posthog feature flags experiments

PostHog combines feature flags and A/B testing into a unified experimentation platform. This means you can control feature rollouts, run experiments, and analyze results all in one place. Here's how to use it effectively.

Feature Flags vs. Experiments

First, understand the difference:

Feature Flags control who sees a feature. Use them for:

Gradual rollouts (10% → 50% → 100%)
Beta testing with specific users
Kill switches for risky features
User targeting based on properties
Trunk-based development (wrap incomplete features)

Experiments measure the impact of a change. Use them for:

Testing hypotheses with statistical rigor
Comparing variants against a control
Determining if a change improves metrics
Validating product decisions with data

In PostHog, experiments are built on top of feature flags. You create a flag, configure your experiment, and PostHog handles the statistical analysis using either Bayesian or frequentist methods.

Feature Flag Types

PostHog supports three types of feature flags:

Boolean Flags: Return true or false. Use for simple on/off feature toggles.

Multivariate Flags: Return one of multiple string variants (e.g., "control", "test-a", "test-b"). Essential for A/B/n testing.

Payload Flags: Return any valid JSON type (object, array, number, string, boolean, or null) as additional configuration data. Enables you to configure functionality without code changes.

Creating Effective Feature Flags

Flag Naming Conventions

Use a consistent naming pattern with descriptive, hierarchical keys:

feature-new-checkout-flow – For new features
experiment-pricing-page-v2 – For A/B tests
release-dark-mode – For controlled releases
ops-maintenance-banner – For operational flags

Best practices for naming:

Name flags to reflect their return type (e.g., is_premium_user for boolean, selected_theme for a string)
Use positive language for boolean flags to avoid double negatives (e.g., is_premium_user instead of is_not_premium_user)
Keep names descriptive: is_v2_billing_dashboard_enabled is clearer than is_dashboard_enabled

Targeting Options

PostHog offers powerful targeting capabilities:

Percentage Rollout:

Roll out to a percentage of users. PostHog uses consistent hashing, so users stay in their assigned group across sessions and devices (provided they're identified).

User Properties:

Target based on user attributes:

plan = "enterprise" – Enterprise customers only
country = "US" – Geographic targeting (requires GeoIP enabled)
is_beta_tester = true – Opt-in beta users

Group Properties (B2B):

Target entire organizations using group analytics:

company_size > 100 – Large companies
industry = "finance" – Specific verticals

Cohorts:

Target users who belong to specific cohorts you've defined. Note: Behavioral cohorts (those using event/action filters) are not supported with feature flags due to query performance requirements.

Feature Flag Dependencies:

Create flags that depend on other flags' states for complex rollout scenarios (e.g., only show feature B if feature A is enabled).

Running A/B Tests

The Experiment Process

1. Start with a Hypothesis

Before touching PostHog, write down:

What you're changing: "Adding a progress bar to onboarding"
What you expect: "Will increase onboarding completion by 10%"
Primary metric: Onboarding Completed event
Why you believe this: "Users drop off because they don't know how much is left"

2. Calculate Sample Size and Duration

PostHog includes a recommended running time calculator in the experiment setup. You need to know:

Baseline conversion rate: What's the current rate? (e.g., 60%)
Minimum detectable effect: What lift would be meaningful? (e.g., 10% relative = 66%)
Statistical significance: Usually 95%
Statistical power: Usually 80%

PostHog will calculate the minimum sample size required and approximately how long the test will take based on your traffic. A good rule of thumb: tests should run between one week and one month.

3. Create the Experiment

Go to the A/B Testing tab in PostHog
Click "New Experiment"
Name it and describe your hypothesis
Set a feature flag key (PostHog creates the flag automatically)
Define variants (Control, Test) or add more for multivariate tests
Set your goal metric (funnel, trend, or ratio)
Optionally add secondary metrics and guardrail metrics
Save as draft first for testing

4. Test Before Launch

Before launching to all users, do a test rollout (e.g., 5% of users) to verify:

Users are assigned to variants in the expected ratio (e.g., 50/50)
The experiment isn't causing crashes or errors
Metrics are being tracked correctly

5. Implement the Variants

In your code, use the feature flag to show different experiences:

const variant = posthog.getFeatureFlag('experiment-onboarding-progress')

if (variant === 'control') {
  // Show original onboarding
} else if (variant === 'test') {
  // Show onboarding with progress bar
}

Important: Filter out ineligible users in your code before checking the feature flag. For example, if testing a new onboarding flow, don't include users who have already completed onboarding.

6. Wait for Significance

This is the hard part. With PostHog's Bayesian approach, you can check results at any time without statistical penalties. PostHog will show you when you've reached statistical significance (typically 90%+ win probability). However, avoid making decisions based on very early data—let the credible intervals stabilize.

Understanding PostHog's Statistical Methods

PostHog supports two statistical approaches:

Bayesian (Default):

Directly answers "Is variant A better than variant B?"
Shows win probability (likelihood each variant is better)
Provides credible intervals (95% probability the true value lies within this range)
You can check results anytime without statistical penalties
Uses different models for different metric types:
- Funnel metrics: Beta model
- Count metrics: Gamma-Poisson model
- Revenue/continuous metrics: Lognormal model

Frequentist:

Uses t-tests and confidence intervals
Reports p-values (result is significant if p < 0.05)
Uses Welch's method to account for unequal variances
Requires predefined sample sizes; checking early inflates false positives

You can set the default method in Settings > Organization > General, or override per experiment.

Interpreting Results

PostHog shows you:

Delta: Percentage change compared to control (e.g., +10%)
Win probability (Bayesian): Likelihood this variant is better (e.g., 97%)
Credible/Confidence interval: Range where the true effect likely falls
Statistical significance: Color-coded (green = winning, red = losing, no color = not significant)

Visual indicators:

If the interval doesn't cross zero, the result is statistically significant
Arrows (↑ or ↓) indicate whether the metric increased or decreased

When to ship:

Win probability > 90% AND positive lift → Ship the treatment
Win probability > 90% AND negative lift → Keep the control
Win probability < 90% → Run longer or accept inconclusive results

Common Experimentation Mistakes

1. Including Unaffected Users

Including users who aren't affected by the change dilutes your results. If testing a new onboarding flow, filter out users who have already completed onboarding before evaluating the feature flag.

2. Peeking and Stopping Early (Frequentist)

With frequentist methods, looking at results every day and stopping when it "looks significant" inflates false positive rates. Set your sample size upfront and commit to it. Note: Bayesian methods allow checking results anytime.

3. Testing Too Many Things at Once

If you change 5 things at once, you won't know which one drove the result. Test one hypothesis at a time. The caveat: changes that are too small can slow your team down, so balance granularity with velocity.

4. Wrong Success Metric

Optimizing for clicks on a button doesn't matter if those clicks don't lead to conversions. Use business metrics, not vanity metrics.

5. Ignoring Guardrail Metrics

Your treatment might increase signups but decrease retention. Always monitor counter metrics to catch unintended consequences. For example, if testing a sign-up page change, also monitor time spent in app to ensure you're not misleading users.

6. Running Too Short

Day-of-week effects are real. Run experiments for at least one full week, ideally two, to capture variance in user behavior. Seasonal periods can also cause significant changes.

7. Not Pre-calculating Running Time

Starting without deciding how long to run can cause the "peeking problem." Use PostHog's running time calculator to determine if you have sufficient statistical power.

Advanced Patterns

Holdout Groups

PostHog has built-in holdout group support for measuring cumulative impact of multiple changes. Holdouts are randomly assigned lists of users excluded from experiments. You can:

Exclude users from specific experiments or all experiments
Measure long-term effects after experiments end
Verify experiments don't have negative long-term impacts

When assigned to an experiment, your holdout appears as another variant in analysis with full statistical metrics.

Staged Rollouts

After an experiment wins:

Roll out to 10% and monitor for bugs
Increase to 50% and watch metrics
Roll out to 100%
Remove the feature flag code in a cleanup sprint

Important: Leaving flags in your code too long creates technical debt and can confuse future developers.

Group-Targeted Experiments

For B2B products, run experiments at the organization level instead of user level. Every member of a group receives the same variant, ensuring consistent experiences and enabling measurement of impact on the group as a whole.

Experiment Documentation

Keep a log of every experiment:

Hypothesis and rationale
Start/end dates
Sample size and duration
Results and statistical significance
Decision and learnings

This prevents re-running failed experiments and builds institutional knowledge. PostHog allows you to add descriptions and screenshots directly to experiments.

PostHog-Specific Tips

Performance Optimization

Use local evaluation for high-volume: Instead of making a request for each flag, PostHog periodically fetches and stores flag definitions locally, enabling evaluation without network calls. Latency drops from ~100-500ms to under 50ms.
Bootstrap flags for instant loading: Pass precomputed flag values in your initial page load to avoid async evaluation delays and prevent flickering.
Use server-side flags for critical paths: Client-side flags can flicker. For checkout flows, evaluate flags server-side.

Reliability

Deploy a reverse proxy: Ad blockers can disable feature flags. Using your own domain for PostHog requests reduces interception by tracking blockers.
Handle errors gracefully: Wrap PostHog SDK methods in try-catch blocks. Set appropriate timeouts with feature_flag_request_timeout_ms.
Identify users consistently: Different distinct IDs can cause the same user to receive different flag values across sessions. Always identify users to ensure consistent experiences.

Flag Hygiene

Minimize flag locations: The more places a flag appears in code, the more likely problems occur. Wrap flags in a single function if used in multiple places.
Clean up after rollouts: Remove flag code after full rollout to reduce technical debt.
Use evaluation environments: Control where flags evaluate (client-side vs. server-side) to prevent flags from evaluating in unintended environments and reduce unnecessary evaluation costs.

Experiment Features

View session recordings: See exactly what users experienced in each variant by accessing recordings tied to experiment results.
Use the toolbar for testing: Override feature flag values in your browser to test variants without affecting other users.
Set up alerts: Get notified when experiments reach significance so you can act quickly.

Success Metrics

Don't be surprised when experiments fail. Industry benchmarks show:

At Bing, only 10-20% of experiments generate positive results
Booking.com runs ~25,000 tests per year; only 10% generate positive results

The value is in the learning. Every experiment—win or lose—teaches you something about your users. Feature flags and experiments are the fastest path to product improvement. Every feature becomes a hypothesis, every release becomes an opportunity to learn. Start small, build the muscle, and let data drive your decisions.

Feature Flags and A/B Testing with PostHog