PostHog combines feature flags and A/B testing into a unified experimentation platform. This means you can control feature rollouts, run experiments, and analyze results all in one place. Here's how to use it effectively.
Feature Flags vs. Experiments
First, understand the difference:
Feature Flags control who sees a feature. Use them for:
- Gradual rollouts (10% → 50% → 100%)
- Beta testing with specific users
- Kill switches for risky features
- User targeting based on properties
- Trunk-based development (wrap incomplete features)
Experiments measure the impact of a change. Use them for:
- Testing hypotheses with statistical rigor
- Comparing variants against a control
- Determining if a change improves metrics
- Validating product decisions with data
In PostHog, experiments are built on top of feature flags. You create a flag, configure your experiment, and PostHog handles the statistical analysis using either Bayesian or frequentist methods.
Feature Flag Types
PostHog supports three types of feature flags:
Boolean Flags: Return true or false. Use for simple on/off feature toggles.
Multivariate Flags: Return one of multiple string variants (e.g., "control", "test-a", "test-b"). Essential for A/B/n testing.
Payload Flags: Return any valid JSON type (object, array, number, string, boolean, or null) as additional configuration data. Enables you to configure functionality without code changes.
Creating Effective Feature Flags
Flag Naming Conventions
Use a consistent naming pattern with descriptive, hierarchical keys:
feature-new-checkout-flow– For new featuresexperiment-pricing-page-v2– For A/B testsrelease-dark-mode– For controlled releasesops-maintenance-banner– For operational flags
Best practices for naming:
- Name flags to reflect their return type (e.g.,
is_premium_userfor boolean,selected_themefor a string) - Use positive language for boolean flags to avoid double negatives (e.g.,
is_premium_userinstead ofis_not_premium_user) - Keep names descriptive:
is_v2_billing_dashboard_enabledis clearer thanis_dashboard_enabled
Targeting Options
PostHog offers powerful targeting capabilities:
Percentage Rollout:
Roll out to a percentage of users. PostHog uses consistent hashing, so users stay in their assigned group across sessions and devices (provided they're identified).
User Properties:
Target based on user attributes:
plan = "enterprise"– Enterprise customers onlycountry = "US"– Geographic targeting (requires GeoIP enabled)is_beta_tester = true– Opt-in beta users
Group Properties (B2B):
Target entire organizations using group analytics:
company_size > 100– Large companiesindustry = "finance"– Specific verticals
Cohorts:
Target users who belong to specific cohorts you've defined. Note: Behavioral cohorts (those using event/action filters) are not supported with feature flags due to query performance requirements.
Feature Flag Dependencies:
Create flags that depend on other flags' states for complex rollout scenarios (e.g., only show feature B if feature A is enabled).
Running A/B Tests
The Experiment Process
1. Start with a Hypothesis
Before touching PostHog, write down:
- What you're changing: "Adding a progress bar to onboarding"
- What you expect: "Will increase onboarding completion by 10%"
- Primary metric: Onboarding Completed event
- Why you believe this: "Users drop off because they don't know how much is left"
2. Calculate Sample Size and Duration
PostHog includes a recommended running time calculator in the experiment setup. You need to know:
- Baseline conversion rate: What's the current rate? (e.g., 60%)
- Minimum detectable effect: What lift would be meaningful? (e.g., 10% relative = 66%)
- Statistical significance: Usually 95%
- Statistical power: Usually 80%
PostHog will calculate the minimum sample size required and approximately how long the test will take based on your traffic. A good rule of thumb: tests should run between one week and one month.
3. Create the Experiment
- Go to the A/B Testing tab in PostHog
- Click "New Experiment"
- Name it and describe your hypothesis
- Set a feature flag key (PostHog creates the flag automatically)
- Define variants (Control, Test) or add more for multivariate tests
- Set your goal metric (funnel, trend, or ratio)
- Optionally add secondary metrics and guardrail metrics
- Save as draft first for testing
4. Test Before Launch
Before launching to all users, do a test rollout (e.g., 5% of users) to verify:
- Users are assigned to variants in the expected ratio (e.g., 50/50)
- The experiment isn't causing crashes or errors
- Metrics are being tracked correctly
5. Implement the Variants
In your code, use the feature flag to show different experiences:
const variant = posthog.getFeatureFlag('experiment-onboarding-progress')
if (variant === 'control') {
// Show original onboarding
} else if (variant === 'test') {
// Show onboarding with progress bar
}
Important: Filter out ineligible users in your code before checking the feature flag. For example, if testing a new onboarding flow, don't include users who have already completed onboarding.
6. Wait for Significance
This is the hard part. With PostHog's Bayesian approach, you can check results at any time without statistical penalties. PostHog will show you when you've reached statistical significance (typically 90%+ win probability). However, avoid making decisions based on very early data—let the credible intervals stabilize.
Understanding PostHog's Statistical Methods
PostHog supports two statistical approaches:
Bayesian (Default):
- Directly answers "Is variant A better than variant B?"
- Shows win probability (likelihood each variant is better)
- Provides credible intervals (95% probability the true value lies within this range)
- You can check results anytime without statistical penalties
- Uses different models for different metric types:
- Funnel metrics: Beta model
- Count metrics: Gamma-Poisson model
- Revenue/continuous metrics: Lognormal model
Frequentist:
- Uses t-tests and confidence intervals
- Reports p-values (result is significant if p < 0.05)
- Uses Welch's method to account for unequal variances
- Requires predefined sample sizes; checking early inflates false positives
You can set the default method in Settings > Organization > General, or override per experiment.
Interpreting Results
PostHog shows you:
- Delta: Percentage change compared to control (e.g., +10%)
- Win probability (Bayesian): Likelihood this variant is better (e.g., 97%)
- Credible/Confidence interval: Range where the true effect likely falls
- Statistical significance: Color-coded (green = winning, red = losing, no color = not significant)
Visual indicators:
- If the interval doesn't cross zero, the result is statistically significant
- Arrows (↑ or ↓) indicate whether the metric increased or decreased
When to ship:
- Win probability > 90% AND positive lift → Ship the treatment
- Win probability > 90% AND negative lift → Keep the control
- Win probability < 90% → Run longer or accept inconclusive results
Common Experimentation Mistakes
1. Including Unaffected Users
Including users who aren't affected by the change dilutes your results. If testing a new onboarding flow, filter out users who have already completed onboarding before evaluating the feature flag.
2. Peeking and Stopping Early (Frequentist)
With frequentist methods, looking at results every day and stopping when it "looks significant" inflates false positive rates. Set your sample size upfront and commit to it. Note: Bayesian methods allow checking results anytime.
3. Testing Too Many Things at Once
If you change 5 things at once, you won't know which one drove the result. Test one hypothesis at a time. The caveat: changes that are too small can slow your team down, so balance granularity with velocity.
4. Wrong Success Metric
Optimizing for clicks on a button doesn't matter if those clicks don't lead to conversions. Use business metrics, not vanity metrics.
5. Ignoring Guardrail Metrics
Your treatment might increase signups but decrease retention. Always monitor counter metrics to catch unintended consequences. For example, if testing a sign-up page change, also monitor time spent in app to ensure you're not misleading users.
6. Running Too Short
Day-of-week effects are real. Run experiments for at least one full week, ideally two, to capture variance in user behavior. Seasonal periods can also cause significant changes.
7. Not Pre-calculating Running Time
Starting without deciding how long to run can cause the "peeking problem." Use PostHog's running time calculator to determine if you have sufficient statistical power.
Advanced Patterns
Holdout Groups
PostHog has built-in holdout group support for measuring cumulative impact of multiple changes. Holdouts are randomly assigned lists of users excluded from experiments. You can:
- Exclude users from specific experiments or all experiments
- Measure long-term effects after experiments end
- Verify experiments don't have negative long-term impacts
When assigned to an experiment, your holdout appears as another variant in analysis with full statistical metrics.
Staged Rollouts
After an experiment wins:
- Roll out to 10% and monitor for bugs
- Increase to 50% and watch metrics
- Roll out to 100%
- Remove the feature flag code in a cleanup sprint
Important: Leaving flags in your code too long creates technical debt and can confuse future developers.
Group-Targeted Experiments
For B2B products, run experiments at the organization level instead of user level. Every member of a group receives the same variant, ensuring consistent experiences and enabling measurement of impact on the group as a whole.
Experiment Documentation
Keep a log of every experiment:
- Hypothesis and rationale
- Start/end dates
- Sample size and duration
- Results and statistical significance
- Decision and learnings
This prevents re-running failed experiments and builds institutional knowledge. PostHog allows you to add descriptions and screenshots directly to experiments.
PostHog-Specific Tips
Performance Optimization
- Use local evaluation for high-volume: Instead of making a request for each flag, PostHog periodically fetches and stores flag definitions locally, enabling evaluation without network calls. Latency drops from ~100-500ms to under 50ms.
- Bootstrap flags for instant loading: Pass precomputed flag values in your initial page load to avoid async evaluation delays and prevent flickering.
- Use server-side flags for critical paths: Client-side flags can flicker. For checkout flows, evaluate flags server-side.
Reliability
- Deploy a reverse proxy: Ad blockers can disable feature flags. Using your own domain for PostHog requests reduces interception by tracking blockers.
- Handle errors gracefully: Wrap PostHog SDK methods in try-catch blocks. Set appropriate timeouts with
feature_flag_request_timeout_ms. - Identify users consistently: Different distinct IDs can cause the same user to receive different flag values across sessions. Always identify users to ensure consistent experiences.
Flag Hygiene
- Minimize flag locations: The more places a flag appears in code, the more likely problems occur. Wrap flags in a single function if used in multiple places.
- Clean up after rollouts: Remove flag code after full rollout to reduce technical debt.
- Use evaluation environments: Control where flags evaluate (client-side vs. server-side) to prevent flags from evaluating in unintended environments and reduce unnecessary evaluation costs.
Experiment Features
- View session recordings: See exactly what users experienced in each variant by accessing recordings tied to experiment results.
- Use the toolbar for testing: Override feature flag values in your browser to test variants without affecting other users.
- Set up alerts: Get notified when experiments reach significance so you can act quickly.
Success Metrics
Don't be surprised when experiments fail. Industry benchmarks show:
- At Bing, only 10-20% of experiments generate positive results
- Booking.com runs ~25,000 tests per year; only 10% generate positive results
The value is in the learning. Every experiment—win or lose—teaches you something about your users. Feature flags and experiments are the fastest path to product improvement. Every feature becomes a hypothesis, every release becomes an opportunity to learn. Start small, build the muscle, and let data drive your decisions.