Why Most A/B Tests Are Lying to You

-

Thursday. A product manager at a Series B SaaS company opens her A/B testing dashboard for the fourth time that day, a half-drunk cold brew beside her laptop. The screen reads: Variant B, +8.3% conversion lift, 96% statistical significance.

She screenshots the result. Posts it within the #product-wins Slack channel with a celebration emoji. The top of engineering replies with a thumbs-up and starts planning the rollout sprint.

Here’s what the dashboard didn’t show her: if she had waited three more days (the unique planned test duration), that significance would have dropped to 74%. The +8.3% lift would have shrunk to +1.2%. Below the noise floor. Not real.

For those who’ve ever stopped a test early since it “hit significance,” you’ve probably shipped a version of this error. You’re in large company. At Google and Bing, only 10% to twenty% of controlled experiments generate positive results, in keeping with Ronny Kohavi’s research published within the Harvard Business Review. At Microsoft broadly, one-third of experiments prove effective, one-third are neutral, and one-third actively hurt the metrics they intended to enhance. Most ideas don’t work. The experiments that “prove” they do are sometimes telling you what you should hear.

In case your A/B testing tool permits you to peek at results day by day and stop every time the arrogance bar turns green, it’s not a testing tool. It’s a random number generator with a nicer UI.

The 4 statistical sins below account for nearly all of unreliable A/B test results. Each takes lower than quarter-hour to repair. By the tip of this text, you’ll have a five-item pre-test checklist and a choice framework for selecting between frequentist, Bayesian, and sequential testing that you could apply to your next experiment Monday morning.


The Peeking Problem: 26% of Your Winners Aren’t Real

Each time you check your A/B test results before the planned end date, you’re running a brand new statistical test. Not metaphorically. Literally.

Frequentist significance tests are designed for a single have a look at a pre-determined sample size. If you check results after 100 visitors, then 200, then 500, then 1,000, you’re not running one test. You’re running 4. Each look gives noise one other likelihood to masquerade as signal.

Evan Miller quantified this in his widely cited evaluation “How To not Run an A/B Test.” For those who check results after every batch of recent data and stop the moment you see p < 0.05, the actual false positive rate isn’t 5%.

It’s 26.1%.

One in 4 “winners” is pure noise.

The mechanics are straightforward. A significance test controls the false positive rate at 5% for a single evaluation point. Multiple checks create multiple opportunities for random fluctuations to cross the importance threshold. As Miller puts it: “For those who peek at an ongoing experiment ten times, then what you think that is 1% significance is definitely just 5%.”

Checking results repeatedly and stopping at significance inflates your false positive rate by greater than 5x. Image by the creator.

That is essentially the most common sin in A/B testing, and the most costly. Teams make product decisions, allocate engineering resources, and report revenue impact to leadership based on results that had a one-in-four likelihood of being imaginary.

The fix is straightforward but unpopular: calculate your required sample size before you begin, and don’t have a look at the outcomes until you hit it. If that discipline feels painful (and for many teams, it does), sequential testing offers a middle path. More on that within the framework below.

Check your test results after every batch of holiday makers, and also you’ll “find” a winner 26% of the time. Even when there isn’t one.


The Power Vacuum: Small Samples, Inflated Effects

Peeking creates false winners. The second sin makes real winners look larger than they’re.

Statistical power is the probability that your test will detect an actual effect when one exists. The usual goal is 80%, meaning a 20% likelihood you’ll miss an actual effect even when it’s there. To hit 80% power, you would like a particular sample size, and that number is dependent upon three things: your baseline conversion rate, the smallest effect you should detect, and your significance threshold.

Most teams skip the facility calculation. They run the test “until it’s significant” or “for 2 weeks,” whichever comes first. This creates a phenomenon called the winner’s curse.

Here’s how it really works. In an underpowered test, the random variation in your data is large relative to the true effect. The one way a real-but-small effect reaches statistical significance in a small sample is that if random noise pushes the measured effect far above its true value. So the very act of reaching significance in an underpowered test guarantees that your estimated effect is inflated.

When small samples produce significant results, the observed effect is usually inflated well above the true value.
Image by the creator.

A team might rejoice a +8% conversion lift, ship the change, after which watch the actual number settle at +2% over the next quarter. The test wasn’t improper exactly (there was an actual effect), however the team based their revenue projections on an inflated number. An artifact of insufficient sample size.

An underpowered test that reaches significance doesn’t find the reality. It finds an exaggeration of the reality.

The fix: run an influence evaluation before every test. Set your Minimum Detectable Effect (MDE) on the smallest change that will justify the engineering and product effort to ship. Calculate the sample size needed at 80% power. Then run the test until you reach that number. No early exits.


The Multiple Comparisons Trap

The third sin scales with ambition. Your A/B test tracks conversion rate, average order value, bounce rate, time on page, and click-through rate on the call-to-action. Five metrics. Standard practice.

Here’s the issue. At a 5% significance level per metric, the probability of at the least one false positive across all five isn’t 5%. It’s 22.6%.

The mathematics: 1 − (1 − 0.05)5 = 0.226.

Scale that to twenty metrics (common in analytics-heavy teams) and the probability hits 64.2%. You’re more more likely to find noise that appears real than to avoid it entirely.

At 20 metrics and a typical 5% threshold, you have got an almost two-in-three likelihood of celebrating noise.
Image by the creator.

Test 20 metrics at a 5% threshold and you have got a 64% likelihood of celebrating noise.

That is the multiple comparisons problem, and most practitioners understand it exists in theory but don’t correct for it in practice. They declare one primary metric, then quietly rejoice when a secondary metric hits significance. Or they run the identical test across 4 user segments and count a segment-level win as an actual result.

Two corrections exist, and major platforms already support them. Benjamini-Hochberg controls the expected proportion of false discoveries amongst your significant results (less conservative, preserves more power). Holm-Bonferroni controls the probability of even one false positive (more conservative, appropriate when a single improper call has serious consequences). Optimizely uses a tiered version of Benjamini-Hochberg. GrowthBook offers each.

The fix: declare one primary metric before the test starts. Every little thing else is exploratory. If you should evaluate multiple metrics formally, apply a correction. In case your platform doesn’t offer one, you would like a special platform.


When “Significant” Doesn’t Mean Significant

The fourth sin is the quietest and possibly the most costly. A test might be statistically significant and practically worthless at the identical time.

Statistical significance answers exactly one query: “Is that this result likely as a result of likelihood?” It says nothing about whether the difference is large enough to matter. A test with 2 million visitors can detect a 0.02 percentage point lift on conversion with high confidence. That lift is real. It’s also not price a single sprint of engineering time to ship.

The gap between “real” and “price acting on” is where practical significance lives. Most teams never define it.

Before any test, set a practical significance threshold: the minimum effect size that justifies implementation. This could reflect the engineering cost of shipping the change, the chance cost of the test’s runtime, and the downstream revenue impact. If a 0.5 percentage point lift translates to $200K in annual revenue and the change takes one sprint to construct, that’s your threshold. Anything below it’s a “true but useless” finding.

The fix: calculate your MDE before the test starts, not only for power evaluation (though it’s the identical number), but as a choice gate. Even when a test reaches significance, if the measured effect falls below the MDE, you don’t ship. Write this number down. Get stakeholder agreement before launch.


The Bayesian Fix That Doesn’t Fix Anything

For those who’ve read this far, a thought is likely to be forming: “I’ll just switch to Bayesian A/B testing. It handles peeking. It gives me ‘probability of being best’ as a substitute of confusing p-values. Problem solved.”

That is the preferred misconception in modern experimentation.

Bayesian A/B testing does solve one real problem: communication. Telling a VP “there’s a 94% probability that Variant B is best” is clearer than “we reject the null hypothesis at α = 0.05.” Business stakeholders understand the primary statement intuitively. The second requires a statistics lecture.

But Bayesian testing doesn’t solve the peeking problem.

In October 2025, Alex Molas published a detailed simulation study showing that Bayesian A/B tests with fixed posterior thresholds suffer from the identical false positive inflation once you peek and stop on success. Using a 95% “probability to beat control” as a stopping rule, checked after every 100 observations, produced false positive rates of 80%. Not 5%. Not 26%. Eighty percent.

David Robinson at Variance Explained reached a parallel conclusion: a hard and fast posterior threshold used as a stopping rule doesn’t control error rates in the way in which most practitioners assume. The posterior stays interpretable at any sample size. But interpretability just isn’t the identical as error control.

None of this implies Bayesian methods are useless. For low-stakes directional decisions (picking a blog headline, selecting an email subject line) where Type I error control isn’t critical, the intuitive probability framework is genuinely higher. For prime-stakes product decisions where you would like reliable error guarantees, “just go Bayesian” just isn’t a solution. It’s a fancy dress change on the identical problem.

Switching from frequentist to Bayesian doesn’t cure peeking. It just changes the number you’re misinterpreting.

The true solution isn’t a switch in methodology. It’s a pre-test protocol that forces statistical discipline no matter which framework you select.


The Pre-Test Protocol

That is the section the remaining of the article was constructing toward. Every little thing above established why you would like it. Every little thing below shows what changes once you have got it.

The 5-Point Pre-Test Checklist

Run through these five items before pressing “Start” on any A/B test. Each is pass/fail. If any item fails, fix it before launching.

  1. Sample size calculated. Set your MDE (the smallest effect price shipping). Calculate the required sample size at 80% power and 5% significance using Evan Miller’s free calculator or your platform’s built-in tool. 
  2. Runtime fixed and documented. Divide required sample size by day by day eligible traffic. Round up. Add buffer for weekday/weekend variation (minimum 7 full days, even when sample size is reached sooner). Write down the tip date. 
  3. One primary metric declared. Write it down before the test starts. Secondary metrics are exploratory only. If you should evaluate multiple metrics formally, apply Benjamini-Hochberg or Holm-Bonferroni correction. 
  4. Practical significance threshold set. Define the minimum effect that justifies implementation. Agree on this with engineering and product stakeholders before launch. If the test reaches statistical significance but falls below this threshold, you don’t ship. 
  5. Evaluation method chosen. Pick one: Frequentist, Bayesian, or Sequential. Document why. Use the choice matrix below. 
Image by the creator.

Worked Example: Checkout Flow Test

A mid-market e-commerce team (500K monthly visitors) desires to test a brand new single-page checkout against their current multi-step flow. Here’s how they run the checklist:

1. MDE: 0.5 percentage points (from 3.2% baseline to three.7%). At 500K monthly visitors with a $65 average order value, a 0.5pp lift generates roughly $195K in incremental annual revenue. The brand new checkout costs about 2 weeks of engineering time (~$15K loaded). The ROI clears the bar.

2. Sample size: At 80% power and 5% significance, this requires ~25,000 per variant. 50,000 total.

3. Runtime: 250K monthly visitors reach checkout. That’s ~8,300/day. 50,000 total ÷ 8,300/day = 6 days. Rounded to 14 days to capture weekday/weekend effects.

4. Primary metric: Checkout conversion rate. Average order value and cart abandonment tracked as exploratory (no correction needed since they won’t drive the ship/no-ship decision).

5. Method: Sequential testing. High traffic, and stakeholders want weekly progress updates. Two pre-planned analyses: day 7 and day 14. Alpha spending via O’Brien-Fleming bounds.

Result: At day 7, the observed lift is +0.3 percentage points. The sequential boundary isn’t crossed. Proceed. At day 14, the lift is +0.6 percentage points. Boundary crossed. Ship it.

Without the protocol: The PM checks day by day, sees +1.1 percentage points on day 3 with 93% “significance,” and declares a winner. She ships based on a number that’s nearly double the reality. Revenue projections overshoot by 83%. The actual lift settles at +0.6 points over the following quarter. Leadership loses trust within the experimentation program.

The most effective A/B test is the one where you wrote down “what would change our mind?” before pressing Start.


What Rigorous Testing Actually Buys You

At Microsoft Bing, an engineer picked up a low-priority concept that had been shelved for months: a small change to how ad headlines displayed in search results. The change seemed too minor to prioritize. Someone ran an A/B test.

The result was a 12% increase in revenue per search, price over $100 million annually within the U.S. alone. It became the only most precious change Bing ever shipped.

This story, documented by Ronny Kohavi within the Harvard Business Review, carries two lessons. First, intuition about what matters is improper more often than not. At Google and Bing, 80% to 90% of experiments show no positive effect. As Kohavi puts it: “Any figure that appears interesting or different is generally improper.” You would like rigorous testing precisely because your instincts aren’t ok.

Second, rigorous testing compounds. Bing’s experimentation program identified dozens of revenue-improving changes monthly, collectively boosting revenue per search by 10% to 25% every year. This accumulation was a significant component in Bing growing its U.S. search share from 8% in 2009 to 23%.

The quarter-hour you spend on a pre-test checklist isn’t overhead. It’s the difference between an experimentation program that compounds real gains and one which ships noise, erodes stakeholder trust, and makes A/B testing seem like theater.

That product manager from 3 PM Thursday? She’s going to run one other test next week. So are you.

The dashboard will still show a confidence percentage. It can still turn green when it crosses a threshold. The UI is designed to make calling a winner feel satisfying and definitive.

But now you understand what the dashboard doesn’t show. The 26.1%. The winner’s curse. The 64% false alarm rate. The Bayesian mirage.

Your next test starts soon. The checklist takes quarter-hour. The choice matrix takes five. That’s 20 minutes between shipping signal and shipping noise.

Which one will it’s?


References

  1. Evan Miller, “How Not To Run an A/B Test”
  2. Alex Molas, “Bayesian A/B Testing Is Not Resistant to Peeking” (October 2025)
  3. David Robinson, “Is Bayesian A/B Testing Resistant to Peeking? Not Exactly”, Variance Explained
  4. Ron Kohavi, Stefan Thomke, “The Surprising Power of Online Experiments”, Harvard Business Review (September 2017)
  5. Optimizely, “False Discovery Rate Control”, Support Documentation
  6. GrowthBook, “Multiple Testing Corrections”, Documentation
  7. Analytics-Toolkit, “Underpowered A/B Tests: Confusions, Myths, and Reality” (2020)
  8. Statsig, “Effect Size: Practical vs Statistical Significance”
  9. Statsig, “Sequential Testing: Find out how to Peek at A/B Test Results Without Ruining Validity”
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x