The Statistical Significance Scam

An in depth look into the issues of science’s favorite tool

13 min read

17 hours ago

Statistical significance is just like the drive-thru of the research world. Roll as much as the study, grab your “significance meal,” and boom — you’ve got a tasty conclusion to share with all your folks. And it isn’t just convenient for the reader, it makes researchers’ lives easier too. Why make the hard sell when you may say two words as a substitute?

But there’s a catch.

Those fancy equations and nitty-gritty details we’ve conveniently avoided? They’re the actual meat of the matter. And when researchers and readers rely too heavily on one statistical tool, we will find yourself making a whopper of a mistake, just like the one that just about broke the laws of physics.

In 2011, physicists on the renowned CERN laboratory announced a shocking discovery: neutrinos could travel faster than the speed of sunshine. The finding threatened to overturn Einstein’s theory of relativity, a cornerstone of recent physics. The researchers were confident of their results, passing physics’ rigorous statistical significance threshold of 99.9999998%. Case closed, right?

Not quite. As other scientists scrutinized the experiment, they found flaws within the methodology and ultimately couldn’t replicate the outcomes. The unique finding, despite its impressive “statistical significance,” turned out to be false.

In this text, we’ll delve into 4 critical the reason why you shouldn’t instinctively trust a statistically significant finding. Furthermore, why you shouldn’t habitually discard non-statistically significant results.

The 4 key flaws of statistical significance:

It’s made up: The statistical significance/non-significance line is all too often plucked out of thin air, or lazily taken from the overall line of 95% confidence.
It doesn’t mean what (most) people think it means: Statistical significance doesn’t mean ‘There’s Y% probability X is true’.
It’s easy to hack (and often is): Randomness is often labeled statistically significant attributable to mass experiments.
It’s nothing to do with how essential the result’s: Statistical significance shouldn’t be related to the importance of the difference.

Statistical significance is solely a line within the sand humans have created with zero mathematical support. Take into consideration that for a second. Something that is mostly considered an objective measure is, at its core, entirely subjective.

The mathematical part is provided one step before deciding on the importance, via a numerical measure of confidence. Essentially the most common form utilized in hypothesis testing known as the p-value. This provides the actual mathematical probability that the test data results weren’t simply attributable to randomness.

For instance, a p-value of 0.05 means there’s a 5% probability of seeing these data points (or more extreme) attributable to random probability, or that we’re 95% confident the result wasn’t attributable to probability. For instance, suppose you suspect a coin is unfair in favour of heads i.e. the probability of landing on heads is larger than 50%. You toss the coin 5 times and it lands on heads every time. There’s a 1/2 x 1/2 x 1/2 x 1/2 x 1/2 = 3.1% probability that it happened just because of probability, if the coin was fair.

But is that this enough to say it’s statistically significant? It depends who you ask.

Often, whoever is accountable for determining where the road of significance can be drawn within the sand has more influence on whether a result is critical than the underlying data itself.

Given this subjective final step, often in my very own evaluation I’d provide the reader of the study with the extent of confidence percentage, moderately than the binary significance/non-significance result. The ultimate step is solely too opinion-based.

Sceptic: “But there are standards in place for determining statistical significance.”

I hear the argument loads in response to my argument above (I discuss this quite a bit — much to the delight of my academic researcher girlfriend). To which, I respond with something like:

Me: “After all, if there’s a particular standard you will need to adhere to, akin to for regulatory or academic journal publishing reasons, then you might have no selection but to follow the usual. But when that isn’t the case then there’s no reason to not.”

Sceptic: “But there’s a general standard. It’s 95% confidence.”

At that time within the conversation I try my best to not roll my eyes. Deciding your test’s statistical significance point is 95%, just because that’s the norm, is frankly lazy. It doesn’t take note of the context of what’s being tested.

In my day job, if I see someone using the 95% significance threshold for an experiment and not using a contextual explanation, it raises a red flag. It suggests that the person either doesn’t understand the implications of their selection or doesn’t care in regards to the specific business needs of the experiment.

An example can best explain why that is so essential.

Suppose you’re employed as a knowledge scientist for a tech company, and the UI team need to know, “Should we use the colour red or blue for our ‘subscribe’ button to maximise out Click Through Rate (CTR)?”. The UI team favour neither color, but must select one by the tip of the week. After some A/B testing and statistical evaluation we’ve our results:

The follow-the-standards data scientist may come back to the UI team announcing, “Unfortunately, the experiment found no statistically significant difference between the click-through rate of the red and blue button.”

This can be a horrendous evaluation, purely attributable to the ultimate subjective step. Had the info scientist taken the initiative to grasp the context, critically, that ‘the UI team favour neither color, but must select one by the tip of the week’, then she must have set the importance point at a really high p-value, arguably 1.0 i.e. the statistical evaluation doesn’t matter, the UI team are glad to choose whichever color had the best CTR.

Given the chance that data scientists and the like may not have the complete context to find out one of the best point of significance, it’s higher (and simpler) to provide the responsibility to those that have the complete business context — in this instance, the UI team. In other words, the info scientist must have announced to the UI team, “The experiment resulted with the blue button receiving the next click-through rate, with a confidence of 94% that this wasn’t attributed to random probability.” The ultimate step of determining significance must be made by the UI team. After all, this doesn’t mean the info scientist shouldn’t educate the team on what “confidence of 94%” means, in addition to clearly explaining why the statistical significance is best left to them.

Let’s assume we live in a rather more perfect world, where point one isn’t any longer a difficulty. The road within the sand figure is at all times perfect, huzza! Say we wish to run an experiment, with the the importance line set at 99% confidence. Some weeks pass and ultimately we’ve our results and the statistical evaluation finds that it’s statistically significant, huzza again!.. But what does that really mean?

Common belief, within the case of hypothesis testing, is that there’s a 99% probability that the hypothesis is correct. That is painfully fallacious. All it means is there’s a 1% probability of observing data this extreme or more extreme by randomness for this experiment.

Statistical significance doesn’t take note of whether the experiment itself is accurate. Listed below are some examples of things statistical significance can’t capture:

Sampling quality: The population sampled might be biased or unrepresentative.
Data quality: Measurement errors, missing data, or other data quality issues aren’t addressed.
Assumption validity: The statistical test’s assumptions (like normality, independence) might be violated.
Study design quality: Poor experimental controls, not controlling for confounding variables, testing multiple outcomes without adjusting significance levels.

Coming back to the instance mentioned within the introduction. After failures to independently replicate the initial finding, physicists of the unique 2011 experiment announced they’d found a bug of their measuring device’s master clock i.e. data quality issue, which resulted in a full retraction of their initial study.

The following time you hear a statistically significant discovery that goes against common belief, don’t be so quick to imagine it.

Given statistical significance is all about how likely something can have occurred attributable to randomness, an experimenter who’s more concerned about achieving a statistical significant result than uncovering the reality can quite easily game the system.

The chances of rolling two ones from two dice is (1/6 × 1/6) = 1/36, or 2.8%; a result so rare it could be classified as statistically significant by many individuals. But what if I throw greater than two dice? Naturally, the percentages of no less than two ones will rise:

3 dice: ≈ 7.4%
4 dice: ≈ 14.4%
5 dice: ≈ 23%
6 dice: ≈ 32.4%
7 dice: ≈ 42%
8 dice: ≈ 51%
12 dice: ≈ 80%*

*A minimum of two dice rolling a one is the equivalent of: 1 (i.e. 100%, certain), minus the probability of rolling zero ones, minus the probability of rolling just one one

P(zero ones) = (5/6)^n

P(exactly one one) = n * (1/6) * (5/6)^(n-1)

n is the variety of dice

So the whole formula is: 1 — (5/6)^n — n*(1/6)*(5/6)^(n-1)

Let’s say I run an easy experiment, with an initial theory that one is more likely than other numbers to be rolled. I roll 12 dice of various colours and sizes. Listed below are my results:

Unfortunately, my (calculated) hopes of getting no less than two ones have been dashed… Actually, now that I feel of it, I didn’t really need two ones. I used to be more concerned about the percentages of massive red dice. I imagine there’s a high probability of getting sixes from them. Ah! Looks like my theory is correct, the 2 big red dice have rolled sixes! There is just a 2.8% probability of this happening by probability. Very interesting. I shall now write a paper on my findings and aim to publish it in a tutorial journal that accepts my result as statistically significant.

This story may sound far-fetched, but the truth isn’t as distant from this as you’d expect, especially within the highly regarded field of educational research. The truth is, this form of thing happens often enough to make a reputation for itself, p-hacking.

For those who’re surprised, delving into the tutorial system will make clear why practices that appear abominable to the scientific method occur so often throughout the realm of science.

Academia is exceptionally difficult to have a successful profession in. For instance, In STEM subjects only 0.45% of PhD students grow to be professors. After all, some PhD students don’t want a tutorial profession, but the bulk do (67% in accordance with this survey). So, roughly speaking, you might have a 1% probability of creating it as a professor if you might have accomplished a PhD and intend to make academia your profession. Given these odds you would like consider yourself as quite exceptional, or moderately, you would like other people to think that, since you may’t hire yourself. So, how is outstanding measured?

Perhaps unsurprisingly, an important measure of a tutorial’s success is their research impact. Common measures of writer impact include the h-index, g-index and i10-index. What all of them have in common is that they’re heavily focused on citations i.e. how persistently has their published work been mentioned in other published work. Knowing this, if we wish to do well in academia, we want to concentrate on publishing research that’s prone to get citations.

You’re far more prone to be cited if you happen to publish your work in a highly rated academic journal. And, since 88% of top journal papers are statistically significant, you’re much more prone to get accepted into these journals in case your research is statistically significant. This pushes loads of well-meaning, but career-driven, academics down a slippery slope. They begin out with a scientific methodology for producing research papers like so:

The Statistical Significance Scam

An in depth look into the issues of science’s favorite tool

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

World models go mainstream

The Ultimate Guide to Power BI Aggregations

Deploy Your AI Assistant to Monitor and Debug n8n Workflows Using Claude and MCP

Tips on how to Evaluate Retrieval Quality in RAG Pipelines (Part 3): DCG@k and NDCG@k

OpenAI Is Quietly Constructing Your Next Health Assistant

The Statistical Significance Scam

An in depth look into the issues of science’s favorite tool

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.