Bonferroni vs. Benjamini-Hochberg: Selecting Your P-Value Correction

-

be a sensitive topic. Perhaps best avoided on first encounter with a Statistician. The disposition toward the subject has led to a tacit agreement that α = 0.05 is the gold standard—in fact, a ‘convenient convention’, a rule of thumb set by Ronald Fisher himself.

Who?? Don’t know him? Don’t worry.

He was the primary to introduce Maximum Likelihood Estimation (MLE), ANOVA, and Fisher Information (the latter, you will have guessed). Fisher was greater than a relevant figure in the neighborhood, the daddy of statistics. Had a deep interest in Mendelian genetics and evolutionary biology, for which he would make several key contributions. Unfortunately, Fisher also had a thorny past. He was involved with the Eugenics Society and its policy of voluntary sterilization for the “feeble-minded.”

Yes, there is no such thing as a such thing as a famous Statistician.

But a rule of thumb set by the daddy of statistics can sometimes be mistaken for a law, and law it shouldn’t be.

A portrait of the young Ronald Aylmer Fisher (1890–1962). Source: Wikimedia Commons (file page). Public Domain.

There may be one key instance if you find yourself not only compelled to but change this alpha-level, and that every one comes all the way down to multiple hypothesis testing.

To run multiple tests without using the or the procedure is greater than problematic. Without these corrections, we could prove any hypothesis:

H₁: The sun is blue

By simply re-running our experiment until luck strikes. But how do these corrections work? and which one do you have to use? They are usually not interchangeable!

P-values and an issue

To grasp why, we’d like to take a look at what exactly our p-value is telling us. To grasp it deeper than small is sweet, and large is bad. But to do that, we’ll need an experiment, and nothing is as exciting — or as contested — as discovering superheavy elements.

These elements are unstable and created in particle accelerators, one atom at a time. Pound-for-pound, the most costly thing ever produced. Existing only in cosmic events like supernovae, lasting just for or of a second.

But their instability becomes a bonus for detection, as a brand new superheavy element would exhibit a definite radioactive decay. The decay sequence captured by sensors within the reactor can tell us whether a brand new element is present.

Scientist working contained in the ORMAK fusion device at ORNL. Credit: Oak Ridge National Laboratory, licensed under CC BY 2.0

As our null hypothesis, we state:

H₀ = The sequence is background noise decay. (No recent element)

Now we’d like to collect evidence that H₀ is not true if we wish to prove now we have created a brand new element. This is finished through our test statistic . On the whole terms, this captures the difference between what the sensors observe and what is predicted from background radiation. test statistics are a measure of ‘surprise’ between what we expect to watch if H₀ is true and what our sample data actually says. The larger , the more evidence now we have that H₀ is fake.

That is precisely what the Schmidt test statistic does on the sequence of radioactive decay times.

[
sigma_{obs} = sqrt{frac{1}{n-1} sum_{i=1}^{n} (ln t_i – overline{ln t})^2}
]

The Schmidt test statistic was utilized in the invention of: Hassium (108), Meitnerium (109) in Darmstadtium (110), Element 111 Roentgenium (111), Copernicium (112) from Moscovium (115), Tennessine (117). from

It is important to specify a distribution for H₀ in order that we will calculate the probability that a test statistic is as extreme because the test statistic of the observed data.

We assume noise decays follow an exponential distribution. There are 1,000,000 the explanation why that is assumption, but let’s not get bogged down here. If we don’t have a distribution for H₀, computing our probability value could be unattainable!

[
H_0^{(Schmidt)}​:t_1​,…,t_n​ i.i.d. ∼ Exp(λ)
]

The p-value is then the probability under the null model of obtaining a test statistic at the least as extreme as that computed from the sample data. The less likely our test statistic is, the more likely it’s that H₀ is fake.

[
p ;=; Pr_{H_0}!big( T(X) ge T(x_{mathrm{obs}}) big).
]

Edvard Munch, On the Roulette Table in Monte Carlo (1892). Source: Wikimedia Commons (file page). Public Domain (CC PDM 1.0)

In fact, this brings up an interesting issue. What if we observe a rare background decay rate, a decay rate that simply resembles that of an undiscovered decaying particle? What if our sensors detect an unlikely, though possible, decay sequence that yields a big test statistic? Every time we run the test there may be a small likelihood of getting an outlier just by likelihood. This outlier will give a big test statistic as it should be quite different than what we expect to see when H₀ is true. The big T(x) will probably be within the tails of our expected distribution of H₀ and can produce a small p-value. A small probability of observing anything more extreme than this outlier. But no recent element exists! we just got 31 red by playing roulette 1,000,000 times.

It seems unlikely, but whenever you remember that protons are being beamed at goal particles for months at a time, the chance stands. So how can we account for it?

There are two ways: a conservative and a less conservative method. Your alternative is determined by the experiment. We are able to use the:

  • Family Sensible Error Rate (FWER) and the Bonferroni correction
  • False Discovery Rate (FDR) and the Benjamini-Hochberg procedure

These are usually not interchangeable! It is advisable fastidiously consider your study and pick the correct one.

In the event you’re occupied with the physics of it:

Recent elements are created by accelerating lighter ions at 10% the speed of sunshine. These ion beams bombard heavier goal atoms. The incredible speeds and kinetic energy are required to beat the coulomb barrier (the immense repulsive force between two positively charged particles.

Recent Element Beam (Protons) Goal (Protons)
Nihonium (113) Zinc-70 (30) Bismuth-209 (83)
Moscovium (115) Calcium-48 (20) Americium-243 (95)
Tennessine (117) Calcium-48 (20) Berkelium-249 (97)
Oganesson (118) Calcium-48 (20) Californium-249 (98)
Computer simulation showing the collision and fusion of two atomic nuclei to form a superheavy element. (Credit: Lawrence Berkeley National Laboratory) Public Domain

Family Sensible Error Rate

That is our conservative approach, and what needs to be used if we cannot admit any false positives. This approach keeps the probability of admitting at the least one Type I error below our alpha level.

[
Pr(text{at least one Type I error in the family}) leq alpha
]

This can also be an easier correction. Simply divide the alpha level by the variety of times the experiment was run. So for each test you reject the null hypothesis if and provided that:

[
p_i leq frac{alpha}{m}
]

Equivalently, you’ll be able to adjust your p-values. In the event you run tests, take:

[
p_i^{text{adj}} = min(1, m p_i)
]

And reject the null hypothesis if:

[
p_i^{(text{Bonf})} le alpha
]

All we did here was multiply either side of the inequality by .

The proof for this can also be a slim one-line. If we let Aᵢ be the event that there’s a false positive in test i. Then the probability of getting at the least one false positive will probably be the probability of the union of all these events.

[
text{Pr}(text{at least one false positive}) = text{Pr}left(bigcup_{i=1}^{m} A_iright) le sum_{i=1}^{m} text{Pr}(A_i) le m cdot frac{alpha}{m} = alpha
]

Here we make use of the union sure. a fundamental concept in probability that states the probability of A₁, or A₂, or Aₖ happening have to be lower than or equal to the sum of the probability of every event happening.

[
text{Pr}(A_1 cup A_2 cup cdots cup A_k) le sum_{i=1}^{k} text{Pr}(A_i)
]

False Discovery Rate

The Benjamini-Hochberg procedure also isn’t too complicated. Simply:

  • Sort your p-values: p₁ ≤ … ≤ pₘ.
  • Accept the primary k where pₖ ​> α/(m−k+1)

On this approach, the goal is to manage the false discovery rate (FDR).

[
text{FDR} = Eleft[ frac{V}{max(R, 1)} right]
]

Where is the variety of times we reject the null hypothesis, and is the variety of rejections which can be (unfortunately) false positives (Type I errors). The goal is to maintain this metric below a particular threshold .

The BH thresholds are:

[
frac{1}{m}q, frac{2}{m}q, dots, frac{m}{m}q = q
]

And we reject the primary smallest p-values where:

[
P_{(k)} leq frac{k}{m}q
]

Use this if you find yourself okay with some false positives. When your primary concern is minimizing the type II error rate, that’s, you need to be certain that there are fewer false negatives, no instances after we accept H₀ when H₀ is in truth false.

Consider this as a genomics study where you aim to discover everyone who has a particular gene that makes them more prone to a specific cancer. It could be less harmful if we treated some individuals who didn’t have the gene than risk letting someone who did have it walk away with no treatment.

Quick side-by-side

:

  • Controls family-wise error rate (FWER).
  • Guarantees the probability of a single false discovery rate ≤ α
  • Higher rate of false negatives ⇒ Lower statistical power
  • Zero risk tolerance

  • Controls False Discovery Rate (FDR)
  • guarantees that amongst all discoveries, false positives are ≤ q
  • Fewer false negatives ⇒ Higher statistical power
  • Some risk tolerance

An excellent-tiny p for a super-heavy atom

We are able to’t have any nonexistent elements within the periodic table, so in relation to finding a brand new element, the Bonferroni correction is the correct approach. But in relation to decay chain data collected by position-sensitive silicon detectors, picking an isn’t so easy.

Physicists are inclined to use the expected variety of random chains produced by the complete search over the complete dataset:

[
Pr(ge 1 text{ random chain}) approx 1 – e^{-n_b}
]

[
1 – e^{-n_b} leq alpha_{text{family}} Rightarrow n_b approx alpha_{text{family}} quad (text{approximately, for rare events})
]

The variety of random chains comes from observing the background data when no experiment is happening. from this data we will construct the null distribution H₀ through monte carlo simulation

We estimate the variety of random chains by modelling the background event rates and resampling the observed background events. Under H₀ (no heavy element decay chain), we use Monte Carlo to simulate many null realizations and compute how often the search algorithm produces a sequence as extreme because the observed chain.

More precisely:

H₀​: background events arrive as a Poisson process with rate λ ⇒ inter-arrival times are Exponential.

Then an accidental chain is k consecutive hits in τ time. We scan the info using our test statistic to find out whether an extreme cluster exists.

lambda_rate = 0.2   # events per second
T_total = 2_000.0   # seconds of data-taking (mean events ~ 400)
k = 4               # chain length
tau_obs = 0.20      # "observed extreme": 4 events inside 0.10 sec

Nmc = 20_000
rng = np.random.default_rng(0)

def dmin_and_count(times, k, tau):
    if times.size < k:
        return np.inf, 0
    spans = times[k-1:] - times[:-(k-1)]
    return float(np.min(spans)), int(np.sum(spans <= tau))

...

Monte-Carlo Simulation on GitHub

Illustration by Writer

In the event you’re occupied with the numbers, in the invention of element 117 Tennessine (Ts), a p-value of 5×10−16 was used. I imagine that if no corrections were ever used, our periodic table would, unfortunately, not be poster-sized, and chemistry could be in shambles.

Conclusion

This whole concept of looking for something in a lot of places, then treating a specially significant blip as if it got here from one remark, is often known as the Look-Elsewhere Effect. and there are two primary ways we will adjust for this:

  • Correction
  • Procedure

Our alternative entirely is determined by how conservative we wish to be.

But even with a p-value of 5×10−16, you could be wondering when a p-value of 10^-99 should still be discarded. And that every one comes all the way down to Victor Ninov, a physicist at Lawrence Berkeley National Laboratory. Who was – for a temporary moment – the person who discovered element 118.

Nevertheless, an internal investigation found that he had fabricated the alpha-decay chain. On this instance, with respect to research misconduct and falsified data, even a p-value of 10^-99 doesn't justify rejecting the null hypothesis.

Yuri Oganessian, the leader of the team on the Joint Institute for Nuclear Research in Dubna and Lawrence Livermore National Laboratory, who discovered Element 118. Wikimedia Commons, CC BY 4.0.

References

Bodmer, W., Bailey, R. A., Charlesworth, B., Eyre-Walker, A., Farewell, V., Mead, A., & Senn, S. (2021). The outstanding scientist, RA Fisher: his views on eugenics and race. , (4), 565-576.

Khuyagbaatar, J., Yakushev, A., Düllmann, C. E., Ackermann, D., Andersson, L. L., Asai, M., … & Yakusheva, V. (2014). Ca 48+ Bk 249 fusion response resulting in element Z= 117: Long-lived α-decaying Db 270 and discovery of Lr 266. , (17), 172501.

Positives, H. M. F. Multiple Comparisons: Bonferroni Corrections and False Discovery Rates.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x