Why and How one can Adjust P-values in Multiple Hypothesis Testing

Artificial Intelligence

Why and How one can Adjust P-values in Multiple Hypothesis Testing

admin

May 5, 2023

Why and How one can Adjust P-values in Multiple Hypothesis Testing

P-values below a certain threshold are sometimes used as a way to pick relevant features. Advice below suggests how you can use them appropriately.

Photo by the writer. Taken at Westfield UTC Mall, La Jolla, California.

Multiple hypothesis testing occurs after we repeatedly test models on a variety of features, because the probability of obtaining a number of false discoveries increases with the variety of tests. For instance, in the sector of genomics, scientists often need to test whether any of the hundreds of genes have a significantly different activity in an end result of interest. Or whether jellybeans cause pimples.

On this blog post, we are going to cover few of the favored methods used to account for multiple hypothesis testing by adjusting model p-values:

False Positive Rate (FPR)
Family-Sensible Error Rate (FWER)
False Discovery Rate (FDR)

and explain when it is sensible to make use of them.

This document will be summarized in the next image:

We’ll create a simulated example to raised understand how various manipulation of p-values can result in different conclusions. To run this code, we want Python with pandas, numpy, scipy and statsmodels libraries installed.

For the aim of this instance, we start by making a Pandas DataFrame of 1000 features. 990 of which (99%) may have their values generated from a Normal distribution with mean = 0, called a Null model. (In a function norm.rvs() used below, mean is ready using a loc argument.) The remaining 1% of the features can be generated from a Normal distribution mean = 3, called a Non-Null model. We’ll use these as representing interesting features that we would love to find.

import pandas as pd
import numpy as np
from scipy.stats import norm
from statsmodels.stats.multitest import multipletestsnp.random.seed(42)
n_null = 9900
n_nonnull = 100
df = pd.DataFrame({
'hypothesis': np.concatenate((
['null'] * n_null,
['non-null'] * n_nonnull,
)),
'feature': range(n_null + n_nonnull),
'x': np.concatenate((
norm.rvs(loc=0, scale=1, size=n_null),
norm.rvs(loc=3, scale=1, size=n_nonnull),
))
})

For every of the 1000 features, p-value is a probability of observing the worth at the very least as large, if we assume it was generated from a Null distribution.

P-values will be calculated from a cumulative distribution ( norm.cdf() from scipy.stats) which represents the probability of obtaining a worth equal to or lower than the one observed. Then to calculate the p-value we calculate 1 - norm.cdf() to search out the probability greater than the one observed:

df['p_value'] = 1 - norm.cdf(df['x'], loc = 0, scale = 1)
df

The primary concept known as a False Positive Rate and is defined as a fraction of null hypotheses that we flag as “significant” (also called Type I errors). The p-values we calculated earlier will be interpreted as a false positive rate by their very definition: they’re probabilities of obtaining a worth at the very least as large as a specified value, after we sample a Null distribution.

For illustrative purposes, we are going to apply a typical (magical 🧙) p-value threshold of 0.05, but any threshold will be used:

df['is_raw_p_value_significant'] = df['p_value'] <= 0.05
df.groupby(['hypothesis', 'is_raw_p_value_significant']).size()

hypothesis  is_raw_p_value_significant
non-null    False                            8
True                            92
null        False                         9407
True                           493
dtype: int64

notice that out of our 9900 null hypotheses, 493 are flagged as “significant”. Due to this fact, a False Positive Rate is: FPR = 493 / (493 + 9940) = 0.053.

The essential problem with FPR is that in an actual scenario we don’t a priori know which hypotheses are null and which usually are not. Then, the raw p-value by itself (False Positive Rate) is of limited use. In our case when the fraction of non-null features may be very small, a lot of the features flagged as significant can be null, because there are lots of more of them. Specifically, out of 92 + 493 = 585 features flagged true (“positive”), only 92 are from our non-null distribution. That implies that a majority or about 84% of reported significant features (493 / 585) are false positives!

So, what can we do about this? There are two common methods of addressing this issue: as a substitute of False Positive Rate, we are able to calculate Family-Sensible Error Rate (FWER) or a False Discovery Rate (FDR). Each of those methods takes the set of raw, unadjusted, p-values as an input, and produces a latest set of “adjusted p-values” as an output. These “adjusted p-values” represent estimates of upper bounds on FWER and FDR. They will be obtained from multipletests() function, which is an element of the statsmodels Python library:

def adjust_pvalues(p_values, method):
return multipletests(p_values, method = method)[1]

Family-Sensible Error Rate is a probability of falsely rejecting a number of null hypotheses, or in other words flagging true Null as Non-null, or a probability of seeing a number of false positives.

When there is just one hypothesis being tested, this is the same as the raw p-value (false positive rate). Nonetheless, the more hypotheses are tested, the more likely we’re going to get a number of false positives. There are two popular ways to estimate FWER: Bonferroni and Holm procedures. Although neither Bonferroni nor Holm procedures make any assumptions concerning the dependence of tests run on individual features, they can be overly conservative. For instance, in the acute case when the entire features are an identical (same model repeated 10,000 times), no correction is required. While in the opposite extreme, where no features are correlated, some form of correction is required.

Bonferroni procedure

One of the crucial popular methods for correcting for multiple hypothesis testing is a Bonferroni procedure. The explanation this method is popular is because it is rather easy to calculate, even by hand. This procedure multiplies each p-value by the full variety of tests performed or sets it to 1 if this multiplication would push it past 1.

df['p_value_bonf'] = adjust_pvalues(df['p_value'], 'bonferroni')
df.sort_values('p_value_bonf')

Holm procedure

Holm’s procedure provides a correction that’s more powerful than Bonferroni’s procedure. The one difference is that the p-values usually are not all multiplied by the full variety of tests (here, 10000). As an alternative, each sorted p-value is multiplied progressively by a decreasing sequence 10000, 9999, 9998, 9997, …, 3, 2, 1.

df['p_value_holm'] = adjust_pvalues(df['p_value'], 'holm')
df.sort_values('p_value_holm').head(10)

We are able to confirm this ourselves: the last tenth p-value on this output is multiplied by 9991: 7.943832e-06 * 9991 = 0.079367. Holm’s correction can also be the default method for adjusting p-values in p.adjust() function in R language.

If we again apply our p-value threshold of 0.05, let’s have a look how these adjusted p-values affect our predictions:

df['is_p_value_holm_significant'] = df['p_value_holm'] <= 0.05
df.groupby(['hypothesis', 'is_p_value_holm_significant']).size()

hypothesis  is_p_value_holm_significant
non-null    False                            92
True                              8
null        False                          9900
dtype: int64

These results are much different than after we applied the identical threshold to the raw p-values! Now, only 8 features are flagged as “significant”, and all 8 are correct — they were generated from our Non-null distribution. It’s because the probability of getting even one feature flagged incorrectly is just 0.05 (5%).

Nonetheless, this approach has a downside: it didn’t flag other 92 Non-null features as significant. While it was very stringent to make sure that not one of the null features slipped in, it was capable of find only 8% (8 out of 100) non-null features. This will be seen as taking a special extreme than the False Positive Rate approach.

Is there a more middle ground? The reply is “yes”, and that middle ground is False Discovery Rate.

What if we’re OK with letting some false positives in, but capturing greater than single-digit percent of true positives? Possibly we’re OK with having some false positive, just not that many who they overwhelm the entire features we flag as significant — as was the case within the FPR example.

This will be done by controlling for False Discovery Rate (reasonably than FWER or FPR) at a specified threshold level, say 0.05. False Discovery Rate is defined a fraction of false positives amongst all features flagged as positive: FDR = FP / (FP + TP), where FP is the variety of False Positives and TP is the variety of True Positives. By setting FDR threshold to 0.05, we’re saying we’re OK with having 5% (on average) false positives amongst all of our features we flag as positive.

There are several methods to regulate FDR and here we are going to describe how you can use two popular ones: Benjamini-Hochberg and Benjamini-Yekutieli procedures. Each of those procedures are similar although more involved than FWER procedures. They still depend on sorting the p-values, multiplying them with a selected number, after which using a cut-off criterion.

Benjamini-Hochberg procedure

Benjamini-Hochberg (BH) procedure assumes that every of the tests are independent. Dependent tests occur, for instance, if the features being tested are correlated with one another. Let’s calculate the BH-adjusted p-values and compare it to our earlier result from FWER using Holm’s correction:

df['p_value_bh'] = adjust_pvalues(df['p_value'], 'fdr_bh')
df[['hypothesis', 'feature', 'x', 'p_value', 'p_value_holm', 'p_value_bh']] 
.sort_values('p_value_bh') 
.head(10)

df['is_p_value_holm_significant'] = df['p_value_holm'] <= 0.05
df.groupby(['hypothesis', 'is_p_value_holm_significant']).size()

hypothesis  is_p_value_holm_significant
non-null    False                            92
True                              8
null        False                          9900
dtype: int64

df['is_p_value_bh_significant'] = df['p_value_bh'] <= 0.05
df.groupby(['hypothesis', 'is_p_value_bh_significant']).size()

hypothesis  is_p_value_bh_significant
non-null    False                          67
True                           33
null        False                        9898
True                            2
dtype: int64

BH procedure now appropriately flagged 33 out of 100 non-null features as significant — an improvement from the 8 with the Holm’s correction. Nonetheless, it also flagged 2 null features as significant. So, out of the 35 features flagged as significant, the fraction of incorrect features is: 2 / 33 = 0.06 so 6%.

Note that on this case we now have 6% FDR rate, despite the fact that we aimed to regulate it at 5%. FDR can be controlled at a 5% rate on average: sometimes it could be lower and sometimes it could be higher.

Benjamini-Yekutieli procedure

Benjamini-Yekutieli (BY) procedure controls FDR no matter whether tests are independent or not. Again, it’s value noting that each one of those procedures try to ascertain upper bounds on FDR (or FWER), so that they could also be less or more conservative. Let’s compare the BY procedure with a BH and Holm procedures above:

df['p_value_by'] = adjust_pvalues(df['p_value'], 'fdr_by')
df[['hypothesis', 'feature', 'x', 'p_value', 'p_value_holm', 'p_value_bh', 'p_value_by']] 
.sort_values('p_value_by') 
.head(10)

df['is_p_value_by_significant'] = df['p_value_by'] <= 0.05
df.groupby(['hypothesis', 'is_p_value_by_significant']).size()

hypothesis  is_p_value_by_significant
non-null    False                          93
True                            7
null        False                        9900
dtype: int64

BY procedure is stricter in controlling FDR; on this case much more so than the Holm’s procedure for controlling FWER, by flagging only 7 non-null features as significant! The essential advantage of using it’s after we know the information may contain a high variety of correlated features. Nonetheless, in that case we might also want to contemplate filtering out correlated features in order that we don’t must test all of them.

At the tip, the selection of procedure is left to the user and will depend on what the evaluation is attempting to do. Quoting Benjamini, Hochberg (Royal Stat. Soc. 1995):

Often the control of the FWER will not be quite needed. The control of the FWER is significant when a conclusion from the assorted individual inferences is more likely to be erroneous when at the very least one among them is.

This may increasingly be the case, for instance, when several latest treatments are competing against a regular, and a single treatment is chosen from the set of treatments that are declared significantly higher than the usual.

In other cases, where we could also be OK to have some false positives, FDR methods corresponding to BH correction provide less stringent p-value adjustments and will be preferrable if we primarily need to increase the variety of true positives that pass a certain p-value threshold.

There are other adjustment methods not mentioned here, notably a q-value which can also be used for FDR control, and on the time of writing exists only as an R package.