Alternatives to the p-value Criterion for Statistical Significance (with R code)


Photo by Rommel Davila on Unsplash

In establishing statistical significance, the p-value criterion is sort of universally used. The criterion is to reject the null hypothesis (H0) in favour of the choice (H1), when the p-value is lower than the extent of significance (α). The traditional values for this decision threshold include 0.05, 0.10, and 0.01.

By definition, the p-value measures how compatible the sample information is with H0: i.e., P(D|H0), the probability or likelihood of information (D) under H0. Nevertheless, as made clear from the statements of the American Statistical Association (Wasserstein and Lazar, 2016), the p-value criterion as a choice rule has quite a few serious deficiencies. The important deficiencies include

  1. the p-value is a decreasing function of sample size;
  2. the criterion completely ignores P(D|H1), the compatibility of information with H1; and
  3. the traditional values of α (comparable to 0.05) are arbitrary with little scientific justification.

One in all the results is that the p-value criterion steadily rejects H0 when it’s violated by a practically negligible margin. This is very so when the sample size is large or massive. This example occurs because, while the p-value is a decreasing function of sample size, its threshold (α) is fixed and doesn’t decrease with sample size. On this point, Wasserstein and Lazar (2016) strongly recommend that the p-value be supplemented and even replaced with other alternatives.

On this post, I introduce a spread of easy, but more sensible, alternatives to the p-value criterion which may overcome the above-mentioned deficiencies. They could be classified into three categories:

  1. Balancing P(D|H0) and P(D|H1) (Bayesian method);
  2. Adjusting the extent of significance (α); and
  3. Adjusting the p-value.

These alternatives are easy to compute, and might provide more sensible inferential outcomes than those solely based on the p-value criterion, which might be demonstrated using an application with R codes.

Consider a linear regression model

Y = β0 + β1 X1 + … + βk Xk + u,

where Y is the dependent variable, X’s are independent variables, and u is a random error term following a standard distribution with zero mean and glued variance. We consider testing for

H0: β1 = … = βq = 0,

against H1 that H0 doesn’t hold (q ≤ k). A straightforward example is H0: β1 = 0; H1: β1 ≠ 0, where q =1.

Borrowing from the Bayesian statistical inference, we define the next probabilities:

Prob(H0|D): posterior probability for H0, which is the probability or likelihood of H0 after the researcher observes the information D;

Prob(H1|D) ≡ 1 — Prob(H0|D): posterior probability for H1;

Prob(D|H0): (marginal) likelihood of information under H0;

Prob(D|H1): (marginal) likelihood of information under H1;

P(H0): prior probability for H0, representing the researcher’s belief about H0 before she observes the information;

P(H1) = 1- P(H0): prior probability for H1.

These probabilities are related (by Bayes rule) as

The important components are as follows:

P10: the posterior odds ratio for H1 over H0, the ratio of the posterior probability of H1 to that of H0;

B10 ≡ P(D|H1)/P(D|H0) called the Bayes factor, the ratio of the (marginal) likelihood under H1 to that of H0;

P(H1)/P(H0): prior odds ratio.

Note that the posterior odds ratio is the Bayes factor multiplied by the prior odds ratio, and that that P10 = B10 if Prob(H0) = Prob(H1) = 0.5.

The choice rule is, if P10 > 0, the evidence favours H1 over H0. Because of this, after the researcher observes the information, she favours H1 if P(H1|D) > P(H0|D), i.e., if the posterior probability of H1 is higher than that of H0.

For B10, the choice rule proposed by Kass and Raftery (1995) is given below:

Image created by the creator

For instance, if B10 = 3, then P(D|H1) = 3 × P(D|H0), which suggests that the information is compatible with H1 3 times greater than it’s compatible with H0. Note that the Bayes factor is usually expressed as 2log(B10), where log() is the natural logarithm, in the identical scale because the likelihood ratio test statistic.

Bayes factor

Wagenmakers (2007) provides an easy approximation formula for the Bayes factor given by

2log(B10) = BIC(H0) — BIC(H1),

where BIC(Hi) denotes the worth of the Bayesian information criterion under Hi (i = 0, 1).

Posterior probabilities

Zellner and Siow (1979) provide a formula for P10 given by

Image Created by the creator

where F is the F-test statistic for H0, Γ() is the gamma function, v1 = n-k0-k1–1, n is the sample size, k0 is the variety of parameters restricted under H0; and k1 is the variety of parameters unrestricted under H0 (k = k0+k1).

Startz (2014) provides a formula for P(H0|D), posterior probability for H0, to check for H0: βi = 0:

Image created by the creator

where t is the t-statistic for H0: βi = 0, ϕ() is the usual normal density function, and s is the usual error estimator for the estimation of βi.

Adjustment to the p-value

Good (1988) proposes the next adjustment to the p-value:

Image created by the creator

where p is the p-value for H0: βi = 0. The rule is obtained by considering the convergence rate of the Bayes factor against a pointy null hypothesis. The adjusted p-value (p1) increases with sample size n.

Harvey (2017) proposes what is named the Bayesianized p-value

Image created by the creator

where PR ≡ P(H0)/P(H1) and MBF = exp(-0.5t²) is the minimum Bayes factor while t is the t-statistic.

Significance level adjustment

Perez and Perichhi (2014) propose an adaptive rule for the extent of significance derived by reconciling the Bayesian inferential method and likelihood ratio principle, which is written as follows:

Image created by the creator

where q is variety of parameters under H0, α is the initial level of significance comparable to 0.05, and χ²(α,q) is the α-level critical value from the chi-square distribution with q degrees of freedom. In brief, the rule adjusts the extent of significance as a decreasing function of sample size n.

On this section, we apply the above alternative measures to a regression with a big sample size, and examine how the inferential results are different from those obtained solely based on the p-value criterion. The R codes for the calculation of those measures are also provided.

Kamstra et al. (2003) examine the effect of depression linked with seasonal affective disorder on stock return. They claim that the length of sunlight can systematically affect the variation in stock return. They estimate the regression model of the next form:

Image created by the creator

where R is the stock return in percentage on day t; M is a dummy variable for Monday; T is a dummy variable for the last trading day or the primary five trading days of the tax yr; A is a dummy variable for autumn days; C is cloud cover, P is precipitation; G is temperature, and S measures the length of sunlights.

They argue that, with an extended sunlight, investors are in a greater mood, and they have an inclination to purchase more stocks which is able to increase the stock price and return. Based on this, their null and alternative hypotheses are

H0: γ3 = 0; H1: γ3 ≠ 0.

Their regression results are replicated using the U.S. stock market data, every day from Jan 1965 to April 1996 (7886 observations). The info range is proscribed by the cloud cover data which is obtainable only from 1965 to 1996. The complete results with further details can be found from Kim (2022).

Image created by the creator

The above table presents a summary of the regression results under H0 and H1. The null hypothesis H0: γ3 = 0 is rejected on the 5% level of significance, with the coefficient estimate of 0.033, t-statistic of two.31, and p-value of 0.027. Hence, based on the p-value criterion, the length of sunlight affects the stock return with statistical significance: the stock return is anticipated to extend by 0.033% in response to a 1-unit increase within the length of sunlight.

While that is evidence against the implications of stock market efficiency, it might be argued that whether this effect is large enough to be practically essential is questionable.

The values of the choice measures and the corresponding decisions are given below:

Image created by the creator

Note that P10 and p2 are calculated under the idea that P(H0)=P(H1), which suggests that the researcher is impartial between H0 and H1 a priori. It is obvious from the ends in the above table that every one of the alternatives to the p-value criterion strongly favours H0 over H1 or cannot reject H0 on the 5% level of significance. Harvey’s (2017) Bayesianized p-value that indicates rejection of H0 at the ten% level of significance.

Hence, we may conclude that the outcomes of Kamstra et al. (2003), based solely on the p-value criterion, should not so convincing under the choice decision rules. Given the questionable effect size and nearly negligible goodness-of-fit of the model (R² = 0.056), the choices based on these alternatives seem more sensible.

The R code below shows the calculation of those alternatives (the total code and data can be found from the creator on request):

# Regression under H1
Reg1 = lm(ret.g ~ ret.g1+ret.g2+SAD+Mon+Tax+FALL+cloud+prep+temp,data=dat)
# Regression under H0
Reg0 = lm(ret.g ~ ret.g1+ret.g2+Mon+FALL+Tax+cloud+prep+temp, data=dat)

# 2log(B10): Wagenmakers (2007)

# PH0: Startz (2014)
T=length(ret.g); se=0.014; t=2.314
Ph0=dnorm(t)/(dnorm(t) + se/c)

# p-valeu adjustment: Good (1988)
P_adjusted = min(c(0.5,p*sqrt(T/100)))

# Bayesianized p-value: Harvey (2017)
t=2.314; p=0.0207

# P10: Zellner and Siow (1979)
f=t^2; k0=1; k1=8; v1 = T-k0-k1- 1
P1 =pi^(0.5)/gamma((k0+1)/2)

# Adaptive Level of Significance: Perez and Perichhi (2014)
q = 1 # Variety of Parameters under H0
adapt1 = ( qchisq(p=1-alpha,df=q) + q*log(n) )^(0.5*q-1)
adapt2 = 2^(0.5*q-1) * n^(0.5*q) * gamma(0.5*q)
adapt3 = exp(-0.5*qchisq(p=1-alpha,df=q))

The p-value criterion has quite a few deficiencies. Sole reliance on this decision rule has generated serious problems in scientific research, including accumulation of improper stylized facts, research integrity, and research credibility: see the statements of the American Statistical Association (Wasserstein and Lazar, 2016).

This post presents several alternatives to the p-value criterion for statistical evidence. A balanced and informed statistical decision could be made by considering the data from a spread of alternatives. Mindless use of a single decision rule can provide misleading decisions, which could be highly costly and consequential. These alternatives are easy to calculate and might complement the p-value criterion for higher and more informed decisions.

Please Follow Me for more engaging posts!


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x