Linear Regression in Time Series: Sources of Spurious Regression

-

1. Introduction

It’s pretty clear that the majority of our work shall be automated by AI in the longer term. This shall be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but in addition refine AI models, ultimately freeing up time to concentrate on other activities.

Nevertheless, there’s one concept that continues to be misunderstood, even amongst experts. It’s spurious regression in time series evaluation. This issue arises when regression models suggest strong relationships between variables, even when none exist. It is often observed in time series regression equations that appear to have a high degree of fit — as indicated by a high R² (coefficient of multiple correlation) — but with an extremely low Durbin-Watson statistic (d), signaling strong autocorrelation within the error terms.

What is especially surprising is that just about all econometric textbooks warn in regards to the danger of autocorrelated errors, yet this issue persists in lots of published papers. Granger and Newbold (1974) identified several examples. As an illustration, they found published equations with R² = 0.997 and the Durbin-Watson statistic (d) equal to 0.53. Probably the most extreme found is an equation with R² = 0.999 and d = 0.093.

It is very problematic in economics and finance, where many key variables exhibit autocorrelation or serial correlation between adjoining values, particularly if the sampling interval is small, similar to per week or a month, resulting in misleading conclusions if not handled appropriately. For instance, today’s GDP is strongly correlated with the GDP of the previous quarter. Our post provides an in depth explanation of the outcomes from Granger and Newbold (1974) and Python simulation (see section 7) replicating the important thing results presented of their article.

Whether you’re an economist, data scientist, or analyst working with time series data, understanding this issue is crucial to making sure your models produce meaningful results.

To walk you thru this paper, the subsequent section will introduce the random walk and the ARIMA(0,1,1) process. In section 3, we are going to explain how Granger and Newbold (1974) describe the emergence of nonsense regressions, with examples illustrated in section 4. Finally, we’ll show how one can avoid spurious regressions when working with time series data.

2. Easy presentation of a Random Walk and ARIMA(0,1,1) Process

2.1 Random Walk

Let 𝐗ₜ be a time series. We are saying that 𝐗ₜ follows a random walk if its representation is given by:

𝐗ₜ = 𝐗ₜ₋₁ + 𝜖ₜ. (1)

Where 𝜖ₜ is a white noise. It could be written as a sum of white noise, a useful form for simulation. It’s a non-stationary time series because its variance is determined by the time t.

2.2 ARIMA(0,1,1) Process

The ARIMA(0,1,1) process is given by:

𝐗ₜ = 𝐗ₜ₋₁ + 𝜖ₜ − 𝜃 𝜖ₜ₋₁. (2)

where 𝜖ₜ is a white noise. The ARIMA(0,1,1) process is non-stationary. It could be written as a sum of an independent random walk and white noise:

𝐗ₜ = 𝐗₀ + random walk + white noise. (3) This kind is helpful for simulation.

Those non-stationary series are sometimes employed as benchmarks against which the forecasting performance of other models is judged.

3. Random walk can result in Nonsense Regression

First, let’s recall the Linear Regression model. The linear regression model is given by:

𝐘 = 𝐗𝛽 + 𝜖. (4)

Where 𝐘 is a T × 1 vector of the dependent variable, 𝛽 is a K × 1 vector of the coefficients, 𝐗 is a T × K matrix of the independent variables containing a column of ones and (K−1) columns with T observations on each of the (K−1) independent variables, that are stochastic but distributed independently of the T × 1 vector of the errors 𝜖. It is mostly assumed that:

𝐄(𝜖) = 0, (5)

and

𝐄(𝜖𝜖′) = 𝜎²𝐈. (6)

where 𝐈 is the identity matrix.

A test of the contribution of independent variables to the reason of the dependent variable is the F-test. The null hypothesis of the test is given by:

𝐇₀: 𝛽₁ = 𝛽₂ = ⋯ = 𝛽ₖ₋₁ = 0, (7)

And the statistic of the test is given by:

𝐅 = (𝐑² / (𝐊−1)) / ((1−𝐑²) / (𝐓−𝐊)). (8)

where 𝐑² is the coefficient of determination.

If we would like to construct the statistic of the test, let’s assume that the null hypothesis is true, and one tries to suit a regression of the shape (Equation 4) to the degrees of an economic time series. Suppose next that these series are usually not stationary or are highly autocorrelated. In such a situation, the test procedure is invalid since 𝐅 in (Equation 8) is just not distributed as an F-distribution under the null hypothesis (Equation 7). In truth, under the null hypothesis, the errors or residuals from (Equation 4) are given by:

𝜖ₜ = 𝐘ₜ − 𝐗𝛽₀ ; t = 1, 2, …, T. (9)

And can have the identical autocorrelation structure as the unique series 𝐘.

Some idea of the distribution problem can arise within the situation when:

𝐘ₜ = 𝛽₀ + 𝐗ₜ𝛽₁ + 𝜖ₜ. (10)

Where 𝐘ₜ and 𝐗ₜ follow independent first-order autoregressive processes:

𝐘ₜ = 𝜌 𝐘ₜ₋₁ + 𝜂ₜ, and 𝐗ₜ = 𝜌* 𝐗ₜ₋₁ + 𝜈ₜ. (11)

Where 𝜂ₜ and 𝜈ₜ are white noise.

We all know that on this case, 𝐑² is the square of the correlation between 𝐘ₜ and 𝐗ₜ. They use Kendall’s result from the article Knowles (1954), which expresses the variance of 𝐑:

𝐕𝐚𝐫(𝐑) = (1/T)* (1 + 𝜌𝜌*) / (1 − 𝜌𝜌*). (12)

Since 𝐑 is constrained to lie between -1 and 1, if its variance is larger than 1/3, the distribution of 𝐑 cannot have a mode at 0. This means that 𝜌𝜌* > (T−1) / (T+1).

Thus, for instance, if T = 20 and 𝜌 = 𝜌*, a distribution that is just not unimodal at 0 shall be obtained if 𝜌 > 0.86, and if 𝜌 = 0.9, 𝐕𝐚𝐫(𝐑) = 0.47. So the 𝐄(𝐑²) shall be near 0.47.

It has been shown that when 𝜌 is near 1, 𝐑² may be very high, suggesting a powerful relationship between 𝐘ₜ and 𝐗ₜ. Nevertheless, in point of fact, the 2 series are completely independent. When 𝜌 is near 1, each series behave like random walks or near-random walks. On top of that, each series are highly autocorrelated, which causes the residuals from the regression to even be strongly autocorrelated. In consequence, the Durbin-Watson statistic 𝐝 shall be very low.

This is the reason a high 𝐑² on this context should never be taken as evidence of a real relationship between the 2 series.

To explore the opportunity of obtaining a spurious regression when regressing two independent random walks, a series of simulations proposed by Granger and Newbold (1974) shall be conducted in the subsequent section.

4. Simulation results using Python.

On this section, we are going to show using simulations that using the regression model with independent random walks bias the estimation of the coefficients and the hypothesis tests of the coefficients are invalid. The Python code that can produce the outcomes of the simulation shall be presented in section 6.

A regression equation proposed by Granger and Newbold (1974) is given by:

𝐘ₜ = 𝛽₀ + 𝐗ₜ𝛽₁ + 𝜖ₜ

Where 𝐘ₜ and 𝐗ₜ were generated as independent random walks, each of length 50. The values 𝐒 = |𝛽̂₁| / √(𝐒𝐄̂(𝛽̂₁)), representing the statistic for testing the importance of 𝛽₁, for 100 simulations shall be reported within the table below.

Table 1: Regressing two independent random walks

The null hypothesis of no relationship between 𝐘ₜ and 𝐗ₜ is rejected on the 5% level if 𝐒 > 2. This table shows that the null hypothesis (𝛽 = 0) is wrongly rejected in a couple of quarter (71 times) of all cases. That is awkward since the two variables are independent random walks, meaning there’s no actual relationship. Let’s break down why this happens.

If 𝛽̂₁ / 𝐒𝐄̂ follows a 𝐍(0,1), the expected value of 𝐒, its absolute value, must be √2 / π ≈ 0.8 (√2/π is the mean of absolutely the value of a normal normal distribution). Nevertheless, the simulation results show a mean of 4.59, meaning the estimated 𝐒 is underestimated by an element of:

4.59 / 0.8 = 5.7

In classical statistics, we often use a t-test threshold of around 2 to envision the importance of a coefficient. Nevertheless, these results show that, on this case, you would wish to make use of a threshold of 11.4 to properly test for significance:

2 × (4.59 / 0.8) = 11.4

Interpretation: We’ve just shown that including variables that don’t belong within the model — especially random walks — can result in completely invalid significance tests for the coefficients.

To make their simulations even clearer, Granger and Newbold (1974) ran a series of regressions using variables that follow either a random walk or an ARIMA(0,1,1) process.

Here is how they arrange their simulations:

They regressed a dependent series 𝐘ₜ on m series 𝐗ⱼ,ₜ (with j = 1, 2, …, m), various m from 1 to five. The dependent series 𝐘ₜ and the independent series 𝐗ⱼ,ₜ follow the identical forms of processes, they usually tested 4 cases:

  • Case 1 (Levels): 𝐘ₜ and 𝐗ⱼ,ₜ follow random walks.
  • Case 2 (Differences): They use the primary differences of the random walks, that are stationary.
  • Case 3 (Levels): 𝐘ₜ and 𝐗ⱼ,ₜ follow ARIMA(0,1,1).
  • Case 4 (Differences): They use the primary differences of the previous ARIMA(0,1,1) processes, that are stationary.

Each series has a length of fifty observations, they usually ran 100 simulations for every case.

All error terms are distributed as 𝐍(0,1), and the ARIMA(0,1,1) series are derived because the sum of the random walk and independent white noise. The simulation results, based on 100 replications with series of length 50, are summarized in the subsequent table.

Table 2: Regressions of a series on m independent ‘explanatory’ series.

Interpretation of the outcomes :

  • It’s seen that the probability of not rejecting the null hypothesis of no relationship between 𝐘ₜ and 𝐗ⱼ,ₜ becomes very small when m ≥ 3 when regressions are made with random walk series (rw-levels). The 𝐑² and the mean Durbin-Watson increase. Similar results are obtained when the regressions are made with ARIMA(0,1,1) series (arima-levels).
  • When white noise series (rw-diffs) are used, classical regression evaluation is valid because the error series shall be white noise and least squares shall be efficient.
  • Nevertheless, when the regressions are made with the differences of ARIMA(0,1,1) series (arima-diffs) or first-order moving average series MA(1) process, the null hypothesis is rejected, on average:

(10 + 16 + 5 + 6 + 6) / 5 = 8.6

which is larger than 5% of the time.

In case your variables are random walks or near them, and also you include unnecessary variables in your regression, you’ll often get fallacious results. High 𝐑² and low Durbin-Watson values don’t confirm a real relationship but as an alternative indicate a probable spurious one.

5. Learn how to avoid spurious regression in time series

It’s really hard to provide you with a whole list of the way to avoid spurious regressions. Nevertheless, there are a number of good practices you’ll be able to follow to minimize the chance as much as possible.

If one performs a regression evaluation with time series data and finds that the residuals are strongly autocorrelated, there’s a major problem with regards to interpreting the coefficients of the equation. To examine for autocorrelation within the residuals, one can use the Durbin-Watson test or the Portmanteau test.

Based on the study above, we are able to conclude that if a regression evaluation performed with economical variables produces strongly autocorrelated residuals, meaning a low Durbin-Watson statistic, then the outcomes of the evaluation are prone to be spurious, regardless of the value of the coefficient of determination R² observed.

In such cases, it can be crucial to know where the mis-specification comes from. In keeping with the literature, misspecification often falls into three categories : (i) the omission of a relevant variable, (ii) the inclusion of an irrelevant variable, or (iii) autocorrelation of the errors. More often than not, mis-specification comes from a combination of those three sources.

To avoid spurious regression in a time series, several recommendations may be made:

  • The primary advice is to pick the fitting macroeconomic variables which are likely to elucidate the dependent variable. This may be done by reviewing the literature or consulting experts in the sphere.
  • The second advice is to stationarize the series by taking first differences. Normally, the primary differences of macroeconomic variables are stationary and still easy to interpret. For macroeconomic data, it’s strongly advisable to distinguish the series once to cut back the autocorrelation of the residuals, especially when the sample size is small. There’s indeed sometimes strong serial correlation observed in these variables. A straightforward calculation shows that the primary differences will almost at all times have much smaller serial correlations than the unique series.
  • The third advice is to make use of the Box-Jenkins methodology to model each macroeconomic variable individually after which seek for relationships between the series by relating the residuals from each individual model. The concept here is that the Box-Jenkins process extracts the explained a part of the series, leaving the residuals, which contain only what can’t be explained by the series’ own past behavior. This makes it easier to envision whether these unexplained parts (residuals) are related across variables.

6. Conclusion

Many econometrics textbooks warn about specification errors in regression models, but the issue still shows up in lots of published papers. Granger and Newbold (1974) highlighted the chance of spurious regressions, where you get a high paired with very low Durbin-Watson statistics.

Using Python simulations, we showed a few of the major causes of those spurious regressions, especially including variables that don’t belong within the model and are highly autocorrelated. We also demonstrated how these issues can completely distort hypothesis tests on the coefficients.

Hopefully, this post will help reduce the chance of spurious regressions in future econometric analyses.

7. Appendice: Python code for simulation.

#####################################################Simulation Code for table 1 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

np.random.seed(123)
M = 100 
n = 50
S = np.zeros(M)
for i in range(M):
#---------------------------------------------------------------
# Generate the information
#---------------------------------------------------------------
    espilon_y = np.random.normal(0, 1, n)
    espilon_x = np.random.normal(0, 1, n)

    Y = np.cumsum(espilon_y)
    X = np.cumsum(espilon_x)
#---------------------------------------------------------------
# Fit the model
#---------------------------------------------------------------
    X = sm.add_constant(X)
    model = sm.OLS(Y, X).fit()
#---------------------------------------------------------------
# Compute the statistic
#------------------------------------------------------
    S[i] = np.abs(model.params[1])/model.bse[1]


#------------------------------------------------------ 
#              Maximum value of S
#------------------------------------------------------
S_max = int(np.ceil(max(S)))

#------------------------------------------------------ 
#                Create bins
#------------------------------------------------------
bins = np.arange(0, S_max + 2, 1)  

#------------------------------------------------------
#    Compute the histogram
#------------------------------------------------------
frequency, bin_edges = np.histogram(S, bins=bins)

#------------------------------------------------------
#    Create a dataframe
#------------------------------------------------------

df = pd.DataFrame({
    "S Interval": [f"{int(bin_edges[i])}-{int(bin_edges[i+1])}" for i in range(len(bin_edges)-1)],
    "Frequency": frequency
})
print(df)
print(np.mean(S))

#####################################################Simulation Code for table 2 #####################################################

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.stattools import durbin_watson
from tabulate import tabulate

np.random.seed(1)  # Pour rendre les résultats reproductibles

#------------------------------------------------------
# Definition of functions
#------------------------------------------------------

def generate_random_walk(T):
    """
    Génère une série de longueur T suivant un random walk :
        Y_t = Y_{t-1} + e_t,
    où e_t ~ N(0,1).
    """
    e = np.random.normal(0, 1, size=T)
    return np.cumsum(e)

def generate_arima_0_1_1(T):
    """
    Génère un ARIMA(0,1,1) selon la méthode de Granger & Newbold :
    la série est obtenue en additionnant une marche aléatoire et un bruit blanc indépendant.
    """
    rw = generate_random_walk(T)
    wn = np.random.normal(0, 1, size=T)
    return rw + wn

def difference(series):
    """
    Calcule la différence première d'une série unidimensionnelle.
    Retourne une série de longueur T-1.
    """
    return np.diff(series)

#------------------------------------------------------
# Paramètres
#------------------------------------------------------

T = 50           # longueur de chaque série
n_sims = 100     # nombre de simulations Monte Carlo
alpha = 0.05     # seuil de significativité

#------------------------------------------------------
# Definition of function for simulation
#------------------------------------------------------

def run_simulation_case(case_name, m_values=[1,2,3,4,5]):
    """
    case_name : un identifiant pour le type de génération :
        - 'rw-levels' : random walk (levels)
        - 'rw-diffs'  : differences of RW (white noise)
        - 'arima-levels' : ARIMA(0,1,1) en niveaux
        - 'arima-diffs'  : différences d'un ARIMA(0,1,1) => MA(1)
    
    m_values : liste du nombre de régresseurs.
    
    Retourne un DataFrame avec pour chaque m :
        - % de rejets de H0
        - Durbin-Watson moyen
        - R^2_adj moyen
        - % de R^2 > 0.1
    """
    results = []
    
    for m in m_values:
        count_reject = 0
        dw_list = []
        r2_adjusted_list = []
        
        for _ in range(n_sims):
#--------------------------------------
# 1) Generation of independents de Y_t and X_{j,t}.
#----------------------------------------
            if case_name == 'rw-levels':
                Y = generate_random_walk(T)
                Xs = [generate_random_walk(T) for __ in range(m)]
            
            elif case_name == 'rw-diffs':
                # Y et X sont les différences d'un RW, i.e. ~ white noise
                Y_rw = generate_random_walk(T)
                Y = difference(Y_rw)
                Xs = []
                for __ in range(m):
                    X_rw = generate_random_walk(T)
                    Xs.append(difference(X_rw))
                # NB : maintenant Y et Xs ont longueur T-1
                # => ajuster T_effectif = T-1
                # => on prendra T_effectif points pour la régression
            
            elif case_name == 'arima-levels':
                Y = generate_arima_0_1_1(T)
                Xs = [generate_arima_0_1_1(T) for __ in range(m)]
            
            elif case_name == 'arima-diffs':
                # Différences d'un ARIMA(0,1,1) => MA(1)
                Y_arima = generate_arima_0_1_1(T)
                Y = difference(Y_arima)
                Xs = []
                for __ in range(m):
                    X_arima = generate_arima_0_1_1(T)
                    Xs.append(difference(X_arima))
            
            # 2) Prépare les données pour la régression
            #    Selon le cas, la longueur est T ou T-1
            if case_name in ['rw-levels','arima-levels']:
                Y_reg = Y
                X_reg = np.column_stack(Xs) if m>0 else np.array([])
            else:
                # dans les cas de différences, la longueur est T-1
                Y_reg = Y
                X_reg = np.column_stack(Xs) if m>0 else np.array([])
            
            # 3) Régression OLS
            X_with_const = sm.add_constant(X_reg)  # Ajout de l'ordonnée à l'origine
            model = sm.OLS(Y_reg, X_with_const).fit()
            
            # 4) Test global F : H0 : tous les beta_j = 0
            #    On regarde si p-value < alpha
            if model.f_pvalue is just not None and model.f_pvalue < alpha:
                count_reject += 1
            
            # 5) R^2, Durbin-Watson
            r2_adjusted_list.append(model.rsquared_adj)
            
            
            dw_list.append(durbin_watson(model.resid))
        
        # Statistiques sur n_sims répétitions
        reject_percent = 100 * count_reject / n_sims
        dw_mean = np.mean(dw_list)
        r2_mean = np.mean(r2_adjusted_list)
        r2_above_0_7_percent = 100 * np.mean(np.array(r2_adjusted_list) > 0.7)
        
        results.append({
            'm': m,
            'Reject %': reject_percent,
            'Mean DW': dw_mean,
            'Mean R^2': r2_mean,
            '% R^2_adj>0.7': r2_above_0_7_percent
        })
    
    return pd.DataFrame(results)
    
#------------------------------------------------------
# Application of the simulation
#------------------------------------------------------       

cases = ['rw-levels', 'rw-diffs', 'arima-levels', 'arima-diffs']
all_results = {}

for c in cases:
    df_res = run_simulation_case(c, m_values=[1,2,3,4,5])
    all_results[c] = df_res

#------------------------------------------------------
# Store data in table
#------------------------------------------------------

for case, df_res in all_results.items():
    print(f"nn{case}")
    print(tabulate(df_res, headers='keys', tablefmt='fancy_grid'))

References

  • Granger, Clive WJ, and Paul Newbold. 1974. “Spurious Regressions in Econometrics.”  2 (2): 111–20.
  • Knowles, EAG. 1954. “Exercises in Theoretical Statistics.” Oxford University Press.
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x