Forecasting US GDP using Machine Learning and Mathematics

What can we learn from this contemporary problem?

GDP is a really strong metric of a rustic’s economic well-being; due to this fact, making forecasts of the measurement highly wanted. Policymakers and legislators, for instance, will probably want to have a rough forecast of the trends regarding the country’s GDP prior to passing a brand new bill or law. Researchers and economists will even consider these forecasts for various endeavors in each academic and industrial settings.

Forecasting GDP, similarly to many other time series problems, follows a general workflow.

Using the integrated FRED (Federal Reserve Economic Data) library and API, we’ll create our features by constructing an information frame composed of US GDP together with another metrics which can be closely related (GDP = Consumption + Investment + Govt. Spending + Net Export)
Using quite a lot of statistical tests and analyses, we’ll explore the nuances of our data with a purpose to higher understand the underlying relationships between features.
Finally, we’ll utilize quite a lot of statistical and machine-learning models to conclude which approach can lead us to probably the most accurate and efficient forecast.

Along all of those steps, we’ll delve into the nuances of the underlying mathematical backbone that supports our tests and models.

To construct our dataset for this project, we might be utilizing the FRED (Federal Reserve Economic Data) API which is the premier application to collect economic data. Note that to make use of this data, one must register an account on the FRED website and request a custom API key.

Every time series on the web site is connected to a particular character string (for instance GDP is linked to ‘GDP’, Net Export to ‘NETEXP’, etc.). This is essential because after we make a call for every of our features, we’d like to be sure that that we specify the right character string to associate with it.

Keeping this in mind, lets now construct our data frame:

#used to label and construct each feature dataframe.
def gen_df(category, series):
gen_ser = fred.get_series(series, frequency='q')
return pd.DataFrame({'Date': gen_ser.index, category + ' : Billions of dollars': gen_ser.values})
#used to merge every constructed dataframe.
def merge_dataframes(dataframes, on_column):
merged_df = dataframes[0]
for df in dataframes[1:]:
merged_df = pd.merge(merged_df, df, on=on_column)
return merged_df
#list of features for use 
dataframes_list = [
gen_df('GDP', 'GDP'),
gen_df('PCE', 'PCE'),
gen_df('GPDI', 'GPDI'),
gen_df('NETEXP', 'NETEXP'),
gen_df('GovTotExp', 'W068RCQ027SBEA')
]
#defining and displaying dataset
data = merge_dataframes(dataframes_list,'Date')
data

Notice that since we now have defined functions versus static chunks of code, we’re free to expand our list of features for further testing. Running this code, our resulting data frame is the next:

We notice that our dataset starts from the Sixties, giving us a reasonably broad historical context. As well as, taking a look at the form of the information frame, we now have 1285 instances of actual economic data to work with, a number that isn’t necessarily small but not big either. These observations will come into play during our modeling phase.

Now that our dataset is initialized, we will begin visualizing and conducting tests to collect some insights into the behavior of our data and the way our features relate to at least one one other.

Visualization (Line plot):

Our first approach to analyzing this dataset is to easily graph each feature on the identical plot with a purpose to catch some patterns. We are able to write the next:

#separating date column from feature columns
date_column = 'Date'
feature_columns = data.columns.difference([date_column])
#set the plot 
fig, ax = plt.subplots(figsize=(10, 6))
fig.suptitle('Features vs Time', y=1.02)
#graphing features onto plot
for i, feature in enumerate(feature_columns):
ax.plot(data[date_column], data[feature], label=feature, color=plt.cm.viridis(i / len(feature_columns)))
#label axis
ax.set_xlabel('Date')
ax.set_ylabel('Billions of Dollars')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
#display the plot 
plt.show()

Running the code, we get the result:

the graph, we notice below that among the features resemble GDP excess of others. As an illustration, GDP and PCE follow almost the very same trend while NETEXP shares no visible similarities. Though it might be tempting, we will not yet begin choosing and removing certain features before conducting more exploratory tests.

ADF (Augmented Dickey-Fuller) Test:

The ADF (Augmented Dickey-Fuller) Test evaluates the stationarity of a selected time series by checking for the presence of a unit root, a characteristic that defines a time series as nonstationarity. Stationarity essentially signifies that a time series has a continuing mean and variance. This is essential to check because many popular forecasting methods (including ones we’ll use in our modeling phase) require stationarity to operate properly.

Formula for Unit Root

Although we will determine the stationarity for many of those time series just by taking a look at the graph, doing the testing remains to be useful because we’ll likely reuse it in later parts of the forecast. Using the Statsmodel library we write:

from statsmodels.tsa.stattools import adfuller
#iterating through each feature
for column in data.columns:
if column != 'Date':
result = adfuller(data[column])
print(f"ADF Statistic for {column}: {result[0]}")
print(f"P-value for {column}: {result[1]}")
print("Critical Values:")
for key, value in result[4].items():
print(f"  {key}: {value}")
#creating separation line between each feature
print("n" + "=" * 40 + "n")

giving us the result:

The numbers we have an interest from this test are the P-values. A P-value near zero (equal to or lower than 0.05) implies stationarity while a price closer to 1 implies nonstationarity. We are able to see that every one of our time series features are highly nonstationary because of their statistically insignificant p-values, in other words, we’re unable to reject the null hypothesis for the absence of a unit root. Below is an easy visual representation of the test for one among our features. The red dotted line represents the P-value where we’d give you the option to find out stationarity for the time series feature, and the blue box represents the P-value where the feature is currently.

VIF (Variance Inflation Factor) Test:

The aim of finding the Variance Inflation Factor of every feature is to ascertain for multicollinearity, or the degree of correlation the predictors share with each other. High multicollinearity isn’t necessarily detrimental to our forecast, nevertheless, it may make it much harder for us to find out the person effect of every feature time series for the prediction, thus hurting the interpretability of the model.

Mathematically, the calculation is as follows:

(Variance Inflation Factor of predictor)

with Xj representing our chosen predictor and R²j is the coefficient of determination for our specific predictor. Applying this calculation to our data, we arrive at the next result:

Evidently, our predictors are very closely linked to at least one one other. A VIF rating greater than 5 implies multicollinearity, and the scores our features achieved far exceed this amount. Predictably, PCE by far had the best rating which is smart given how its shape on the road plot resembled most of the other features.

Now that we now have looked thoroughly through our data to raised understand the relationships and characteristics of every feature, we’ll begin to make modifications to our dataset with a purpose to prepare it for modeling.

Differencing to realize stationarity

To start modeling we’d like to first ensure our data is stationary. we will achieve this using a way called differencing, which essentially transforms the raw data using a mathematical formula much like the tests above.

The concept is defined mathematically as:

(First Order Differencing equation)

This makes it so we’re removing the nonlinear trends from the features, leading to a continuing series. In other words, we’re taking values from our time series and calculating the change which occurred following the previous point.

We are able to implement this idea in our dataset and check the outcomes from the previously used ADF test with the next code:

#differencing and storing original dataset 
data_diff = data.drop('Date', axis=1).diff().dropna()
#printing ADF test for brand new dataset
for column in data_diff.columns:
result = adfuller(data_diff[column])
print(f"ADF Statistic for {column}: {result[0]}")
print(f"P-value for {column}: {result[1]}")
print("Critical Values:")
for key, value in result[4].items():
print(f"  {key}: {value}")print("n" + "=" * 40 + "n")

running this ends in:

We notice that our recent p-values are lower than 0.05, meaning that we will now reject the null hypothesis that our dataset is nonstationary. Taking a have a look at the graph of the brand new dataset proves this assertion:

We see how all of our time series are actually centered around 0 with the mean and variance remaining constant. In other words, our data now visibly demonstrates characteristics of a stationary system.

VAR (Vector Auto Regression) Model

Step one of the VAR model is performing the Granger Causality Test which can tell us which of our features are statistically significant to our prediction. The test indicates to us if a lagged version of a particular time series will help us predict our goal time series, nevertheless not necessarily that one time series causes the opposite (note that causation within the context of statistics is a far tougher concept to prove).

Using the StatsModels library, we will apply the test as follows:

from statsmodels.tsa.stattools import grangercausalitytests
columns = ['PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
lags = [6, 9, 1, 1] #determined from individually testing each combinationfor column, lag in zip(columns, lags):
df_new = data_diff[['GDP : Billions of dollars', column]]
print(f'For: {column}')
gc_res = grangercausalitytests(df_new, lag)
print("n" + "=" * 40 + "n")

Running the code ends in the next table:

(Sample of Granger Causality for 2 features)

Here we are only searching for a single lag for every feature that has statistically significant p-values(>.05). So for instance, since on the primary lag each NETEXP and GovTotExp, we’ll consider each these features for our VAR model. Personal consumption expenditures arguably didn’t make this cut-off (see notebook), nevertheless, the sixth lag is so close that I made a decision to maintain it in. Our next step is to create our VAR model now that we now have decided that every one of our features are significant from the Granger Causality Test.

VAR (Vector Auto Regression) is a model which might leverage different time series to gauge patterns and determine a versatile forecast. Mathematically, the model is defined by:

(Vector Auto Regression Model)

Where Yt is a while series at a selected time t and Ap is a determined coefficient matrix. We’re essentially using the lagged values of a time series (and in our case other time series) to make a prediction for Yt. Knowing this, we will now apply this algorithm to the data_diff dataset and evaluate the outcomes:

(Evaluation Metrics)

this forecast, we will clearly see that despite missing the mark quite heavily on each evaluation metrics used (MAE and MAPE), our model visually was not too inaccurate barring the outliers attributable to the pandemic. We managed to remain on the testing line for probably the most part from 2018–2019 and from 2022–2024, nevertheless, the worldwide events following obviously threw in some unpredictability which affected the model’s ability to exactly judge the trends.

VECM (Vector Error Correction Model)

VECM (Vector Error Correction Model) is analogous to VAR, albeit with just a few key differences. Unlike VAR, VECM doesn’t depend on stationarity so differencing and normalizing the time series won’t be needed. VECM also assumes cointegration, or long-term equilibrium between the time series. Mathematically, we define the model as:

(VECM model equation)

This equation is analogous to the VAR equation, with Π being a coefficient matrix which is the product of two other matrices, together with taking the sum of lagged versions of our time series Yt. Remembering to suit the model on our original (not difference) dataset, we achieve the next result:

Though it is difficult to check to our VAR model to this one on condition that we are actually using nonstationary data, we will still deduce each by the error metric and the visualization that this model was not capable of accurately capture the trends on this forecast. With this, it’s fair to say that we will rule out traditional statistical methods for approaching this problem.

Machine Learning forecasting

When deciding on a machine learning approach to model this problem, we wish to consider the quantity of information that we’re working with. Prior to creating lagged columns, our dataset has a complete of 1275 observations across all time-series. Which means using more complex approaches, comparable to LSTMs or gradient boosting, are perhaps unnecessary as we will use a more easy model to receive the identical amount of accuracy and way more interpretability.

Train-Test Split

Train-test splits for time series problems differ barely from splits in traditional regression or classification tasks (Note we also used the train-test split in our VAR and VECM models, nevertheless, it feels more appropriate to deal with within the Machine Learning section). We are able to perform our Train-Test split on our differenced data with the next code:

#90-10 data split
split_index = int(len(data_diff) * 0.90)
train_data = data_diff.iloc[:split_index]
test_data = data_diff.iloc[split_index:]
#Assigning GDP column to focus on variable 
X_train = train_data.drop('GDP : Billions of dollars', axis=1)
y_train = train_data['GDP : Billions of dollars']
X_test = test_data.drop('GDP : Billions of dollars', axis=1)
y_test = test_data['GDP : Billions of dollars']

Here it’s imperative that we don’t shuffle around our data, since that will mean we’re training our model on data from the longer term which in turn will cause data leakages.

example of train-test split on time series data

Also as compared, notice that we’re training over a really large portion (90 percent) of the information whereas typically we’d train over 75 percent in a typical regression task. It is because practically, we are usually not actually concerned with forecasting over a big timeframe. Realistically even forecasting over several years isn’t probable for this task given the final unpredictability that comes with real-world time series data.

Random Forests

Remembering our VIF test from earlier, we all know our features are highly correlated with each other. This partially plays into the choice to decide on random forests as one among our machine-learning models. decision trees make binary decisions between features, meaning that theoretically our features being highly correlated mustn’t be detrimental to our model.

So as to add on, random forest is mostly a really strong model being robust to overfitting from the stochastic nature of how the trees are computed. Each tree uses a random subset of the overall feature space, meaning that certain features are unlikely to dominate the model. Following the development of the person trees, the outcomes are averaged with a purpose to make a final prediction using every individual learner.

We are able to implement the model to our dataset with the next code:

from sklearn.ensemble import RandomForestRegressor
#fitting model 
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)y_pred = rf_model.predict(X_test)
#plotting results
printevals(y_test,y_pred)
plotresults('Actual vs Forecasted GDP using Random Forest')

running this offers us the outcomes:

(Evaluation Metrics for Random Forests)

(Actual vs Forecasted GDP for Random Forests)

We are able to see that Random Forests was capable of produce our greatest forecast yet, attaining higher error metrics than our attempts at VAR and VECM. Perhaps most impressively, visually we will see that our model was almost perfectly encapsulating the information from 2017–2019, just prior to encountering the outliers.

K Nearest Neighbors

KNN (K-Nearest-Neighbors) was one final approach we’ll attempt. A part of the reasoning for why we elect this specific model is because of the feature-to-observation ratio. KNN is a distanced based algorithm that we’re coping with data which has a low amount of feature space comparative to the variety of observations.

To make use of the model, we must first select a hyperparameter k which defines the variety of neighbors our data gets mapped to. The next k value insinuates a more biased model while a lower k value insinuates a more overfit model. We are able to select the optimal one with the next code:

from sklearn.neighbors import KNeighborsRegressor
#iterate over all k=1 to k=10
for i in range (1,10):
knn_model = KNeighborsRegressor(n_neighbors=i)
knn_model.fit(X_train, y_train)y_pred = knn_model.predict(X_test)
#print evaluation for every k
print(f'for k = {i} ')
printevals(y_test,y_pred)
print("n" + "=" * 40 + "n")

Running this code gives us:

(accuracy comparing different values of k)

We are able to see that our greatest accuracy measurements are achieved when k=2, following that value the model becomes too biased with increasing values of k. knowing this, we will now apply the model to our dataset:

#applying model with optimal k value
knn_model = KNeighborsRegressor(n_neighbors=2)
knn_model.fit(X_train, y_train)y_pred = knn_model.predict(X_test)
printevals(y_test,y_pred)
plotresults('Actual vs Forecasted GDP using KNN')

leading to:

(Evaluation metrics for KNN)

We are able to see KNN in its own right performed thoroughly. Despite being outperformed barely by way of error metrics in comparison with Random Forests, visually the model performed in regards to the same and arguably captured the period before the pandemic from 2018–2019 even higher than Random Forests.

all of our models, we will see the one which performed the very best was Random Forests. That is most definitely because of Random Forests for probably the most part being a really strong predictive model that may be fit to quite a lot of datasets. On the whole, the machine learning algorithms far outperformed the standard statistical methods. Perhaps this may be explained by the proven fact that VAR and VECM each require an ideal amount of historical background data to work optimally, something which we didn’t have much of on condition that our data got here out in quarterly intervals. There also could also be something to be said about how each the machine learning models used were nonparametric. These models often are governed by fewer assumptions than their counterparts and due to this fact could also be more flexible to unique problem sets just like the one here. Below is our final best prediction, removing the differencing transformation we previously used to suit the models.

(Actual vs Forecasted GDP for Random Forests (not differenced))

By far the best challenge regarding this forecasting problem was handling the huge outlier attributable to the pandemic together with the next instability attributable to it. Our methods for forecasting obviously can’t predict that this could occur, ultimately decreasing our accuracy for every approach. Had our goal been to forecast the previous decade, our models would most definitely have a much easier time finding and predicting trends. By way of improvement and further research, I believe a possible solution can be to perform some kind of normalization and outlier smoothing technique on the time interval from 2020–2024, after which evaluate our fully trained model on recent quarterly data that is available in. As well as, it might be useful to include recent features which have a heavy influence on GDP comparable to quarterly inflation and private asset evaluations.

For traditional statistical methods- https://link.springer.com/book/10.1007/978-1-4842-7150-6 , https://www.statsmodels.org/stable/generated/statsmodels.tsa.vector_ar.vecm.VECM.html

For machine learning methods — https://www.statlearning.com/

For dataset — https://fred.stlouisfed.org/docs/api/fred/

FRED provides licensed, free-to-access datasets for any user who owns an API key, read more here — https://fredhelp.stlouisfed.org/fred/about/about-fred/what-is-fred/

All pictures not specifically given credit within the caption belong to me.

please note that with a purpose to run this notebook you have to create an account on the FRED website, request an API key, and paste said key into the second cell of the notebook.

https://github.com/Dronmong/GDP-Forecast

Forecasting US GDP using Machine Learning and Mathematics

What can we learn from this contemporary problem?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Deploying Speech-to-Speech on Hugging Face

How AI coding agents work—and what to recollect in case you use them

Bonferroni vs. Benjamini-Hochberg: Selecting Your P-Value Correction

Introducing SynthID Text

CinePile 2.0 – making stronger datasets with adversarial refinement

Forecasting US GDP using Machine Learning and Mathematics

What can we learn from this contemporary problem?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.