Forecasting within the Age of Foundation Models

Benchmarking Lag-Llama against XGBoost

Cliffs near Ribadesella. Photo by Enric Domas on Unsplash

On Hugging Face, there are 20 models tagged “time series” on the time of writing. While definitely not quite a bit (the “text-generation-inference” tag yields 125,950 results), time series forecasting with foundation models is an interesting enough area of interest for large corporations like Amazon, IBM and Salesforce to have developed their very own models: Chronos, TinyTimeMixer and Moirai, respectively. On the time of writing, some of the popular on Hugging Face by variety of likes is Lag-Llama, a univariate probabilistic model. Developed by Kashif Rasul, Arjun Ashok and co-authors [1], Lag-Llama was open sourced in February 2024. The authors of the model claim “strong zero-shot generalization capabilities” on a wide range of datasets across different domains. Once fine-tuned for specific tasks, in addition they claim it to be the perfect general-purpose model of its kind. Big words!

On this blog, I showcase my experience fine-tuning Lag-Llama, and test its capabilities against a more classical machine learning approach. Specifically, I benchmark it against an XGBoost model designed to handle univariate time series data. Gradient boosting algorithms comparable to XGBoost are widely considered the epitome of “classical” machine learning (versus deep-learning), and have been shown to perform extremely well with tabular data [2]. Due to this fact, it seems fitting to make use of XGBoost to check if Lag-Llama lives as much as its guarantees. Will the inspiration model do higher? Spoiler alert: it shouldn’t be that straightforward.

By the way in which, I is not going to go into the main points of the model architecture, however the paper is value a read, as is that this nice walk-through by Marco Peixeiro.

The information that I exploit for this exercise is a 4-year-long series of hourly wave heights off the coast of Ribadesella, a town within the Spanish region of Asturias. The series is accessible on the Spanish ports authority data portal. The measurements were taken at a station positioned within the coordinates (43.5, -5.083), from 18/06/2020 00:00 to 18/06/2024 23:00 [3]. I even have decided to aggregate the series to a every day level, taking the max over the 24 observations in every day. The explanation is that the concepts that we undergo on this post are higher illustrated from a rather less granular perspective. Otherwise, the outcomes grow to be very volatile in a short time. Due to this fact, our goal variable is the utmost height of the waves recorded in a day, measured in meters.

Distribution of goal data. Image by creator

There are several explanation why I selected this series: the primary one is that the Lag-Llama model was trained on some weather-related data, although not quite a bit, relatively. I might expect the model to search out one of these data barely difficult, but still manageable. The second is that, while meteorological forecasts are typically produced using numerical weather models, statistical models can still complement these forecasts, specially for long-range predictions. On the very least, within the era of climate change, I feel statistical models can tell us what we’d typically expect, and the way far off it’s from what is definitely happening.

The dataset is pretty standard and doesn’t require much preprocessing aside from imputing a number of missing values. The plot below shows what it looks like after we split it into train, validation and test sets. The last two sets have a length of 5 months. To know more about how we preprocess the information, have a have a look at this notebook.

Maximum every day wave heights in Ribadesella. Image by creator

We’re going to benchmark Lag-Llama against XGBoost on two univariate forecasting tasks: point forecasting and probabilistic forecasting. The 2 tasks complement one another: point forecasting gives us a selected, single-number prediction, whereas probabilistic forecasting gives us a confidence region around it. One could say that Lag-Llama was only trained for the latter, so we should always give attention to that one. While that’s true, I imagine that humans find it easier to grasp a single number than a confidence interval, so I feel the purpose forecast continues to be useful, even when only for illustrative purposes.

There are a lot of aspects that we want to contemplate when producing a forecast. A few of crucial include the forecast horizon, the last statement(s) that we feed the model, or how often we update the model (if in any respect). Different combos of things yield their very own forms of forecast with their very own interpretations. In our case, we’re going to do a recursive multi-step forecast without updating the model, with a step size of seven days. Which means we’re going to use one single model to supply batches of seven forecasts at a time. After producing one batch, the model sees 7 more data points, corresponding to the dates that it just predicted, and it produces 7 more forecasts. The model, nonetheless, shouldn’t be retrained as recent data is accessible. By way of our dataset, which means that we are going to produce a forecast of maximum wave heights for every day of the subsequent week.

For point forecasting, we’re going to use the Mean Absolute Error (MAE) as performance metric. Within the case of probabilistic forecasting, we are going to aim for empirical coverage or coverage probability of 80%.

The scene is ready. Let’s get our hands dirty with the experiments!

While originally not designed for time series forecasting, gradient boosting algorithms normally, and XGBoost specifically, will be great predictors. We just must feed the algorithm the information in the proper format. As an example, if we wish to make use of three lags of our goal series, we are able to simply create three columns (say, in a pandas dataframe) with the lagged values and voilà! An XGBoost forecaster. Nonetheless, this process can quickly grow to be onerous, especially if we intend to make use of many lags. Luckily for us, the library Skforecast [4] can do that. Actually, Skforecast is the one-stop shop for developing and testing all forms of forecasters. I truthfully can’t recommend it enough!

Making a forecaster with Skforecast is pretty straightforward. We just must create a ForecasterAutoreg object with an XGBoost regressor, which we are able to then fine-tune. On top of the XGBoost hyperparamters that we’d typically optimise for, we also need to go looking for the perfect variety of lags to incorporate in our model. To do this, Skforecast provides a Bayesian optimisation method that runs Optuna on the background, bayesian_search_forecaster.

Defining and optimising hyperparameters of XGBoost forecaster

The search yields an optimised XGBoost forecaster which, amongst other hyperparameters, uses 21 lags of the goal variable, i.e. 21 days of maximum wave heights to predict the subsequent:

Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21] 
Parameters: {'n_estimators': 900, 
'max_depth': 12, 
'learning_rate': 0.30394338985367425, 
'reg_alpha': 0.5, 
'reg_lambda': 0.0, 
'subsample': 1.0, 
'colsample_bytree': 0.2}

But is the model any good? Let’s discover!

Point forecasting

First, let’s have a look at how well the XGBoost forecaster does at predicting the subsequent 7 days of maximum wave heights. The chart below plots the predictions against the actual values of our test set. We will see that the prediction tends to follow the final trend of the particular data, but it surely is much from perfect.

Maximum wave heights and XGBoost predictions. Image by creator

To create the predictions depicted above, we now have used Skforecast’s backtesting_forecaster function, which allows us to judge the model on a test set, as shown in the next code snippet. On top of the predictions, we also get a performance metric, which in our case is the MAE.

Backtesting our XGBoost forecaster

Our model’s MAE is 0.64. Which means, on average, our predictions are 64cm off the actual measurement. To place this value in context, the usual deviation of the goal variable is 0.86. Due to this fact, our model’s average error is about 0.74 units of the usual deviation. Moreover, if we were to easily use the previous equivalent statement as a dummy best guess for our forecast, we’d get a MAE of 0.84 (see point 1 of this notebook). All things considered, plainly, to date, our model is healthier than a straightforward logical rule, which is a relief!

Probabilistic forecasting

Skforecast allows us to calculate distribution intervals where the long run end result is prone to fall. The library provides two methods: using either bootstrapped residuals or quantile regression. The outcomes will not be very different, so I’m going to focus here on the bootstrapped residuals method. You may see more leads to part 3 of this notebook.

The concept of constructing prediction intervals using bootstrapped residuals is that we are able to randomly take a model’s forecast errors (residuals) an add them to the identical model’s forecasts. By repeating the method quite a few times, we are able to construct an equal number of different forecasts. These predictions follow a distribution that we are able to get prediction intervals from. In other words, if we assume that the forecast errors are random and identically distributed in time, adding these errors creates a universe of equally possible forecasts. On this universe, we’d expect to see not less than a percentage of the particular values of the forecasted series. In our case, we are going to aim for 80% of the values (that’s, a coverage of 80%).

To construct the prediction intervals with Skforecast, we follow a 3-step process: first, we generate forecasts for our validation set; second, we compute the residuals from those forecasts and store them in our forecaster class; third, we get the probabilistic forecasts for our test set. The second and third steps are illustrated within the snippet below (the primary one corresponds to the code snippet within the previous section). Lines 14-17 are the parameters that govern our bootstrap calculation.

Generating prediction intervals with bootstrapped residuals

The resulting prediction intervals are depicted within the chart below.

Bootstraped prediction intervals with XGBoost forecaster. Image by creator

An 84.67% of values within the test set fall inside our prediction intervals, which is just above our goal of 80%. While this shouldn’t be bad, it can also mean that we’re overshooting and our intervals are too big. Consider it this manner: if we said that tomorrow’s waves can be between 0 and infinity meters high, we’d all the time be right, however the forecast can be useless! To get a idea of how big our intervals are, Skforecast’s docs suggest that we compute the world of our intervals by thaking the sum of the differences between the upper and lower boundaries of the intervals. This shouldn’t be an absolute measure, but it may well help us compare across forecasters. In our case, the world is 348.28.

These are our XGBoost results. How about Lag-Llama?

The authors of Lag-Llama provide a demo notebook to begin forecasting with the model without fine-tuning it. The code is prepared to supply probabilistic forecasts given a set horizon, or prediction length, and a context length, or the quantity of previous data points to contemplate within the forecast. We just must call the get_llama_predictions function below:

Modified version of get_llama_predictions function to supply probabilistic forecasts.

The core of the funtion is a LagLlamaEstimatorclass (lines 19–47), which is a Pytorch Lightning Estimator based on the GluonTS [5] package for probabilistic forecasting. I suggest you undergo the GluonTS docs to get acquainted with the package.

We will leverage the get_llama_predictions function to supply recursive multistep forecasts. We simply need to supply batches of predictions over consecutive batches. That is what we do within the function below, recursive_forecast:

This function produces recursive probabilistic and point forecasts

In lines 37 to 39 of the code snippet above, we extract the percentiles 10 and 90 to supply an 80% probabilistic forecast (90–10), in addition to the median of the probabilistic prediction to get some extent forecast. If it’s essential to learn more concerning the output of the model, I suggest you may have a have a look at the creator’s tutorial mentioned above.

The authors of the model advise that different datasets and forecasting tasks may require differen context lenghts. In our case, we try context lenghts of 32, 64 and 128 tokens (lags). The chart below shows the outcomes of the 64-token model.

Zero-shot Lag-Llama predictions with a context length of 128 tokens. Image by creator

Point forecasting

As we said above, Lag-Llama shouldn’t be meant to calculate point forecasts, but we are able to get one by taking the median of the probabilistic interval that it returns. One other potential point forecast can be the mean, although it could be subject to outliers within the interval. In any case, for our particular dataset, each options yield similar results.

The MAE of the 32-token model was 0.75. That of the 64-token model was 0.77, while the MAE of the 128-token model was 0.77 as well. These are all higher than the XGBoost forecaster’s, which went right down to 0.64. Actually, they’re very near the baseline, dummy model that used the previous week’s value as today’s forecast (MAE 0.84).

Probabilistic forecasting

With a predicted interval coverage of 68.67% and an interval area of 280.05, the 32-token forecast doesn’t perform as much as our required standard. The 64-token one, reaches an 74.0% coverage, which gets closer to the 80% region that we’re in search of. To achieve this, it takes an interval area of 343.74. The 128-token model overshoots but is closer to the mark, with an 84.67% coverage and an area of 399.25. We will grasp an interesting trend here: more coverage implies a bigger interval area. This shouldn’t all the time be the case — a really narrow interval could all the time be right. Nonetheless, in practice this trade-off could be very much present in all of the models I even have trained.

Notice the periodic bulges within the chart (around March 10 or April 7, for example). Since we’re producing a 7-day forecast, the bulges represent the increased uncertainty as we move away from the last statement that the model saw. In other words, a forecast for the subsequent day will likely be less uncertain than a forecast for the day after next, and so forth.

The 128-token model yields very similar results to the XGBoost forecaster, which had an area 348.28 and a coverage of 84.67%. Based on these results, we are able to say that, with no training, Lag-Llama’s performance is slightly solid and as much as par with an optimised traditional forecaster.

Lag-Llama’s Github repo comes with a “best practices” section with suggestions to make use of and fine-tune the model. The authors especially recommend tuning the context length and the educational rate. We’re going to explore a number of the suggested values for these hyperparameters. The code snippet below, which I even have taken and modified from the authors’ fine-tuning tutorial notebook, shows how we are able to conduct a small grid search:

Grid seek for fine-tuning Lag-Llama

Within the code above, we loop over context lengths of 32, 64, and 128 tokens, in addition to learning rates of 0.001, 0.001, and 0.005. Inside the loop, we also calculate some test metrics: Coverage[0.8], Coverage[0.9] and Mean Absolute Error of (MAE) Coverage. Coverage[0.x] measures what number of predictions fall inside their prediction interval. As an example, a superb model must have a Coverage[0.8] of around 80%. MAE Coverage, then again, measures the deviation of the particular coverage probabilities from the nominal coverage levels. Due to this fact, a superb model in our case needs to be one with a small MAE and coverages of around 80% and 90%, respectively.

One in every of the important differences with respect to the unique fine-tuning code from the authors is line 46. In that line, the unique code doesn’t include a validation set. In my experience, not including it meant that every one models that I trained ended up overfitting the training data. However, with a validation set most models were optimised in Epoch 0 and didn’t improve the validation loss thereafter. With more data, we might even see less extreme outcomes.

Once trained, a lot of the models within the loop yield a MAE of 0.5 and coverages of 1 on the test set. Which means the models have very broad prediction intervals, however the prediction shouldn’t be very precise. The model that strikes a greater balance is model 6 (counting from 0 to eight within the loop), with the next hyperparameters and metrics:

 {'context_length': 128,
'lr': 0.001,
'Coverage[0.8]': 0.7142857142857143,
'Coverage[0.9]': 0.8571428571428571,
'MAE_Coverage': 0.36666666666666664}

Since that is probably the most promising model, we’re going to run it through the tests that we now have with the opposite forecasters.

The chart below shows the predictions from the fine-tuned model.

Effective-tuned Lag-Llama predictions with a context length of 64 tokens. Image by creator

Something that catches the attention in a short time is that prediction intervals are substantially smaller than those from the zero-shot version. Actually, the interval area is 188.69. With these prediction intervals, the model reaches a coverage of 56.67% over the 7-day recursive forecast. Do not forget that our greatest zero-shot predictions, with a 128-token context, had an area of 399.25, reaching a coverage of 84.67%. This implies a 55% reduction within the interval area, with only a 33% decrease in coverage. Nonetheless, the fine-tuned model is simply too removed from the 80% coverage that we’re aiming for, whereas the zero-shot model with 128 tokens wasn’t.

With regards to point forecasting, the MAE of the model is 0.77, which shouldn’t be an improvement over the zero-shot forecasts and worse than the XGBoost forecaster.

Overall, the fine-tuned model leaves doesn’t leave us a superb picture: it doesn’t do higher than a zero-shot higher at either point of probabilistic forecasting. The authors do suggest that the model can improve if fine-tuned with more data, so it might be that our training set was not large enough.

To recap, let’s ask again the query that we set out initially of this blog: Is Lag-Llama higher at forecasting than XGBoost? For our dataset, the short answer is not any, they’re similar. The long answer is more complicated, though. Zero-shot forecasts with a 128-token context length were at the identical level as XGBoost when it comes to probabilistic forecasting. Effective-tuning Lag-Llama further reduced the prediction area, making the model’s correct forecasts more precise, albeit at a considerable cost when it comes to probabilistc coverage. This raises the query of where the model could get with more training data. But more data we didn’t have, so we are able to’t say that Lag-Llama beat XGBoost.

These results inevitably open a broader debate: since one shouldn’t be higher than the opposite when it comes to performance, which one should we use? On this case, we’d need to contemplate other variables comparable to ease of use, deployment and maintenance and inference costs. While I haven’t formally tested the 2 options in any of those points, I think the XGBoost would come out higher. Less data- and resource-hungry, pretty robust to overfitting and time-tested are hard-to-beat characteristics, and XGBoost has all of them.

But don’t imagine me! The code that I used is publicly available on this Github repo, so go take a look and run it yourself.

Forecasting within the Age of Foundation Models

Benchmarking Lag-Llama against XGBoost

Point forecasting

Probabilistic forecasting

Point forecasting

Probabilistic forecasting

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

the way to create large-scale synthetic data for pre-training Large Language Models

Decoding the Arctic to predict winter weather

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

Phi-2 on Intel Meteor Lake

Retrieval for Time-Series: How Looking Back Improves Forecasts

Forecasting within the Age of Foundation Models

Benchmarking Lag-Llama against XGBoost

Point forecasting

Probabilistic forecasting

Point forecasting

Probabilistic forecasting

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.