Home Artificial Intelligence Time Series Forecasting with XGBoost and LightGBM: Predicting Energy Consumption Problem Preprocessing Training the Models Evaluation Preprocessing Weather Data Conclusion & Future Steps

Time Series Forecasting with XGBoost and LightGBM: Predicting Energy Consumption Problem Preprocessing Training the Models Evaluation Preprocessing Weather Data Conclusion & Future Steps

1
Time Series Forecasting with XGBoost and LightGBM: Predicting Energy Consumption
Problem
Preprocessing
Training the Models
Evaluation
Preprocessing Weather Data
Conclusion & Future Steps

Image generated by Stable Diffusion

In the only terms, time series forecasting is the technique of predicting future values based on previous historical data. Certainly one of the most popular fields where time series forecasting is utilized currently is within the cryptocurrencies market, where one desires to predict how prices in popular cryptos, like Bitcoin or Ethereum, will fluctuate in the subsequent few days and even longer periods of time. One other real world case is the energy consumption prediction. Especially within the contemporary world where energy is one in all the first points of dialogue, being able to accurately predicting the demand of energy consumption is an important tool for any electric power company. In this text, we are going to take a fast but practical take a look at how this is finished by incorporating Ensemble models reminiscent of extreme gradient boosting or XGBoost and lightweight gradient boosting or LGB models.

We are going to deal with the energy consumption problem, where given a sufficiently large dataset of the each day energy consumption of various households in a city, we’re tasked to predict as accurately as possible the longer term energy demands. For the needs of this tutorial, I’ve chosen the London Energy Dataset which accommodates the energy consumption of 5,567 randomly chosen households in the town of London, UK for the time period of November 2011 to February 2014. In a while, in an try to improve our predictions we mix this set with the London Weather Dataset with the intention to add weather related data in the method.

The very very first thing we now have to do in every project is to get an excellent understanding of the info and preprocess them if needed. To view the info with pandas we will do:

The `LCLid` is a novel string that identifies each household, the `Date` is self-explanatory, the `KWH` is the full variety of kilowatt-hours spent on that date and there are not any missing values in any respect. Since we would like to predict the consumption in a general fashion and never per household, we’d like to group the outcomes by date and average the kilowatt-hours.

At this point, it will be great if we could have a take a look at the best way consumption changes through the years. A line plot can expose this:

Energy Consumption Plot on the Entire Dataset

The seasonality characteristic is pretty obvious. Through the winter months we observe high demands in energy, while throughout the summer the consumption is at the bottom levels. This behavior repeats itself for each yr within the dataset, with different high and low values. To visualise the fluctuation within the span of a yr we will do:

Yearly Energy Consumption

To coach a model like XGBoost and LightGB we’d like to create the features ourselves. Currently, we now have just one feature: the complete date. We are able to extract different features based on the complete date reminiscent of the day of the week, the day of the yr, the month and others. To attain this we will do:

So, the `date` feature is currently redundant. Before dropping it, we are going to use it to separate our dataset into training and testing sets. Contrary to the traditional training, in time series we will’t just split the set in a random way since the order of the info is amazingly vital and we’re only allowed to include previous data. Otherwise, we may be prompted to predict a price while bearing in mind future values too! The dataset accommodates almost 2.5 years of knowledge, so for the testing set we are going to use only the last 6 months. If the training set was larger we might have used all the last yr because the testing set.

To visualise again the split and discriminate between the training and testing sets we will plot:

Visualizing Training-Testing Split

Now we will drop the `date` feature and create the training and testing sets:

The hyperparameter optimization will probably be done with grid search. The grid search takes parameters and a few values as configuration and tries out every possible combination. The parameter configuration that achieves the most effective result, will probably be the one to form the most effective estimator. Grid search utilizes cross validation too, so it’s crucial to offer an appropriate splitting mechanism. Again, resulting from the character of the issue we will’t just use plain k-fold cross validation. Scikit learn provides the TimeSeriesSplit method which splits the info incrementally in a respectful manner when it comes to continuity.

For the LightGB model we will do the identical by providing different parameters:

To judge the most effective estimator on the test set we are going to calculate some metrics. These are the Mean Absolute Error or MAE, the Mean Squared Error or MSE and the Mean Absolute Percentage Error or MAPE. Each of those provide a distinct perspective on the actual performance of the trained model. Moreover, we are going to plot a line diagram to higher visualize the performance of the model.

Lastly, to guage any of the aforementioned models we now have to run the next:

XGBoost Results
LightGBM Results

Despite the fact that XGBoost predicts more accurately the energy consumption in the course of the winter months, to strictly quantify and compare the performances we’d like to calculate the error metrics. By taking a take a look at the table below, it’s greater than obvious that XGBoost outperforms LightGBM in all cases.

The model performs relatively well, but is there a technique to improve it even further? The reply is yes. There are various different suggestions and tricks available that may be employed with the intention to achieve higher results. Certainly one of them is to make use of auxiliary features which can be correlated directly or not directly to energy consumption. For instance, the weather data can play a decisive role in relation to predicting energy demands. That’s why we decide to boost our dataset with weather data from the London Weather Dataset.

First let’s take a take a look at the structure of the info:

There are numerous missing data that should be filled in. Filling missing data just isn’t trivial and is determined by each case. Since we now have weather data where every day is determined by the previous and next days, we are going to fill those values by interpolating. Also, we are going to convert the `date` column to `datetime` and, then, merge the 2 dataframes with the intention to get one enhanced dataframe.

Remember that after generating the improved set, we now have to re-run the splitting process and get the brand new `training_data` and `testing_data`. Don’t forget to incorporate the brand new features too.

There is no such thing as a have to update the training steps. After training the models on the brand new dataset, we get the next results:

XGBoost Enhanced with Weather Results
LightGBM Enhanced with Weather Results

The weather data improve the performance in each models by a major margin. Particularly, within the XGBoost scenario the MAE is reduced by almost 44%, while the MAPE moved from 19% to 16%. For LightGBM, the MAE has dropped by 42% and the MAPE declined from 19.8% to 16.7%.

Ensemble models are very powerful machine learning tools that may be utilized within the time series forecasting problem. In this text, we’ve seen how this is finished within the case of energy consumption. At first, we trained our models through the use of solely the date factor. In a while, we took into consideration additional data within the training process which can be correlated with the duty at hand with the intention to boost the leads to a notable way.

The performance can potentially be improved much more by incorporating the so-called lag features or trying different hyperparameter optimization techniques reminiscent of randomized search or Bayesian optimization. I encourage you to try these out yourselves and share the outcomes within the comments below.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here