Time Series for Climate Change: Solar Irradiance Forecasting


Photo by Andrey Grinkevich on Unsplash

That is Part 2 of the series Time Series for Climate Change. List of articles:

Solar energy is an increasingly prevalent source of fresh energy.

Sunlight is converted into electricity by photovoltaic devices. Since these devices will not be pollutants, they’re considered a source of fresh energy. Besides environmental advantages, solar energy can also be appealing because of its low price. The initial investment is large, however the low long-term costs are worthwhile.

The quantity of energy produced is decided by the extent of solar radiation. But, solar conditions can change rapidly. For instance, a cloud may unexpectedly cover the sun and reduce the efficiency of photovoltaic devices.

So, solar energy systems depend on forecasting models to predict solar conditions. Like within the case of wind power, accurate forecasts have a direct impact on the effectiveness of those systems.

Beyond energy production

Forecasting solar irradiance has other applications besides energy, for instance:

  • Agriculture: Farmers can leverage forecasts to optimize crop production. Instances include estimating when to plant or harvest a crop, or optimizing irrigation systems;
  • Civil engineering: Forecasting solar irradiance can also be precious for designing and constructing buildings. Predictions will be used to maximise solar radiation, thereby reducing heating/cooling costs. Forecasts can be useful to configure air-conditioning systems. This contributes to the efficient use of energy inside buildings.

Challenges, and what’s next

Despite its importance, solar conditions are highly variable and difficult to predict. These depend upon several meteorological aspects, whose information is typically unavailable.

In the remainder of this text, we’ll develop a model for solar irradiance forecasting. Amongst other things, you’ll learn find out how to:

  • visualize a multivariate time series;
  • transform a multivariate time series for supervised learning;
  • do feature selection based on correlation and importance scores.

This tutorial relies on a dataset collected by the U.S. Department of Agriculture. You’ll be able to check more details in reference [1]. The complete code for this tutorial is offered on Github:

The information is a multivariate time series: at each fast, an remark consists of several variables. These include the next weather and hydrological variables:

  • Solar irradiance (watts per square meter);
  • Wind direction;
  • Snow depth;
  • Wind speed;
  • Dew point temperature;
  • Precipitation;
  • Vapor pressure;
  • Relative humidity;
  • Air temperature.

The series spans from October 1, 2007, to October 1, 2013. It’s collected at an hourly frequency totaling 52.608 observations.

After downloading the info, we are able to read it using pandas:

import re
import pandas as pd
# src module available here: https://github.com/vcerqueira/tsa4climate/tree/principal/src
from src.log import LogTransformation

# a sample here: https://github.com/vcerqueira/tsa4climate/tree/principal/content/part_2/assets
assets = 'path_to_data_directory'

DATE_TIME_COLS = ['month', 'day', 'calendar_year', 'hour']
# we'll concentrate on the info collected at particular station called smf1
STATION = 'smf1'

{'incoming_solar_final.csv': DATE_TIME_COLS + [f'{STATION}_sin_w/m2'],
'wind_dir_raw.csv': DATE_TIME_COLS + [f'{STATION}_wd_deg'],
'snow_depth_final.csv': DATE_TIME_COLS + [f'{STATION}_sd_mm'],
'wind_speed_final.csv': DATE_TIME_COLS + [f'{STATION}_ws_m/s'],
'dewpoint_final.csv': DATE_TIME_COLS + [f'{STATION}_dpt_C'],
'precipitation_final.csv': DATE_TIME_COLS + [f'{STATION}_ppt_mm'],
'vapor_pressure.csv': DATE_TIME_COLS + [f'{STATION}_vp_Pa'],
'relative_humidity_final.csv': DATE_TIME_COLS + [f'{STATION}_rh'],
'air_temp_final.csv': DATE_TIME_COLS + [f'{STATION}_ta_C'],

data_series = {}
for file in COLUMNS_PER_FILE:
file_data = pd.read_csv(f'{assets}/{file}')

var_df = file_data[COLUMNS_PER_FILE[file]]

var_df['datetime'] =
pd.to_datetime([f'{year}/{month}/{day} {hour}:00'
for year, month, day, hour in zip(var_df['calendar_year'],

var_df = var_df.drop(DATE_TIME_COLS, axis=1)
var_df = var_df.set_index('datetime')
series = var_df.iloc[:, 0].sort_index()

data_series[file] = series

mv_series = pd.concat(data_series, axis=1)
mv_series.columns = [re.sub('_final.csv|_raw.csv|.csv', '', x) for x in mv_series.columns]
mv_series.columns = [re.sub('_', ' ', x) for x in mv_series.columns]
mv_series.columns = [x.title() for x in mv_series.columns]

mv_series = mv_series.astype(float)

This code results in the next data set:

Sample of the multivariate time series

Exploratory data evaluation

Multivariate time series plot in log-scale. For visualization purposes, the series was resampled to a day by day frequency. This was done by taking the mean value per day. Image by writer.

The series plot suggests there’s a robust yearly seasonality. Radiation levels peak during summertime, and other variables show similar patterns. Aside from seasonal fluctuations, the extent of the time series is stable over time.

We may also visualize the solar irradiance variable individually:

Every day total solar radiation. Image by writer.

Besides the clear seasonality, we may also spot some downward spikes across the level of the series. These cases should be predicted timely in order that backup energy systems are used efficiently.

We may also analyze the correlation between each pair of variables:

Heatmap showing the pairwise correlation. Image by writer.

Solar irradiance is correlated with a few of the variables. For instance, air temperature, relative humidity (negative correlation), or wind speed.

We’ve explored find out how to construct a forecasting model with a univariate time series in a previous article. Yet, the correlation heatmap suggests that it might be precious to incorporate these variables within the model.

How can we do this?

Primer on Auto-Regressive Distributed Lags modeling

Auto-regressive distributed lags (ARDL) is a modeling technique for multivariate time series.

ARDL is a useful approach to identifying the connection between several variables over time. It really works by extending the auto-regression technique to multivariate data. The longer term values of a given variable of the series are modeled based on its lags and the lags of other variables.

On this case, we would like to forecast solar irradiance based on the lags of several aspects reminiscent of air temperature or vapor pressure.

Transforming the info for ARDL

Applying the ARDL method involves transforming the time series right into a tabular format. This is completed by applying time delay embedding to every variable, after which concatenating the outcomes right into a single matrix. The next function will be used to do that:

import pandas as pd

def mts_to_tabular(data: pd.DataFrame,
n_lags: int,
horizon: int,
return_Xy: bool = False,
drop_na: bool = True):
Time delay embedding with multivariate time series
Time series for supervised learning

:param data: multivariate time series as pd.DataFrame
:param n_lags: variety of past values to used as explanatory variables
:param horizon: what number of values to forecast
:param return_Xy: whether to return the lags split from future observations

:return: pd.DataFrame with reconstructed time series

# applying time delay embedding to every variable
data_list = [time_delay_embedding(data[col], n_lags, horizon)
for col in data]

# concatenating the ends in a single dataframe
df = pd.concat(data_list, axis=1)

if drop_na:
df = df.dropna()

if not return_Xy:
return df

is_future = df.columns.str.incorporates('+')

X = df.iloc[:, ~is_future]
Y = df.iloc[:, is_future]

if Y.shape[1] == 1:
Y = Y.iloc[:, 0]

return X, Y

This function is applied to the info as follows:

from sklearn.model_selection import train_test_split

# goal variable
TARGET = 'Solar Irradiance'
# variety of lags for every variable
N_LAGS = 24
# forecasting horizon for solar irradiance

# leaving the last 30% of observations for testing
train, test = train_test_split(mv_series, test_size=0.3, shuffle=False)

# transforming the time series right into a tabular format
X_train, Y_train_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)
X_test, Y_test_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)

# subsetting the goal variable
target_columns = Y_train_all.columns.str.incorporates(TARGET)
Y_train = Y_train_all.iloc[:, target_columns]
Y_test = Y_test_all.iloc[:, target_columns]

We set the forecasting horizon to 48 hours. Predicting many steps prematurely is precious for the effective integration of several energy sources into the electricity grid.

It’s difficult to say a priori what number of lags must be included. So, this value is ready to 24 for every variable. This results in a complete of 216 lag-based features.

Constructing a forecasting model

Before constructing a model, we extract 8 more features based on the date and time. These include data reminiscent of the day of the yr or hour that are useful to model seasonality.

We reduce the variety of explanatory variables with feature selection. First, we apply a correlation filter. That is used to remove any feature with a correlation greater than 95% with some other explanatory variable. Then, we also apply recursive feature elimination (RFE) based on the importance scores of a Random Forest. After feature engineering, we train a model using a Random Forest.

We leverage sklearn’s Pipeline and RandomSearchCV to optimize the parameters of the various steps:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sktime.transformations.series.date import DateTimeFeatures

from src.holdout import Holdout

# including datetime information to model seasonality
hourly_feats = DateTimeFeatures(ts_freq='H',

# constructing a pipeline
pipeline = Pipeline([
# feature extraction based on datetime
('extraction', hourly_feats),
# removing correlated explanatory variables
('correlation_filter', FunctionTransformer(func=correlation_filter)),
# applying feature selection based on recursive feature elimination
('select', RFE(estimator=RandomForestRegressor(max_depth=5), step=3)),
# building a random forest model for forecasting
('model', RandomForestRegressor())]

# parameter grid for optimization
param_grid = {
'extraction': ['passthrough', hourly_feats],
'select__n_features_to_select': np.linspace(start=.1, stop=1, num=10),
'model__n_estimators': [100, 200]

# optimizing the pipeline with random search
model = RandomizedSearchCV(estimator=pipeline,

# running random search
model.fit(X_train, Y_train)

# checking the chosen model
# Pipeline(steps=[('extraction',
# DateTimeFeatures(feature_scope='efficient', ts_freq='H')),
# ('correlation_filter',
# FunctionTransformer(func=)),
# ('select',
# RFE(estimator=RandomForestRegressor(max_depth=5),
# n_features_to_select=0.9, step=3)),
# ('model', RandomForestRegressor(n_estimators=200))])

Evaluating the model

We chosen a model using a random search coupled with a validation split. Now, we are able to evaluate its forecasting performance on the test set.

# getting forecasts for the test set
forecasts = model.predict(X_test)
forecasts = pd.DataFrame(forecasts, columns=Y_test.columns)

The chosen model kept only 65 out of the unique 224 explanatory variables. Here’s the importance of the highest 20 features:

Importance of every feature within the model. Image by writer.

The features hour of the day and day of the yr are among the many top 4 features. This result highlights the strength of seasonal effects in the info. Besides those, the primary lags of a few of the variables are also useful to the model.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x