## How you can use time series evaluation and forecasting to tackle climate change

That is Part 2 of the series *Time Series for Climate Change. *List of articles:

Solar energy is an increasingly prevalent source of fresh energy.

Sunlight is converted into electricity by photovoltaic devices. Since these devices will not be pollutants, they’re considered a source of fresh energy. Besides environmental advantages, solar energy can also be appealing because of its low price. The initial investment is large, however the low long-term costs are worthwhile.

The quantity of energy produced is decided by the extent of solar radiation. But, solar conditions can change rapidly. For instance, a cloud may unexpectedly cover the sun and reduce the efficiency of photovoltaic devices.

So, solar energy systems depend on forecasting models to predict solar conditions. Like within the case of wind power, accurate forecasts have a direct impact on the effectiveness of those systems.

## Beyond energy production

Forecasting solar irradiance has other applications besides energy, for instance:

- Agriculture: Farmers can leverage forecasts to optimize crop production. Instances include estimating when to plant or harvest a crop, or optimizing irrigation systems;
- Civil engineering: Forecasting solar irradiance can also be precious for designing and constructing buildings. Predictions will be used to maximise solar radiation, thereby reducing heating/cooling costs. Forecasts can be useful to configure air-conditioning systems. This contributes to the efficient use of energy inside buildings.

## Challenges, and whatâ€™s next

Despite its importance, solar conditions are highly variable and difficult to predict. These depend upon several meteorological aspects, whose information is typically unavailable.

In the remainder of this text, weâ€™ll develop a model for solar irradiance forecasting. Amongst other things, youâ€™ll learn find out how to:

- visualize a multivariate time series;
- transform a multivariate time series for supervised learning;
- do feature selection based on correlation and importance scores.

This tutorial relies on a dataset collected by the U.S. Department of Agriculture. You’ll be able to check more details in reference [1]. The complete code for this tutorial is offered on Github:

The information is a multivariate time series: at each fast, an remark consists of several variables. These include the next weather and hydrological variables:

- Solar irradiance (watts per square meter);
- Wind direction;
- Snow depth;
- Wind speed;
- Dew point temperature;
- Precipitation;
- Vapor pressure;
- Relative humidity;
- Air temperature.

The series spans from October 1, 2007, to October 1, 2013. Itâ€™s collected at an hourly frequency totaling 52.608 observations.

After downloading the info, we are able to read it using pandas:

`import re`

import pandas as pd

# src module available here: https://github.com/vcerqueira/tsa4climate/tree/principal/src

from src.log import LogTransformation# a sample here: https://github.com/vcerqueira/tsa4climate/tree/principal/content/part_2/assets

assets = 'path_to_data_directory'

DATE_TIME_COLS = ['month', 'day', 'calendar_year', 'hour']

# we'll concentrate on the info collected at particular station called smf1

STATION = 'smf1'

COLUMNS_PER_FILE =

{'incoming_solar_final.csv': DATE_TIME_COLS + [f'{STATION}_sin_w/m2'],

'wind_dir_raw.csv': DATE_TIME_COLS + [f'{STATION}_wd_deg'],

'snow_depth_final.csv': DATE_TIME_COLS + [f'{STATION}_sd_mm'],

'wind_speed_final.csv': DATE_TIME_COLS + [f'{STATION}_ws_m/s'],

'dewpoint_final.csv': DATE_TIME_COLS + [f'{STATION}_dpt_C'],

'precipitation_final.csv': DATE_TIME_COLS + [f'{STATION}_ppt_mm'],

'vapor_pressure.csv': DATE_TIME_COLS + [f'{STATION}_vp_Pa'],

'relative_humidity_final.csv': DATE_TIME_COLS + [f'{STATION}_rh'],

'air_temp_final.csv': DATE_TIME_COLS + [f'{STATION}_ta_C'],

}

data_series = {}

for file in COLUMNS_PER_FILE:

file_data = pd.read_csv(f'{assets}/{file}')

var_df = file_data[COLUMNS_PER_FILE[file]]

var_df['datetime'] =

pd.to_datetime([f'{year}/{month}/{day} {hour}:00'

for year, month, day, hour in zip(var_df['calendar_year'],

var_df['month'],

var_df['day'],

var_df['hour'])])

var_df = var_df.drop(DATE_TIME_COLS, axis=1)

var_df = var_df.set_index('datetime')

series = var_df.iloc[:, 0].sort_index()

data_series[file] = series

mv_series = pd.concat(data_series, axis=1)

mv_series.columns = [re.sub('_final.csv|_raw.csv|.csv', '', x) for x in mv_series.columns]

mv_series.columns = [re.sub('_', ' ', x) for x in mv_series.columns]

mv_series.columns = [x.title() for x in mv_series.columns]

mv_series = mv_series.astype(float)

This code results in the next data set:

## Exploratory data evaluation

The series plot suggests thereâ€™s a robust yearly seasonality. Radiation levels peak during summertime, and other variables show similar patterns. Aside from seasonal fluctuations, the extent of the time series is stable over time.

We may also visualize the solar irradiance variable individually:

Besides the clear seasonality, we may also spot some downward spikes across the level of the series. These cases should be predicted timely in order that backup energy systems are used efficiently.

We may also analyze the correlation between each pair of variables:

Solar irradiance is correlated with a few of the variables. For instance, air temperature, relative humidity (negative correlation), or wind speed.

Weâ€™ve explored find out how to construct a forecasting model with a univariate time series in a previous article. Yet, the correlation heatmap suggests that it might be precious to incorporate these variables within the model.

How can we do this?

## Primer on Auto-Regressive Distributed Lags modeling

Auto-regressive distributed lags (ARDL) is a modeling technique for multivariate time series.

ARDL is a useful approach to identifying the connection between several variables over time. It really works by extending the auto-regression technique to multivariate data. The longer term values of a given variable of the series are modeled based on its lags and the lags of other variables.

On this case, we would like to forecast solar irradiance based on the lags of several aspects reminiscent of air temperature or vapor pressure.

## Transforming the info for ARDL

Applying the ARDL method involves transforming the time series right into a tabular format. This is completed by applying time delay embedding to every variable, after which concatenating the outcomes right into a single matrix. The next function will be used to do that:

`import pandas as pd`def mts_to_tabular(data: pd.DataFrame,

n_lags: int,

horizon: int,

return_Xy: bool = False,

drop_na: bool = True):

"""

Time delay embedding with multivariate time series

Time series for supervised learning

:param data: multivariate time series as pd.DataFrame

:param n_lags: variety of past values to used as explanatory variables

:param horizon: what number of values to forecast

:param return_Xy: whether to return the lags split from future observations

:return: pd.DataFrame with reconstructed time series

"""

# applying time delay embedding to every variable

data_list = [time_delay_embedding(data[col], n_lags, horizon)

for col in data]

# concatenating the ends in a single dataframe

df = pd.concat(data_list, axis=1)

if drop_na:

df = df.dropna()

if not return_Xy:

return df

is_future = df.columns.str.incorporates('+')

X = df.iloc[:, ~is_future]

Y = df.iloc[:, is_future]

if Y.shape[1] == 1:

Y = Y.iloc[:, 0]

return X, Y

This function is applied to the info as follows:

`from sklearn.model_selection import train_test_split`# goal variable

TARGET = 'Solar Irradiance'

# variety of lags for every variable

N_LAGS = 24

# forecasting horizon for solar irradiance

HORIZON = 48

# leaving the last 30% of observations for testing

train, test = train_test_split(mv_series, test_size=0.3, shuffle=False)

# transforming the time series right into a tabular format

X_train, Y_train_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)

X_test, Y_test_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)

# subsetting the goal variable

target_columns = Y_train_all.columns.str.incorporates(TARGET)

Y_train = Y_train_all.iloc[:, target_columns]

Y_test = Y_test_all.iloc[:, target_columns]

We set the forecasting horizon to 48 hours. Predicting many steps prematurely is precious for the effective integration of several energy sources into the electricity grid.

Itâ€™s difficult to say a priori what number of lags must be included. So, this value is ready to 24 for every variable. This results in a complete of 216 lag-based features.

## Constructing a forecasting model

Before constructing a model, we extract 8 more features based on the date and time. These include data reminiscent of the day of the yr or hour that are useful to model seasonality.

We reduce the variety of explanatory variables with feature selection. First, we apply a correlation filter. That is used to remove any feature with a correlation greater than 95% with some other explanatory variable. Then, we also apply recursive feature elimination (RFE) based on the importance scores of a Random Forest. After feature engineering, we train a model using a Random Forest.

We leverage sklearnâ€™s *Pipeline *and *RandomSearchCV *to optimize the parameters of the various steps:

`from sklearn.pipeline import Pipeline`

from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import RandomizedSearchCV

from sktime.transformations.series.date import DateTimeFeaturesfrom src.holdout import Holdout

# including datetime information to model seasonality

hourly_feats = DateTimeFeatures(ts_freq='H',

keep_original_columns=True,

feature_scope='efficient')

# constructing a pipeline

pipeline = Pipeline([

# feature extraction based on datetime

('extraction', hourly_feats),

# removing correlated explanatory variables

('correlation_filter', FunctionTransformer(func=correlation_filter)),

# applying feature selection based on recursive feature elimination

('select', RFE(estimator=RandomForestRegressor(max_depth=5), step=3)),

# building a random forest model for forecasting

('model', RandomForestRegressor())]

)

# parameter grid for optimization

param_grid = {

'extraction': ['passthrough', hourly_feats],

'select__n_features_to_select': np.linspace(start=.1, stop=1, num=10),

'model__n_estimators': [100, 200]

}

# optimizing the pipeline with random search

model = RandomizedSearchCV(estimator=pipeline,

param_distributions=param_grid,

scoring='neg_mean_squared_error',

n_iter=25,

n_jobs=5,

refit=True,

verbose=2,

cv=Holdout(n=X_train.shape[0]),

random_state=123)

# running random search

model.fit(X_train, Y_train)

# checking the chosen model

model.best_estimator_

# Pipeline(steps=[('extraction',

# DateTimeFeatures(feature_scope='efficient', ts_freq='H')),

# ('correlation_filter',

# FunctionTransformer(func=)),

# ('select',

# RFE(estimator=RandomForestRegressor(max_depth=5),

# n_features_to_select=0.9, step=3)),

# ('model', RandomForestRegressor(n_estimators=200))])

## Evaluating the model

We chosen a model using a random search coupled with a validation split. Now, we are able to evaluate its forecasting performance on the test set.

`# getting forecasts for the test set`

forecasts = model.predict(X_test)

forecasts = pd.DataFrame(forecasts, columns=Y_test.columns)

The chosen model kept only 65 out of the unique 224 explanatory variables. Hereâ€™s the importance of the highest 20 features:

The features hour of the day and day of the yr are among the many top 4 features. This result highlights the strength of seasonal effects in the info. Besides those, the primary lags of a few of the variables are also useful to the model.