Predict sales for 50 different items at 10 different stores. 📈🛒
Kaggle Competition
Store Item Demand Forecasting Challenge
Goal
Predict sales for 50 different items at 10 different stores. 📈🛒
Python Notebook
Multiple Timeseries Forecasting notebook is on the market on Github.
Installing Requirements
Activate a Python environment and install the next
%%capture
!pip install pandas
!pip install numpy
!pip install darts
!pip install matplotlib
!pip install -U "gluonts[torch]==0.11.0" "optuna~=2.10.0"
!pip install ipykernel
!pip install --upgrade nbformat
!pip install lightgbm xgboost
!pip install seaborn
!pip install distributed
!pip install datasetsforecast
!pip install darts
!pip install mlforecast
!pip install statsforecast
Import packages
%matplotlib inlineimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Dataset
On this tutorial, we’ll train and evaluate multiple time-series forecasting models using the Store Item Demand Forecasting Challenge dataset from Kaggle. This dataset has 10 different stores and every store has 50 items, i.e. total of 500 each day level time series data for five years (2013–2017).
Download data
- Download the train.csv from here
- Create a
./data
directory contained in the directory of this Python notebook - Save the train.csv contained in the
./data
directory
Load data
data = pd.read_csv("./data/train.csv", index_col=0)
data = data.reset_index()
data.head()
print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns")
The dataset has 913000 rows and 4 columns
Data fields
- — Date of the sale data. There aren’t any holiday effects or store closures.
- — Store ID
- — Item ID
- — Variety of items sold at a specific store on a specific date.
data.describe()
- We don’t have any negative “sales” values
- We’d like to convert “store” and “item” columns to type string
Reduce the quantity of knowledge
For simplicity & speed, we’ll only keep data from the primary 2 out of the ten stores
data = data.loc[data["store"] <= 2]
print(f"The dataset now has {data.shape[0]} rows and {data.shape[1]} columns")print(f"The dataset now has {data.shape[0]} rows and {data.shape[1]} columns")
The dataset now has 182600 rows and 4 columns
Convert “store” and “item” columns to type string
data[["store", "item"]] = data[["store", "item"]].astype(str)
Check for duplicate rows
count_duplicate_rows = len(data)-len(data.drop_duplicates())
print(f"There are {count_duplicate_rows} duplicate rows")
There are 0 duplicate rows
Create a recent column by concatenating “store” and “item” columns
data["store_item"] = data["store"] + "_" + data["item"]
Convert date column to datetime
data["date"] = pd.to_datetime(data["date"])
Calculate total sales per date, store and item
item_sales_per_date = data.groupby(["date","store_item"])["sales"].aggregate("sum")
item_sales_per_date = item_sales_per_date.reset_index()
item_sales_per_date.columns = ["date","store_item","sales"]
item_sales_per_date = item_sales_per_date.sort_values("date", ascending=True)
item_sales_per_date.head()
item_sales_per_date.tail()
Plot total sales for all products over time
total_sales_per_date = item_sales_per_date.groupby(["date"])["sales"].aggregate("sum")
total_sales_per_date.plot()
Not all items sell daily
- For every “store_item” value add all of the missing “date” values and fill in “sales” with 0s
Generate an index with all combos of “date” and “store_item” values
import itertoolsmultiindex = list(zip(item_sales_per_date["date"], item_sales_per_date["store_item"]))
multiindex = pd.MultiIndex.from_tuples(multiindex, names=('index_1', 'index_2'))
dataset = item_sales_per_date.copy()
dataset.index = multiindex
idx_dates = list(pd.date_range(min(item_sales_per_date["date"]), max(item_sales_per_date["date"])))
idx_ids = list(dataset["store_item"].unique())
idx = list(itertools.product(idx_dates, idx_ids))
dataset = dataset.reindex(idx, None)
dataset = dataset.reset_index()
dataset.head()
For every “store_item” value fill add all of the missing “date” values and fill in “sales” with 0s
dataset["date"] = dataset["date"].fillna(dataset["index_1"])
dataset["store_item"] = dataset["store_item"].fillna(dataset["index_2"])
dataset["sales"] = dataset["sales"].fillna(0)
dataset = dataset.drop(columns=["index_1", "index_2"])
dataset = dataset.set_index("date")
dataset.index.name = "date"
dataset.head()
Create the dataset that we’ll use for training the forecasting models
# Rename columns to match the Nixtlaverse's expectations
# The 'store_item' becomes 'unique_id' representing the unique identifier of the time series
# The 'date' becomes 'ds' representing the time stamp of the info points
# The 'sales' becomes 'y' representing the goal variable we would like to forecast
# Y_df = dataset.query('unique_id.str.startswith("6_46")').copy()
Y_df = dataset.copy()
Y_df = Y_df.reset_index()
Y_df = Y_df.rename(columns={
'store_item': 'unique_id',
'date': 'ds',
'sales': 'y'
})
Y_df['y'] = Y_df['y'].astype(int)
# Convert the 'ds' column to datetime format to make sure proper handling of date-related operations in subsequent steps
Y_df['ds'] = pd.to_datetime(Y_df['ds'])
Y_df.tail()Y_df['unique_id'] = Y_df['unique_id'].astype(str)
%%capture
!pip install tqdm
from tqdm.autonotebook import tqdm
from statsforecast import StatsForecast
# Plot random series for EDA
StatsForecast.plot(Y_df)
Check for seasonality in the full variety of ‘sales’ per ‘date’
import darts
from darts import TimeSeries
from darts.utils.statistics import plot_acf, check_seasonalityseasonality_check_data = total_sales_per_date.reset_index()
seasonality_check_data = seasonality_check_data.set_index("date")
seasonality_check_data_series = TimeSeries.from_times_and_values(seasonality_check_data.index, seasonality_check_data["sales"].values)
plot_acf(seasonality_check_data_series, m=7, alpha=0.05, max_lag=30)
The ACF presents a spike at x in [1, 7, 14, 21], which suggests a weekly seasonality trend (highlighted). The blue zone determines the importance of the statistics for a confidence level of $alpha = 5%$. We also can run a statistical check of seasonality for every candidate period m
.
plot_acf(seasonality_check_data_series, m=7, alpha=0.05, max_lag=12*30)
We even have 6 months seasonality
We are going to now train multiple Statistical & ML models and evaluate which one performs best
from time import time
# Import needed models from the statsforecast library
from statsforecast.models import (
# SeasonalNaive: A model that uses the previous season's data because the forecast
SeasonalNaive,
# Naive: An easy model that uses the last observed value because the forecast
Naive,
# HistoricAverage: This model uses the common of all historical data because the forecast
HistoricAverage,
# CrostonOptimized: A model specifically designed for intermittent demand forecasting
CrostonOptimized,
# ADIDA: Adaptive combination of Intermittent Demand Approaches, a model designed for intermittent demand
ADIDA,
# IMAPA: Intermittent Multiplicative AutoRegressive Average, a model for intermittent series that includes autocorrelation
IMAPA,
# AutoETS: Automated Exponential Smoothing model that mechanically selects the perfect Exponential Smoothing model based on AIC
AutoETS
)# horizon is the variety of days in the long run for which we'll make a forecast
horizon = 30
# the seasonality window is 7 because now we have weekly seasonality
season_length = 7
# the variety of days that the model will use to make a forecast
window_size = 6*30
models = [
SeasonalNaive(season_length=season_length),
Naive(),
HistoricAverage(),
CrostonOptimized(),
ADIDA(),
IMAPA(),
AutoETS(season_length=season_length)
]
# Instantiate the StatsForecast class
sf = StatsForecast(
models=models, # An inventory of models for use for forecasting
freq='D', # The frequency of the time series data (on this case, 'D' stands for each day frequency)
n_jobs=-1, # The variety of CPU cores to make use of for parallel execution (-1 means use all available cores)
)
# Get the present time before forecasting starts, this will likely be used to measure the execution time
init = time()
# Call the forecast approach to the StatsForecast instance to predict the subsequent horizon days
# Level is about to [90], which suggests that it's going to compute the 90% prediction interval
fcst_df = sf.forecast(df=Y_df, h=horizon, level=[90])
# Get the present time after the forecasting ends
end = time()
# Calculate and print the full time taken for the forecasting in minutes
print(f'Forecast Minutes: {(end - init) / 60}')
Forecast Minutes: 0.9659132162729899
fcst_df.head()
from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals
from window_ops.expanding import expanding_mean# Import the needed models from various libraries
# LGBMRegressor: A gradient boosting framework that uses tree-based learning algorithms from the LightGBM library
from lightgbm import LGBMRegressor
# XGBRegressor: A gradient boosting regressor model from the XGBoost library
from xgboost import XGBRegressor
# LinearRegression: An easy linear regression model from the scikit-learn library
from sklearn.linear_model import LinearRegression
# Instantiate the MLForecast object
mlf = MLForecast(
models=[LGBMRegressor(max_depth=10), XGBRegressor(max_depth=10, eval_metric='rmse'), LinearRegression()], # List of models for forecasting: LightGBM, XGBoost and Linear Regression
freq='D', # Frequency of the info - 'D' for each day frequency
lags=list(range(1, 7)), # Specific lags to make use of as regressors: 1 to six days
lag_transforms = {
1: [expanding_mean], # Apply expanding mean transformation to the lag of 1 day
},
date_features=['year', 'month', 'day', 'dayofweek', 'quarter', 'week', 'dayofyear'], # Date features to make use of as regressors
)
# Start the timer to calculate the time taken for fitting the models
init = time()
# Fit the MLForecast models to the info, with prediction intervals set using a window size of window_size days
mlf.fit(Y_df, prediction_intervals=PredictionIntervals(window_size=window_size))
# Calculate the tip time after fitting the models
end = time()
# Print the time taken to suit the MLForecast models, in minutes
print(f'MLForecast Minutes: {(end - init) / 60}')
MLForecast Minutes: 0.3328384002049764fcst_mlf_df
fcst_mlf_df = mlf.predict(horizon=horizon, level=[90])
fcst_mlf_df.head()
Per store, item and forecast model
fcst_df = fcst_df.merge(fcst_mlf_df, how='left', on=['unique_id', 'ds'])
sf.plot(Y_df, fcst_df, max_insample_length=30)
Cross Validation in StatsForecast
init = time()
cv_df = sf.cross_validation(df=Y_df, h=horizon, n_windows=3, step_size=horizon, level=[90])
end = time()
print(f'CV Minutes: {(end - init) / 60}')
CV Minutes: 2.01016796429952
Cross Validation in MLForecast
init = time()
cv_mlf_df = mlf.cross_validation(
data=Y_df,
window_size=horizon,
n_windows=3,
step_size=horizon,
level=[90],
)
end = time()
print(f'CV Minutes: {(end - init) / 60}')
CV Minutes: 0.4411295692125956
cv_df = cv_df.merge(cv_mlf_df.drop(columns=['y']), how='left', on=['unique_id', 'ds', 'cutoff'])
cv_df.head()
Visualize cross validation splits for a selected unique_id
cutoffs = cv_df['cutoff'].unique()for cutoff in cutoffs:
img = sf.plot(
Y_df,
cv_df.query('cutoff == @cutoff').drop(columns=['y', 'cutoff']),
max_insample_length=30*4,
unique_ids=['1_1'],
)
agg_cv_df = cv_df.loc[:,~cv_df.columns.str.contains('hi|lo')].groupby(['ds', 'cutoff']).sum(numeric_only=True).reset_index()
agg_cv_df.insert(0, 'unique_id', 'Total sales (all items)')agg_Y_df = Y_df.groupby(['ds']).sum(numeric_only=True).reset_index()
agg_Y_df.insert(0, 'unique_id', 'Total sales (all items)')
for cutoff in cutoffs:
img = sf.plot(
agg_Y_df,
agg_cv_df.query('cutoff == @cutoff').drop(columns=['y', 'cutoff']),
max_insample_length=30*4,
)
from typing import List, Callablefrom distributed import Client
from fugue import transform
from fugue_dask import DaskExecutionEngine
from datasetsforecast.losses import mse, mae, smape, mape
def evaluate(df: pd.DataFrame, metrics: List[Callable]) -> pd.DataFrame:
eval_ = {}
models = df.loc[:, ~df.columns.str.contains('unique_id|y|ds|cutoff|lo|hi')].columns
for model in models:
eval_[model] = {}
for metric in metrics:
eval_[model][metric.__name__] = metric(df['y'], df[model])
eval_df = pd.DataFrame(eval_).rename_axis('metric').reset_index()
eval_df.insert(0, 'cutoff', df['cutoff'].iloc[0])
eval_df.insert(0, 'unique_id', df['unique_id'].iloc[0])
return eval_df
str_models = cv_df.loc[:, ~cv_df.columns.str.contains('unique_id|y|ds|cutoff|lo|hi')].columns
str_models = ','.join([f"{model}:float" for model in str_models])
cv_df['cutoff'] = cv_df['cutoff'].astype(str)
cv_df = cv_df.reset_index()
cv_df['unique_id'] = cv_df['unique_id'].astype(str)
Let’s cleate a dask client.
client = Client() # without this, dask shouldn't be in distributed mode
# fugue.dask.dataframe.default.partitions determines the default partitions for a recent DaskDataFrame
engine = DaskExecutionEngine({"fugue.dask.dataframe.default.partitions": 96})
The transform function takes the evaluate functions and applies it to every combination of time series (unique_id) and cross validation window (cutoff) using the dask client we created before.
evaluation_df = transform(
cv_df.loc[:, ~cv_df.columns.str.contains('lo|hi')],
evaluate,
engine="dask",
params={'metrics': [mse, mae, mape, smape]},
schema=f"unique_id:str,cutoff:str,metric:str, {str_models}",
as_local=True,
partition={'by': ['unique_id', 'cutoff']}
)
evaluation_df.head()
# Calculate the mean metric for every cross validation window
evaluation_df.groupby(['unique_id','cutoff', 'metric']).mean(numeric_only=True)
import matplotlib.pyplot as plt
import seaborn as snsevaluation_df_melted = pd.melt(evaluation_df, id_vars=['unique_id', 'cutoff', 'metric'], var_name='model', value_name='error')
evaluation_df_melted
Distribution of MSE errors per model, unique_id and evaluation metric
sns.violinplot(evaluation_df_melted.query('metric=="mse"'), x='error', y='model').set_title('Distribution of MSE errors')
Distribution of SMAPE errors per model, unique_id and evaluation metric
sns.violinplot(evaluation_df_melted.query('metric=="smape"'), x='error', y='model').set_title('Distribution of SMAPE errors')
trained_models = list(evaluation_df_melted.model.unique())evaluation_df[trained_models] = evaluation_df[trained_models].astype(float)
evaluation_df
# Select the perfect model for every time series, metric, and cross validation window
evaluation_df['best_model'] = evaluation_df.idxmin(axis=1, numeric_only=True)
# count how again and again a model wins per metric and cross validation window
count_best_model = evaluation_df.groupby(['unique_id', 'metric', 'best_model']).size().rename('n').to_frame().reset_index()
# plot results
sns.barplot(count_best_model, x='n', y='best_model', hue='metric')
AutoETS is the perfect performing model for all evaluation metrics
This doesn’t mean that AutoETS is the perfect performing model for every individual “store_item”
count_best_model
# add colours
colours = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
# For the mse, calculate how again and again a model wins
count_best_model_mse = count_best_model.query('metric == "mse" & unique_id == "1_1"')
counts_series_mse = count_best_model_mse["n"]
plt.pie(counts_series_mse, labels=count_best_model_mse["best_model"], autopct='%.0f%%', colours=colours)
plt.title("Winning Models Based On MSE - Store_item='1_1'")
plt.show()
was the perfect performing model based on for two out of the three validation folds of .
# For the smape, calculate how again and again a model wins
count_best_model_smape = count_best_model.query('metric == "smape" & unique_id == "1_1"')
counts_series_smape = count_best_model_smape["n"]
plt.pie(counts_series_smape, labels=count_best_model_smape["best_model"], autopct='%.0f%%', colours=colours)
plt.title("Winning Models Based On SMAPE - Store_item='1_1'")
plt.show()
was the perfect performing model based on for two out of the three validation folds of .
- The very best model based on MSE is XGBRegressor
- The very best model based on SMAPE is LGBMRegressor
eval_series_df = evaluation_df.query('metric == "mse" & unique_id == "1_1"').groupby(['unique_id']).mean(numeric_only=True)
eval_series_df['best_model'] = eval_series_df.idxmin(axis=1)sf.plot(Y_df, cv_df.drop(columns=['cutoff', 'y']),
max_insample_length=30 * 2,
models=['XGBRegressor'],
unique_ids=eval_series_df.query('best_model == "XGBRegressor"').index[:8])
eval_series_df = evaluation_df.query('metric == "smape" & unique_id == "1_1"').groupby(['unique_id']).mean(numeric_only=True)
eval_series_df['best_model'] = eval_series_df.idxmin(axis=1)sf.plot(Y_df, cv_df.drop(columns=['cutoff', 'y']),
max_insample_length=30 * 2,
models=['LGBMRegressor'],
unique_ids=eval_series_df.query('best_model == "LGBMRegressor"').index[:8])
eval_series_df = evaluation_df.query('metric == "mape"').groupby(['unique_id']).mean(numeric_only=True)
eval_series_df['best_model'] = eval_series_df.idxmin(axis=1)sf.plot(Y_df, cv_df.drop(columns=['cutoff', 'y']),
max_insample_length=30 * 2,
models=['AutoETS'],
unique_ids=eval_series_df.query('best_model == "AutoETS"').index[:8])
eval_series_df = evaluation_df.query('metric == "mape"').groupby(['unique_id']).mean(numeric_only=True)
eval_series_df['best_model'] = eval_series_df.idxmin(axis=1)sf.plot(Y_df, cv_df.drop(columns=['cutoff', 'y']),
max_insample_length=30 * 2,
models=['XGBRegressor'],
unique_ids=eval_series_df.query('best_model == "XGBRegressor"').index[:8])
This code relies on the next publicly available resources
jazz music
bossa nova