Time series prediction can be beneficial in many fields including logistics, weather, sales forecasting, and predictive maintenance. Walmart provided a complete data set on Kaggle that can be used to evaluate time series prediction techniques. I will be making several posts using the Walmart data set from the M5 Forecasting – Accuracy competition to develop and evaluate time series methods in Python.
This first post demonstrates preliminary exploratory data analysis (EDA) and prediction using seasonal features. The post also provides a brief summary of polymorphism in Python using an abstract parent class to minimize code duplication and avoid conditionals like switch or if statements.
Walmart Data Set
The Walmart data set includes data for items in three categories of products from 2011 through 2016: hobbies, foods, and household. Each item is associated with a store in CA, FL, or TX. Three tables contain data to identify daily unit sales, selling prices, and event data for any given day.
- The calendar table has daily rows with weekday and event labels providing the date of notable events. The event labels contained in the table include religious holidays, such as Chanukah End and Easter, sporting events, such as SuperBowl and NBAFinalsStart, and US national holidays, such as Thanksgiving and IndependenceDay.
- The sales_train_validation table includes daily unit sales data for products in the three categories among stores in the three states. This table is in wide format with each row containing all daily sales data for one product and columns for each day in the full time range.
- The sell_prices table provides weekly prices for each item.
Exploratory Data Analysis
Prior to predicting unit sales, we would like to identify good candidate items that have strong seasonal variation. Since the Walmart data set only provides item_id labels instead of true item names or descriptions, EDA is needed to identify these good candidate items. Items that are the best candidates for seasonal prediction have a high correlation with events and holidays. Since some foods are often associated with events (e.g., chocolate on Valentine’s Day), this analysis focuses on items in the FOOD category. This initial round of exploratory data analysis identifies foods that demonstrate higher sales on events. Same items from multiple stores are grouped since preliminary EDA showed these same food items behaved similarly among multiple stores and states.
To identify good candidates for seasonal prediction, the data tables needs to be merged into a form with one column per event and one row per item with the same items from multiple stores groups into one row.
- Unpivoting (pandas melt) the sales_train_validation table converts the table from a wide format with a column per day to long format with a primary key including day.
- Grouping and averaging (pandas groupby) combines each item sold on one day among all stores into one row for that item with an average unit_sales for this day. The grouped sales_train_validation table now only has three columns: d (day), item_id, and average unit_sales across all stores.
- Merging (pandas merge) joins this grouped table with the calendar table on the day column. This step adds the event labels per day to the average unit_sales per day.
- Grouping again combines items per event to produce a table with a primary key of item_id and event_name. This table identifies the average unit_sales per event.
- Pivoting (pandas pivot) converts this grouped table into wide format with one column per event including a ‘None’ column to group sales on days without events.
- Dividing the unit_sales values in the event columns by the the unit_sales values in the ‘None’ column produces a unit_sales ratio to highlight foods with higher sales on event days. This step produces the final table values and structure.
Sorting this wide table to be descending in the ‘Thanksgiving’ column identifies FOODS_3_069 as the food with highest increase in average unit_sales on Thanksgiving Day compared to days without events.
The unit_sales for FOODS_3_069 at the TX_1 store demonstrates the unit_sales seasonality for this food. Distinct peaks occur near New Years Eve, Christmas, Thanksgiving, and Valentine’s Day although not all holidays have a peak each of the five years.
Unit Sales Prediction with Seasonal Features
This analysis uses a combination of deterministic time series features to predict unit sales. These features are a linear trend, weekly seasonal indicators, and annual seasonal indicators. The linear trend enables the model to detrend a long-term linear trend in time. The seasonal features are Fourier series in which each series has an integer number of cycles within a one year time frame. This analysis uses annual seasonal indicators with 1 to 32 cycles per annum. The statsmodels.tsa.deterministic.DeterministicProcess container class is a convenient class that provides the Fourier Series in addition to constants, time trends, and seasonal indicators for each week. The following method demonstrates the DeterministicProcess syntax. The DeterministicProcess requires a pandas index format for the index column.
def create_seasonal_features(self, df_merged_store):
"""Creates seasonal features for one item and one store"""
df_copy = df_merged_store.copy(deep=True)
y = df_copy['unit_sales']
df_copy['date'] = pd.DatetimeIndex(df_copy['date'])
df_copy.set_index('date', inplace=True)
fourier = CalendarFourier(freq='A', order=16)
dp = DeterministicProcess(index=df_copy.index,
constant=True,
order=1,
seasonal=True,
additional_terms=[fourier],
drop=True)
X = dp.in_sample()
return X, y
The model fitting and prediction problem presents an opportunity to apply polymorphism in Python using the abc package. A parent class contains a generic plotting method and abstract methods for fitting and prediction. A child class defines fitting and prediction methods that are tailored to a specific combination of input features. This first analysis only uses the seasonal features described previously, and the UnitSalesPredictionSeasonal child class fits a linear regression model from sklearn.linear_model.LinearRegression. The full code used in this example is available on GitHub: https://github.com/bspivey/M5ForecastingAccuracy.
from abc import ABC, abstractmethod
class UnitSalesPrediction(ABC):
def plot_predictions(self, X, y, y_pred):
list_of_tuples = list(zip(X.index, y, y_pred))
columns = ['date', 'y', 'y_pred']
df_wide = pd.DataFrame(list_of_tuples, columns=columns)
value_vars = ['y', 'y_pred']
df_tall = pd.melt(df_wide,
id_vars='date',
value_vars=value_vars,
var_name='y_label',
value_name='y_value')
fig = px.line(df_tall,
x='date',
y='y_value',
color='y_label',
width=900,
height=300)
fig.update_layout(
yaxis_title='unit_sales')
fig.show()
@abstractmethod
def fit_unit_sales_model(self):
pass
@abstractmethod
def predict_unit_sales(self):
pass
class UnitSalesPredictionSeasonal(UnitSalesPrediction):
def fit_unit_sales_model(self, X_seasonal, y):
"""Trains a model to predict unit sales for one item and one store"""
X = X_seasonal
model = LinearRegression().fit(X, y)
return model
The model trains on FOODS_3_069 time series data excluding the final two years. The final year contains the test data not used for model tuning, and the prior year contains the validation data used for model tuning.
The results for unit_sales predictions on validation and test data demonstrate that the seasonal features model successfully identifies peaks around Thanksgiving and Christmas and a possible peak near Valentine’s Day. The y_pred signal is the predicted unit_sales shown versus the y signal which is the validation or test data unit_sales.
While the results demonstrate a correlation with several holidays as expected, the results smooth the predictions and show potential for improvement. Ideas for next steps are (1) including categorical features using actual event and holiday labels combined with lag/lead features, (2) using a hybrid linear regression and nonlinear regression model, and (3) using deep learning packages such as Facebook Prophet.