Uncategorized Archives - Benjamin Spivey

Come Thou Fount of Every Blessing – Piano Instrumental

After beginning to learn piano again starting last May 2022, Come Thou Fount of Every Blessing is the first hymn I began learning. This is a fairly easy arrangement suited to my first year learning to play again: https://www.musicnotes.com/sheetmusic/mtd.asp?ppn=MN0213498.

While I cannot say which is my favorite hymn, this one is in the top five, alongside Amazing Grace, Be Thou My Vision, In Christ Alone, and It is Well with My Soul.

I am playing here on a Roland FP-30X Digital Piano. One of the primary reasons I chose this digital piano are the keys which are considered comparable to digital pianos of higher quality both for Roland and other manufacturers.

Walmart Daily Sales Prediction Using Time Series Analysis: Seasonality

Time series prediction can be beneficial in many fields including logistics, weather, sales forecasting, and predictive maintenance. Walmart provided a complete data set on Kaggle that can be used to evaluate time series prediction techniques. I will be making several posts using the Walmart data set from the M5 Forecasting – Accuracy competition to develop and evaluate time series methods in Python.

This first post demonstrates preliminary exploratory data analysis (EDA) and prediction using seasonal features. The post also provides a brief summary of polymorphism in Python using an abstract parent class to minimize code duplication and avoid conditionals like switch or if statements.

Walmart Data Set

The Walmart data set includes data for items in three categories of products from 2011 through 2016: hobbies, foods, and household. Each item is associated with a store in CA, FL, or TX. Three tables contain data to identify daily unit sales, selling prices, and event data for any given day.

The calendar table has daily rows with weekday and event labels providing the date of notable events. The event labels contained in the table include religious holidays, such as Chanukah End and Easter, sporting events, such as SuperBowl and NBAFinalsStart, and US national holidays, such as Thanksgiving and IndependenceDay.

The sales_train_validation table includes daily unit sales data for products in the three categories among stores in the three states. This table is in wide format with each row containing all daily sales data for one product and columns for each day in the full time range.

The sell_prices table provides weekly prices for each item.

Exploratory Data Analysis

Prior to predicting unit sales, we would like to identify good candidate items that have strong seasonal variation. Since the Walmart data set only provides item_id labels instead of true item names or descriptions, EDA is needed to identify these good candidate items. Items that are the best candidates for seasonal prediction have a high correlation with events and holidays. Since some foods are often associated with events (e.g., chocolate on Valentine’s Day), this analysis focuses on items in the FOOD category. This initial round of exploratory data analysis identifies foods that demonstrate higher sales on events. Same items from multiple stores are grouped since preliminary EDA showed these same food items behaved similarly among multiple stores and states.

To identify good candidates for seasonal prediction, the data tables needs to be merged into a form with one column per event and one row per item with the same items from multiple stores groups into one row.

Unpivoting (pandas melt) the sales_train_validation table converts the table from a wide format with a column per day to long format with a primary key including day.
Grouping and averaging (pandas groupby) combines each item sold on one day among all stores into one row for that item with an average unit_sales for this day. The grouped sales_train_validation table now only has three columns: d (day), item_id, and average unit_sales across all stores.
Merging (pandas merge) joins this grouped table with the calendar table on the day column. This step adds the event labels per day to the average unit_sales per day.
Grouping again combines items per event to produce a table with a primary key of item_id and event_name. This table identifies the average unit_sales per event.
Pivoting (pandas pivot) converts this grouped table into wide format with one column per event including a ‘None’ column to group sales on days without events.
Dividing the unit_sales values in the event columns by the the unit_sales values in the ‘None’ column produces a unit_sales ratio to highlight foods with higher sales on event days. This step produces the final table values and structure.

Final unsorted wide table with average unit sales per item and per event

Sorting this wide table to be descending in the ‘Thanksgiving’ column identifies FOODS_3_069 as the food with highest increase in average unit_sales on Thanksgiving Day compared to days without events.

Sorted table identifies foods that sell more on Thanksgiving than normal days

The unit_sales for FOODS_3_069 at the TX_1 store demonstrates the unit_sales seasonality for this food. Distinct peaks occur near New Years Eve, Christmas, Thanksgiving, and Valentine’s Day although not all holidays have a peak each of the five years.

Unit sales of the FOODS_3_069 item shows peaks near three holidays

Unit Sales Prediction with Seasonal Features

This analysis uses a combination of deterministic time series features to predict unit sales. These features are a linear trend, weekly seasonal indicators, and annual seasonal indicators. The linear trend enables the model to detrend a long-term linear trend in time. The seasonal features are Fourier series in which each series has an integer number of cycles within a one year time frame. This analysis uses annual seasonal indicators with 1 to 32 cycles per annum. The statsmodels.tsa.deterministic.DeterministicProcess container class is a convenient class that provides the Fourier Series in addition to constants, time trends, and seasonal indicators for each week. The following method demonstrates the DeterministicProcess syntax. The DeterministicProcess requires a pandas index format for the index column.

    def create_seasonal_features(self, df_merged_store):
        """Creates seasonal features for one item and one store"""
        df_copy = df_merged_store.copy(deep=True)
        y = df_copy['unit_sales']

        df_copy['date'] = pd.DatetimeIndex(df_copy['date'])
        df_copy.set_index('date', inplace=True)
        fourier = CalendarFourier(freq='A', order=16)
        dp = DeterministicProcess(index=df_copy.index,
                                    constant=True,
                                    order=1,
                                    seasonal=True,
                                    additional_terms=[fourier],
                                    drop=True)
        X = dp.in_sample()

        return X, y

The model fitting and prediction problem presents an opportunity to apply polymorphism in Python using the abc package. A parent class contains a generic plotting method and abstract methods for fitting and prediction. A child class defines fitting and prediction methods that are tailored to a specific combination of input features. This first analysis only uses the seasonal features described previously, and the UnitSalesPredictionSeasonal child class fits a linear regression model from sklearn.linear_model.LinearRegression. The full code used in this example is available on GitHub: https://github.com/bspivey/M5ForecastingAccuracy.

from abc import ABC, abstractmethod

class UnitSalesPrediction(ABC):
    def plot_predictions(self, X, y, y_pred):
        list_of_tuples = list(zip(X.index, y, y_pred))
        columns = ['date', 'y', 'y_pred']
        df_wide = pd.DataFrame(list_of_tuples, columns=columns)
        value_vars = ['y', 'y_pred']
        df_tall = pd.melt(df_wide,
                            id_vars='date',
                            value_vars=value_vars,
                            var_name='y_label',
                            value_name='y_value')

        fig = px.line(df_tall,
                        x='date',
                        y='y_value',
                        color='y_label',
                        width=900,
                        height=300)
        fig.update_layout(
            yaxis_title='unit_sales')

        fig.show()

    @abstractmethod
    def fit_unit_sales_model(self):
        pass

    @abstractmethod
    def predict_unit_sales(self):
        pass

class UnitSalesPredictionSeasonal(UnitSalesPrediction):
    def fit_unit_sales_model(self, X_seasonal, y):
        """Trains a model to predict unit sales for one item and one store"""
        X = X_seasonal
        model = LinearRegression().fit(X, y)

        return model

The model trains on FOODS_3_069 time series data excluding the final two years. The final year contains the test data not used for model tuning, and the prior year contains the validation data used for model tuning.

The results for unit_sales predictions on validation and test data demonstrate that the seasonal features model successfully identifies peaks around Thanksgiving and Christmas and a possible peak near Valentine’s Day. The y_pred signal is the predicted unit_sales shown versus the y signal which is the validation or test data unit_sales.

FOODS_3_069 unit sales predictions on validation data

FOODS_3_069 unit sales predictions on test data

While the results demonstrate a correlation with several holidays as expected, the results smooth the predictions and show potential for improvement. Ideas for next steps are (1) including categorical features using actual event and holiday labels combined with lag/lead features, (2) using a hybrid linear regression and nonlinear regression model, and (3) using deep learning packages such as Facebook Prophet.

Compare GPU and CPU Training Times for Image Recognition with Tensorflow 2

This article compares the training times for fitting a Tensorflow 2 convolutional neural network (CNN or convnet) using a GPU or CPU on the Kaggle Dogs vs. Cats dataset. The Dogs vs. Cats competition was an early Kaggle competition to demonstrate the power of convnets to solve computer vision recognition problems as winning entries reached 95% accuracy.

The training time comparison follows my prior post explaining how to setup an nvidia-docker container to run TensorFlow 2 on a GPU. I will begin this article by reviewing the main steps to train the convnets using an example in Deep Learning with Python 1st edition by Chollet. These steps are provided in more detail on the book GitHub site: https://github.com/fchollet/deep-learning-with-python-notebooks.

Starting the Container

The GPU can be enabled or disabled when starting the nvidia-docker container by keeping or removing the --gpus all option in the following line:

sudo docker run --gpus all -d -it -p 8848:8888 -v "$(pwd)/data:/home/jovyan/work" -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.4_cuda-11.0_ubuntu-18.04_python-only

If the GPU is not selected as an option, the following command should show no GPUs in the list of local devices:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
from tensorflow.python.client import device_lib

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2823115825857772105
]

Training the Model

The convnet is constructed with a series of paired convolution and max pooling layers. The first Conv2D layer slides 3×3 windows over the 150 x 150 x 3 pixel tensor representing the scaled RGB input image to produce a 148 x 148 x 32 pixel output feature map with 32 layers for each of the 32 convolution filters. The output height and width can maintain the input height and width by setting padding="same". The MaxPooling2D layer downsamples the feature maps. Downsampling is important to reduce the number of model parameters and to achieve output feature maps that represent general image features such cat eyes or ears. The convnet is completed by flattening the output feature map and adding Dense neural network layers. The convolution and max pooling layers transform input images to generalized image features which serve as inputs to the Dense neural network classifier. The reader may find many more detailed explanations of convnets online.

from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
          input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 148, 148, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 74, 74, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 72, 72, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 36, 36, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 34, 34, 128)       73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 17, 17, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 15, 15, 128)       147584    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 7, 128)         0         
_________________________________________________________________
flatten (Flatten)            (None, 6272)              0         
_________________________________________________________________
dense (Dense)                (None, 512)               3211776   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

The model is compiled with a binary_crossentropy loss function and the acc metric as a generic accuracy metric. These may be used together for a two target class problem, but the metric should be changed for a multiclass problem.

from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

A data generator is used to generate batches of image tensor data that can be augmented at runtime. The first example shows the training time comparison with only image rescaling, and the second example shows the results with rotations, x-y shifts, shear, zoom, and horizontal flip augmentations.

# Image data generator with only scaling
from keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

# Image data generator with additional data augmentations
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,)

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

The image transformations used for data augmentation are beneficial to reduce overfitting since the model becomes less sensitive to placement and orientation of the objects within an image. The convnet model is fit using 30 epochs without data augmentation and 100 epochs with data augmentation. The model is fit with more epochs in the latter run since model validation performance continues to improve without overfitting.

history = model.fit(
      train_generator,
      steps_per_epoch=100,
      epochs=30, # 100 epochs with data augmentation
      validation_data=validation_generator,
      validation_steps=50)

Model Validation Results

The convnet without data augmentation demonstrates overfitting that begins by the second epoch as the training accuracy exceeds the validation accuracy. The validation accuracy saturates at ~70%.

Accuracy and loss learning curves demonstrate overfitting early in learning

The convnet with data augmentations demonstrates increasing validation accuracy above 80% by the final epoch.

Accuracy and loss curves demonstrate continued improvement through 90 epochs.

GPU vs. CPU Training Time Results

Without data augmentation, the training time for all GPU epochs after the first one was 8 seconds versus the CPU epoch time of 27 seconds.

GPU training time without data augmentation

CPU training time without data augmentation

With the data augmentations used above, the training time for the GPU epochs were 15 seconds versus the CPU epoch time of 28 seconds.

GPU training time with data augmentation

CPU training time with data augmentation

The reason the training time for GPU epochs increased compared to the CPU epochs may be because the ImageDataGenerator augmented the images asynchronously using the CPU. The following post describes more details about how the data augmentation may be done synchronously with the GPU: https://keras.io/examples/vision/image_classification_from_scratch/ and https://github.com/keras-team/keras/issues/12120.

Documenting the 2020 Pandemic

As we all experience larger changes to our world than many of us in the States experienced with the dot-com crash, 9/11, or the Great Recession, I am documenting what it is like to live during this time.

Friday, March 13, 2020

The past week has been stressful. The coronavirus COVID-19 has begun spreading in the US, and the stock market has begun its steepest decent since 2008. Washington and New York has over 300 cases, and Texas has 23 cases. Overall the US had 1,215 cases and 36 deaths, all in Washington. The Washington outbreak began in a nursing home.

Life is Still “Normal.” I am in a work training class with colleagues from Indonesia, Russia, England, and possibly other countries flying to Houston for this class. All of us were in a normal classroom sitting a couple feet away sharing the same table. All virus cases in Houston are being reported as due to international travelers returning home to Houston. We also have a teacher who had walking pneumonia for the past month with a dry cough, very similar to COVID-19 symptoms. This is not reassuring as he approaches our desk and talks two feet from us. Meanwhile, by Friday we are being told at work to maintain six feet from others. I feel like I am at higher risk based on all the information we have now, but it seems I would be overreacting to skip the class and return to my desk. I was encouraged this past week to ask our supervisor to work from home if we feel this is necessary given the virus, but the company has not provided official guidance on working from home. By the end of the day, I hear that class members are having to rebook their flights to return to their home countries amid tightening travel restrictions.

In the prior month, President Trump placed a ban on non-resident travelers from China on February 1 and quarantined US resident travelers from Wuhan. We have seen 80,000+ people infected and 3,000+ deaths in China though we find later that these numbers are underreported. I read in the WSJ about Wuhan medical workers wearing hazmat suits all day and had family members trapped outside Wuhan due to a quarantine.

Church Services. By Thursday evening our church had given guidance that the Sunday service would be online, and other Sunday classes are cancelled. Meanwhile some classes at church still planned smaller weekly gatherings. Our church was ahead of government restrictions which would come later.

Monday, March 16, 2020

Working from Home. The past weekend has seemed like a whirlwind of increasing restrictions. On Friday, I was debating whether to request to work from home as some colleagues had done. By Sunday evening our workplace sent an email stating that only essential employees be asked to come into work this week. In the past two weeks I attended the IADC conference in Galveston and a training class with international travelers. I managed to complete these just before the restrictions would have stopped both.

The President also issued guidance called “15 Days to Slow the Spread.” The federal guidance generally was to stay home if you feel sick, are elderly, or are otherwise at increased risk. It also recommended to attend work or school from home as possible unless you are in a critical industry as defined by DHS and avoid social gatherings of more than 10 people, avoid dining out, and practice good hygiene.

Official gatherings at church are generally cancelled now since they usually involve more than 10 people.

Thursday, March 19, 2020

Schools and Restaurants. Texas Governor Greg Abbott issued the first public health disaster order in Texas since 1901. Schools will be closed, public gatherings are limited to 10 people or fewer, restaurants are limited to take-out orders only, and non-essential state employees are called to telework.

The California governor Newsom issued one of the strictest lockdown orders outside of China and Italy which limits Californians to their homes except for exercise and essential needs and would not allow gatherings of up to 10 people.

Tuesday, March 24, 2020

Stay at Home. The Harris County Judge Hidalgo issued an order for residents to stay at home similar to the California order and other orders since issued. The main restriction now is that people should not be interacting within six feet of others outside their household unless caring for a friend or family member. Groups of up to 10 people are no longer permitted.

Since I live at home, this restriction was particularly difficult, but we are allowed to go out for walks and visit parks with friends as long as we mind social distancing guidelines. The order also closed “non-essential” businesses like barbers, gun ranges, and church business besides preparing for services.

Government officials repeat that masks will not help prevent you from contracting the virus. I cannot believe they recommend not wearing masks in good faith. The virus is spread by respiratory droplets that exit the body through the mouth and nose. Some masks like surgical masks may not filter the virus well but will hinder the sick from spreading it and create some marginal barrier for inhaling the virus.

Sunday, March 29, 2020

Interstate Travel Restrictions. Texas Governor Abbott ordered drivers from Louisiana to self-quarantine for 14 days. He also expanded the self-quarantine for airline travelers from Miami, Atlanta, Detroit, Chicago, California, and Washington state. The Texas DPS is enforcing checks at airports and along highways.

Tuesday, March 31, 2020

I trimmed my hair for the first time, and it actually looks good. A friend in person and colleagues on videochats gave their approval. No barbers are open with the social distancing lockdowns.

April. The federal and state stay-at-home orders were extended through end of April. Our company likewise extended working from home orders through the end of April.

Saturday, April 4, 2020

Masks. The CDC has finally changed their position and now recommend that the general public wear non-medical, cloth masks to hinder the spread of the virus. Many people do not show symptoms, maybe up to 50%, and wearing a mask will reduce the spread from asymptomatic people. The President said it is voluntary and that he would not be wearing a mask.

United States. The US has by far the most cases worldwide now with 275,000 confirmed cases and 7,100 deaths, 1,100 today alone. Texas has 6,050 confirmed cases while New York is at 102,000 confirmed cases. Spain has the largest number of cases outside the US at 124,700.

Houston. The number of cases in Harris County continues to increase. I hear reports of hospitals having their COVID units already full and placing patients showing COVID symptoms in other units. Houston Methodist is seeing patients double every 3 to 4 days. The hospital currently has 116 patients testing positive, not all requiring the ICU, and can handle 450 ICU patients. Texas is still far behind other states with testing. I read that the 25 county SE Texas region has about 1,000 hospital cases. Since about 10-20% of those infected require hospitalization, we could have 5,000-10,000 in SE Texas actual cases alone. Masks and ventilators are in short supply in New York already for hospital workers. Large hospital systems in Houston are not reporting shortages, but smaller providers are.

Life Changes. Besides what I have already noted, a few other life changes are:

Family members and friends have experienced layoffs or furloughs.

87 gas prices at $1.50 in Spring, TX.

Tape on the floor of stores to encourage social distancing.

Standing line at grocery stores in the morning especially in hopes of getting a pack of toilet paper or paper towels. This has improved somewhat, but at the peak I could not get most of my frozen vegetables even after waiting in a line for 45 minutes before HEB opened.

Parks are closed in other states like SC, NC, CA, … but not in TX.

Wimbledon was cancelled for the first time since WW II. I had tickets to the theater on March 20 and the Rodeo on March 22 cancelled and still waiting on refunds for the latter one.

Weddings have been postponed including a brother’s wedding. Funerals are having limited or no attendance permitted in some states.

Manufacturers have switched their production lines to making ventilators (auto companies) and sanitizer (distillers). ExxonMobil has ramped their IPA production and is helping with a new mask design which uses less polypropylene.

We have daily press conferences by the President, state and local authorities. In a recent mayor conference, Mayor Turner said the city has a right to take over a former hospital building for sale regardless of the owner’s preference.

My first WordPress site

After building my first blog (innovabots.blogspot.com) during graduate school for a robotics project, I am now finally starting a site with my own domain using WordPress. I chose HostGator to host the site because it advertised tools to install and integrate WordPress sites. Though I cannot say how it compares to other sites with similar capabilities, I have been satisfied with HostGator’s service. I used their online technical chat support today to solve a cookies error with the site, and their support was responsive and better than most online chat support I have used over the years.

I am planning to use this site to share travel experiences like my annual National Park trips, advice as I learn various machine learning algorithms within and outside the OMS CS program at Georgia Tech, and hobby projects that I work over the years.