Compare GPU and CPU Training Times for Image Recognition with Tensorflow 2

This article compares the training times for fitting a Tensorflow 2 convolutional neural network (CNN or convnet) using a GPU or CPU on the Kaggle Dogs vs. Cats dataset. The Dogs vs. Cats competition was an early Kaggle competition to demonstrate the power of convnets to solve computer vision recognition problems as winning entries reached 95% accuracy.

The training time comparison follows my prior post explaining how to setup an nvidia-docker container to run TensorFlow 2 on a GPU. I will begin this article by reviewing the main steps to train the convnets using an example in Deep Learning with Python 1st edition by Chollet. These steps are provided in more detail on the book GitHub site: https://github.com/fchollet/deep-learning-with-python-notebooks.

Starting the Container

The GPU can be enabled or disabled when starting the nvidia-docker container by keeping or removing the --gpus all option in the following line:

sudo docker run --gpus all -d -it -p 8848:8888 -v "$(pwd)/data:/home/jovyan/work" -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.4_cuda-11.0_ubuntu-18.04_python-only

If the GPU is not selected as an option, the following command should show no GPUs in the list of local devices:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
from tensorflow.python.client import device_lib

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2823115825857772105
]

Training the Model

The convnet is constructed with a series of paired convolution and max pooling layers. The first Conv2D layer slides 3×3 windows over the 150 x 150 x 3 pixel tensor representing the scaled RGB input image to produce a 148 x 148 x 32 pixel output feature map with 32 layers for each of the 32 convolution filters. The output height and width can maintain the input height and width by setting padding="same". The MaxPooling2D layer downsamples the feature maps. Downsampling is important to reduce the number of model parameters and to achieve output feature maps that represent general image features such cat eyes or ears. The convnet is completed by flattening the output feature map and adding Dense neural network layers. The convolution and max pooling layers transform input images to generalized image features which serve as inputs to the Dense neural network classifier. The reader may find many more detailed explanations of convnets online.

from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
          input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 148, 148, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 74, 74, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 72, 72, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 36, 36, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 34, 34, 128)       73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 17, 17, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 15, 15, 128)       147584    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 7, 128)         0         
_________________________________________________________________
flatten (Flatten)            (None, 6272)              0         
_________________________________________________________________
dense (Dense)                (None, 512)               3211776   
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 513       
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

The model is compiled with a binary_crossentropy loss function and the acc metric as a generic accuracy metric. These may be used together for a two target class problem, but the metric should be changed for a multiclass problem.

from keras import optimizers

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

A data generator is used to generate batches of image tensor data that can be augmented at runtime. The first example shows the training time comparison with only image rescaling, and the second example shows the results with rotations, x-y shifts, shear, zoom, and horizontal flip augmentations.

# Image data generator with only scaling
from keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

# Image data generator with additional data augmentations
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,)

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 150x150
        target_size=(150, 150),
        batch_size=20,
        # Since we use binary_crossentropy loss, we need binary labels
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

The image transformations used for data augmentation are beneficial to reduce overfitting since the model becomes less sensitive to placement and orientation of the objects within an image. The convnet model is fit using 30 epochs without data augmentation and 100 epochs with data augmentation. The model is fit with more epochs in the latter run since model validation performance continues to improve without overfitting.

history = model.fit(
      train_generator,
      steps_per_epoch=100,
      epochs=30, # 100 epochs with data augmentation
      validation_data=validation_generator,
      validation_steps=50)

Model Validation Results

The convnet without data augmentation demonstrates overfitting that begins by the second epoch as the training accuracy exceeds the validation accuracy. The validation accuracy saturates at ~70%.

Accuracy and loss learning curves demonstrate overfitting early in learning

The convnet with data augmentations demonstrates increasing validation accuracy above 80% by the final epoch.

Accuracy and loss curves demonstrate continued improvement through 90 epochs.

GPU vs. CPU Training Time Results

Without data augmentation, the training time for all GPU epochs after the first one was 8 seconds versus the CPU epoch time of 27 seconds.

GPU training time without data augmentation

CPU training time without data augmentation

With the data augmentations used above, the training time for the GPU epochs were 15 seconds versus the CPU epoch time of 28 seconds.

GPU training time with data augmentation

CPU training time with data augmentation

The reason the training time for GPU epochs increased compared to the CPU epochs may be because the ImageDataGenerator augmented the images asynchronously using the CPU. The following post describes more details about how the data augmentation may be done synchronously with the GPU: https://keras.io/examples/vision/image_classification_from_scratch/ and https://github.com/keras-team/keras/issues/12120.