Introduction

In this exercise, we’ll build a convolutional neural network (CNN) for image classification. This is one of the original and more straightforward uses of CNNs. More complex uses include:

  • Semantic image segmentation
  • Image denoising or reconstruction
  • Working with video streams

But all methods are based around two basic operations:

  • Convolution: in this step, the network learns a series of kernels or filters that transform the original image in some way. These are similar to filters that are used in standard image processing (e.g. low-pass filters), but filters are chosen by how well the transformed image maps to the outcome variable. To put this another way, these filters identify shapes or features that are important in differentiating between different outcomes
  • Max-pooling: in this step, the image resolution is transformed. In general, the resolution is halved, by aggregating groups of four pixels in a two by two window.

In general, these steps are repeated several times. As this progresses, the small shapes identified in the first set of convolutions are progressively combined into larger structures. For example, a series of small curves or lines could be aggregated into a cat’s eye.

Image classification

The basic idea behind image classification is to link features of an image to a single label or class. For example, we might have a photograph of a cat, with the label Cat and one of a dog with the label Dog. The goal of the model is to identify what shapes and colors, and combination of these can help differentiate between these two classes.

Data

We’ll use a dataset from Kaggle (https://www.kaggle.com/moltean/fruits) containing over 90,000 images of fruits and vegetables. We’ll just be using a subset of these data, and you’ll need to download the zip file fruits.zip from gtihub. I’d suggest making a new folder for this lab - move the zip file to this once it is downloaded.

Once you have unzipped the data, take a look in the fruits folder. This is already set up in the standard way for image classification. There are two high level folders Training and Testing, each of which contains a subset of the images. Not too surprisingly, the first will be used to train the model, and the second to test the model. Within each of these, there will be a set of folders, one per class of fruit. The name of the folders is used as the label for each image, and is what Tensorflow will use.

  • fruits
  • Training
  • Apple
  • Banana
  • …
  • Testing
  • Apple
  • Banana
  • …

Note that this is a specific format for image classification. There are other options, for example, if you are working with continuous outcomes.

Tensorflow and Keras

The TensorFlow and Keras libraries were first designed to be used in Python. There has been quite a lot of work recently to allow access in R, mainly using the reticulate package which allows exchanges between R and Python. You do still need an version of Python for this to work, but this can be installed directly from R.

First install the R keras package:

install.packages('keras')

Once this step is complete, load th elibrary, and run the install_keras() function. This will install a local version of Python, as well as download all the necessary add-ons to build deep learning models:

library(keras)
install_keras()

This can take a few minutes, but should all load without problem. If you do get error messages, please let us know so we can try to fix them

Data processing

Let’s start, as usual, by loading the libraries we’ll need for the lab:

library(dplyr)
library(tidyr)
library(keras)
library(ggplot2)

Now set the path to the folder containing the training images you downloaded. If you’ve copied these to your datafiles folder, this will look something like this:

train_image_files_path = "./datafiles/fruits/Training/"

If you have any questions about setting this path, please ask.

You can visualize any of the images using the imager package (you’ll need to install this):

library(imager)
im = load.image(paste0(train_image_files_path, "Banana/0_100.jpg"))
plot(im)

You can try other images by changing the folder name and filename. Note that these are somewhat idealized images, with a blank white background.

Next, we’ll define the classes that we are going to process. There are 16 different types of fruit in the dataset, making this a multi-class classification problem.

fruit_list = c("Kiwi", "Banana", "Apricot", "Avocado", 
                "Cocos", "Clementine", "Mandarine", "Orange",
                "Limes", "Lemon", "Peach", "Plum", 
                "Raspberry", "Strawberry", "Pineapple", "Pomegranate")

# store the number of classes
output_n = length(fruit_list)

The original images are 100x100 pixels. Ideally we’d use these at their full resolution, but as this is an example, we’ll reduce the resolution to 20x20 to make this a bit faster to run, so we set the target size here. We’ll use these values (stored in target_size to define the input tensor shape in the network. (A good follow-up test would be to increase this and see how much it impacts the predictions.)

img_width = 20
img_height = 20
target_size = c(img_width, img_height)

The other dimension we need for image processing is the image depth. These are RGB images with three color channels. The input tensors then will be rank 4, with shape (\(n\), 20, 20, 3), where \(n\) is the number of images.

channels = 3

The last parameter we’ll set here is the batch size. This is the same parameter that we have used before to control the rate at which the network weights are updated. We’ll also use this to control the number of images that are loaded into memory at any step. This is very useful if you’re working on a computer with limited memory (like my old laptop).

batch_size = 32

Image generators

Keras has several functions to facilitate working with images. We’ll start by creating an image generator. This acts a bit like a pipeline and will carry various pre-processing steps. These include data augmentation: simple transformations of the images to supplement the original image. We’re not going to use that here, but some example code is given in the appendix to illustrate how you might use this.

We’ll create a generator for the training images. This will rescale each channel to a 0-1 range (the RGB channels have values between 0 and 255), and it will hold aside 30% of the training images for validation. We’ll use this to check for overfitting during the training process.

train_data_gen = image_data_generator(
  rescale = 1/255,
  validation_split = 0.3)

The next function we’ll use is a flow function. This function controls how Keras will read in the images for any training step (i.e. any update of the network weights). There are several arguments here:

  • train_image_files_path: The path to the top-level folder containing the training images
  • train_data_gen: The image data generator
  • subset: The subset of images to use from the generator for training. As we set the validation_size to 0.3, this will be 1 - 0.3 = 0.7, or 70% of the images
  • target_size: The size for rescaling each image
  • class_mode: The type of label used (this will one-hot encode the labels of the images)
  • classes: The set of categories to use. This is the list we defined earlier and needs to match the subfolder names. If this is not included, this will use all subfolders, and create a list of labels
  • batch_size: The number of images to import for any update step
  • seed: a value to initialize the random number generator (this is only there to ensure consistent results)
train_image_array_gen = flow_images_from_directory(train_image_files_path, 
                                                    train_data_gen,
                                                    subset = 'training',
                                                    target_size = target_size,
                                                    class_mode = "categorical",
                                                    classes = fruit_list,
                                                    batch_size = batch_size,
                                                    seed = 42)
## Found 5401 images belonging to 16 classes.

The function will tell you how many images (and classes) it found in the folder you defined. If this is 0, go back and check the folder path you defined earlier. We’ll also create a flow for the validation images. The only difference here is in the definition of the subset.

valid_image_array_gen = flow_images_from_directory(train_image_files_path, 
                                                   train_data_gen,
                                                   subset = 'validation',
                                                   target_size = target_size,
                                                   class_mode = "categorical",
                                                   classes = fruit_list,
                                                   batch_size = batch_size,
                                                   seed = 42)
## Found 2308 images belonging to 16 classes.

Note that these flow generators contain various useful bits of information. For example, to check the number of images (we’ll also use this number when training the model):

train_samples = train_image_array_gen$n
valid_samples = valid_image_array_gen$n
print(paste(train_samples, valid_samples))
## [1] "5401 2308"

Or you can get the number of images per class (type of fruit)

cat("Number of images per class:")
## Number of images per class:
table(factor(train_image_array_gen$classes))
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
## 327 343 345 299 343 343 343 336 343 345 345 313 343 345 343 345

Which suggest this is a relatively well-balanced dataset. This also contains various information about the files, resolution, channels, etc.

Model definition

Let’s now set up the model. As this is quite a complex model, we’ll do this as a series of steps rather than in one go.

First, create a template sequential model

model = keras_model_sequential() 

Next we add the first hidden layer. This is a convolutional layer, where we’ll create 16 filters (or convolutions) based on the original images, with a 3x3 kernel. We’ll pad the output of this layer so that it has the same size as the input (same). Note that we also need to define the size of the input tensors (width, height and channels).

model %>%
  layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = "same", input_shape = c(img_width, img_height, channels)) 

We’ll then take the output of this layer and pass it through a ReLU activation function (this could have been included directly in the convolutional layer, but this allows a little more control on the process):

model %>%
  layer_activation("relu") 

Now, we add a max-pooling layer. As a reminder, this reduces the resolution of the output from the previous layer by a simple filter, forcing the next layer of the network to focus on larger image features. We’ll also add a dropout layer. This is a form of regularization. It randomly sets some connection weights to 0 (i.e. having no contribution to the model), which can reduce overfitting.

model %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25)

Let’s add another convolutional layer, this time with 32 filters, and pass this through a different activation function (a leaky ReLU)

model %>%
  layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = "same") %>%
  layer_activation_leaky_relu(0.5) 

We’ll take the output of this function and normalize the weights. This is a simple method that adjusts the mean weight to close to zero and reduces the amount variation. This helps avoid gradient problems with very small or very large weights

model %>%
  layer_batch_normalization()

And we’ll run the output of this through a max-pooling function with dropout:

model %>%
  layer_max_pooling_2d(pool_size = c(2,2)) %>%
  layer_dropout(0.25)

Now we’ll add layers to connect the output of this last max-pooling step to the output (the fruit classes). The first thing we need to do is to flatten the output. The output of the max-pooling is a tensor of shape (5, 5, 32). The size of 5 is a result of the two max-pooling operations and the 32 is the number of filters from the second convolution. The layer_flatten() function will flatten this into a rank 1 tensor of shape (800).

model %>%
  layer_flatten()

Next we’ll pass this flattened layer through a dense layer, with a ReLU activation and a dropout

model %>% 
  layer_dense(100) %>%
  layer_activation("relu") %>%
  layer_dropout(0.5)

Finally, we need to output predictions. As this is a multiclass task, the final layer needs to have the same number of nodes as classes (16). This is passed through a softmax activation function. This transforms the predictions for all classes into probabilities (i.e. they have to sum to 1).

model %>%
  layer_dense(output_n) %>% 
  layer_activation("softmax")

More practically, we’ll create a function that will build the model in one go. This will allow us to easily create new versions for testing.

create_model = function() {
  model <- keras_model_sequential() %>%
    
    # add convolution layer
    layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = "same", 
                  input_shape = c(img_width, img_height, channels)) %>%
    layer_activation("relu") %>%
    
    # Use max pooling
    layer_max_pooling_2d(pool_size = c(2,2)) %>%
    layer_dropout(0.25) %>%
    
    # Second hidden layer
    layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = "same") %>%
    layer_activation_leaky_relu(0.5) %>%
    layer_batch_normalization() %>%
    
    # Use max pooling
    layer_max_pooling_2d(pool_size = c(2,2)) %>%
    layer_dropout(0.25) %>%
    
    # Flatten and feed into dense layer
    layer_flatten() %>%
    layer_dense(100) %>%
    layer_activation("relu") %>%
    layer_dropout(0.5) %>%
    
    # Outputs 
    layer_dense(output_n) %>% 
    layer_activation("softmax")
  
  return(model)
}

Let’s take a look at the whole thing:

model = create_model()
summary(model)
## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                  Output Shape               Param #    Trainable  
## ================================================================================
##  conv2d_3 (Conv2D)             (None, 20, 20, 16)         448        Y          
##  activation_5 (Activation)     (None, 20, 20, 16)         0          Y          
##  max_pooling2d_3 (MaxPooling2  (None, 10, 10, 16)         0          Y          
##  D)                                                                             
##  dropout_5 (Dropout)           (None, 10, 10, 16)         0          Y          
##  conv2d_2 (Conv2D)             (None, 10, 10, 32)         4640       Y          
##  leaky_re_lu_1 (LeakyReLU)     (None, 10, 10, 32)         0          Y          
##  batch_normalization_1 (Batch  (None, 10, 10, 32)         128        Y          
##  Normalization)                                                                 
##  max_pooling2d_2 (MaxPooling2  (None, 5, 5, 32)           0          Y          
##  D)                                                                             
##  dropout_4 (Dropout)           (None, 5, 5, 32)           0          Y          
##  flatten_1 (Flatten)           (None, 800)                0          Y          
##  dense_3 (Dense)               (None, 100)                80100      Y          
##  activation_4 (Activation)     (None, 100)                0          Y          
##  dropout_3 (Dropout)           (None, 100)                0          Y          
##  dense_2 (Dense)               (None, 16)                 1616       Y          
##  activation_3 (Activation)     (None, 16)                 0          Y          
## ================================================================================
## Total params: 86932 (339.58 KB)
## Trainable params: 86868 (339.33 KB)
## Non-trainable params: 64 (256.00 Byte)
## ________________________________________________________________________________

Our model has a little under 87,000 parameters or weights to train (hence the need for a lot of images). Note that there are small set of non-trainable parameter from the normalization layer.

The next step is to compile the model. This asss an optimization function, a loss function and performance metric. A good option for the loss function is categorical_crossentropy, which tries to maximize the difference between the distribution of multiple categories, and we’ll use a backpropagation optimizer. The metric is the error between the predicted class and observed class aggregated across all samples. Accuracy is the simplest of these for classification exercises and gives the proportion of all images that are correctly classified.

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_rmsprop(),
  metrics = "accuracy"
)

Training the model

We’ll now train the model for 20 epochs. As we are using a data generator function to supply the images and data to the model, we use the fit_generator() function instead of the fit() function we have previously used. We specify:

  • train_image_array_gen: The generator of the training samples
  • steps_per_epoch: The number of update steps in each epoch (usually just number of samples / batch size)
  • epochs: number of full training iterations
  • validation_data: the generator of the validation samples
  • validation_steps: the number of validation steps per epoch

This takes a few minutes to train (on my laptop). It’s worth remembering what is going on here: the algorithm is reading in batches of 32 images, rescaling them, updating model weights through back propagation and then repeating the whole thing 20 times. As we previously defined a separate validation set (and image generator), this routine will calculate two losses: - The training loss. This is how accurately the model can predict the images that are being used to update the weights - The validation loss. This is how accurately the model can predict a set of training images that are not used in updating the weights

As the model continues to train, you should see the loss (the crossentropy) decrease for both of these, but will likely stabilize at a certain point. The accuracy should (hopefully) increase over time.

epochs = 20
hist <- model %>% fit(
  train_image_array_gen,
  
  steps_per_epoch = as.integer(train_samples / batch_size), 
  epochs = epochs, 
  
  validation_data = valid_image_array_gen,
  validation_steps = as.integer(valid_samples / batch_size)
)

Now plot the evolution of the loss function and performance metric:

plot(hist)

The plot shows a steep decline in both loss values, but no real improvement beyond epoch 3. It’s quite likely that this model has overfit - become too tuned to the training data to allow prediction. There are a variety of ways we can avoid this, but a simple one is to slow the rate at which the model learns. We’ll refit the model, but will add a learning_rate parameter to the compilation step.

model = create_model()
model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_rmsprop(learning_rate = 0.0001),
  metrics = "accuracy"
)
hist <- model %>% fit(
  train_image_array_gen,
  
  steps_per_epoch = as.integer(train_samples / batch_size), 
  epochs = epochs, 
  
  validation_data = valid_image_array_gen,
  validation_steps = as.integer(valid_samples / batch_size)
)
plot(hist)

The plot shows a good evolution of both the loss and performance metrics.

Model evaluation

We’ll now evaluate this model on the set of test images. Normally, we’d retrain the model using the full training set (including the validation set), but to save time, we’ll just proceed with the existing model. In order to do this, we first need to create a new image generator for the test images:

test_image_files_path <- "./datafiles/fruits/Test/"

test_data_gen <- image_data_generator(rescale = 1/255)

test_generator <- flow_images_from_directory(
  test_image_files_path,
  test_data_gen,
  target_size = target_size,
  class_mode = "categorical",
  classes = fruit_list,
  batch_size = 1,
  shuffle = FALSE,
  seed = 42)
## Found 2592 images belonging to 16 classes.

The only differences here from the previous code is that we use the test image folder, and we no longer specify a validation parameter. Now we can use this to evaluate the model using evaluate():

results <- model %>%
  evaluate(test_generator, 
           steps = as.integer(test_generator$n))
## 2592/2592 - 17s - loss: 0.2086 - accuracy: 0.9325 - 17s/epoch - 7ms/step
print(results)
##      loss  accuracy 
## 0.2086322 0.9324846

Which gives us an accuracy of about 0.93 which is a very good classifier. It is worth noting that this is partly because the images have all been cleaned and prepared; achieving this level of accuracy with images take ‘in the wild’ would require much more work in setting up and training the model.

We’ll finish by obtaining predictions for the set of test images, and building a confusion matrix based on these. Predictions for new samples can be obtained using the predict_generator() function. Note that if you were predicting for completely new images, you would need to make a new generator for these. We’ll start by resetting the image generator (this just ensures that everything will align in the output), then obtain the predictions:

test_generator$reset()
predictions <- model %>% 
  predict(
    test_generator,
    steps = as.integer(test_generator$n)
  ) 
## 2592/2592 - 3s - 3s/epoch - 1ms/step

For each image, there is the predicted probability of each class:

predictions[1,]
##  [1] 9.984205e-01 1.300145e-07 1.360483e-10 8.630380e-05 1.362863e-03
##  [6] 7.293721e-14 5.064761e-08 1.394456e-08 2.414268e-05 1.028145e-04
## [11] 1.220951e-06 1.428428e-09 5.182119e-12 5.544225e-09 1.891201e-06
## [16] 4.449551e-09

To get the predicted labels, we simply need to find the column with the highest probability. We can use R’s max.col() function for this:

pred_class <- max.col(predictions) 

The test generator stores the observed labels for the test set, so let’s extract that and store it as obs_class:

obs_class <- test_generator$classes

We can now make a confusion matrix between the observed and predicted classes:

pred_table <- table(obs_class, pred_class)

Note that the row and column indices are different (this is because R starts indices at 1 and Keras starts at 0). We can simply replace these with the fruit labels

rownames(pred_table) <- colnames(pred_table) <- fruit_list
pred_table
Kiwi Banana Apricot Avocado Cocos Clementine Mandarine Orange Limes Lemon Peach Plum Raspberry Strawberry Pineapple Pomegranate
Kiwi 156 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Banana 0 166 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Apricot 0 0 68 0 0 0 0 96 0 0 0 0 0 0 0 0
Avocado 0 0 0 143 0 0 0 0 0 0 0 0 0 0 0 0
Cocos 0 0 0 0 166 0 0 0 0 0 0 0 0 0 0 0
Clementine 0 0 0 0 0 166 0 0 0 0 0 0 0 0 0 0
Mandarine 0 0 0 0 0 0 166 0 0 0 0 0 0 0 0 0
Orange 0 0 0 0 0 0 0 160 0 0 0 0 0 0 0 0
Limes 0 0 0 0 0 0 0 0 166 0 0 0 0 0 0 0
Lemon 0 0 0 0 0 0 0 0 0 164 0 0 0 0 0 0
Peach 0 0 0 0 0 0 0 37 0 0 119 0 0 0 0 8
Plum 0 0 0 0 0 0 0 0 0 0 0 151 0 0 0 0
Raspberry 0 0 0 0 0 0 0 0 0 0 0 0 166 0 0 0
Strawberry 0 0 0 0 0 0 0 0 0 0 0 0 0 164 0 0
Pineapple 0 0 0 0 34 0 0 0 0 0 0 0 0 0 132 0
Pomegranate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 164

This shows why we got such a high accuracy - nearly all the images are correctly classified. Let’s finish by plotting this result. Here, we convert this confusion matrix into a long data frame, and calculate the number of each observed image class. We then use this to calculate the percentage correctly identified, and use ggplot2’s geom_tile to plot this out. Note that you will need the reshape2 package here:

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
pred_df <- melt(pred_table, value.name = "count") %>%
  group_by(obs_class) %>%
  mutate(n = sum(count)) %>%
  ungroup()
p <- pred_df %>%
  filter(count > 0) %>%
  mutate(percentage_pred = count / n * 100) %>%
  ggplot(aes(x = obs_class, y = pred_class, 
             fill = percentage_pred,
             label = round(percentage_pred, 2))) +
  geom_tile() +
  #scale_fill_continuous() +
  scale_fill_gradient(low = "blue", high = "red") +
  geom_text(color = "white") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(x = "True class", 
       y = "Predicted class",
       fill = "Percentage\nof predictions",
       title = "True v. predicted class labels", 
       subtitle = "Percentage of test images predicted for each label")

print(p)

Appendix

This is an example of an image generator that will perform data augmentation. In each epoch, each image is transformed according to a set of random modifications. The parameters here set limits on the amount of transformation that any method will carry out. For example, images will be randomly rotated by an amount between + and - 40 degrees.

train_data_gen <- image_data_generator(
  rescale = 1/255,
  rotation_range=40, ## Random rotation (+/- 40 degrees)
  width_shift_range=0.2, ## Random horizontal shift (+/- proportion of image size)
  height_shift_range=0.2, ## Random vertical shift (+/- proportion of image size)
  shear_range=0.2, ## Shear angle in counter-clockwise direction in degrees
  zoom_range=0.2, ## Zoom range. Zooms by 1 +/- 0.2 from original
  horizontal_flip=True, ## Randomly flips 50% of images
  fill_mode='nearest' ## How to fill newly created pixels (nearest neighbor)
  validation_split = 0.3)