How do convolutional neural networks work?

written August 10, 2024 in deep learning, machine learning

Convolutional neural networks, CNNs, convnets, call them what you like - these powerful neural nets remain one of the most popular type of models for classifying images. But have you ever wondered how they work? In this blog post, we’ll go through this step-by-step, with lots of diagrams and code examples to try to make it as concrete as possible.

For this blog post, I relied heavily on François Chollet’s wonderful book, Deep Learning with Python. If you want one book to gently introduce you to neural networks, this would be the one I recommend. I also used these lecture notes from the University of Washington’s Introduction to Machine Learning course.

How do CNNs process images?

CNNs take in raw images as inputs, breaking them down to the level of individual pixels. Below you can see a greyscale 16 x 16 pixel image which has been divided into its 256 respective pixels. Each of these pixels serves as an input feature for the model.

Within each pixel, there is information about the intensity of the image at that point. For greyscale images, we only have information about the intensity of greyscale. This means that for greyscale images, we only have one channel, as you can see below.

However, for colour images, we have information about the intensity of red, green and blue in each pixel, so we have a channel with information about the intensity of red in the image, another with the intensity of green, and a final one with the intensity of blue, giving colour images three channels in total. You can see this in the colour 16 $\times$ 16 pixel image below: the image can be decomposed into its red, green and blue channels (I used this cool channel splitter app to do so), with each of these channels’ values being the intensity of red, green and blue in that pixel.

As each pixel is treated as a feature for the image, the total number of input features is calculated by the image height $\times$ width $\times$ number of channels. So for our 16 $\times$ 16 grey scale image above, the total number of input features would be 16 $\times$ 16 $\times$ 1 $=$ 256. For the 16 $\times$ 16 colour image, this would be 16 $\times$ 16 $\times$ 3 $=$ 768. You can see that using colour images is quite a bit more computationally expensive than greyscale ones!

So for greyscale images, we now have one matrix of size height $\times$ width, and for colour images, we have three such matrices. How do we now convert these into a model input? Neural nets need their inputs to be in the form of a single input row, or vector, not two- or three-dimensional arrays. Well, it’s pretty straightforward to convert a 2D or 3D array into a vector: we just need to unroll it. Let’s have a look at how this might work in NumPy.

For our greyscale image, we have a 16 $\times$ 16 matrix (a 2D array) containing all of the greyscale intensity values:

import numpy as np

image = np.array([
    [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255], 
    [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255, 255, 116, 116, 109, 255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255, 116, 146, 153, 109, 255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 116, 138, 255, 255, 255, 153, 109, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 116, 153, 255, 255, 255, 146, 116, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 116, 146, 174, 153, 146, 116,  95, 255, 255, 255, 255],
    [255, 255, 255, 255, 116, 167, 174, 198,  65, 153, 146, 138,  95, 255, 255, 255],
    [255, 255, 255, 255, 116, 153, 167,  65,  53,  65, 156, 153, 116, 255, 255, 255],
    [255, 255, 255, 255, 116, 153, 167, 174,  65, 198, 167, 167,  95, 255, 255, 255],
    [255, 255, 255, 255,  95, 153, 167, 167,  65, 174, 167, 153, 109, 255, 255, 255],
    [255, 255, 255, 255,  95, 146, 153, 167, 167, 167, 153, 138, 109, 255, 255, 255],
    [255, 255, 255, 255, 255, 116,  95, 109, 116, 109, 116, 116, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255],
    [255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255],
    ])
image.shape

(16, 16)

To convert this into a vector (a 1D array), we use the ravel method:

image_input = np.ravel(image)
image_input.shape

(256,)

We can see we now have an input vector for our CNN with 256 features. For the colour image, we’d unroll each of the colour channel matrices then append them to each other, giving a vector with 768 elements. Naturally, this means that all images used to train the CNN (and actually all images that the CNN can later run predictions over) must be the same size.

The convolutional layer

Now that we understand how CNNs treat images as inputs, let’s have a look at the first type of hidden layer in these networks: the convolutional layer.

The goal of the convolutional layer is to try to summarise the input, called a feature map, by extracting meaningful visual features. Let’s see how this might work with the first convolutional layer in the network, where the feature map is the original image itself.

Firstly, a window, normally of sizes 3 $\times$ 3, 5 $\times$ 5 or 7 $\times$ 7 is slid across the image. At each point that the window stops, the part of the image under it, called a patch, is extracted as a matrix.

This patch is multiplied by another matrix of the same size, called the convolutional kernel. By multiplying the patch by the convolutional kernel, the model can capture some sort of key information contained in the image, such as dominant outlines, structural patterns, and even specific objects. There are kernels that do specific transformations to the image. For example, this matrix sharpens images:

$$\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \\ \end{bmatrix} $$

and this one detects edges:

$$\begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \\ \end{bmatrix} $$

However, which transformations need to be applied by each convolutional kernel will be learned during the model’s training, and will be dependent on the specific set of images the network is trying to classify. This means that the weights in the convolutional kernels will be learned during the model’s training, just like with any weights in a neural network.

The multiplication operation between each patch and the convolutional kernel is the sum of element-wise products, i.e., every element in a patch is multiplied by its corresponding element in the convolutional kernel, and all of these values are added together. This yields a single number for each multiplication between a patch and the convolutional kernel, which is then assigned to the output matrix of the convolutional layer. We repeat this operation with every patch in the input, until we’ve covered the whole feature map, ending up with a single filter of the feature map. You can see this in the animation below.

In the case of a 3 $\times$ 3 window, the filter will be of size (feature map height - 2) $\times$ (feature map width - 2), so in the case of our 16 $\times$ 16 image, the resulting filter would be 14 $\times$ 14. This is because with a 3 $\times$ 3 window, the window can only fit in the image 14 times along the x-axis and 14 times along the y-axis, and each patch is reduced to a single value in the feature. With a 5 $\times$ 5 window, the resulting filter will be (feature map height - 4) $\times$ (feature map width - 4), and finally, with a 7 $\times$ 7 window, the resulting filter will be (feature map height - 6) $\times$ (feature map width - 6).

Let’s see how a filter is created by doing this step-by-step in NumPy.

Firstly, let’s create the window using the view_as_windows function from scikit-image. We’ll work with a 3 $\times$ 3 window.

from skimage.util.shape import view_as_windows

window_shape = (3, 3)

We apply this to our greyscale image from earlier in the post. We can see we now have an array containing all of our 3 $\times$ 3 patches, which are two dimensional NumPy arrays. As the window fits 14 times along the x-axis and 14 times along the y-axis, we have 196 of these patches in total.

patches = view_as_windows(image, window_shape)
patches.shape

(14, 14, 3, 3)

Let’s see the first of these patches:

patches[0, 0, :, :]

array([[255, 255, 255],
       [255, 255, 255],
       [255, 255, 255]])

We can see this is the upper-left corner of our image, where all pixels have an intensity of 255 (i.e., white).

Let’s now create our convolutional kernel. As I explained earlier, the specific values will be determined by the model, but we can for now use the kernel that sharpens images.

convolutional_kernel = [[0, -1, 0], 
                        [-1, 5, -1],
                        [0, -1, 0]]

We now need to take the element-wise product between each of the values in the patch and the values in the convolutional kernel. We can do this by multiplying the corresponding elements in the two matrices together, then summing the values. We need to sum twice, as the first sum operation sums along each column in the matrix.

sum(sum(patches[0, 0, :, :] * convolutional_kernel))

The value is, unsurprisingly, 255. Let’s now repeat this operation for every patch from the image. This is equivalent to sliding the window along the top row of the image matrix, calculating the sum of element-wise products each time it stops, then repeating this for the second row, and so on until the 14th row.

We’ll assign each sum of products to a filter matrix, which we’ll initialise by creating a 14 $\times$ 14 matrix made up of zeros.

filter_matrix = np.zeros([14, 14])
for i in range(0, 14):
    for j in range(0, 14):
        filter_value = sum(sum(patches[i, j, :, :] * convolutional_kernel))
        filter_matrix[i, j] = filter_value

We can see that the resulting matrix looks quite similar to the input matrix, but smaller and with some numbers transformed. You can also see we’ve departed from the 0 to 255 greyscale values.

filter_matrix

array([[ 255.,  255.,  255.,  255.,  255.,  255.,  394.,  394.,  401.,
         255.,  255.,  255.,  255.,  255.],
       [ 255.,  255.,  255.,  255.,  255.,  533., -192.,  -53., -190.,
         401.,  255.,  255.,  255.,  255.],
       [ 255.,  255.,  255.,  255.,  533., -214.,   90.,  139., -227.,
         503.,  401.,  255.,  255.,  255.],
       [ 255.,  255.,  255.,  394., -184.,   50.,  481.,  357.,  503.,
           0., -234.,  401.,  255.,  255.],
       [ 255.,  255.,  255.,  394.,  -60.,  110.,  438.,  357.,  473.,
          90.,  -25.,  394.,  255.,  255.],
       [ 255.,  255.,  255.,  533., -104.,  113.,  118.,  125.,   53.,
          47., -150.,  575.,  255.,  255.],
       [ 255.,  255.,  394., -213.,  276.,  192.,  512., -232.,  343.,
         167.,  201., -289.,  415.,  255.],
       [ 255.,  255.,  394.,  -60.,  162.,  276., -267.,    5., -235.,
         249.,  188.,  -18.,  394.,  255.],
       [ 255.,  255.,  394.,  -39.,  176.,  174.,  406., -165.,  519.,
         147.,  267., -172.,  415.,  255.],
       [ 255.,  255.,  415., -144.,  204.,  195.,  262., -248.,  273.,
         188.,  184.,  -67.,  401.,  255.],
       [ 255.,  255.,  415., -276.,  213.,  190.,  239.,  320.,  232.,
         177.,  159., -212.,  401.,  255.],
       [ 255.,  255.,  255.,  554., -171., -158.,  -88.,  -60., -109.,
         -53., -184.,  540.,  255.,  255.],
       [ 255.,  255.,  255.,  255.,  394.,  415.,  401.,  394.,  401.,
         394.,  394.,  255.,  255.,  255.],
       [ 255.,  255.,  255.,  255.,  255.,  255.,  255.,  255.,  255.,
         255.,  255.,  255.,  255.,  255.]])

We can see how the filter has transformed the image by visualising it below using matplotlib. As we’re not dealing with greyscale values anymore, this is not strictly how the image “looks” to the network, but it gives us an approximation.

import matplotlib.pyplot as plt

plt.imshow(filter_matrix, 
           cmap='gray')
plt.colorbar();

If you want the filter matrix to be the same size as the input matrix, you can apply padding. This is where extra rows, with a value of zero, are added to the outside of the feature map. The number of rows added will be equal to the number of rows “lost” when calculating the values in the filter, so in the case of a 3 $\times$ 3 window, two rows would be added to both the height and width. You can see this in the diagram below:

Now, we’ve only discussed making one filter from our input image. However, one of the hyperparameters you can set in the convolutional layer is the number of filters you want the model to extract, and normally you would make many of these (for example, 32 or 64). For each of these, a different convolutional kernel will be used, with the specific weights for that kernel calculated during model training. Each filter will then extract specific features from the image, meaning that a single convolutional layer can capture many different pieces of visual information.

As a result, the output from the convolutional layer also has three dimensions: height and width, which as we discussed above refers to the dimensions of each of the filters, and number of channels, which now refers to the number of filters you want made over the feature map.

The final thing to cover about the convolutional layer is a parameter called stride. So far, when we moved the window over the feature map, we were just moving it one pixel at a time. However, it is possible to change the stride so that you move two or more pixels at a time. This makes the output matrix less large, but is rarely used in practice. Instead, we rely on the next type of layer in CNNs, pooling layers, to reduce the size of the feature map as it moves through the network.

The pooling layer

Following the convolutional layer, there is usually another type of layer called a pooling layer. Pooling is designed to aggressively downsample the feature map by again sliding windows over each filter, and taking some aggregation of the values inside the window. There are several operations that can be used, but the most common of these is taking the maximum (known as max pooling). Max pooling usually uses a 2 $\times$ 2 window and a stride of 2, so that we get non-overlapping patches that we can conduct the pooling operation over. We can see how max pooling works below.

Let’s say that we apply a max pooling operation to our 14 $\times$ 14 feature map. To remind you, this feature map is one of the filters from the previous convolutional layer, which was the result of multiplying our input matrix by the sharpening convolutional kernel. We can see how this would work in this diagram:

The image is broken into 2 $\times$ 2 patches, and within each, the maximum value is taken. As a result, we end up with a 7 $\times$ 7 matrix.

We can again see how this works step-by-step in NumPy:

Firstly, we define our 2 $\times$ 2 window, and break the feature map into patches. We use the step argument to define a stride of 2.

window_shape = (2, 2)
patches = view_as_windows(filter_matrix, window_shape, step = 2)
patches.shape

(7, 7, 2, 2)

You can see that, as in the diagram above, we can fit 7 windows along the x-axis and 7 along the y-axis, leaving us with 49 2 $\times$ 2 patches.

Again, let’s have a look at the first patch:

patches[0, 0, :, :]

array([[255., 255.],
       [255., 255.]])

We can get the maximum value for each patch by calling the max() function over the array.

patches[0, 0, :, :].max()

255.0

Now, all we need to do to take the max of each patch. We’ll again use a nested for loop to do this:

max_pooling_matrix = np.zeros([7, 7])
for i in range(0, 7):
    for j in range(0, 7):
        max_value = patches[i, j, :, :].max()
        max_pooling_matrix[i, j] = max_value

max_pooling_matrix

array([[255., 255., 533., 394., 401., 255., 255.],
       [255., 394., 533., 481., 503., 401., 255.],
       [255., 533., 113., 438., 473., 575., 255.],
       [255., 394., 276., 512., 343., 201., 415.],
       [255., 415., 204., 406., 519., 267., 415.],
       [255., 554., 213., 320., 232., 540., 401.],
       [255., 255., 415., 401., 401., 394., 255.]])

And there we have it: from our 14 $\times$ 14 filter, we have a 7 $\times$ 7 output. If we were doing this over the whole output from a convolutional kernel, we would repeat this operation for each channel (filter), leaving us with an output array again of size height $\times$ width $\times$ number of channels, but this time the height and width are half of what they were in the input array. As we complete the pooling operation over each filter separately, the number of channels remains the same.

And with that, we’ve understood the two types of layers that make up CNNs! These layers will be repeated, with a convolutional layer followed by a pooling layer, until the network gets to the final dense prediction layers (which we’ll cover in second). The output of the pooling layer will become the input for the next convolutional layer, with each of the channels being treated as a feature map in that convolutional layer.

How many of these layers that you include is, again, a subjective decision and will be based on how big and complex your image is. For example, in Chollet’s chapter on CNNs, he shows two examples. For the MNIST digit classication dataset, which has 26 $\times$ 26 pixel greyscale images, he uses three convolutional layers and three max pooling layers. For the larger and more complex images in the Dogs vs Cats dataset, he uses four of each layer. The idea is to try to have enough convolutional and pooling layers so that your images are relatively small by the time they hit the dense prediction layers.

So, the final question you might ask about these two layers is: why bother with these complex layer types? Why not just use dense layers to learn the features from the images?

There are a few reasons for this. The convolutional layers do the following:
They learn translation invariant patterns: This means that the network is able to learn patterns that make up meaningful features, no matter where they occur in the image. This is because the convolutional layers force the network to only focus on subsets of the image at a time, which means that if the network learns, for example, diagonal lines in one part of the image, it can recognise them anywhere. If we used a dense layer, the network would need to learn the same feature all over again if it occurred in a different place in another image. This mimics the way our own visual system recognises visual features, meaning the network can learn a lot more about images with fewer training examples.

They are capable of learning hierarchies of visual features: Convolutional layers first learn simple features, such as the orientation of lines. As the features are passed through more convolutional layers, these features will be aggregated more and more, first into shapes, then features, as you can see below. This actually also mimics how visual inputs are processed in our visual system.

Dinosaur icon credit

The pooling layers also play an important role in how the network learns. Without pooling, the windows would remain relatively small as they progressed through the network. Our 3 $\times$ 3 window would analyse a 5 $\times$ 5 area of the original image in the second layer, and a 7 $\times$ 7 area in the third layer … as you can see, even with our small 16 $\times$ 16 image, we’re not even close to the network being able to analyse the whole image. The pooling operation halves the feature space each time, allowing the convolutional layers to more efficiently “see” larger and larger parts of the input image as the network progresses and allow it to build up that nice hierarchy of features.

Additionally, as we have multiple channels per convolutional layer, without downsampling the number of features in the feature map, we’d potentially end up an absolutely enormous number of coefficients by the time we hit the dense layers we need for classification. As well as being computationally inefficient, it would also lead to model overfitting. You can see why these two reasons are also why you need to include enough convolutional and pooling layers to reduce the size of the feature map before you hit the final layers. And speaking of these final layers …

The dense layers

Finally, we of course need to get the CNN to solve a classification problem, not just recognise image features! This is where the final layers come in.

After the final pooling layer, we’re of course left with a three dimensional array, with the dimensions height, width and number of channels. In order to connect this to a dense layer, we first need to flatten this array into a 1-dimensional array, which becomes a regular neural net layer with its number of nodes equal to height $\times$ width $\times$ number of channels.

From here, CNNs act pretty much like a normal neural network. We need to first create a fully connected layer, which means we create a layer with a selected number of nodes, and connect every node in the flattened pooling layer to the nodes in this layer. You can see here how not aggressively downsizing your convolutional layers can blow out the size of your network, as the number of parameters could be enormous if the feature map is still large by the end of the network.

Finally, we need to add our output layer. For classification problems, this layer will have a number of nodes equal to the number of prediction classes (so for the MNIST dataset, this will be 10, one for each digit). This will be a softmax layer, which returns an array of probability scores, one for each of the nodes in the output layer. This probability corresponds to how likely the network predicts each class of the outcome to be, and the class with the highest probability will the network’s prediction for that image.

And there we have it! I hope this post has helped you to understand one of the most important models in computer vision, and see how this interesting model makes accurate predictions about images by processing them in a similar way to our own visual system.