Marco Moscatelli

Posted on Sep 26, 2023 • Edited on Nov 7, 2023

A gentle introduction to Convolutions (Visually explained)

Have you ever asked yourself how is it possible to identify all the horizontal lines in an image with just some scalar products between matrices? No? Me neither, but you can use convolutions to do that, and they can do other magic things… Obviously!

The ability of computers to recognize faces, identify objects, and drive cars autonomously is based on this sort of mathematical operation called convolution. This operation was first introduced in the 19th century by Siméon Denis Poisson, a French mathematician and physicist.

But it wasn’t until the 1980s that convolution found its way into the field of computer vision, thanks to the pioneering work of researchers such as Yann LeCun, Geoff Hinton, and Yoshua Bengio.

Since then, convolution has become a foundation of modern machine learning, enabling computers to process images, videos, and other forms of visual data with unparalleled accuracy and efficiency.

In this article, we’ll explore the world of convolutions by even “convoluting” a duck, stick with me to the end to know the power of this beautiful tool.

What is a convolution?

Convolution is a simple mathematical operation, it involves taking a small matrix, called kernel or filter, and sliding it over an input image, performing the dot product at each point where the filter overlaps with the image, and repeating this process for all pixels.

The kernel is designed to highlight certain features of the input image, such as edges, corners, or textures, by detecting patterns of pixels that match certain criteria.

You can perform convolution in 1D, 2D, and even in 3D.

You can see from the GIF above that we are performing the dot product between matrices for every “jump” of the kernel and adding that result as a new pixel in the convolution.

You can use convolution both for upsampling(increasing resolution) and for downsampling(decreasing resolution), we’ll see later a variant of convolution that is really great for upsampling and a method for downsampling.

Remember: The goal of using convolution in deep learning is not to use them to predict an outcome, but to extract features that then will be used by FFNs layers to predict data.

How to implement convolutions in Python

For this explanation, we just need two libraries, PyTorch and Matplotlib.



import torch  
import torch.nn.functional as F  
import matplotlib.pyplot as plt

Let’s start by creating an image with random pixels, and a “pretty" kernel and plotting everything out:

# Creating a images 20x20 made with random value  
imgSize = 20  
image = torch.rand(imgSize, imgSize)  

# typically kernels are created with odd size   
kernelSize = 7  

# Creating a 2D image  
X, Y = torch.meshgrid(torch.linspace(-3,3,kernelSize),torch.linspace(-3,3,kernelSize))  

# This is the basic formula for a gaussian, this will make our kernel "pretty"  
kernel = torch.exp( -(X**2+Y**2)/7 )  

fig,ax = plt.subplots(1,2,figsize=(8,6))  
ax[0].imshow(image)  
ax[0].set_title('Image')  

ax[1].imshow(kernel)  
ax[1].set_title('Convolution kernel')  

plt.show()

Isn’t this kernel beautiful? Now it is time to talk about the part that you have been waiting for… The implementation of convolution.



# Pytorch requires the image and the kernel in this format:   
# (in_channels, output_channels, imgSizeY, imgSizeX)  
image_processed = image.view(1, 1, imgSize, imgSize)  
kernel_processed = kernel.view(1,1, kernelSize, kernelSize)  

# implementing the convolution  
convolution = F.conv2d(image_processed, kernel_processed)  

plt.title("Convolution")  
# we need to bring back the convolution to a format understandable by Matplotlib  
plt.imshow(convolution.view(convolution.shape[2], convolution.shape[3]))

You should get a result similar to this:

Time to create a “convoluted” duck!

Don’t you think is better to use real images? We will a vertical kernel to identify all the vertical lines and a horizontal kernel to identify all horizontal lines, and the image will be this beautiful duck that is ready to be “convoluted”:



# to read an image from a url  
from imageio import imread  
# to perform an operation  
import numpy as np  

duck = imread('https://media.istockphoto.com/id/464988959/photo/mallard-duck-on-white-background.jpg?b=1&s=612x612&w=0&k=20&c=mzgsLywgUZ_mAQPWVPJgbFp68doxm5d7fXy-huiaNSY=')  
print(duck.shape) # should be (612, 530, 3)  

# transform image to 2D for convenience (not necessary for convolution!)  
# We need numpy because with torch we cannot mean over a specific axis  
duck = torch.Tensor(np.mean(duck,axis=2))  
duck = duck/torch.max(duck) # scaling the image  

# check the size again  
print(duck.shape) # should be (612, 530)

Time to implement the convolution on this beautiful duck:



# creating the kernels  
# vertical kernel  
VK = torch.Tensor([ [1,0,-1],  
                [1,0,-1],  
                [1,0,-1] ]).view(1, 1, 3, 3) # remember? the kernel has to have a specific format  

# horizontal kernel  
HK = torch.Tensor([ [ 1, 1, 1],  
                [ 0, 0, 0],  
                [-1,-1,-1] ]).view(1, 1, 3, 3)  

fig,ax = plt.subplots(2,2,figsize=(12,10))  

# plotting vertical kernel  
ax[0,0].imshow(VK.view(3,3))  
ax[0,0].set_title('Vertical kernel')  

# plotting horizontal kernel  
ax[0,1].imshow(HK.view(3,3))  
ax[0,1].set_title('Horizontal kernel')  

# run convolution and show the result  
convolution = F.conv2d(duck.view(1,1,duck.shape[0], duck.shape[1]),VK) # be sure to change the format  
ax[1,0].imshow(convolution.view(convolution.shape[2], convolution.shape[3]),cmap='gray',vmin=0,vmax=.01) # we scale the image using vmin and vmax  

convolution = F.conv2d(duck.view(1,1,duck.shape[0], duck.shape[1]),HK)  
ax[1,1].imshow(convolution.view(convolution.shape[2], convolution.shape[3]),cmap='gray',vmin=0,vmax=.01) # we scale the image using vmin and vmax  

plt.show()

Ready for the result?

Pretty stunning as a result, what do you think?

Here we are using a kernel invented by humans, in DL models the kernel will be learned by the network.

Notice this: PyTorch and other DL frameworks implement actually a thing called “cross-correlation” and not convolution, but stick with me until the end to know more.

Exploring convolutions parameters

The most important parameters are stride and padding, in this article, you’ll see covered both.

Add empty borders with padding!

padding: have you noticed in this GIF there are some sort of zeros on the borders? This is called padding, the convolution, in this case, has a padding of 1, the value from the padding can be any number but the best value to pick is usually zero.

Increase stride if you want to jump more!

In the image above made with my beautiful handwriting skills you can see that we are skipping some numbers (I am using a kernel of size one for simplicity), this is because we have a stride > 1, stride is just the number of “jumps” that the kernel will do in a direction. The image above has a stride of 2. If you want to downsample you can just increase the stride.

Suppose we apply a stride of 3 while using a 3x3 kernel and a 5x5 input — what would happen on the second jump?

Exactly, we can’t make any operation in that part, PyTorch will omit the pixel when the kernel goes outside the image, the only solution is to add padding.

Knowing the size of the output with convolution

You probably know the size of the output even before the output is given just by looking at the parameters, but this will become more difficult as the size of the parameters increases, here’s a formula to calculate the exact size of the output:

X : is the size of the output
M : is the size of the input
p : padding
K : kernel size
S : stride
h : horizontal or vertical
⌊ ⌋ : round down

Transposed convolution

Transposed convolution, also known as deconvolution, is a sort of convolution that is great for upsampling, with this type of convolution we start with a small image and receive as an output a bigger image.

To do that just perform a scalar matrix multiplication between the kernel and every pixel of the image, like normal convolution even here we slide the kernel over the image, and in the result, you would sum the overlapped part.

Knowing the size of the output with transposed convolution

Similar to the formula that you have seen in the previous section there is a formula too, to calculate the output size using transposed convolutions.

X : is the size of the output
M : is the size of the input
p : padding
K : kernel size
S : stride
h : horizontal or vertical
⌊ ⌋ : round down

There is something I have not told you…

Actually, the deep learning models implement another thing that is not convolution but it is similar, and it’s called cross-correlation.

The only difference between the two is that convolution uses an “inverted” kernel, rotated by 180°.

Pytorch with F.conv2d() is implementing cross-correlation, If you want to implement a real convolution you can easily use the Scipy library or create the code on your own (just remember to rotate the kernel by 180°).

Maximizing the performance of convolutions using pooling

The main goal of pooling is to stabilize the results and create a more stable network, this is because pooling increments the receptive fields (stay with me, I’ll explain later what it means and we it is useful).

With a pooling layer, you want a pixel to explain the image as best as it can, this is made by doing an operation on a number of pixels to reduce that number to one, for example, a 4x4 block of pixels will be reduced to 1x1 block of pixels, this can be made by averaging or taking the maximum/minimum value.

The pooling operation takes the same parameters as the operation of convolution, with a small difference, here we can choose what to do if our kernel goes “outside” the image (which can be caused by a too-large stride).

There are different types of pooling operations:

Mean pooling
Max pooling
Min pooling

As you do with convolution, even here you slide a sort of kernel (formally called “spatial extent”), in the highlighted part you would take the average of all values, or the max value or the min value to represent a single pixel in the output.

Usually, when you use pooling you would also set the stride to be the same as the spatial extent, as you can see in the GIF above with the same size of two for the spatial extent and for the stride.

What are receptive fields?

Receptive fields are a very important concept in psychology, signal processing and deep learning too. A receptive field is the quantity of data in the field of view of something, the receptive field of an FNN unit is one pixel.

Having a receptive field of one makes the network non-robust to translation, resizing, rotations, etc. For example, let’s say that you take more photos of yourself in a park in the same position, these photos are similar to each other, but they have small differences, which can cause the network to not recognize you in both photos.

So you want a pixel in the output to contain more information than just a single square in the input.

In the image above you have some little changes between the two photos, a network without a pooling layer can struggle in identifying you in both photos (because ANN units have a receptive field of one).

“Kernel” outside the image in pooling operations

Do you remember that we have seen what would happen if we have too large stride and we go outside the image with the kernel?

In a convolution you would increase padding, using pooling, you have in PyTorch a parameter called “ceil_mode”, if set to False, you will remove the pixel created when a part of the kernel is outside the image, if set to True, the operation of pooling will be performed only on the part covered by the Kernel, but in this case, the pixel will be added to the result.

A typical CNN architecture is like this:

Example of CNN architecture

To pool or to stride?

There is an alternative to pooling and it is to increase stride in the convolution operation, they both increase the receptive field of a neuron but there are some differences:

pooling layers are faster at the computational level
pooling layers have usually smaller receptive fields
pooling layers are usually more stable in complex networks

There is no correct answer to whether you should pick one over another, pooling layers are more historical so you could find them in more architecture, but both are valid choices.

Conclusions

Now you know what are convolutions and their variants and how to implement them in PyTorch, you know how convolutions are used in deep learning models and how to use pooling to your advantage.

You have just nicked the surface of the implementation of convolutions in Deep Learning, now your job is to go on this path and start learning the beautiful thing that CNNs can gift you.

I want to share some resources if you want to go more in-depth in this argument: