AlvBarros

Posted on Jan 21 • Edited on Jan 26

Brief Introduction to CNNs

#machinelearning #ai #learning #beginners

This article is heavily based on An Introduction to Convolutional Neural Networks by Keiron O'Shea and Ryan Nash.

Let's begin by discussing some background knowledge.

Source: Application of Artificial Intelligence in Lung Cancer. A Venn diagram of fields inside Artificial Intelligence.

Deep Learning

As the diagram shows, deep learning is a subset of machine learning as a whole, and its key distinction to other types of machine learning is the automatic feature extraction and data dependency.

Feature engineering: Automatically learns features from raw data.
Data: Needs way more data than traditional ML.
Performance: Slower to train and requires significant computing power (GPUs), but achieves higher accuracy in complex tasks.
Interpretability: Often considered a "black box" due to its complexity and automatic feature extraction.

Neural Networks

Artificial Neural Networks are a high number of interconnected computational nodes (referred to as neurons), of which work entwine in a distributed fashion to collectively learn from the input in order to optimize its final output.

Source: An Introduction to Convolutional Neural Networks. A three-layered feedforward neural network, comprised of a input layer, a hidden layer and an output layer.

Convolutional Neural Networks

The main notable difference between CNNs and traditional ANNs is that CNNs are primarily used in the field of pattern recognition within images.

The layers within the CNN are comprised of neurons organized into three dimensions: the spatial dimesionality of the input (height and width) and the depth. This "depth" does not refer to the total number of layers within the ANN, but the third dimension of an activation volume.

Overall architecture

CNNs are comprised of three types of layers: convolutional layers, pooling layers and fully-connected layers.

Take for example this simplified CNN architecture for MNIST classification.

Source: An Introduction to Convolutional Neural Networks

The input layer will hold the pixel values of the image. These can be RGB colors.
The convolutional layer will determine the output of neurons of which are connected to local regions of the input through the calculation of the scalar product between their weights and the region connected to the input volume. The rectified linear unit (commonly shortened to ReLu) aims to apply an "elementwise" activation function such as sigmoid to the output of the activation produced by the previous layer.
The pooling layer will then simply perform downsampling along the spatial dimensionality of the given input, reducing the image size and in turn the amount of parameters.
The fully connected layers will then perform the same duties found in standard neural networks and attempt to produce class scores from the activations. It is also suggested that ReLu may be used between these layers as to improve performance.

Convolutional layer

Convolution is a mathematical operation that measures how one function overlaps another across space or time.

In neural networks, it means sliding a small filter over the data to systematically detect local patterns (such as edges, textures, or shapes).

Here's a great video by 3Blue1Brown that goes more in-depth and offers some visualization.

Notice how a matrix goes through every pixel of the original image.
The convolution in the animation is just averaging out every neighboring pixel, so it results in a "blur" effect. Consider that every pixel is a matrix of red, green and blue values (RGB) from 0 to 255. So, for example, a completely red pixel would be [255, 0, 0]. The kernel is a matrix of 3x3 filled with 1/9, so that you average the pixel's neighbors. I heavily encourage you to watch the video so it makes more sense.

Kernel

The "filter" or the matrix of 1/9's in the example above is called a kernel. They are usually small in spatial dimensionality, but spread along the entire image.

When the data hits a convolutional layer, the layer convolves each filter across the spatial dimensionality of the input to produce a 2D activation map.

Source: An Introduction to Convolutional Neural Networks. A visual representation of a convolutional layer. The center element of the kernel is placed over the input vector (image), of which is then calculated and replaced with a weighted sum of itself and any nearby pixels.

Training neural networks on inputs such as images results in models of which are too big to train effectively. Consider an image of 800 height and 600 width. This would mean 800 x 600 = 480.000 pixels. Bear in mind pixels are RGB matrixes as we explained earlier, so it would mean 480.000 x 3 = 1.440.000. So just for this image, the number of weights on a single neuron would be almost 1.5 million, and these networks usually way more than one neuron.

To mitigate this, the convolutional layer must be connected to small regions of the input, referred to as the receptive field size of the neuron. To visualize this, if the input of the network is an image of size 64 x 64 x 3 and we set the receptive field size as 6 x 6, we would have a total of 108 weights on each neuron in the layer. To put this into perspective, a standard neuron seen in other forms of neural networks would contain 12.288 weights each.

Optimization

Convolutional layers are also able to significantly reduce the complexity of the model through three hyperparameters: depth, stride and zero-padding.

Depth: The depth produced can be manually set through the number of neurons within the layer. This can be seen with other forms of neural networks, where all of the neurons in the hidden layer are connected to every single neuron beforehand. Reducing this can significantly minimize the total number of neurons, but it can also significantly reduce the pattern recognition capabilities.
Stride: You can think of this as the amount of "steps" we take when applying the convolution kernel. A stride of one would mean that every pixel would be put through the convolution. A stride of two, would mean that 1 in every 2 pixels would be put, so one would be skipped.
Zero-padding: It's the simple process of padding the border of the input, and is an effective method to give further control as to the dimensionality of the output volumes.

With these, we can calculate the size of the 2D output of the convolutional layer.

Source: An Introduction to Convolutional Neural Networks. V represents the input volume size (height x width x depth), R represents the receptive field size, Z is the amount of zero padding set and S referring to the stride. If the calculated result is not an integer, then the stride has been incorrectly set.

Parameter sharing, the idea that the kernel is the same for the entire image, works on the assumption that if one region feature is useful to compute at a set spatial region, then it is likely to be useful in another region.

Pooling layer

The objective is to gradually reduce the dimensionality of the representation.

It operates over each activation map in the input, and scales its dimensionality using the "MAX" function. In most CNNs, these come in the form of max-pooling layers with kernels of a dimensionality of 2 x 2 applied with a stride of 2 along the spatial dimensions of the input. This scales the activation map down to 25% of the original size - whilst maintaining the depth volume to its standard size.

Due to its destructive nature, there are only two generally observed methods of max-pooling. The one mentioned previously, which allows the layer to extend through the entirety of the image, or overlapping pooling where the stride is set to 2 with a kernel size set to 3. Having a kernel size of the pooling layer above 3 will usually greatly decrease the performance of the model.

It's important to note that beyond max-pooling, some CNNs may contain general pooling. These layers are comprised of pooling neurons that are able to perform a multitude of common operations including L1/L2-normalisation, and average pooling.

Fully-connected layer

This layer contains neurons of which are directly connected to the neurons in the two adjacent layers, without being connected to any layers within them. This is analogous to the way that neurons are arranged in traditional forms of neural networks.

Recipes

Despite the relatively small number of layers required for a CNN, there is no set way of formulating a CNN architecture. That being said, they follow a common architecture.

This common architecture is of convolutional layers stacked, followed by pooling layers in a repeated manner before feeding forward to fully-connected layers, as show in the Overall architecture section.

Another way is to stack two convolutional layers before each pooling layer, as illustrated below. This is strongly recommended as stacking multiple convolutional layers allows for more complex features of the input vector to be selected.

Source: An Introduction to Convolutional Neural Networks. A common form of CNN architecture in which convolutional layers are stacked between ReLus continuousl before being passed through the pooling layer, before going between one or many fully connected ReLus.

Also, it is advised to split up large convolutional layers into smaller ones in order to reduce the amount of computational complexity within a given layer.

For example, imagine you were to stack three layers on top of each other with a receptive field of 3 x 3. Each neuron on the first layer would have a 3 x 3 view of the input vector. The second layer neuron, however, is acting on the output of the first layer. So, even though the kernel size can also be of 3 x 3, effectively the second neuron is also depending on the first layer. Removing the overlapping pixels and assuming a stride of 1, We have a 5 x 5 dimensionality (since one column overlaps). If you do it again for a third layer, effectively the receptive field is now of 7 x 7, and so on.

The input layer should be recursively divisible by two. Common numbers include 32 x 32, 64 x 64, 96 x 96, 128 x 128 and 224 x 224.

With small filters, set stride to one and make use of zero-padding as to ensure that the convolutional layers do not reconfigure any of the dimensionality of the input. The amount of zero-padding to be used should be calculated by taking one away from the receptive field size and dividing by two.

CNNs can be horrendously resource-heavy. An example of this problem could be in filtering a large image (anything over 128 x 128 could be considered large), so if the input is 227 x 227 (as seen with ImageNet) and we're filtering with 64 kernels each with a zero padding, then the result will bee three activation vectors of size 227 x 227 x 64 - which calculates to roughly 10 million activations - or an enormous 70 megabytes of memory per image.

In this case there are two options.

First, you can reduce the spatial dimensionality of the input images by resizing the raw imags to something a little less heavy.

Alternatively, you can go against everything we stated earlier and opt for larger filter sizes with a larger stride (2, as opposed to 1).

Conclusion

CNNs differ to other forms of neural networks in that instead of focusing on the entirety of the problem domain, knowledge about the specific type of input is exploited. This in turn allows for a much simpler network architecture to be set up.

Top comments (1)

Marcelly Paiva • Jan 21

Artigo muito claro e bem organizado, principalmente para quem não é da área.
A explicação ajuda a entender a ideia geral das CNNs e sua importância no reconhecimento de imagens, mesmo sendo um tema complexo. Parabéns pelo trabalho