Nicanor Korir

Posted on Nov 12

CNNs: from a beginner's point of view

#ai #robotics #machinelearning #softwareengineering

I've learnt this topic about 20 times now, some are a bit confusing, and of course, I know some core things. In this article, I am going to break down CNN to make it easy to understand the basics and maybe the advanced CNN.

Okay, from your perspective, how do you recognize your friend's face in a crowded room?

Like, genuinely, what's happening in your brain? You're not calculating pixel values or comparing feature vectors. You just see them and instantly think, "Oh, that's Sarah."

Your brain is doing something incredibly sophisticated without you realizing it. And that's exactly what CNNs (Convolutional Neural Networks) are trying to do. They're trying to teach computers to see and understand images the way your brain does.

What's the Problem We're Trying to Solve with CNNs?

Before CNNs, people tried to use regular neural networks (fully connected networks) to process images. Here's how it worked: take an image, flatten it into a long list of numbers (every pixel becomes a number), and feed that into a neural network.

Sorry, this will rush things, but stay with me here

An image that's 224x224 pixels has about 150,000 pixels. If you have an RGB image (3 color channels), that's 450,000 numbers. If your first hidden layer has 1000 neurons, you now have 450 million weights to learn just in the first layer.

That's massive. Your network becomes incredibly expensive to train, slow to run, and prone to overfitting (memorizing instead of learning).

But here's the thing, and this is important: images have structure. Pixels next to each other are related. An eye is an eye, whether it's in the top-left or bottom-right of your image. Your brain doesn't relearn what an eye looks like every time it's in a different position.

So the question becomes: How do we build a neural network that understands this spatial structure and reuses knowledge across the image?

That's where convolutions come in

In mathematics, convolution is an operation that combines two functions to produce a third function, showing how one modifies or overlaps with the other as it shifts across it. In CNNs, this idea is used to slide a filter across an image to detect features such as edges and patterns.

The Core Idea

Imagine you're looking at a painting. Instead of analyzing every millimeter of it at once, you use a small window to look at it piece by piece. You slide that window across the painting, examining each region.

You might notice:

In this region, there are strong diagonal lines (could be an arm)
In that region, there's a curved edge (could be a face)
Over there, there's a specific color pattern (could be hair)

Now, imagine you're looking for specific patterns, edges, corners, colors, and shapes. As you slide your window across the image, you're asking: "Does this pattern appear here? How strongly?"

That's convolution

In math terms, you have:

An image (the painting)
A filter/kernel (your small window, usually 3x3 or 5x5)
A convolution operation (sliding the filter across the image and computing a value for each position)

The **filter **is like a feature detector, different filters detect different features:

One filter might detect horizontal edges
Another detects vertical edges
Another detects corners
Another detects specific textures

Here's the magic: the network learns what these filters should be. You don't hard-code "detect an edge." The network figures out, "To recognize images well, I should learn these specific filter patterns."

Let's say you have a 5x5 image (tiny, for illustration):

And a 3x3 filter (kernel):

1  0 -1
2  0 -2
1  0 -1

(This is actually a real filter, the Sobel filter, that detects vertical edges)

Convolution works like this:

Place the filter on the top-left of the image
Multiply each element of the filter by the corresponding image element
Sum all those products
That sum is the output for that position
Slide the filter one position to the right, repeat
When you reach the end of a row, move down and start from the left

After sliding through the entire image, you get a new, slightly smaller image. That new image highlights where the filter's pattern appears strongly in the original image.

Do this with multiple filters, and you get numerous feature maps. Each one shows where different patterns appear in the image.

Why This Is So Powerful

Because you're sliding the same filter across the image, you're using the same weights everywhere. This means:

Fewer parameters: Instead of 450 million weights, maybe you have 9 (for a 3x3 filter) × number of filters
Weight sharing: The network learns that certain patterns are important, and it looks for them everywhere
Translation invariance: An edge detector works whether the edge is in the top-left or bottom-right

Your network becomes smaller, faster, and smarter.

Layering

Here's where it gets interesting. You don't just do one convolution, you stack them.

After the first convolution, you get feature maps that detect simple patterns (edges, corners). Then you apply another convolution to those feature maps. Now you're detecting patterns of patterns.

Maybe the second layer detects "edges arranged in a circular pattern" (detecting circles). The third layer might detect "circles with specific textures" (detecting eyes or wheels).

By the time you're 10 layers deep, you're detecting high-level features: "This looks like a face," "This looks like a car," "This looks like a dog."

This is the hierarchy of features:

Layer 1: Edges and corners
Layer 2: Simple shapes (circles, lines arranged together)
Layer 3: Textures and patterns
Layer 4: Parts of objects (wheels, fur, eyes)
Layer 5+: Whole objects (cars, animals, faces)

This mirrors how your brain works. You see edges first, then recognize that those edges form a nose, then recognize that a nose is part of a face.

Pooling

Pooling in CNNs reduces the size of feature maps by summarizing small regions (like taking the maximum value in a 2×2 area), so the network keeps the most important information while becoming more efficient. It helps make feature detection more stable, even if an object shifts slightly in the image. The most common method is max pooling, where you take the maximum value in that region.

Why? Because:

It reduces the spatial size (fewer numbers to process)
It makes the network more robust to small shifts (if a feature moves slightly, max pooling will still find it)
It emphasizes the strongest features (the maximum value is usually the most important)

An Actual CNN Architecture

Let me show you what a simple CNN looks like:

Input Image (224x224x3)
    ↓
Convolution (32 filters, 3x3) → Output: 224x224x32
ReLU activation
Max Pooling (2x2) → Output: 112x112x32
    ↓
Convolution (64 filters, 3x3) → Output: 112x112x64
ReLU activation
Max Pooling (2x2) → Output: 56x56x64
    ↓
Convolution (128 filters, 3x3) → Output: 56x56x128
ReLU activation
Max Pooling (2x2) → Output: 28x28x128
    ↓
Flatten → 28*28*128 = 100,352 values
    ↓
Fully Connected Layer (256 neurons)
ReLU activation
    ↓
Fully Connected Layer (10 neurons) → Output: probabilities for 10 classes
    ↓
Softmax → Final prediction

Each layer is making the data smaller but richer. By the end, instead of 224x224 pixels, you have 10 numbers representing "how confident am I that this is a [cat/dog/bird/etc]?"

Okay, But How Do You Actually Train This?

The process is similar to regular neural networks, but the convolutions make it special:

Forward pass: Image goes through the layers, producing a prediction
Loss calculation: Compare prediction to ground truth. "I said dog, it was actually a cat. That's wrong."
Backpropagation: Calculate gradients through all the layers, including the convolutional layers
Update filters: Adjust the filter weights so they become better at detecting useful features
Repeat: Do this thousands of times until the network gets better

The network automatically learns what filters to use. You don't tell it "detect edges." It figures it out because detecting edges helps it recognize objects better.

Real-World Applications

Medical Imaging: Detecting tumors in X-rays, CT scans
Autonomous Vehicles: Detecting pedestrians, traffic signs, and lane markings. CNNs can process camera feeds in real-time
Social Media: Instagram uses CNNs for content recommendation, Facebook for face detection, and TikTok for understanding video content.
Satellite Imagery: Detecting changes in landscapes, tracking deforestation, and counting crops
Quality Control: Manufacturing plants use CNNs to detect defects in products at superhuman speeds.
E-commerce: Product recognition, visual search (take a photo of something, find similar items online)

The Limitations

I don't want to oversell this, CNNs have real limitations:

1. They Need Lots of Data

Unlike humans, who can learn from a few examples, CNNs need thousands. Transfer learning helps, but it's still data-hungry.

2. They're Brittle

A CNN trained to recognize a dog might be completely fooled by a tiny, carefully crafted perturbation of the image. Humans see it as obviously still a dog.

3. They Don't Understand Context

A CNN might recognize all the objects in an image perfectly, but miss the relationship between them. It sees "cat," "couch," but doesn't understand "cat sitting on couch."

4. They're Black Boxes

You can visualize what they learned, but explaining why a specific prediction was made is hard. This matters for medical or legal applications where you need explainability.

5. They're Computationally Expensive

Running inference requires significant resources, especially for complex models.

What next?

For me, I'll create a practical example. I am also doing product recognition and categorization for a warehouse using different tools and technologies. For you, you might tell me in the comments or on social media, and we can chat about

DEV Community

CNNs: from a beginner's point of view

Top comments (0)