zeromathai

Posted on May 8 • Originally published at zeromathai.com

How CNNs Work — From Convolution Kernels to ResNet

#machinelearning #deeplearning #ai #computervision

CNNs changed computer vision because they stopped treating images like flat lists of numbers.

An image has structure.

Pixels near each other usually matter together.

That is exactly what CNNs are built to capture.

Core Idea

A Convolutional Neural Network is designed for spatial data.

Instead of looking at every pixel independently, it scans small regions with filters.

Those filters learn useful patterns.

Edges.

Corners.

Textures.

Shapes.

As layers get deeper, simple visual patterns become higher-level features.

The Key Structure

A basic CNN flow looks like this:

Image → Convolution → Activation → Pooling → Deeper Features → Classifier

The important part is convolution.

A convolution kernel moves across the image and extracts local features.

In simple terms:

Kernel + Local Image Region → Feature Value

So the CNN does not memorize the whole image at once.

It learns reusable visual detectors.

Implementation View

At a high level, a CNN layer works like this:

take an input image

slide a kernel over small regions

compute feature values

apply activation

optionally reduce spatial size with pooling

pass feature maps to the next layer

This is why CNNs are practical for image tasks.

The same kernel is reused across the image.

That means fewer parameters than a fully connected layer over all pixels.

It also means the model can detect the same pattern in different locations.

Concrete Example

Imagine a model looking at a cat image.

Early convolution layers may detect:

edges
curves
color contrasts

Middle layers may combine them into:

eyes
ears
whisker-like patterns

Deeper layers may combine those into:

face-like structures
object-level features

That hierarchy is the core intuition.

CNNs do not jump directly from pixels to “cat.”

They build features step by step.

CNN vs Standard Neural Network

A standard fully connected neural network treats an image mostly as a flat vector.

That loses the spatial structure.

CNNs preserve local relationships.

Standard neural network:

flattens image early
connects many pixels directly
uses many parameters
does not naturally preserve spatial locality

CNN:

keeps spatial structure
uses local filters
shares weights across locations
learns hierarchical visual features

That is why CNNs became the default architecture for image recognition.

They match the structure of the data.

Why Convolution Kernels Matter

The kernel is the core mechanism.

A kernel is a small matrix that scans over the image.

It responds strongly when it finds a pattern it has learned.

For example:

one kernel may detect vertical edges
another may detect horizontal edges
another may respond to texture

In deep learning, these kernels are learned from data.

You do not manually define all of them.

The model discovers useful filters during training.

How CNNs Became Deeper

Early CNNs showed that convolution worked.

But deeper CNNs learned richer features.

That created a new problem:

How do you train very deep networks without making optimization unstable?

This is where the model timeline becomes useful.

LeNet showed the basic CNN idea.

AlexNet showed that CNNs could dominate large-scale image recognition.

VGGNet showed the power of simple depth.

GoogLeNet improved efficiency with Inception modules.

ResNet made very deep networks trainable with residual connections.

Landmark Model Flow

A simple timeline looks like this:

LeNet → AlexNet → VGGNet → GoogLeNet → ResNet

Each model solved a different pressure point.

LeNet:

early CNN structure
useful for digit recognition
showed convolution could work

AlexNet:

large-scale breakthrough
helped trigger the deep learning boom
proved CNNs could scale with data and GPUs

VGGNet:

simple repeated convolution blocks
showed depth could improve representation
easy to understand structurally

GoogLeNet:

focused on efficiency and multi-scale features
used Inception-style modules
reduced unnecessary computation

ResNet:

solved the degradation problem in very deep networks
used skip connections
made extremely deep CNNs practical

Why ImageNet Mattered

CNN progress was not only about architecture.

It also needed large-scale benchmarks.

ImageNet and ILSVRC gave researchers a clear way to compare models.

That mattered because architecture improvements became measurable.

Better models were not just theoretically interesting.

They produced visible gains on the same benchmark.

This is why AlexNet became such a turning point.

It showed that deep CNNs could outperform older computer vision pipelines at scale.

Recommended Learning Order

If CNNs feel like a list of model names, learn them in this order:

Convolutional Neural Network
Convolution Kernel
Deep Convolutional Network
LeNet
AlexNet
VGGNet
GoogLeNet
ResNet
ImageNet / ILSVRC

This order works because you first understand the mechanism.

Then you understand the architecture.

Then you understand the historical model flow.

Takeaway

CNNs work because they match the structure of images.

Images are spatial.

Local patterns matter.

The same pattern can appear in many locations.

CNNs use convolution kernels to capture that structure efficiently.

The shortest version is:

Local filters + shared weights + deep feature hierarchy = CNN power

If you remember one idea, remember this:

CNNs turn pixels into features by repeatedly detecting local patterns and composing them into higher-level visual concepts.

Discussion

When learning CNNs, do you find it easier to start from the convolution kernel itself, or from the model timeline like LeNet → AlexNet → ResNet?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/cnn-complete-hub-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community