Nilavukkarasan R

Posted on Mar 31 • Edited on May 12

Convolutional Neural Networks: Teaching Networks to See

#ai #convolutionalnetworks #computervision #machinelearning

"Vision is the art of seeing what is invisible to others."
Jonathan Swift

150 Million Parameters for One Layer

Post 05 ended with a number: a 224×224 color photograph has 150,528 pixels. Connect each pixel to 1,000 neurons in a fully connected layer and you need 150 million weights. Just for the first layer. Before the network has learned anything.

That's not a training problem. That's an architecture problem. Fully connected networks treat every pixel as equally related to every other pixel. A pixel in the top-left corner connects to the same neurons as a pixel in the bottom-right. But images don't work that way. Nearby pixels form edges, textures, shapes. Distant pixels are usually unrelated.

We need an architecture that knows this.

One Small Filter, Everywhere

Instead of connecting every pixel to every neuron, a CNN slides a small filter (say 3×3) across the image. At each position, it multiplies the 9 filter weights by the 9 pixel values underneath and sums them up, exactly like the perceptron's weighted sum from Post 01, just applied to a small patch instead of the whole input. Then it moves one pixel over and repeats.

The same 9 weights are used at every position. One filter, applied everywhere. If it learns to detect vertical edges, it detects them in the top-left, the center, and the bottom-right, all with the same 9 weights.

FC first layer:   150,528 × 1,000 = 150 million parameters
CNN first layer:  32 filters × 3×3×3 = 864 parameters

That's the core idea. Instead of one giant layer that sees everything, many small filters that each detect one pattern locally. The technical term is weight sharing, and it's why CNNs are practical for images.

On the left, the FC network flattens the image into a row of 784 pixels. Spatial structure destroyed. Every pixel connects to every neuron. On the right, the CNN keeps the image as a 2D grid and slides a small 3×3 filter across it. Same 9 weights, applied everywhere. Spatial structure preserved.

What the Filters Learn

A filter is just a small grid of numbers. Backpropagation (same algorithm from Post 03) adjusts these numbers until the filter detects something useful. Nobody designs the filters by hand. The network learns them from data.

Learned vertical edge filter:    Learned horizontal edge filter:
[-1  0  1]                        [-1 -2 -1]
[-2  0  2]                        [ 0  0  0]
[-1  0  1]                        [ 1  2  1]

Slide a vertical edge filter across a digit and you get a feature map: a heat map showing where vertical edges were detected and how strongly. Stack 32 filters and you get 32 feature maps, 32 different views of the same image.

Early layers learn edges. Deeper layers combine edges into shapes. Even deeper layers combine shapes into parts of objects. It's a hierarchy: simple patterns compose into complex ones.

Pooling: Summarize, Don't Memorize

After detecting features, we don't need to know exactly where they appeared. Just roughly where. Max pooling takes a small window (2×2) and keeps only the strongest activation.

Before pooling:        After 2×2 max pooling:
[1  3  2  4]           [6  4]
[5  6  1  2]     →     [8  7]
[3  8  4  7]
[1  2  6  3]

This shrinks the spatial size (fewer parameters downstream) and makes the network tolerant to small shifts. A digit shifted 2 pixels left still produces similar pooled features. The network recognizes the pattern regardless of exact position.

The Full Pipeline

Input Image
    ↓
[Conv → ReLU → Pool]    ← detect edges, textures
    ↓
[Conv → ReLU → Pool]    ← combine into shapes
    ↓
Flatten
    ↓
[Fully Connected]        ← classify based on learned features
    ↓
Softmax → Prediction

Everything we built still applies. Activation adds non-linearity (Post 02). Backpropagation trains the filters. Adam optimizes the updates. Dropout regularizes. CNNs didn't replace what we built. They added a smarter way to process spatial data on top of it.

See It

The original image is just pixels. After the first convolution, you see edges. After the second, you see shapes. The network is building its own understanding of the digit, layer by layer, from nothing but raw pixels and backpropagation.

Open the playground. Train both an FC network and a CNN on the same MNIST subset. The CNN reaches higher accuracy compared to FC network.

In the second tab, Pick a digit and watch what each filter detects. One filter lights up along vertical strokes. Another responds to horizontal edges. Another catches curves.

The top row shows the digit after four different filters. The vertical edge filter lights up along the strokes of the 3. The horizontal edge filter catches the top and bottom curves. Each filter sees something different in the same image. The bottom row shows the same feature maps after max pooling: smaller, coarser, but the important patterns survive.

CNNs Are Built for Spatial Data

CNNs work because of two assumptions: nearby inputs are related, and the same pattern can appear anywhere. Images satisfy both. So do audio spectrograms and video frames. Text doesn't. The word "not" next to "good" means something completely different from "not" next to "bad." That's why text needs different architectures, which we'll get to.

What's Next

CNNs solve the parameter problem for images. But go deep enough and training breaks down again. Gradients vanish through many layers. Adding more layers actually hurts accuracy, not from overfitting, but because the gradient signal can't reach the early layers.

Two ideas fixed this: batch normalization (stabilize activations between layers) and residual connections (let gradients skip layers entirely). Together, they made 50-layer and 100-layer networks trainable.

References:
Krizhevsky, A., (2012). ImageNet Classification with Deep Convolutional Neural Networks.

Series: From Perceptrons to Transformers | Code: GitHub

DEV Community