"Vision is the art of seeing what is invisible to others." Jonathan Swift
When Regularization Wasn't Enough
In my last post, I showed you how dropout and weight decay stop a network from memorizing training data. We trained on MNIST, closed the generalization gap, and got a network that actually works in the real world.
It felt like we'd finally solved it.
But then i tried it on a real photograph. Not 28×28 grayscale digits. A 224×224 color image.
The math was brutal.
224 × 224 × 3 = 150,528 inputs
Connect those to 1,000 neurons: 150 million parameters
Just for the first layer. Before learning anything useful.
We needed a different idea entirely. And it came, as the best ideas often do, from biology.
The Visual Cortex Moment
In 1959, neuroscientists David Hubel and Torsten Wiesel did something remarkable. They inserted electrodes into a cat's visual cortex and projected shapes onto a screen. They were trying to find what made individual neurons fire.
Most shapes did nothing. Then, almost by accident, they moved a glass slide and cast a thin line of light across the screen.
One neuron went wild.
Not all neurons. Not the whole cortex. One specific neuron, responding to one specific edge, at one specific orientation, in one specific region of the visual field.
They kept experimenting. Different neurons responded to different orientations, horizontal edges, vertical edges, diagonal edges. Each neuron had a small receptive field: a limited patch of the visual field it paid attention to. Neurons in later areas responded to more complex patterns, corners, curves, eventually whole shapes.
The visual cortex isn't a fully connected blob. It's a hierarchy of local detectors, each building on the one before it.
That insight, decades later, became the blueprint for CNNs.
The Problem with Fully Connected Networks on Images
Here's what a fully-connected network does to an image:
FC Network sees an image as:
┌─────────────────────────────────────────────────────┐
│ pixel_1, pixel_2, pixel_3, ..., pixel_150528 │
│ (all spatial structure destroyed, every pixel │
│ connected to every neuron with separate weights) │
└─────────────────────────────────────────────────────┘
Every neuron: "I must learn about ALL 150,528 pixels equally."
This is wasteful in two ways. First, a pixel in the top-left corner has almost nothing to do with a pixel in the bottom-right corner, but the network treats them as equally related. Second, if a cat's ear appears in the top-left of one image and the top-right of another, the network needs separate neurons to detect it in each location.
A CNN thinks differently:
CNN sees an image as:
┌──────────────────────────────────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ │
│ │ F │ │ F │ │ F │ ← same filter, sliding across │
│ └───┘ └───┘ └───┘ "Is there an edge here?" │
│ ↘ ↓ ↙ │
│ [feature map] │
└──────────────────────────────────────────────────────┘
Every filter: "I detect ONE pattern, ANYWHERE in the image."
Same filter, applied everywhere. One set of weights to detect vertical edges across the entire image. This is weight sharing, and it's the core reason CNNs work.
The parameter comparison is stark:
| FC Network | CNN (typical) | |
|---|---|---|
| Input | 224×224×3 = 150K | 224×224×3 = 150K |
| First layer params | 150K × 1000 = 150M | 96 filters × 11×11×3 = 35K |
| Assumption | all pixels equally related | nearby pixels are related |
For the full mathematical breakdown of parameter counts across FC, LeNet, AlexNet, VGG, and ResNet, see CNN_ARCHITECTURE_DEEP_DIVE.md.
Key Concepts, Grounded in Biology
Local Receptive Field
In the visual cortex, each neuron only responds to a small patch of the visual field. It's local. It doesn't see the whole image—just its neighborhood.
In a CNN, each filter application does the same thing. A 3×3 filter looks at a 3×3 patch of the image. That's its receptive field. It asks: "Does my pattern exist in this small region?"
As you go deeper in the network, receptive fields grow. A neuron in layer 3 has seen the outputs of layer 2, which saw layer 1, which saw the raw pixels. So it effectively "sees" a larger region, just like neurons deeper in the visual cortex respond to larger, more complex patterns.
Learnable Filter (The Edge Detector)
A filter is just a small grid of numbers—say 3×3. During training, backpropagation (the same algorithm from Post 3) adjusts these numbers until the filter detects something useful. One filter might learn to detect vertical edges. Another learns horizontal edges. Another learns a specific texture.
Learned vertical edge filter: Learned horizontal edge filter:
[-1 0 1] [-1 -2 -1]
[-2 0 2] [ 0 0 0]
[-1 0 1] [ 1 2 1]
The network learns these automatically, no human designs them. That's the power.
Feature Map
When you slide a filter across an image, you get a feature map: a 2D grid showing where that filter's pattern was detected and how strongly.
Think of it like a heat map. Apply a vertical-edge filter to a photo of a face, and the feature map lights up along the sides of the nose, the edges of the eyes, the outline of the jaw. Dark where there are no vertical edges. Bright where there are.
Stack 32 filters and you get 32 feature maps, 32 different "views" of the same image, each highlighting a different pattern.
Padding
Here's a practical problem: if you slide a 3×3 filter across a 5×5 image, the filter can't be centered on the edge pixels. You lose a border of information, and the output shrinks.
Padding adds a ring of zeros around the image before applying the filter. This lets the filter visit every pixel, including the edges, and preserves the spatial dimensions.
It's like giving your peripheral vision a bit of extra context at the boundary of your visual field.
Pooling (Spatial Summarization)
After detecting features, we don't need to track exactly where they appeared—just roughly where. This is pooling.
Max pooling takes a small window (say 2×2) and keeps only the strongest activation. It's like asking: "Did this feature appear anywhere in this region?" The exact location doesn't matter.
Feature map: After 2×2 max pooling:
[1 3 2 4] [3 4]
[5 6 1 2] → [8 7]
[3 8 4 7]
[1 2 6 3]
This does three things: reduces the spatial size (fewer parameters downstream), makes the network tolerant to small shifts in position (translation invariance), and forces the network to summarize rather than memorize exact locations.
Your visual cortex doesn't care if a face is shifted 5 pixels left. You still recognize it. Pooling builds that tolerance in.
ReLU — Still Here
The activation function hasn't changed. After each convolution, we still apply ReLU (from Post 2):
output = max(0, convolution_result)
Negative activations become zero. Positive ones pass through. Same reason as before: it introduces non-linearity and avoids the vanishing gradient problem we discussed in Post 3. The building blocks carry forward.
Putting It Together: The CNN Pipeline
A CNN is just these ideas stacked in sequence:
Input Image
↓
[Conv → ReLU] × N ← learn local patterns (like V1 cortex)
↓
[Pooling] ← summarize, reduce size
↓
[Conv → ReLU] × N ← learn combinations of patterns (like V2/V4)
↓
[Pooling]
↓
Flatten
↓
[Fully Connected] ← classify based on learned features
↓
Softmax → Prediction
The early layers learn edges and textures. Middle layers combine those into shapes. Deep layers combine shapes into objects. It's the same hierarchy Hubel and Wiesel found in the cat's brain, just learned from data instead of evolution.
Training still uses backpropagation and Adam. The same gradient flow, the same weight updates. CNNs didn't replace what we built, they extended it with a smarter architecture.
CNN Architectures: A Brief Lineage
The core ideas — local receptive fields, learnable filters, pooling stayed constant. What changed over the years was how deep, how wide, and how trainable.
LeNet (1998) was the proof of concept. Two conv layers, two pooling layers, three fully-connected layers. Trained on handwritten digits. ~60K parameters. It worked, but the hardware and data of the time couldn't push it further.
AlexNet (2012) was the moment everything changed. Five conv layers, three FC layers, ~60M parameters — trained on GPUs for the first time. It won ImageNet by a margin that shocked the field. The key additions: ReLU activations (faster training), dropout (regularization), and data augmentation. The deep learning era started here.
VGG (2014) asked: what if we just go deeper, but keep it simple? Only 3×3 filters, stacked in blocks. 16–19 layers, ~138M parameters. It showed that depth itself was the driver of accuracy, but the three large FC layers at the end were a parameter bottleneck.
ResNet (2015) solved the problem VGG exposed: beyond ~20 layers, adding more layers actually hurts accuracy. Not from overfitting — from gradients vanishing before they reach early layers. ResNet's fix was elegant: skip connections that let gradients bypass layers entirely. Suddenly 50-layer, 152-layer networks were trainable. ResNet-50 achieves better accuracy than VGG-16 with 5× fewer parameters.
LeNet (1998) → AlexNet (2012) → VGG (2014) → ResNet (2015)
~60K params ~60M params ~138M params ~25M params
proof of concept GPU + ReLU depth matters skip connections
revolution but costly solve depth
For the full layer-by-layer breakdown with parameter counts, see CNN_ARCHITECTURE_DEEP_DIVE.md.
CNNs Are Built for Images — Not Text
This is worth saying explicitly, because it's easy to assume CNNs are a general-purpose upgrade to fully-connected networks. They're not.
CNNs work because of two assumptions baked into the architecture:
- Nearby inputs are related — a pixel's neighbors matter more than distant pixels
- The same pattern can appear anywhere — weight sharing makes sense because an edge in the top-left is the same edge as one in the bottom-right
Images satisfy both assumptions perfectly. So do audio spectrograms and video frames.
Text doesn't. The word "not" next to "good" completely changes the meaning — but "not" next to "bad" means something different again. Context in language isn't local and positional the way it is in images. The same word means different things in different positions. Weight sharing across positions doesn't make the same kind of sense.
That's why text needs different architectures — RNNs (Post 8) that process sequences step by step, and eventually Transformers (Post 10) that learn which words to pay attention to regardless of distance. CNNs are a specialized tool, and their specialization is spatial data.
What Clicked for Me
I kept thinking about Hubel and Wiesel's cat. One neuron, one edge orientation. Seemed pointless.
Then it clicked: a neuron that responds to everything is useless. A neuron that responds to exactly one pattern, reliably, anywhere—that's signal. That's a building block.
Fully-connected networks try to be generalists from pixel one. CNNs start with specialists and compose up.
Interactive Playground
I've built an interactive playground where you can watch CNN in action. It has two tabs
Tab 1: FC Network vs CNN
Both models are trained from scratch on the same 1,000-sample MNIST subset using pure NumPy and Adam — the same setup from Post 4. You can adjust the FC hidden layer size, the number of CNN filters, the number of epochs (up to 20), and the batch size, then hit Train both models to run real training.
Tab 2: CNN Layer Explorer
Pick any of the digits 0, 1, 6, or 8 and explore three views:
What each filter detects — shows the raw filter weights (3×3 grid) alongside the response heatmap on your chosen digit. Bright yellow means "this pattern is strongly present here." You can see how a vertical-edge filter lights up along strokes, while a blob filter responds to filled regions.
Layer-by-layer pipeline — traces your digit through the full network: Conv1+ReLU → MaxPool → Conv2+ReLU → MaxPool → Flatten → FC → Softmax. Each stage shows the actual feature map image with a caption explaining what happened and why. A dimension table below tracks the shape at every step.
MaxPool zoom-in — takes a 4×4 patch from the conv output and shows the actual numerical values, then shows the 2×2 result after pooling. You can see exactly which values survived and why — the maximum in each 2×2 block wins.
What This Unlocked
Before CNNs, computer vision meant handcrafting features. Researchers spent years designing SIFT descriptors, HOG features, edge detectors, all by hand. Then AlexNet (2012) showed that a CNN trained on enough data could learn better features automatically, and it wasn't close. The error rate dropped from 26% to 15% in one year.
That was the moment the field changed.
Every modern vision system—object detection, medical imaging, autonomous driving, face recognition—runs on some variant of this idea. Local receptive fields. Learnable filters. Hierarchical features. Weight sharing.
All of it traced back to a cat, an electrode, and a sliding glass slide in 1959.
What's Next
We can now train CNNs that learn to see. But there's a catch: go deep enough, and training breaks down. Gradients vanish. Accuracy plateaus. Adding more layers actually hurts.
Post 7 covers the two innovations that fixed this: Batch Normalization (stabilize the activations between layers) and Residual Connections (let gradients skip layers entirely). Together, they made 50-layer, 100-layer networks trainable—and unlocked the modern era of deep learning.
References
- LeCun et al. (1998). Gradient-Based Learning Applied to Document Recognition.
- Krizhevsky et al. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
Top comments (0)