Understanding CNNs requires more than just architectures. Learn how convolution, stride, padding, and output size shape spatial behavior in deep learning models, with practical intuition and real-world design insights.
Cross-posted from Zeromath. Original article: https://zeromathai.com/en/pooling-activation-layers-en/
The Real Problem: Spatial Understanding (Not Layers)
Most CNN issues are NOT about:
- “Which architecture should I use?”
They are about:
- wrong tensor shapes
- misunderstanding stride/padding
- losing spatial information too early
If you’ve ever hit something like:
RuntimeError: size mismatch
This post is for you.
Convolution = Sliding Pattern Detector
At each position:
- Take a small patch
- Multiply with filter weights
- Sum → one output value
Repeat → feature map
Key properties:
- local connectivity
- shared weights
- translation awareness
This is why CNNs scale.
Filters: What Your Model Actually Learns
Each filter learns ONE pattern.
Examples:
- edge detector
- texture detector
- color transition
Multiple filters → multiple feature maps
Conceptually:
output_channels = num_filters
Important:
CNNs don’t learn “images” — they learn patterns.
Receptive Field (Core Concept)
Each neuron sees only part of the image.
Example:
- Input: 32×32
- Kernel: 5×5
→ neuron sees 5×5 region
Stack layers:
→ receptive field grows
Meaning:
- early layers → local features
- deeper layers → global features
Stride = Resolution Control
Stride defines how far the filter moves.
| Stride | Effect |
|---|---|
| 1 | high detail |
| 2 | downsample |
Trade-off:
- larger stride → faster
- but → information loss
Real-world mistake:
using stride=2 too early → model misses fine features
Padding = Boundary Control
Without padding:
- output shrinks
- edge information disappears fast
With padding:
- spatial size preserved
- borders are kept
Typical implementation:
padding = (kernel_size - 1) // 2
Rule of thumb:
deep CNNs almost always use padding
Output Size Formula (You MUST Know This)
Output = (m - k) / s + 1
Where:
- m = input size
- k = kernel size
- s = stride
Example:
(7 - 3) / 1 + 1 = 5
If you don’t calculate this:
→ your model WILL break
One Filter vs Many Filters
| Filters | Output |
|---|---|
| 1 | 1 feature map |
| 32 | 32 channels |
Output shape:
H × W × C
Where:
- C = number of filters
More filters = richer representation
Common Real-World Mistakes
1. Shape mismatch
You didn’t compute output size correctly.
2. Too much downsampling
Large stride early → lost spatial information.
3. No padding
Edges vanish layer by layer.
4. Too few filters
Model lacks expressive power.
Design Intuition (What Actually Matters)
When designing CNNs:
- kernel size → what patterns you detect
- stride → how fast you compress
- padding → whether you preserve structure
- filters → how rich your representation is
This is not hyperparameter tuning.
This is:
designing how your model perceives the world
Final Takeaway
CNNs don’t “see images”.
They:
- scan locally
- extract patterns
- build hierarchical representations
If you understand:
- convolution
- receptive field
- stride
- padding
Then you understand:
how CNNs actually work
What part of CNN design still feels confusing?
Drop your thoughts 👇
Top comments (0)