zeromathai

Posted on Apr 11 • Edited on May 7 • Originally published at zeromathai.com

CNN Spatial Behavior Explained: Convolution, Stride, Padding, and Output Size (With Intuition)

#ai #machinelearning #deeplearning #computervision

Understanding CNNs requires more than just architectures. Learn how convolution, stride, padding, and output size shape spatial behavior in deep learning models, with practical intuition and real-world design insights.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/pooling-activation-layers-en/

The Real Problem: Spatial Understanding (Not Layers)

Most CNN issues are NOT about:

“Which architecture should I use?”

They are about:

wrong tensor shapes
misunderstanding stride/padding
losing spatial information too early

If you’ve ever hit something like:

RuntimeError: size mismatch

This post is for you.

Convolution = Sliding Pattern Detector

At each position:

Take a small patch
Multiply with filter weights
Sum → one output value

Repeat → feature map

Key properties:

local connectivity
shared weights
translation awareness

This is why CNNs scale.

Filters: What Your Model Actually Learns

Each filter learns ONE pattern.

Examples:

edge detector
texture detector
color transition

Multiple filters → multiple feature maps

Conceptually:

output_channels = num_filters

Important:

CNNs don’t learn “images” — they learn patterns.

Receptive Field (Core Concept)

Each neuron sees only part of the image.

Example:

Input: 32×32
Kernel: 5×5

→ neuron sees 5×5 region

Stack layers:
→ receptive field grows

Meaning:

early layers → local features
deeper layers → global features

Stride = Resolution Control

Stride defines how far the filter moves.

Stride	Effect
1	high detail
2	downsample

Trade-off:

larger stride → faster
but → information loss

Real-world mistake:

using stride=2 too early → model misses fine features

Padding = Boundary Control

Without padding:

output shrinks
edge information disappears fast

With padding:

spatial size preserved
borders are kept

Typical implementation:

padding = (kernel_size - 1) // 2

Rule of thumb:

deep CNNs almost always use padding

Output Size Formula (You MUST Know This)

Output = (m - k) / s + 1

Where:

m = input size
k = kernel size
s = stride

Example:

(7 - 3) / 1 + 1 = 5

If you don’t calculate this:
→ your model WILL break

One Filter vs Many Filters

Filters	Output
1	1 feature map
32	32 channels

Output shape:

H × W × C

Where:

C = number of filters

More filters = richer representation

Common Real-World Mistakes

1. Shape mismatch

You didn’t compute output size correctly.

2. Too much downsampling

Large stride early → lost spatial information.

3. No padding

Edges vanish layer by layer.

4. Too few filters

Model lacks expressive power.

Design Intuition (What Actually Matters)

When designing CNNs:

kernel size → what patterns you detect
stride → how fast you compress
padding → whether you preserve structure
filters → how rich your representation is

This is not hyperparameter tuning.

This is:

designing how your model perceives the world

Final Takeaway

CNNs don’t “see images”.

They:

scan locally
extract patterns
build hierarchical representations

If you understand:

convolution
receptive field
stride
padding

Then you understand:

how CNNs actually work

What part of CNN design still feels confusing?

Drop your thoughts 👇

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community