DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

CNN Spatial Behavior Explained: Convolution, Stride, Padding, and Output Size (With Intuition)

Understanding CNNs requires more than just architectures. Learn how convolution, stride, padding, and output size shape spatial behavior in deep learning models, with practical intuition and real-world design insights.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/pooling-activation-layers-en/


The Real Problem: Spatial Understanding (Not Layers)

Most CNN issues are NOT about:

  • “Which architecture should I use?”

They are about:

  • wrong tensor shapes
  • misunderstanding stride/padding
  • losing spatial information too early

If you’ve ever hit something like:

RuntimeError: size mismatch
Enter fullscreen mode Exit fullscreen mode

This post is for you.


Convolution = Sliding Pattern Detector

At each position:

  1. Take a small patch
  2. Multiply with filter weights
  3. Sum → one output value

Repeat → feature map

Key properties:

  • local connectivity
  • shared weights
  • translation awareness

This is why CNNs scale.


Filters: What Your Model Actually Learns

Each filter learns ONE pattern.

Examples:

  • edge detector
  • texture detector
  • color transition

Multiple filters → multiple feature maps

Conceptually:

output_channels = num_filters
Enter fullscreen mode Exit fullscreen mode

Important:

CNNs don’t learn “images” — they learn patterns.


Receptive Field (Core Concept)

Each neuron sees only part of the image.

Example:

  • Input: 32×32
  • Kernel: 5×5

→ neuron sees 5×5 region

Stack layers:
→ receptive field grows

Meaning:

  • early layers → local features
  • deeper layers → global features

Stride = Resolution Control

Stride defines how far the filter moves.

Stride Effect
1 high detail
2 downsample

Trade-off:

  • larger stride → faster
  • but → information loss

Real-world mistake:

using stride=2 too early → model misses fine features


Padding = Boundary Control

Without padding:

  • output shrinks
  • edge information disappears fast

With padding:

  • spatial size preserved
  • borders are kept

Typical implementation:

padding = (kernel_size - 1) // 2
Enter fullscreen mode Exit fullscreen mode

Rule of thumb:

deep CNNs almost always use padding


Output Size Formula (You MUST Know This)

Output = (m - k) / s + 1
Enter fullscreen mode Exit fullscreen mode

Where:

  • m = input size
  • k = kernel size
  • s = stride

Example:

(7 - 3) / 1 + 1 = 5
Enter fullscreen mode Exit fullscreen mode

If you don’t calculate this:
→ your model WILL break


One Filter vs Many Filters

Filters Output
1 1 feature map
32 32 channels

Output shape:

H × W × C
Enter fullscreen mode Exit fullscreen mode

Where:

  • C = number of filters

More filters = richer representation


Common Real-World Mistakes

1. Shape mismatch

You didn’t compute output size correctly.

2. Too much downsampling

Large stride early → lost spatial information.

3. No padding

Edges vanish layer by layer.

4. Too few filters

Model lacks expressive power.


Design Intuition (What Actually Matters)

When designing CNNs:

  • kernel size → what patterns you detect
  • stride → how fast you compress
  • padding → whether you preserve structure
  • filters → how rich your representation is

This is not hyperparameter tuning.

This is:

designing how your model perceives the world


Final Takeaway

CNNs don’t “see images”.

They:

  • scan locally
  • extract patterns
  • build hierarchical representations

If you understand:

  • convolution
  • receptive field
  • stride
  • padding

Then you understand:

how CNNs actually work


What part of CNN design still feels confusing?

Drop your thoughts 👇

Top comments (0)