DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

CNNs Explained: How Image Classification Actually Works in Deep Learning

Understanding CNNs means understanding how models turn raw pixels into structured representations. This guide explains convolution, pooling, and architectures like ResNet with practical insights.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/dl-convolutional-neural-networks-cnn-en/


The Real Problem: Pixels → Meaning

Images are just tensors.

No objects. No semantics.

So the real question is:

How do we extract structure from raw data?


Why Old Pipelines Didn’t Scale

Classic approach:

  • Feature extraction (SIFT, HOG)
  • Classifier (SVM)

Limitation:

You only learn what you design.


Why MLPs Fail (Critical Insight)

Flattening images destroys structure.

Problems:

  • Parameter explosion
  • No spatial awareness

But the deeper issue:

No reuse of patterns


CNNs = Structured Efficiency

CNNs fix this with:

  • Local connectivity
  • Weight sharing

Meaning:

  • Fewer parameters
  • Better generalization
  • Built-in spatial bias

What Convolution Actually Learns

Filters become detectors:

  • Edges
  • Textures
  • Shapes

Stacking layers creates hierarchy:

Edges → shapes → objects


Why Depth Matters (Practical View)

Shallow model:

  • Detects edges

Deep model:

  • Understands objects

Depth = abstraction


Core Components (What Actually Matters)

ReLU

  • Stabilizes gradients
  • Enables deep learning

Pooling

  • Reduces noise
  • Adds robustness

Fully Connected

  • Final decision layer

Why ResNet Changed Everything

Deep networks used to fail.

Problem:

Degradation with depth

Solution:

Skip connections


Real Effect:

  • Easier training
  • Deeper models
  • Better results

Training Insights (This Is Where Most Bugs Are)

1. Data Augmentation > Architecture (Often)

Small dataset?

→ augmentation matters more than model choice


2. BatchNorm = Stability

Without it:

  • training unstable

With it:

  • faster convergence

3. Preprocessing Is Not Optional

Unnormalized input = unstable gradients


Debugging CNNs (Highly Practical)

Feature Maps

See what the model detects


CAM (Class Activation Map)

See what the model uses


Real-World Example

Model classifies “cow” correctly.

CAM shows:

  • Focus on grass, not cow

Conclusion:

Dataset bias, not model intelligence


Practical Takeaways

  • CNNs learn features automatically
  • Structure matters more than size
  • Depth builds meaning
  • Training tricks are critical
  • Visualization reveals hidden problems

Final Thought

CNNs are not just models.

They encode this idea:

Learn representations, not rules


If you’ve worked with CNNs:

  • Did augmentation help more than architecture?
  • Have you checked CAM for bias?
  • Where did your model actually fail?

Top comments (0)