How machines learn to see — locally vs globally.
If you’ve ever wondered why Vision Transformers (ViTs) replaced Convolutional Neural Networks (CNNs) so quickly in computer vision, you’re not alone.
Both models “see” — but they see differently.
Let’s visualize how these architectures process the same image step-by-step, and why attention has changed the way machines perceive the world.
🧩 1. How CNNs See: The Local Lens
A CNN processes an image piece by piece — a mosaic of local patterns.
- Each convolution filter slides over pixels (a receptive field)
- Early layers learn edges, textures, shapes
- Deeper layers combine them into higher-level features (eyes, wheels, leaves)
import torch
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
nn.ReLU()
)
print(sum(p.numel() for p in cnn.parameters()), "trainable parameters")
Visual metaphor:
→ CNNs are like looking through a microscope — powerful, but only one patch at a time.
Local precision, global blindness.
🌍 2. How Transformers See: The Global Canvas
Transformers treat an image as a sequence of patches, not pixels.
Each patch becomes a token, similar to a word in NLP.
Instead of convolutions, a self-attention layer learns which patches matter to each other —
so the model can connect “eye” to “face,” or “wheel” to “car,” even if they’re far apart.
from transformers import ViTModel, ViTFeatureExtractor
import torch
from PIL import Image
import requests
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/image_classification.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
model = ViTModel.from_pretrained("google/vit-base-patch16-224")
inputs = extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
print("Hidden state shape:", outputs.last_hidden_state.shape)
Visual metaphor:
→ ViTs are like seeing from above — every part of the image talks to every other part.
Global awareness, context-rich understanding.
🔬 3. Visualizing the Difference
Let’s see this difference side-by-side:
| Concept | CNN | Transformer |
|---|---|---|
| Vision style | Local → Hierarchical | Global → Relational |
| Input type | Pixels | Patches |
| Core operation | Convolution | Self-Attention |
| Memory | Spatial (fixed window) | Contextual (dynamic) |
| Inductive bias | Strong (translation invariant) | Minimal (learns from data) |
| Pretraining need | Works from scratch | Needs large datasets |
| Best for | Small data, simple patterns | Complex, global reasoning |
🧠 4. Why Transformers Surpass CNNs (Eventually)
Transformers outperform CNNs when:
You have lots of data
You need long-range dependencies
You want to unify vision and language
But CNNs are still valuable — fast, efficient, and great at edge devices.
The real magic is in hybrid architectures — CNN + Attention (ConvNeXt, CoAtNet, etc.)
They combine the sharpness of convolution with the context of attention.
🧪 5. Minimal Code Comparison
Here’s a quick benchmark-style code snippet using PyTorch:
import torchvision.models as models
import torch
cnn_model = models.resnet18(pretrained=True)
vit_model = models.vit_b_16(pretrained=True)
x = torch.randn(1, 3, 224, 224)
cnn_out = cnn_model(x)
vit_out = vit_model(x)
print("CNN output:", cnn_out.shape)
print("ViT output:", vit_out.shape)
Output:
CNN output: torch.Size([1, 1000])
ViT output: torch.Size([1, 1000])
Same input, same output shape — completely different thought process.
🪞 6. The Philosophy Behind It
CNNs extract meaning.
Transformers connect meaning.
One builds understanding layer by layer.
The other builds it all at once — like a conversation, not a hierarchy.
Deep learning started with perception.
Transformers added awareness.
That’s the real leap.
⚡ 7. The Takeaway
CNNs = strong inductive bias, fast training, efficient on small data
Transformers = flexible reasoning, global context, scalability
Hybrids = the best of both worlds
Both architectures are tools — what matters is when to use which.
Use CNNs when your world is small.
Use Transformers when your world is connected.
Next Up → “Fine-Tuning Failures and Fixes” — my notes from debugging unstable Transformer training runs.
Top comments (0)