An AI Experiment That Went Deeper Than Expected
When people talk about computer vision today, the conversation almost always turns into CNN vs Vision Transformers (ViT).
CNNs dominated vision tasks for years. Then Transformers arrived from NLP and started rewriting the rules.
So I decided to run an experiment.
Not on ImageNet.
Not on some perfectly curated benchmark.
But on something messy, real-world, and meaningful:
🌱 Plant disease detection using the PlantVillage dataset
Because if AI can help farmers detect crop diseases early, the impact is far bigger than just leaderboard scores.
But what started as a simple model comparison turned into one of the most chaotic and insightful experiments I’ve run.
Let’s dive in.
🧠 The Question
- Can Vision Transformers outperform CNNs on plant disease detection? And more importantly:
- How do they behave on real agricultural datasets?
- What happens when data distribution shifts?
- Do Transformers really generalize better?
📊 Dataset: PlantVillage
The PlantVillage dataset is one of the most widely used agricultural datasets.
📦 Stats
- 38 classes
- 162,916 images
- Multiple crops: tomato, potato, corn, apple, grape, etc.
Images include healthy and diseased leaves
Typical diseases include:
- Early Blight
- Late Blight
- Leaf Mold
- Septoria Leaf Spot
- Bacterial Spot
- Each disease has distinct visual patterns, which makes it a good candidate for vision models.
But the dataset has a hidden issue...
⚠️ Most images have clean backgrounds.
Meaning the models might learn background cues instead of disease patterns.
This becomes important later.
⚙️ Models Used in the Experiment
I trained two architectures.
1️⃣** CNN Baseline**
Classic convolutional architecture.
Model used:
- ResNet50 (transfer learning)
Why ResNet?
Because it is:
- stable
- widely used
- strong baseline for vision tasks
2️⃣ Vision Transformer (ViT)
Transformer-based architecture designed for images.
Instead of convolutions, it works by:
🔹 Splitting image into patches
🔹 Treating patches like tokens
🔹 Running self-attention
This allows the model to learn global relationships across the image.
Model used:
- ViT-B/16
🏗 Training Setup
- Hardware
- GPU: T4
- Framework: PyTorch
- Pretrained weights used
Training configuration
Parameter Value
Image Size 224x224
Batch Size 32
Optimizer Adam
Epochs 20
Loss CrossEntropy
Data augmentation applied:
- Random flip
- Random rotation
Color jitter
📈 Results
Here’s where things get interesting.
Model Accuracy
ResNet50 ** 99.95%**
Vision Transformer 99.37%
At first glance…
CNN wins.
But accuracy alone doesn’t tell the full story.
🧪 What the Models Actually Learned
After training, I ran activation and attention visualizations.
And the results were surprising.
CNN Behavior
CNN focused heavily on:
leaf texture
disease spots
color variations
But sometimes it also locked onto:
*⚠️ background patterns
*
Which is dangerous.
Because if background changes, performance can drop.
Vision Transformer Behavior
ViT behaved differently.
Instead of local textures, it analyzed:
- global leaf structure
- shape irregularities
- spread patterns of disease
Attention maps showed it focusing on multiple disease regions simultaneously.
This suggests better spatial reasoning.
💥 The Real Test: Background Shift
I introduced a challenge.
I tested the models on new leaf images with natural farm backgrounds instead of lab backgrounds.
This is where things exploded.
CNN Performance
Accuracy dropped from:
99.95% → 4%
The model had partially learned the background bias.
Vision Transformer Performance
Accuracy dropped from:
99.37% → 8%
Still a drop.
But much more robust.
🧨 Biggest Challenges We Faced
*1️⃣ Transformers Need More Data
*
- CNNs work well even with smaller datasets.
- Transformers love massive datasets.
- Without enough data: > training becomes unstable > convergence slows
2️⃣ Training Instability
ViT required:
- careful learning rate tuning
- warmup schedules
- Otherwise loss spikes appear.
3️⃣ GPU Memory
Transformers are memory hungry.
Even small changes in batch size caused:
💥 CUDA out-of-memory errors.
🧠 Key Insight
The biggest takeaway from this experiment:
CNNs are better pattern detectors.
Transformers are better reasoning engines.
CNN:
✔ excellent at local features
ViT:
✔ excellent at global context
🌾 Why This Matters for Agriculture
Real farms are messy environments.
Leaves are:
- partially occluded
- rotated
- surrounded by soil and plants
Models must generalize beyond lab datasets.
Transformers show promising potential here.
🔬 What I Want to Try Next
This experiment opened many new ideas.
Next experiments:
🔥 Hybrid CNN + Transformer architectures
🔥 Self-supervised pretraining on plant data
🔥 Real-time disease detection using YOLO + ViT embeddings
Goal:
Build a real-world plant disease detection system farmers can actually use.
🚀 Final Thoughts
Benchmarks are easy.
Real-world AI is chaos.
And that’s where the fun begins.
This experiment taught me that accuracy numbers alone don’t define intelligence.
Understanding how models think matters far more.
More experiments coming soon.
Stay curious. 🌿
✍️ If you're working on AI for agriculture or computer vision, I’d love to exchange ideas.
Top comments (0)