<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vanshika Garg</title>
    <description>The latest articles on DEV Community by Vanshika Garg (@vanshika_garg).</description>
    <link>https://dev.to/vanshika_garg</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825160%2F068bd885-5e99-4854-ae32-e6847f2103f9.png</url>
      <title>DEV Community: Vanshika Garg</title>
      <link>https://dev.to/vanshika_garg</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vanshika_garg"/>
    <language>en</language>
    <item>
      <title>🌿 Vision Transformers vs CNNs on PlantVillage</title>
      <dc:creator>Vanshika Garg</dc:creator>
      <pubDate>Sun, 15 Mar 2026 10:40:09 +0000</pubDate>
      <link>https://dev.to/vanshika_garg/vision-transformers-vs-cnns-on-plantvillage-23n</link>
      <guid>https://dev.to/vanshika_garg/vision-transformers-vs-cnns-on-plantvillage-23n</guid>
      <description>&lt;p&gt;An AI Experiment That Went Deeper Than Expected&lt;/p&gt;

&lt;p&gt;When people talk about computer vision today, the conversation almost always turns into CNN vs Vision Transformers (ViT).&lt;/p&gt;

&lt;p&gt;CNNs dominated vision tasks for years. Then Transformers arrived from NLP and started rewriting the rules.&lt;/p&gt;

&lt;p&gt;So I decided to run an experiment.&lt;/p&gt;

&lt;p&gt;Not on ImageNet.&lt;br&gt;
Not on some perfectly curated benchmark.&lt;/p&gt;

&lt;p&gt;But on something messy, real-world, and meaningful:&lt;/p&gt;

&lt;p&gt;🌱 Plant disease detection using the PlantVillage dataset&lt;/p&gt;

&lt;p&gt;Because if AI can help farmers detect crop diseases early, the impact is far bigger than just leaderboard scores.&lt;/p&gt;

&lt;p&gt;But what started as a simple model comparison turned into one of the most chaotic and insightful experiments I’ve run.&lt;/p&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧠 The Question
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Can Vision Transformers outperform CNNs on plant disease detection?
And more importantly:&lt;/li&gt;
&lt;li&gt;How do they behave on real agricultural datasets?&lt;/li&gt;
&lt;li&gt;What happens when data distribution shifts?&lt;/li&gt;
&lt;li&gt;Do Transformers really generalize better?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  📊 Dataset: PlantVillage
&lt;/h2&gt;

&lt;p&gt;The PlantVillage dataset is one of the most widely used agricultural datasets.&lt;/p&gt;

&lt;p&gt;📦 Stats&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;38 classes&lt;/li&gt;
&lt;li&gt;162,916 images&lt;/li&gt;
&lt;li&gt;Multiple crops: tomato, potato, corn, apple, grape, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Images include healthy and diseased leaves&lt;/p&gt;

&lt;p&gt;Typical diseases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Early Blight&lt;/li&gt;
&lt;li&gt;Late Blight&lt;/li&gt;
&lt;li&gt;Leaf Mold&lt;/li&gt;
&lt;li&gt;Septoria Leaf Spot&lt;/li&gt;
&lt;li&gt;Bacterial Spot&lt;/li&gt;
&lt;li&gt;Each disease has distinct visual patterns, which makes it a good candidate for vision models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the dataset has a hidden issue...&lt;/p&gt;

&lt;p&gt;⚠️ Most images have clean backgrounds.&lt;/p&gt;

&lt;p&gt;Meaning the models might learn background cues instead of disease patterns.&lt;/p&gt;

&lt;p&gt;This becomes important later.&lt;/p&gt;

&lt;p&gt;⚙️ Models Used in the Experiment&lt;/p&gt;

&lt;h2&gt;
  
  
  I trained two architectures.
&lt;/h2&gt;

&lt;p&gt;1️⃣** CNN Baseline**&lt;/p&gt;

&lt;p&gt;Classic convolutional architecture.&lt;/p&gt;

&lt;p&gt;Model used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ResNet50 (transfer learning)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why ResNet?&lt;/p&gt;

&lt;p&gt;Because it is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable&lt;/li&gt;
&lt;li&gt;widely used&lt;/li&gt;
&lt;li&gt;strong baseline for vision tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2️⃣ &lt;strong&gt;Vision Transformer (ViT)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Transformer-based architecture designed for images.&lt;/p&gt;

&lt;p&gt;Instead of convolutions, it works by:&lt;/p&gt;

&lt;p&gt;🔹 Splitting image into patches&lt;br&gt;
🔹 Treating patches like tokens&lt;br&gt;
🔹 Running self-attention&lt;/p&gt;

&lt;p&gt;This allows the model to learn global relationships across the image.&lt;/p&gt;

&lt;p&gt;Model used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ViT-B/16&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🏗 Training Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Hardware&lt;/li&gt;
&lt;li&gt;GPU: T4&lt;/li&gt;
&lt;li&gt;Framework: PyTorch&lt;/li&gt;
&lt;li&gt;Pretrained weights used&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Training configuration
&lt;/h2&gt;

&lt;p&gt;Parameter   Value&lt;br&gt;
Image Size  224x224&lt;br&gt;
Batch Size  32&lt;br&gt;
Optimizer   Adam&lt;br&gt;
Epochs  20&lt;br&gt;
Loss    CrossEntropy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data augmentation applied:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random flip&lt;/li&gt;
&lt;li&gt;Random rotation&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Color jitter&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;📈 Results&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s where things get interesting.&lt;/p&gt;

&lt;p&gt;Model                  Accuracy&lt;br&gt;
ResNet50          ** 99.95%**&lt;br&gt;
Vision Transformer     &lt;strong&gt;99.37%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At first glance…&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CNN wins.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But accuracy alone doesn’t tell the full story.&lt;/p&gt;

&lt;p&gt;🧪 What the Models Actually Learned&lt;/p&gt;

&lt;p&gt;After training, I ran activation and attention visualizations.&lt;/p&gt;

&lt;p&gt;And the results were surprising.&lt;/p&gt;

&lt;p&gt;CNN Behavior&lt;/p&gt;

&lt;p&gt;CNN focused heavily on:&lt;/p&gt;

&lt;p&gt;leaf texture&lt;/p&gt;

&lt;p&gt;disease spots&lt;/p&gt;

&lt;p&gt;color variations&lt;/p&gt;

&lt;p&gt;But sometimes it also locked onto:&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;⚠️ background patterns&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Which is dangerous.&lt;/p&gt;

&lt;p&gt;Because if background changes, performance can drop.&lt;/p&gt;

&lt;p&gt;Vision Transformer Behavior&lt;/p&gt;

&lt;p&gt;ViT behaved differently.&lt;/p&gt;

&lt;p&gt;Instead of local textures, it analyzed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;global leaf structure&lt;/li&gt;
&lt;li&gt;shape irregularities&lt;/li&gt;
&lt;li&gt;spread patterns of disease&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Attention maps showed it focusing on multiple disease regions simultaneously.&lt;/p&gt;

&lt;p&gt;This suggests better spatial reasoning.&lt;/p&gt;

&lt;h2&gt;
  
  
  💥 The Real Test: Background Shift
&lt;/h2&gt;

&lt;p&gt;I introduced a challenge.&lt;/p&gt;

&lt;p&gt;I tested the models on new leaf images with natural farm backgrounds instead of lab backgrounds.&lt;/p&gt;

&lt;p&gt;This is where things exploded.&lt;/p&gt;

&lt;p&gt;CNN Performance&lt;/p&gt;

&lt;p&gt;Accuracy dropped from:&lt;/p&gt;

&lt;p&gt;99.95% → 4%&lt;/p&gt;

&lt;p&gt;The model had partially learned the background bias.&lt;/p&gt;

&lt;p&gt;Vision Transformer Performance&lt;/p&gt;

&lt;p&gt;Accuracy dropped from:&lt;/p&gt;

&lt;p&gt;99.37% → 8%&lt;/p&gt;

&lt;p&gt;Still a drop.&lt;/p&gt;

&lt;p&gt;But much more robust.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧨 Biggest Challenges We Faced
&lt;/h2&gt;

&lt;p&gt;*&lt;em&gt;1️⃣ Transformers Need More Data&lt;br&gt;
*&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CNNs work well even with smaller datasets.&lt;/li&gt;
&lt;li&gt;Transformers love massive datasets.&lt;/li&gt;
&lt;li&gt;Without enough data:
&amp;gt; training becomes unstable
&amp;gt; convergence slows&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2️⃣ Training Instability&lt;/p&gt;

&lt;p&gt;ViT required:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;careful learning rate tuning&lt;/li&gt;
&lt;li&gt;warmup schedules&lt;/li&gt;
&lt;li&gt;Otherwise loss spikes appear.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;3️⃣ GPU Memory&lt;/p&gt;

&lt;p&gt;Transformers are memory hungry.&lt;/p&gt;

&lt;p&gt;Even small changes in batch size caused:&lt;/p&gt;

&lt;p&gt;💥 CUDA out-of-memory errors.&lt;/p&gt;

&lt;p&gt;🧠 Key Insight&lt;/p&gt;

&lt;p&gt;The biggest takeaway from this experiment:&lt;/p&gt;

&lt;p&gt;CNNs are better pattern detectors.&lt;br&gt;
Transformers are better reasoning engines.&lt;/p&gt;

&lt;p&gt;CNN:&lt;/p&gt;

&lt;p&gt;✔ excellent at local features&lt;/p&gt;

&lt;p&gt;ViT:&lt;/p&gt;

&lt;p&gt;✔ excellent at global context&lt;/p&gt;

&lt;p&gt;🌾 Why This Matters for Agriculture&lt;/p&gt;

&lt;p&gt;Real farms are messy environments.&lt;/p&gt;

&lt;p&gt;Leaves are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;partially occluded&lt;/li&gt;
&lt;li&gt;rotated&lt;/li&gt;
&lt;li&gt;surrounded by soil and plants&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Models must generalize beyond lab datasets.&lt;/p&gt;

&lt;p&gt;Transformers show promising potential here.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  🔬 What I Want to Try Next
&lt;/h2&gt;

&lt;p&gt;This experiment opened many new ideas.&lt;/p&gt;

&lt;p&gt;Next experiments:&lt;/p&gt;

&lt;p&gt;🔥 Hybrid CNN + Transformer architectures&lt;br&gt;
🔥 Self-supervised pretraining on plant data&lt;br&gt;
🔥 Real-time disease detection using YOLO + ViT embeddings&lt;/p&gt;

&lt;p&gt;Goal:&lt;/p&gt;

&lt;p&gt;Build a real-world plant disease detection system farmers can actually use.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Benchmarks are easy.&lt;br&gt;
Real-world AI is chaos.&lt;br&gt;
And that’s where the fun begins.&lt;/p&gt;

&lt;p&gt;This experiment taught me that accuracy numbers alone don’t define intelligence.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Understanding how models think matters far more.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More experiments coming soon.&lt;/p&gt;

&lt;p&gt;Stay curious. 🌿&lt;/p&gt;




&lt;p&gt;✍️ If you're working on AI for agriculture or computer vision, I’d love to exchange ideas.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>ai</category>
      <category>agriculture</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
