<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anurag Kar</title>
    <description>The latest articles on DEV Community by Anurag Kar (@anuragkar234).</description>
    <link>https://dev.to/anuragkar234</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F457418%2Fa5590573-8c0f-4351-a6c7-0868190fa698.jpg</url>
      <title>DEV Community: Anurag Kar</title>
      <link>https://dev.to/anuragkar234</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anuragkar234"/>
    <language>en</language>
    <item>
      <title>🚨 “Everything Was Green… But Production Was Broken” — A Debugging Story Every Backend Engineer Needs</title>
      <dc:creator>Anurag Kar</dc:creator>
      <pubDate>Sat, 28 Mar 2026 05:22:10 +0000</pubDate>
      <link>https://dev.to/anuragkar234/everything-was-green-but-production-was-broken-a-debugging-story-every-backend-engineer-needs-2k64</link>
      <guid>https://dev.to/anuragkar234/everything-was-green-but-production-was-broken-a-debugging-story-every-backend-engineer-needs-2k64</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;0 errors. 0 alerts. 100% failure.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At 2 AM, everything in our dashboards was green.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No spikes 📊&lt;/li&gt;
&lt;li&gt;No errors ❌&lt;/li&gt;
&lt;li&gt;No alerts 🚨&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet…&lt;/p&gt;

&lt;p&gt;👉 Orders were failing&lt;br&gt;
👉 Inventory was stuck&lt;br&gt;
👉 Business impact was real!&lt;/p&gt;

&lt;p&gt;This is the story of how a perfectly healthy system silently failed — and what it taught me about building production-grade distributed systems.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧠 Why This Matters
&lt;/h2&gt;

&lt;p&gt;As Software Engineer at one of the P0 Business, your job isn’t just to write working code.&lt;/p&gt;

&lt;p&gt;It’s to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when things go wrong?&lt;/li&gt;
&lt;li&gt;How will you &lt;em&gt;know&lt;/em&gt; it went wrong?&lt;/li&gt;
&lt;li&gt;Can you debug it at 2 AM under pressure?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This bug exposed a gap between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“System is running” vs “System is working”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🧩 Real System Architecture (Simplified from Production)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo25lsds2skn30lbd0oc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo25lsds2skn30lbd0oc.png" alt="Arch View" width="800" height="1149"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🎯 Expected vs Reality
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Expected Flow:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Event published → Consumer processes → DB updated&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  What Actually Happened:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Event published ✅&lt;/li&gt;
&lt;li&gt;Consumer running ✅&lt;/li&gt;
&lt;li&gt;Logs clean ✅&lt;/li&gt;
&lt;li&gt;Metrics normal ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ &lt;strong&gt;Inventory never updated&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🚨 The Moment It Got Real
&lt;/h2&gt;

&lt;p&gt;We started getting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On-call alerts from business teams&lt;/li&gt;
&lt;li&gt;Manual escalations&lt;/li&gt;
&lt;li&gt;“Orders are stuck” messages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But internally?&lt;/p&gt;

&lt;p&gt;👉 Everything said &lt;strong&gt;“System Healthy”&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🕵️ Debugging Journey (The Real One)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Logs
&lt;/h3&gt;

&lt;p&gt;Nothing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Metrics
&lt;/h3&gt;

&lt;p&gt;Normal.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3: Infra
&lt;/h3&gt;

&lt;p&gt;Healthy.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 4: Reproduce locally
&lt;/h3&gt;

&lt;p&gt;Couldn’t.&lt;/p&gt;



&lt;p&gt;At this point, you hit a wall every backend engineer knows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If nothing is wrong… why is everything broken?”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  💀 The Hidden Bug
&lt;/h2&gt;

&lt;p&gt;Buried deep inside the consumer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;isValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;That one line.&lt;/p&gt;




&lt;h2&gt;
  
  
  😶 Why This Was Dangerous
&lt;/h2&gt;

&lt;p&gt;This caused:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ No logs&lt;/li&gt;
&lt;li&gt;❌ No metrics&lt;/li&gt;
&lt;li&gt;❌ No retries&lt;/li&gt;
&lt;li&gt;❌ No DLQ&lt;/li&gt;
&lt;li&gt;❌ No alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just silent skipping.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧬 What Was Actually Happening
&lt;/h2&gt;

&lt;p&gt;A[Event Received] --&amp;gt; B{Valid?} B -- Yes --&amp;gt; C[Process Event] B -- No --&amp;gt; D[Log + Metrics + DLQ]&lt;/p&gt;

&lt;p&gt;This is the worst possible failure mode in distributed systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ The Real Problem
&lt;/h2&gt;

&lt;p&gt;This wasn’t a “bug”.&lt;/p&gt;

&lt;p&gt;This was a &lt;strong&gt;design failure in observability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business logic ✔️&lt;/li&gt;
&lt;li&gt;Infra stability ✔️&lt;/li&gt;
&lt;li&gt;Scalability ✔️&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But missing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;❌ &lt;strong&gt;Visibility into decision points&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🛠️ The Fix (Simple but Powerful)
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Make Failures Visible
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;isValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"event_validation_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Add Metrics for Every Drop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"inventory.event.validation.failure"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Optional: DLQ for Debugging
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;sendToDLQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  📊 New System (After Fix)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    A[Event Received] --&amp;gt; B{Valid?}
    B -- Yes --&amp;gt; C[Process Event]
    B -- No --&amp;gt; D[Log + Metrics + DLQ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔥 The Shift in Thinking
&lt;/h2&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If it fails, it will show up”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If I don’t explicitly track it, it doesn’t exist”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  💡 My Production Checklist
&lt;/h2&gt;

&lt;p&gt;Whenever I design a consumer now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Log &lt;strong&gt;every decision branch&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Add metrics for &lt;strong&gt;drops, skips, retries&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Never &lt;code&gt;return nil&lt;/code&gt; silently&lt;/li&gt;
&lt;li&gt;✅ Add DLQ for debugging paths&lt;/li&gt;
&lt;li&gt;✅ Think in &lt;strong&gt;failure scenarios first&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Logs Tell a Story You Choose
&lt;/h3&gt;

&lt;p&gt;If you don’t log it, it didn’t happen.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Metrics Only Measure What You Track
&lt;/h3&gt;

&lt;p&gt;No metric = no failure (even if it's happening)&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Silent Failures Are Worse Than Crashes
&lt;/h3&gt;

&lt;p&gt;Crashes alert you. Silence kills you slowly.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Your system is not reliable because it doesn’t crash.&lt;br&gt;
It’s reliable because it tells you when it’s wrong.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📈 Series: Production Debugging Playbook (for Backend Engineers)
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;Part 1&lt;/strong&gt; of a series based on real production learnings:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔹 Part 1: When Logs Lie (This Post)
&lt;/h3&gt;

&lt;p&gt;Silent failures &amp;amp; observability gaps&lt;/p&gt;

&lt;p&gt;Next Parts coming soon!&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Let’s Discuss
&lt;/h2&gt;

&lt;p&gt;Have you ever faced a bug where:&lt;/p&gt;

&lt;p&gt;👉 Everything looked fine&lt;br&gt;
👉 But production was broken&lt;/p&gt;

&lt;p&gt;Drop your story 👇&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
