Anurag Kar

Posted on Mar 28

🚨 “Everything Was Green… But Production Was Broken” — A Debugging Story Every Backend Engineer Needs

0 errors. 0 alerts. 100% failure.

At 2 AM, everything in our dashboards was green.

No spikes 📊
No errors ❌
No alerts 🚨

And yet…

👉 Orders were failing
👉 Inventory was stuck
👉 Business impact was real!

This is the story of how a perfectly healthy system silently failed — and what it taught me about building production-grade distributed systems.

🧠 Why This Matters

As Software Engineer at one of the P0 Business, your job isn’t just to write working code.

It’s to answer:

What happens when things go wrong?
How will you know it went wrong?
Can you debug it at 2 AM under pressure?

This bug exposed a gap between:

“System is running” vs “System is working”

🧩 Real System Architecture (Simplified from Production)

🎯 Expected vs Reality

Expected Flow:

Event published → Consumer processes → DB updated

What Actually Happened:

Event published ✅
Consumer running ✅
Logs clean ✅
Metrics normal ✅

❌ Inventory never updated

🚨 The Moment It Got Real

We started getting:

On-call alerts from business teams
Manual escalations
“Orders are stuck” messages

But internally?

👉 Everything said “System Healthy”

🕵️ Debugging Journey (The Real One)

Step 1: Logs

Nothing.

Step 2: Metrics

Normal.

Step 3: Infra

Healthy.

Step 4: Reproduce locally

Couldn’t.

At this point, you hit a wall every backend engineer knows:

“If nothing is wrong… why is everything broken?”

💀 The Hidden Bug

Buried deep inside the consumer:

if !isValid(event) {
    return nil
}

That’s it.

That one line.

😶 Why This Was Dangerous

This caused:

❌ No logs
❌ No metrics
❌ No retries
❌ No DLQ
❌ No alerts

Just silent skipping.

🧬 What Was Actually Happening

A[Event Received] --> B{Valid?} B -- Yes --> C[Process Event] B -- No --> D[Log + Metrics + DLQ]

This is the worst possible failure mode in distributed systems.

⚠️ The Real Problem

This wasn’t a “bug”.

This was a design failure in observability.

We had:

Business logic ✔️
Infra stability ✔️
Scalability ✔️

But missing:

❌ Visibility into decision points

🛠️ The Fix (Simple but Powerful)

1. Make Failures Visible

if !isValid(event) {
    log.Warn("event_validation_failed", event.ID)
    return nil
}

2. Add Metrics for Every Drop

metrics.Increment("inventory.event.validation.failure")

3. Optional: DLQ for Debugging

sendToDLQ(event)

📊 New System (After Fix)

    A[Event Received] --> B{Valid?}
    B -- Yes --> C[Process Event]
    B -- No --> D[Log + Metrics + DLQ]

🔥 The Shift in Thinking

Before:

“If it fails, it will show up”

After:

“If I don’t explicitly track it, it doesn’t exist”

💡 My Production Checklist

Whenever I design a consumer now:

✅ Log every decision branch
✅ Add metrics for drops, skips, retries
✅ Never return nil silently
✅ Add DLQ for debugging paths
✅ Think in failure scenarios first

🧠 Key Takeaways

1. Logs Tell a Story You Choose

If you don’t log it, it didn’t happen.

2. Metrics Only Measure What You Track

No metric = no failure (even if it's happening)

3. Silent Failures Are Worse Than Crashes

Crashes alert you. Silence kills you slowly.

Your system is not reliable because it doesn’t crash.
It’s reliable because it tells you when it’s wrong.

📈 Series: Production Debugging Playbook (for Backend Engineers)

This is Part 1 of a series based on real production learnings:

🔹 Part 1: When Logs Lie (This Post)

Silent failures & observability gaps

Next Parts coming soon!

💬 Let’s Discuss

Have you ever faced a bug where:

👉 Everything looked fine
👉 But production was broken

Drop your story 👇

DEV Community