My Monitoring Dashboard Was All Green — While 80% of Users Got Errors

#codcompass #ai #knowledgebase #webdev

My Monitoring Dashboard Was All Green — While 80% of Users Got Errors

Everything looked perfect. Response times? Under 200ms. Error rate? 0.3%. CPU? 12%. Memory? Fine. Every metric on the dashboard was painted in soothing green.

And yet, our support inbox was filling up with "the app isn't working" messages.

Here's what happened, how I found it, and the monitoring lesson that changed how I think about observability.

The Problem

It was a Tuesday. Our app had a feature that processed user-uploaded documents — convert, OCR, store. Simple pipeline. The monitoring dashboard showed everything healthy.

But users were complaining. Not loudly. Just a steady trickle of "my document didn't process" tickets.

I checked the dashboard again. All green.

The Investigation

I dug into the raw logs — not the aggregated metrics, but the actual request-level logs. And that's when I saw it:

20% of requests went through the main service → responded normally
80% of requests hit a load balancer rule I'd added weeks ago for "capacity management" → got silently routed to a staging queue that... nobody was consuming from

The requests weren't failing. They were being redirected to a dead end. No errors, no timeouts. Just... nothing.

Our monitoring was tracking:

✅ Response time (of the 20% that completed)
✅ Error rate (there were none — the requests just vanished)
✅ CPU/memory (the main service was barely working, because 80% of traffic was going elsewhere)

The dashboard was green because it was only measuring the healthy path.

The Root Cause

Three weeks earlier, I'd added a load balancer rule to route "overflow" traffic to a secondary processing queue during peak loads. The idea: prevent the main service from crashing under heavy document uploads.

The rule worked. It routed 80% of traffic to the secondary queue.

Nobody ever set up a consumer for the secondary queue.

The traffic wasn't failing. It was being politely escorted to a room with no exit. And our monitoring, which only tracked the main service, had no idea.

The Fix

Disabled the load balancer rule (immediate fix)
Set up a consumer for the secondary queue (proper fix — now both paths are monitored)
Added queue-depth monitoring (so we catch "traffic going somewhere but not being consumed" scenarios)
Created a "silent failure" runbook (a checklist for when metrics look fine but users report problems)

The Real Lesson

Green dashboards don't mean healthy systems. They mean your dashboards are measuring the things you told them to measure.

The things you didn't tell them to measure? Those are the things that will quietly break everything.

Here's my new rule: Every routing decision needs monitoring on both ends. If you send traffic somewhere, you need to know if it arrives and if it gets processed.

A routing rule without a corresponding monitor isn't just incomplete. It's dangerous. It gives you false confidence — the worst kind of confidence in production.

My "Silent Failure" Checklist

When users report problems but your dashboard is green:

[ ] Check raw logs, not just aggregated metrics
[ ] Trace a single request end-to-end (not just the happy path)
[ ] Look for traffic going somewhere unexpected (new routes, old rules, load balancer configs)
[ ] Check queue depths and consumer lag (requests might be waiting, not failing)
[ ] Ask: "What would success look like if this was broken in a way my dashboard couldn't see?"

The answer to that last question is usually the bug.

Your metrics are only as good as your imagination. If you can't imagine how something could silently fail, you won't monitor it. And if you don't monitor it, it will fail — quietly, confidently, while your dashboard stays green.

What's your "green dashboard but everything is broken" story?