I spent three hours last week staring at a perfectly green dashboard while our users were getting 5-second response times. Every metric said "healthy." Every alert was silent. But the product was broken.
That's when it clicked: I had built a monitoring system, not an observability system. And they are not the same thing.
The Dashboard Illusion
Here's what my setup looked like:
- CPU usage: 23% ✅
- Memory: 61% ✅
- Request rate: normal ✅
- Error rate: 0.2% ✅
But users were rage-clicking because the checkout flow was timing out on a specific third-party API call. The error rate was "fine" because retries masked the failures. The CPU was low because the bottleneck was network I/O, not computation. Every metric I was watching was measuring the wrong thing.
This is what I call decorative telemetry — numbers that look authoritative in a standup meeting but tell you nothing about what users actually experience.
Monitoring vs Observability: The Real Difference
The distinction that changed everything:
- Monitoring asks: "Is it broken?" — You define thresholds, you get alerts when crossed.
- Observability asks: "Why is it broken?" — You can explore unknown-unknowns without shipping new code.
Monitoring is a car dashboard (speed, fuel, temperature). Observability is a mechanic's diagnostic tool (can trace any symptom back to root cause).
Most teams — including mine — build dashboards and call it observability. That's like buying a speedometer and claiming you can diagnose engine problems.
What Actually Fixed It
1. SLO-First Instrumentation
Instead of measuring infrastructure, I started measuring what users perceive. For checkout, that meant:
SLO: 99% of checkout requests complete in < 2 seconds
SLI: Actual latency at p95 per 5-minute window
Error Budget: 7.2 minutes of SLO violation allowed per month
When the third-party API degraded, our error budget burned visibly. The dashboard went from "all green" to "73% budget consumed in 2 hours." That's actionable.
2. Correlation Discipline
The real breakthrough was forcing structured logs, trace IDs, and bounded context tags into every service. When something breaks, you follow the trace ID — not guess which dashboard to check.
Before: "Checkout is slow. Let me check 6 dashboards."
After: "Trace abc123 failed at step 4. It's the payment provider."
Mean-time-to-incomprehension dropped from hours to minutes.
3. Alert Hygiene
I cut our alerts from 47 to 8. The rule: page only what demands immediate human action. Everything else goes to a dashboard or a weekly report.
The 8 remaining alerts have explicit severity semantics:
- SEV-1: Users cannot complete core flows → page immediately
- SEV-2: Degraded experience for > 10% of users → page if it persists > 15 min
- SEV-3: Internal tooling degraded → Slack notification
- Everything else: Dashboard-only
The Maturity Model I Wish I'd Had Earlier
Level 1 — Foundational: Consistent logging, health checks, and actionable paging tied to runbooks. (This is where most teams think they're "done.")
Level 2 — Intermediate: Distributed tracing with sampling strategies that survive production load. You can follow a request across services.
Level 3 — Advanced: Anomaly detection where baselines are meaningful — not noisy vanity curves that nobody trusts.
Level 4 — Principal: Org-wide observability contracts. Every team instruments the same way, SLOs drive priority, and incident learning becomes institutional knowledge.
I was at Level 1 thinking I was at Level 3.
The Hard Truth
Dashboards without on-call runbooks are decorative. Metrics without error budgets are opinions. And alerts that page for everything page for nothing.
Observability isn't a tool you install. It's a discipline you practice — mapping every signal to a decision, every alert to an action, and every incident to a learning loop.
The green dashboard was lying to me not because it was wrong, but because it was answering questions nobody was asking.
What's the biggest gap between what your dashboards show and what your users experience?
Top comments (0)