Sanket Naik

Posted on Dec 15, 2025

Your Observability Stack Is Optimized for the Wrong Thing

#observability #operations #distributedsystems #systemdesign

TL;DR: Modern observability tools—Prometheus, Jaeger, the ELK stack—excel at collecting signals (metrics, logs, traces) but fail at the harder problem: understanding your system. This is an architectural problem, not a tooling problem. The solution is treating flows, topology, and business context as first-class entities, not afterthoughts.

The Problem: A 3am Page

It's 3am. You get paged: checkout is slow.

You open Grafana. Latency looks mostly normal—median is fine. You check Datadog. Nothing obviously broken. You dive into traces in Jaeger. Thousands of traces, most normal. You search logs for errors. A few timeouts, but nothing systematic.

Two hours later, your team figures it out: a recent deployment moved database pods across availability zones, and now checkout requests for mobile users in US-West are crossing zones for every database call. Each round trip adds 50ms, and when you have 20 queries per checkout, you hit 1 second of latency.

Your monitoring stack saw all the signals. It just didn't connect them.

The Real Problem: We're Optimizing for Collection, Not Understanding

Every major observability platform—Prometheus, VictoriaMetrics, Jaeger, the ELK stack, Datadog, New Relic—does one thing exceptionally well: collect signals at scale. They're optimized for throughput, retention, and query performance on individual signal types.

But here's the problem: they treat observability as a collection problem when it's actually a modeling problem.

This has a hidden cost: terrible developer experience. Engineers spend their days context-switching between tools instead of building features. Platform teams spend weeks tuning dashboards that operators barely use. This isn't just operational overhead—it's a drag on DevX.

Why Signal-Centric Observability Fails

Metrics systems excel at aggregation but can't handle high-cardinality dimensions like flow, client, or region. Logging systems preserve rich detail but offer no structural context—you're searching through chaos. Distributed tracing shows request paths beautifully, but each trace is an island with no baseline comparison or business context.

The result: operators and engineers manually jump between tools. "Check dashboards... then logs... then traces... is this even related?" Humans are bad at correlation at scale. You miss subtle failures that emerge gradually.

And everyone's frustrated. Engineers can't easily observe their own systems. SREs are drowning in false positives. Platform teams are exhausted tuning infrastructure that doesn't deliver value.

What's Missing: Context

The checkout problem wasn't mysterious. A human could solve it in minutes if presented with:

System topology: "What changed recently?" (deployment to US-West-2)
Flow definition: "Which flows are affected?" (checkout, specifically mobile users)
Historical baselines: "Is this normal for this context?" (no, baseline checkout is 200ms, this is 800ms)
Correlated signals: "What changed at the same time?" (pod rescheduling coincided with latency jump)

None of these require new algorithms. They require context—understanding your system as a coherent whole, not as isolated signals.

The Solution: Flow-Centric Observability with System Awareness

Instead of asking "what do my metrics/logs/traces say?", ask:

"How are my business-critical flows behaving?"

And more importantly: "What does my service depend on, and how is it performing?"

A flow is a domain-relevant execution path: checkout, search, login, payment processing. Each flow:

Spans multiple services and infrastructure
Has known operational requirements (latency, error tolerance)
Has known business importance (some flows are revenue-critical)
Is something an engineer can directly understand ("Here's how checkout works, here's what's expected, here's what's happening now")

The Architecture

System Model (Topology, Flows, Semantics)
            ↓
Contextual Anomaly Detection
            ↓
Actionable Insights → Engineers

The system has three layers:

First, define your system: services, flows, SLOs, business impact. This isn't a new tool—it's a shared data structure existing tools reference.

Second, detect anomalies contextually. Instead of static thresholds, compare observed behavior against baselines for the current context (deployment version, traffic pattern, topology). You don't need ML—simple statistical baselining works great at the flow level.

Third, surface insights. When an anomaly is detected, show what flow is affected, why it matters, what changed, and suggested actions. Engineers get immediate feedback: "Your deploy didn't break checkout" or "Checkout latency jumped post-deploy—here's why."

A Concrete Example: The Checkout Slowdown

Let's walk through how this would work:

Pre-deployment baseline:

Checkout flow (all regions, all clients): p99 latency = 200ms
Checkout flow (mobile, US-West): p99 latency = 210ms (slightly higher due to distance)

Post-deployment:

System detects: checkout latency for (mobile, US-West) jumped to 800ms
Context: most recent changes were pod rescheduling in US-West-2
Baseline comparison: this is 3.8x higher than expected for this context
Hypothesis: the system flags that cross-zone database access increased

What the operator sees:

ANOMALY: Checkout flow degradation

Flow: checkout (mobile users)

Region: US-West

Severity: High (checkout is revenue-critical)

Context change: Pod rescheduling in US-West-2 at 2:47am

Observed: p99 latency 800ms (expected: 210ms)

Possible cause: Cross-AZ database access following reschedule

Suggested action: Check pod placement; consider local replica or zone-aware scheduling

Instead of 2 hours of detective work, the operator has the answer in 5 minutes.

Why This Isn't "Just Better Alerting"

You might think: "Can't I just add more alerts to Prometheus?"

Not really. Traditional alerting is static—it defines thresholds upfront. But "normal" for checkout depends on:

Which region?
Which customer segment?
Peak vs. off-peak?
Post-deployment (higher error rates expected)?
Traffic composition?

You can't bake all of this into static alerts. You need a model.

Why This Isn't "Just AIOps"

AIOps platforms apply machine learning to observability data to reduce noise and infer root causes. Sounds good, but:

They still operate on signal-centric data, inheriting all the same correlation problems
They treat intelligence as something to be inferred from data alone
They often produce black-box recommendations that operators don't trust

The difference: We encode intelligence explicitly in a system model, not inferred from data. Humans stay in control.

What You'd Actually Need to Build This

Not a new tool. A system model registry that defines your system once, integrates with existing tools (Prometheus, Jaeger, your CD pipeline), and detects anomalies contextually. That's it. Everything else is implementation detail.

Why This Shift Makes Sense Now

Four reasons:

1. Systems are more complex

Microservices, Kubernetes, multi-region deployments
Your 2-tier monolith had 2 obvious places to look
Your 50-service mesh has thousands of possible failure modes

2. Failures are more subtle

Not "service X is down" (that's obvious)
But "checkout is 30% slower for a specific user cohort" (that's a 3am hunt)
These emerge gradually, from subtle cascades across many services

3. Business impact isn't symmetric

A 100ms latency increase on your internal admin tool: shrug
A 100ms latency increase on checkout: customer churn, lost revenue
Current observability treats all latencies the same

4. DevX is becoming a competitive advantage

Teams that ship fast have better observability (which lets them ship even faster)
Engineers want to understand their systems quickly without becoming observability experts
Context-switching between tools is a hidden productivity tax
If your observability sucks, your best engineers will leave to work somewhere with better DevX

What Would You Build First?

Pick one flow. Checkout, login, search—whatever drives your business or frustrates your engineers most.

Can you:

Define it explicitly—what services, what SLOs, what's the business impact?
Collect baselines—what's normal for different contexts (deployment version, traffic pattern, region)?
Detect one anomaly contextually—can you catch post-deployment degradation?
Surface it with context—not just "alert fired" but "why it matters and what changed"?

If you can answer yes to all four, you have a working prototype.

The real test: Five minutes after an engineer deploys, can they know with confidence whether they broke anything? If yes, you've solved the problem.

Why I'm Talking About This

I'm a platform engineer. And I've been thinking about observability differently.

We've invested heavily in the observability stack: Prometheus, Grafana, Jaeger, ELK. We've tuned alerts, built dashboards, created runbooks. And they work—they collect signals at massive scale. But I've noticed something interesting: despite all this investment, the fundamental challenge remains unchanged. Engineers still struggle to understand their own systems. When something goes wrong, we're still jumping between tools, correlating signals manually, relying on human expertise to make sense of what's happening.

It's not a tool problem. It's an architectural one.

I started wondering: what if observability isn't primarily about better signal collection? What if it's about system modeling? What if we treated understanding your system—topology, flows, semantics—as a first-class concern instead of something you figure out ad-hoc when things break?

The more I think about it, the more convinced I become that this is the direction observability needs to move. Not because the current approach is "wrong"—it's genuinely valuable. But because we're optimizing for the wrong thing. We've solved signal collection. Now we need to solve signal understanding.

This isn't a small idea. It changes how teams think about observability, how we build observability tools, and ultimately how engineers interact with their systems.

I Want to Hear From You

If this framework resonates—if you've noticed the same gap between signal collection and signal understanding—I'd like to think through this together.

Some questions I'm genuinely curious about:

If you're a platform engineer:

What's the disconnect you're experiencing between observability investment and what engineers actually need?
Have you tried to bridge the gap? What worked? What didn't?
What would observability need to look like to feel truly solved for your team?

If you're an application engineer:

When you deploy a change, what's your actual process for knowing whether it's working?
What observability question do you most want answered but can't easily ask?
How much of your time goes into "becoming fluent" in observability tools vs. actually debugging?

If you're building or working on observability tools:

Where do you see teams struggling most with current platforms?
What constraints are you operating under that prevent moving in this direction?
What would it take to make flow-centric observability feasible in your architecture?

If you've thought deeply about this problem:

What am I getting wrong?
What pieces am I missing?
What would actually need to happen for teams to adopt this approach?

I'd genuinely appreciate hearing your perspective. Drop a comment or reach out. I'm trying to understand whether this is a real paradigm shift or if I'm missing something important about how observability actually works in practice.

The goal isn't agreement—it's to stress-test the idea with people who understand the problem deeply.

The DevX Angle

Here's something observability vendors rarely talk about: observability is a developer tool, not just an operations tool.

When you build good observability with the system-modeling approach, you're not just reducing MTTR (mean time to recovery). You're:

Making it easy for engineers to understand their systems (faster onboarding, higher confidence in changes)
Reducing context-switching (one place to look, not five)
Enabling engineers to self-serve (they don't need to page SREs for basic debugging)
Shortening the feedback loop (deploy → immediate signal on whether it worked)

The teams with the best DevX are often the ones with the best observability. That's not coincidence. When observability is easy and accessible, everything moves faster.

DEV Community