Gnaneswar

Posted on Jan 2

Nobody Knows What's Happening Anymore

#devops #monitoring #discuss

We have more observability than ever. So why does every incident still start with confusion?

Last Tuesday at 2:47 PM, our API latency spiked.

Not a small bump — a wake-up-the-entire-team spike.

We had everything you’d expect:

Grafana dashboards showing the latency increase
Datadog traces pointing at slow database queries
CloudWatch metrics showing normal CPU and memory
PagerDuty firing on error rates
Slack filling up with graphs and screenshots

What surprised me wasn’t that the incident happened.

It was that every tool worked exactly as designed, and we still couldn’t answer the only question that mattered:

What is actually happening?

It took me about 18 minutes to piece together that a deployment had changed how we batched requests.

That slightly increased per-request latency, which triggered retry logic, which cascaded into the database getting hammered with duplicate queries.

The data was all there.

I just couldn’t see the story.

We’ve Solved the Wrong Problem

Over the last decade, observability tooling got very good at collection and visualization.

You can instrument almost anything.

You can graph nearly everything.

You can trace requests across an entire distributed system.

But we largely stopped there.

We built tools that answer:

“What is this metric doing?”

but not:

“What is actually happening?”

Those are different questions.

The first is a data problem.

The second is a reasoning problem.

How We Actually Debug Incidents

Here’s what incident response usually looks like in practice.

You open a handful of browser tabs:
Grafana. Datadog. Logs. The cloud console. Your service dashboard. Maybe a notebook someone created months ago and never updated.

You start asking questions:

What changed recently?
When did this start?
What else changed around the same time?
How might these things be connected?

You’re not reading metrics.

You’re reconstructing a narrative from fragments.

“The deploy was at 2:30… latency jumped at 2:32… which service was that… what changed in that deploy… did it affect retries… wait, retry volume is way higher than normal…”

This is the actual cognitive work of incident response.

Not reading charts.

Turning charts into understanding.

The Ambiguity Problem

What makes this hard is that the same signals can support very different explanations.

Imagine you observe:

A deploy at 14:30
Latency increasing at 14:32
Error rates rising shortly after
Retry volume spiking
Infrastructure metrics remaining normal

What happened?

Story 1:

The deploy introduced a bug. Latency increased because of faulty code. Errors followed. Roll back.

Story 2:

Latency triggered retries, which amplified downstream load. Infrastructure wasn’t exhausted, but retry amplification caused intermittent failures. Adjust retry behavior.

Story 3:

The deploy changed request characteristics in a way that reduced effective throughput in the model-serving layer. You’re not CPU-bound — you’re throughput-bound. Fix batching or warmup behavior.

Same data.

Different lenses.

Different fixes.

The difference isn’t the signals.

It’s the questions being asked.

What I’m Exploring

I’ve been thinking about this for a while — not just this one incident, but the pattern behind it.

We keep expecting humans to do narrative construction in their heads, under pressure, during incidents. It’s expensive. It’s error-prone. And it gets harder as systems become more distributed.

What if we treated “turning signals into narrative” as an engineering problem?

Not to replace human judgment, but to help with the grunt work:

identifying temporal correlations
suggesting possible causal relationships
surfacing multiple plausible interpretations
making ambiguity explicit

I’m calling this exploration Coherence.

Right now, there’s no polished tool. It’s early thinking. I don’t yet know whether this can be built cleanly, or whether it collapses under real-world noise. That uncertainty is part of why I’m interested in it.

Why This Might Be a Bad Idea

There are real risks here.

It could create false confidence.

A generated narrative can sound authoritative even when it’s wrong.

It could hide important details.

Any form of summarization loses information — sometimes the information that matters most.

It might be solving the wrong problem.

Maybe the real issue isn’t interpretation, but that our systems are simply too complex to understand at all.

People might not trust it.

If you’re debugging a production incident, trusting an automated explanation is a big leap.

I don’t have clean answers to these concerns.

Why I Still Think It’s Worth Exploring

Because the current approach — handing humans a pile of data and asking them to reason perfectly under stress — doesn’t scale.

Systems keep growing in complexity.

The gap between “here’s your data” and “here’s what’s happening” keeps widening.

We probably need tools that operate at the level of reasoning, not just visualization.

Even if this particular idea fails, I think the problem itself is worth spending time on.

If you’ve felt this pain — or if you think this framing is wrong — I’d genuinely like to hear about it.

What actually helps you go from “incident declared” to “we understand what’s happening”?

DEV Community