What 3am On-Call Taught Me About Why Incident Tools Break Down

#devops #monitoring #tooling

At 3am, during an incident, nobody is excited about tooling.

You’re tired, Slack is exploding, alerts are firing, and everyone is asking the same question in different words:

“What changed?”

I learned this the hard way after spending a few years on call as an SRE.

Most incidents I dealt with didn’t fail because we lacked alerts or dashboards. We had plenty. The failure usually happened after the alert — during investigation — when context was missing and people started jumping between tools trying to reconstruct what the system even looked like.

The uncomfortable pattern I kept seeing

Across different teams and stacks, the same things kept happening:

Logs, metrics, traces, and deploy history all lived in different places
Context from past incidents lived in Slack threads or postmortems nobody read
New tools required weeks of setup before they were useful
During incidents, nobody wanted to open yet another UI

What surprised me most was how much effort went into wiring tools together, instead of helping people reason about failures.

A lot of “AI for SRE” tools I tried assumed:

clean integrations already existed
the system graph was known
teams had time to configure everything upfront

That’s rarely true in real systems.

The problem wasn’t intelligence — it was context

At some point it clicked for me:

the bottleneck wasn’t smarter analysis, it was missing context.

If a tool doesn’t understand:

what services exist
how they depend on each other
what changed recently
how incidents were handled before

then adding AI on top just produces confident nonsense.

So instead of asking “how do we make the model smarter?”, I started asking:

“How does the tool learn what the system actually looks like?”

A different approach I wanted to try

I started building IncidentFox mostly to scratch my own itch.

Two design decisions came directly from on-call pain:

1. Learn the system at setup, not weeks later

Instead of asking users to manually wire everything, the tool analyzes your system during setup — codebase, observability signals, past incidents — and builds the initial understanding automatically.

Not because setup is annoying (it is), but because incomplete integrations lead to wrong conclusions. Accuracy depends on context, and context shouldn’t take weeks to assemble.

2. Stay where incidents already happen

Every incident I’ve been part of lived in Slack.

That’s where decisions happened, context was shared, and confusion spread.

So the tool is Slack-first by design. Not as a notification surface, but as the actual place where investigation happens — pulling in logs, metrics, traces, and historical context directly into the thread.

The goal wasn’t to replace existing tools, but to stop people from losing context by bouncing between them.

One more thing I didn’t expect

As incidents happen, teams leave behind a lot of implicit knowledge:

why a hypothesis was rejected
which signals mattered
what ended up being noise

Most tools throw that away.

We decided to keep it. The system continuously learns from each incident, so future investigations start with more context instead of a blank slate.

It’s not magic, and it definitely doesn’t “solve incidents automatically”. It just tries to remember what humans already learned under pressure.

Where this is today

IncidentFox is open source (Apache 2.0), self-hostable, and very much still evolving. It won’t replace your monitoring stack, and it has blind spots I haven’t hit yet.

I’m sharing it because I want feedback from people who’ve actually been on call: