At 3am, during an incident, nobody is excited about tooling.
You’re tired, Slack is exploding, alerts are firing, and everyone is asking the same question in different words:
“What changed?”
I learned this the hard way after spending a few years on call as an SRE.
Most incidents I dealt with didn’t fail because we lacked alerts or dashboards. We had plenty. The failure usually happened after the alert — during investigation — when context was missing and people started jumping between tools trying to reconstruct what the system even looked like.
The uncomfortable pattern I kept seeing
Across different teams and stacks, the same things kept happening:
- Logs, metrics, traces, and deploy history all lived in different places
- Context from past incidents lived in Slack threads or postmortems nobody read
- New tools required weeks of setup before they were useful
- During incidents, nobody wanted to open yet another UI
What surprised me most was how much effort went into wiring tools together, instead of helping people reason about failures.
A lot of “AI for SRE” tools I tried assumed:
- clean integrations already existed
- the system graph was known
- teams had time to configure everything upfront
That’s rarely true in real systems.
The problem wasn’t intelligence — it was context
At some point it clicked for me:
the bottleneck wasn’t smarter analysis, it was missing context.
If a tool doesn’t understand:
- what services exist
- how they depend on each other
- what changed recently
- how incidents were handled before
then adding AI on top just produces confident nonsense.
So instead of asking “how do we make the model smarter?”, I started asking:
“How does the tool learn what the system actually looks like?”
A different approach I wanted to try
I started building IncidentFox mostly to scratch my own itch.
Two design decisions came directly from on-call pain:
1. Learn the system at setup, not weeks later
Instead of asking users to manually wire everything, the tool analyzes your system during setup — codebase, observability signals, past incidents — and builds the initial understanding automatically.
Not because setup is annoying (it is), but because incomplete integrations lead to wrong conclusions. Accuracy depends on context, and context shouldn’t take weeks to assemble.
2. Stay where incidents already happen
Every incident I’ve been part of lived in Slack.
That’s where decisions happened, context was shared, and confusion spread.
So the tool is Slack-first by design. Not as a notification surface, but as the actual place where investigation happens — pulling in logs, metrics, traces, and historical context directly into the thread.
The goal wasn’t to replace existing tools, but to stop people from losing context by bouncing between them.
One more thing I didn’t expect
As incidents happen, teams leave behind a lot of implicit knowledge:
- why a hypothesis was rejected
- which signals mattered
- what ended up being noise
Most tools throw that away.
We decided to keep it. The system continuously learns from each incident, so future investigations start with more context instead of a blank slate.
It’s not magic, and it definitely doesn’t “solve incidents automatically”. It just tries to remember what humans already learned under pressure.
Where this is today
IncidentFox is open source (Apache 2.0), self-hostable, and very much still evolving. It won’t replace your monitoring stack, and it has blind spots I haven’t hit yet.
I’m sharing it because I want feedback from people who’ve actually been on call:
- What helped you most during investigation?
- What tools disappointed you?
- What assumptions here sound naive?
Repo is here if you’re curious:
https://github.com/incidentfox/incidentfox
I’m still learning — and incidents have a way of humbling everyone.
Top comments (0)