A few days ago Andrej Karpathy said we should build LLM powered knowledge bases. Within 48 hours someone made Graphify, a tool that turns raw data into a semantic knowledge graph with a single command.
But what if we applied this idea to incident management?
The Problem with Incident Data
Most incident management tools tell you what just happened:
- Incident created
- Alerts triggered
- Timeline recorded
But during an actual incident, that’s not what you need. What you really need is:
- What happened last time this service broke?
- Who responded?
- What fixed it?
- What’s likely to break next?
That information exists but is buried across Slack threads, postmortems, dashboards, and logs. It’s not connected.
From Logs to Graph
We took incident data (services, alerts, responders, teams, timelines) and fed it into Graphify. Instead of treating incidents as isolated logs, they become part of a semantic graph:
Nodes: services, incidents, alerts, responders
Edges: relationships between them (co-occurrence, ownership, causality)
Now instead of querying logs, you’re querying relationships.
What This Unlocks
1. Instant Incident Memory
When a new incident fires, you can query:
What happened last time this service broke?
And immediately get:
- similar incidents
- who handled them
- what actions resolved them
No more Slack archaeology.
2. Blast Radius Prediction
If Service X goes down, the graph can tell you:
Services Y and Z usually fail shortly after.
Because it has learned co-failure patterns over time.
3. Smarter Onboarding
Instead of asking a new SRE to read 200 past incidents:
Here’s the graph. These are the hot spots, these teams own these systems, this is how everything connects.
It’s a map of your infrastructure reality across time, not a boring and unconnected documentation.
4. Team Load Visibility
You can connect:
- incident volume
- team ownership
- responder activity
And suddenly see which teams absorbed the most load relative to their size? This is where things like burnout start to become visible in the data.
5. Alert Signal vs Noise
Because alerts are tied to actual incidents in the graph, you can rank:
- alerts that frequently lead to real incidents
- alerts that never matter This gives you a way to tune or delete alerts backed by evidence
6. Surfacing Dependencies
Some services consistently fail together, even if no one documented the dependency.
The graph reveals what actually depends on what based on real incidents, team and alert data.
Where This Gets Really Interesting
Once you have this graph, it becomes a foundation for:
- Slack bots that auto-post relevant context during incidents
- AI SREs with memory
- Querying your system like a knowledge base instead of dashboards
This gives the power for on-call teams to not only rediscover solutions but build accumulated knowledge.
This shifts on-call teams from repeatedly rediscovering solutions to building accumulated knowledge over time.
Small Plug (If You Use Rootly)
If you’re using Rootly, I built a small plugin to explore your incident data with Graphify:

https://github.com/Rootly-AI-Labs/rootly-graphify-importer
Final Thoughts
Incident management data is already rich. It's full of signals across alerts, incidents, and responses but rarely captures how things relate.
Graphify flips that, turning logs to knowledge, building connections across events, and turning history into memory.
Once you see your system as a graph that turns scattered data into something you can filter, query, and explore, it’s hard to go back.
Top comments (0)