Hamza

Posted on Apr 13

Using Graphify to turn Incident Data into a Knowledge Graph

#sre #ai #devops #llm

A few days ago Andrej Karpathy said we should build LLM powered knowledge bases. Within 48 hours someone made Graphify, a tool that turns raw data into a semantic knowledge graph with a single command.

But what if we applied this idea to incident management?

The Problem with Incident Data

Most incident management tools tell you what just happened:

Incident created
Alerts triggered
Timeline recorded

But during an actual incident, that’s not what you need. What you really need is:

What happened last time this service broke?
Who responded?
What fixed it?
What’s likely to break next?

That information exists but is buried across Slack threads, postmortems, dashboards, and logs. It’s not connected.

From Logs to Graph

We took incident data (services, alerts, responders, teams, timelines) and fed it into Graphify. Instead of treating incidents as isolated logs, they become part of a semantic graph:

Nodes: services, incidents, alerts, responders
Edges: relationships between them (co-occurrence, ownership, causality)

Now instead of querying logs, you’re querying relationships.

What This Unlocks

1. Instant Incident Memory
When a new incident fires, you can query:

What happened last time this service broke?

And immediately get:

similar incidents
who handled them
what actions resolved them

No more Slack archaeology.

2. Blast Radius Prediction
If Service X goes down, the graph can tell you:

Services Y and Z usually fail shortly after.

Because it has learned co-failure patterns over time.

3. Smarter Onboarding
Instead of asking a new SRE to read 200 past incidents:

Here’s the graph. These are the hot spots, these teams own these systems, this is how everything connects.

It’s a map of your infrastructure reality across time, not a boring and unconnected documentation.

4. Team Load Visibility
You can connect:

incident volume
team ownership
responder activity

And suddenly see which teams absorbed the most load relative to their size? This is where things like burnout start to become visible in the data.

5. Alert Signal vs Noise
Because alerts are tied to actual incidents in the graph, you can rank:

alerts that frequently lead to real incidents
alerts that never matter This gives you a way to tune or delete alerts backed by evidence

6. Surfacing Dependencies
Some services consistently fail together, even if no one documented the dependency.
The graph reveals what actually depends on what based on real incidents, team and alert data.

Where This Gets Really Interesting

Once you have this graph, it becomes a foundation for:

Slack bots that auto-post relevant context during incidents
AI SREs with memory
Querying your system like a knowledge base instead of dashboards

This gives the power for on-call teams to not only rediscover solutions but build accumulated knowledge.

This shifts on-call teams from repeatedly rediscovering solutions to building accumulated knowledge over time.

Small Plug (If You Use Rootly)

If you’re using Rootly, I built a small plugin to explore your incident data with Graphify:

https://github.com/Rootly-AI-Labs/rootly-graphify-importer

Final Thoughts

Incident management data is already rich. It's full of signals across alerts, incidents, and responses but rarely captures how things relate.

Graphify flips that, turning logs to knowledge, building connections across events, and turning history into memory.

Once you see your system as a graph that turns scattered data into something you can filter, query, and explore, it’s hard to go back.

Top comments (1)

Miloslav Homer • Apr 13

GraphRAG seems to me like the highest potential improvement coming from AI. Sounds like a great approach to this much too common problem.

I'd like to ask, please: how do you ensure the onthology stays consistent and compact? This was a pain point in my experiments.