DEV Community: Hamza

Using Graphify to turn Incident Data into a Knowledge Graph

Hamza — Mon, 13 Apr 2026 14:20:22 +0000

A few days ago Andrej Karpathy said we should build LLM powered knowledge bases. Within 48 hours someone made Graphify, a tool that turns raw data into a semantic knowledge graph with a single command.

But what if we applied this idea to incident management?

The Problem with Incident Data

Most incident management tools tell you what just happened:

Incident created
Alerts triggered
Timeline recorded

But during an actual incident, that’s not what you need. What you really need is:

What happened last time this service broke?
Who responded?
What fixed it?
What’s likely to break next?

That information exists but is buried across Slack threads, postmortems, dashboards, and logs. It’s not connected.

From Logs to Graph

We took incident data (services, alerts, responders, teams, timelines) and fed it into Graphify. Instead of treating incidents as isolated logs, they become part of a semantic graph:

Nodes: services, incidents, alerts, responders
Edges: relationships between them (co-occurrence, ownership, causality)

Now instead of querying logs, you’re querying relationships.

What This Unlocks

1. Instant Incident Memory
When a new incident fires, you can query:

What happened last time this service broke?

And immediately get:

similar incidents
who handled them
what actions resolved them

No more Slack archaeology.

2. Blast Radius Prediction
If Service X goes down, the graph can tell you:

Services Y and Z usually fail shortly after.

Because it has learned co-failure patterns over time.

3. Smarter Onboarding
Instead of asking a new SRE to read 200 past incidents:

Here’s the graph. These are the hot spots, these teams own these systems, this is how everything connects.

It’s a map of your infrastructure reality across time, not a boring and unconnected documentation.

4. Team Load Visibility
You can connect:

incident volume
team ownership
responder activity

And suddenly see which teams absorbed the most load relative to their size? This is where things like burnout start to become visible in the data.

5. Alert Signal vs Noise
Because alerts are tied to actual incidents in the graph, you can rank:

alerts that frequently lead to real incidents
alerts that never matter This gives you a way to tune or delete alerts backed by evidence

6. Surfacing Dependencies
Some services consistently fail together, even if no one documented the dependency.
The graph reveals what actually depends on what based on real incidents, team and alert data.

Where This Gets Really Interesting

Once you have this graph, it becomes a foundation for:

Slack bots that auto-post relevant context during incidents
AI SREs with memory
Querying your system like a knowledge base instead of dashboards

This gives the power for on-call teams to not only rediscover solutions but build accumulated knowledge.

This shifts on-call teams from repeatedly rediscovering solutions to building accumulated knowledge over time.

Small Plug (If You Use Rootly)

If you’re using Rootly, I built a small plugin to explore your incident data with Graphify:

https://github.com/Rootly-AI-Labs/rootly-graphify-importer

Final Thoughts

Incident management data is already rich. It's full of signals across alerts, incidents, and responses but rarely captures how things relate.

Graphify flips that, turning logs to knowledge, building connections across events, and turning history into memory.

Once you see your system as a graph that turns scattered data into something you can filter, query, and explore, it’s hard to go back.

[Boost]

Hamza — Tue, 10 Mar 2026 06:12:15 +0000

Hamza

Mar 10

On-Call Burnout: What Incident Data Doesn’t Show

#sre #productivity #opensource #devops

5 min read

On-Call Burnout: What Incident Data Doesn’t Show

Hamza — Tue, 10 Mar 2026 06:11:55 +0000

Incident dashboards measure system health. However, they rarely show the workload and strain that engineers face when responding to alerts.

How On-Call Works

On-call engineers are developers responsible for resolving critical system outages. When incidents happen, such as outages, performance issues, or system failures, the on-call engineer is alerted and responds by investigating and resolving the issue. Schedules rotate so that someone is always available to respond and resolve these incidents.

Teams track incident metrics like incident counts, Mean Time to Repair (MTTR), Mean Time to Detect (MTTD) or escalations of the incident. These metrics give insights into system reliability, but say very little about how incident response affects the engineers handling those alerts. While teams are focused on system health, the health of the team itself can silently become strained.

The Numbers Behind On-Call Work

Industry Data shows that on-call work takes up a significant share of engineering time. The 2025 SRE Report states that engineers spend a median of 30% of their week on operational work, which is up from 25% the year before.

Incidents themselves are also common. 46% of SREs reported responding to more than five incidents in the last 30 days, while 23% handled between 6–10 incidents. At that level, the load can quickly lead to burnout when combined with other engineering responsibilities and external factors ("SRE report 2025", 2026).

These metrics outline a fundamental gap: Incident counts and traditional incident data alone can't completely capture how workload is experienced by the engineers responding to alerts.

Burnout is widespread. A 2024 survey found that about 65% of engineers report currently experiencing burnout (Chevireddy, 2025).

Burnout Is Hard to See

Burnout rarely appears directly in dashboards. Most incident management tools (e.g. Rootly, PagerDuty) track metrics such as Incident Volume, Response Time, Resolution Time, Escalation Frequency.

But they rarely capture how that workload is experienced by the engineers responding to incidents.

For example, two engineers may handle five incidents in a week, which on surface level may not give much insight. However if one of them received most of their alerts late at night or during personal time, the experience is completely different.

Patterns like after-hours alerts, uneven incident distribution, and trends like repeated interruptions can accumulate into fatigue or burnout. It can leave responders with a lingering uneasiness of the feeling that another alert could arrive at any moment.

These signals are also often scattered across multiple tools, which makes them difficult to detect early. Late-night pull requests, after-hours ticket responses, and consecutive incident response days often go untracked. Looking deeper at incident data revealed something important.

Incident Load Is About Patterns

Many Site Reliability Engineer (SRE) responders shared the same thought: the same number of incidents can feel very different depending on who receives them and when they occur.

A handful of alerts during the workday is very different from repeated pages late at night or during personal time. But timing is only part of the picture. Incident response rarely happens in isolation. Engineers are often working on multiple things: feature work, code reviews, tickets, and meetings just to name a few.

What matters most is not just how many incidents occur, but how that workload accumulates and snowballs across responders and when it happens.

An engineer who receives many alerts during a quiet week may experience them very differently from someone who is balancing deadlines, reviewing pull requests late, or engaging multiple tools like GitHub, Slack, and ticketing systems.

Context in personal life is also a huge factor. An alert during working hours may be manageable, whereas the same alert during dinner, family time, or late at night can feel far more disruptive. Over time, these interruptions can snowball and change how responders experience on-call.

In other words, incident load isn't only about the count of incidents, but the pattern of work surrounding those incidents: when they happen, how frequently they interrupt responders, and what other responsibilities engineers are managing at the same time.

Introducing On-Call Health

This observation led the team at Rootly’s AI Labs to build On-Call Health, an open-source project designed to help teams understand workload patterns in incident response.

The platform connects to the tools responders already use: Rootly, PagerDuty, GitHub, Slack, Linear, and Jira, collecting data and creating signals to understand how incident response load evolves over time.

At the center of this is the On-Call Health (OCH) Score, a composite workload score derived from incident response activity, engineering workload, and work pattern signals across integrated systems.

Tracking Trends Instead of Thresholds

Rather than relying on fixed thresholds, On-Call Health compares each engineer’s workload against their own historical baseline, since each engineer handles incident pressure differently.

For example, an incident volume that’s normal for an experienced SRE might be overwhelming for someone six months into their first rotation. Static thresholds miss this.

However, what On-Call Health tracks is workload trends over time, signaling when activity is stable, increases, or shifts significantly from a responder’s baseline. Changes in these trends can act as early warning signals that incident response load may be becoming unsustainable.

Contributing Signals

To detect these patterns, On-Call Health combines activity signals from multiple tools and platforms engineers already use:

Primary Response Signals: Collected from incident platforms such as Rootly and PagerDuty.

Incident Volume
Incident Severity
Time to acknowledge incidents
Time to resolve incidents
After-hours pages
Consecutive on-call days

Work Pattern Signals: Collected from GitHub, Slack, Rootly, and PagerDuty to identify when work is happening and how often responders are active outside normal working hours.

After-hours Activity
Weekend Activity
On-Call Shift Frequency

Engineering Workload Signals: Collected from development and ticketing platforms such as GitHub, Jira, and Linear, to get insight into the engineering work happening alongside incident response.

Assigned Issues
Pull Request volume
Commit frequency
Code reviews
Tickets assigned and Priority levels
Overlapping Work

Engineer Check-in Signals
On-Call Health also sends short Slack check-in surveys inspired by Apple Health’s State of Mind feature. This captures how engineers feel alongside operational data and helps surface trends in the responder timeline and personal life.

Try it Yourself!

On-Call Health is open source and free to use.

You can explore the project or try the demo with preloaded mock data:

Website: https://oncallhealth.ai
GitHub: https://github.com/Rootly-AI-Labs/On-Call-Health

We're continuing to experiment ways to better understand responder workload and make on-call systems more sustainable for the engineers running them. Feedback, ideas, and contributions are always welcome.

Final Tips to Keep in Mind

Incident volume alone doesn’t show the full picture. When alerts happen matters as much as how many occur
After-hours interruptions add up. Late-night pages, weekend work, and repeated disruptions often drive burnout more than raw incident volume.
Workload snowballs. Incident response happens side-by-side with development work, reviews, tickets, and meetings.
Look for trends, not only metrics. Changes in workload trends can reveal stress long before traditional dashboards do.

References

Chevireddy, S. (2025, June 29). AI, Burnout, and the Future: Navigating the New Era of Software Engineering. https://medium.com/. https://medium.com/@sraavanchevireddy/ai-burnout-and-the-future-navigating-the-new-era-of-software-engineering-9fbc54b658f9
The SRE report 2025. (2026, March 4). Catchpoint | Internet Performance Monitoring (IPM). https://www.catchpoint.com/asset/2025-sre-report