DEV Community

Cover image for On-Call Burnout: What Incident Data Doesn’t Show
Hamza
Hamza

Posted on

On-Call Burnout: What Incident Data Doesn’t Show

Incident dashboards measure system health, but they rarely reveal the workload experienced by the engineers responding to alerts.

Typical On-Call

On-call engineers are developers responsible for responding to and resolving critical system outages. Incidents happen, alerts fire, and on-call engineers respond. Teams rotate on-call duties so that someone is always responsible for resolving outages, performance issues, or system failures when they occur.

Teams track incident metrics like incident counts, Mean Time to Repair (MTTR), Mean Time to Detect (MTTD) or escalations of the incident. These metrics give insights into system reliability, but say little about how incident response affects the engineers handling those alerts. While teams focus heavily on system health, the health of the team itself can quietly become strained.


The Numbers Behind On-Call Work

Industry Data shows that on-call work takes up a significant share of engineering time. According to the 2025 SRE Report, engineers spend a median of 30% of their week on operational work, which is up from 25% the year before.

Incidents themselves are also common. 46% of SREs reported responding to more than five incidents in the last 30 days, while 23% handled between 6–10 incidents. At that level, the load can quickly lead to burnout when combined with other engineering responsibilities and external factors ("SRE report 2025", 2026).

These metrics outline a fundamental gap: Incident counts and traditional incident data alone fail to capture how that workload is actually experienced by the engineers responding to alerts.

Burnout is widespread. A 2024 survey found that about 65% of engineers report currently experiencing burnout (Chevireddy, 2025).


Burnout Is Hard to See

Burnout rarely appears directly in dashboards. Most incident management tools (e.g. Rootly, PagerDuty) track metrics such as Incident Volume, Response Time, Resolution Time, Escalation Frequency.

But they rarely capture how that workload is experienced by the engineers responding to incidents.

For example, two engineers may handle five incidents in a week, which on surface level may not give much insight. However if one of them received most of their alerts late at night or during personal time, the experience is completely different.

Over time, patterns like after-hours alerts, uneven incident distribution, and trends like repeated interruptions can accumulate into fatigue or burnout. It can leave responders with a lingering uneasiness of the feeling that another alert could arrive at any moment.

These signals are also often scattered across multiple tools, which makes them difficult to detect early. Late-night pull requests, after-hours ticket responses, and consecutive incident response days often go untracked. Looking deeper at incident data revealed something important.


Incident Load Is About Patterns

Many Site Reliability Engineer (SRE) responders shared the same thought: the same number of incidents can feel very different depending on who receives them and when they occur.

A handful of alerts during the workday is very different from repeated pages late at night or during personal time. But timing is only part of the picture. Incident response rarely happens in isolation. Engineers are often working on multiple things: feature work, code reviews, tickets, and meetings just to name a few.

What matters most is not just how many incidents occur, but how that workload accumulates and snowballs across responders and when it happens.

For example, an engineer who receives many alerts during a quiet week may experience them very differently from someone who is already balancing deadlines, reviewing pull requests late in the evening, or responding to issues across multiple tools like GitHub, Slack, and ticketing systems.

Context in personal life is also a huge factor. An alert during working hours may be manageable, while the same alert during dinner, family time, or late at night can feel far more disruptive.

Over time, these interruptions can snowball and change how responders experience on-call.

In other words, incident load isn't only about the count of incidents, but the pattern of work surrounding those incidents: when they happen, how frequently they interrupt responders, and what other responsibilities engineers are managing at the same time.


Introducing On-Call Health

This observation led the team at Rootly’s AI Labs to build On-Call Health, an open-source project designed to help teams understand workload patterns in incident response.

The platform connects to the tools responders already use: Rootly, PagerDuty, GitHub, Slack, Linear, and Jira, collecting data and creating signals to understand how incident response load evolves over time.

At the center of this is the On-Call Health (OCH) Score, a composite workload score derived from incident response activity, engineering workload, and work pattern signals across integrated systems.


Tracking Trends Instead of Thresholds

Rather than relying on fixed thresholds, On-Call Health compares each engineer’s workload against their own historical baseline, since each engineer handles incident pressure differently.

For example, an incident volume that’s normal for an experienced SRE might be overwhelming for someone six months into their first rotation. Static thresholds miss this.

Instead, On-Call Health tracks how workload trends change over time, highlighting when activity remains stable, increases, or shifts significantly from a responder’s baseline.Sudden changes in these trends can act as early warning signals that incident response load may be becoming unsustainable.


Contributing Signals

To detect these patterns, On-Call Health combines activity signals from multiple tools and platforms engineers already use:

Primary Response Signals: Collected from incident platforms such as Rootly and PagerDuty.

  • Incident Volume
  • Incident Severity
  • Time to acknowledge incidents
  • Time to resolve incidents
  • After-hours pages
  • Consecutive on-call days

Work Pattern Signals: Collected from GitHub, Slack, Rootly, and PagerDuty to identify when work is happening and how often responders are active outside normal working hours.

  • After-hours Activity
  • Weekend Activity
  • On-Call Shift Frequency

Engineering Workload Signals: Collected from development and ticketing platforms such as GitHub, Jira, and Linear, to get insight into the engineering work happening alongside incident response.

  • Assigned Issues
  • Pull Request volume
  • Commit frequency
  • Code reviews
  • Tickets assigned and Priority levels
  • Overlapping Work

Engineer Check-in Signals
On-Call Health also sends short Slack check-in surveys inspired by Apple Health’s State of Mind feature. This captures how engineers feel alongside operational data and helps surface trends in the responder timeline and personal life.


Try it Yourself!

On-Call Health is open source and free to use.

You can explore the project or try the demo with preloaded mock data:

Website: https://oncallhealth.ai
GitHub: https://github.com/Rootly-AI-Labs/On-Call-Health

We're continuing to experiment ways to better understand responder workload and make on-call systems more sustainable for the engineers running them. Feedback, ideas, and contributions are always welcome.


Final Tips to Keep in Mind

  • Incident volume alone doesn’t show the full picture. When alerts happen matters as much as how many occur
  • After-hours interruptions add up. Late-night pages, weekend work, and repeated disruptions often drive burnout more than raw incident volume.
  • Workload snowballs. Incident response happens alongside development work, reviews, tickets, and meetings.
  • Look for trends, not only metrics. Changes in workload trends can reveal stress long before traditional dashboards do.

Reliability isn’t just about healthy systems, it’s about healthy engineers too!


References

Top comments (0)