Hamza

Posted on Mar 10 • Edited on Mar 13

On-Call Burnout: What Incident Data Doesn’t Show

#sre #productivity #opensource #devops

Incident dashboards measure system health. However, they rarely show the workload and strain that engineers face when responding to alerts.

How On-Call Works

On-call engineers are developers responsible for resolving critical system outages. When incidents happen, such as outages, performance issues, or system failures, the on-call engineer is alerted and responds by investigating and resolving the issue. Schedules rotate so that someone is always available to respond and resolve these incidents.

Teams track incident metrics like incident counts, Mean Time to Repair (MTTR), Mean Time to Detect (MTTD) or escalations of the incident. These metrics give insights into system reliability, but say very little about how incident response affects the engineers handling those alerts. While teams are focused on system health, the health of the team itself can silently become strained.

The Numbers Behind On-Call Work

Industry Data shows that on-call work takes up a significant share of engineering time. The 2025 SRE Report states that engineers spend a median of 30% of their week on operational work, which is up from 25% the year before.

Incidents themselves are also common. 46% of SREs reported responding to more than five incidents in the last 30 days, while 23% handled between 6–10 incidents. At that level, the load can quickly lead to burnout when combined with other engineering responsibilities and external factors ("SRE report 2025", 2026).

These metrics outline a fundamental gap: Incident counts and traditional incident data alone can't completely capture how workload is experienced by the engineers responding to alerts.

Burnout is widespread. A 2024 survey found that about 65% of engineers report currently experiencing burnout (Chevireddy, 2025).

Burnout Is Hard to See

Burnout rarely appears directly in dashboards. Most incident management tools (e.g. Rootly, PagerDuty) track metrics such as Incident Volume, Response Time, Resolution Time, Escalation Frequency.

But they rarely capture how that workload is experienced by the engineers responding to incidents.

For example, two engineers may handle five incidents in a week, which on surface level may not give much insight. However if one of them received most of their alerts late at night or during personal time, the experience is completely different.

Patterns like after-hours alerts, uneven incident distribution, and trends like repeated interruptions can accumulate into fatigue or burnout. It can leave responders with a lingering uneasiness of the feeling that another alert could arrive at any moment.

These signals are also often scattered across multiple tools, which makes them difficult to detect early. Late-night pull requests, after-hours ticket responses, and consecutive incident response days often go untracked. Looking deeper at incident data revealed something important.

Incident Load Is About Patterns

Many Site Reliability Engineer (SRE) responders shared the same thought: the same number of incidents can feel very different depending on who receives them and when they occur.

A handful of alerts during the workday is very different from repeated pages late at night or during personal time. But timing is only part of the picture. Incident response rarely happens in isolation. Engineers are often working on multiple things: feature work, code reviews, tickets, and meetings just to name a few.

What matters most is not just how many incidents occur, but how that workload accumulates and snowballs across responders and when it happens.

An engineer who receives many alerts during a quiet week may experience them very differently from someone who is balancing deadlines, reviewing pull requests late, or engaging multiple tools like GitHub, Slack, and ticketing systems.

Context in personal life is also a huge factor. An alert during working hours may be manageable, whereas the same alert during dinner, family time, or late at night can feel far more disruptive. Over time, these interruptions can snowball and change how responders experience on-call.

In other words, incident load isn't only about the count of incidents, but the pattern of work surrounding those incidents: when they happen, how frequently they interrupt responders, and what other responsibilities engineers are managing at the same time.

Introducing On-Call Health

This observation led the team at Rootly’s AI Labs to build On-Call Health, an open-source project designed to help teams understand workload patterns in incident response.

The platform connects to the tools responders already use: Rootly, PagerDuty, GitHub, Slack, Linear, and Jira, collecting data and creating signals to understand how incident response load evolves over time.

At the center of this is the On-Call Health (OCH) Score, a composite workload score derived from incident response activity, engineering workload, and work pattern signals across integrated systems.

Tracking Trends Instead of Thresholds

Rather than relying on fixed thresholds, On-Call Health compares each engineer’s workload against their own historical baseline, since each engineer handles incident pressure differently.

For example, an incident volume that’s normal for an experienced SRE might be overwhelming for someone six months into their first rotation. Static thresholds miss this.

However, what On-Call Health tracks is workload trends over time, signaling when activity is stable, increases, or shifts significantly from a responder’s baseline. Changes in these trends can act as early warning signals that incident response load may be becoming unsustainable.

Contributing Signals

To detect these patterns, On-Call Health combines activity signals from multiple tools and platforms engineers already use:

Primary Response Signals: Collected from incident platforms such as Rootly and PagerDuty.

Incident Volume
Incident Severity
Time to acknowledge incidents
Time to resolve incidents
After-hours pages
Consecutive on-call days

Work Pattern Signals: Collected from GitHub, Slack, Rootly, and PagerDuty to identify when work is happening and how often responders are active outside normal working hours.

After-hours Activity
Weekend Activity
On-Call Shift Frequency

Engineering Workload Signals: Collected from development and ticketing platforms such as GitHub, Jira, and Linear, to get insight into the engineering work happening alongside incident response.

Assigned Issues
Pull Request volume
Commit frequency
Code reviews
Tickets assigned and Priority levels
Overlapping Work

Engineer Check-in Signals
On-Call Health also sends short Slack check-in surveys inspired by Apple Health’s State of Mind feature. This captures how engineers feel alongside operational data and helps surface trends in the responder timeline and personal life.

Try it Yourself!

On-Call Health is open source and free to use.

You can explore the project or try the demo with preloaded mock data:

Website: https://oncallhealth.ai
GitHub: https://github.com/Rootly-AI-Labs/On-Call-Health

We're continuing to experiment ways to better understand responder workload and make on-call systems more sustainable for the engineers running them. Feedback, ideas, and contributions are always welcome.

Final Tips to Keep in Mind

Incident volume alone doesn’t show the full picture. When alerts happen matters as much as how many occur
After-hours interruptions add up. Late-night pages, weekend work, and repeated disruptions often drive burnout more than raw incident volume.
Workload snowballs. Incident response happens side-by-side with development work, reviews, tickets, and meetings.
Look for trends, not only metrics. Changes in workload trends can reveal stress long before traditional dashboards do.

References

Chevireddy, S. (2025, June 29). AI, Burnout, and the Future: Navigating the New Era of Software Engineering. https://medium.com/. https://medium.com/@sraavanchevireddy/ai-burnout-and-the-future-navigating-the-new-era-of-software-engineering-9fbc54b658f9
The SRE report 2025. (2026, March 4). Catchpoint | Internet Performance Monitoring (IPM). https://www.catchpoint.com/asset/2025-sre-report

Top comments (2)

Sylvain Kalache • Mar 15

The tech industry has become excellent at monitoring the health of its software system, but does nothing for the engineers. On-call rotation have always been hard on people, and with developers now being able to ship faster, that means more incidents and on-call engineers more likely to burn out.

Incident Copilot • Mar 11

Important point. The incident timeline usually captures duration, severity, and resolution path, but it rarely captures the cognitive fragmentation that accumulates around repeated low-grade interruptions.

That is why teams can show acceptable MTTR on paper and still burn people out in practice. If the system wakes the same engineers too often for noisy or low-leverage work, the process is unhealthy even when the dashboard looks fine.