Akshay Galande

Posted on Jul 2

The Hidden Tax Your On-Call Rotation Is Charging Your Team

#oncall #ai #devops #productivity

There's a number most engineering leaders don't track: the real cost of their on-call rotation.

Not the PagerDuty bill. Not the pager pay. The actual cost — measured in burned out engineers, slower sprint velocity, compounding bugs, and the quiet resignation emails that always seem to come the month after a bad on-call week.

I've spent the last year talking to engineering teams about their on-call experience. The conversations are remarkably similar regardless of company size, stack, or industry. The pain is the same everywhere. Here's what I keep hearing.

The 62% problem

Across every team I've talked to, the rough number is consistent: somewhere between 55 and 70 percent of production pages are for issues the team has already solved before.

Not the same bug in the same service. The same pattern in a different service. A null check that was fixed in the checkout service in March pages someone in the payments service in June. A timeout retry that was handled in the API gateway shows up as a new incident in the notification service.

The knowledge exists inside the organization. It's sitting in a merged PR from three months ago. But the engineer getting paged at 2am has no way to know that. So they start from scratch. Read the stack trace. Form a hypothesis. Write the fix. Run the tests. Open a PR. Two hours later, they go back to sleep — having essentially duplicated work that already existed somewhere in the codebase.

This is the core inefficiency of on-call. Not that the problems are hard. Most of them aren't. It's that the system has no memory.

The day-after tax

The incident itself is only the beginning of the cost. The real damage happens the next morning.

The engineer who got paged at 3am shows up to standup running on four hours of sleep. They context-switch poorly. Their code quality drops. They skip the thorough review they'd normally do. They rubber-stamp a PR that introduces a subtle bug — which becomes next week's 3am page for someone else.

I call this "on-call debt." It works exactly like technical debt. Each incident doesn't just cost the time to fix it. It reduces the team's capacity to prevent the next one. And like technical debt, it compounds quietly until something breaks badly enough that leadership finally notices.

But by then, the senior engineer who could have prevented it has already accepted an offer somewhere else.

The retention problem nobody talks about

Here's the stat that should keep engineering leaders up at night: roughly one in four senior engineers say on-call is their primary reason for leaving a company.

Not compensation. Not the tech stack. Not the manager. The 3am pages.

This makes sense when you think about it from the engineer's perspective. They spent years developing deep expertise. They can architect systems, mentor juniors, and solve genuinely hard problems. And they're being woken up at 3am to add a null check.

The mismatch between their skill level and the work they're being paged for is what drives the frustration. It's not that they can't do it. It's that they shouldn't have to.

When a senior engineer leaves because of on-call burnout, the replacement cost is staggering. Six months of recruiting. Six months of ramping. A year of lost institutional knowledge. For a 10-person team, on-call-driven turnover can easily cost $200K to $300K per year in hidden replacement costs alone.

The tool chain gap

The irony is that engineering teams have never had more tools. Sentry catches the error. Datadog shows the metrics. PagerDuty wakes someone up. Slack broadcasts the alert to the whole channel.

Detect. Notify. Amplify. Human.

That's the current workflow. Every tool in the chain is doing its job. But notice what's missing: nothing in that chain does the actual work. Every tool passes the incident to the next one until a human picks it up.

We have AI that can write code, generate tests, refactor entire files. But the on-call workflow still terminates at a sleepy engineer with a laptop. The tooling around the incident is sophisticated. The resolution is still completely manual.

The split that should exist

Not every incident needs a human. That's the uncomfortable truth that the current on-call model refuses to acknowledge.

When you look at the 62% of repeat-pattern pages, these are incidents where the diagnosis is straightforward, the fix is small, and the risk is low. A null check. A timeout retry. An unhandled promise. A missing error boundary.

These don't require creativity or architectural judgment. They require pattern recognition and the mechanical work of writing a fix, running tests, and opening a PR.

Meanwhile, the other 38% — the genuinely novel failures, the cascading incidents, the architectural problems — these absolutely need a human. They need someone who understands the system deeply, can reason about edge cases, and can make judgment calls under pressure.

The split should be:

Routine incidents get handled automatically. Complex incidents get escalated to humans with full context so they can solve, not investigate.

Tier 1: automation. Tier 2: your best engineers.

This isn't a futuristic idea. It's how every other operational domain works. Customer support has had tier-1 automation for years. IT helpdesks auto-resolve password resets. Network operations centers auto-remediate known failure patterns.

Software engineering is the last holdout. And the cost of that holdout is measured in burnout, turnover, and lost velocity.

What would have to be true

For this split to work, you'd need something that can do four things:

First, it needs to understand production errors at the code level. Not just "this service is throwing 500s" but "this is a null reference on line 47 of checkout.ts because session.user can be undefined when the auth token expires."

Second, it needs pattern memory across services. If Service A fixed this exact pattern three months ago, the system should know that and apply the same approach.

Third, it needs to actually produce the fix. Not a suggestion. Not a recommendation. A working code change with tests that passes CI and is ready for human review.

Fourth, it needs to know its own limits. When the incident is too complex or the confidence is too low, it should escalate immediately with full diagnostic context — not attempt a fix it's not sure about.

This is the problem we set out to solve with BugOps. It connects to your monitoring and version control, handles the routine incidents autonomously, and escalates the hard ones to your team with everything they need to resolve quickly.

But regardless of what tool solves it, the underlying shift is overdue. Engineering teams deserve an on-call model that respects the difference between routine work and genuinely hard problems. The current model treats every incident like it needs a human. It doesn't. And the humans are paying the price.

The question I'd leave you with

Look at your team's on-call data from the last three months. Count the pages that were repeat patterns. Count the ones that were genuinely novel.

If the split is anywhere close to 60/40, your team is spending most of their on-call time on work that shouldn't require them.

The question isn't whether to automate the routine 60%. It's how long your team can sustain the cost of not doing it.

bugops.app

DEV Community