Why On-Call Burnout Is an Onboarding Problem (and You Probably Don't See It)

#ai #programming #sre #devops

Here's something nobody talks about: your on-call burnout isn't about being on-call.

It's about what happens after you fix something.

Two weeks ago, a database query locked a critical table for 15 minutes. Cost the company $50K in lost revenue. Took me 30 minutes to fix (restart the query), but 3 hours to figure out why it happened.

The fix? We added monitoring. Great.

But did we actually figure out why someone wrote a query that could lock that table? No. Did we trace back to the feature that introduced it? No. Did we understand the sequence of deploys that made it vulnerable? No.

So when something similar happens in 6 months (and it will), someone else will spend 3 hours debugging it again.

This is the pattern that burns people out.

On-call isn't the problem. Incident after incident is the problem. And we keep having the same incidents because we're solving them one layer too shallow.

Junior engineers onboard and within months they're exhausted because they're learning by firefighting. Senior engineers leave because they're tired of playing whack-a-mole. New hires quit during their first on-call rotation because the incidents feel random and unsolvable.

The real fix isn't a better rotation schedule. It's actually understanding your incidents so deeply that you prevent the class of incident, not just the symptom.

That's the only sustainable on-call.

Tell me: What's the biggest gap in your incident analysis process? Are you finding root cause, or just fixing the immediate break?

Built by: #olivix https://olivix.app/