Solved: Company is starting on-call soon. What should I request from management? Money? Separate phone? etc…

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Companies often implement on-call rotations without adequate compensation, equipment, or clear processes, leading to engineer burnout. To counter this, engineers must negotiate non-negotiables like explicit pay, a separate work phone, and defined alert criticality, while advocating for sustainable practices such as observability, SLOs, and comprehensive runbooks.

🎯 Key Takeaways

On-call duty necessitates explicit compensation (stipend, hourly pay, or time-off-in-lieu) and a company-provided separate phone to ensure work-life separation and acknowledge the ‘tax on personal time’.
A sustainable on-call culture shifts from basic monitoring to comprehensive observability, defining Service Level Objectives (SLOs) and error budgets to prioritize user experience and reduce noisy, non-actionable alerts.
Effective incident response requires actionable alerts with corresponding runbooks and a commitment to blameless post-mortems, fostering learning from systemic failures rather than individual blame.

Before you accept that pager, you need a plan. A senior engineer breaks down the non-negotiables for on-call duty—from compensation and equipment to setting boundaries that protect your sanity.

So, They’re Putting You On-Call? Here’s How to Not Get Played.

I remember my first “real” on-call rotation like it was yesterday. It was 2:47 AM on a Tuesday. The alert was a screeching banshee from an app I’d never heard of. The message was cryptic: CRITICAL: Service 'x-data-processor' latency > 500ms. I spent the next 90 minutes fumbling through unfamiliar dashboards, SSH’ing into a box named util-worker-03 (what the hell is that?), and trying to find a runbook that didn’t exist. The “fix”? A senior engineer messaged me back on Slack at 8 AM: “Oh yeah, that thing flaps sometimes. Just ignore it.” I had lost sleep, sanity, and a bit of my soul for a noisy, non-actionable alert. That’s when I learned the most important lesson of my career: On-call isn’t something you just do; it’s something you negotiate.

The “Why”: The On-Call Tax on Your Life

Let’s be brutally honest. When a company starts an on-call rotation, it’s often because they see a business need (“We need 24/7 coverage!”) but want to avoid the cost of a dedicated 24/7 Network Operations Center (NOC). They see your salary and think, “Great, we already pay them, this is just part of the job.”

But it’s not. It’s a fundamental change to your life. It’s a tax on your personal time. It means you can’t have that second beer on a Friday, you can’t go to that movie in the bad-reception theater, you can’t truly disconnect on your kid’s birthday. You are tethered to a device, waiting for something to break. If a company doesn’t acknowledge and compensate for this intrusion, they’re not just being cheap; they’re failing to respect their engineers. This leads to burnout, alert fatigue, and eventually, a revolving door of talent.

Solution 1: The Bare Minimum Survival Kit

If management says, “On-call starts next Monday,” this is your immediate, non-negotiable list. This isn’t about creating a perfect system; it’s about basic survival and establishing that your time is valuable. You’re not asking for the world, you’re asking for the basics.

Your Checklist for Day One:

Money: This is non-negotiable. It can be a weekly stipend for being on the hook, an hourly rate for time spent actively working on an incident, or a combination. Don’t work for free. Period.
A Separate Phone: They want to wake you up at 3 AM? They can pay for the device and the service that does it. This is critical for work-life separation. When you hand off the phone, you are mentally off the clock. A PagerDuty or Opsgenie app on your personal phone is not a substitute.
A Clear Rotation Schedule: Who is on-call? When? For how long? Who is the secondary? Who do you escalate to if you’re stuck? This must be documented and visible to everyone. No “tap on the shoulder” assignments.
A Defined “Criticality” Level: What, exactly, constitutes a pageable offense? A 5% increase in dev server CPU is not an emergency. The main payment gateway returning 500 errors is. This needs to be defined before you get your first alert.

Pro Tip: Get it in writing. An email from your manager confirming the terms is the bare minimum. A formal HR policy is better. If it’s not written down, it doesn’t exist.

Solution 2: The Sustainable On-Call Culture

Once you’ve survived the initial implementation, it’s time to advocate for a system that doesn’t just compensate you for pain but actively works to reduce it. This is the real goal. A healthy on-call culture is a sign of a mature engineering organization.

Building a Better System:

Invest in Observability, Not Just Monitoring: Monitoring tells you when something is broken. Observability helps you ask questions to figure out why. Stop waking people up for noisy, non-actionable alerts. An alert should be a symptom of real user impact, not a server hiccup.
Define SLOs and Error Budgets: Service Level Objectives (SLOs) formally define your reliability targets. If your user-login service needs to be 99.9% available, that’s your target. Your Error Budget is the 0.1% of the time it’s allowed to fail. You only page someone when you are in danger of burning through that budget. This moves the conversation from “Is the server okay?” to “Is the user experience okay?”.
Blameless Post-mortems: When things break (and they will), the goal is to learn, not to blame. A good post-mortem focuses on systemic failures (“Why did our alerting system fail to detect this sooner?”) not individual mistakes (“Why did Bob push that commit?”).
Actionable Alerts with Runbooks: Every single alert that can wake a human up needs a corresponding runbook. It should explain the likely impact, where to look first, and common remediation steps.

Compare this bad alert…

{
  "alert": "CPU > 90% on prod-db-01",
  "severity": "CRITICAL"
}

…to this good alert:

{
  "alert": "P99 Checkout API latency > 750ms",
  "severity": "CRITICAL",
  "impact": "Users may be unable to complete purchases. Potential revenue loss.",
  "slo_status": "Error budget for 'checkout_api' is burning 10x faster than normal.",
  "runbook": "https://wiki.techresolve.net/runbooks/checkout-api-latency"
}

Solution 3: The ‘Nuclear’ Option

What if management says no to everything? No money, no phone, just “do it, it’s part of your job.” You have a choice to make. The ‘Nuclear’ Option is a last resort, and it’s about setting hard, professional boundaries.

It sounds like this:

“I understand the business need for 24/7 coverage. However, being on-call is a significant responsibility that extends beyond my standard job description and hours. I am happy to participate in a rotation once we have a formal, written policy in place that includes fair compensation and clear guidelines. Until then, I will continue to be available to address critical issues during my normal working hours.”

This is a high-risk, high-reward move. In a toxic workplace, it might get you labeled as “not a team player” or worse. In a healthy workplace, it can be the catalyst that forces management to take the issue seriously. You are not refusing to work; you are refusing to be exploited. Sometimes, the most powerful tool you have is your willingness to protect your own time and well-being. And if the company can’t respect that, well, there are a lot of other companies out there hiring.

Topic	Bare Minimum	Sustainable Culture	What to Avoid
Compensation	Weekly Stipend + Hourly Pay	Fair Stipend + Time-off-in-lieu	“It’s part of your salary.”
Equipment	Company-provided Phone & Plan	Company Phone + Laptop	“Just use your personal phone.”
Alerting	Defined Escalation Path	SLO-based, Actionable Alerts	“Page on every CPU spike.”
Process	Clear Schedule	Blameless Post-mortems	“Figure it out as you go.”