incident.io

Posted on Mar 22, 2022

What's a fair compensation for being on call?

#sre #devops #discuss

For the vast majority of organisations, it's necessary to have some form of round the clock cover to support the business. Whilst it's most commonly a concern for engineering, it's increasingly common to have folks from various disciplines available out-of-hours. Irrespective of role, compensating people fairly is an important factor of running a healthy and effective on-call system.

In this guide, we'll cover the various options for paying folks who are on-call, along with some advice, trade-offs and pitfalls to avoid.

And ps, if you happen to be on-call and dealing with incidents, check out incident.io --- out-of-the-box IR automation tool for on-call folks, that handles the tedious manual processes on Slack during an incident, making responding quick and easy so that you can focus on fixing the problem.

A couple of terms worth knowing

On-call: Someone who's the first point of contact when something goes wrong. Most commonly on-call is linked to out-of-hours work.
Pager: We used to carry actual pagers back in the day, but these days the phrase 'holding the pager' really means being the first person whose phone will be called when something goes wrong.

Common approaches to on-call compensation

There are a number of different approaches to on-call comp...

1. No pay for on-call

Many organisations don't pay anything for on-call. The folks taking part in an on-call rotation are doing so out of the goodness of their heart, or because they've been asked (or forced!) to do so.

This approach is particularly prevalent in smaller companies where incidents aren't especially severe, or the impact of downtime isn't considered a significant concern. It's also something we see in much larger organisations, where historical norms prevail and folks feel less empowered to ask for change.

If you're considering this approach to compensation, it's worth keeping an eye on how people are feeling over time. There'll come a point where the impact on home life is meaningful enough that they expect to be compensated for the inconvenience. Left unchecked, it's easy to end up with burnt-out or undervalued people who may look for opportunities elsewhere.

2. Pay as part of your base salary

In this model on-call compensation is positioned as part of your base salary, and is commonly combined with expectations around how often you'll be required to work out-of-hours. For example, you might have a base salary of $100k and an expectation that you'll be on-call no more than one week in eight.

There's a clear appeal to this approach as it's a simple mechanic for paying folks without any complex monthly compensation calculations.

There are a few trade-offs to consider, however. Firstly, this model prevents folks from trading time without feeling indebted to each other. Since everyone is compensated equally, overrides are highly transactional; "I'll take your week, if you take mine".

Additionally, people have a tendency to forget these kinds of agreements exist, and over time they may forget that they are being compensated at all. And since it doesn't feel like they're being paid 'extra' for that out-of-hours time, there's a risk they'll begrudge doing it.

3. Pay per incident

With a pay per incident approach, on-callers are paid a fixed amount for every incident they attend. Typically it'll apply to out-of-hours only, but every time you open your laptop to investigate an issue you were paged for, you get rewarded.

It's clear to see where the positives are here: pay is correlated with the amount of active pain you feel, so the more you're disrupted the more you're compensated. A bad week of incidents is made a little less bad when you pocket a nice bump on your paycheck.

There are clear drawbacks to this method though, and in the worst case you can end up disincentivising work to make systems more reliable. If you're going to be paid every time an alert fires, and you don't mind being woken up, why take the time to fix things? This is clearly an extreme, and few people are likely to deliberately leave things in a poor state, but while the incentive structures are in place the risk is there.

Another failure mode is the fact that no incidents means no pay. It might sound reasonable on the surface, but when being on-call means you need to carry your laptop with you for a week, that's an inconvenience you're giving away for free.

4. Pay by time

This approach works by paying folks solely for the hours they spend on-call. Most commonly this is run as a weekly rotation, where on-callers are paid a fixed amount for the time they're 'holding the pager'.

In an ideal world, the compensation is worked out down to the minute taking overrides into account. In this world, when someone takes over part of your shift for you, they're automatically compensated for their time which removes the pressure to trade shifts.

Ultimately this approach pays folks for the inconvenience on their home life, rather than their time directly. If you're paid $300 for a full week of on-call, you're essentially being compensated at a rate of $1.78 per hour. Not an amazing rate by any standards, but if it's just covering you to carry a laptop around it's a pretty fair exchange.

As with the other approaches, paying by time also comes with some trade-offs. Firstly, it doesn't work well when someone is on the receiving end of a bad week of on-call: that $300 is going to feel like a rough deal when you've been woken several nights in a row. For this reason, it's helpful to pair it with an environment of proactive overrides: if someone's been up in the night, offer to give them a 24-hour break.

The other failure mode is the fairness when there's an imbalance of pager-load between teams. If I'm paid $300 per week and my team receives double the number of alerts of another, I'm likely to feel aggrieved. There's no easy solution here, but one option is to consider different rates for 'high' and 'low' load teams.

Our advice on getting started

There's clearly no shortage of options for compensating folks who are on-call, but you might be wondering which is best. As with most things, the most accurate answer is "it depends", but for the majority of people, paying by time is a sensible default. Below are a couple of example models which incorporate this approach.

A single rotation in smaller organisations

In smaller organisations, like incident.io, it's likely that a single rotation will have you covered. In this world, you're in easy mode, and you can pick a compensation amount that feels fair based on the criticality of your service and the typical frequency of incidents.

For us, we pay ~$350 per week and expect to be paged no more than once or twice (and sometimes never!). We've written about our setup in more depth in On-call by default.

A more sophisticated default for larger organisations

When I worked at Monzo -- a bank here in the UK -- a large proportion of the company was on-call. Things were a little more complex and we wanted to structure our compensation to cover a few different scenarios:

Critical teams who needed to be available at very short notice
Teams with lower alert load, or who could afford to respond with less urgency
Fairly compensating folks who weren't on-call, but were opportunistically pulled into incidents

To account for these we had two levels for weekly compensation; one where a team would be paid a higher rate in return for a stricter SLA, and another where a team would be paid slightly less in return for a slightly relaxed response time.

Since you're paying folks for the inconvenience on the home life this approach feels justified. If you're on a critical rotation and need your laptop with you at all times, it's fair that you're compensated higher than someone who can leave it at home whilst they head out for groceries.

To account for folks who were opportunistically pulled into incidents, we introduced the concept of 'sideways escalations'. If you were on-call and needed someone else who wasn't, you could page them to help out and they'd be paid a fixed amount in return. Since their involvement was opportunistic, and the only inconvenience was for this event in particular this approach felt fair. Moreover, it reduced social barriers to escalation as you knew the person you were paging would be compensated for their involvement.

If you're interested in reading more, Monzo have written about their on-call setup here and here.

ℹ️ If you're planning on implementing time-based compensation and you're using PagerDuty to manage your on-call, this API call can grab you all the time slots that folks were on-call on a particular schedule, taking overrides into account. From here's it's simple to import into a spreadsheet and multiply the time on-call by the pro-rated weekly amount.

curl --request GET \\
  --url '<https://api.pagerduty.com/schedules/><SCHEDULE>?since=2022-01-01T00:00:00+0000&until=2022-02-01T00:00:00+0000' \\
  --header 'Accept: application/vnd.pagerduty+json;version=2' \\
  --header 'Authorization: Token token=<TOKEN>' \\
  --header 'Content-Type: application/json' | jq -r '.schedule.final_schedule.rendered_schedule_entries[]|[.start, .end, .user.summary]|@tsv'

**From                      To                        Person**
2022-01-01T00:00:00+00:00   2022-01-02T08:00:00+00:00   Pete Hamilton
2022-01-02T08:00:00+00:00   2022-01-03T08:00:00+00:00   Chris Evans
2022-01-03T08:00:00+00:00   2022-01-04T08:00:00+00:00   Sophie Koonin

Getting started

There's no such thing as a one-size-fits-all approach to on-call compensation. What you pay is heavily context-dependent and will vary based on your business, your incident load, the people who'll be on-call, and a number of other factors.

Overall, you'll struggle to go too wrong if you keep things simple, pay people fairly, and remain cognisant of the impact out-of-hours work can have.

In 2019, @evnsio and @spikelindsey surveyed over 300 people to understand how they're compensated for on-call. The full dataset of responses can be found here.

DEV Community