DEV Community

Cover image for Surviving On-Call: Tips from a Hosted Graphite SRE
Bébhinn Egan
Bébhinn Egan

Posted on • Originally published at hostedgraphite.com

Surviving On-Call: Tips from a Hosted Graphite SRE

Originally posted on the Hosted Graphite blog.

On-call is pain, and anyone who says otherwise is trying to sell you something. That said, there are lots of ways to make on-call a better experience, necessary evil that it is. I’ve been an SRE at Hosted Graphite since 2016, so have done my fair share of on-call. A lot has already been written about how companies can make on-call a better experience for teams, and lucky for us we get a mandatory day-off after an oncall shift, flexible working hours and a remote friendly office that goes a long way to making the on-call experience suck less. These are obviously organisation specific, but for the purposes of this post I’d like to focus on things that you can do to make your individual experience better, based on what I’ve learned.

Notification Hygiene

“It's important to practice good hygiene
At least if you wanna run with my team”

Del The Funky Homosapien (If you Must)

Notification hygiene is something that most of us are pretty terrible at – in and outside of work. Cleaning up your notifications is a very useful tactic for on-call, as chances are you'll be answering the pager and spending more time looking at screens than you normally would. If you use Pagerduty, I’d recommend grabbing a relevant Pagerduty V-Card for your region. Next you’ll want to set up your Do Not Disturbs: for me that is close family, my team-mates’ phone numbers and all of the numbers relevant to the Pagerduty V-Card. After that, I turn off all other push notifications and pop my phone into Do Not Disturb mode. This setup makes sure the only audible notifications I get are either from close family, phone alarms, Pagerduty notifications or emergency work calls. We use Slack at Hosted Graphite, so I also enable Slack notifications between working hours and generally allow all calls from 7am-9pm to account for things like doctor’s appointments, package deliveries, or most importantly, takeaway deliveries.

This might seem like a lot of effort, particularly if you use your phone pretty much as is (where you’re inundated with a barrage of notifications). If that works for you, that’s cool, but the goal I have during on-call shifts is to reduce the time I’m actively thinking about getting paged, unless the pager is currently yelling at me.

Bonus tip: When you’re setting up Pagerduty notification channels, make sure to leave a minute or two delay before your first notification. This means that if you see a page happen in real time while you’re working, you can acknowledge it through the web app before it blows up your phone.

Note: This is going to depend entirely on your response times for the services you are on call for. If it’s an incredibly high priority service, that requires response times in the seconds, this strategy isn’t going to work. However, if your services are resilient enough, and allow for longer response times for incidents, this strategy can make your incident management more effective, as you’re able to properly deal with issues without your phone constantly going off.

In that case I’d recommend just making your initial notifications be a push notification directly to your laptop, combined with one going to your phone–then you’ll get instant feedback before your phone goes off, because you’re looking directly at the screen and can acknowledge it from there.

Personal Care

“I know you've had a rough time
Here I've come to hijack you (hijack you), I'll love you while
I'm making the most of the night”

Carly Rae Jepsen (Making the Most of the Night)

Like it or not, most on-call shifts are going to tire you out. The exception is when absolutely nothing happens at all, which we aspire to by doing proper incident follow-up, post-mortems, and focusing on infrastructure and software improvements to make on-call suck less. That said, the best laid plans of mice and men are nothing when faced with a faulty router in a data-centre at 3am.

So you’ve been woken up at 3am, you’ve sorted your incident, but now you’re awake and can’t get back to sleep. The temptation is to get up, power through, finish work early, and try to grab some sleep later. However that way lies danger, as it’s easy to get stuck in the pattern of staying up late, and missing most of the first half of the day. Prioritising regular sleep is vital, as is trying to stick to a reasonable routine. I’m a night owl, and because we have flexible hours at Hosted Graphite, my hours tend to be something like 11am to 7pm, and I usually fall asleep around 2am. So even if I get paged at 8am and I could just get up and start my day, I’m always going to take that sleep.

When I first started doing on-call I didn’t do this, and tried to power through...exactly once. You would not think that an hour or two of extra sleep makes that much difference, but it does. I remember spending that day making tonnes of small mistakes and felt my brain was working at quarter-speed (and for the people who know me, that’s definitely saying something).

An incident heavy week can also reach its tendrils into other aspects of your personal life – if you find yourself working late into the evening, the temptation emerges to order in some food or grab a pizza on your way home. I’ve found that if I don’t do some prep in advance of being on-call I end up mostly eating trash for the week, and feeling way worse than I would otherwise. When combined with potentially disturbed sleep, the temptation to skip breakfast so you can spend a little longer in bed is another decision which will likely put a damper on your on-call week.

To avoid this, before going on-call I’ll buy a box of some sort of cereal/breakfast bar and whatever fruit I feel like eating and set myself the rule that I’m not leaving the house til I eat breakfast. If I make it as easy as possible to eat a decent breakfast, there’s no excuse not to. I have the usual work lunch, and at dinner time make twice as much as I normally would, to cover me for the next day in case I get hit with a page. As for those extreme incident-laden weeks, anything you can set and forget til it’s done cooking is ideal, and it means you’ll still be eating well even if you’re dealing with a lot of incidents.

The last, and probably most important aspect of looking after yourself on-call is to take stock of your mental health. You need to do whatever it is that lets you relax, be it video games, quiet time with a book, or a trip to the gym. Whatever it is, you need to make time for something that isn’t thinking about work, so you can recharge your batteries. You may be on-call for the week, but the pager isn’t your life.

It’s always good to remember is that you have a team behind you to support you and back you up. At Hosted Graphite, the company wants well rested and happy engineers, because well rested and happy engineers make fewer mistakes, do better work, and make for a much stronger team.

Out and About on Call

“If you go down in the woods today
You're sure of a big surprise”

Anne Murray (Teddy Bear’s Picnic)

There’s a perception that when you’re doing an on-call shift, you need to be at home, bound to your phone and laptop, hunched over in anticipation for the next notification so you can spring into action. This is not the case. Although I’d generally advise against to the cinema or an evening at a fancy French restaurant, you can pretty much go about your regular routine–so long as you have a couple of things prepared:

A good 4G/LTE connection and phone to use as a hotspot

If you’re doing on-call, you should be provided with this. At Hosted Graphite, our phone bills are covered by the office and we have a choice between using our own devices, or a company provided phone. Whatever way your company does it, you’re going to need ready access to the internet for both notification delivery and actual incident response time.

Your laptop

Obviously, you’ll need a way to respond to incidents and you probably don’t have much choice here. If you’re lucky enough to be able to pick and choose the device you use for your on-call purposes, I’d recommend the 13” macbook air for most SRE type work that is heavily biased towards work on remote servers. It's lightweight and easily portable. Excellent battery life is also a big plus. For the windows inclined (or if you just want to run some straight up linux with no faff) I’ve also heard very good things about the Thinkpad X Series.

A power bank and assorted chargers

You don’t want to get caught with a dead phone or laptop, so a beefy power bank is worth picking up, for both peace of mind and to fill up any devices you may have if you’re caught somewhere without plug access, like a bus or train.

A solid backpack to hold all your on-call related stuff

The ideal here is something that's big enough to comfortably hold all of the above, but not too big to leave you hauling around a huge backpack everywhere. When I’m on-call I tend to use a simple leather messenger bag for trips to and from work, and for everywhere else I use this MATEIN backpack (amazon uk). It holds everything I need and collapses down fairly thin when all the space isn’t in use. It also has a handy usb port on the side, so you can plug in your power bank without needing to take it out.

You’ll probably have other specific needs, like a yubikey for auth or maybe a swipe card reader for vpn access or other similar things, so I tried to just focus on the bare minimum here. The most important thing is to do your best to just forget the pager is a thing until it pages out and go about your life.

Final note

If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier. These mobile meat sacks we inhabit are fragile at best, and it is both the responsibility of the person on-call and also the company’s leadership to ensure that people who are doing on-call on a regular basis are given the resources they need to succeed, and the time they need to stay well-rested, healthy and happy.

Written by Dave Fennell, SRE at Hosted Graphite.

Top comments (0)