Discussion on: Have you ever been on-call? What was it like?

View post

It's been complicated. 😅 I was part of a small team for many years, and devs were expected to be put on the rotation sooner or later. Over time, my relationship with being on-call changed in several ways - as did my relationship with coworkers, code bases, alerts, and so on.

I've had pagerduty alerts that led to hours of intensive debugging, escalation to several people, patches across multiple systems, in-depth postmortems, etc. Some of those fires happened in the middle of the day, which is almost no problem at all: everybody's in the office to collaborate and no one gets woken up. But most people think of being on-call as getting paged in the middle of the night or weekend or holiday. That comes with the territory, and of course I have those war stories. Nights when I couldn't get onto the VPN, didn't have access to some system, or didn't know what was going on, so I had to call, roll over to voicemail, and redial until some coworker (or coworkers!) woke up. Some nights the issue "fixed itself" by the time I'd rallied the troops. Most nights were more mundane, though. In a sense, that's what you want: no outages or outages you can handle easily by yourself.

I've been through various attempts to improve the process: training newcomers, writing playbooks, systematizing postmortems, tuning alerts. Upon reflection, I think most of it never really stuck. A lot of it was social. Employee turnover meant that, over time, the core group of developers who'd been there for 4, 5, 6 years were the ones who knew enough not to need training or playbooks. They owned their respective systems and were solely responsible for their respective outages. We all knew each other and got into a collective rhythm.

It certainly was a culture of choke points, though. The indispensable few who could resolve certain systems' outages. But I think it got even worse than that as time went on. I'd say it was directly related to the flavor & quality of alerting. Our alerts were only really keyed off of infrastructure-level metrics: healthy host count, latency at the load balancer, CPU utilization, pingdom checks. An alert on such metrics isn't in itself actionable; you don't know why the host count has surged, or if it's even a problem. Alerts were almost never about something you did as an application developer¹, and yet the developer was put on-call for their application.

So eventually I became conditioned to either (a) ignore alerts² or (b) blindly escalate them to the one guy who was up at odd hours anyway and was better at troubleshooting AWS. So the choke points got even worse as I took myself out of the equation.

It didn't make for a healthy organization. But improving alerts & the resulting culture is a whole other topic unto itself, and I can't really say that I've been through that transformation. 😬 So I guess talking about it just serves as catharsis for my years of pain. 😭

The upshot is that the quality of the on-call experience is directly related to the quality of the alerts. I guess this should come as no surprise. Put simply, if I'm going to be woken up in the middle of the night, it should be for a good reason. If I'm doing my job right, those reasons should be few and far between.

Or at least nothing acute. I remember one wild goose chase where an alert in system A led the CTO to see what looked like high latencies in system B. After hours of theory-crafting & fiddling with various things in the middle of the night, it turned out B's latency had been at that level for ages and was just a red herring. The AWS dashboard had been flashing red (although not wired to a pagerduty alert) because of some arbitrary threshold that was set years ago. If it was an issue, it happened incrementally over time, and no one had been watching. 🤷
After enough insomniac nights of:
- wake up to an indistinct alert
- find my laptop
- type password wrong a couple times
- 2FA into the VPN
- see if the service Works For Me™ (cuz there's no signal beyond "number is below threshold")
- squint at dashboards in vain
- see no further signs of smoke
- resolve the alert
- fail to go back to sleep after funneling all that light into my eyeballs
I think I can be forgiven for disregarding The Boy Who Cried Wolf.

Kevan Carstensen • Oct 16 '20

This post has rekindled my deep sense of gratitude toward our alerts that skew (for the most part) quiet and meaningful.