Engineer On-Call: The Dos and Don'ts

#devops #sre #engineering #cloud

Have you ever woken up at 2:00 AM after receiving an on-call alert and said to yourself why am I getting an alert for this or why hasn’t this been automated yet?

If the answer is yes, which I’m sure it is for most of us, you’re not wrong.

On-call and alerts aren’t just about putting out fires, it’s about how to prevent fires. In this blog post, you’re going to learn how you can effectively be on-call, not burn out, and automate a ton of alerts.

The Purpose Of On-Call

Many organizations believe that the reason engineers are on-call is to answer alerts throughout the day and night. Although that’s somewhat true, that’s maybe 20% of the actual reason.

The true purpose of being on-call is to fix the unknown issues. Notice how the keyword there is unknown.

What this means is; any issue that is known, should not be an on-call issue. For example, let’s say that an application reaches an 80% memory threshold before scaling out to another server or container. If that’s a known issue, that should be fixed with automation or within the application itself, not with an engineer waking up at 2:00 AM to fix it.

The goal of being on-call is to fix the issues that you’re either not aware of yet or that you haven’t thought of yet. If you’re on-call and you keep fixing the same issues over and over again, read on...

Scaling On-Call

As time goes on, systems get larger and increase in numbers. Application code bases grow. Databases expand. With all of that, teams most likely grow as well. The problem typically is that practices don’t change. Those same issues that keep waking you up weekly at 2:00 AM probably aren’t changing. Instead, management may be looking for the wrong way out; to expand the team so more people can burn out by waking up at 2:00 AM.

Instead of scaling out the team for these specific issues, organizations should be scaling out their systems and overall architecture.

Systems come in all shapes and sizes:

Virtual machines
Bare metal
Serverless
Containers (like Docker)
Orchestration (like Kubernetes)

As these systems expand with much-needed velocity due to application growth, that means the systems need repeatability and automation.

As your application requires more resources, on-call teams should be focusing 50% of their time putting out fires and the other 50% of their time automating the on-call alerts.

The automation could be anything from scripts to platforms that are put in place to help the automation. In any case, you should never spend more than 50% of your time putting out fires.

Automating Your On-Call Alerts

Some of the issues that you want to automate may be easily fixable. Just as an example, maybe you have an application running on an auto-scaling group in Azure or AWS. The fix could simply be to expand the auto-scaling group with another virtual machine/EC2 instance, which is a few lines of automation code.

The question becomes how do I deploy that automation code?

The answer is with automated runbooks.

There are a few ways to set up automated runbooks, but two of my favorites are PagerDuty Rundeck and xMatters.

PagerDuty Rundeck is great if you’re already using PagerDuty for your on-call and alerting. It’s a separate cost, so you’ll have to pay extra, but it’s definitely worth it. Rundeck takes the approach of write some automation code and we’ll run it when the alert goes off.

xMatters can be used for on-call alerts as well if you don’t already have a system in place, and it can also be used for automated runbooks. xMatters is definitely more low-code specific with its automated runbooks. It has a UI-workflow-style approach where you essentially drag/drop jobs that you want to occur. For example, maybe you want a Slack message to happen when an alert goes off, so you’d drag the Slack icon onto the UI workflow and then set up the Slack configurations for authenticating to your Slack environment.

Top comments (1)

sufiyanpk7 • Dec 19 '23

Does this work?