DEV Community

Hannah Culver for Blameless

Posted on

Top Practices for Runbook Automation

What is runbook automation?

Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. For example, a runbook for spinning up a new server might ask some questions about the purpose of the server and its estimated load, then lead you to the appropriate instructions and settings. Runbooks ease the cognitive load of these common tasks by clearly outlining the process for each.

Runbook automation eliminates toil further by having these steps run through software triggered by certain situations (such as exceeding a threshold in your error budget policy), minimizing the amount of input you need to provide. This requires tools to execute each step, as well as a tool to orchestrate the overall runbook and determine which steps are necessary.

Automated runbooks can be a powerful tool for time-saving and consistency. We’ll look at five best practices for getting the most out of runbook automation, some tools on the market that can help you implement them, and discuss how to integrate runbook automation into a complete SRE solution.

A runbook template: Key steps to consider

  1. Understand and map your system architecture: To create runbooks that automatically use a variety of services, you’ll need to understand how each service functions and how they connect. Map these connections and include information on how automation tools can control each service to lay a solid foundation for future runbooks.
  2. Identify the right service owners: Once you’ve mapped out your architecture, you’ll need a repository of the owners of each service. This will help future runbook authors contact the right people for collaboration, advice, and sign-offs. Complex automated runbooks will work through many service areas, so involving the owners and experts of each space is a must.
  3. Lay out key procedures and checklist tasks: Common tasks often have common steps - subtask procedures like auditing, version control, and deployment are likely to overlap. Identify these key steps and clearly define their processes, then compile them into a list. Future runbook authors should use steps from this list when possible for consistency.
  4. Identify methods to bake into automation: Now that you have a list of key procedures that recur in many tasks, you also have a great starting point for finding automation opportunities. Look for things that can be scripted, and ways to have scripts trigger subsequent scripts. Make your automated steps modular so they can be baked into a variety of runbooks.
  5. Continue refining, learning, and improving: Resources like the architecture map, service owner repository, and list of common tasks aren’t to be created once and left untouched. Include updating these resources as a checklist task on procedures that would modify them, and also have regular checks to ensure they’re up to date. When you revisit them, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.

How to write simple runbooks for complex workflows

One of the most powerful features of automated runbooks or playbooks is their ability to navigate long conditional paths to complete complex tasks. Consider a runbook created to update the settings for a variety of development environments. This could require the automating tool to check many variables and deploy different changes for each combination, quickly creating a tree with many branches. Manually determining which branch to shake can be a tedious challenge, but the automated runbook finds the correct branch with ease.

You’ll inevitably need to change your runbook, so you’ll need some way to cut through this complexity. In order for other developers to update and refine your automated runbook, you’ll need some representation of how it actually works. This could take the form of a visual aid, like a flowchart, that shows you the steps and pathways at a glance with embedded links to the code executed at each step.

Another option is to have a simple automating language that dictates the overall structure of the runbook. Ansible provides automation tools that are controlled by instructions in a simple language that’s understandable without any special programming knowledge. This helps your runbooks remain easy to parse and update, even when they contain many steps and connections.

Make creating new automated runbooks easy

To get the most out of runbook automation, developers should be encouraged to implement them where possible to help create guardrails around specific processes. You should never assume that any area of development and operations is unable to be automated - even in the most nuanced projects, you’ll find simpler subtasks that could be automated. Likewise, consider automating even seemingly novel tasks. Your investment in automation can pay big dividends if these tasks do end up recurring.

To encourage this automation mentality, remove as many barriers as you can to creating and implementing new automated runbooks. Ideally, creating a new automatic runbook for a task shouldn’t take much longer or need many more resources than just completing the task manually. Rundeck, for example, allows users to quickly create workflows, integrating existing scripts and tools. It prides itself on being “automation for automation,” allowing you to automate as quickly as possible.

Of course, like any other aspect of development, automated runbooks should be observed and reviewed on a regular basis. The more runbooks you have running around, the more essential it is to stay on top of what they’re doing. You can help yourself out by having your automated runbooks log themselves, providing information on when they run, what choices they make, and what resources they use. This small overhead is another rewarding investment.

Integrate runbook automation into every aspect of DevOps

There are opportunities to automate and save time in even the most nuanced aspects of development and operations. To empower this, your automated runbooks should hook into every tool in your stack. One route to this connection is to have tools that can be easily controlled through things like external scripts, allowing the orchestrating runbook automation tool to deploy custom instructions.

Another route is to choose an orchestrating tool that has specific integrations with the rest of your environment. Microsoft’s Azure Automation works with every aspect of an Azure development environment, allowing Azure customers to intuitively create powerful instructions for every part of their DevOps solution.

Have automated runbooks for reliability events

One of the most helpful ways to use automated runbooks is in incident response, increasing the speed and consistency of resolution. Create automated runbooks for your common troubleshooting processes, and have them trigger in response to outages, extreme load, or other SLOs.

But remember that automated runbooks can only do so much. SRE principles teach us that there will always be incidents that fall outside of our expectations, so it’s impossible to have a runbook for everything that could go wrong. Runbooks will still be useful in these instances, though: the audit trails they generate of what didn’t work provide a great starting point to determine how to triage.

We mentioned earlier the importance of scheduled review sessions to refine your runbooks and incident playbooks, and a good SRE solution will support you here too. The resource monitoring of SRE will allow you to measure the impact of your runbooks, highlighting areas to refine and optimize. Likewise, monitoring of development resources can suggest areas that could benefit from further automation.

Blameless helps you get the most out of automated runbooks

The Blameless SRE platform provides many tools to help you get the most from runbook automation:

Blameless checklists and reminders create guardrails to help your team follow the steps in your runbook or playbook.
They also encourage the creation of new automated runbooks by highlighting the procedural logic behind complex tasks.
Key runbook activities are automatically captured in Blameless’ incident retrospectives, also known as postmortems, allowing teams to focus on building more insightful incident narratives
Blameless reliability insights can highlight areas where certain incidents or workflows can benefit from more automation.

Using automated runbooks properly will accelerate your DevOps processes, and using them in an SRE system will help you go faster, safer.

Top comments (1)

Collapse
 
ben profile image
Ben Halpern

This is great