Republished from Medium.
At the end of a long day, I happened upon the unfolding IT disaster at GitLab on Twitter. They had mistakenly deleted a production database, losing hours of issues, users, comments, and more. I immediately felt heartsick for their team: it’s a very specific sensation when you delete crucial production data and then discover your backup systems won’t save you. There’s no floor beneath your feet, no undo button to reverse time, just a magnificent sense of terror.
The reaction to lost production data tends to follow a behavior pattern. First come the recovery attempts, and usually partial success as the team finds alternate de facto backups (QA servers, incomplete data sets). Second is a stabilization period, where the team reckons with what if anything is actually gone. After that, a post-mortem (which must be blameless for anyone to learn and move forward) should extract learning from the incident and fill out a fuller story with every actor’s perspective.
Teams involved in the post-mortem will come up with remediation work, aimed at preventing this kind of failure from happening again. Everyone who’s had their fingers burned will be highly motivated to define this work, and make it happen. The team wants to end this period with a renewed sense of safety.
But there’s always the nagging question, the same kind of concern that slinks out of the security realm and nestles on the shoulders of engineers: What are we missing? Where’s the next failure going to happen? What are we doing—or not doing—right now that’s going to bite us in six months?
Post-mortems should be enlightening events, but they’re also by definition in reaction to a particular failure. It’s hard to find the boundaries of the post-mortem analysis, so that you can point beyond and ask, “But what about that thing we’re not thinking about yet?” We need an additional practice, one that can produce more failures—which is to say, more learning—in a controlled environment.
Larger engineering organizations with distributed systems can implement systems such as Netflix’s Chaos Monkey, which is an application that semi-randomly kills other servers in your system. This forces engineers to design resilient software, such that services can continue to operate as individual machines fail. In practice, the benefits of running Chaos Monkey increase as deployment and team sizes grow. Smaller organizations with simpler, less distributed systems don’t benefit from the Monkey approach.
DoSomething had a file-loss disaster of our own in the summer of 2016. The general summary is all too familiar, especially after reading about that GitLab incident: one of us restarted a file transfer process in production, late at night. The process started with an incorrect configuration (tread carefully with any tool that has a “destructive mode”), and diligently deleted hundreds of thousands of files before we were able to identify & stop it. Much worse, we discovered that our core backup process had been silently failing for some time. We eventually managed to recover 94% of the files from old backups and other sources, but the pain of the recovery effort, not to mention the unrecoverable 6%, have stayed with us.
During the aftermath, a wise CTO friend, CJ Rayhill of the Public Library of Science, suggested I look into running fire drills. Her experience consulting in highly regulated industries had shown her the value of running IT teams through scenarios of total destruction. They would find obvious changes to make, such as improving access to data backups. They’d also uncover more obscure failures, such as when no one knew how to change the corporate voicemail to update the outgoing message.
The nature of fire drills will of course vary among teams and industries. For us, fire drills started with two main concerns: lost or corrupted data sets, and lost deployed systems (servers and networks). Fire drills crucially include communication, documentation, and other human factors inside their scope, whereas Chaos Monkey-style failure engineering restricts its scope to software design.
Our infrastructure team set a goal to run two to three fire drills per quarter. To determine where to start, we picked a system that was critical, but fairly self-contained. We wanted to concentrate first on the mechanics and logistics of running a fire drill for the first time. We needed to get through some basic questions: What does it look like? Who’s involved? Who’s in charge? How long does it take? How do you define the disaster? How do you record what happens? What do you do when it’s over? And so on.
Our team has run enough of these to define some requirements and practices:
We’re going for a certain level of verisimilitude, with a cast of characters and clear boundaries around the action. So we started with a kind of film treatment:
System X registers as unavailable in New Relic. When we look at AWS, we see that the entire VPC has been destroyed: app servers, Redis cache, and the MongoDB cluster are all gone. The AWS region and availability zone are still available.
That’s the plot summary. Note that it also defines what’s in scope, and what’s not: In this case, AWS itself is out of scope for the exercise.
For a cast of characters, we wanted to specify not just who was involved, but what roles they should play. This serves as part of our institutional memory. It also lets us play with changing some of the parameters, or expanding the scope of the problem, in subsequent drills against the same system.
RACI is a useful tool for making sure teammates on a project communicate without stepping on each other’s toes. RACI defines the roles people play on a project: Responsible, Accountable, Consulted, and Informed.
In a fire drill for an org our size (which is to say, a staff of 60), it made sense to us to have a single person making the big decisions (that’s the A). More than one person could be doing the work (the R). Others need to give feedback and provide institutional memory, especially if we’re trying to recreate a system with incomplete documentation (the Cs). And some others just need to know when the drill starts, how it’s going, and when it’s over (the Is).
In practice, this can be simple: four R, A, C, I bullet points at the top of a doc. But deciding who is involved, and how, is crucial for providing clarity in a confusing situation. One of the most costly problems in an emergency is not knowing who needs to make critical judgment calls.
How will we know when we’ve brought the system back? Are we integrating this system into a larger deployment to prove that it works normally, or running a test suite against it? Similarly to an agile team’s “definition of done,” this step tells the team what success looks like, so they can end the fire drill.
We use the artifice of a fire drill to make some bets about how we’ll react. How do we think we’ll restore systems? What are we worried about? We can compare this analysis to what actually happens, and extract important information from the deviations between plan and reality.
Not every team may take this step, as they may want the test to include how well the team analyzes the situation and creates a resolution path in the heat of the moment.
Rayhill emphasizes the “authenticity of the drill(s)…. Being able to simulate a real drill with real actors and real consequences is the only way to challenge your organization to be as ready as they can be.”
Let’s say a system goes down, and the backups are unavailable—in other words, the actual incident that DoSomething experienced in July 2016. How long should the fire drill go on? Does the team keep banging its head against the wall for twenty-four hours, or call it quits after eight? It’s especially important to set a time limit for when drills go wrong, so this doesn’t become a demoralizing practice.
In the case of an absolute failure to restore, end the fire drill, fix the fatal problems, and reschedule the drill.
In the heat of the moment, we may make a series of failed attempts, or consider approaches that we soon abandon. We want to write these down as a timeline to review later. This is the series of events we can examine later to find holes in our decisions, systems, and documentation.
The important thing is to document exactly what happened and then review it in your post-mortem to make incremental improvements. — CJ Rayhill
At DoSomething, we’re likely coordinating the response as usual in Slack, but we’ve typically kept response notes in a shared Google Doc, so that everyone can contribute and review. As institutional memory, these notes are gold: save them and keep them organized.
After the event, analyze as you would an actual incident. Look at the timeline and all of the artifacts, including the rebuilt system. Did we get it back up and running? Where were the interesting gaps? If we didn’t get the system to pass our acceptance criteria, what are the most crucial pieces to get in place before we rerun the fire drill?
We’re shooting for a simple rule: we must address crucial gaps uncovered by one fire drill before we start another. This helps prevent us from making the same mistakes in two subsequent fire drills. It also mitigates against the team getting overwhelmed by its backlog.
Other teams may decide not to enforce this serializing of remedies and subsequent drills. Rayhill allows that “the remedy for a certain issue might be costly to the organization.” If her team determined the cost was too high, they would still “document our reasoning so that it was available to the auditors.”
The primary principle of fire drills is to use a lab environment to create as much information as possible about failures. Use drills to discover weak spots, and subsequently to pound away at those weak spots until they become stronger. Your weaknesses might be around communication and decision-making, freshness and availability of backups, alerts and monitoring, documentation, unnecessary complexity, or automation.
When I…started this process with one organization, we couldn’t even get through the first 3 steps the first time we ran a drill! But we fixed what went wrong and kept at it until we could finish an entire exercise. — CJ Rayhill
Most likely, teams that run fire drills will discover failures across a number of categories. Each of these discoveries should cause a celebration, because the team that finds a problem in the lab can fix it before it bites them in the wild.
Google runs these drills on a massive scale through a practice called DiRT: Disaster Recovery Testing.