DEV Community

Cover image for Disaster Recovery Drills That Actually Work
Samson Tanimawo
Samson Tanimawo

Posted on

Disaster Recovery Drills That Actually Work

Most DR Drills Are Theater

Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.

Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.

Real DR drills test whether your team can actually recover, not whether they can talk about recovery.

The Three Levels of DR Testing

Level 1: Tabletop

  • Walk through a scenario on paper
  • Identify missing runbooks
  • Find ownership gaps
  • Useful for: New team members, initial gap analysis
  • Limits: Doesn't prove anything actually works

Level 2: Partial Failure Test

  • Actually fail one component in staging
  • Watch recovery happen with real tools
  • Time the full recovery
  • Useful for: Validating specific runbooks
  • Limits: Staging ≠ production

Level 3: Full Production Drill

  • Actually fail a real production component
  • Customer-facing (announce a maintenance window if needed)
  • Full team responds as if it's real
  • Useful for: Proving you can recover
  • Limits: Scary, high-coordination

Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.

A Real DR Drill Scenario

Scenario: Primary database becomes unreachable

Setup (48 hours before):

  • Schedule window with product team
  • Pick a time when customer impact is minimal
  • Brief the team: "Something will fail tomorrow, respond as normal"
  • Pre-position the incident commander

Execution:

  1. At T+0, block network access to primary database via iptables
  2. Start stopwatch
  3. Watch the team respond
  4. Do NOT intervene or give hints
  5. Document every action, every decision, every delay

Metrics to capture:

  • Time to detection (first alert fire)
  • Time to engagement (first human acknowledges)
  • Time to diagnosis ("we know what's wrong")
  • Time to mitigation (customer impact stops)
  • Time to recovery (fully restored)

Scoring the Drill

Good scores:

  • Detection: under 2 minutes
  • Engagement: under 5 minutes
  • Diagnosis: under 15 minutes
  • Mitigation: under 30 minutes (for DR scenarios)
  • Recovery: depends on scenario

If any of these are 5x longer than target, you have a real problem.

What Always Goes Wrong

In every DR drill we run:

  1. Runbook is out of date. The one that worked 6 months ago has wrong commands now.
  2. Credentials don't work. The service account was rotated, nobody updated the runbook.
  3. Backup is untested. The restore fails because the backup is corrupted.
  4. Escalation paths are stale. The "DBA on-call" has left the company.
  5. Dependencies are missing. The recovery playbook assumes Service X is up, but Service X depends on the failed component.

Every drill uncovers 3-5 of these. Fix them, then drill again.

Chaos Engineering vs. DR Drills

These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.

Chaos engineering answers: "Can we survive small failures routinely?"

DR drills answer: "Can we survive catastrophic failures at all?"

You need both.

The Blame-Free Rule

DR drills expose weaknesses. Those weaknesses are process problems, not people problems.

The rules:

  • No firing based on drill performance
  • No promotions based on "being the hero"
  • Focus on process gaps, not individual failures
  • The post-drill retrospective is 90% about fixing systems, 10% about training people

If the team is afraid of the drill, you'll never learn anything real.

Frequency That Actually Works

Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
Enter fullscreen mode Exit fullscreen mode

Also: after every major infrastructure change, drill the affected components.

The Hardest Lesson

The drill is the easy part. The hard part is making the fixes from the drill a priority when there's feature pressure.

We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.

Starting Point

If you've never done a DR drill:

  1. Pick one scenario (database failure, region outage, API gateway down)
  2. Schedule it for a quiet hour
  3. Run a tabletop first find the obvious gaps
  4. Fix those gaps
  5. Run a partial failure test in staging
  6. Measure everything
  7. Run a retro focused on process
  8. Schedule the next one

Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)