Samson Tanimawo

Posted on Apr 29

Disaster Recovery Drills That Actually Work

#dr #reliability #sre #chaosengineering

Most DR Drills Are Theater

Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.

Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.

Real DR drills test whether your team can actually recover, not whether they can talk about recovery.

The Three Levels of DR Testing

Level 1: Tabletop

Walk through a scenario on paper
Identify missing runbooks
Find ownership gaps
Useful for: New team members, initial gap analysis
Limits: Doesn't prove anything actually works

Level 2: Partial Failure Test

Actually fail one component in staging
Watch recovery happen with real tools
Time the full recovery
Useful for: Validating specific runbooks
Limits: Staging ≠ production

Level 3: Full Production Drill

Actually fail a real production component
Customer-facing (announce a maintenance window if needed)
Full team responds as if it's real
Useful for: Proving you can recover
Limits: Scary, high-coordination

Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.

A Real DR Drill Scenario

Scenario: Primary database becomes unreachable

Setup (48 hours before):

Schedule window with product team
Pick a time when customer impact is minimal
Brief the team: "Something will fail tomorrow, respond as normal"
Pre-position the incident commander

Execution:

At T+0, block network access to primary database via iptables
Start stopwatch
Watch the team respond
Do NOT intervene or give hints
Document every action, every decision, every delay

Metrics to capture:

Time to detection (first alert fire)
Time to engagement (first human acknowledges)
Time to diagnosis ("we know what's wrong")
Time to mitigation (customer impact stops)
Time to recovery (fully restored)

Scoring the Drill

Good scores:

Detection: under 2 minutes
Engagement: under 5 minutes
Diagnosis: under 15 minutes
Mitigation: under 30 minutes (for DR scenarios)
Recovery: depends on scenario

If any of these are 5x longer than target, you have a real problem.

What Always Goes Wrong

In every DR drill we run:

Runbook is out of date. The one that worked 6 months ago has wrong commands now.
Credentials don't work. The service account was rotated, nobody updated the runbook.
Backup is untested. The restore fails because the backup is corrupted.
Escalation paths are stale. The "DBA on-call" has left the company.
Dependencies are missing. The recovery playbook assumes Service X is up, but Service X depends on the failed component.

Every drill uncovers 3-5 of these. Fix them, then drill again.

Chaos Engineering vs. DR Drills

These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.

Chaos engineering answers: "Can we survive small failures routinely?"

DR drills answer: "Can we survive catastrophic failures at all?"

You need both.

The Blame-Free Rule

DR drills expose weaknesses. Those weaknesses are process problems, not people problems.

The rules:

No firing based on drill performance
No promotions based on "being the hero"
Focus on process gaps, not individual failures
The post-drill retrospective is 90% about fixing systems, 10% about training people

If the team is afraid of the drill, you'll never learn anything real.

Frequency That Actually Works

Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day

Also: after every major infrastructure change, drill the affected components.

The Hardest Lesson

The drill is the easy part. The hard part is making the fixes from the drill a priority when there's feature pressure.

We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.

Starting Point

If you've never done a DR drill:

Pick one scenario (database failure, region outage, API gateway down)
Schedule it for a quiet hour
Run a tabletop first find the obvious gaps
Fix those gaps
Run a partial failure test in staging
Measure everything
Run a retro focused on process
Schedule the next one

Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community