Most DR Drills Are Theater
Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.
Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.
Real DR drills test whether your team can actually recover, not whether they can talk about recovery.
The Three Levels of DR Testing
Level 1: Tabletop
- Walk through a scenario on paper
- Identify missing runbooks
- Find ownership gaps
- Useful for: New team members, initial gap analysis
- Limits: Doesn't prove anything actually works
Level 2: Partial Failure Test
- Actually fail one component in staging
- Watch recovery happen with real tools
- Time the full recovery
- Useful for: Validating specific runbooks
- Limits: Staging ≠ production
Level 3: Full Production Drill
- Actually fail a real production component
- Customer-facing (announce a maintenance window if needed)
- Full team responds as if it's real
- Useful for: Proving you can recover
- Limits: Scary, high-coordination
Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.
A Real DR Drill Scenario
Scenario: Primary database becomes unreachable
Setup (48 hours before):
- Schedule window with product team
- Pick a time when customer impact is minimal
- Brief the team: "Something will fail tomorrow, respond as normal"
- Pre-position the incident commander
Execution:
- At T+0, block network access to primary database via iptables
- Start stopwatch
- Watch the team respond
- Do NOT intervene or give hints
- Document every action, every decision, every delay
Metrics to capture:
- Time to detection (first alert fire)
- Time to engagement (first human acknowledges)
- Time to diagnosis ("we know what's wrong")
- Time to mitigation (customer impact stops)
- Time to recovery (fully restored)
Scoring the Drill
Good scores:
- Detection: under 2 minutes
- Engagement: under 5 minutes
- Diagnosis: under 15 minutes
- Mitigation: under 30 minutes (for DR scenarios)
- Recovery: depends on scenario
If any of these are 5x longer than target, you have a real problem.
What Always Goes Wrong
In every DR drill we run:
- Runbook is out of date. The one that worked 6 months ago has wrong commands now.
- Credentials don't work. The service account was rotated, nobody updated the runbook.
- Backup is untested. The restore fails because the backup is corrupted.
- Escalation paths are stale. The "DBA on-call" has left the company.
- Dependencies are missing. The recovery playbook assumes Service X is up, but Service X depends on the failed component.
Every drill uncovers 3-5 of these. Fix them, then drill again.
Chaos Engineering vs. DR Drills
These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.
Chaos engineering answers: "Can we survive small failures routinely?"
DR drills answer: "Can we survive catastrophic failures at all?"
You need both.
The Blame-Free Rule
DR drills expose weaknesses. Those weaknesses are process problems, not people problems.
The rules:
- No firing based on drill performance
- No promotions based on "being the hero"
- Focus on process gaps, not individual failures
- The post-drill retrospective is 90% about fixing systems, 10% about training people
If the team is afraid of the drill, you'll never learn anything real.
Frequency That Actually Works
Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
Also: after every major infrastructure change, drill the affected components.
The Hardest Lesson
The drill is the easy part. The hard part is making the fixes from the drill a priority when there's feature pressure.
We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.
Starting Point
If you've never done a DR drill:
- Pick one scenario (database failure, region outage, API gateway down)
- Schedule it for a quiet hour
- Run a tabletop first find the obvious gaps
- Fix those gaps
- Run a partial failure test in staging
- Measure everything
- Run a retro focused on process
- Schedule the next one
Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)