Daniel Glover

Posted on Mar 22 • Originally published at danieljamesglover.com

IT Disaster Recovery Planning That Works: A Practical Guide

#disasterrecovery #devops #infrastructure #cloud

Every IT leader has a disaster recovery plan. Most of them do not work. That is not cynicism - it is what the data tells us. Industry research consistently shows that over 70% of organisations that test their DR plans discover critical gaps.

I have been through enough real incidents to know that the gap between a DR plan document and actual recovery capability is often enormous. The plan says four hours. Reality says four days.

If you are an IT leader responsible for keeping systems running, this guide covers how to build a DR plan that survives contact with an actual disaster.

Why most DR plans fail

The document problem

Most DR plans are documents written once, approved by leadership, and filed away. Within six months, they are partially obsolete because infrastructure changes constantly.

The assumption problem

DR plans are built on assumptions: the network will be available, DNS will resolve, the backup site has capacity, the team will be reachable. Stack them together and you have a house of cards.

I learned this the hard way when a data centre power failure also took out the network equipment we needed to reach our backup site. Nobody had questioned the network connectivity assumption.

The people problem

Disasters happen at 3 AM on a bank holiday weekend when your lead engineer is on a flight. Technical systems can be designed for resilience. People cannot be patched.

Building a DR plan that actually works

A working DR plan is not a document. It is a capability.

Step 1: Define what matters

Classify systems into tiers based on business impact:

Tier 1 - Critical: Direct revenue impact. RTO under one hour, RPO near zero.
Tier 2 - Important: Significant impact but workarounds exist. RTO 4-8 hours.
Tier 3 - Standard: Can tolerate extended outages. RTO 24-48 hours.
Tier 4 - Low priority: Recover when possible.

This forces difficult conversations about what the business actually needs. I have seen organisations classify 80% of their systems as Tier 1, which means nothing is truly prioritised.

Step 2: Map your dependencies

Every system depends on other systems. Draw dependency maps explicitly. Include external dependencies like cloud providers, SaaS tools, and CDNs.

Pay special attention to shared dependencies. If your Tier 1 application and your monitoring system both depend on the same DNS provider, you will lose visibility at the moment you need it most.

Step 3: Design for recovery, not just resilience

Resilience prevents failures. Recovery is what happens when prevention fails. Most organisations over-invest in resilience and under-invest in recovery.

Automated failover for Tier 1. If it requires human intervention, it is not Tier 1 ready.
Documented manual procedures for Tier 2. Step-by-step runbooks a competent engineer who has never seen the system can follow.
Rebuild procedures for Tier 3+. Infrastructure as code and tested restoration scripts.

Step 4: Automate everything you can

Manual recovery procedures are unreliable. People make mistakes under pressure.

Infrastructure as code for rebuilding environments
Automated backup verification (restore and verify, not just check completion)
Automated failover testing in production
Runbook automation over Word documents

Every manual step is a potential failure point.

Step 5: Test relentlessly

Tabletop exercises (quarterly): Walk through scenarios. Cheap and effective.
Component testing (monthly): Test individual recovery procedures.
Full DR tests (biannually): Execute complete recovery. Expensive but essential.
Chaos engineering (ongoing): Introduce controlled failures.

Step 6: Document for humans

Write for the person executing at 3 AM:

Short sentences. Clear instructions. No ambiguity.
Decision trees, not novels.
Contact lists with phone numbers, not just Slack handles.
Version control with Git, not SharePoint.

Step 7: Plan for communication

Communication failures during incidents cause more lasting damage than the technical issues. Include internal templates, external procedures, and status page management.

The cloud does not solve this

Cloud providers offer infrastructure resilience but cannot protect against application-level failures, configuration errors, vendor outages, or account-level issues.

Your cloud DR strategy should include the ability to operate independently of any single provider, at least for Tier 1 services.

Getting started

If your DR plan has not been tested in over a year:

Run a tabletop exercise this month. Pick a realistic scenario and walk through your response.
Test one backup restoration this week. Time it. Verify the data. Compare to your RTO.
Update your contact list today. Remove leavers, add joiners.
Schedule regular testing. Quarterly tabletops, monthly component tests, biannual full tests.

The best time to test your DR plan was six months ago. The second best time is this week.

Originally published on danieljamesglover.com

DEV Community