Disaster Recovery Maturity Model

#distributedsystems #disasterrecovery #cloud

Points to be consider while setting up DRMM.

Catalog your applications:

Make sure you are aware of all applications that exist in the ecosystem and that they have assigned owners along with their contact information. Also, make sure that the application owner is made aware of their responsibilities E.g. in case of an emergency, they or whoever they nominate on their behalf may be contacted for remediation.

Document standard recovery procedures:

For each application, make sure recovery procedures are documented and more importantly, well understood
Define a strategy: Make sure that the RPO and RTO for each application is defined i.e. a backup is defined.

Categorize applications into resilience categories:

Not all applications are created equal. Define resilience categories and slot each application into one of them. For e.g. the payments application for a credit card company may be more important to recover than a reporting application. Be aware of legal and compliance regulations for your domain. DR strategies come at a cost. An active-active deployment will likely cost more money than a cold standby. Make sure your business stakeholders are aware and participate actively in this categorisation process.

Practice!:

Ideally all the time as part of your automated deployment pipeline. Take your DR drills seriously.

Can you really recover if things start going south?

Automate!:

Infra as Code, automated configuration and secrets management, automated deployment, 'nuff said!
AWS' Well Architected Framework has a reliability pillar which talks about planning for disaster recovery that you may find useful.

Do maturity Assessment:

Areas to consider for maturity assessment (AWS context):

Applicable across all inventory -- application, network, infra, CI, storage, test-environments, etc. Any system that gets used today, needs a DR strategy -- including test systems, CI/CD, toggle/configuration stores, doc repos, etc.. because they ALL play an important role in keeping the Software running and Development process (machinery that creates software) running. (This is borrowed from Security thinking, where keeping non-Prod systems is very important too).
Are components in one AZ, multi-AZ, or cross Region (is cross region level complexity needed?)
For their business what are the RPO/RTO requirements? Each component then needs to be assessed against those goals -- to get a "heat map" for the whole inventory -- where are we today, vs where do we want to be, vs how will we get there, and what are the priorities? The RPO/RTO goals are usually not the same for all components, since some sub-systems are more critical than others.
What is the level of automation (to support the RPO/RTO objectives)?
How frequently is the automation tested?
Do they have automation/checks in place that can detect new features/configurations that are no longer meeting RPO/RTO objectives, because DR/Backup/Restore scripts have become out-of-date/obsolete? ("DR Automation obsolescence drift"). Think Poka-Yokes.