mehdi-arfaoui

Posted on Apr 9

I scanned my AWS account for disaster recovery gaps - here's what I found published: false tags: aws, devops, opensource, security

#aws #devops #opensource #showdev

Most teams know they should have a disaster recovery plan. Few know if it actually works.

I built an open-source tool called Stronghold that scans AWS infrastructure and answers a simple question: if something fails right now, can we actually recover?

Not "do we have backups configured" but "is the payment service still recoverable if eu-west-1a goes down, and do we have proof?"

Here's what happened when I ran it.

Try it yourself (30 seconds)

No AWS credentials needed:

npx @stronghold-dr/cli demo

This runs against built-in sample infrastructure, a realistic startup setup with 24 resources across RDS, Lambda, S3, DynamoDB, ELB, ElastiCache, and more.

What it found

The demo simulates a typical startup AWS account. Here's the posture report:

DR Posture - 2026-04-09

  Services:
    ✓ frontend       B   85/100   0 critical findings
    ✗ database       F   19/100   2 critical findings
    ! startup-api    D   53/100   4 high findings
    ✗ dns            F    0/100   1 critical finding
    ✗ storage        F   26/100   2 critical findings
    ✗ shared-files   F   28/100   3 critical findings
    ✓ cache          B   85/100   0 critical findings
    ✓ messaging      B   85/100   0 critical findings

  Score: 45/100 (D)
  Scenarios: 0/13 covered
  Evidence: 52 observed, 0 tested, 0 expired

45 out of 100. Grade D. Zero scenarios covered.

That means: if any AZ fails, if the primary database goes down, if data gets corrupted, there is no proven recovery path for any of those cases.

The scary part? This infrastructure has backups configured. It has read replicas. It looks fine in the AWS console. But when you reason about it at the service level and ask "can we actually recover the payment service end-to-end?" — the answer is no.

What makes this different from Prowler or AWS Config

Prowler tells you "RDS instance has no backup." That's useful but incomplete.

Stronghold tells you:

Which service is affected (not just which resource)
Which scenarios are no longer survivable because of this gap
What evidence supports the current DR posture (observed config vs. actual tested recovery)
Whether the runbook still matches the live infrastructure
How long the gap has existed and whether posture is improving or degrading

Here's the scenario coverage for the demo account:

Scenario Coverage Analysis

  AZ Failure Scenarios:
    ✗ eu-west-1a failure    6 services affected   UNCOVERED
    ✗ eu-west-1b failure    5 services affected   UNCOVERED

  Region Failure Scenarios:
    ✗ eu-west-1 failure    14 services affected   UNCOVERED

  SPOF Failure Scenarios:
    ! prod-db-primary fails  4 services affected   DEGRADED
    ! prod-api-alb fails     2 services affected   DEGRADED

  Data Corruption Scenarios:
    ! database corruption    4 services affected   DEGRADED

  Summary: 0/13 covered, 3 uncovered, 10 degraded

Zero scenarios covered. Not because there are no backups — but because the backups haven't been tested, the runbooks reference resources that have changed, and there's no proof that an end-to-end recovery would work.

The five questions

Stronghold exists to answer five questions that most DR tooling skips:

What do we actually have? → Service-level mapping, not just resource inventory
What breaks if this fails? → Dependency graph with blast radius analysis
Do we still have a viable recovery path? → Scenario coverage, not just config checks
What evidence supports that belief? → Evidence maturity (observed ≠ tested)
Has our posture improved or degraded? → History, trends, and DR debt tracking

Quick start on a real account

# Generate the read-only IAM policy
npx @stronghold-dr/cli iam-policy > stronghold-policy.json

# Scan
npx @stronghold-dr/cli scan --region eu-west-1

# See your posture
npx @stronghold-dr/cli status

# Check scenario coverage
npx @stronghold-dr/cli scenarios

# Generate a recovery plan
npx @stronghold-dr/cli plan generate > drp.yaml

Stronghold is read-only - it calls Describe, List, and Get APIs only. It makes no changes to your infrastructure. Zero telemetry.

Add it to your CI

- name: Stronghold DR Check
  uses: mehdi-arfaoui/stronghold-dr-check@v1
  with:
    aws-region: eu-west-1
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    fail-under-score: 60
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

This posts a DR impact comment on every pull request and fails the check if the score drops below your threshold.

What's under the hood

16 AWS service scanners with bounded concurrency and retry
Service detection from CloudFormation, tags, and topology
39 DR validation rules across 6 categories
Evidence model with 5 maturity levels
Scenario coverage analysis (AZ failure, region failure, SPOF, data corruption)
DRP-as-Code with executable runbooks
Posture memory with finding lifecycle, DR debt, and trend tracking
Governance with ownership, risk acceptance, and policy enforcement
AES-256-GCM encryption, redaction, and always-on audit trail

TypeScript, strict mode, zero any, 654 tests, 81% core coverage. AGPL-3.0.

Links

GitHub: mehdi-arfaoui/Stronghold
npm: @stronghold-dr/cli
GitHub Action: stronghold-dr-check
Website: stronghold.software

If you've ever wondered whether your DR plan would actually work in a real incident, try running the demo. It takes 30 seconds and you might be surprised by what a fresh pair of eyes finds, even on sample infrastructure.

I'm building this solo and in the open.
Feedback, issues, and stars are all welcome!

DEV Community