Most teams know they should have a disaster recovery plan. Few know if it actually works.
I built an open-source tool called Stronghold that scans AWS infrastructure and answers a simple question: if something fails right now, can we actually recover?
Not "do we have backups configured" but "is the payment service still recoverable if eu-west-1a goes down, and do we have proof?"
Here's what happened when I ran it.
Try it yourself (30 seconds)
No AWS credentials needed:
npx @stronghold-dr/cli demo
This runs against built-in sample infrastructure, a realistic startup setup with 24 resources across RDS, Lambda, S3, DynamoDB, ELB, ElastiCache, and more.
What it found
The demo simulates a typical startup AWS account. Here's the posture report:
DR Posture - 2026-04-09
Services:
✓ frontend B 85/100 0 critical findings
✗ database F 19/100 2 critical findings
! startup-api D 53/100 4 high findings
✗ dns F 0/100 1 critical finding
✗ storage F 26/100 2 critical findings
✗ shared-files F 28/100 3 critical findings
✓ cache B 85/100 0 critical findings
✓ messaging B 85/100 0 critical findings
Score: 45/100 (D)
Scenarios: 0/13 covered
Evidence: 52 observed, 0 tested, 0 expired
45 out of 100. Grade D. Zero scenarios covered.
That means: if any AZ fails, if the primary database goes down, if data gets corrupted, there is no proven recovery path for any of those cases.
The scary part? This infrastructure has backups configured. It has read replicas. It looks fine in the AWS console. But when you reason about it at the service level and ask "can we actually recover the payment service end-to-end?" — the answer is no.
What makes this different from Prowler or AWS Config
Prowler tells you "RDS instance has no backup." That's useful but incomplete.
Stronghold tells you:
- Which service is affected (not just which resource)
- Which scenarios are no longer survivable because of this gap
- What evidence supports the current DR posture (observed config vs. actual tested recovery)
- Whether the runbook still matches the live infrastructure
- How long the gap has existed and whether posture is improving or degrading
Here's the scenario coverage for the demo account:
Scenario Coverage Analysis
AZ Failure Scenarios:
✗ eu-west-1a failure 6 services affected UNCOVERED
✗ eu-west-1b failure 5 services affected UNCOVERED
Region Failure Scenarios:
✗ eu-west-1 failure 14 services affected UNCOVERED
SPOF Failure Scenarios:
! prod-db-primary fails 4 services affected DEGRADED
! prod-api-alb fails 2 services affected DEGRADED
Data Corruption Scenarios:
! database corruption 4 services affected DEGRADED
Summary: 0/13 covered, 3 uncovered, 10 degraded
Zero scenarios covered. Not because there are no backups — but because the backups haven't been tested, the runbooks reference resources that have changed, and there's no proof that an end-to-end recovery would work.
The five questions
Stronghold exists to answer five questions that most DR tooling skips:
- What do we actually have? → Service-level mapping, not just resource inventory
- What breaks if this fails? → Dependency graph with blast radius analysis
- Do we still have a viable recovery path? → Scenario coverage, not just config checks
- What evidence supports that belief? → Evidence maturity (observed ≠ tested)
- Has our posture improved or degraded? → History, trends, and DR debt tracking
Quick start on a real account
# Generate the read-only IAM policy
npx @stronghold-dr/cli iam-policy > stronghold-policy.json
# Scan
npx @stronghold-dr/cli scan --region eu-west-1
# See your posture
npx @stronghold-dr/cli status
# Check scenario coverage
npx @stronghold-dr/cli scenarios
# Generate a recovery plan
npx @stronghold-dr/cli plan generate > drp.yaml
Stronghold is read-only - it calls Describe, List, and Get APIs only. It makes no changes to your infrastructure. Zero telemetry.
Add it to your CI
- name: Stronghold DR Check
uses: mehdi-arfaoui/stronghold-dr-check@v1
with:
aws-region: eu-west-1
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
fail-under-score: 60
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
This posts a DR impact comment on every pull request and fails the check if the score drops below your threshold.
What's under the hood
- 16 AWS service scanners with bounded concurrency and retry
- Service detection from CloudFormation, tags, and topology
- 39 DR validation rules across 6 categories
- Evidence model with 5 maturity levels
- Scenario coverage analysis (AZ failure, region failure, SPOF, data corruption)
- DRP-as-Code with executable runbooks
- Posture memory with finding lifecycle, DR debt, and trend tracking
- Governance with ownership, risk acceptance, and policy enforcement
- AES-256-GCM encryption, redaction, and always-on audit trail
TypeScript, strict mode, zero any, 654 tests, 81% core coverage. AGPL-3.0.
Links
- GitHub: mehdi-arfaoui/Stronghold
- npm: @stronghold-dr/cli
- GitHub Action: stronghold-dr-check
- Website: stronghold.software
If you've ever wondered whether your DR plan would actually work in a real incident, try running the demo. It takes 30 seconds and you might be surprised by what a fresh pair of eyes finds, even on sample infrastructure.
I'm building this solo and in the open.
Feedback, issues, and stars are all welcome!
Top comments (0)