Romar Cablao for AWS Community Builders

Posted on Apr 5

BuildWithAI: Architecting a Serverless DR Toolkit on AWS

#aws #serverless #disasterrecovery #devops

Overview

I'd been getting more involved in disaster recovery planning lately and kept running into the same gap — a lot of teams on AWS have backups, but not a real Disaster Recovery (DR) plan. No documented runbooks, no tested failover procedures, no RTO/RPO targets tied to business impact. So that became the motivation for this side project: six AI-powered tools that automate the tedious parts of DR planning, built entirely on AWS.

In part one of this three-part series, we will walk through the architecture — the serverless stack, the central model config, and the 5-layer cost guardrail system that keeps everything under $10/month (of course, you can set your own threshold; that's just what felt right for this side project). The next two parts will cover prompt engineering for each tool and the lessons learned setting this side project.

Here is a look at what we're going to build. You can try out the live version at https://dr-toolkit.thecloudspark.com.

While this was implemented with the help of Kiro — AWS's spec-driven AI IDE — this series will focus on the DR toolkit, Amazon Bedrock, and the underlying AWS architecture, rather than Kiro itself.

What the toolkit does

Six tools, same workflow: provide input, Lambda calls Amazon Bedrock, get formatted output.

#	Tool	Default Model	What it does
1	Runbook Generator	Nova Pro	Paste IaC → get a full DR runbook
2	RTO/RPO Estimator	Nova Lite	Fill a form → get recovery targets and DR tier
3	DR Strategy Advisor	Nova Lite	Answer questions → get an AWS DR architecture pattern
4	Post-Mortem Writer	Nova Lite	Paste incident notes → get a structured post-mortem
5	DR Checklist Builder	Nova Lite	Pick your AWS services → get a tailored audit checklist
6	Template DR Reviewer	Nova Pro	Paste IaC → get a gap analysis with fix snippets

The live demo at DR Toolkit currently runs on Amazon Nova models. But these are just the defaults — the toolkit supports any model in the Bedrock Model Catalog. You can mix and match: Nova Lite for simple tools, Claude Sonnet for complex ones, or go all-in on a single provider. Just update models.config.json and redeploy.

Architecture

Here’s the big picture. I kept the architecture intentionally simple and straightforward AWS serverless setup. Few Lambda functions, one API Gateway, one DynamoDB table, one SNS topic, S3 + CloudFront for the frontend.

So when someone opens the toolkit, CloudFront serves the static frontend from a private S3 bucket. When they submit a tool form, the request goes through API Gateway to one of six tool Lambda functions. Each Lambda runs through the guardrail checks against DynamoDB before calling Amazon Bedrock's invoke_model. Separately, if the monthly AWS Budget hits $10, an SNS alert triggers the budget_shutoff Lambda, which flips tools_enabled=False in DynamoDB. Every tool checks that flag before doing anything else.

Browser
   │
   ├── GET ──▶ CloudFront (security headers + URL rewrite)
   │              └──▶ S3 (private bucket, OAC only)
   │
   └── POST ──▶ API Gateway (HTTP API, 10 req/s, burst 25)
                    │
                    ▼
               AWS Lambda (Python 3.14)
                 ├── guardrails.py  ← 5-layer cost protection
                 ├── model_config.py ← reads models.config.json
                 ├── Amazon Bedrock (cross-region inference profiles)
                 └── DynamoDB (daily counters + IP rate limits + kill switch)

AWS Budget $10/mo ──▶ SNS ──▶ Lambda (flips kill switch)

Layer	What	Why
Frontend	Next.js 16 + Tailwind CSS v3	Static export, zero server cost
Frontend hosting	S3 (private, OAC) + CloudFront	Security headers, HTTPS, URL rewrite
API	API Gateway HTTP API	Built-in throttling, cheaper than REST API
Compute	Lambda (Python 3.14)	One function per tool + shared layer
AI	Amazon Bedrock	Cross-region inference profiles
Database	DynamoDB (on-demand)	Counters + feature flag + per-IP rate limits
Alerts	SNS + AWS Budgets	Auto-shutoff at $10/month
IaC	Serverless Framework	Single `serverless.yml`

Central config: models.config.json

Every tool's model, token limit, daily cap, and word count is controlled by one JSON file at the repo's root directory:

{
  "region": "ap-southeast-1",
  "tools": {
    "runbook-generator": {
      "modelId": "apac.amazon.nova-pro-v1:0",
      "displayLabel": "Nova Pro",
      "badgeColor": "blue",
      "toolLimit": 50,
      "maxTokens": 800,
      "maxWords": 600
    },
    "rto-estimator": {
      "modelId": "apac.amazon.nova-lite-v1:0",
      "displayLabel": "Nova Lite",
      "badgeColor": "green",
      "toolLimit": 50,
      "maxTokens": 400,
      "maxWords": 300
    }
  }
}

This config is consumed at deploy time by three things:

Lambda handlers — via a shared model_config.py module
Frontend — a slim copy with just displayLabel + badgeColor for the UI badges
serverless-models.js — auto-generates IAM resource ARNs so Bedrock permissions stay scoped to exactly the models in use

The handlers auto-detect the model provider from the modelId and use the correct Bedrock request format — Anthropic's anthropic_version + system string format for Claude, or Amazon's schemaVersion: messages-v1 + system array format for Nova. You can mix providers freely within the same deployment. IAM permissions update automatically on deploy — no manual policy edits needed.

Want to switch from Nova to Claude? Swap the modelId:

"runbook-generator": {
  "modelId": "global.anthropic.claude-sonnet-4-6",
  "displayLabel": "Sonnet 4.6",
  ...
}

Redeploy and that's it 🚀. The Model Selection Guide in the repo has copy-paste-ready model IDs for every supported option.

The 5-layer cost guardrail system

Running a free public tool on Bedrock with no authentication means you need cost protection in layers. Five guardrail layers is probably overkill for most projects. But for a free public demo where anyone can hit the endpoint, I'd rather over-protect than wake up to a surprise bill. All five checks run before Bedrock ever gets called.

Layer 1 — API Gateway throttling

Configured in serverless.yml:

HttpApiStage:
  Properties:
    DefaultRouteSettings:
      ThrottlingRateLimit: 10
      ThrottlingBurstLimit: 25

This is the first line of defense. Abuse gets 429s from API Gateway before Lambda even runs. Zero Bedrock cost.

Layer 2 — Daily usage counters

DynamoDB atomic conditional increments, both global (200/day) and per-tool (50/day for most tools, 30 for DR Reviewer since Nova Pro costs more per call):

table.update_item(
    Key={"pk": f"usage#{today}", "sk": sk},
    UpdateExpression="ADD run_count :inc SET #d = :date",
    ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit",
    ExpressionAttributeValues={":inc": 1, ":limit": limit, ":date": today},
)

Layer 3 — Per-IP rate limiting

3 requests per minute per IP, using DynamoDB TTL'd counters:

minute_bucket = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M")
pk = f"ratelimit#{source_ip}#{minute_bucket}"

table.update_item(
    Key={"pk": pk, "sk": "ALL"},
    UpdateExpression="ADD run_count :inc SET expires_at = :exp",
    ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit",
    ExpressionAttributeValues={
        ":inc": 1,
        ":limit": IP_RATE_LIMIT,
        ":exp": int(time.time()) + 120,
    },
)

Layer 4 — Bedrock token caps

Hard max_tokens per tool (400–800 depending on the tool). Input is also truncated to 8,000 characters before it reaches Bedrock. Most templates I tested were well under 3,000 characters, so the cap rarely triggers, but it bounds the worst case.

Layer 5 — Budget auto-shutoff

AWS Budget at $10/month → SNS → Lambda sets tools_enabled = false in DynamoDB:

def handler(event, context):
    table.put_item(Item={
        "pk": "config", "sk": "global",
        "tools_enabled": False,
        "disabled_reason": "Monthly budget threshold reached.",
    })

Every handler checks this flag first. Worst case: tools temporarily unavailable. But never a surprise bill. (There's up to a ~5 minute lag between the budget alert and shutoff, so in-flight requests at alarm time aren't blocked. But at these volumes, the overshoot is negligible.)

Security hardening

A few key controls worth highlighting:

IAM least privilege. bedrock:InvokeModel is scoped to specific inference profile and foundation model ARNs, auto-generated from models.config.json by serverless-models.js. No wildcards on any IAM policy.

S3 private + OAC. No public access. Only CloudFront can read from the bucket.

CORS. API Gateway allowedOrigins is restricted to the CloudFront domain. The Lambda response headers themselves use Access-Control-Allow-Origin: * because the response helper doesn't know the domain and the API relies on rate limiting and daily caps (not auth tokens) for protection. The gateway-level restriction is the meaningful one.

Prompt injection defense. All handlers use Bedrock's system parameter to separate instructions from user input. More on this in Part 2.

Full details in the Security Assessment doc in the repo.

What's next

That covers the architecture: the serverless stack, the central config, the 5-layer cost guardrails, and the security controls.

In the next part, we'll look at the tools themselves: the prompts behind each one, how to choose the right model per tool, the system prompt pattern for prompt injection defense, and the patterns that are reusable in any Bedrock project.

Try it / Fork it:

Live Demo: https://dr-toolkit.thecloudspark.com

DR Toolkit

AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.

dr-toolkit.thecloudspark.com

Source Code: github.com/romarcablao/dr-toolkit-on-aws

romarcablao / dr-toolkit-on-aws

BuildWithAI: DR Toolkit on AWS

DR Toolkit on AWS

AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.

Tools

#	Tool	Endpoint	Model	Daily Limit
1	Runbook Generator	POST /runbook	Nova Pro	50/day
2	RTO/RPO Estimator	POST /rto-estimator	Nova Lite	50/day
3	DR Strategy Advisor	POST /dr-advisor	Nova Lite	50/day
4	Post-Mortem Writer	POST /postmortem	Nova Lite	50/day
5	DR Checklist Builder	POST /checklist	Nova Lite	50/day
6	Template DR Reviewer	POST /dr-reviewer	Nova Pro	30/day

Architecture

Frontend: Next.js 16 (static export) + Tailwind CSS → S3 + CloudFront
Backend: AWS Lambda (Python 3.14) → API Gateway HTTP API
AI: Amazon Bedrock — Nova Lite (Tools 2–5), Nova Pro (Tools 1, 6)
Database: DynamoDB single table dr-toolkit-usage (usage counters + feature flag)
IaC: Serverless Framework v3 (serverless.yml)
Region: ap-southeast-1 (Singapore)

Project Structure

dr-toolkit/
├── serverless.yml             # Serverless Framework

…

View on GitHub

References:

DEV Community

BuildWithAI: Architecting a Serverless DR Toolkit on AWS

Overview

What the toolkit does

Architecture

Central config: models.config.json

The 5-layer cost guardrail system

Layer 1 — API Gateway throttling

Layer 2 — Daily usage counters

Layer 3 — Per-IP rate limiting

Layer 4 — Bedrock token caps

Layer 5 — Budget auto-shutoff

Security hardening

What's next

DR Toolkit

romarcablao / dr-toolkit-on-aws

BuildWithAI: DR Toolkit on AWS

DR Toolkit on AWS

Tools

Architecture

Project Structure

Top comments (0)