DEV Community

Cover image for BuildWithAI: Architecting a Serverless DR Toolkit on AWS

BuildWithAI: Architecting a Serverless DR Toolkit on AWS

Overview

I'd been getting more involved in disaster recovery planning lately and kept running into the same gap — a lot of teams on AWS have backups, but not a real Disaster Recovery (DR) plan. No documented runbooks, no tested failover procedures, no RTO/RPO targets tied to business impact. So that became the motivation for this side project: six AI-powered tools that automate the tedious parts of DR planning, built entirely on AWS.

BuildWithAI: DR Toolkit on AWS — DESIGN, PROMPT, LEARN

In part one of this three-part series, we will walk through the architecture — the serverless stack, the central model config, and the 5-layer cost guardrail system that keeps everything under $10/month (of course, you can set your own threshold; that's just what felt right for this side project). The next two parts will cover prompt engineering for each tool and the lessons learned setting this side project.

Here is a look at what we're going to build. You can try out the live version at https://dr-toolkit.thecloudspark.com.

While this was implemented with the help of Kiro — AWS's spec-driven AI IDE — this series will focus on the DR toolkit, Amazon Bedrock, and the underlying AWS architecture, rather than Kiro itself.


What the toolkit does

Six tools, same workflow: provide input, Lambda calls Amazon Bedrock, get formatted output.

# Tool Default Model What it does
1 Runbook Generator Nova Pro Paste IaC → get a full DR runbook
2 RTO/RPO Estimator Nova Lite Fill a form → get recovery targets and DR tier
3 DR Strategy Advisor Nova Lite Answer questions → get an AWS DR architecture pattern
4 Post-Mortem Writer Nova Lite Paste incident notes → get a structured post-mortem
5 DR Checklist Builder Nova Lite Pick your AWS services → get a tailored audit checklist
6 Template DR Reviewer Nova Pro Paste IaC → get a gap analysis with fix snippets

Screenshot: DR AI Toolkit homepage showing all 6 tool cards

The live demo at DR Toolkit currently runs on Amazon Nova models. But these are just the defaults — the toolkit supports any model in the Bedrock Model Catalog. You can mix and match: Nova Lite for simple tools, Claude Sonnet for complex ones, or go all-in on a single provider. Just update models.config.json and redeploy.


Architecture

Here’s the big picture. I kept the architecture intentionally simple and straightforward AWS serverless setup. Few Lambda functions, one API Gateway, one DynamoDB table, one SNS topic, S3 + CloudFront for the frontend.

So when someone opens the toolkit, CloudFront serves the static frontend from a private S3 bucket. When they submit a tool form, the request goes through API Gateway to one of six tool Lambda functions. Each Lambda runs through the guardrail checks against DynamoDB before calling Amazon Bedrock's invoke_model. Separately, if the monthly AWS Budget hits $10, an SNS alert triggers the budget_shutoff Lambda, which flips tools_enabled=False in DynamoDB. Every tool checks that flag before doing anything else.

Browser
   │
   ├── GET ──▶ CloudFront (security headers + URL rewrite)
   │              └──▶ S3 (private bucket, OAC only)
   │
   └── POST ──▶ API Gateway (HTTP API, 10 req/s, burst 25)
                    │
                    ▼
               AWS Lambda (Python 3.14)
                 ├── guardrails.py  ← 5-layer cost protection
                 ├── model_config.py ← reads models.config.json
                 ├── Amazon Bedrock (cross-region inference profiles)
                 └── DynamoDB (daily counters + IP rate limits + kill switch)

AWS Budget $10/mo ──▶ SNS ──▶ Lambda (flips kill switch)
Enter fullscreen mode Exit fullscreen mode
Layer What Why
Frontend Next.js 16 + Tailwind CSS v3 Static export, zero server cost
Frontend hosting S3 (private, OAC) + CloudFront Security headers, HTTPS, URL rewrite
API API Gateway HTTP API Built-in throttling, cheaper than REST API
Compute Lambda (Python 3.14) One function per tool + shared layer
AI Amazon Bedrock Cross-region inference profiles
Database DynamoDB (on-demand) Counters + feature flag + per-IP rate limits
Alerts SNS + AWS Budgets Auto-shutoff at $10/month
IaC Serverless Framework Single serverless.yml

Central config: models.config.json

Every tool's model, token limit, daily cap, and word count is controlled by one JSON file at the repo's root directory:

{
  "region": "ap-southeast-1",
  "tools": {
    "runbook-generator": {
      "modelId": "apac.amazon.nova-pro-v1:0",
      "displayLabel": "Nova Pro",
      "badgeColor": "blue",
      "toolLimit": 50,
      "maxTokens": 800,
      "maxWords": 600
    },
    "rto-estimator": {
      "modelId": "apac.amazon.nova-lite-v1:0",
      "displayLabel": "Nova Lite",
      "badgeColor": "green",
      "toolLimit": 50,
      "maxTokens": 400,
      "maxWords": 300
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This config is consumed at deploy time by three things:

  • Lambda handlers — via a shared model_config.py module
  • Frontend — a slim copy with just displayLabel + badgeColor for the UI badges
  • serverless-models.js — auto-generates IAM resource ARNs so Bedrock permissions stay scoped to exactly the models in use

The handlers auto-detect the model provider from the modelId and use the correct Bedrock request format — Anthropic's anthropic_version + system string format for Claude, or Amazon's schemaVersion: messages-v1 + system array format for Nova. You can mix providers freely within the same deployment. IAM permissions update automatically on deploy — no manual policy edits needed.

Want to switch from Nova to Claude? Swap the modelId:

"runbook-generator": {
  "modelId": "global.anthropic.claude-sonnet-4-6",
  "displayLabel": "Sonnet 4.6",
  ...
}
Enter fullscreen mode Exit fullscreen mode

Redeploy and that's it 🚀. The Model Selection Guide in the repo has copy-paste-ready model IDs for every supported option.


The 5-layer cost guardrail system

Running a free public tool on Bedrock with no authentication means you need cost protection in layers. Five guardrail layers is probably overkill for most projects. But for a free public demo where anyone can hit the endpoint, I'd rather over-protect than wake up to a surprise bill. All five checks run before Bedrock ever gets called.

Layer 1 — API Gateway throttling

Configured in serverless.yml:

HttpApiStage:
  Properties:
    DefaultRouteSettings:
      ThrottlingRateLimit: 10
      ThrottlingBurstLimit: 25
Enter fullscreen mode Exit fullscreen mode

This is the first line of defense. Abuse gets 429s from API Gateway before Lambda even runs. Zero Bedrock cost.

Layer 2 — Daily usage counters

DynamoDB atomic conditional increments, both global (200/day) and per-tool (50/day for most tools, 30 for DR Reviewer since Nova Pro costs more per call):

table.update_item(
    Key={"pk": f"usage#{today}", "sk": sk},
    UpdateExpression="ADD run_count :inc SET #d = :date",
    ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit",
    ExpressionAttributeValues={":inc": 1, ":limit": limit, ":date": today},
)
Enter fullscreen mode Exit fullscreen mode

Layer 3 — Per-IP rate limiting

3 requests per minute per IP, using DynamoDB TTL'd counters:

minute_bucket = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M")
pk = f"ratelimit#{source_ip}#{minute_bucket}"

table.update_item(
    Key={"pk": pk, "sk": "ALL"},
    UpdateExpression="ADD run_count :inc SET expires_at = :exp",
    ConditionExpression="attribute_not_exists(run_count) OR run_count < :limit",
    ExpressionAttributeValues={
        ":inc": 1,
        ":limit": IP_RATE_LIMIT,
        ":exp": int(time.time()) + 120,
    },
)
Enter fullscreen mode Exit fullscreen mode

Layer 4 — Bedrock token caps

Hard max_tokens per tool (400–800 depending on the tool). Input is also truncated to 8,000 characters before it reaches Bedrock. Most templates I tested were well under 3,000 characters, so the cap rarely triggers, but it bounds the worst case.

Layer 5 — Budget auto-shutoff

AWS Budget at $10/month → SNS → Lambda sets tools_enabled = false in DynamoDB:

def handler(event, context):
    table.put_item(Item={
        "pk": "config", "sk": "global",
        "tools_enabled": False,
        "disabled_reason": "Monthly budget threshold reached.",
    })
Enter fullscreen mode Exit fullscreen mode

Screenshot: DynamoDB table showing usage counters and config row

Every handler checks this flag first. Worst case: tools temporarily unavailable. But never a surprise bill. (There's up to a ~5 minute lag between the budget alert and shutoff, so in-flight requests at alarm time aren't blocked. But at these volumes, the overshoot is negligible.)


Security hardening

A few key controls worth highlighting:

IAM least privilege. bedrock:InvokeModel is scoped to specific inference profile and foundation model ARNs, auto-generated from models.config.json by serverless-models.js. No wildcards on any IAM policy.

S3 private + OAC. No public access. Only CloudFront can read from the bucket.

CORS. API Gateway allowedOrigins is restricted to the CloudFront domain. The Lambda response headers themselves use Access-Control-Allow-Origin: * because the response helper doesn't know the domain and the API relies on rate limiting and daily caps (not auth tokens) for protection. The gateway-level restriction is the meaningful one.

Prompt injection defense. All handlers use Bedrock's system parameter to separate instructions from user input. More on this in Part 2.

Full details in the Security Assessment doc in the repo.


What's next

That covers the architecture: the serverless stack, the central config, the 5-layer cost guardrails, and the security controls.

What's Next Teaser

In the next part, we'll look at the tools themselves: the prompts behind each one, how to choose the right model per tool, the system prompt pattern for prompt injection defense, and the patterns that are reusable in any Bedrock project.


Try it / Fork it:

Live Demo: https://dr-toolkit.thecloudspark.com

DR Toolkit

AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.

favicon dr-toolkit.thecloudspark.com



Source Code: github.com/romarcablao/dr-toolkit-on-aws

GitHub logo romarcablao / dr-toolkit-on-aws

BuildWithAI: DR Toolkit on AWS

DR Toolkit on AWS

DR Toolkit

AI-powered disaster recovery planning tool for AWS builders. Plan, document, and audit your DR posture with Amazon Bedrock. Resilience planning, accelerated by generative AI.

Kiro Amazon Bedrock AWS Lambda Amazon DynamoDB Amazon S3 Amazon CloudFront Next.js Tailwind CSS

Tools

# Tool Endpoint Model Daily Limit
1 Runbook Generator POST /runbook Nova Pro 50/day
2 RTO/RPO Estimator POST /rto-estimator Nova Lite 50/day
3 DR Strategy Advisor POST /dr-advisor Nova Lite 50/day
4 Post-Mortem Writer POST /postmortem Nova Lite 50/day
5 DR Checklist Builder POST /checklist Nova Lite 50/day
6 Template DR Reviewer POST /dr-reviewer Nova Pro 30/day

DR Toolkit Tools

Architecture

  • Frontend: Next.js 16 (static export) + Tailwind CSS → S3 + CloudFront
  • Backend: AWS Lambda (Python 3.14) → API Gateway HTTP API
  • AI: Amazon Bedrock — Nova Lite (Tools 2–5), Nova Pro (Tools 1, 6)
  • Database: DynamoDB single table dr-toolkit-usage (usage counters + feature flag)
  • IaC: Serverless Framework v3 (serverless.yml)
  • Region: ap-southeast-1 (Singapore)

Project Structure

dr-toolkit/
├── serverless.yml             # Serverless Framework

References:

Top comments (0)