Harish Aravindan

Posted on Mar 8

DynamoDB as a State Machine: How I Stopped Paying for Redundant Lambda Executions

#aws #serverless #ai #dynamodb

Part of my warrantyAI build series — building an AI-powered warranty management system on AWS, one week at a time.

Building HITL into AI pipelines for high-risk decisions | Harish Aravindan posted on the topic | LinkedIn

𝗜 𝗽𝗮𝘂𝘀𝗲𝗱 𝗮𝗻 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁 𝗺𝗶𝗱-𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄 𝘁𝗵𝗶𝘀 𝘄𝗲𝗲𝗸 𝗮𝗻𝗱 𝗹𝗲𝘁 𝗮 𝗵𝘂𝗺𝗮𝗻 𝗱𝗲𝗰𝗶𝗱𝗲 𝘄𝗵𝗮𝘁 𝗵𝗮𝗽𝗽𝗲𝗻𝘀 𝗻𝗲𝘅𝘁 𝗪𝗲𝗲𝗸 𝟵 𝗼𝗳 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝘄𝗮𝗿𝗿𝗮𝗻𝘁𝘆𝗔𝗜 - and this one changed how I think about AI pipelines. 𝗧𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺: Week 8's pipeline classified warranties and sent reminders automatically. Fine for low and medium risk. But for high-risk documents — expired warranties, missing serial numbers, suspicious policy terms — no one should be auto-sending anything without a human in the loop. And honestly? The AI world is finally catching up to this thinking. OpenAI, Anthropic, and Google have all started baking 𝗛𝘂𝗺𝗮𝗻-𝗶𝗻-𝘁𝗵𝗲-𝗟𝗼𝗼𝗽 (𝗛𝗜𝗧𝗟) patterns into their agent frameworks — not as an afterthought, but as a core design primitive. The reason is simple: LLMs are probabilistic. They're very good at pattern recognition across millions of documents. They're not good at knowing when they're wrong. A confident wrong answer from a classifier in a warranty pipeline doesn't just fail silently — it sends a notification to a real customer. 𝗛𝗜𝗧𝗟 𝗶𝘀 𝘁𝗵𝗲 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗯𝗿𝗲𝗮𝗸𝗲𝗿. You let the AI handle the 90% it's genuinely better at — reading documents, extracting structure, classifying risk — and you bring humans in precisely at the 10% where consequences matter. That's not a limitation of the AI. That's good system design. So I built it. Here's the new flow: 📄 𝗥𝗲𝗮𝗱𝗲𝗿 → 🔍 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗲𝗿 → 🛑 𝗛𝗜𝗧𝗟 𝗔𝗴𝗲𝗻𝘁 → 📨 𝗥𝗲𝗺𝗶𝗻𝗱𝗲𝗿 The HITL agent does three things when risk = HIGH: 1. Serialises the full pipeline state to DynamoDB 2. Sends an SNS email to the reviewer with ✅ Approve and ❌ Reject links 3. Raises NodeInterrupt — LangGraph pauses the graph completely The pipeline just... stops. Waits. When the reviewer clicks Approve, a second Lambda fires, reads the DynamoDB state, and re-invokes the pipeline — but only the Reminder agent. Everything before that already ran. When they click Reject, an SNS notification goes to the tenant. No reminder. Full audit trail in S3. For medium and low risk? HITL is skipped entirely. Zero delay. 𝗪𝗵𝗮𝘁 𝗜 𝗹𝗲𝗮𝗿𝗻𝗲𝗱 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝘁𝗵𝗶𝘀: — LangGraph's NodeInterrupt is surprisingly clean. One raise, graph pauses. — DynamoDB as a state checkpoint is more reliable than I expected (7-day TTL, on-demand billing) — The hardest part wasn't the code. It was deciding what "high risk" actually means. The best AI systems in production today aren't the ones running fully autonomously. They're the ones that know exactly when to stop and ask. Stack: LangGraph + AWS Bedrock + DynamoDB + SNS + API Gateway + Lambda Repo: https://lnkd.in/gYrC3wEW Week 10: CI/CD for the whole pipeline + prompt regression tests. #bedrock #aws #serverless #langchain #aiengineering #building #ai #agents

linkedin.com

Most people reach for DynamoDB when they need a fast key-value store. I did too.

Then I started using it as a state machine — and accidentally cut the redundant Lambda execution cost out of my AI pipeline entirely.

Here's the pattern.

The Problem: AI Pipelines That Don't Know When to Stop

In Week 8 of building warrantyAI, I had a 3-agent LangGraph pipeline:

Reader → Classifier → Reminder

Every document that came in ran the full pipeline. Reader extracted text with Textract. Classifier invoked Bedrock (Claude Haiku, fallback to Sonnet). Reminder generated a notification and published to SNS.

That's fine when every document should proceed to the end. But in Week 9, I added a human review step for high-risk warranties. The pipeline needed to:

Pause after classification
Wait for a human decision (could be hours, could be days)
Resume from exactly where it stopped — not re-run everything

The naive approach would be to re-invoke the full pipeline on resume. Reader runs again. Textract runs again. Classifier calls Bedrock again. You pay for all of it twice.

With DynamoDB as the state checkpoint, the resumed execution runs only the Reminder agent. Everything before it is already stored.

The Pattern: Checkpoint State, Not Just Data

The key mental shift: DynamoDB isn't storing the result of your pipeline. It's storing the entire state of your pipeline at the moment it paused.

Here's the DynamoDB schema I use:

Field	Type	Value
`document_id`	PK	Unique per document
`sk`	SK	Always `"REVIEW"`
`status`	S	`pending_review` / `approved` / `rejected`
`warranty_state`	S	Full pipeline state as JSON string
`created_at`	S	ISO 8601 timestamp
`ttl`	N	Unix epoch — auto-expires after 7 days

The warranty_state field holds everything: raw extracted text, classification result, risk level, model used, guardrail flags, audit log. The entire WarrantyState TypedDict serialised as a JSON string.

When the pipeline resumes, it deserialises that field and picks up exactly where it left off.

def write_to_dynamodb(state: WarrantyState) -> None:
    table = dynamodb.Table(HITL_TABLE)
    ttl   = int(time.time()) + (7 * 86400)  # 7-day auto-expiry

    table.put_item(Item={
        "document_id":    state["document_id"],
        "sk":             "REVIEW",
        "tenant_id":      state["tenant_id"],
        "status":         "pending_review",
        "warranty_state": json.dumps(state, default=str),
        "created_at":     datetime.now(timezone.utc).isoformat(),
        "ttl":            ttl,
    })

And on resume:

def get_review_record(document_id: str) -> dict:
    table    = dynamodb.Table(HITL_TABLE)
    response = table.get_item(Key={"document_id": document_id, "sk": "REVIEW"})
    item     = response.get("Item")

    if item.get("status") != "pending_review":
        raise ValueError(f"Already actioned: {item.get('status')}")

    return item

# In resume Lambda:
record         = get_review_record(document_id)
warranty_state = json.loads(record["warranty_state"])  # full state restored

# Run only the Reminder agent — Reader and Classifier already ran
reminder_update = reminder_agent(warranty_state)

No Textract. No Bedrock classification call. Just the Reminder.

The Full HITL Flow

S3 Upload → Lambda trigger
     │
     ▼
Reader Agent      (Textract + Bedrock Haiku structuring)
     │
     ▼
Classifier Agent  (Bedrock Haiku → Sonnet fallback if confidence < 0.7)
     │
     ▼
HITL Agent ──── risk != "high" ──────────────────────────┐
     │                                                     │
     │ risk == "high"                                      │
     ▼                                                     │
Write full state to DynamoDB                              │
     │                                                     │
     ▼                                                     │
SNS email to reviewer                                      │
(approve/reject links)                                     │
     │                                                     │
     ▼                                                     │
NodeInterrupt — graph pauses                              │
                                                           │
     Reviewer clicks link                                  │
     → API Gateway                                         │
     → resume Lambda                                       │
          │                                                │
          ├── APPROVE → run_from_reminder(state)           │
          └── REJECT  → SNS to tenant, stop               │
                                                           │
                                                    Reminder Agent
                                                           │
                                                    SNS to tenant

For medium and low risk documents, the HITL node is skipped entirely — the graph flows straight through to Reminder with no pause, no DynamoDB write, no cost.

Why DynamoDB Over Other Options

When I was designing this, I considered three approaches:

SQS with visibility timeout — messages can be "in flight" for up to 12 hours. Not enough for a human review that might sit overnight. Also, you can't query by document_id easily.

S3 as state store — works, but you're polling or using S3 notifications to detect resume. Awkward.

DynamoDB — point lookups by document_id, TTL handles cleanup automatically, on-demand billing means you pay per read/write not per hour, and the Streams feature gives you a path to event-driven resume if you want it later.

The on-demand billing matters more than it sounds. A warranty pipeline doesn't process documents at a steady rate. Some days 500 documents, some days 5. With provisioned capacity you're paying for peak all the time. With on-demand you pay for actual usage.

At my current volume, the DynamoDB cost for the HITL table is under $0.50/month.

The TTL Trick

This is the part I underestimated when I first built this.

Every review record gets a ttl field set 7 days from creation:

ttl = int(time.time()) + (7 * 86400)

DynamoDB's TTL feature automatically deletes expired items — no cron, no cleanup Lambda, no cost. Unactioned reviews just disappear. This matters because:

Stale review records don't accumulate
Storage costs stay flat regardless of volume
You don't need to build a cleanup process

The one thing to know: TTL deletion isn't instant. DynamoDB typically cleans up within 48 hours of expiry. If you need exact expiry (e.g. the approve link should stop working at exactly 7 days), enforce it in your Lambda:

if item.get("status") != "pending_review":
    raise ValueError("Already actioned")

# Also check TTL manually if you need hard expiry
created = datetime.fromisoformat(item["created_at"])
if (datetime.now(timezone.utc) - created).days > 7:
    raise ValueError("Review expired")

The Cost Comparison

Here's what changed between Week 8 (no HITL) and Week 9 (HITL with DynamoDB state):

	Week 8	Week 9
High-risk doc: Bedrock calls	2 (classify + reminder gen)	1 (reminder only, on approve)
High-risk doc: Textract	Yes, every run	Once, state stored
Redundant re-processing	On every retry	Zero
State cleanup	Manual	Automatic via TTL
DynamoDB cost	$0	<$0.50/month

The Bedrock saving is the real one. Claude Haiku is cheap (~$0.0004/call) but Sonnet fallback is ~$0.006/call. If a high-risk document triggered the Sonnet fallback and you re-ran the pipeline on resume, you'd pay for Sonnet twice. With DynamoDB state, classification runs once and the result is stored.

At low volume this is pennies. At scale — thousands of documents per day with a meaningful percentage flagged as high-risk — it adds up quickly.

What's Next

Week 10 adds CI/CD to the pipeline — GitHub Actions deploying to Lambda via ECR, with prompt regression tests so a bad Bedrock prompt doesn't silently break classification in production.

The DynamoDB state pattern from this week sets that up nicely: because state is checkpointed, regression tests can inject a known state at any node in the graph and assert the output without running the full pipeline.