Part of my warrantyAI build series — building an AI-powered warranty management system on AWS, one week at a time.
Most people reach for DynamoDB when they need a fast key-value store. I did too.
Then I started using it as a state machine — and accidentally cut the redundant Lambda execution cost out of my AI pipeline entirely.
Here's the pattern.
The Problem: AI Pipelines That Don't Know When to Stop
In Week 8 of building warrantyAI, I had a 3-agent LangGraph pipeline:
Reader → Classifier → Reminder
Every document that came in ran the full pipeline. Reader extracted text with Textract. Classifier invoked Bedrock (Claude Haiku, fallback to Sonnet). Reminder generated a notification and published to SNS.
That's fine when every document should proceed to the end. But in Week 9, I added a human review step for high-risk warranties. The pipeline needed to:
- Pause after classification
- Wait for a human decision (could be hours, could be days)
- Resume from exactly where it stopped — not re-run everything
The naive approach would be to re-invoke the full pipeline on resume. Reader runs again. Textract runs again. Classifier calls Bedrock again. You pay for all of it twice.
With DynamoDB as the state checkpoint, the resumed execution runs only the Reminder agent. Everything before it is already stored.
The Pattern: Checkpoint State, Not Just Data
The key mental shift: DynamoDB isn't storing the result of your pipeline. It's storing the entire state of your pipeline at the moment it paused.
Here's the DynamoDB schema I use:
| Field | Type | Value |
|---|---|---|
document_id |
PK | Unique per document |
sk |
SK | Always "REVIEW"
|
status |
S |
pending_review / approved / rejected
|
warranty_state |
S | Full pipeline state as JSON string |
created_at |
S | ISO 8601 timestamp |
ttl |
N | Unix epoch — auto-expires after 7 days |
The warranty_state field holds everything: raw extracted text, classification result, risk level, model used, guardrail flags, audit log. The entire WarrantyState TypedDict serialised as a JSON string.
When the pipeline resumes, it deserialises that field and picks up exactly where it left off.
def write_to_dynamodb(state: WarrantyState) -> None:
table = dynamodb.Table(HITL_TABLE)
ttl = int(time.time()) + (7 * 86400) # 7-day auto-expiry
table.put_item(Item={
"document_id": state["document_id"],
"sk": "REVIEW",
"tenant_id": state["tenant_id"],
"status": "pending_review",
"warranty_state": json.dumps(state, default=str),
"created_at": datetime.now(timezone.utc).isoformat(),
"ttl": ttl,
})
And on resume:
def get_review_record(document_id: str) -> dict:
table = dynamodb.Table(HITL_TABLE)
response = table.get_item(Key={"document_id": document_id, "sk": "REVIEW"})
item = response.get("Item")
if item.get("status") != "pending_review":
raise ValueError(f"Already actioned: {item.get('status')}")
return item
# In resume Lambda:
record = get_review_record(document_id)
warranty_state = json.loads(record["warranty_state"]) # full state restored
# Run only the Reminder agent — Reader and Classifier already ran
reminder_update = reminder_agent(warranty_state)
No Textract. No Bedrock classification call. Just the Reminder.
The Full HITL Flow
S3 Upload → Lambda trigger
│
▼
Reader Agent (Textract + Bedrock Haiku structuring)
│
▼
Classifier Agent (Bedrock Haiku → Sonnet fallback if confidence < 0.7)
│
▼
HITL Agent ──── risk != "high" ──────────────────────────┐
│ │
│ risk == "high" │
▼ │
Write full state to DynamoDB │
│ │
▼ │
SNS email to reviewer │
(approve/reject links) │
│ │
▼ │
NodeInterrupt — graph pauses │
│
Reviewer clicks link │
→ API Gateway │
→ resume Lambda │
│ │
├── APPROVE → run_from_reminder(state) │
└── REJECT → SNS to tenant, stop │
│
Reminder Agent
│
SNS to tenant
For medium and low risk documents, the HITL node is skipped entirely — the graph flows straight through to Reminder with no pause, no DynamoDB write, no cost.
Why DynamoDB Over Other Options
When I was designing this, I considered three approaches:
SQS with visibility timeout — messages can be "in flight" for up to 12 hours. Not enough for a human review that might sit overnight. Also, you can't query by document_id easily.
S3 as state store — works, but you're polling or using S3 notifications to detect resume. Awkward.
DynamoDB — point lookups by document_id, TTL handles cleanup automatically, on-demand billing means you pay per read/write not per hour, and the Streams feature gives you a path to event-driven resume if you want it later.
The on-demand billing matters more than it sounds. A warranty pipeline doesn't process documents at a steady rate. Some days 500 documents, some days 5. With provisioned capacity you're paying for peak all the time. With on-demand you pay for actual usage.
At my current volume, the DynamoDB cost for the HITL table is under $0.50/month.
The TTL Trick
This is the part I underestimated when I first built this.
Every review record gets a ttl field set 7 days from creation:
ttl = int(time.time()) + (7 * 86400)
DynamoDB's TTL feature automatically deletes expired items — no cron, no cleanup Lambda, no cost. Unactioned reviews just disappear. This matters because:
- Stale review records don't accumulate
- Storage costs stay flat regardless of volume
- You don't need to build a cleanup process
The one thing to know: TTL deletion isn't instant. DynamoDB typically cleans up within 48 hours of expiry. If you need exact expiry (e.g. the approve link should stop working at exactly 7 days), enforce it in your Lambda:
if item.get("status") != "pending_review":
raise ValueError("Already actioned")
# Also check TTL manually if you need hard expiry
created = datetime.fromisoformat(item["created_at"])
if (datetime.now(timezone.utc) - created).days > 7:
raise ValueError("Review expired")
The Cost Comparison
Here's what changed between Week 8 (no HITL) and Week 9 (HITL with DynamoDB state):
| Week 8 | Week 9 | |
|---|---|---|
| High-risk doc: Bedrock calls | 2 (classify + reminder gen) | 1 (reminder only, on approve) |
| High-risk doc: Textract | Yes, every run | Once, state stored |
| Redundant re-processing | On every retry | Zero |
| State cleanup | Manual | Automatic via TTL |
| DynamoDB cost | $0 | <$0.50/month |
The Bedrock saving is the real one. Claude Haiku is cheap (~$0.0004/call) but Sonnet fallback is ~$0.006/call. If a high-risk document triggered the Sonnet fallback and you re-ran the pipeline on resume, you'd pay for Sonnet twice. With DynamoDB state, classification runs once and the result is stored.
At low volume this is pennies. At scale — thousands of documents per day with a meaningful percentage flagged as high-risk — it adds up quickly.
What's Next
Week 10 adds CI/CD to the pipeline — GitHub Actions deploying to Lambda via ECR, with prompt regression tests so a bad Bedrock prompt doesn't silently break classification in production.
The DynamoDB state pattern from this week sets that up nicely: because state is checkpointed, regression tests can inject a known state at any node in the graph and assert the output without running the full pipeline.
Top comments (0)