The screen loads. You pick your character class (TabbyWarrior, SiameseMage, MaineCoonPaladin, SphinxRogue) and the game begins:
Neo-Pawsburg, 2087. The neon flickers above the Neon Scratch Lounge as Madame
Fluffington slides a data chip across the bar. RoombaCore drones have taken
over Chrome Alley. The resistance needs someone with claws.What do you do?
You type: I slip into Chrome Alley and scout for a weak point in the patrol.
And while you do all of this, the panel on the right side of the screen
lights up: a Step Functions workflow trace stepping through each stage in
real time, Lambda invocation records streaming underneath it, token counts
ticking up, a d20 rolling. Your SphinxRogue, high on AGI and STL, vanishes
into the shadows.
The Dungeon Master is Claude running on Amazon Bedrock. The dice are Lambda functions. The campaign state is DynamoDB. The whole thing is a real distributed system, and you're watching it execute while you play.
I built The Neon Scratch Lounge as a live demo for AWS Summit LA last month. The goal wasn't to ship a game. It was to make five reliability patterns for AI agent systems feel concrete instead of abstract (and also a little bit for fun, because who wants to see another chatbot demo).
But the more I built it, the more it became a genuine exercise in the question I actually care about: what does it actually take to go from a working AI prototype to something you'd trust in production? (i.e. my day job)
So let's dig into it.
The prototype everyone builds first
The first version of any AI-powered feature looks something like this:
User → Lambda → Bedrock → User
It works. It's impressive (someone may have even applauded - I don't know I just heard). You show it to stakeholders, and they want to ship it tomorrow.
Then you try to actually ship it, and:
- A malformed model response kills the whole request with no retry
- Two concurrent users hit a race condition you never thought about
- You can't tell whether a failure was a Bedrock timeout or bad input
- The model confidently invents facts that aren't in your data
- Session two of a conversation has no memory of session one
None of this is because AI is uniquely broken. It's because an AI agent is a distributed system: it makes network calls, holds state, executes side effects, and fails in all the ways distributed systems fail. Treating it like a clever fetch() wrapper is what bites you.
What the production version looks like
Here's the full architecture for the game. Each piece maps to a real reliability problem it solves.
Player action
→ API Gateway
→ dungeon-controller Lambda
├── DynamoDB (read/write campaign state)
├── EventBridge (fire-and-forget audit event)
└── Step Functions Express Workflow
├── retrieve-lore
├── invoke-dungeon-master (Bedrock)
├── validate-and-route
├── execute-tools (Map state, parallel)
├── persist-campaign
└── format-response
CloudWatch Logs + Metrics + Alarms (across everything)
This is:
- Seven Lambda functions
- Three DynamoDB tables
- One Step Functions state machine
- One API Gateway
- One EventBridge bus for audit events
- One SQS dead letter queue.
All are defined in CDK.
Grounding the model: RAG before every call
The Dungeon Master can't be trusted to remember that SphinxRogues have a SandstormVanish ability, or that Chrome Alley is controlled by RoombaCore drones, or that the night market sells CreditChips. If you don't tell it, it will invent plausible-sounding alternatives, confidently and helpfully wrong.
The retrieve-lore Lambda runs before Bedrock is ever called. It scores 16KB of structured lore JSON (locations, enemies, items, classes) against the player's action using keyword overlap, pulls the top chunks, and injects them into the system prompt. The model's current location is always injected regardless of score.
// lambda/workflow/retrieve-lore.ts
const scored = loreChunks.map(chunk => ({
chunk,
score: scoreKeywordOverlap(playerAction, chunk.keywords)
}));
const context = scored
.sort((a, b) => b.score - a.score)
.slice(0, 5)
.map(r => r.chunk.content)
.join('\n\n');
For a larger knowledge base you'd swap this for Bedrock Knowledge Bases with OpenSearch Serverless for semantic vector search. The project supports both with a single flag in cdk.json. For a game world that fits in 16KB, keyword scoring is fast, free, and good enough.
RAG grounds the model in facts. But even with good context, the model can still return malformed JSON. So after every Bedrock call, a validation step checks the output schema. If it fails, Step Functions retries with a typed error. No custom retry scaffolding needed in Lambda code.
A brief detour: "serverless" does not always mean "scales to zero"
I want to tell you about the week I added Amazon OpenSearch Serverless to this project.
Sounds budget-friendly, right? It's serverless. I assumed (and you know what they say about assuming) that serverless meant free when idle. It does not.
OpenSearch Serverless always provisions a minimum of 4 OCUs (OpenSearch Compute Units), and at $0.24/hr per OCU, that's a fixed floor whether you have zero requests or five thousand:
4 OCUs × $0.24/hr × 730 hrs = $700.80/month. Every month. No matter what.
This is how the project ended up with a toggle in cdk.json:
"useBedrockKnowledgeBase": false
The bundled JSON keyword scoring isn't as semantically rich as vector search, but it runs in-memory inside Lambda and costs nothing at idle. For a conference demo with a 16KB lore file, it's the right call. I'd benchmark both before deciding for a real production system, but I'd check the pricing page first.
(Side note: if you're using AOSS for dev/test, you can disable redundancy to drop to 2 OCUs instead of 4, which cuts the floor to ~$350/month. Still not free, but worth knowing.)
The day after I made this decision and committed to the keyword approach, AWS announced that OpenSearch Serverless now supports scaling to zero.
The demo was already written. I wasn't going back. But if you're starting fresh today, that changes the math significantly. Always check the pricing page. A quick read now saves a very awkward conversation with finance later.
Retries, idempotency, and the dead letter queue
The execute-tools stage is where game state actually mutates: damage is applied, XP is awarded, inventory changes. These are Lambda invocations, and Lambda invocations fail.
Step Functions handles the retry strategy entirely in the state machine definition:
"Retry": [{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2,
"JitterStrategy": "FULL"
}]
Full jitter is deliberate. Without it, multiple concurrent workflows retrying the same error pile onto Bedrock in sync. Full jitter staggers them.
But retries create a second problem: if apply-damage succeeds and the response gets lost, a retry would apply that damage again. The fix is idempotency: cache the tool result in DynamoDB and return the cached version on retry instead of re-executing:
// lambda/shared/idempotency.ts
const key = `${campaignId}:${turnId}:${toolName}:${purpose}`;
const cached = await dynamo.get({ Key: { pk: key } });
if (cached.Item) return cached.Item.result;
const result = await executeTool(args);
await dynamo.put({ Item: { pk: key, result, ttl: now + 3600 } });
return result;
The key is scoped to turnId. The same tool called in two different turns should produce different outcomes (dice are random), so the cache can't be global.
Executions that exhaust all retries route to an SQS dead letter queue. The player gets a safe fallback narrative. Nothing crashes. The DLQ depth metric fires a CloudWatch alarm. No data is lost and the failure is isolated to one turn.
The system controls execution, not the model
When the LLM controls its own tool invocation loop (decides which tool, calls it, reads the result, decides what to do next) you get infinite loops, runaway API costs, and failure states that are nearly impossible to debug. Letting the model govern its own execution is like letting your database write its own SQL.
In this system, the model reasons. Step Functions and Lambda execute. Those are different jobs.
After Bedrock returns a narrative and a list of tool calls, the validate-and-route Lambda filters them against an explicit allowlist before anything runs:
// lambda/workflow/validate-and-route.ts
const ALLOWED_TOOLS = new Set([
'roll-dice', 'apply-damage', 'update-inventory', 'award-xp',
'update-location', 'apply-effect', 'use-special-ability', 'update-quest-log'
]);
const validated = dmOutput.toolCalls.filter(call => {
if (!ALLOWED_TOOLS.has(call.name)) {
logger.warn({ toolName: call.name, reason: 'not in allowlist' });
return false;
}
return true;
});
The model cannot invoke a tool that isn't on the list. Business logic never lives inside a prompt; it lives in typed Lambda functions with unit tests.
Structured observability from day one
Traditional debugging doesn't translate to AI systems. A stack trace doesn't tell you whether the model misread the lore context, whether the tool arguments were semantically wrong, or whether a retry loop doubled a game effect.
Every Lambda in this system emits exactly one log line per invocation: one structured JSON object, queryable in CloudWatch Logs Insights without any parsing overhead:
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
requestId,
campaignId,
toolName,
inputTokens: usage.inputTokens,
outputTokens: usage.outputTokens,
latencyMs: Date.now() - start,
retryCount,
success: true
}));
CloudWatch metric filters extract fields from those log lines into custom metrics in the NeonScratch namespace: dice roll totals, Bedrock token usage, monsters defeated per hour, DLQ depth. No code change needed to add a dashboard widget for a new field you're already logging.
The alarms are defined in CDK, not clicked in the console:
new cloudwatch.Alarm(this, 'ControllerLatency', {
metric: controllerFn.metricDuration({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 10_000,
evaluationPeriods: 2,
});
The game's mechanics panel shows these log lines live for the current campaign, each Lambda invocation appearing in order as the turn executes. For the Summit demo, I added a failure-injection shortcut, Ctrl+Shift+F in the browser, that forces the tool-execution Lambda to fail, so the audience could watch the retry animation in the workflow trace and see the DLQ alarm flip red in real time.
State lives in DynamoDB, not in the prompt
A stateless AI agent is a frustrating one. Without memory of the previous turn, the Dungeon Master doesn't know your HP, what's in your inventory, where you are, which enemies are still alive, or what quests you've accepted.
Campaign state (character stats, inventory, location, quest log, active effects, conversation history) lives in DynamoDB. Every turn reads it before invoking Bedrock and writes it back after tool execution:
// lambda/workflow/persist-campaign.ts
await dynamo.update({
Key: { campaignId },
UpdateExpression: `
SET character.hp = :hp,
character.inventory = :inventory,
currentLocation = :location,
conversationHistory = :history
`,
ExpressionAttributeValues: {
':hp': state.hp,
':inventory': state.inventory,
':location': state.location,
':history': trimmedHistory,
}
});
Bedrock charges per token, and conversation history grows without bound. After 20 turns, persist-campaign calls Bedrock to summarize the narrative so far, stores the summary, and trims the raw turn history. The next turn assembles: system context + summary + last N turns + current action. Continuity survives the context window.
What I'd add before calling it truly production-ready
The game is a demo, not a product, and I want to be honest about what's missing:
- Authentication: there's none (honestly, this wasn't a demo around auth - so wasn't going to pull my hair for it). Any request to the API can start or mutate a campaign
- Rate limiting: API Gateway has basic throttling configured, but there are no per-user limits (because again, not the point of the demo)
- Request signing: the frontend calls the API directly with no auth header (again, NO AUTH implemented)
- Encryption at rest: DynamoDB tables don't have customer-managed KMS keys
- Multi-region failover: single region only
For a real application, I'd add Cognito or a JWT check at the API Gateway layer first. Everything else is addressable without changing the core architecture.
What's production-worthy:
- Retry strategy
- Idempotency pattern
- Structured logging discipline
- Alarm definitions
- CDK infrastructure
- Separation between model reasoning and system execution
Those patterns transfer directly to any AI agent system, game, or otherwise.
The source
The full project is on GitHub at github.com/chaotictoejam/neon-scratch-lounge. Deploy it with cdk deploy --all from the infra/ directory, point VITE_API_GATEWAY_URL at the output URL, and you have a working skeleton.
The lore JSON files in lore/ are the easiest place to adapt it. Swap out Neo-Pawsburg for your own domain, and the DM will narrate your world instead. The five core patterns stay exactly the same.
Find me at linkedin.com/in/jlskiles or youtube.com/@DrJoanneSkiles

Top comments (0)