It was 11:00 PM on a Tuesday when my friend startup's CTO dropped a screenshot of their monthly cloud bill into the engineering Slack channel.
The AWS infrastructure costs were flat. But their LLM inference API bill looked like a hockey stick pointing straight up.
"Why are we burning thousands of dollars a day on Claude 3 Opus?" she asked.
The lead engineer replied: "Because to make the AI assistant feel 'smart' and remember the user, we have to pass their entire conversation history into the context window for every single message. If they've been using the app for a month, we are passing 80,000 tokens just so the bot remembers their dog's name when they say 'hello'."
They had fallen into the classic Generative AI trap: Treating the LLM's context window as a database.
As a cloud architect, I love when we can take "boring" cloud primitives and combine them with AI to create something that feels like magic but is actually just brilliant, highly-scalable engineering. If you want to make a CTO stop in their tracks, rethink their architecture, and say, "Wait, is this actually possible?" you need to move away from standard chatbots.
Here is an architectural pivot that radically changes how an AI application scales, operates, and spends money: Event-Driven AI Memory using AWS EventBridge.
The Pivot: From Context Windows to a "Neural Memory Bus"
The traditional approach to AI memory is brute force: stuff conversational history into giant, expensive LLM context windows, or build complex Retrieval-Augmented Generation (RAG) pipelines over raw chat logs.
Both approaches are slow, expensive, and prone to losing important details in the noise.
Instead of keeping a running transcript of everything the user has ever said, what if we decoupled "memory" from the "chat interface" entirely? What if we treated user actions as asynchronous events?
The Architecture: Building the "Fact Store"
We can achieve this by combining AWS EventBridge, AWS Lambda, Amazon DynamoDB, and a hyper fast, cheap LLM like Claude 3 Haiku via Amazon Bedrock.
Here is how the event-driven memory pipeline works:
Step 1: The Event Bus
Route every user action in your app not just chat messages, but button clicks, page views, and settings changes through AWS EventBridge as standard JSON events.
Step 2: The Memory Extractor (Async)
Have a lightweight AWS Lambda function subscribe to these events. When an event fires, the Lambda function passes the event payload to a fast, cheap model like Claude Haiku.
The system prompt is simple: "You are a background observer. Review this user event. Extract any permanent, highly relevant facts about this user. Output as a JSON array. If nothing is relevant, return an empty array."
Step 3: The Fact Store (DynamoDB)
If Haiku detects a fact (e.g., User is building a SaaS, User prefers Python, User operates in the EU), the Lambda function upserts that key-value pair into an Amazon DynamoDB table keyed by the UserID. This is your "Fact Store" a living, breathing profile of the user.
The "Aha!" Moment: Querying the AI
Now, let's go back to that expensive chat interface.
When the user asks a complex question, you do not query a massive chat history. You don't pass 80,000 tokens of past transcripts.
Instead, your backend does a sub-millisecond GetItem lookup against DynamoDB for that user's Fact Profile. You take those concentrated facts and inject them into the system prompt of your heavy-lifting model (like Claude 3.5 Sonnet or Opus).
The CTO’s Reaction: Why This Pattern Wins
When you explain this architecture to engineering leaders, the reaction is almost always the same: "Wait, we can use EventBridge as a global 'neural memory bus' for our AI?"
Yes. And here is why this tradeoff makes sense for scaling startups:
1. Massive Cost Reduction
You are swapping synchronous, high-token inference on your most expensive model for asynchronous, low-token inference on your cheapest model. A 1,000-token prompt to Claude Haiku costs fractions of a cent. Querying a DynamoDB table costs practically nothing. You drop your token consumption by 90%.
2. Infinite Scale and Speed
DynamoDB delivers single-digit millisecond performance at any scale. Because you are only injecting a condensed JSON object of "Facts" into your final chat prompt, your time-to-first-token (TTFT) drops drastically. The AI responds faster because it has less text to read.
3. Omnichannel Intelligence
Because the memory is tied to EventBridge not the chat window the AI learns from the user's actions, not just their words. If a user struggles with a dashboard and triggers three "Error 500" events, the Fact Store updates. When they finally open the support chatbot, the AI already knows they are frustrated and exactly which error they hit.
The Bottom Line
We need to stop treating Large Language Models as databases. They are reasoning engines.
By leveraging standard, highly scalable cloud primitives like AWS EventBridge and DynamoDB, we can offload the burden of memory from the LLM context window into actual infrastructure.
It feels like AI magic to the user, but under the hood? It’s just brilliant, boring, beautiful engineering.
Have you hit the "context window cost wall" in your generative AI applications yet? Let me know in the comments how your team is managing AI memory at scale.


Top comments (0)