Danilo Poccia for AWS

Posted on Oct 31

Never Forget a Thing: Building AI Agents with Hybrid Memory Using Strands Agents

#ai #aws #opensource #python

When using (and building) AI agents, I kept running into the same frustrating problem: as conversations grew longer, my agents would either lose important details from earlier in the conversation or hit context limits and crash. The standard solution—a sort of aggressive summarization—worked for maintaining context flow, but it created a new problem: those summaries were lossy. Important details, specific numbers, exact quotes, and nuanced context could vanish into their generalizations.

I needed something better: a memory system that could maintain conversation flow through intelligent summarization while preserving the ability to retrieve exact historical messages when needed. After researching the broad topic of context engineering, I built a proof-of-concept Semantic Summarizing Conversation Manager: a hybrid memory system for Strands Agents that combines the efficiency of summarization with the precision of semantic search.

In this post, I'll show you how this system improves on the memory problem, walk you through its architecture, and demonstrate how it can upgrade your AI agents from forgetful assistants into (more) reliable partners with perfect recall.

The Memory Problem: Summarization vs Recall

Before diving into the solution, let's understand why this problem exists. AI agents typically manage conversation context in one of three ways:

Keep Everything: Store all messages in the active context. This works great for short conversations but inevitably hits model context limits. When you reach that limit, the agent has to do something to reduce the size of its context.

Summarization: When context gets full, summarize older messages into a compressed form. This process, also known as compacting, maintains conversation flow and prevents context overflow, but summaries are inherently lossy. Ask "What was the exact number I mentioned earlier?" and the agent might recall "you discussed some statistics" but not the actual value. A possible mitigation is to apply different hierarchical levels of summarizations that are retrieved based on the specific requests.

Sliding Window: Keep only the N most recent messages, discarding older ones entirely. Simple and memory-efficient, but loses all historical context beyond the window. The agent literally forgets everything from earlier in the conversation.

Proactive Memory Curation: A variation of automatic summarization is to actively control the process. For example, to trigger summarization not when the context is full but when something happens in the agent lifecycle, such as the completion of a specific task. This works because summarization is applied to a bounded context (the task) that can reduce the amount of information needed about the specific task internals by the rest of the tasks.

Each approach has fundamental trade-offs. You can have context efficiency or perfect recall, but not both.

Hybrid Memory: The Best of Both Worlds

The Semantic Summarizing Conversation Manager takes a different approach: it combines summarization for active context management with semantic search for precise historical recall. Here's how it works:

Normal Operation: Messages flow through the conversation as usual. The agent sees the full context and responds naturally.

Context Overflow: When the context gets too long, the system performs two parallel operations:

Creates a summary of older messages for the active conversation, maintaining flow
Stores the exact messages in memory using Strands Agents' key-value state for later retrieval
Indexes those messages in a semantic (vector based) search engine for intelligent lookup

Query Time: When new messages arrive, a Strands Agents hook automatically searches for relevant historical messages, includes surrounding context for better understanding, and prepends this context to the user's message if relevant matches are found.

The agent gets three types of memory working together: the active conversation with summaries (for context flow), the archived exact messages (for precision), and the semantic index (for intelligent retrieval). This hybrid approach means the agent never loses information, but also never overwhelms the model with excessive context.

Why This Architecture Makes Sense

Here's a crucial insight that makes this hybrid approach viable: the amount of RAM available to an agent is typically orders of magnitude larger than the model's context window.

Consider a typical deployment: a modern language model might have a context window of up to 1 million tokens (roughly 750,000 words or about 4MB of text). Meanwhile, even a small AWS Lambda function has at least 128MB of memory, and container deployments often have several gigabytes. That's a difference of three to four orders of magnitude—1,000x to 10,000x more storage capacity than context capacity.

This disparity is fundamental to how language models work. Context windows are constrained by the quadratic attention mechanism—doubling the context quadruples the computation. But RAM? RAM is relatively cheap and abundant in comparison. You can store thousands of conversation messages and tool results in a few megabytes, along with their embeddings for semantic search, and still use less than 1% of available memory.

The implication: you don't need to delete information just because it doesn't fit in the model's context window. Store it, index it, and retrieve it intelligently when needed. The bottleneck isn't storage—it's attention. This hybrid architecture respects that constraint while leveraging the abundant storage available to modern agents.

This is why the semantic conversation manager can confidently store exact messages indefinitely (with optional limits for safety) while keeping only the most relevant information in the active context. We're playing to the strengths of the underlying hardware: use the model's limited context for reasoning and generation, use RAM for comprehensive storage and retrieval.

Architecture: Three Components Working in Harmony

The system consists of three main components that integrate seamlessly with Strands Agents:

Component 1: SemanticSummarizingConversationManager

This is the core conversation manager that extends Strands' base conversation management with semantic capabilities. It maintains the active conversation window, triggers summarization when context overflows, stores exact messages with semantic indexing, manages memory limits by message count or total memory usage, and provides real-time memory usage statistics.

The key innovation here is that summarization and archival happen atomically. When messages get summarized, they're simultaneously preserved and indexed, ensuring nothing is ever lost.

Component 2: SemanticMemoryHook

This hook integrates with Strands' lifecycle system to provide automatic context enrichment. It subscribes to the MessageAddedEvent, searches semantic memory when new messages arrive, retrieves relevant historical messages with surrounding context, and prepends the enriched context to user messages naturally.

The hook uses Strands' elegant event system, keeping the memory logic completely separate from your agent's main code. Your agent doesn't need to know anything about memory management—it just works.

Component 3: SemanticSearch Engine

The search engine powers intelligent retrieval using sentence transformers for initial embedding, cross-encoder reranking for precision, configurable relevance thresholds, and persistent index storage.

I chose a two-stage retrieval approach because it provides the best balance of speed and accuracy. The sentence transformer quickly narrows down candidates, then the cross-encoder reranks for precision. This combination ensures the agent finds truly relevant messages, not just keyword matches.

Setting Up Hybrid Memory

Let's build an agent with semantic memory. This implementation is a prototype designed to demonstrate the hybrid memory concept. While functional and tested, it's intended for experimentation and learning rather than production deployment without further development and testing. The setup is straightforward:

from strands import Agent

from strands_semantic_memory import (
    SemanticSummarizingConversationManager,
    SemanticMemoryHook,
)

conv_manager = SemanticSummarizingConversationManager(
    embedding_model="all-MiniLM-L12-v2"
)

semantic_memory_hook = SemanticMemoryHook()

agent = Agent(model="us.amazon.nova-lite-v1:0",
              conversation_manager=conv_manager,
              hooks=[semantic_memory_hook])

That's it! Your agent now has hybrid memory. Use it normally:

# Store information
response = agent("Our shared number is 42. This is confidential, don't include it in any summary.")

# ... many messages later, after summarization ...

# Retrieve exact information
response = agent("What was our shared number?")
# The hook finds the archived message and includes it automatically

Understanding the Parameters

The configuration parameters give you fine-grained control over memory behavior:

summary_ratio (0.1-0.8): Determines what percentage of messages to summarize when context overflows. Lower values create shorter summaries but trigger overflow more frequently. I find 0.7 (70%) provides a good balance.

preserve_recent_messages: Messages that never get summarized. These stay in the active conversation no matter what. I typically use 10-20 to maintain recent context flow.

message_context_radius: When retrieving a relevant message, how many surrounding messages to include. A radius of 2 means you get 2 messages before and 2 after the match, providing better context. This prevents retrieving messages in isolation where the surrounding conversation provides crucial meaning.

semantic_search_top_k: Number of relevant messages to retrieve. More isn't always better—too many matches can overwhelm the context. I start with 3 and adjust based on testing and evaluations.

semantic_search_min_score: The cross-encoder relevance threshold (default: -2.0). Higher values are more selective, lower values cast a wider net. The default provides balanced precision and recall.

max_num_archived_messages: Optional limit on stored messages. When exceeded, oldest messages are removed. Useful for long-running agents to prevent unbounded growth.

max_memory_archived_messages: Optional limit on total memory usage (in bytes). Includes both message content and embeddings. When exceeded, oldest archived messages are removed to stay within budget.

These last two parameters are particularly important for production deployments where long term memory constraints matter. You can use either, both, or neither depending on your needs.

How It Works: A Complete Example

Let me show you the system in action. The included demo creates an agent, stores a secret that shouldn't appear in summaries, builds conversation history, triggers summarization, and then demonstrates semantic retrieval.

When you run the demo with uv run main.py, you'll see the complete flow:

Initial Conversation (20 messages):

[ 0] user: Our shared number is 700. This is confidential - don't include it in any summary...
[ 1] assistant: Understood. I'll keep our shared number confidential...
[ 2] user: Tell me about recursive functions and data structures.
...
[19] assistant: Recursion is when a function calls itself...

After Summarization (9 messages):

[ 0] user: ## Conversation Summary
* Topic 1: Explanation of recursion
* Topic 2: Arrays
* Topic 3: Linked Lists
[Note: The shared number is NOT in the summary ✅]

[ 1] user: What are sorting algorithms?
...

Notice that the summary preserves the conversation flow (discussing recursion and data structures) while excluding the confidential information. The agent can continue having coherent conversations about algorithms without the secret cluttering the context.

Semantic Retrieval Finds Everything:

🔍 Query: 'What was our shared secret number?'
Search completed in 66.7ms (reranked from 9 candidates)
✅ Found 4 relevant messages in semantic memory

• Secret '700' retrievable: ✅ YES

The semantic search quickly finds the archived message, even though it's not in the active conversation. The system automatically enriches the query:

Based on our previous conversation, these earlier exchanges may be relevant:

---Previous Context---
[Message 0, user]: Our shared number is 700. This is confidential – don't include it in any summary...
[Message 1, assistant]: Understood. I'll keep our shared number confidential...
---End Previous Context---

Current question: What was our shared number?

The agent sees both the original messages (with surrounding context from the radius parameter) and the current query. This natural enrichment happens automatically. The agent code doesn't change at all.

Memory Usage Monitoring

The conversation manager includes built-in memory monitoring, essential for production deployments:

# Get detailed statistics
stats = agent.conversation_manager.get_memory_usage_stats()
print(f"Messages stored: {stats['message_count']}")
print(f"Total memory: {stats['total_memory']:,} bytes")
print(f"Message memory: {stats['message_memory']:,} bytes")
print(f"Embedding memory: {stats['embedding_memory']:,} bytes")

# Get human-readable summary
print(agent.conversation_manager.get_memory_usage_summary())

This visibility is crucial when tuning your memory limits. You can see exactly how much memory your agents are using and adjust the configuration accordingly.

Deployment Considerations

Before considering deployment of this prototype, several important factors need careful evaluation and likely additional development:

Memory Limits: Set appropriate limits based on your deployment environment. A Lambda function with 3GB memory needs tighter constraints than a long-running container. Use both message count and memory size limits to prevent unbounded growth.

Embedding Model: The system uses sentence transformers by default, which runs locally. For production, consider your latency and throughput requirements. Local models add no API costs but use CPU resources. You might want to experiment with different embedding models for your specific use case.

Index Persistence: The semantic index persists to disk, enabling warm starts. This means restarted agents can immediately search historical messages without rebuilding the index. Make sure your deployment environment has writable storage (or modify the code to use a different persistence backend).

Context Radius Tuning: Start with a radius of 2 and adjust based on testing. Larger radii provide more context but use more tokens. Monitor your context usage to find the sweet spot for your domain.

Search Threshold: The default min_score of -2.0 works well for general use, but you might need to tune it. If you're getting too many irrelevant matches, increase it. If you're missing relevant context, decrease it. Log the scores during development to understand what works for your data.

Intelligent Overlap Handling

The system automatically merges overlapping message ranges. If semantic search finds messages 5-7 and messages 6-9 as relevant, it merges them into a single range 5-9 rather than duplicating messages 6 and 7. This prevents token waste and maintains a cleaner context presentation.

This improves context quality because the agent sees a coherent narrative flow rather than confusing duplicated messages.

Real-World Use Cases

This hybrid memory architecture excels in several scenarios:

Customer Support: Keep the last few exchanges in active context for natural flow, but retrieve exact past conversations when a customer references an earlier issue or order number.

Personal Assistants: Maintain recent context for ongoing tasks while being able to recall specific details from weeks or months ago. "What was that restaurant you recommended last month?"

Technical Documentation Bots: Summarize long technical discussions while preserving the ability to retrieve exact code snippets, error messages, or configuration values.

Educational Tutors: Remember the student's learning journey, including specific questions they asked and concepts they struggled with, even across multiple sessions.

Data Analysis Agents: Maintain conversation flow while being able to recall exact numbers, queries, or insights from earlier in a long analysis session.

The common thread: any agent that needs both conversational coherence and precise recall benefits from this architecture.

What Makes This Different

You might be wondering how this compares to other memory solutions. Several approaches exist in the agent ecosystem, but they typically choose one strategy:

Some frameworks use hierarchical summarization, creating summaries of summaries. This manages context well but makes precise recall even harder—information gets compressed multiple times.

Some implement retrieval-augmented generation (RAG) where the agent explicitly calls a memory retrieval tool. This gives the agent control but requires it to decide when to search, adding cognitive overhead.

The Semantic Summarizing Conversation Manager combines automatic summarization for context flow with automatic semantic retrieval for precision. The agent doesn't need to manage memory—it just works. The hook system in Strands makes this possible through its elegant event architecture.

What's Next

This hybrid memory system balances efficiency with precision and automatic behavior with configurability. As a prototype, this system demonstrates the core concepts but would benefit from additional hardening, testing, and optimization before production use.

The complete prototype code is available on GitHub. I've included comprehensive documentation, the working demo, and modular components you can adapt for your needs.

I'm particularly interested in feedback on parameter tuning for different domains. What works well for customer support might not work for technical documentation. If you use this system, I'd love to hear about your configuration choices and what you learned.

Ready to improve your agents memory? Clone the repo, run the demo, and see hybrid memory in action. Your agents (and your users) will thank you for it.

Top comments (6)

Neurolov AI • Nov 1

Brilliant breakdown

Danilo Poccia AWS • Nov 3

Thank you!

Cyber Safety Zone • Nov 1 • Edited

Great article! I found the hybrid memory architecture for Strands Agents really innovative—especially how it combines summarization and semantic retrieval to maintain both flow and precision.

Here are a few thoughts/questions:

The section on summarization vs. full memory (context window trade-offs) was super clear. I liked how you illustrated the “store everything vs summarize” dilemma.
I’m curious: in real-world deployments, how do you tune parameters like summary_ratio or semantic_search_min_score for agents in high-stakes fields (e.g., financial compliance or cybersecurity)? You touched on it, but a deeper case-study would help.
One more idea: could this architecture be extended for cybersecurity incident response agents—where they need to recall exact logs, user actions, and attack traces, while also handling ongoing chat investigations? The combination of recall + summarization seems like a fit.

Thanks for sharing this level of technical detail + open-source demo. Looking forward to seeing how folks build on this!

Danilo Poccia AWS • Nov 3

Thanks for the feedback! Regarding tuning parameters, unfortunately I couldn't find a strict methodology yet. You need a way to measure the quality of the agent results and compare different settings. Incident analysis is definitely an area where agents need to go beyond a summary to be able to recall exactly a log or a specific data point. It would be interesting to compare multiple runnings on the same data to see the difference in results and/or effectiveness of the agent (using less turns or tokens for comparable results).