DEV Community: Jerry Gathu

Building Memory-Enabled AI Agents with LangMem

Jerry Gathu — Thu, 13 Nov 2025 03:58:06 +0000

Introduction

Modern AI agents need more than just the ability to respond to queries, they need memory. They need to remember past interactions, learn from examples, and store important information for future use. This is where LangMem comes in.

LangMem is a powerful memory management system that integrates seamlessly with LangGraph agents, enabling them to store, search, and retrieve information across conversations. In this article, we'll explore how to build a customer support agent that uses LangMem to provide personalized, context-aware assistance.

What is LangMem?

LangMem provides two key capabilities for AI agents:

Persistent Storage: Store information that persists across multiple conversations
Semantic Search: Find relevant information using natural language queries powered by embeddings

Think of LangMem as giving your AI agent a notebook where it can write down important information and quickly find what it needs later.

Setting Up Your Environment

First, install the required packages:

pip install langchain langchain-openai langgraph langmem python-dotenv

Next, set up your environment with the necessary API keys:

import os
from dotenv import load_dotenv

load_dotenv()
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

Creating a Memory Store

The foundation of LangMem is the InMemoryStore. This is where all your agent's memories will be stored:

from langgraph.store.memory import InMemoryStore

store = InMemoryStore(
    index={"embed": "openai:text-embedding-3-small"}
)

The index parameter enables semantic search by creating embeddings of stored content. This allows your agent to find relevant memories even when queries don't match exactly.

Understanding Namespaces

LangMem uses namespaces to organize memories, similar to folders in a file system. A namespace is a tuple that creates a hierarchical structure:

# User-specific namespace
namespace = ("lance",)

# More specific namespace for examples
examples_namespace = (
    "support_assistant",
    "lance",
    "examples"
)

This organization lets you:

Separate memories by user
Categorize different types of information
Control access to specific memory sets

Storing Information in Memory

Basic Storage Operations

LangMem provides simple methods to store and retrieve data:

# Store a value
store.put(
    namespace=("lance",),
    key="triage_tech",
    value={"prompt": "Handle login issues, API problems, system errors"}
)

# Retrieve a value
result = store.get(namespace=("lance",), key="triage_tech")
if result is not None:
    print(result.value['prompt'])

Practical Example: Storing Configuration

Here's how our customer support agent stores triage rules:

def store_triage_rules(store, user_id):
    namespace = (user_id,)

    # Store tech support rules
    store.put(
        namespace,
        "triage_tech",
        {"prompt": "Login issues, API problems, system errors"}
    )

    # Store sales support rules
    store.put(
        namespace,
        "triage_sales",
        {"prompt": "Pricing questions, demos, upgrade requests"}
    )

    # Store finance support rules
    store.put(
        namespace,
        "triage_finance",
        {"prompt": "Payment issues, refunds, billing disputes"}
    )

Semantic Search with LangMem

One of LangMem's most powerful features is semantic search—finding relevant memories based on meaning, not just exact matches:

# Search for relevant examples
examples = store.search(
    namespace=("support_assistant", "lance", "examples"),
    query="customer asking about payment problems"
)

# Process search results
for example in examples:
    print(f"Subject: {example.value['subject']}")
    print(f"Content: {example.value['content']}")
    print(f"Label: {example.value['label']}")

The search returns items ranked by semantic similarity to your query, even if they don't contain the exact words.

Creating Memory Tools for Agents

LangMem provides pre-built tools that agents can use to manage their own memory:

from langmem import create_manage_memory_tool, create_search_memory_tool

# Tool for storing memories
manage_memory_tool = create_manage_memory_tool(
    namespace=(
        "support_assistant",
        "{langgraph_user_id}",
        "collection"
    )
)

# Tool for searching memories
search_memory_tool = create_search_memory_tool(
    namespace=(
        "support_assistant",
        "{langgraph_user_id}",
        "collection"
    )
)

These tools allow your agent to:

Decide what information is important to remember
Store it for future reference
Search for relevant past information when needed

Building a Memory-Enabled Customer Support Agent

Let's put it all together with a complete example. Our agent will:

Remember triage rules for different support categories
Search for similar past tickets (few-shot examples)
Store information about customer interactions

Step 1: Define the Agent's System Prompt

agent_system_prompt = """
You are ABC Company's customer support assistant.

You have access to the following tools:

1. send_to_tech_support() - Route technical issues
2. send_to_sales_support() - Route sales inquiries  
3. send_to_finance_support() - Route billing issues
4. manage_memory - Store relevant information for future reference
5. search_memory - Search for relevant past information

Use manage_memory to store important details about:
- Customer issues and resolutions
- Common problems and solutions
- Customer preferences and history

Use search_memory to find:
- Similar past tickets
- Previous interactions with this customer
- Relevant solutions or patterns
"""

Step 2: Create the Prompt Function

This function retrieves stored instructions from memory:

def create_prompt(state, config, store):
    user_id = config['configurable']['langgraph_user_id']
    namespace = (user_id,)

    # Try to get custom instructions from memory
    result = store.get(namespace, "agent_instructions")

    if result is None:
        # Use default instructions if none stored
        instructions = "Use these tools appropriately"
        store.put(namespace, "agent_instructions", {"prompt": instructions})
    else:
        instructions = result.value['prompt']

    return [
        {"role": "system", "content": agent_system_prompt.format(instructions=instructions)},
        *state['messages']
    ]

Step 3: Build the Agent with Memory Tools

from langgraph.prebuilt import create_react_agent

tools = [
    send_to_tech_support,
    send_to_sales_support,
    send_to_finance_support,
    manage_memory_tool,  # Agent can store memories
    search_memory_tool   # Agent can search memories
]

agent = create_react_agent(
    model="openai:gpt-4o",
    tools=tools,
    prompt=create_prompt,
    store=store  # Pass the store to enable memory
)

Step 4: Using the Agent

customer_input = {
    "subject": "Payment refund",
    "message": "I am unsatisfied with the service and wish to receive a full refund"
}

config = {"configurable": {"langgraph_user_id": "lance"}}

response = agent.invoke(
    {"customer_input": customer_input},
    config=config
)

Real-World Use Cases

1. Few-Shot Learning from Past Examples

Store successful ticket resolutions and retrieve similar ones:

# Store a successful resolution
store.put(
    namespace=("support_assistant", "lance", "examples"),
    key="ticket_001",
    value={
        "subject": "Cannot log in",
        "content": "User forgot password",
        "label": "tech_support",
        "resolution": "Password reset link sent"
    }
)

# Later, search for similar tickets
similar = store.search(
    namespace=("support_assistant", "lance", "examples"),
    query="user can't access account"
)

2. Personalized User Preferences

Remember how each user prefers to be helped:

# Agent stores a memory
manage_memory_tool.invoke({
    "content": "Customer prefers detailed technical explanations"
})

# Agent searches before responding
preferences = search_memory_tool.invoke({
    "query": "how does this customer like to receive help"
})

3. Dynamic Rule Updates

Update triage rules based on new policies:

# Update tech support criteria
store.put(
    namespace=("lance",),
    key="triage_tech",
    value={"prompt": "Now also includes mobile app issues"}
)

Best Practices

1. Use Hierarchical Namespaces

# Good: Organized and specific
("company", "user_id", "ticket_examples")

# Less ideal: Flat structure
("examples",)

2. Check for Existing Values

result = store.get(namespace, key)
if result is None:
    # Initialize with default
    store.put(namespace, key, default_value)
else:
    # Use existing value
    value = result.value

3. Use Descriptive Keys

# Good
store.put(namespace, "triage_rules_tech", data)

# Less clear
store.put(namespace, "tr_t", data)

4. Leverage Semantic Search

# Let the agent describe what it's looking for naturally
query = "customer who had billing problems last month"
results = store.search(namespace, query=query)

Conclusion

LangMem transforms stateless AI agents into intelligent assistants with memory. By combining persistent storage with semantic search, your agents can:

Learn from past interactions
Provide personalized experiences
Improve over time with few-shot learning
Maintain context across conversations

The key is to think about what information your agent needs to remember and organize it effectively using namespaces. With LangMem, you're not just building a chatbot, you're building an assistant that gets smarter with every interaction.

Advanced Retrieval-Augmented Generation (RAG) Techniques

Jerry Gathu — Thu, 25 Sep 2025 10:03:54 +0000

Retrieval-Augmented Generation(RAG) has become a cornerstone for building powerful AI applications that combine language models with real-world knowledge. While the basic, or “naive,” RAG approach works well, it does have limitations. That’s where advanced RAG techniques come in, improving performance, accuracy, and usability at every step from indexing to retrieval to generation.

Let’s break down what makes advanced RAG special, what problems it solves, and some key techniques you can implement.

Why Go Beyond Naive RAG?
At its core, naive RAG splits documents into chunks, embeds those chunks, and then retrieves the closest chunks for a query. While simple and effective for small datasets, this method struggles when:

The number of documents or chunks grows large, causing latency and performance bottlenecks.
Documents are large and complex, making relevant chunk retrieval tough.
All chunks are treated equally, ignoring inherent document structure or hierarchy.

Advanced RAG tweaks and optimizes various pipeline steps , from preprocessing to retrieval to generation, to address these issues.

Pre-Retrieval Data Structuring

To speed up and improve search, pre-retrieval steps focus on better indexing and querying:

Metadata tagging: Attach concise, meaningful metadata to chunks (e.g., author, date, document type) helping fine-grained filtering and boosting relevance.

Hierarchical Indexing: Instead of treating chunks flatly, we leverage document structures. Start by embedding summaries at high-level segments like chapters, then drill down to sections and paragraphs. Queries first match summaries, then descend into detailed chunks.

Summarization & Map-Reduce: For very large or dense documents, you can generate summaries for chunks and combine them stepwise. This reduces noise and helps overcome token limits in embedding and generation.

By respecting document hierarchies, advanced RAG better captures context and improves precision though beware of added indexing and latency cost.

Similarity Search with Hypothetical Questions and HyDE

Two cutting-edge tricks further improve retrieval:

Hypothetical Questions: Generate artificial questions for each chunk that capture likely user queries. These get embedded instead of the chunk itself, improving alignment between queries and document chunks.

Hypothetical Document Embeddings (HyDE): For a given user question, generate several hypothetical answers and embed those. Then find chunks semantically close to these generated answers, boosting recall especially in niche domains.

HyDE, in particular, can compensate for domain mismatches where embedding models might not generalize accurately.

Context Enrichment

Smaller chunks improve search precision but at the risk of losing important context for generation. Techniques to balance this include:

Sentence Window Retrieval: Embed and search sentences individually, then expand selected sentences with neighbors, restoring useful context for the language model.

Parent Document Retriever: Retrieve relevant chunks but also provide the entire parent document’s context if many chunks come from the same source.

These methods enrich the input, helping the LLM generate more coherent and deeper responses.

Transforming Queries

Some queries are too complex or verbose. Advanced systems:

Break complex queries into smaller subqueries.
Use step-back prompting to generalize overly specific queries.
Apply query rewriting and expansion to add terms and clarify the user’s intent.
Use LLMs to generate multiple query variants for improved matching.

Smart query transformation refines results and increases relevant document recall.

Hybrid Search

Vector search excels at semantic meaning but struggles with exact term matches crucial for, say, brand names or jargon.

Hybrid search combines:

Sparse keyword search algorithms like BM25
Dense vector embeddings from transformers

Both scores are weighted to optimize relevance, achieving a sweet balance between precision and coverage.

Query Routing

Complex systems may need to:

Search multiple data sources (vector DBs, SQL, proprietary stores)
Handle mixed modalities like images, text, audio

Query routing uses a router, either rule-based or LLM-powered, to direct queries to one or more appropriate retrieval backends. This avoids wasted compute and speeds up response times.

Post-Retrieval: Reranking and Context Compression

After retrieving top chunks, simply feeding all to an LLM isn’t ideal:

Reranking uses secondary models to reorder chunks by relevance, reducing hallucinations. Techniques include cross-encoders, multi-vector rerankers, and even fine-tuned LLMs.

Context compression filters redundant or low-value info before generation, saving token usage and improving output quality.

Response Optimization and Memory Integration

Generating great answers often needs multi-step reasoning and memory awareness:

Iterative refinement uses multiple LLM calls over chunks to progressively improve answers.
Hierarchical summarization recursively merges chunk-level summaries into a coherent final response.
Chat history embedding and compression helps maintain context over long multi-turn conversations.

Adaptive and Recursive Retrieval

Real-world queries aren’t always simple. Advanced RAG approaches can:

Use iterative retrieval to repeatedly refine results based on generated answers.
Employ recursive retrieval combined with chain-of-thought prompting to break queries into sub-steps, improving precision.
Enable adaptive systems where the LLM decides when and what to retrieve dynamically.

Wrapping Up: Putting It All Together
Advanced RAG techniques collectively evolve the naive pipeline into a sophisticated, high-performing system capable of:

Handling large, complex, hierarchical corpora
Improving retrieval relevance and recall with smarter embeddings and query processing
Providing richer context for state-of-the-art LLMs to generate accurate, grounded answers
Scaling efficiently across use cases and domains

By investing in these enhancements, hierarchical indexing, hypotheticals, hybrid search, reranking, query routing, and more, developers can unlock the full potential of RAG to build robust AI assistants, knowledge bases, and search engines.

Building Your Own Data Parser with Docling

Jerry Gathu — Wed, 24 Sep 2025 15:40:09 +0000

When building AI agents, one of the most crucial steps is preparing the data you feed to the language models (LLMs). If your data is not well-structured and context-ready, your agents may severely underperform and fail to deliver the results you expect.

While there are well-known data parsers such as LlamaParser, Amazon Textract, and Azure AI Document Intelligence, this article will focus on setting up your own data parser using an open-source alternative called Docling. Docling allows you to efficiently transform messy data into structured formats ready for AI workflows.

What this article covers:

Extract document content
Create document chunks
Create embeddings and storing them in ChromaDB
Testing basic search functionality

EXTRACT DOCUMENT CONTENT
First we import the DocumentConverter from docling and initialize it.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

Next, convert your PDF (or other document formats) to a structured Docling document and export it as JSON:

result = converter.convert("https://arxiv.org/pdf/240.09869")

document = result.document
json_output = document.export_to_dict()

Docling supports a wide range of formats including PDF, DOCX, HTML, Markdown, and even PowerPoint. You can also use URLs to extract HTML content directly..

DOCUMENT CHUNKING

Instead of storing the entire document at once, we split it into smaller, meaningful pieces called chunks. This improves retrieval relevance and reduces the amount of data sent to the language model at once.

Docling offers powerful chunking methods that understand document structure beyond just splitting text blindly:

Hierarchial chunker - Recognizes natural breaks in documents like sections and paragraphs.
Hybrid chunker - Builds on hierarchical chunking and further splits chunks too large for your embedding model's token limits.

Here’s an example using Docling’s HybridChunker with an open-source tokenizer:

from tokenizer import Tokenizer

tokenizer = Tokenizer()
MAX_TOKENS = 8191

chunker = HybridChunker(
     tokenizer=tokenizer,
     max_tokens=MAX_TOKENS
)

chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)

EMBEDDING THE CHUNKS

Now that we have our data ready we can proceed to storing them in our vector database. First, we will initialize chromadb client, create our collection with our embedding function using OpenAI as follows;

import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

# Initialize persistent Chroma client for local storage
client = chromadb.Client(Settings(
    persist_directory="chroma_persistent_storage"
))

# Set up OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="<OPENAI_API_KEY>",
    model_name="text-embedding-3-small"
)

# Create a collection with the embedding function
collection = client.create_collection(
    name='document_collection',
    embedding_function=openai_ef
)

# Add chunks to the collection
for idx, chunk in enumerate(chunks):
    collection.add(
        documents=[chunk.page_content],
        metadatas=[{"chunk_index": idx}],
        ids=[f"chunk_{idx}"]
    )

# Persist changes to disk
client.persist()

Test Basic Search Functionality

Test if your setup works by querying the vector database with sample questions and retrieving relevant chunks:

query = "What is the main contribution of the paper?"
results = collection.query(query_texts=[query], n_results=3)

for doc in results['documents'][0]:
    print(doc)

Summary

With Docling, you can easily:

Extract documents from multiple formats into a clean, structured format.
Chunk documents intelligently, preserving their logical hierarchy.
Generate embeddings using your preferred model (like OpenAI).
Store and retrieve embeddings efficiently in a local vector database like Chroma.