Mudassir Khan

Posted on Jun 6

AI agent memory management: beyond the context window

#ai #llm #machinelearning #webdev

AI agent memory management: beyond the context window

Your agent answered correctly five minutes ago. Now it's asking for the same information again. The context window filled up, the early messages got evicted, and all that history is gone.

This is not a hallucination problem. It's a memory architecture problem.

The symptom: agents that forget during a task

You're debugging an agent that handles multistep workflows. Somewhere around turn 15, it starts contradicting itself. It asks questions you already answered in turn 3. It ignores decisions it made in turn 7.

Context limits are the obvious culprit, but the real issue is deeper. Most agent implementations treat the context window as memory. It isn't. It's working memory that evicts old data as soon as the window fills. The moment something scrolls out of the context, the agent has no path back to it.

That's the gap this article addresses.

Working memory vs longterm memory

A context window is the agent's working memory. It holds the current conversation, the current task state, and whatever you stuffed in at the start. It's fast to read, but it resets. Every new session starts blank.

Longterm memory is what persists across sessions: user preferences, prior decisions, learned facts about the problem domain. Without it, every session is the agent's first session.

The distinction matters because the solutions are different. Working memory problems (forgetting during a task) need context management techniques. Longterm memory problems (cross session amnesia) need a storage and retrieval layer.

Most articles conflate the two. Most agents solve neither properly.

The lost in the middle problem and why position matters

Even inside a single context window, position matters more than you'd think.

Research shows that LLMs exhibit a "lost in the middle" effect: accuracy is highest when relevant information is at the start or end of the context, and drops significantly for information buried in the middle. If you have a 64k token window and you put the most critical retrieved documents at position 30k, you've effectively hidden them from the model.

The practical consequence: in a RAG system, you should not dump all your retrieved chunks in the middle of the prompt and hope the model weighs them equally. It doesn't.

A production fix is to place your highest confidence retrieved documents at the very beginning and end of the context. Treat the context window like a sandwich: critical context at the top, critical context again at the bottom, filler in the middle.

The hybrid pattern: autoretrieve at start, agent driven storage after

The pattern that holds up in production is a combination of two mechanisms.

Autoretrieve at request start. Before every agent turn, automatically retrieve relevant longterm memories based on the current query or task state. This means the agent always starts with the best available context, even in a fresh session.

Agent driven storage. Let the agent decide what is worth keeping. After completing a task or making a significant decision, the agent writes to longterm memory: "User prefers TypeScript strict mode, reminded me three times." That's information worth retrieving on the next session.

You can think of solid agent memory management as a filing cabinet alongside the working scratchpad. Autoretrieve pulls relevant files from the cabinet at the start of each turn. Agent driven storage files things back when they are worth keeping.

Where vector stores fit and how to rerank

Longterm memory at scale lives in vector databases. You embed memories (or document chunks) into a vector space and retrieve by semantic similarity. When a new query arrives, you run a similarity search and pull the top K most relevant chunks.

The problem is that "most similar" and "most useful" are not the same thing. A retrieval system that returns raw cosine similarity scores will sometimes surface tangentially related content over the most relevant hit. That's where reranking earns its place.

A reranker takes the top K retrieved chunks and scores them again using a more expensive crossencoder model. It's slower than vector similarity search, but it runs on a small candidate set (K is usually 10 to 20), so the latency stays manageable. The output is a reordered list where the most genuinely useful chunks end up at the top.

If you're picking a stack, you can compare vector databases by latency, filtering support, and managed hosting options before committing.

A minimal memory loop in code

Here's a realistic Python sketch of the hybrid pattern. It uses a hypothetical memory client wrapping a vector store.

from memory_client import MemoryClient
from llm_client import LLMClient

memory = MemoryClient(user_id="user_123")
llm = LLMClient()

def agent_turn(query: str, agent_id: str) -> str:
    # Step 1: autoretrieve relevant memories before every turn
    memories = memory.search(
        query=query,
        agent_id=agent_id,
        top_k=5,
        rerank=True,
    )

    # Step 2: inject top memories at start, rest at end (sandwich pattern)
    top_memories = memories[:2]    # highest confidence at the very top
    bottom_memories = memories[2:] # remaining near the bottom

    context_parts = []
    if top_memories:
        context_parts.append(format_memories(top_memories))

    context_parts.append(query)

    if bottom_memories:
        context_parts.append(format_memories(bottom_memories))

    response = llm.complete("\n\n".join(context_parts))

    # Step 3: agent driven storage — write what is worth keeping
    if response.should_store:
        memory.add(
            content=response.memory_payload,
            agent_id=agent_id,
            user_id="user_123",
        )

    return response.text

The should_store flag is where the interesting decisions happen. You can implement it as a second LLM call ("Is this response something worth remembering for future sessions?"), a simple heuristic (decisions over a certain length, or responses containing explicit preferences), or a structured output field the main LLM populates.

Start simple. A naive heuristic beats no memory at all, and you can upgrade the storage logic once you see what your agent actually needs to keep.

FAQ

What is the difference between a context window and memory?

A context window is temporary working memory. It holds the current turn's information and is cleared between sessions. Memory is a persistent store that survives session resets, backed by a database the agent reads from and writes to explicitly.

How do agents remember across sessions?

They don't automatically. You wire up explicit storage (usually a vector database or a structured keyvalue store) and retrieval logic. The agent writes important information to the store at the end of a task and retrieves relevant entries at the start of the next session.

What is the lost in the middle problem?

LLMs pay less attention to information in the middle of a long context window than to information at the start or end. If you place your most critical retrieved documents in the center of a large prompt, the model may effectively ignore them. The fix is to place highest confidence chunks at the very beginning and end of the context.

If you want a deeper look at agent memory architecture, I cover it in more detail on my site.

If you want this wired up on your own system end to end, that is exactly the kind of work I take on.

Drop a comment if your setup looks different. Curious what memory stacks people are running in production.

DEV Community

AI agent memory management: beyond the context window

AI agent memory management: beyond the context window

The symptom: agents that forget during a task

Working memory vs longterm memory

The lost in the middle problem and why position matters

The hybrid pattern: autoretrieve at start, agent driven storage after

Where vector stores fit and how to rerank

A minimal memory loop in code

FAQ

Top comments (0)