Improving

Posted on May 28 • Originally published at improving.com

Agent Memory Systems: Building Long-Term Context for AI

#agents #agenticmemory #ai

AI is almost everywhere now from agentic coding, autonomous workflows to day-to-day engineering tasks that you want to delegate to an agent. In many cases, an agent looks impressive in a single turn: you dump all the details of your pipeline or cluster issue into one prompt, and it responds with clean reasoning and a plausible plan.

The trouble starts when you try to work like a human does, iteratively over multiple turns.

Suddenly the agent forgets what you decided earlier, why you chose a certain approach, and sometimes even who is talking to it (SRE? platform engineer? backend dev?).

Agent memory fills this gap. Without a memory system, your agent behaves like a goldfish, it can only remember what fits inside a fixed context window, and once that window is saturated or a new session begins, the continuity breaks.

Memory is how you turn a one-shot chatbot into something that can maintain state, learn from outcomes, and stay aligned with your constraints over time.

Let’s start with understanding how context engineering solves this goldfish problem.

What is Context Engineering?

Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.

It includes techniques such as:

Prompt engineering
Structured outputs
State handling
RAG (Retrieval-Augmented Generation)
Memory (short-term + long-term)
Context packing / token budgeting

Why Agentic Memory Matters

Let’s understand from an example.

You’re building a DevOps pipeline, and you ask the agent to add one more step, and it replies with “I don’t know what you’re talking about.”

That does not mean the model you’re using is not good or capable. It’s because it has no context. It doesn’t know about your previous step in a single-turn conversation system.

If your system recalls previous conversation state, evidence, and decisions, the agent can understand exactly what pipeline you mean and what step should be added, even if you were referring in different sessions.

Without memory, the AI agent behaves like Dory from Finding Nemo. It might remember a few recent turns, and then it stops, especially if the context window is small, or you cross a certain token limit.

Once you restart a conversation or start a new session, it forgets nearly everything unless you build persistence.

Agentic memory is required for workflows that are iterative, multi-step, and long-running, helping achieve:

Continuity: Enables agents to remember previous turns in a conversation.
Learning and adaptation: Allows agents to learn from past successes and failures.
Advanced reasoning: Supports planning, personalization, and maintaining state.

Memory Architecture Patterns

Agent memory is not a single bucket where you dump everything from chat history.

In practice, it is layered because different information has different lifetimes, retrieval needs, and failure modes.

A useful mental model is:

Short-term memory: What the agent needs right now to finish the current task.
Long-term memory: What should persist across sessions.

Long-term memory typically splits into:

Episodic memory
Semantic memory
Procedural memory

Short-term Memory (Session / Working Memory)

Short-term memory is generally considered an agent session buffer. It holds recent conversation plus the immediate working state needed to complete the current task.

It prevents the agent from resetting mid-debug or mid-execution and is typically implemented as a sliding window of messages plus a state object containing plans, variables, tool outputs, and assumptions.

Once the task ends or the buffer grows too large, short-term memory is summarized, pruned, or selectively promoted into long-term memory.

Long-term Memory (Persistent Memory)

Long-term memory can be divided into three categories:

Episodic Memory

Stores past interactions as events with outcomes.

Examples:

What happened
What failed
What worked

Useful when revisiting the same system over time because it preserves continuity and prevents repeating dead ends.

Semantic Memory

Stores stable facts and constraints about the user, project, and environment.

Examples:

Deployment conventions
User preferences
Team policies
Architecture decisions

This keeps the agent consistent and personalized across sessions.

Procedural Memory

Stores repeatable how-to knowledge.

Examples:

Workflows
Runbooks
Checklists
Operational procedures

This allows the agent to execute proven processes instead of improvising each time.

How Agentic Memory Systems Work

Memory-enabled agents mimic the practical shape of human memory.

Humans have:

Sensory intake
Working memory
Long-term storage

Agents recreate this through an operational loop.

Practical Agent Loop

Imagine you ask an agent:

Deploy payments service to the Kubernetes staging cluster using Helm. Enable HPA and make sure rollout is healthy.

The agent will generally follow these steps:

Read
- Parse the goal, target, constraints, and current state.
Retrieve
- Pull relevant semantic, procedural, and episodic memories.
Assemble
- Build a token-budgeted context with only the relevant information.
Act
- Execute deployment steps and inspect failures.
Evaluate
- Verify rollout health, pod status, and deployment success.
Write-back
- Store durable learnings, fixes, and operational insights.

The write-back phase is what turns chat into learning.

Without it, you are only doing retrieval, not memory.

Agent Memory vs RAG (Retrieval-Augmented Generation)

RAG is about retrieving external knowledge such as:

Documentation
Tickets
Wikis
Runbooks

It is fundamentally a stateless retrieval workflow.

Memory, in contrast, is about persistent internal context:

User preferences
Constraints
Decisions
Outcomes
Historical interactions

What Should Become Memory?

Memory should contain information that improves future performance without introducing noise.

Examples include:

Explicit “remember this” instructions

Remember that all production deployments must go through manual approval in ArgoCD.

Stable preferences

Examples:

Uses Argo CD for GitOps
Prefers YAML over Helm templates
Follows strict naming conventions

Decisions and milestones

Examples:

Migrated from Jenkins to GitHub Actions
Standardized observability with Prometheus + Grafana

User corrections

Examples:

The API endpoint is /v2/orders, not /v1/orders
We’re running on EKS, not GKE

Outcomes and lessons learned

Examples:

Terraform state locking issues resolved by moving to S3 + DynamoDB backend

RAG and Memory Together

The most effective AI systems use both RAG and memory together.

RAG provides organizational knowledge.
Memory provides personalized context.

Together they create agents that are both knowledgeable and context-aware.

Vector Stores for Semantic Memory

Semantic memory stores stable facts, preferences, and knowledge.

As agents accumulate hundreds or thousands of facts, scalable retrieval becomes necessary.

This is where vector stores become useful.

Vector databases convert text into embeddings represented as numerical vectors.

When the agent needs information:

The current query is embedded.
Similar vectors are retrieved.
Relevant memories are injected into context.

Retrieval Strategies

Common strategies include:

Similarity search (top-k)
Re-ranking
Recency bias
Filtering by scope

Popular Vector Databases

Pinecone
Weaviate
Milvus
Qdrant
Chroma
pgvector

Context Window Management and Token Accounting

The context window is the model’s working memory.

Even though modern models support huge windows, effective context management is still difficult.

Adding too much information:

Degrades reasoning quality
Increases cost
Adds latency

Long-Running vs Short-Running Agents

Short-running agents may fit everything into a single context window.

Long-running agents operating across sessions accumulate far more information than any window can hold.

These agents require selective retrieval strategies.

The Context Stuffing Trap

A common mistake is including all available information without curation.

This introduces noise and buries critical information.

Helpful techniques include:

Semantic chunking
Memory buffering
Just-in-time retrieval
Hierarchical summarization
Progressive disclosure
Sliding windows

Packing Order Matters

Models pay more attention to the beginning and end of the context.

Recommended ordering:

System instructions and high-priority context
Immediate user query and relevant memories
Avoid burying critical information in the middle

Memory Management: Pruning and Compression

As memory grows, it requires active management.

Without management:

Retrieval slows down
Storage grows indefinitely
Old memories conflict with newer information

Pruning Strategies

Pruning selectively forgets irrelevant information.

Common strategies include:

TTL (time-to-live)
Least Recently Used (LRU)
Relevance scoring
User-requested deletion

Most production systems combine several of these techniques.

Memory Compression

Compression stores information in more compact forms.

Useful techniques include:

Rolling summaries
Hierarchical summarization
Topic clustering
Deduplication

Measuring Compression Quality

Compression should preserve critical information.

Quality checks include:

Ensuring important facts remain retrievable
Detecting contradictions
Avoiding over-compression

Final Thoughts

We are entering the era of personalized AI agents, where memory becomes foundational infrastructure.

Without memory, agents lose continuity across sessions and interactions.

With memory, they can:

Learn from outcomes
Maintain long-term context
Personalize interactions
Execute workflows more reliably

In this article, we explored:

Context engineering
Memory architecture patterns
Agent loops
RAG vs memory
Semantic retrieval systems
Context management
Pruning and compression

After working on 250+ projects and helping companies generate billions, one thing is clear and that is that most organizations don't fail at AI because of technology. They fail because they skip the trust-building stages like developing agentic memory systems that make AI safe to scale.

In the next part of this blog, we will be implementing memory patterns and learning how all these pieces come together to form a sophisticated agentic system. That system will make sure that agents not only talk in one session but also across sessions and remember all the past events.

DEV Community

Agent Memory Systems: Building Long-Term Context for AI

What is Context Engineering?

Why Agentic Memory Matters

Memory Architecture Patterns

Short-term Memory (Session / Working Memory)

Long-term Memory (Persistent Memory)

Episodic Memory

Semantic Memory

Procedural Memory

How Agentic Memory Systems Work

Practical Agent Loop

Agent Memory vs RAG (Retrieval-Augmented Generation)

What Should Become Memory?

Explicit “remember this” instructions

Stable preferences

Decisions and milestones

User corrections

Outcomes and lessons learned

RAG and Memory Together

Vector Stores for Semantic Memory

Retrieval Strategies

Popular Vector Databases

Context Window Management and Token Accounting

Long-Running vs Short-Running Agents

The Context Stuffing Trap

Packing Order Matters

Memory Management: Pruning and Compression

Pruning Strategies

Memory Compression

Measuring Compression Quality

Final Thoughts

Top comments (0)