Mukunda Rao Katta

Posted on May 25

Adding Memory to Your Python Agent Without a Vector Database

#hermeschallenge #ai #python #agents

Most blog posts about agent memory go straight to vector databases. Pinecone, Weaviate, ChromaDB. Embeddings, cosine similarity, semantic retrieval.

That's fine if you need it. But most agents I've built don't need semantic search over 10 years of conversation history. They need to remember what happened in this session, and maybe the last few sessions. That's it.

Three patterns handle that. No vector DB required.

The Three Patterns

Pattern 1: Conversation log. Persist every message as JSONL. Load the last N turns on startup. This is stateful replay.

Pattern 2: Key-value session store. Checkpoint arbitrary dict state between runs. The agent picks up exactly where it left off, even after restart.

Pattern 3: Sliding message window. Keep only the most recent messages in context. Older messages get dropped. Cheap and effective.

Each pattern fits different use cases. Let's look at each one with real code.

Pattern 1: Conversation Log with conversation-codec

conversation-codec persists messages as JSONL. Optional Fernet encryption if you're storing anything sensitive.

from conversation_codec import ConversationCodec

codec = ConversationCodec(path="~/.myagent/sessions/user123.jsonl")

# Load prior history on startup
history = codec.load()

# Send to LLM with prior context
messages = history + [{"role": "user", "content": user_input}]
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=messages,
)

assistant_reply = response.content[0].text

# Persist new turns
codec.append({"role": "user", "content": user_input})
codec.append({"role": "assistant", "content": assistant_reply})

The codec writes each turn as a newline-delimited JSON object. One line per message. You can inspect the file with any text editor or grep it for debugging.

For encryption, pass a Fernet key:

from cryptography.fernet import Fernet
key = Fernet.generate_key()
codec = ConversationCodec(path="session.jsonl", encryption_key=key)

Store the key in your environment, not the file. The file then contains only encrypted blobs.

Pattern 2: Key-Value State with agent-resume

Conversation history is one thing. But agents often carry state that isn't messages. A list of things they've processed. A counter. A current task step. A dict of user preferences.

agent-resume checkpoints arbitrary dicts between runs.

from agent_resume import AgentResume

resume = AgentResume(path="~/.myagent/state/user123.json")

# Load state from last run, or start fresh
state = resume.load(default={"step": 0, "processed_ids": [], "user_prefs": {}})

print(f"Resuming from step {state['step']}")

# Do work
for item in fetch_items():
    if item["id"] in state["processed_ids"]:
        continue  # already done

    process(item)
    state["processed_ids"].append(item["id"])
    state["step"] += 1

    # Checkpoint after each item
    resume.save(state)

print(f"Finished at step {state['step']}")

The checkpoint is atomic. If your agent crashes mid-run, the last successful save is intact. On next startup, the agent loads that state and skips already-processed items.

This pattern is useful for any agent running a multi-step pipeline over external data. Ingestion agents. Daily digest agents. Anything where re-processing old items is wasteful or harmful.

Pattern 3: Sliding Window with agent-message-window

Loading the full conversation history eventually hits the context window limit. An agent that's been running for 50 turns can't load all 50 turns without risk.

agent-message-window keeps a fixed-size window of recent messages. Older messages slide off.

from agent_message_window import MessageWindow

window = MessageWindow(max_messages=20, keep_system=True)

# Load from persistent log (optional -- combine with Pattern 1)
history = codec.load()
for msg in history:
    window.add(msg)

# Current window is always the last 20 messages
messages = window.get()

# Send to LLM
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=messages,
)

assistant_reply = response.content[0].text

# Add new turns to window
window.add({"role": "user", "content": user_input})
window.add({"role": "assistant", "content": assistant_reply})

The keep_system=True flag means the system message never slides off. Only conversational turns get evicted.

One important detail: the window respects tool use / tool result pairing. If a tool_use block is in the window, the corresponding tool_result stays in the window too. Evicting one without the other breaks the Anthropic message format.

Combining All Three

In practice, you often want all three together:

from conversation_codec import ConversationCodec
from agent_resume import AgentResume
from agent_message_window import MessageWindow

# Persistent log (full history, JSONL)
codec = ConversationCodec(path="~/.myagent/sessions/user123.jsonl")

# Key-value state (task state, not messages)
resume = AgentResume(path="~/.myagent/state/user123.json")
state = resume.load(default={"task": None})

# Sliding window (what the LLM actually sees)
window = MessageWindow(max_messages=20, keep_system=True)

# Seed window from recent history only
for msg in codec.load()[-20:]:
    window.add(msg)

# Agent loop
while True:
    user_input = input("> ")
    if not user_input:
        break

    window.add({"role": "user", "content": user_input})
    codec.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=window.get(),
    )

    reply = response.content[0].text
    window.add({"role": "assistant", "content": reply})
    codec.append({"role": "assistant", "content": reply})
    resume.save(state)

    print(reply)

Each layer has a clear job. The codec holds the full record. The resume holds state. The window controls what the model sees.

What This Does NOT Do

These patterns do not do semantic search. If a user asks "what did we talk about last month regarding the budget?", these patterns cannot retrieve that context efficiently. You need embeddings and vector search for that.

These patterns also do not handle multi-user concurrency. If two processes write to the same JSONL file simultaneously, you'll get interleaved writes. Use a database or add a file lock if you have concurrent writers.

There is no automatic summarization of old turns. If you want to compress old history into a summary before it slides off the window, you wire that up yourself. The libraries give you the primitives.

Design Notes

The reason to use JSONL instead of SQLite or a real database is debuggability. You can open the file in any editor. You can grep for a specific tool call. You can replay the session by reading the file line by line. That matters when something breaks at 2am.

The reason to separate conversation log from key-value state is that they have different shapes. Messages are append-only and ordered. State is a mutable dict. Mixing them into one structure causes pain when you need to query one without the other.

The sliding window is a tradeoff, not a solution. You lose old context. But you control your token usage, and the model sees a coherent recent window instead of a half-truncated history.

When This Applies

Use these patterns when:

You're building a single-user chatbot or assistant
Sessions are reasonably bounded (under a few hundred turns)
You don't need to retrieve specific past facts from months ago
You want something you can inspect and debug without a running database

Do not use these patterns when:

You need to answer questions about content from months of history
You have thousands of users sharing a session store
You need vector similarity retrieval over past conversations

Quick Start

pip install conversation-codec agent-resume agent-message-window

The libraries are independent. Use any one or all three. No shared configuration.

Related Libraries

Library	What It Does	Language
`conversation-codec`	JSONL persistence with optional Fernet encryption	Python
`agent-resume`	Checkpoint/resume arbitrary dict state between runs	Python
`agent-message-window`	Sliding window with tool_use/tool_result pairing	Python
`agentfit`	Agent run benchmarking and latency tracking	Python
`prompt-token-counter`	Approximate token counts before sending to LLM	Python
`agent-step-log`	Per-step JSONL logger with structured metadata	Python

What's Next

These three patterns cover the 80% case. If you find you need semantic retrieval, the natural next step is to add an embedding layer on top of the conversation log. The JSONL file becomes your source of truth, and you index it into a vector store separately.

If you're hitting token limits even with the sliding window, look at prompt-token-counter to measure your actual usage before each call, and tool-output-truncate-py to shrink large tool results before they hit the window.

The full source for all three libraries is on GitHub under MukundaKatta. PRs are open.

DEV Community