Most agent tutorials show you the happy path. Start a loop, call the LLM, print the result, done. What they skip is everything that makes agents useful in production: continuity across turns, persistence across crashes, and not blowing up the context window on turn 47.
This post walks the full lifecycle of a conversation. From the first message to a stored session you can resume two days later.
Hook
Here is the situation. A user sends a message to your agent at 9am. They come back at 2pm, pick up the same thread, and expect the agent to remember everything. The agent hits an error on turn 12. They reload the page, and the conversation resumes from turn 11. Across all of this, your context window stays under control and your bill does not spike.
This is not a single library problem. It is a composition problem. Four libraries cover the four distinct jobs:
-
agent-message-windowtrims the in-memory context on every turn -
conversation-codecappends each turn to a JSONL file on disk -
agent-resumecheckpoints tool state so you can recover mid-run -
prompt-token-counterkeeps your token estimates accurate as the window shifts
Main Code
import asyncio
from pathlib import Path
from agent_message_window import MessageWindow
from conversation_codec import ConversationCodec
from agent_resume import Checkpoint, CheckpointStore
from prompt_token_counter import count_tokens
import anthropic
MODEL = "claude-sonnet-4-6"
MAX_TOKENS = 6000 # target window size in tokens
TURNS_BEFORE_CHECKPOINT = 3
async def run_conversation(session_id: str, user_message: str) -> str:
session_path = Path(f"/tmp/sessions/{session_id}")
session_path.mkdir(parents=True, exist_ok=True)
codec = ConversationCodec(session_path / "turns.jsonl")
store = CheckpointStore(session_path / "checkpoint.json")
window = MessageWindow(max_tokens=MAX_TOKENS, token_counter=count_tokens)
# Restore existing conversation from disk
existing_turns = codec.read_all()
for turn in existing_turns:
window.push(turn["role"], turn["content"])
# Restore checkpoint if one exists
checkpoint: Checkpoint | None = store.load()
tool_state = checkpoint.tool_state if checkpoint else {}
turn_count = checkpoint.turn_number if checkpoint else 0
# Push the new user message
window.push("user", user_message)
codec.append("user", user_message)
client = anthropic.Anthropic()
# The main turn loop
while True:
response = client.messages.create(
model=MODEL,
max_tokens=1024,
system="You are a helpful assistant.",
messages=window.as_messages(),
)
assistant_text = response.content[0].text
stop_reason = response.stop_reason
# Persist and push the assistant turn
codec.append("assistant", assistant_text)
window.push("assistant", assistant_text)
turn_count += 1
# Checkpoint every N turns
if turn_count % TURNS_BEFORE_CHECKPOINT == 0:
store.save(Checkpoint(
turn_number=turn_count,
tool_state=tool_state,
last_assistant_message=assistant_text,
))
if stop_reason == "end_turn":
break
# Close the codec cleanly at session end
codec.close()
return assistant_text
async def main():
session_id = "user-123-thread-456"
reply = await run_conversation(session_id, "What is the capital of France?")
print(reply)
# Simulate returning later
reply2 = await run_conversation(session_id, "And what is its population?")
print(reply2)
if __name__ == "__main__":
asyncio.run(main())
What It Does NOT Do
This setup does not summarize old turns. When the window fills up, agent-message-window drops the oldest messages by token count. If you need summaries instead of drops, you add a summarization step before the push. That is a separate concern and a separate library.
This does not handle multi-agent sessions. Each session_id is one thread. If two agents write to the same JSONL, you get interleaved turns with no ordering guarantee. Use separate files per agent.
The checkpoint here is coarse. It saves every three turns. It does not save mid-tool-call state. If your agent calls a tool on turn 12 and crashes mid-execution, you resume from turn 11, not mid-tool. For finer granularity, checkpoint inside your tool dispatcher.
Design Reasoning
Each library does exactly one job.
agent-message-window does not touch disk. It is a pure in-memory structure. You push messages in, it trims from the oldest end when the token budget is exceeded, and you read out a clean list for the LLM call.
conversation-codec does not understand tokens. It appends raw turns to JSONL and reads them back. That is it. The codec does not decide what to keep.
agent-resume does not know what your tools do. It stores and retrieves a JSON blob. What you put in tool_state is up to you.
Keeping these separate means you can swap any one piece. Replace the codec with a database writer. Replace the checkpoint store with Redis. The rest stays the same.
The token counter runs at push time, not at LLM call time. This avoids a last-second recalculation when you are assembling the payload.
When This Applies / Does Not Apply
This pattern fits any agent with persistent sessions. Customer support bots, coding assistants, research agents that work over hours. Anything where a user expects continuity.
It does not fit single-shot agents. If your agent answers one question and discards state, this is overhead you do not need. Use a plain list and call the LLM once.
It does not fit high-frequency automation. If your agent runs 10,000 short tasks per hour, writing a JSONL per task is wasteful. Batch your persistence or skip it and rely on logs.
If your context window is always small and you never hit the token limit, skip agent-message-window and just pass the full list to the LLM. Add it when you start hitting 400 errors from context overflow.
Quick-Start Snippet
Install the four libraries:
pip install agent-message-window conversation-codec agent-resume prompt-token-counter
Minimal usage without crash recovery:
from agent_message_window import MessageWindow
from conversation_codec import ConversationCodec
from prompt_token_counter import count_tokens
from pathlib import Path
import anthropic
window = MessageWindow(max_tokens=4000, token_counter=count_tokens)
codec = ConversationCodec(Path("/tmp/my_session.jsonl"))
client = anthropic.Anthropic()
def chat(user_text: str) -> str:
window.push("user", user_text)
codec.append("user", user_text)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=window.as_messages(),
)
reply = response.content[0].text
window.push("assistant", reply)
codec.append("assistant", reply)
return reply
print(chat("Hello, who are you?"))
print(chat("What did I just ask you?"))
The second call still knows the answer because the window and codec both carry the history.
Siblings Table
| Library | Job | GitHub |
|---|---|---|
| agent-message-window | Sliding token window for LLM context | MukundaKatta/agent-message-window |
| conversation-codec | JSONL append + read for conversation turns | MukundaKatta/conversation-codec |
| agent-resume | Checkpoint and restore tool state | MukundaKatta/agent-resume |
| prompt-token-counter | Approximate token counts per message | MukundaKatta/prompt-token-counter |
| agentfit | Track token usage per run | MukundaKatta/agentfit |
| agent-step-log | Per-step JSONL structured logger | MukundaKatta/agent-step-log |
What's Next
The next step for this pattern is a summarization hook. When the window drops old turns, instead of discarding them you summarize the oldest N turns into a single system message. This gives the agent compressed long-term memory without growing the context.
The other obvious extension is multi-session management: a session registry that maps user IDs to active windows and codec handles. Right now the session ID is just a directory path. Wrapping that in a proper session manager lets you expire inactive sessions, rotate JSONL files, and query across sessions.
If you are running a multi-user service, also look at agent-rate-fence to put per-user limits on how many turns per minute each session can generate. Without that, one user with a fast polling loop can starve everyone else.
The libraries used in this post are part of the Hermes Agent Challenge sprint. Each one solves a single problem with zero or minimal dependencies. The goal is a stack you can compose without fighting framework opinions.
All repos are at MukundaKatta on GitHub. Issues and PRs welcome.
Top comments (0)