Mukunda Rao Katta

Posted on May 25

The Full Conversation Lifecycle: From First Message to Stored Memory

#hermeschallenge #ai #python #agents

Most agent tutorials show you the happy path. Start a loop, call the LLM, print the result, done. What they skip is everything that makes agents useful in production: continuity across turns, persistence across crashes, and not blowing up the context window on turn 47.

This post walks the full lifecycle of a conversation. From the first message to a stored session you can resume two days later.

Hook

Here is the situation. A user sends a message to your agent at 9am. They come back at 2pm, pick up the same thread, and expect the agent to remember everything. The agent hits an error on turn 12. They reload the page, and the conversation resumes from turn 11. Across all of this, your context window stays under control and your bill does not spike.

This is not a single library problem. It is a composition problem. Four libraries cover the four distinct jobs:

agent-message-window trims the in-memory context on every turn
conversation-codec appends each turn to a JSONL file on disk
agent-resume checkpoints tool state so you can recover mid-run
prompt-token-counter keeps your token estimates accurate as the window shifts

Main Code

import asyncio
from pathlib import Path
from agent_message_window import MessageWindow
from conversation_codec import ConversationCodec
from agent_resume import Checkpoint, CheckpointStore
from prompt_token_counter import count_tokens
import anthropic

MODEL = "claude-sonnet-4-6"
MAX_TOKENS = 6000      # target window size in tokens
TURNS_BEFORE_CHECKPOINT = 3

async def run_conversation(session_id: str, user_message: str) -> str:
    session_path = Path(f"/tmp/sessions/{session_id}")
    session_path.mkdir(parents=True, exist_ok=True)

    codec = ConversationCodec(session_path / "turns.jsonl")
    store = CheckpointStore(session_path / "checkpoint.json")
    window = MessageWindow(max_tokens=MAX_TOKENS, token_counter=count_tokens)

    # Restore existing conversation from disk
    existing_turns = codec.read_all()
    for turn in existing_turns:
        window.push(turn["role"], turn["content"])

    # Restore checkpoint if one exists
    checkpoint: Checkpoint | None = store.load()
    tool_state = checkpoint.tool_state if checkpoint else {}
    turn_count = checkpoint.turn_number if checkpoint else 0

    # Push the new user message
    window.push("user", user_message)
    codec.append("user", user_message)

    client = anthropic.Anthropic()

    # The main turn loop
    while True:
        response = client.messages.create(
            model=MODEL,
            max_tokens=1024,
            system="You are a helpful assistant.",
            messages=window.as_messages(),
        )

        assistant_text = response.content[0].text
        stop_reason = response.stop_reason

        # Persist and push the assistant turn
        codec.append("assistant", assistant_text)
        window.push("assistant", assistant_text)
        turn_count += 1

        # Checkpoint every N turns
        if turn_count % TURNS_BEFORE_CHECKPOINT == 0:
            store.save(Checkpoint(
                turn_number=turn_count,
                tool_state=tool_state,
                last_assistant_message=assistant_text,
            ))

        if stop_reason == "end_turn":
            break

    # Close the codec cleanly at session end
    codec.close()
    return assistant_text


async def main():
    session_id = "user-123-thread-456"
    reply = await run_conversation(session_id, "What is the capital of France?")
    print(reply)

    # Simulate returning later
    reply2 = await run_conversation(session_id, "And what is its population?")
    print(reply2)


if __name__ == "__main__":
    asyncio.run(main())

What It Does NOT Do

This setup does not summarize old turns. When the window fills up, agent-message-window drops the oldest messages by token count. If you need summaries instead of drops, you add a summarization step before the push. That is a separate concern and a separate library.

This does not handle multi-agent sessions. Each session_id is one thread. If two agents write to the same JSONL, you get interleaved turns with no ordering guarantee. Use separate files per agent.

The checkpoint here is coarse. It saves every three turns. It does not save mid-tool-call state. If your agent calls a tool on turn 12 and crashes mid-execution, you resume from turn 11, not mid-tool. For finer granularity, checkpoint inside your tool dispatcher.

Design Reasoning

Each library does exactly one job.

agent-message-window does not touch disk. It is a pure in-memory structure. You push messages in, it trims from the oldest end when the token budget is exceeded, and you read out a clean list for the LLM call.

conversation-codec does not understand tokens. It appends raw turns to JSONL and reads them back. That is it. The codec does not decide what to keep.

agent-resume does not know what your tools do. It stores and retrieves a JSON blob. What you put in tool_state is up to you.

Keeping these separate means you can swap any one piece. Replace the codec with a database writer. Replace the checkpoint store with Redis. The rest stays the same.

The token counter runs at push time, not at LLM call time. This avoids a last-second recalculation when you are assembling the payload.

When This Applies / Does Not Apply

This pattern fits any agent with persistent sessions. Customer support bots, coding assistants, research agents that work over hours. Anything where a user expects continuity.

It does not fit single-shot agents. If your agent answers one question and discards state, this is overhead you do not need. Use a plain list and call the LLM once.

It does not fit high-frequency automation. If your agent runs 10,000 short tasks per hour, writing a JSONL per task is wasteful. Batch your persistence or skip it and rely on logs.

If your context window is always small and you never hit the token limit, skip agent-message-window and just pass the full list to the LLM. Add it when you start hitting 400 errors from context overflow.

Quick-Start Snippet

Install the four libraries:

pip install agent-message-window conversation-codec agent-resume prompt-token-counter

Minimal usage without crash recovery:

from agent_message_window import MessageWindow
from conversation_codec import ConversationCodec
from prompt_token_counter import count_tokens
from pathlib import Path
import anthropic

window = MessageWindow(max_tokens=4000, token_counter=count_tokens)
codec = ConversationCodec(Path("/tmp/my_session.jsonl"))
client = anthropic.Anthropic()

def chat(user_text: str) -> str:
    window.push("user", user_text)
    codec.append("user", user_text)

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=window.as_messages(),
    )
    reply = response.content[0].text
    window.push("assistant", reply)
    codec.append("assistant", reply)
    return reply

print(chat("Hello, who are you?"))
print(chat("What did I just ask you?"))

The second call still knows the answer because the window and codec both carry the history.

Siblings Table

Library	Job	GitHub
agent-message-window	Sliding token window for LLM context	MukundaKatta/agent-message-window
conversation-codec	JSONL append + read for conversation turns	MukundaKatta/conversation-codec
agent-resume	Checkpoint and restore tool state	MukundaKatta/agent-resume
prompt-token-counter	Approximate token counts per message	MukundaKatta/prompt-token-counter
agentfit	Track token usage per run	MukundaKatta/agentfit
agent-step-log	Per-step JSONL structured logger	MukundaKatta/agent-step-log

What's Next

The next step for this pattern is a summarization hook. When the window drops old turns, instead of discarding them you summarize the oldest N turns into a single system message. This gives the agent compressed long-term memory without growing the context.

The other obvious extension is multi-session management: a session registry that maps user IDs to active windows and codec handles. Right now the session ID is just a directory path. Wrapping that in a proper session manager lets you expire inactive sessions, rotate JSONL files, and query across sessions.

If you are running a multi-user service, also look at agent-rate-fence to put per-user limits on how many turns per minute each session can generate. Without that, one user with a fast polling loop can starve everyone else.

The libraries used in this post are part of the Hermes Agent Challenge sprint. Each one solves a single problem with zero or minimal dependencies. The goal is a stack you can compose without fighting framework opinions.

All repos are at MukundaKatta on GitHub. Issues and PRs welcome.

DEV Community