DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

My Hermes agent ran for 2 hours, hit a timeout at turn 89, and I had to start over. Now it checkpoints.

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge.

My Hermes research agent was doing a multi-step literature review. Two hours in, at turn 89, the API returned a timeout error that my retry logic couldn't recover from. No checkpoint. I had to start over and re-pay for 89 turns of work.

That's trace-checkpoint.

One line to checkpoint each turn

from trace_checkpoint import Checkpointer

ckpt = Checkpointer("checkpoints/run.jsonl")

for turn in range(1, 100):
    response = call_llm(messages)
    messages.append({"role": "assistant", "content": response.text})

    ckpt.save({"messages": messages})  # one JSONL line per turn

    if is_done(response):
        break
Enter fullscreen mode Exit fullscreen mode

Each save() appends one line to the file. No lock files, no database, no temp files. If the process dies between writes, the last complete line is the last good checkpoint.

Resume after a crash

ckpt = Checkpointer("checkpoints/run.jsonl")
last = ckpt.last_record

if last:
    messages = last.data["messages"]
    start_turn = last.turn + 1
    print(f"Resuming from turn {start_turn}")
else:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    start_turn = 1

for turn in range(start_turn, 100):
    ...
Enter fullscreen mode Exit fullscreen mode

ckpt.last_record is None if no checkpoint exists, which makes the "first run vs resume" check a single if.

The file format

Each record is a compact JSON line:

{"turn":1,"ts":1716566401.2,"data":{"messages":[...]},"meta":{"label":"research","run_id":"run-001"}}
Enter fullscreen mode Exit fullscreen mode

Plain text. Grep-able. You can inspect it with cat, recover it with any JSON parser, and load it with jq. No proprietary format.

Auto turn numbering

If you don't pass turn=, the checkpointer auto-increments from last_turn + 1:

ckpt = Checkpointer("run.jsonl")
ckpt.save({"x": 1})  # turn 1
ckpt.save({"x": 2})  # turn 2
ckpt.save({"x": 3})  # turn 3
Enter fullscreen mode Exit fullscreen mode

Or explicit:

ckpt.save({"messages": messages}, turn=turn)
Enter fullscreen mode Exit fullscreen mode

label and run_id

Tag records with metadata for multi-agent setups:

ckpt = Checkpointer(
    "checkpoints/worker-1.jsonl",
    label="literature-review",
    run_id="run-2026-05-24",
)
Enter fullscreen mode Exit fullscreen mode

Both go in the meta field of every record. Filter by run_id across merged trace files to reconstruct a full run.

What I checkpoint in my Hermes agent

ckpt.save({
    "messages": messages,             # full message history to this point
    "tool_results": tool_results,     # tool call outputs
    "sub_tasks_done": sub_tasks_done, # which sub-tasks completed
    "cost_usd": running_cost,         # cost so far
})
Enter fullscreen mode Exit fullscreen mode

On resume, I reload this dict and pick up exactly where I left off. The tool results are already there — I don't redo the web searches or API calls that already completed.

Zero dependencies

Standard library only: json, pathlib, time, dataclasses. No third-party packages.

pip install trace-checkpoint
Enter fullscreen mode Exit fullscreen mode

Repo: https://github.com/MukundaKatta/trace-checkpoint

Top comments (0)