My Hermes agent ran for 2 hours, hit a timeout at turn 89, and I had to start over. Now it checkpoints.

#devchallenge #hermesagentchallenge #agents #python

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge.

My Hermes research agent was doing a multi-step literature review. Two hours in, at turn 89, the API returned a timeout error that my retry logic couldn't recover from. No checkpoint. I had to start over and re-pay for 89 turns of work.

That's trace-checkpoint.

One line to checkpoint each turn

from trace_checkpoint import Checkpointer

ckpt = Checkpointer("checkpoints/run.jsonl")

for turn in range(1, 100):
    response = call_llm(messages)
    messages.append({"role": "assistant", "content": response.text})

    ckpt.save({"messages": messages})  # one JSONL line per turn

    if is_done(response):
        break

Each save() appends one line to the file. No lock files, no database, no temp files. If the process dies between writes, the last complete line is the last good checkpoint.

Resume after a crash

ckpt = Checkpointer("checkpoints/run.jsonl")
last = ckpt.last_record

if last:
    messages = last.data["messages"]
    start_turn = last.turn + 1
    print(f"Resuming from turn {start_turn}")
else:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    start_turn = 1

for turn in range(start_turn, 100):
    ...

ckpt.last_record is None if no checkpoint exists, which makes the "first run vs resume" check a single if.

The file format

Each record is a compact JSON line:

{"turn":1,"ts":1716566401.2,"data":{"messages":[...]},"meta":{"label":"research","run_id":"run-001"}}

Plain text. Grep-able. You can inspect it with cat, recover it with any JSON parser, and load it with jq. No proprietary format.

Auto turn numbering

If you don't pass turn=, the checkpointer auto-increments from last_turn + 1:

ckpt = Checkpointer("run.jsonl")
ckpt.save({"x": 1})  # turn 1
ckpt.save({"x": 2})  # turn 2
ckpt.save({"x": 3})  # turn 3

Or explicit:

ckpt.save({"messages": messages}, turn=turn)

label and run_id

Tag records with metadata for multi-agent setups:

ckpt = Checkpointer(
    "checkpoints/worker-1.jsonl",
    label="literature-review",
    run_id="run-2026-05-24",
)

Both go in the meta field of every record. Filter by run_id across merged trace files to reconstruct a full run.

What I checkpoint in my Hermes agent

ckpt.save({
    "messages": messages,             # full message history to this point
    "tool_results": tool_results,     # tool call outputs
    "sub_tasks_done": sub_tasks_done, # which sub-tasks completed
    "cost_usd": running_cost,         # cost so far
})

On resume, I reload this dict and pick up exactly where I left off. The tool results are already there — I don't redo the web searches or API calls that already completed.