This is a submission for the Hermes Agent Challenge.
My Hermes research agent was doing a multi-step literature review. Two hours in, at turn 89, the API returned a timeout error that my retry logic couldn't recover from. No checkpoint. I had to start over and re-pay for 89 turns of work.
That's trace-checkpoint.
One line to checkpoint each turn
from trace_checkpoint import Checkpointer
ckpt = Checkpointer("checkpoints/run.jsonl")
for turn in range(1, 100):
response = call_llm(messages)
messages.append({"role": "assistant", "content": response.text})
ckpt.save({"messages": messages}) # one JSONL line per turn
if is_done(response):
break
Each save() appends one line to the file. No lock files, no database, no temp files. If the process dies between writes, the last complete line is the last good checkpoint.
Resume after a crash
ckpt = Checkpointer("checkpoints/run.jsonl")
last = ckpt.last_record
if last:
messages = last.data["messages"]
start_turn = last.turn + 1
print(f"Resuming from turn {start_turn}")
else:
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
start_turn = 1
for turn in range(start_turn, 100):
...
ckpt.last_record is None if no checkpoint exists, which makes the "first run vs resume" check a single if.
The file format
Each record is a compact JSON line:
{"turn":1,"ts":1716566401.2,"data":{"messages":[...]},"meta":{"label":"research","run_id":"run-001"}}
Plain text. Grep-able. You can inspect it with cat, recover it with any JSON parser, and load it with jq. No proprietary format.
Auto turn numbering
If you don't pass turn=, the checkpointer auto-increments from last_turn + 1:
ckpt = Checkpointer("run.jsonl")
ckpt.save({"x": 1}) # turn 1
ckpt.save({"x": 2}) # turn 2
ckpt.save({"x": 3}) # turn 3
Or explicit:
ckpt.save({"messages": messages}, turn=turn)
label and run_id
Tag records with metadata for multi-agent setups:
ckpt = Checkpointer(
"checkpoints/worker-1.jsonl",
label="literature-review",
run_id="run-2026-05-24",
)
Both go in the meta field of every record. Filter by run_id across merged trace files to reconstruct a full run.
What I checkpoint in my Hermes agent
ckpt.save({
"messages": messages, # full message history to this point
"tool_results": tool_results, # tool call outputs
"sub_tasks_done": sub_tasks_done, # which sub-tasks completed
"cost_usd": running_cost, # cost so far
})
On resume, I reload this dict and pick up exactly where I left off. The tool results are already there — I don't redo the web searches or API calls that already completed.
Zero dependencies
Standard library only: json, pathlib, time, dataclasses. No third-party packages.
pip install trace-checkpoint
Top comments (0)