agent-resume: checkpoint and resume long-running AI agent jobs in Python

#hermeschallenge #ai #python #agents

You have 500 documents to process. Your agent churns through them, enriching each one with LLM-extracted metadata. Each call takes a second or two. You set it running before you leave the office.

Four hours in, item 447 triggers some edge case. Maybe it is a malformed response from the LLM, maybe it is a network timeout, maybe it is something in your own parsing code. The process crashes and you are staring at an empty output directory.

You restart from item 1.

This is not a hypothetical. If you run batch jobs long enough, this happens. Maybe it is the job that hits a rate limit at 3am. Maybe it is the eval run you kick off before a meeting, only to come back and find it died at item 89 of 200. The only question is whether you lose 4 hours of work or 4 minutes of work. That depends entirely on whether you checkpointed.

Most people either skip checkpointing and pay the restart cost, or wire up a bespoke solution each time: a SQLite table, a set of already-processed IDs in a file, a counter in Redis. It works, but it is plumbing you rewrite for every job.

agent-resume is a small Python library that handles the checkpointing so you do not have to wire it up yourself every time.

The shape of the fix

The library gives you one main function and one store class. That is the whole surface area.

Here is the core pattern:

from agent_resume import resume_or_start, JsonlStore

store = JsonlStore("runs/enrich-docs.jsonl")

def process_documents(documents):
    for doc in documents:
        result = call_llm_to_enrich(doc)
        yield result

results = resume_or_start(
    job_id="enrich-docs-2026-05-24",
    store=store,
    generator_fn=process_documents,
    generator_args=(documents,),
)

for item in results:
    print(item)

Call resume_or_start() with a job ID and a generator function. If no checkpoint exists for that job ID, it starts from the beginning. If a checkpoint exists from a previous run, it replays completed items from the store and continues the generator from where it left off.

Your generator does not need to know anything about checkpointing. It just yields results. The library wraps it, saves each result as it comes out, and tracks the highest completed index.

A crash on item 447 means the next run resumes at item 448.

What it does NOT do

No async support. This is sync only. If you need async batch processing, you will need something else or a wrapper.
No distributed locking. If two processes try to resume the same job ID from the same store file, results are undefined.
No TTL or expiry. Old checkpoints stay forever unless you delete the file manually.
No built-in retry logic per item. If your generator raises, the job halts. Wrap your per-item logic with try/except inside the generator if you want to skip bad items.

Inside the lib: the append-only store

The JsonlStore writes one JSON line per completed item. It never goes back and edits existing lines.

When you call resume_or_start() on an existing checkpoint file, it reads through every line, finds the highest completed index, and starts the generator from there. Old lines from previous crash attempts stay in the file.

This is a deliberate choice. It means you get a full audit trail of every run attempt for free.

# After a crash and two resumes, your .jsonl might look like:
# {"job_id": "enrich-docs", "index": 0, "result": {...}}
# {"job_id": "enrich-docs", "index": 1, "result": {...}}
# ...
# {"job_id": "enrich-docs", "index": 446, "result": {...}}
# --- crash happened here on attempt 1 ---
# {"job_id": "enrich-docs", "index": 447, "result": {...}}  <- attempt 2 resumes here
# {"job_id": "enrich-docs", "index": 448, "result": {...}}
# ...

You can read the file afterward and see exactly which run produced which output. If item 100 was processed in attempt 1 and re-processed in attempt 3 (because you cleared a partial checkpoint), both entries are there.

The tradeoff: the file grows across runs and never shrinks unless you delete it. For most batch jobs, that is a fine tradeoff because the file stays small relative to the actual work being done.

When this is useful

LLM batch enrichment jobs where each item takes a real API call and you cannot afford to redo 400 of them.
Nightly data pipelines that run on a schedule and sometimes get killed partway through.
Long eval runs over test sets where you want to resume after adding more test cases.
Any generator-style pipeline where idempotent reprocessing is expensive and you want a cheap crash boundary.

When this is NOT what you want

If your items are not independent. The library assumes each yield is a self-contained result. If items 50-60 depend on the output of item 49, you need to handle that in your generator logic.
If you need atomic all-or-nothing semantics. The store writes items as they complete. A crash mid-run leaves partial results on disk. That is the point, but if partial completion is worse than no completion for your use case, this is not the right pattern.
If you need distributed checkpointing across multiple workers. This library checkpoints a single generator on a single machine to a single file. No coordination layer, no shared state.

Install

pip install agent-resume

Zero dependencies. Python 3.9 or newer. 35 tests.

GitHub: MukundaKatta/agent-resume

Sibling libraries

These libraries cover adjacent boundaries in the same agent pipeline:

Lib	Boundary	Repo
agent-decision-log	Why each branch was chosen (options, rationale, outcome)	MukundaKatta/agent-decision-log
agent-citation	Structured source attribution for agent outputs	MukundaKatta/agent-citation
agentsnap	Snapshot tests for agent tool-call traces	MukundaKatta/agentsnap
agenttrace	Cost and latency tracking per agent run	MukundaKatta/agenttrace
tool-call-budgets	Per-tool call-count caps to stop runaway loops	MukundaKatta/tool-call-budgets

What is next

The obvious gap is async support. A lot of real batch workloads use asyncio with concurrent LLM calls, and the current sync-only design does not cover that. An async version would need to handle concurrent yields carefully to avoid out-of-order checkpointing, but the core idea translates.

The other thing worth adding is a CLI inspector so you can run agent-resume inspect runs/enrich-docs.jsonl and see a summary: how many items completed, which run attempt got the farthest, whether there are gaps. Right now you have to parse the JSONL yourself.

A third option worth exploring is named snapshots. Right now the resume key is the job ID and the highest completed index. If you want to replay from a specific run attempt rather than the most recent one, you would need to manipulate the file manually. A simple --from-snapshot <timestamp> flag would handle that use case cleanly.

If you build something with it, open an issue. The design is intentionally minimal and I want to see what the real friction points are before adding more.