A while ago I needed a resumable agent runtime.
I did not want something as large as Temporal, and I did not want another agent framework like LangChain. I wanted something small enough to understand, but solid enough to adapt across the different verticals I was building.
It started with a few bare-bones questions.
The moment an agent leaves a notebook, script, or chat session, the hard problems change:
- What work exists?
- Which worker owns it right now?
- What was the last durable step?
- Can another worker resume after a crash?
- Which resources are locked?
- What did the agent produce?
- Can operators inspect what happened?
The effect of it is Roost as a small runtime layer for that problem.
GitHub: https://github.com/mczaykowski/Roost
The basic idea
Roost treats an agent as a durable step machine.
An engine implements two methods:
class Engine:
engine_id: str
async def init_snapshot(self, item: WorkItem) -> Snapshot: ...
async def step(self, snapshot: Snapshot, item: WorkItem) -> Snapshot: ...
The engine owns the domain-specific transition.
Roost owns the operational substrate:
Queue
-> acquire lease
-> load latest Snapshot
-> Engine.step(snapshot, item)
-> compare-and-swap save Snapshot
-> re-enqueue or mark done
That gives you:
- durable snapshots
- per-work leases
- at-least-once execution
- retry-safe progress
- delayed continuation
- resource claims
- event history
- content-addressed artifacts
- failed-work inspection
It is intentionally small. It is not trying to be a prompt framework, model router, workflow DSL, or hosted agent platform.
Roost does not help an agent think.
Roost helps an agent keep going.
Why I built it
A lot of agent tooling focuses on the thinking loop: prompts, tools, retrieval, planning, memory, model routing.
That is useful, but once agents run as workers for minutes, hours, or days, the bottleneck becomes more boring and more operational.
For example:
- a worker dies halfway through a task
- the same job is delivered twice
- a long-running task needs to wait before its next step
- two workers should not touch the same resource at the same time
- an operator needs to know what happened
- the output needs to be inspectable later
You can solve this with a workflow engine, a custom queue, a database table, or a pile of scripts.
Roost is my attempt at a small, agent-shaped version of that layer.
A simple demo: crash-safe URL watchlist
The demo engine is a URL watchlist worker.
It fetches a URL over multiple steps, saves each observation into a snapshot, waits between checks, and writes a final JSON artifact.
You can kill the worker halfway through, restart it, and Roost resumes from the latest saved snapshot.
uv sync --extra redis --extra dev
docker run --rm -p 6379:6379 redis:7
In one terminal:
uv run roost worker --engines watchlist
In another:
WORK_ID=$(uv run roost enqueue \
--engine watchlist \
--resource domain:example.com \
--payload '{"url":"https://example.com","claim":"Example Domain is reachable","checks_required":3,"delay_seconds":5}')
uv run roost status "$WORK_ID"
Then kill the worker with Ctrl-C, start it again, and inspect the same work item.
uv run roost worker --engines watchlist
uv run roost status "$WORK_ID"
There is also a local end-to-end script:
scripts/e2e_watchlist.sh
No LLM key is required. The demo is about runtime behavior, not model behavior.
Local console
Roost includes a small local console:
uv run roost ui
It shows live work, saved state, events, failed work, and artifacts.
The detail view lets you inspect payloads, snapshots, and outputs:
Where this fits
Roost is not a replacement for LangChain, LlamaIndex, CrewAI, AutoGen, Temporal, Celery, or your own agent loop.
It sits at a different layer.
LangChain helps decide what an agent should do.
Temporal helps coordinate workflows.
Celery runs jobs.
Roost keeps long-running agent workers alive, inspectable, and resumable.
The current backend is Redis + SAQ. Execution is at-least-once, so engines need to make step() retry-safe from the same snapshot.
That tradeoff is intentional. I would rather expose the semantics clearly than pretend exactly-once execution exists.
What I’m looking for feedback on
I’m especially interested in feedback on the abstraction boundary.
Is this useful as a small runtime under agent loops?
Would you rather reach for Temporal, Celery, or a custom queue?
Does the init_snapshot() / step() model feel too small, or exactly small enough?


Top comments (0)