DEV Community

Cover image for I built a tiny runtime for resumable agent workers
Mariusz Czajkowski
Mariusz Czajkowski

Posted on

I built a tiny runtime for resumable agent workers

A while ago I needed a resumable agent runtime.

I did not want something as large as Temporal, and I did not want another agent framework like LangChain. I wanted something small enough to understand, but solid enough to adapt across the different verticals I was building.

It started with a few bare-bones questions.

The moment an agent leaves a notebook, script, or chat session, the hard problems change:

  • What work exists?
  • Which worker owns it right now?
  • What was the last durable step?
  • Can another worker resume after a crash?
  • Which resources are locked?
  • What did the agent produce?
  • Can operators inspect what happened?

The effect of it is Roost as a small runtime layer for that problem.

GitHub: https://github.com/mczaykowski/Roost

The basic idea

Roost treats an agent as a durable step machine.

An engine implements two methods:

class Engine:
    engine_id: str

    async def init_snapshot(self, item: WorkItem) -> Snapshot: ...
    async def step(self, snapshot: Snapshot, item: WorkItem) -> Snapshot: ...
Enter fullscreen mode Exit fullscreen mode

The engine owns the domain-specific transition.

Roost owns the operational substrate:

Queue
  -> acquire lease
  -> load latest Snapshot
  -> Engine.step(snapshot, item)
  -> compare-and-swap save Snapshot
  -> re-enqueue or mark done
Enter fullscreen mode Exit fullscreen mode

That gives you:

  • durable snapshots
  • per-work leases
  • at-least-once execution
  • retry-safe progress
  • delayed continuation
  • resource claims
  • event history
  • content-addressed artifacts
  • failed-work inspection

It is intentionally small. It is not trying to be a prompt framework, model router, workflow DSL, or hosted agent platform.

Roost does not help an agent think.

Roost helps an agent keep going.

Why I built it

A lot of agent tooling focuses on the thinking loop: prompts, tools, retrieval, planning, memory, model routing.

That is useful, but once agents run as workers for minutes, hours, or days, the bottleneck becomes more boring and more operational.

For example:

  • a worker dies halfway through a task
  • the same job is delivered twice
  • a long-running task needs to wait before its next step
  • two workers should not touch the same resource at the same time
  • an operator needs to know what happened
  • the output needs to be inspectable later

You can solve this with a workflow engine, a custom queue, a database table, or a pile of scripts.

Roost is my attempt at a small, agent-shaped version of that layer.

A simple demo: crash-safe URL watchlist

The demo engine is a URL watchlist worker.

It fetches a URL over multiple steps, saves each observation into a snapshot, waits between checks, and writes a final JSON artifact.

You can kill the worker halfway through, restart it, and Roost resumes from the latest saved snapshot.

uv sync --extra redis --extra dev
docker run --rm -p 6379:6379 redis:7
Enter fullscreen mode Exit fullscreen mode

In one terminal:

uv run roost worker --engines watchlist
Enter fullscreen mode Exit fullscreen mode

In another:

WORK_ID=$(uv run roost enqueue \
  --engine watchlist \
  --resource domain:example.com \
  --payload '{"url":"https://example.com","claim":"Example Domain is reachable","checks_required":3,"delay_seconds":5}')

uv run roost status "$WORK_ID"
Enter fullscreen mode Exit fullscreen mode

Then kill the worker with Ctrl-C, start it again, and inspect the same work item.

uv run roost worker --engines watchlist
uv run roost status "$WORK_ID"
Enter fullscreen mode Exit fullscreen mode

There is also a local end-to-end script:

scripts/e2e_watchlist.sh
Enter fullscreen mode Exit fullscreen mode

No LLM key is required. The demo is about runtime behavior, not model behavior.

Local console

Roost includes a small local console:

uv run roost ui
Enter fullscreen mode Exit fullscreen mode

It shows live work, saved state, events, failed work, and artifacts.

Roost Console Work View

The detail view lets you inspect payloads, snapshots, and outputs:

Roost Console Detail

Where this fits

Roost is not a replacement for LangChain, LlamaIndex, CrewAI, AutoGen, Temporal, Celery, or your own agent loop.

It sits at a different layer.

LangChain helps decide what an agent should do.
Temporal helps coordinate workflows.
Celery runs jobs.
Roost keeps long-running agent workers alive, inspectable, and resumable.
Enter fullscreen mode Exit fullscreen mode

The current backend is Redis + SAQ. Execution is at-least-once, so engines need to make step() retry-safe from the same snapshot.

That tradeoff is intentional. I would rather expose the semantics clearly than pretend exactly-once execution exists.

What I’m looking for feedback on

I’m especially interested in feedback on the abstraction boundary.

Is this useful as a small runtime under agent loops?

Would you rather reach for Temporal, Celery, or a custom queue?

Does the init_snapshot() / step() model feel too small, or exactly small enough?

GitHub: https://github.com/mczaykowski/Roost

Top comments (0)