DEV Community

Cover image for A supervisor-tree library for building predictable and resilient programs
Di Lu
Di Lu

Posted on • Originally published at runsmith.lu-d.com

A supervisor-tree library for building predictable and resilient programs

🙋‍♂️ Claim: battle-tested idea, not AI-slop. I carefully designed the architecture and crafted the implementation mostly by myself (>80%). AI assistance is present but mainly for creating unit tests and standalone utils.

I just release Runsmith, an Erlang/OTP style supervisor-tree framework for when your Python service/system is made of multiple long-running programs.

Brief Intro

Think of an ETL service with a data poller, a transformer, and a result notifier, each with its own lifecycle, failure modes, and recovery needs. Wiring this by hand with retry loops, watchdog threads, and scattered state flags brittle glue code that is hard to reason about.

⚙️ Runsmith brings structure to this problem. Each unit becomes a worker with an explicit FSM lifecycle. A supervisor tree monitors every worker continuously — detecting stalls and timeouts, not just crashes — and confines restarts to the failed unit so the rest of the system keeps running.

Background story

The real origin story: I was building the backend for a safety-protection camera system at work. The camera is used in manufacturing plants so no downtime is unacceptable. The system has multiple processes all running together:

  • Web app: serving the HTTP API and SSE streams
  • Algorithm worker: running CV inference on incoming frames
  • Camera controller: interacting with the camera device library and polling frames
  • Background task: runner for scheduled jobs such as periodic data vacuuming
  • ONVIF service

Each one needed to run indefinitely and recover from failures without dragging the others down. The algorithm worker once stall mid-inference due to third-party driver failures. The FastAPI web app event loop once starved due to someone wrote bad sync code...

My first version was a messy soup. Lots of state flags, retry logics, watchdogs and probes desperately trying to hold things together. It worked but hard to maintain and reason about.

What I actually wanted was a framework where supervision is a first-class concept, fault isolation is structural rather than bolted on. More importantly, I really want an unified structure for modelling long-running stateful function units.

So I built it. Runsmith is essentially what I wished had existed when I started that project 🤗

Fancy version of supervisord?

Nope 🙂‍↔️. Runsmith and supervisord solve different problems. supervisord is an OS-level process control daemon that manages external programs by PID and static config. Runsmith is an in-process, programmable Python library where the supervised unit is a typed worker with an explicit lifecycle.

That gives a few advantages not present in supervisord:

  • Rich concurrency models: beyond process-only orchestration, workers can run in threads or co-routines, or even custom execution backends.
  • Fine-grained health probes: failure is not just an abnormal process exit, but a constraint violation that can be detected and recovered from.
  • Supervisor-tree: Erlang/OTP style supervisor-tree for nested fault domains.

Top comments (0)