The bug that taught me how to run one process per user

#devops #opensource #tutorial #python

Update (July 2026): WorkerDeck is now on PyPI — pip install workerdeck.
Zero dependencies, a real test suite, CI on Python 3.9–3.12. The follow-up
post covers what changed, including how my own test suite walked straight
into the zombie trap described below:
WorkerDeck is now on PyPI.

I was building a platform where every user runs their own long-lived
background process. Doesn't matter what the process does — an agent, a
pipeline, a scheduled job. What matters is that each user gets their own,
it runs for days, and I need to start it, watch it, and stop it from one
place.

The lifecycle code for that turned out to be the part that bit me. Not the
hard, interesting part of my product — the boring plumbing around it. I want
to tell you about the specific bug that cost me an afternoon, because if
you're building anything similar, you will hit it, and it's not obvious
until you do.

The setup

My manager did the obvious thing. It kept a dictionary of the processes it
had started — user id to process handle — and answered "is this user's worker
running?" by looking in that dictionary. Start a worker, put it in the dict.
Stop it, take it out. Check status, look it up. Clean, simple, worked in
testing.

It worked in production too. Right up until I deployed a change.

The afternoon it went wrong

After the deploy, my dashboard told me every user's worker was stopped. That
was alarming, because the workers were clearly still doing things — I could
see their output, their side effects, their logs ticking over. But the
control panel showed them all as down.

So I did the natural thing: I clicked "start" on one.

Now I had two copies of that user's worker running, fighting each other for
the same port. One of them started throwing "address already in use." I
clicked "stop" on another — and nothing happened. The worker kept running.
The button did nothing.

I spent a while convinced something was deeply broken. It wasn't. The bug
was much dumber than that, and much more instructive.

The actual problem

When I deployed, the manager process restarted. And that in-memory dictionary
of process handles? It lives in the manager's memory. A restart wipes it.

The workers themselves were fine — they're separate processes, they don't
care that the thing that launched them blinked. But the new manager came up
with an empty dictionary. It had amnesia. It had no idea those workers
existed, so it reported them all as stopped. When I clicked "start," it
happily launched a second copy, because as far as it knew there was no first
copy. When I clicked "stop," it looked in its empty dictionary, found nothing,
and did nothing.

Every symptom I saw traced back to one root cause: the manager tracked its
workers only in memory, and memory doesn't survive a restart.

Once you see it, it's obvious. But it's invisible until the day you restart
the manager while workers are running — which, if you deploy, is every day.

Fixing it properly

The fix is to write down what's running somewhere that survives a restart, and
to re-attach to those processes when the manager comes back up. So the manager
now keeps a small registry on disk: for each user, the process id of their
worker.

On startup, it reads that registry, and for each entry checks: is a process
with this pid still alive? If yes, adopt it — treat it as a running worker I'm
responsible for again.

That's the idea. But there's a trap in it that's worth slowing down for.

The trap: pids get recycled

You cannot just check "is a process with this pid alive?" and call it done.
Operating systems reuse pid numbers. The pid that was your worker yesterday
might belong to something completely unrelated today — a database, a cron job,
anything. If you blindly re-adopt whatever now holds that pid, you'll "manage"
a process that isn't yours, and eventually send a stop signal to some innocent
bystander.

A live pid is not proof it's your process.

What makes it proof is the pid plus the process start time. If I recorded
that my worker was pid 4021 started at 14:03:07, and on restart I find a pid
4021 that also started at 14:03:07, that's my worker — the odds of a recycled
pid coincidentally sharing a start time are effectively nil. If the start time
differs, it's an impostor, and I leave it alone.

That check — pid plus start time — is the difference between "usually works"
and "safe to restart." It's a few lines of code, and it's the whole ballgame.

The bonus trap: zombies

While testing the fix, I hit a second one. I killed a worker, then asked the
manager to stop it, and the stop call hung there insisting the worker was
still alive. But I'd just killed it. What?

Zombies. On Linux, when a process exits but its parent hasn't yet collected
its exit status, it lingers as a "defunct" process — a zombie. It's not
running, it does nothing, it holds no resources. But a naive liveness check
("does this pid exist?") says yes, it exists — because technically the
zombie entry is still there.

This shows up exactly in the re-attach case, because the original parent (the
old manager) is gone, so nobody's around to reap the zombie. My liveness check
had to learn to read the process state and treat a zombie as dead. Otherwise
"stop" waits forever for a worker that already left.

Another few lines. Another thing you only learn by hitting it.

What I ended up with

Once I'd solved this for my own platform, I realised the whole lifecycle layer
was generic — none of it was specific to what my workers actually did. So I
pulled it out into a small library,
WorkerDeck, so I (and maybe
you) never have to rediscover these traps from scratch.

It's deliberately small: one class, one host, no orchestration layer. It gives
you:

Zero dependencies — the core is pure Python standard library.
One isolated process per user — own directory, own config injected as environment variables, own log.
Safe start / stop / restart — graceful SIGTERM escalating to SIGKILL, signalling the whole process group so children die too.
Survival of its own restart — the whole point of this post. Workers keep running across a manager restart, and the manager re-adopts them by pid and start time.
Optional self-healing — turn it on and crashed workers auto-restart, with exponential backoff and a circuit breaker so a worker that crashes on startup can't spin forever; after too many crashes it's parked in a "failed" state for a human instead of hot-looping.
An events hook — a callback fired on every meaningful change (started, stopped, crashed, restarted, failed, adopted) so you can wire up logging or alerting.
A small dashboard — so you can see and drive it all without building a UI first.

It's not a Kubernetes replacement and doesn't try to be. It solves the problem
you actually have when you're running per-user processes on one box — which is
where most projects start, and where these failure modes (orphans, zombies,
ignored stop signals, and the restart amnesia that started this whole story)
show up long before you need anything bigger.

If any of the symptoms above sounded familiar: