DEV Community: codelluis

Per-account task concurrency without a lock service

codelluis — Thu, 07 May 2026 22:13:04 +0000

Many background jobs call an external system on behalf of separate accounts,
tenants, or installations. The external system allows parallel calls across
different accounts, but it does not allow two calls for the same account to
run at the same time.

That is not a global rate limit. It is concurrency by key: the key might be
account_id, tenant_id, or another argument that identifies the shared
quota or state boundary.

You want the worker pool busy across many accounts, while each account stays
serial. Without that guard, two workers eventually pick up work for the same
account in parallel. The external system may throttle the account, reject the
second call, or leave you with a partial update to reconcile.

The usual fixes are external locks, one queue per account, or retry/backoff
logic around every call. They can work, but they add another coordination
layer to the job system.

Pynenc's orchestrator already tracks running invocations and their arguments.
With running_concurrency=KEYS and key_arguments=("account_id",), it can
enforce one in-flight invocation per account key while still running different
accounts in parallel. reroute_on_concurrency_control decides whether blocked
work waits or is dropped, and registration_concurrency=KEYS can collapse
duplicate work before a worker sees it.

Full sample: samples/concurrency_demo.

The demo

Four tiny files, each doing one thing:

concurrency_demo/
├── api_server.py     # tiny HTTP app: pretends to be the external provider
├── tasks.py          # PynencBuilder app + 4 tasks (the whole story)
├── enqueue.py        # CLI: enqueue one scenario, print results
└── sample.py         # one-command demo: boots api+worker, runs all scenarios

The "external provider" is a small HTTP app that holds an account in flight
for 0.4 seconds per call and records a collision whenever a second request
arrives while the first is still in flight. In a real integration, that
collision could be a 429, a rejected write, or an inconsistent refresh:

# api_server.py — the part that matters
@app.post("/call/{account_id}/{op}")
async def call(account_id: str, op: str, hold: float = HOLD_SECONDS) -> dict[str, str]:
    async with lock:
        acc = accounts[account_id]
        acc.calls += 1
        collided = acc.in_flight > 0
        acc.collisions += int(collided)
        acc.in_flight += 1
    print(f"  [{'COLLISION' if collided else 'ok       '}] {account_id:<8} {op}", flush=True)

    await asyncio.sleep(hold)

    async with lock:
        accounts[account_id].in_flight -= 1
    return {"outcome": "collision" if collided else "ok"}

The pynenc app and the four tasks fit on one screen. The whole pynenc
configuration — SQLite backend, in-process thread runner, logging — sits
fluently in tasks.py next to the tasks that use it:

# tasks.py
import os
import httpx
from pynenc import PynencBuilder
from pynenc.conf.config_task import ConcurrencyControlType as Mode

API_URL = "http://127.0.0.1:8765"

app = (
    PynencBuilder()
    .app_id("concurrency_demo")
    .sqlite("concurrency_demo.db")
    .thread_runner(min_threads=1, max_threads=8)
    .logging_stream("stdout")
    .logging_level(os.environ.get("DEMO_LOG_LEVEL", "info"))
    .max_pending_seconds(3.0)
    .build()
)


def _hit(account_id: str, op: str, hold: float | None = None) -> str:
    params = {"hold": hold} if hold is not None else None
    r = httpx.post(f"{API_URL}/call/{account_id}/{op}", params=params, timeout=10.0)
    r.raise_for_status()
    return r.json()["outcome"]


@app.task
def call_unsafe(account_id: str, op: str) -> str:
    return _hit(account_id, op)


@app.task(
    running_concurrency=Mode.KEYS,
    key_arguments=("account_id",),
    reroute_on_concurrency_control=True,
)
def call_keyed(account_id: str, op: str) -> str:
    return _hit(account_id, op)


@app.task(
    running_concurrency=Mode.KEYS,
    key_arguments=("account_id",),
    reroute_on_concurrency_control=False,
)
def call_keyed_drop(account_id: str, op: str) -> str:
    return _hit(account_id, op)


@app.task(
    running_concurrency=Mode.KEYS,
    registration_concurrency=Mode.KEYS,
    key_arguments=("account_id",),
    reroute_on_concurrency_control=True,
)
def refresh_once(account_id: str) -> str:
    return _hit(account_id, "refresh")

How to run it

You can launch the demo two ways. The four-terminal flow is useful when you
want to watch the API, the worker, and the pynenc monitor at the same time.
The one-command flow boots the API and worker for you and runs every scenario
in sequence; it is the path used by CI.

# four terminals — recommended for exploring
uv run uvicorn api_server:app --port 8765      # 1. API
uv run pynenc --app tasks.app runner start     # 2. worker
uv run pynenc monitor                          # 3. monitor (optional) at http://127.0.0.1:8000
uv run python enqueue.py all                   # 4. enqueue scenarios

# one command — recommended for CI
uv run python sample.py

What the API observes

All four scenarios, end to end, on a single pynmon timeline. Read it left to
right: scenario A starts with overlapping calls for the same accounts; B fans
out into three account lanes that stay serial per account; C drops blocked
work instead of rerouting it; D collapses duplicate refresh requests before a
worker ever sees them.

Two pynenc state names appear in the screenshots and logs. REROUTED means
the worker tried to start an invocation, found the account key already busy,
and put the invocation back on the queue. CONCURRENCY_CONTROLLED_FINAL
means the invocation was blocked by the key rule and intentionally finished
without running.

Four scenarios, four stories. Each one below pairs the per-scenario
summary, the API server's collision log, and the matching pynmon timeline.

Scenario A — no concurrency control

The baseline pain. Different provider operations, same account_id key.
The runner can hold up to eight invocations in flight, and it does — most
of the 12 invocations start essentially together. The first call per
account reaches the provider cleanly; everything that overlaps the same
account is recorded as COLLISION — the stand-in for a real 429,
throttle, or inconsistent response.

=== A. unsafe — no concurrency control ===
  12 enqueued -> 12 calls, 9 collisions, 1.42s
   X acme     calls=4  collisions=3
   X globex   calls=4  collisions=3
   X initech  calls=4  collisions=3

--- reset @ 11:49:40 A. unsafe — no concurrency control ---
  [ok       ] acme     fetch_profile
  [COLLISION] acme     list_invoices
  [ok       ] globex   fetch_profile
  [COLLISION] acme     update_metadata
  [COLLISION] acme     refresh_usage
  [COLLISION] globex   refresh_usage
  [COLLISION] globex   list_invoices
  [COLLISION] globex   update_metadata
  [ok       ] initech  fetch_profile
  [COLLISION] initech  list_invoices
  [COLLISION] initech  refresh_usage
  [COLLISION] initech  update_metadata

Scenario B — `running_concurrency=KEYS`, `reroute=True`

Same 12 calls as A, no collisions. The orchestrator indexes invocation
arguments and refuses to start a second call_keyed while one with the same
account_id is already running. When a worker tries to pick up a blocked
invocation, reroute_on_concurrency_control=True puts it back on the queue
so it can run when the slot frees up. The timeline shows three clean lanes,
one per account, with blocked invocations moving through REROUTED until
they get their turn.

=== B. keyed — running_concurrency=KEYS, reroute=True ===
  12 enqueued -> 12 calls, 0 collisions, 2.14s
  OK acme     calls=4  collisions=0
  OK globex   calls=4  collisions=0
  OK initech  calls=4  collisions=0

--- reset @ 11:49:41 B. keyed — running_concurrency=KEYS, reroute=True ---
  [ok       ] acme     fetch_profile
  [ok       ] globex   fetch_profile
  [ok       ] initech  fetch_profile
  [ok       ] initech  list_invoices
  [ok       ] acme     update_metadata
  [ok       ] globex   list_invoices
  [ok       ] initech  refresh_usage
  [ok       ] globex   update_metadata
  [ok       ] acme     refresh_usage
  [ok       ] globex   refresh_usage
  [ok       ] initech  update_metadata
  [ok       ] acme     list_invoices

Scenario C — `running_concurrency=KEYS`, `reroute=False`

Same guard, opposite policy. reroute_on_concurrency_control=False tells
the orchestrator not to re-queue blocked invocations — they land in
CONCURRENCY_CONTROLLED_FINAL and inv.result raises KeyError. Only
the first invocation per account_id ever reaches the provider; the other
nine are dropped. The timeline ends almost as soon as the first three
invocations finish.

=== C. drop — running_concurrency=KEYS, reroute=False ===
  12 enqueued -> 3 calls (9 dropped), 0 collisions, 0.67s
  OK acme     calls=1  collisions=0
  OK globex   calls=1  collisions=0
  OK initech  calls=1  collisions=0

--- reset @ 11:49:43 C. drop — running_concurrency=KEYS, reroute=False ---
  [ok       ] acme     fetch_profile
  [ok       ] globex   fetch_profile
  [ok       ] initech  fetch_profile

Scenario D — `registration_concurrency=KEYS` + `running_concurrency=KEYS`

A different question. registration_concurrency checks at enqueue time:
when refresh request number two for acme arrives, there is already one
registered, so the producer gets back a ReusedInvocation pointing to the
first. 24 logical "please refresh this account" events, eight per account,
collapse to 3 actual API calls before a worker picks them up. The
running_concurrency guard is the safety net for the case where a worker
picks up the first task before all duplicates have registered.

=== D. dedupe — registration + running KEYS ===
  24 enqueued -> 3 calls (21 deduped), 0 collisions, 0.57s
  OK acme     calls=1  collisions=0
  OK globex   calls=1  collisions=0
  OK initech  calls=1  collisions=0

--- reset @ 11:49:44 D. dedupe — registration + running KEYS ---
  [ok       ] acme     refresh
  [ok       ] globex   refresh
  [ok       ] initech  refresh

When to reach for which

The pattern applies whenever an argument marks the boundary for shared state
or quota. The external system may be a third-party API, an internal service,
or a resource that should only be touched by one task at a time for a given
key. The system may allow broad parallelism overall, while still requiring
serialization for each account, tenant, installation, or resource id.

The two settings cover most of what people reach for an external
rate-limiter or a per-tenant lock service for:

running_concurrency=KEYS on account_id (or tenant_id, or oauth_installation_id, or client_token), with reroute=True — when the rule is “no two calls in flight for the same client account”, but you still want all calls to eventually complete. Blocked calls re-queue and retry until a slot opens. Good for distinct operations (op1, op2, op3…) that all need to run.
Same, with reroute=False — when the rule is “if a call for this account is already running, drop the new one”. Queue depth stays flat; no retry buildup. Good for “trigger a refresh, but if one is already in flight, skip it”.
registration_concurrency=KEYS + running_concurrency=KEYS on the same key — when "do this once per client right now" is enough, regardless of how many places triggered it. "Refresh client dashboard", "rebuild client index", "regenerate client report". A noisy internal event bus firing the same refresh 50 times per second is a bug; deduping it before it reaches a worker keeps queue depth honest. The running guard is the safety net: if a worker is fast enough to pick up the first task before all duplicates register, the second flag prevents a parallel run. Together they guarantee exactly one call per account, regardless of timing. Scenario D in the sample.

In production terms, the useful part is that this does not require a separate
lock service. The orchestrator already tracks invocations to route work;
checking whether a matching key is already running uses that same state.

Simpler scopes when you don't need keys

This post zooms in on KEYS, but it is one of four scopes. The same two
flags (running_concurrency and registration_concurrency) accept any
value of ConcurrencyControlType:

DISABLED — the default. No concurrency check.
TASK — at most one invocation of the task itself in the chosen state, regardless of arguments. “Only one nightly cleanup may run.”
ARGUMENTS — at most one invocation per full argument tuple. Two calls with identical arguments collapse; calls that differ in any argument run in parallel. “Don't run the same export twice.”
KEYS — at most one invocation per chosen subset of arguments (key_arguments=(...)). The mode this post is about: serialise on the account key, ignore the operation name.

The scope you pick controls what counts as a duplicate. The flag you put
it on (registration_concurrency vs running_concurrency) controls
when the check happens — at enqueue time or at run time.

Full reference, including how key_arguments interacts with each scope and
the other concurrency knobs, is in the pynenc docs:
Concurrency Control use case.

What's not in the box yet

Two things people will (correctly) ask for:

Multi-slot concurrency — "up to 5 in flight per key", not just 1.
Time-window rate limits — "100 calls per minute per key".

Both are on the roadmap. The current primitive - one in-flight invocation per
task/key - already covers a common integration problem: systems that allow
parallelism across accounts but not overlapping work for the same account.
The bigger controls build on the same orchestrator machinery.

How to try it

git clone https://github.com/pynenc/samples
cd samples/concurrency_demo
uv sync
uv run python sample.py

The full sample and README are at
github.com/pynenc/samples/tree/main/concurrency_demo.
The pynenc framework is on PyPI as
pynenc and the source is at
github.com/pynenc/pynenc. Issues, ideas,
and "this would be great if it also did X" comments are welcome.

Distribute your Python app without rewriting it

codelluis — Mon, 27 Apr 2026 15:00:00 +0000

You have a Python function that processes one item. You call it in a loop over a list. The list grows. The loop slows down. The work is real — an LLM API call, an embedding, a scrape, a database query, a model inference — the kind of thing that does not get faster with prettier code.

Distribution is the answer. Distribution usually means rewriting every call site to handle queues, futures, and result objects. So the loop stays slow and a progress bar gets added.

This post is about removing the migration cost. One decorator. One environment variable. Five reports go from 2.51 seconds to 0.54 seconds. Zero call sites change.

The whole demo is in the direct_task_demo sample of the pynenc samples repository. The example happens to generate sales reports because it needs a concrete I/O-bound function with a list-shaped input — but the pattern is the same for batch LLM calls, embedding generation, RAG indexing, web scraping, ETL enrichment, or any workload of the form "slow function, list of items, want it parallel".

The original code

tasks_original.py is plain Python. No decorators, no imports from any framework, no infrastructure assumptions. It does what the existing codebase already does:

# tasks_original.py
import time
from hashlib import md5

PERIODS = ["Q1-2025", "Q2-2025", "Q3-2025", "Q4-2025", "Q1-2026"]


def _build_report(period: str) -> dict:
    time.sleep(0.5)  # simulates DB queries + aggregation
    seed = int(md5(period.encode()).hexdigest()[:8], 16)
    revenue = 50_000 + (seed % 950_000)
    orders = 100 + (seed % 9_900)
    return {"period": period, "revenue": revenue, "orders": orders,
            "avg_order_value": round(revenue / orders, 2)}


def generate_report(period: str) -> dict:
    return _build_report(period)


def generate_reports(periods: list[str]) -> list[dict]:
    return [_build_report(p) for p in periods]

Running it produces five reports in 2.51 seconds. That is the baseline.

The migration

tasks.py is the same file with three additions:

+ from pynenc import Pynenc
+ app = Pynenc()

+ @app.direct_task
  def generate_report(period: str) -> dict:
      return _build_report(period)

+ @app.direct_task(parallel_func=_per_period, aggregate_func=_flatten)
  def generate_reports(periods: list[str]) -> list[dict]:
      return [_build_report(p) for p in periods]

Function bodies, signatures, and return types are identical. The two helpers _per_period and _flatten are added to support the parallel decorator — they read the caller's actual arguments, they do not synthesize anything out of thin air:

def _per_period(args: dict) -> list[tuple[list[str]]]:
    return [([p],) for p in args["periods"]]


def _flatten(chunks: list[list[dict]]) -> list[dict]:
    return [report for chunk in chunks for report in chunk]

_per_period reads the periods argument the caller passed and yields one period per worker. _flatten collects the per-worker results back into a single list. The decorator does the routing.

Sync mode: the decorators are inert

Setting PYNENC__DEV_MODE_FORCE_SYNC_TASKS=True runs every decorated call inline in the caller's thread — no runner, no broker, no database writes. Behaviour is identical to tasks_original.py: 5 reports in 2.52s, same values, same order. This is the strangler-fig migration pattern: decorate one function at a time, keep the env var on so existing tests stay green, then remove it in production. No call site needs to change.

$ PYNENC__DEV_MODE_FORCE_SYNC_TASKS=True python sample_sync.py

Sync mode: 5 reports in 2.52s (expected ~2.5s — sequential, like the original)
  Q1-2025     revenue=$  477,381  orders=  381  AOV=$1252.97
  Q2-2025     revenue=$  798,638  orders= 7838  AOV=$101.89
  ...

Distributed mode: the same calls, with workers

Removing the env var and starting a ThreadRunner makes the decorators distribute work over a SQLite-backed broker. The call sites do not change:

$ python sample_distributed.py

Sequential calls on runner: 5 reports in 3.18s (each call blocks before the next starts)

Concurrent caller threads: 5 reports in 0.54s (N caller threads -> N workers running in parallel)
  Q1-2025     revenue=$  477,381  ...
  ...

Two patterns appear here. The sequential loop is the original code, unchanged — each generate_report(p) blocks before the next call starts. That is by design: @app.direct_task preserves the calling contract of a regular Python function. The caller waits, gets the value back, and exception handling works as it always did. That guarantee is what makes the migration zero-cost.

For caller-side concurrency, ThreadPoolExecutor is the standard Python pattern, and it composes naturally:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=len(PERIODS)) as pool:
    reports = list(pool.map(generate_report, PERIODS))

Each thread blocks on its own call; the runner processes them in parallel. Five reports in 0.54 seconds — five times faster on the same machine, with no broker change.

Single-call fan-out

Sometimes the parallelism belongs inside the function rather than at the call site. The caller passes a list, expects a list back, and does not need to change a single line of code. That is what parallel_func is for: a small helper that describes how to split the arguments into individual work items. Pynenc dispatches one task per item — across whatever workers are running — then reassembles the results via aggregate_func before returning to the caller:

# tasks.py
@app.direct_task(parallel_func=_per_period, aggregate_func=_flatten)
def generate_reports(periods: list[str]) -> list[dict]:
    return [_build_report(p) for p in periods]

The caller calls it exactly as in tasks_original.py:

reports = generate_reports(periods=PERIODS)

Behind the decorator, _per_period reads args["periods"] and yields one argument tuple per period. Pynenc triggers one task per tuple and routes each to an available worker. _flatten collects the per-worker results back into a single list. The caller receives the same shape it always did:

$ python sample_parallel.py

Parallel fan-out: 5 reports in 0.65s (one call, 5 workers running in parallel)

The function signature is honest. Nothing is "ignored". The argument the caller passes is the argument parallel_func reads.

For higher throughput, pynenc's native parallel API goes further: instead of aggregating before returning, the function exposes a result group that the caller can iterate as results arrive. Each item is available as soon as the worker that produced it finishes — no waiting for the slowest one. The parallel_func pattern shown here is the zero-migration-cost option: same signature, same return type, same call site, parallelism handled entirely by the decorator.

Why not just use `asyncio` / `multiprocessing` / Celery?

These are the obvious alternatives and each one solves a different slice of the problem.

asyncio.gather parallelises async I/O on a single event loop. It works only if the function is already async, only on one machine, and only for I/O-bound work. Synchronous functions need to be rewritten.
multiprocessing.Pool.map parallelises across CPU cores on a single host. It cannot scale beyond one machine, struggles with large arguments (everything is pickled and copied), and the call site changes from f(x) to pool.map(f, xs).
concurrent.futures.ThreadPoolExecutor is a clean primitive but stops at the process boundary. With @app.direct_task it composes — use it on the caller side and pynenc handles the worker side, optionally on different machines.
Celery / RQ / Dramatiq scale across machines but break the calling contract: f(x) becomes f.delay(x).get() or similar. Every call site has to change. There is no in-process sync mode for unit tests — you run a worker or you mock.

@app.direct_task is the option that gives you all three properties at once: distributed across machines, the call site does not change, and a single environment variable runs everything inline for tests and local development.

When `direct_task` is the right tool

@app.direct_task always blocks the caller. That is the point: it preserves the calling contract that the original code already relied on. Migration is a copy-the-decorator operation, not a rewrite.

For fire-and-forget semantics — enqueue work and continue without blocking — @app.task is the right decorator. It returns an Invocation and exposes .result for explicit waiting. The two decorators are complementary; the right choice is whichever one preserves the call pattern the codebase already has.

Try it

# uv: https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/pynenc/samples.git
cd samples/direct_task_demo
uv sync

uv run python tasks_original.py                                       # baseline
PYNENC__DEV_MODE_FORCE_SYNC_TASKS=True uv run python sample_sync.py   # decorators inert
uv run python sample_distributed.py                                   # workers, two patterns
uv run python sample_parallel.py                                      # single-call fan-out

I Killed a Python Worker Mid-Task. Here's What Should Have Happened.

codelluis — Sun, 19 Apr 2026 13:59:56 +0000

I ran kill -9 on a worker that was processing three tasks. They vanished. No error. No retry. I checked the queue: empty. I checked the results: nothing. The work was just gone.

This is not a bug. This is the default behavior of many Python task frameworks. A worker dies mid-execution, and whatever it was doing disappears.

So I built a framework where the system heals itself. Here is what that looks like.

The problem nobody talks about

Here is what usually happens when a worker crashes in the middle of a task:

A task starts running on Worker-1.
Worker-1 gets OOM-killed (or crashes, or the host dies).
The task message was already acknowledged and removed from the queue.
The task is gone: no record, no detection, no recovery.

Typical workarounds teams build by hand:

Late acknowledgement, which reduces task loss but increases duplicate execution risk.
External monitoring, which detects failures but still requires manual re-queueing.
Strict idempotency layers everywhere, which are useful but still need a recovery trigger.

These are not complete solutions. They are patches around a missing core capability.

So I killed a worker. Here is what happened

I ran the same crash scenario with pynenc: three tasks running, then SIGKILL, then a second worker.

STEP 1: Starting Worker-1...
  Worker-1 started (PID 12345)

STEP 2: Submitting 3 long-running tasks...
  -> Submitted slow_task(0)
  -> Submitted slow_task(1)
  -> Submitted slow_task(2)

  Waiting for Worker-1 to pick up and start running tasks...

STEP 3: Simulating a worker crash!
  X Killing Worker-1 (PID 12345) with SIGKILL...
  X Worker-1 terminated (exit code -9)

  The in-progress task is now orphaned — no worker owns it.

STEP 4: Starting Worker-2 (the recovery worker)...
  Worker-2 started (PID 12346)

STEP 5: Waiting for recovery and task completion...
  OK slow_task completed: task_0_completed
  OK slow_task completed: task_1_completed
  OK slow_task completed: task_2_completed

  ALL 3 TASKS COMPLETED SUCCESSFULLY
  Tasks from the crashed worker were recovered automatically!

Worker-1 died mid-execution. Worker-2 detected the stale heartbeat, recovered orphaned tasks, and finished all three with zero manual intervention.

Monitoring view

Click to open the image at full size.

This is the same monitoring view used during the run. From here you can inspect the timeline across runners, open each invocation detail, and follow the logs around state changes to understand what happened step by step.

How recovery works

Every runner sends periodic heartbeats. As long as heartbeats arrive, the runner is healthy.

When heartbeats stop:

The recovery service marks the runner as stale.
Orphaned running invocations are claimed safely.
Tasks are re-routed to the broker.
Healthy runners pick them up.

This is built in. No external watcher process required.

Recovery re-executes the full task, so designing tasks to be idempotent remains a best practice.

The code

The task:

# tasks.py (simplified — full version in the repo)
import time
from pynenc import Pynenc

app = Pynenc()

@app.task
def slow_task(task_num: int) -> str:
    slow_task.logger.info(f"[slow_task({task_num})] Starting — will run for 8 seconds")
    for second in range(8):
        time.sleep(1)
        slow_task.logger.info(f"[slow_task({task_num})] progress {second + 1}/8")
    return f"task_{task_num}_completed"

The demo configuration:

# pyproject.toml (key settings — full config in the repo)
[tool.pynenc]
app_id = "recovery_demo"
orchestrator_cls = "SQLiteOrchestrator"
broker_cls = "SQLiteBroker"
state_backend_cls = "SQLiteStateBackend"
runner_cls = "ThreadRunner"

# Fast recovery timeouts for demo purposes.
# Production systems use much higher values (defaults: 10 min heartbeat, 15 min recovery cron).
runner_considered_dead_after_minutes = 0.1          # 6 seconds — heartbeat expiry
recover_running_invocations_cron = "* * * * *"      # every minute (fastest cron resolution)

The full demo is in the public recovery_demo folder of the samples repository.

The entrypoint script is recovery_demo/sample.py.

Try it yourself

# Requires uv — install: https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/pynenc/samples.git
cd samples/recovery_demo
uv sync
uv run python sample.py

No Docker. No Redis. No external services. One demo.

What teams usually build by hand

The problem	Typical approach	What pynenc does
Worker dies mid-task	Lost task or duplicate retries	Automatic recovery via heartbeat detection
Detecting dead workers	External monitoring stack	Built-in runner heartbeat checks
Re-queuing orphaned tasks	Manual scripts and intervention	Automatic re-routing to broker
Recovery in clusters	Custom distributed locking	Atomic global recovery service
Understanding incidents	Log spelunking	Invocation state history and timeline views

What is next

Pynenc is open source and actively maintained:

pynenc - core framework
samples - runnable demos
docs - full documentation

How does your team handle crashed workers today? Join the conversation in GitHub Discussions.