<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: codelluis</title>
    <description>The latest articles on DEV Community by codelluis (@codelluis).</description>
    <link>https://dev.to/codelluis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3887405%2F41485020-bc4a-4b8c-835a-232e5ff013b6.jpeg</url>
      <title>DEV Community: codelluis</title>
      <link>https://dev.to/codelluis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/codelluis"/>
    <language>en</language>
    <item>
      <title>I Killed a Python Worker Mid-Task. Here's What Should Have Happened.</title>
      <dc:creator>codelluis</dc:creator>
      <pubDate>Sun, 19 Apr 2026 13:59:56 +0000</pubDate>
      <link>https://dev.to/codelluis/i-killed-a-python-worker-mid-task-heres-what-should-have-happened-1kpl</link>
      <guid>https://dev.to/codelluis/i-killed-a-python-worker-mid-task-heres-what-should-have-happened-1kpl</guid>
      <description>&lt;p&gt;I ran &lt;code&gt;kill -9&lt;/code&gt; on a worker that was processing three tasks. They vanished. No error. No retry. I checked the queue: empty. I checked the results: nothing. The work was just gone.&lt;/p&gt;

&lt;p&gt;This is not a bug. This is the default behavior of many Python task frameworks. A worker dies mid-execution, and whatever it was doing disappears.&lt;/p&gt;

&lt;p&gt;So I built a framework where the system heals itself. Here is what that looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Here is what usually happens when a worker crashes in the middle of a task:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A task starts running on Worker-1.&lt;/li&gt;
&lt;li&gt;Worker-1 gets OOM-killed (or crashes, or the host dies).&lt;/li&gt;
&lt;li&gt;The task message was already acknowledged and removed from the queue.&lt;/li&gt;
&lt;li&gt;The task is gone: no record, no detection, no recovery.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Typical workarounds teams build by hand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Late acknowledgement, which reduces task loss but increases duplicate execution risk.&lt;/li&gt;
&lt;li&gt;External monitoring, which detects failures but still requires manual re-queueing.&lt;/li&gt;
&lt;li&gt;Strict idempotency layers everywhere, which are useful but still need a recovery trigger.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not complete solutions. They are patches around a missing core capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I killed a worker. Here is what happened
&lt;/h2&gt;

&lt;p&gt;I ran the same crash scenario with &lt;a href="https://github.com/pynenc/pynenc" rel="noopener noreferrer"&gt;pynenc&lt;/a&gt;: three tasks running, then &lt;code&gt;SIGKILL&lt;/code&gt;, then a second worker.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STEP 1: Starting Worker-1...
  Worker-1 started (PID 12345)

STEP 2: Submitting 3 long-running tasks...
  -&amp;gt; Submitted slow_task(0)
  -&amp;gt; Submitted slow_task(1)
  -&amp;gt; Submitted slow_task(2)

  Waiting for Worker-1 to pick up and start running tasks...

STEP 3: Simulating a worker crash!
  X Killing Worker-1 (PID 12345) with SIGKILL...
  X Worker-1 terminated (exit code -9)

  The in-progress task is now orphaned — no worker owns it.

STEP 4: Starting Worker-2 (the recovery worker)...
  Worker-2 started (PID 12346)

STEP 5: Waiting for recovery and task completion...
  OK slow_task completed: task_0_completed
  OK slow_task completed: task_1_completed
  OK slow_task completed: task_2_completed

  ALL 3 TASKS COMPLETED SUCCESSFULLY
  Tasks from the crashed worker were recovered automatically!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Worker-1 died mid-execution. Worker-2 detected the stale heartbeat, recovered orphaned tasks, and finished all three with zero manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring view
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxur6pdslguc8lhkmwqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxur6pdslguc8lhkmwqm.png" alt="Pynmon monitoring view during recovery demo" width="800" height="630"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Click to open the image at full size.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the same monitoring view used during the run. From here you can inspect the timeline across runners, open each invocation detail, and follow the logs around state changes to understand what happened step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  How recovery works
&lt;/h2&gt;

&lt;p&gt;Every runner sends periodic heartbeats. As long as heartbeats arrive, the runner is healthy.&lt;/p&gt;

&lt;p&gt;When heartbeats stop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The recovery service marks the runner as stale.&lt;/li&gt;
&lt;li&gt;Orphaned running invocations are claimed safely.&lt;/li&gt;
&lt;li&gt;Tasks are re-routed to the broker.&lt;/li&gt;
&lt;li&gt;Healthy runners pick them up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is built in. No external watcher process required.&lt;/p&gt;

&lt;p&gt;Recovery re-executes the full task, so designing tasks to be idempotent remains a best practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;p&gt;The task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tasks.py (simplified — full version in the repo)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pynenc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pynenc&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pynenc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;slow_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_num&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;slow_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[slow_task(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)] Starting — will run for 8 seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;slow_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[slow_task(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)] progress &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml (key settings — full config in the repo)&lt;/span&gt;
&lt;span class="nn"&gt;[tool.pynenc]&lt;/span&gt;
&lt;span class="py"&gt;app_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"recovery_demo"&lt;/span&gt;
&lt;span class="py"&gt;orchestrator_cls&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"SQLiteOrchestrator"&lt;/span&gt;
&lt;span class="py"&gt;broker_cls&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"SQLiteBroker"&lt;/span&gt;
&lt;span class="py"&gt;state_backend_cls&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"SQLiteStateBackend"&lt;/span&gt;
&lt;span class="py"&gt;runner_cls&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ThreadRunner"&lt;/span&gt;

&lt;span class="c"&gt;# Fast recovery timeouts for demo purposes.&lt;/span&gt;
&lt;span class="c"&gt;# Production systems use much higher values (defaults: 10 min heartbeat, 15 min recovery cron).&lt;/span&gt;
&lt;span class="py"&gt;runner_considered_dead_after_minutes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;          &lt;span class="c"&gt;# 6 seconds — heartbeat expiry&lt;/span&gt;
&lt;span class="py"&gt;recover_running_invocations_cron&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"* * * * *"&lt;/span&gt;      &lt;span class="c"&gt;# every minute (fastest cron resolution)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full demo is in the public &lt;a href="https://github.com/pynenc/samples/tree/main/recovery_demo" rel="noopener noreferrer"&gt;recovery_demo&lt;/a&gt; folder of the samples repository.&lt;/p&gt;

&lt;p&gt;The entrypoint script is &lt;a href="https://github.com/pynenc/samples/blob/main/recovery_demo/sample.py" rel="noopener noreferrer"&gt;recovery_demo/sample.py&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Requires uv — install: https://docs.astral.sh/uv/getting-started/installation/&lt;/span&gt;
git clone https://github.com/pynenc/samples.git
&lt;span class="nb"&gt;cd &lt;/span&gt;samples/recovery_demo
uv &lt;span class="nb"&gt;sync
&lt;/span&gt;uv run python sample.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Docker. No Redis. No external services. One demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What teams usually build by hand
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;The problem&lt;/th&gt;
&lt;th&gt;Typical approach&lt;/th&gt;
&lt;th&gt;What pynenc does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Worker dies mid-task&lt;/td&gt;
&lt;td&gt;Lost task or duplicate retries&lt;/td&gt;
&lt;td&gt;Automatic recovery via heartbeat detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detecting dead workers&lt;/td&gt;
&lt;td&gt;External monitoring stack&lt;/td&gt;
&lt;td&gt;Built-in runner heartbeat checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Re-queuing orphaned tasks&lt;/td&gt;
&lt;td&gt;Manual scripts and intervention&lt;/td&gt;
&lt;td&gt;Automatic re-routing to broker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery in clusters&lt;/td&gt;
&lt;td&gt;Custom distributed locking&lt;/td&gt;
&lt;td&gt;Atomic global recovery service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Understanding incidents&lt;/td&gt;
&lt;td&gt;Log spelunking&lt;/td&gt;
&lt;td&gt;Invocation state history and timeline views&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What is next
&lt;/h2&gt;

&lt;p&gt;Pynenc is open source and actively maintained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/pynenc/pynenc" rel="noopener noreferrer"&gt;pynenc&lt;/a&gt; - core framework&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/pynenc/samples" rel="noopener noreferrer"&gt;samples&lt;/a&gt; - runnable demos&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.pynenc.org" rel="noopener noreferrer"&gt;docs&lt;/a&gt; - full documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How does your team handle crashed workers today? Join the conversation in &lt;a href="https://github.com/pynenc/pynenc/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>backend</category>
      <category>distributedsystems</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
