Ditch Electron: Implementing Watchdog Heartbeat & Auto-Healing for Multi-Process Desktop Apps

#electrobun #robyn #turso #htmx

Part 2 of the ERTH Architecture Series: How to build a Bun watchdog daemon to detect, kill, and resurrect crashed Python sidecar processes with dynamic port negotiation.

In the first part of this series, we established the core foundation of our ERTH Stack (ElectroBun + Robyn + Turso + HTMX) desktop application. We successfully spawned a high-performance Python backend (Robyn) as a child process from a Bun master process, using Port 0 to dynamically allocate a port and avoid port collisions.

But in real-world desktop environments, things are rarely that simple.

What happens if the Python process enters an asynchronous deadlock? What if it crashes due to a C-level memory access violation? Or what if the operating system aggressively reclaims memory and terminates your sidecar backend while the frontend shell remains open?

If you do not have a robust recovery strategy, your application will hang indefinitely, leaving your user staring at a frozen, unresponsive screen.

In this second post, we will build a production-grade Watchdog Heartbeat & Auto-Healing Pipeline that turns our dual-core architecture into a self-healing, crash-resilient application.

The Self-Healing Architecture

In a distributed local desktop architecture, the frontend (Bun) acts like a sonar, periodically pinging the sidecar backend (Robyn) via a lightweight, dedicated health probe. If the backend fails to respond within a specific timeout, the master process assumes the sidecar is dead or hung, terminates it surgically, and restarts it on a fresh dynamic port.

Here is the lifecycle of our watchdog in action:

Let's build this step-by-step.

Step 1: Implementing the Heartbeat Endpoint in Python

We start by exposing an extremely lightweight, high-priority /ping route in our Python Robyn backend. This route bypasses heavy database checks or business logic to minimize resource overhead.

# backend/app.py
from robyn import Robyn, Request, Response
import json

app = Robyn(__file__)

@app.get("/ping")
def ping(request: Request):
    """Low-overhead watchdog heartbeat probe"""
    return Response(
        status_code=200,
        headers={"Content-Type": "application/json"},
        description=json.dumps({"status": "pong"})
    )

if __name__ == "__main__":
    app.start(host="127.0.0.1", port=0)

This ensures the main process has a clean endpoint to evaluate whether the Python interpreter runtime is responsive and executing requests.

Step 2: Coding the Bun Watchdog Daemon

On the Bun frontend control layer, we refactor our process lifecycle management. Instead of starting the backend once and forgetting it, we encapsulate startup in a reusable startBackend() function and run an asynchronous heartbeat check every 3 seconds.

Here is the implementation:

// src-app/frontend/src/bun/index.ts
import { spawn } from "bun";
import { resolve } from "path";

let backendProcess: any = null;
let portFound = false;
let backendPort = 0;
let watchdogInterval: any = null;
let failCount = 0;

const PORT_CAPTURE_REGEX = /listening on: [^:]+:(\d+)|http:\/\/127\.0\.0\.1:(\d+)/;

// Spawns the Robyn process and listens to stdout for dynamic port allocation
const startBackend = () => {
  portFound = false;
  backendPort = 0;

  console.log(`🚀 [ElectroBun] Starting Robyn backend process...`);

  backendProcess = spawn({
    cmd: ["uv", "run", "python", "app.py"],
    cwd: resolve(__dirname, "../../../backend"),
    stdout: "pipe",
    stderr: "pipe",
  });

  // Intercept stdout to parse the dynamically negotiated port
  (async () => {
    const reader = backendProcess.stdout.getReader();
    const decoder = new TextDecoder();
    let buffer = "";

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split("\n");
      buffer = lines.pop() || "";

      for (const line of lines) {
        process.stdout.write(`[Robyn STDOUT] ${line}\n`);

        if (!portFound) {
          const match = line.match(PORT_CAPTURE_REGEX);
          if (match) {
            const rawPort = match[1] || match[2];
            backendPort = parseInt(rawPort, 10);
            portFound = true;
            console.log(`⚡ [ElectroBun] Captured backend port: ${backendPort}`);
            startWatchdog(); // Begin active polling
            break;
          }
        }
      }
    }
  })();
};

// Polling daemon that hits the /ping endpoint
const startWatchdog = () => {
  if (watchdogInterval) clearInterval(watchdogInterval);
  failCount = 0;

  console.log(`📡 [Watchdog] Heartbeat scanning activated on port :${backendPort}`);

  watchdogInterval = setInterval(async () => {
    if (!portFound || backendPort === 0) return;

    try {
      // Set a strict 1-second timeout for the ping probe
      const res = await fetch(`http://127.0.0.1:${backendPort}/ping`, {
        signal: AbortSignal.timeout(1000)
      });

      if (res.ok) {
        const data = await res.json();
        if (data.status === "pong") {
          failCount = 0; // Reset counter on success
          return;
        }
      }
      failCount++;
    } catch (e) {
      failCount++;
    }

    // Auto-heal if the sidecar fails 3 consecutive times
    if (failCount >= 3) {
      console.log(`🚨 [Watchdog] 3 consecutive heartbeats lost. Backend hung!`);
      console.log(`🚨 [Watchdog] Triggering auto-healing and restarting backend...`);

      killBackend(true); // Kill current sidecar and auto-restart
    }
  }, 3000);
};

// Surgically recycle the backend child process
const killBackend = (autoRestart = false) => {
  if (backendProcess && !backendProcess.killed) {
    console.log("🛑 [ElectroBun] Terminating Robyn backend process...");
    backendProcess.kill("SIGTERM");
  }
  if (watchdogInterval) {
    clearInterval(watchdogInterval);
  }

  if (autoRestart) {
    startBackend();
  }
};

// Register cleanup listeners to prevent leaving zombie processes on exit
process.on("SIGINT", () => killBackend(false));
process.on("SIGTERM", () => killBackend(false));
process.on("exit", () => killBackend(false));

startBackend();

The Engineering Details: Why a 3-Second Probe?

When designing a heartbeat loop, you face a trade-off between responsiveness and battery/CPU consumption:

If you ping every 100 milliseconds, you recover instantly, but you waste CPU cycles and drain the laptop battery.
If you ping every 30 seconds, your application remains frozen for half a minute before recovering, causing a terrible user experience.

Pinging every 3 seconds with a 3-miss tolerance strikes the perfect balance. It guarantees that a crashed or hung backend will be resurrected within 9 to 12 seconds, while consuming virtually 0% CPU overhead.

Live Disaster Recovery Test

When we run this pipeline, the watchdog works silently. If we simulate a sudden backend failure by running kill -9 $(lsof -t -i :<PORT> -sTCP:LISTEN) in another terminal window, the output demonstrates the instant self-healing loop:

📡 [Watchdog] Heartbeat scanning activated on port :54337
[Robyn STDOUT] INFO:actix_server.server:starting service...
... (Backend runs normally) ...

🚨 [Watchdog] 3 consecutive heartbeats lost. Backend hung!
🚨 [Watchdog] Triggering auto-healing and restarting backend...
🛑 [ElectroBun] Terminating Robyn backend process...
🚀 [ElectroBun] Starting Robyn backend process...
⚡ [ElectroBun] Captured backend port: 54518
📡 [Watchdog] Heartbeat scanning activated on port :54518

The app detects the crash, kills the dead process handle, spins up a new instance, negotiates a new port (54518), and injects it back into the WebView. The user experiences nothing more than a brief loading spinner.

What’s Next?

Our dual-core engine now heals itself from crashes. But opening local network ports exposes a massive security risk. Any browser script running on a malicious website in the user's browser could scan local ports, find our Robyn port, and hijack our backend to steal data.

In Part 3, we will secure our local communications using Opaque Tokens and custom request interceptors, establishing an absolute Zero-Trust shield.

📖 Read the Full Book on Leanpub (Includes a free 5-chapter preview edition!)

👉 Explore the open-source code on GitHub

Stay tuned for Part 3!