Kasey Steinhauer

Posted on May 12 • Originally published at celeryradar.com

Celery worker monitoring: detecting silent failures

#python #celery #django #monitoring

Originally posted on celeryradar.com.

Workers are the part of Celery that actually do the work. When they stop, your application's background processing stops. That's the easy part to monitor. The harder part is that workers fail in ways that look healthy from the outside: the process is still running, the broker connection looks fine, the log file's last line is from this morning, and yet tasks aren't getting picked up. By the time somebody on your team notices, a downstream user noticed first.

This guide covers what worker monitoring actually needs to catch (more than "is the process running"), why the three dominant detection approaches each have known blind spots, the five ways workers go silent in production, and the specific implementation trap that causes naive heartbeat setups to fire false alerts during recovery.

Why worker death detection is harder than it looks

Worker monitoring isn't underserved the way beat schedule monitoring is. Every Celery monitoring tool tracks workers in some form. The gap is subtler: each dominant approach has a specific blind spot, and the five most common worker failure modes split across those blind spots so that no single approach catches all of them.

Flower and similar broker-inspect tools query worker state through the broker. Celery's inspect ping command sends a control message and waits for the worker to reply. This works when the worker is healthy and the network path is clean. It misses a few important cases: workers running the solo pool while blocked on a long-running task (the solo pool's main thread handles control commands too, so a stuck task means stuck inspect replies), workers behind certain network configurations where the broker's reply path is unreliable, and prefork workers where the main process itself has stalled on a broker reconnect storm or a slow synchronous transport. Flower will show them as offline; nothing was wrong with the worker process itself.

The APMs (Sentry, Datadog, New Relic) approach worker monitoring primarily from the task side: they instrument task execution and surface errors and slow tasks well. Sentry and New Relic's Python agents are task-side only. If every worker dies, nothing throws, nothing traces, nothing reaches the APM. Sentry Crons covers beat schedules, not workers. Datadog is the partial exception: its Celery integration scrapes Flower's Prometheus endpoint and exposes a per-worker celery.flower.worker.online gauge, so worker absence is visible if you stand up Flower and write the monitor yourself. None of the three ship a preconfigured "worker offline" alert template out of the box.

Process supervision (systemd, supervisord, Kubernetes liveness probes) catches the cleanest failure mode: the process exited. Restart policy kicks in, the worker comes back. What it doesn't catch is the worker process that's still running but has stopped processing tasks. From systemd's view, PID 12345 is alive; from your application's view, nothing's getting done. The liveness probe was wired to the process, not the worker's actual responsiveness.

Each of the three approaches solves part of the problem. None of them, on their own, catches the full set of failure modes that take Celery workers down in production. The rest of this guide is about what proper worker monitoring covers, the five specific failure modes that close the gap, and how to detect each.

What proper worker monitoring entails

Proper worker monitoring is four signals. Three of them are familiar; the fourth is what most homegrown setups miss.

Liveness. Is the worker process running? This is the easy one. systemd reports it, Kubernetes reports it, Flower reports it. Liveness alone is necessary but not sufficient: a process that's alive but no longer processing tasks shows as healthy by every liveness check.

Responsiveness. Is the worker actually picking up and completing tasks? Harder to measure than liveness, and the two common mechanisms (broker inspect Flower-style, and worker-pushed heartbeats) each prove something narrower. Both confirm the worker's main process is alive and its broker path is healthy enough to drive a signal. Neither, on its own, detects a worker whose main process is healthy while child processes are stuck on a long-running task, or whose broker consumer has lost subscription state but kept its connection. Catching the alive-but-stuck case requires a downstream signal: a queue-depth alert on the queues the worker serves. A worker whose heartbeat is current but whose queue is growing past threshold is stuck, even though no worker-level alert will fire. Queue depth alone would catch both the stuck and dead cases eventually, but the lag is proportional to how fast new tasks arrive; the heartbeat catches the dead case in tens of seconds and tells you which host. The two alerts cover different failure modes and are typically run together rather than chosen between. For the heartbeat cadence, every 30 seconds with a 100-to-300 second offline threshold gives enough grace windows to absorb a slow network blip without firing on real-but-brief disconnections.

Identity stability. Does the worker's identifier survive normal operations? Most setups identify workers by hostname. On a Kubernetes deployment, the hostname is the pod name, which rotates on every restart, every rollout, every autoscaler event. A naive setup accumulates offline ghost workers indefinitely: every prior pod sits in the dashboard reading as down, forever. Stable identity requires an explicit override (an env var or kwarg that names the worker independently of hostname) or careful interpretation of dashboard noise.

Out-of-order safety. Late-arriving heartbeats during recovery from a monitoring-side outage shouldn't trigger phantom alerts. Sounds obvious. The implementation is where most naive setups break. Covered in detail below.

The 5 ways workers die silently in production

1. OOM kill

The Linux OOM killer is the most common cause of silent worker death in production. The kernel decides a process has consumed too much memory, picks it as the victim, and sends SIGKILL. The worker has no chance to log anything; SIGKILL is unhandlable. The only trace is in the kernel log (dmesg, journalctl -k) where you'll see a line like Out of memory: Killed process 12345 (celery).

In Kubernetes, the same kernel OOM mechanism applies via cgroups when a container exceeds its memory limit; the kubelet reports the result on the pod with OOMKilled status. The pod restarts (if restart policy permits), but during the restart window tasks are unprocessed. If the underlying memory pattern repeats (a task with a large allocation that the worker doesn't reclaim between runs), the cycle continues: OOM, restart, OOM, restart, with tasks failing or timing out at each cycle.

The detection signal is identical in both environments: the worker stops producing heartbeats. Liveness alone catches the killed process eventually (systemd marks the unit failed; kubelet marks the pod as restarting), but heartbeat absence catches it sooner because heartbeats are pushed on a fixed cadence that doesn't depend on supervisor poll intervals.

2. SIGKILL during deploy

The deployment race. Your deploy pipeline sends SIGTERM to the worker, the worker starts its graceful shutdown (finishing in-flight tasks before exiting), but the supervisor's grace period is shorter than the longest-running task. After the grace period, the supervisor sends SIGKILL.

In Kubernetes, this is the terminationGracePeriodSeconds setting. The default is 30 seconds. Workers running a 60-second task get SIGKILL'd before the task completes; the task is lost (or retried, depending on acks_late). In systemd, TimeoutStopSec plays the same role. The default is 90 seconds, which is enough for most tasks but not for any long-running operation that can't be paused.

The symptom is not "no tasks running" but rather "tasks vanishing mid-execution during deploys." You won't notice it in monitoring that only looks at process state because the process eventually died cleanly. You'll notice it when a customer reports a task they triggered didn't complete, and the audit trail shows the task started but never reached a terminal state. Worker-side, a heartbeat that stops abruptly during a deploy window is the signal; correlating with deploy times tells you whether the cause was the deploy itself.

3. Prefork child crash, parent alive

Celery's default prefork pool runs a main worker process and a pool of child processes that execute tasks. The main process fetches tasks from the broker, dispatches to children, and monitors child health.

When a child crashes (segfault in a C extension, unhandled C-level exception, OOM-killed individually rather than as the whole pool), the main process reaps the child and spawns a replacement. The in-flight task in that child is lost; depending on acks_late configuration, it may or may not be retried. The main process keeps running and continues to look healthy.

The hard-to-debug variant is a recurring child crash that's specific to certain task arguments. Most tasks succeed; a specific subset crashes their executor every time. The main process never reports unhealthy because it's working as designed: spawn child, dispatch task, child dies, spawn replacement. Liveness sees nothing wrong, responsiveness sees nothing wrong (the main process is responsive), and the only catch is in task outcome correlation. This is one of the few failure modes that the per-task breakdown view (retry rate, failure rate per task name) catches better than worker-level monitoring.

4. Broker connection drop without clean reconnect

Workers maintain a persistent connection to the broker (Redis or RabbitMQ). The connection should reconnect automatically if dropped, and usually does. The edge cases that bite are when reconnection succeeds silently from the worker's view but leaves the worker in a state where it's no longer receiving messages.

The Redis variant: a network blip drops the worker's connection. Reconnect establishes a new socket. The known bug pattern is the worker's event loop ending up polling a stale file descriptor or a new socket that wasn't properly registered with the I/O hub; BRPOP never fires again. The worker process is alive, a TCP connection to Redis exists, but new tasks don't arrive at this worker. Largely fixed in Celery 5.5+/kombu 5.4+, but regressions have appeared in 5.6.x. Worth checking the version pinned in your deployment.

The RabbitMQ variant: with acks_late=True, ACK state lives on the AMQP channel. If a channel dies mid-delivery, RabbitMQ requeues the unacked task, but the worker's prefetch slot stays occupied by a zombie task it can no longer ACK. After enough channel drops, every prefetch slot is zombied and the worker consumes nothing despite looking alive. RabbitMQ's 30-minute default consumer_timeout is a related cause for long-running tasks. The worker_cancel_long_running_tasks_on_connection_loss setting (added in Celery 5.1) is the mitigation.

Detection requires the worker to push state, not just maintain a connection. A heartbeat sent over the same broker path that tasks travel through proves both "I'm alive" and "I can use the broker." A heartbeat sent over a separate HTTPS path proves liveness without proving broker reachability, so the broker-disconnect mode can slip through.

5. Hung on a long task or blocking dependency

A worker process is alive and processing one task that's blocked on something: a synchronous database call with a long timeout, an HTTP request to a slow third-party API, a file lock waiting for a process that died. While that task is blocked, the worker can't pick up new tasks. If your concurrency is 1, or if the entire pool is stuck on similarly blocked tasks, the worker is functionally offline.

Process supervision sees the worker as healthy. The broker connection is healthy. The worker is even technically responsive to control commands sent through certain transports. But the queue is filling and tasks aren't moving.

The clearest detection signal is responsiveness measured against throughput, not against heartbeats. A worker that hasn't acked a task in 10 minutes while its queue depth is increasing is stuck even if its heartbeat is current. This is a correlation across two metrics, harder to express as a single threshold and the reason worker-offline alerts pair naturally with queue-depth alerts on the same queue.

Detecting these in production

The detection space breaks into three approaches. Each catches different parts of the five failure modes above. The right answer for most production deployments is a heartbeat-push setup with a sensible offline threshold, but understanding the case for each approach helps you decide what to layer.

Heartbeat push (worker-initiated). The worker reports "I'm alive" on a fixed interval (typically every 30 seconds) to a monitoring service. The service alerts when heartbeats stop arriving for longer than a configured offline threshold (typically 100 to 300 seconds). This catches OOM kill cleanly (process dies, heartbeats stop), SIGKILL during deploy cleanly (same shape), and broker connection loss when heartbeats travel a different path from tasks. It misses prefork child crashes when heartbeats are reported per-main-process (the main process is still up), and it misses hung-on-long-task because the main process is still alive and heartbeating.

Broker inspect (controller-initiated). Periodically send control commands through the broker and wait for the worker's reply. This is what Flower does internally via the Celery app.control.inspect() interface. Catches the same failure modes as heartbeat push, plus partial coverage of hung-on-long-task in cases where the worker's main process is itself stalled (the prefork main process can normally service control commands while a child is blocked, but solo-pool workers and prefork workers whose main process is stuck on broker reconnect don't reply). Adds broker load proportional to worker count times inspect frequency, and reports false offlines when broker control replies are slow or asymmetric.

Process supervision (OS-initiated). systemd, supervisord, Kubernetes liveness probes. Catches the cleanest failure mode (process exit). Misses every alive-but-not-processing case. Fast to set up, no Celery awareness required, and worth running regardless because it handles the auto-restart side that monitoring alone doesn't.

The pragmatic answer is to layer all three. Process supervision handles restart automation. Heartbeat push handles the bulk of failure detection. Broker inspect, where you already have Flower running, catches the additional cases at the cost of broker load. Layering is redundant by design: an OOM kill fires the heartbeat-absence alert and triggers the systemd restart, both signals telling the same story from different angles. That redundancy is the point.

The out-of-order heartbeat trap

There's a specific implementation trap that affects every heartbeat-based worker monitor: out-of-order arrivals.

The naive shape is to write last_seen = received_timestamp on every heartbeat. Correct in the steady state. The problem appears during recovery from a monitoring-side outage. Workers buffer heartbeats while monitoring is unreachable. When monitoring comes back, the buffered heartbeats replay alongside fresh ones. They arrive at the receiver in arrival order, not timestamp order. If the receiver writes last_seen unconditionally, the most recent fresh heartbeat can be followed by an older retried one, and last_seen moves backward. The next offline check fires a false alert: the worker looks like it hasn't reported in seventeen minutes instead of seventeen seconds.

The fix is to enforce MAX(existing, incoming) semantics at write time. Read-compare-write isn't safe under concurrent ingest: two near-simultaneous writes can both pass the comparison and the second one's update clobbers the first. The robust shape is an atomic conditional update at the database level: UPDATE workers SET last_seen = $new WHERE hostname = $h AND last_seen < $new. Postgres evaluates the predicate inside the update; concurrent writes serialize cleanly.

This is one of three layers of redundancy in CeleryRadar's worker monitoring. The other two are a bounded retry queue in the SDK (which preserves heartbeats during outages so they can replay rather than disappearing) and a 10-minute startup grace on the alert engine (which suppresses absence-based alerts for the first ten minutes after the alert worker boots, so backfilled heartbeats land before any evaluator runs). Each layer covers a different failure mode in the recovery path; together they're robust regardless of which mechanism does the heavy lifting on any given recovery.

Setting up worker monitoring with CeleryRadar

If worker monitoring with out-of-order safety, fork safety, and Kubernetes-friendly identity is what you want without writing it yourself, CeleryRadar handles it as part of the standard SDK setup.

pip install celeryradar-sdk

In your Celery app:

# myproject/celery.py
import os
import celeryradar_sdk
from celery import Celery

app = Celery("myproject")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.autodiscover_tasks()

celeryradar_sdk.connect(
    api_key=os.environ["CELERYRADAR_API_KEY"],
    app_name="myproject",
)

The SDK pushes a heartbeat every 30 seconds via Celery's heartbeat_sent signal, rebuilds itself correctly across prefork forks (the parent's TCP connection would otherwise dupe across child processes), and the backend handles the out-of-order arrival case server-side. On Kubernetes, set the CELERYRADAR_WORKER_NAME environment variable on your deployment to override the pod-name-as-hostname default. Pick a stable identifier (the deployment name plus the ordinal, or a fixed string per logical worker class) so the dashboard doesn't accumulate offline ghosts on every pod rotation.

Add a worker_offline alert rule from the rules page. Pick a hostname from the dropdown (the dropdown is sourced from heartbeats CeleryRadar has actually received, so typos can't pass), set the absence threshold (100 seconds is the floor; 180 to 300 is a reasonable default for most workloads), pick a delivery channel, and save.

The five failure modes above land on the dashboard differently. OOM kill and SIGKILL during deploy fire worker_offline cleanly. Prefork child crashes don't fire that alert because the main process is still alive; they surface on the per-task breakdown page as elevated retry and failure rates, and a task_failure_rate alert on the affected task name catches the recurring-on-specific-args case actively. Broker connection drops without clean reconnect fire worker_offline since heartbeats stop arriving. Hung-on-long-task is the hardest case; the main process keeps heartbeating, so the signal is queue depth growing against worker count staying steady. Pair the worker_offline alert with a queue_depth_threshold alert on the queues that worker serves to cover that mode.

Try CeleryRadar free

Closing

Worker monitoring is where the gap between "we monitor workers" and "we'd catch this in production" usually lives. The three dominant approaches (broker inspect, APM task instrumentation, process supervision) each handle part of the failure space, and the five most common failure modes split across them so that no single approach is sufficient. The pragmatic shape is heartbeat-push for the bulk of detection, process supervision for restart automation, and broker inspect where your existing Flower setup catches the additional cases. Out-of-order safety on the heartbeat path is the implementation detail that separates "fires false alerts during every recovery" from "just works."

If beat schedule monitoring is also a gap in your setup, the same SDK installation handles that automatically. See the companion guide on Celery beat monitoring. For the broader picture across tasks, workers, queues, and schedules together, the main guide on monitoring Celery in production covers the full signal map.

DEV Community