The scenario you've lived
It's 2am. Someone pings you: emails aren't going out. You check Flower. Every worker is green; online, healthy. Queue depth is climbing. Nothing is moving.
This is a ghost worker: a process that's alive, but not consuming.
What actually happens
When a Celery worker reconnects to Redis after a network hiccup, it can end up in a broken state where:
- It maintains its heartbeat (so Flower shows it as online)
- It holds its prefetched tasks (so they appear "running")
- But it never ACKs or re-queues them (so they're stuck forever)
The result: your dashboard shows everything healthy while your queue backs up.
Why Flower misses it
Flower tracks worker processes, not task throughput. It asks "is the worker connected?"; not "is the worker actually doing work?".
Here's the check Flower does:
# Flower's worker health check (simplified)
def is_worker_alive(worker):
last_heartbeat = worker.last_heartbeat
return (time.time() - last_heartbeat) < HEARTBEAT_TIMEOUT
A ghost worker keeps sending heartbeats. It's "alive" by this definition.
What you actually want is:
# What you'd need to detect a ghost
def is_worker_actually_processing(worker, queue, window_seconds=300):
tasks_processed = get_task_count(worker, since=time.time() - window_seconds)
queue_depth = get_queue_depth(queue)
if queue_depth > 0 and tasks_processed == 0:
return False # ghost worker
return True
The difference: throughput over time, not just connectivity.
How to detect it manually
If you suspect a ghost worker, here's the Redis-level check:
# Check how many tasks are in the queue
$ redis-cli llen celery
# Check what the worker claims it's doing
$ celery -A myapp inspect active
# Look for tasks stuck in "reserved" state for >10 min
$ celery -A myapp inspect reserved
Expected output when everything is healthy:
-> celery@worker-1: OK
- task_id: abc123 eta: None started: 2026-06-15 02:11:03
-> celery@worker-2: OK
(empty)
But a ghost worker looks like this; tasks claimed but never finishing:
-> celery@worker-1: OK
- task_id: abc123 eta: None started: 2026-06-15 01:43:18 ā 28 min ago
- task_id: def456 eta: None started: 2026-06-15 01:43:19 ā 28 min ago
- task_id: ghi789 eta: None started: 2026-06-15 01:43:21 ā 28 min ago
-> celery@worker-2: OK
(empty)
Worker 1 has been "running" those three tasks for 28 minutes. They're prefetched and stuck.
What Kanari catches
Running kanari audit against the same setup:
š Kanari Audit
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā System: 2 issues found
Workers
ā
worker-2 0/4 online, idle
š„ worker-1 4/4 at capacity ā 0 tasks completed in last 30 min
Queues Pending Oldest task
ā
celery-low 0 ā
š„ emails 214 28m 14s
[CRITICAL] GHOST_WORKER
worker-1 holds 4 prefetched tasks with 0 completions
in the last 1800s. Probable cause: broker reconnect
with prefetch_multiplier > 1.
[HIGH] QUEUE_SLA_BREACH
emails ā oldest task waiting 28m 14s (threshold: 5m)
š” Fix: celery -A myapp control revoke <task_ids> --terminate
Then: systemctl restart celery@worker-1
To prevent recurrence, set CELERYD_PREFETCH_MULTIPLIER=1
in your worker config.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Audit completed in 1.8s
Three things Kanari does that inspect doesn't:
- Correlates queue depth + worker throughput over time
- Names the pattern (GHOST_WORKER) so you know what you're dealing with
- Gives you the exact commands to fix it
The fix
Once you've identified the ghost worker, the recovery is two steps:
Step 1: revoke the stuck tasks (so they go back to the queue)
# Revoke by task ID (from the inspect output above)
$ celery -A myapp control revoke abc123 def456 ghi789 --terminate
# Or revoke everything reserved by that worker
$ celery -A myapp control revoke $(
celery -A myapp inspect reserved --json |
jq -r '.["celery@worker-1"][].id'
) --terminate
Step 2: restart that worker
# systemd
$ sudo systemctl restart celery@worker-1
# supervisor
$ supervisorctl restart celery:worker-1
# docker
$ docker restart worker-1
Preventing it
The root cause is usually CELERYD_PREFETCH_MULTIPLIER set too high combined with late-acks off.
# settings.py ā safer defaults
# Prefetch 1 task at a time per worker process
# Prevents a crashed worker from holding multiple tasks hostage
CELERYD_PREFETCH_MULTIPLIER = 1
# Acknowledge tasks AFTER completion, not on pickup
# Tasks go back to queue if the worker dies mid-execution
CELERY_ACKS_LATE = True
# Reject tasks on worker shutdown instead of leaving them reserved
CELERY_REJECT_ON_WORKER_LOST = True
With PREFETCH_MULTIPLIER=1, a ghost worker holds at most one task hostage instead of four. With ACKS_LATE=True, a hard crash returns the task to the queue automatically.
The takeaway
Flower tells you your workers are connected. Kanari tells you whether they're actually working.
Ghost workers are one of the harder silent failures to catch because every tool that uses heartbeats will miss them. The signal you need is throughput over time, not connectivity.
kanari audit runs this check automatically in ~2 seconds. kanari watch runs it continuously and pages you the moment a worker stalls.
$ pip install kanari-agent
$ kanari audit # one-shot check right now
$ kanari watch # continuous monitoring
No dashboards. Just an alert when something's actually wrong.
Top comments (0)