Hernan Chilabert

Posted on Jul 4 • Edited on Jul 17 • Originally published at getkanari.com

Ghost Workers: Why Your Celery Dashboard Lies to You

#showdev #devops #opensource #celery

The scenario I've lived

It's 2am. Someone pings me: emails aren't going out. You check Flower. Every worker is green; online, healthy. Queue depth is climbing. Nothing is moving.

This is a ghost worker: a process that's alive, but not consuming.

What actually happens

When a Celery worker reconnects to Redis after a network hiccup, it can end up in a broken state where:

It maintains its heartbeat (so Flower shows it as online)
It holds its prefetched tasks (so they appear "running")
But it never ACKs or re-queues them (so they're stuck forever)

The result: your dashboard shows everything healthy while your queue backs up.

Why Flower misses it

Flower tracks worker processes, not task throughput. It asks "is the worker connected?"; not "is the worker actually doing work?".

Here's the check Flower does:

# Flower's worker health check (simplified)
def is_worker_alive(worker):
    last_heartbeat = worker.last_heartbeat
    return (time.time() - last_heartbeat) < HEARTBEAT_TIMEOUT

A ghost worker keeps sending heartbeats. It's "alive" by this definition.

What you actually want is:

# What you'd need to detect a ghost
def is_worker_actually_processing(worker, queue, window_seconds=300):
    tasks_processed = get_task_count(worker, since=time.time() - window_seconds)
    queue_depth = get_queue_depth(queue)

    if queue_depth > 0 and tasks_processed == 0:
        return False  # ghost worker
    return True

The difference: throughput over time, not just connectivity.

How to detect it manually

If you suspect a ghost worker, here's the Redis-level check:

# Check how many tasks are in the queue
$ redis-cli llen celery

# Check what the worker claims it's doing
$ celery -A myapp inspect active

# Look for tasks stuck in "reserved" state for >10 min
$ celery -A myapp inspect reserved

Expected output when everything is healthy:

-> celery@worker-1: OK
    - task_id: abc123  eta: None  started: 2026-06-15 02:11:03

-> celery@worker-2: OK
    (empty)

But a ghost worker looks like this; tasks claimed but never finishing:

-> celery@worker-1: OK
    - task_id: abc123  eta: None  started: 2026-06-15 01:43:18   ← 28 min ago
    - task_id: def456  eta: None  started: 2026-06-15 01:43:19   ← 28 min ago
    - task_id: ghi789  eta: None  started: 2026-06-15 01:43:21   ← 28 min ago

-> celery@worker-2: OK
    (empty)

Worker 1 has been "running" those three tasks for 28 minutes. They're prefetched and stuck.

What Kanari catches

Running kanari audit against the same setup:

🔍 Kanari Audit
════════════════════════════════════════

❌ System: 2 issues found

Workers
  ✅  worker-2    0/4   online, idle
  🔥  worker-1    4/4   at capacity — 0 tasks completed in last 30 min

Queues               Pending   Oldest task
  ✅  celery-low       0         —
  🔥  emails         214       28m 14s

  [CRITICAL]  GHOST_WORKER
              worker-1 holds 4 prefetched tasks with 0 completions
              in the last 1800s. Probable cause: broker reconnect
              with prefetch_multiplier > 1.

  [HIGH]      QUEUE_SLA_BREACH
              emails — oldest task waiting 28m 14s (threshold: 5m)

  💡 Fix: celery -A myapp control revoke <task_ids> --terminate
     Then: systemctl restart celery@worker-1

     To prevent recurrence, set CELERYD_PREFETCH_MULTIPLIER=1
     in your worker config.

════════════════════════════════════════
Audit completed in 1.8s

Three things Kanari does that inspect doesn't:

Correlates queue depth + worker throughput over time
Names the pattern (GHOST_WORKER) so you know what you're dealing with
Gives you the exact commands to fix it

The fix

Once you've identified the ghost worker, the recovery is two steps:

Step 1: revoke the stuck tasks (so they go back to the queue)

# Revoke by task ID (from the inspect output above)
$ celery -A myapp control revoke abc123 def456 ghi789 --terminate

# Or revoke everything reserved by that worker
$ celery -A myapp control revoke $(
    celery -A myapp inspect reserved --json |
    jq -r '.["celery@worker-1"][].id'
  ) --terminate

Step 2: restart that worker

# systemd
$ sudo systemctl restart celery@worker-1

# supervisor
$ supervisorctl restart celery:worker-1

# docker
$ docker restart worker-1

Preventing it

The root cause is usually CELERYD_PREFETCH_MULTIPLIER set too high combined with late-acks off.

# settings.py — safer defaults

# Prefetch 1 task at a time per worker process
# Prevents a crashed worker from holding multiple tasks hostage
CELERYD_PREFETCH_MULTIPLIER = 1

# Acknowledge tasks AFTER completion, not on pickup
# Tasks go back to queue if the worker dies mid-execution
CELERY_ACKS_LATE = True

# Reject tasks on worker shutdown instead of leaving them reserved
CELERY_REJECT_ON_WORKER_LOST = True

With PREFETCH_MULTIPLIER=1, a ghost worker holds at most one task hostage instead of four. With ACKS_LATE=True, a hard crash returns the task to the queue automatically.

The takeaway

Flower tells you your workers are connected. Kanari tells you whether they're actually working.

Ghost workers are one of the harder silent failures to catch because every tool that uses heartbeats will miss them. The signal you need is throughput over time, not connectivity.

kanari audit runs this check automatically in ~2 seconds. kanari watch runs it continuously and pages you the moment a worker stalls.

$ pip install kanari
$ kanari audit          # one-shot check right now
$ kanari watch          # continuous monitoring

No dashboards. Just an alert when something's actually wrong.

DEV Community