DEV Community

Hernan Chilabert
Hernan Chilabert

Posted on • Originally published at getkanari.com

Ghost Workers: Why Your Celery Dashboard Lies to You

The scenario you've lived

It's 2am. Someone pings you: emails aren't going out. You check Flower. Every worker is green; online, healthy. Queue depth is climbing. Nothing is moving.

This is a ghost worker: a process that's alive, but not consuming.

What actually happens

When a Celery worker reconnects to Redis after a network hiccup, it can end up in a broken state where:

  • It maintains its heartbeat (so Flower shows it as online)
  • It holds its prefetched tasks (so they appear "running")
  • But it never ACKs or re-queues them (so they're stuck forever)

The result: your dashboard shows everything healthy while your queue backs up.

Why Flower misses it

Flower tracks worker processes, not task throughput. It asks "is the worker connected?"; not "is the worker actually doing work?".

Here's the check Flower does:

# Flower's worker health check (simplified)
def is_worker_alive(worker):
    last_heartbeat = worker.last_heartbeat
    return (time.time() - last_heartbeat) < HEARTBEAT_TIMEOUT
Enter fullscreen mode Exit fullscreen mode

A ghost worker keeps sending heartbeats. It's "alive" by this definition.

What you actually want is:

# What you'd need to detect a ghost
def is_worker_actually_processing(worker, queue, window_seconds=300):
    tasks_processed = get_task_count(worker, since=time.time() - window_seconds)
    queue_depth = get_queue_depth(queue)

    if queue_depth > 0 and tasks_processed == 0:
        return False  # ghost worker
    return True
Enter fullscreen mode Exit fullscreen mode

The difference: throughput over time, not just connectivity.

How to detect it manually

If you suspect a ghost worker, here's the Redis-level check:

# Check how many tasks are in the queue
$ redis-cli llen celery

# Check what the worker claims it's doing
$ celery -A myapp inspect active

# Look for tasks stuck in "reserved" state for >10 min
$ celery -A myapp inspect reserved
Enter fullscreen mode Exit fullscreen mode

Expected output when everything is healthy:

-> celery@worker-1: OK
    - task_id: abc123  eta: None  started: 2026-06-15 02:11:03

-> celery@worker-2: OK
    (empty)
Enter fullscreen mode Exit fullscreen mode

But a ghost worker looks like this; tasks claimed but never finishing:

-> celery@worker-1: OK
    - task_id: abc123  eta: None  started: 2026-06-15 01:43:18   ← 28 min ago
    - task_id: def456  eta: None  started: 2026-06-15 01:43:19   ← 28 min ago
    - task_id: ghi789  eta: None  started: 2026-06-15 01:43:21   ← 28 min ago

-> celery@worker-2: OK
    (empty)
Enter fullscreen mode Exit fullscreen mode

Worker 1 has been "running" those three tasks for 28 minutes. They're prefetched and stuck.

What Kanari catches

Running kanari audit against the same setup:

šŸ” Kanari Audit
════════════════════════════════════════

āŒ System: 2 issues found

Workers
  āœ…  worker-2    0/4   online, idle
  šŸ”„  worker-1    4/4   at capacity — 0 tasks completed in last 30 min

Queues               Pending   Oldest task
  āœ…  celery-low       0         —
  šŸ”„  emails         214       28m 14s

  [CRITICAL]  GHOST_WORKER
              worker-1 holds 4 prefetched tasks with 0 completions
              in the last 1800s. Probable cause: broker reconnect
              with prefetch_multiplier > 1.

  [HIGH]      QUEUE_SLA_BREACH
              emails — oldest task waiting 28m 14s (threshold: 5m)

  šŸ’” Fix: celery -A myapp control revoke <task_ids> --terminate
     Then: systemctl restart celery@worker-1

     To prevent recurrence, set CELERYD_PREFETCH_MULTIPLIER=1
     in your worker config.

════════════════════════════════════════
Audit completed in 1.8s
Enter fullscreen mode Exit fullscreen mode

Three things Kanari does that inspect doesn't:

  1. Correlates queue depth + worker throughput over time
  2. Names the pattern (GHOST_WORKER) so you know what you're dealing with
  3. Gives you the exact commands to fix it

The fix

Once you've identified the ghost worker, the recovery is two steps:

Step 1: revoke the stuck tasks (so they go back to the queue)

# Revoke by task ID (from the inspect output above)
$ celery -A myapp control revoke abc123 def456 ghi789 --terminate

# Or revoke everything reserved by that worker
$ celery -A myapp control revoke $(
    celery -A myapp inspect reserved --json |
    jq -r '.["celery@worker-1"][].id'
  ) --terminate
Enter fullscreen mode Exit fullscreen mode

Step 2: restart that worker

# systemd
$ sudo systemctl restart celery@worker-1

# supervisor
$ supervisorctl restart celery:worker-1

# docker
$ docker restart worker-1
Enter fullscreen mode Exit fullscreen mode

Preventing it

The root cause is usually CELERYD_PREFETCH_MULTIPLIER set too high combined with late-acks off.

# settings.py — safer defaults

# Prefetch 1 task at a time per worker process
# Prevents a crashed worker from holding multiple tasks hostage
CELERYD_PREFETCH_MULTIPLIER = 1

# Acknowledge tasks AFTER completion, not on pickup
# Tasks go back to queue if the worker dies mid-execution
CELERY_ACKS_LATE = True

# Reject tasks on worker shutdown instead of leaving them reserved
CELERY_REJECT_ON_WORKER_LOST = True
Enter fullscreen mode Exit fullscreen mode

With PREFETCH_MULTIPLIER=1, a ghost worker holds at most one task hostage instead of four. With ACKS_LATE=True, a hard crash returns the task to the queue automatically.

The takeaway

Flower tells you your workers are connected. Kanari tells you whether they're actually working.

Ghost workers are one of the harder silent failures to catch because every tool that uses heartbeats will miss them. The signal you need is throughput over time, not connectivity.

kanari audit runs this check automatically in ~2 seconds. kanari watch runs it continuously and pages you the moment a worker stalls.

$ pip install kanari-agent
$ kanari audit          # one-shot check right now
$ kanari watch          # continuous monitoring
Enter fullscreen mode Exit fullscreen mode

No dashboards. Just an alert when something's actually wrong.

Top comments (0)