When a Celery task fails after all retries, where does it go? By default, Celery with a Redis broker just marks the task as failed and moves on. The failed task is logged, but it's not stored anywhere accessible -- and unless you're watching logs or have monitoring set up, you might not know it happened until something downstream breaks.
Without a dead-letter queue, you have two choices when a task fails permanently: lose the data silently, or stop processing everything until the issue is manually fixed. Neither is acceptable for a production pipeline where missing even a small percentage of events can cause data integrity problems downstream.
A dead-letter queue (DLQ) changes this: failed tasks get routed to a dedicated queue where you can inspect them, requeue them manually when the underlying issue is fixed, or log them for audit. Setting one up in Celery with Redis requires a small amount of configuration but dramatically improves your ability to operate data pipelines reliably.
Understanding Celery's Retry and Failure Flow
Before configuring a DLQ, it helps to understand Celery's default behavior:
- A task is called. It executes and either succeeds (returns a value) or fails (raises an exception).
- If it fails and
max_retriesis set, Celery schedules a retry after the configuredcountdown. - If it fails after
max_retriesretries, Celery raisesMaxRetriesExceededError. The task entersFAILUREstate. - The failure is logged, the result is stored in the result backend (if configured), and nothing else happens.
There is no built-in routing of failed tasks to a separate queue. You have to build that.
Option 1: Route to a DLQ Using task_failure Signal
The cleanest approach for Redis-backed Celery is to use the task_failure signal to route failed tasks to a dedicated Redis list:
from celery.signals import task_failure
import redis
import json
from datetime import datetime
r = redis.Redis(host='localhost', port=6379, db=0)
DLQ_KEY = "celery_dlq"
@task_failure.connect
def handle_task_failure(sender=None, task_id=None, exception=None,
args=None, kwargs=None, traceback=None, einfo=None, **kw):
dlq_entry = {
"task_id": task_id,
"task_name": sender.name if sender else "unknown",
"exception": str(exception),
"args": str(args),
"kwargs": str(kwargs),
"failed_at": datetime.utcnow().isoformat()
}
r.rpush(DLQ_KEY, json.dumps(dlq_entry))
Connect this signal in your Celery app initialization. Every task that exhausts its retries will push an entry to the celery_dlq list in Redis.
You can then inspect DLQ contents from a management script or a monitoring dashboard:
def inspect_dlq(limit=100):
entries = r.lrange(DLQ_KEY, 0, limit - 1)
for entry in entries:
print(json.loads(entry))
Option 2: Dedicated DLQ Task Queue
An alternative is to route failed tasks to a dedicated Celery queue rather than a raw Redis list. This allows DLQ items to be requeued as Celery tasks when the underlying issue is resolved:
from celery import Celery
from celery.signals import task_failure
app = Celery('myapp', broker='redis://localhost:6379/0')
@app.task(name='dlq.failed_task')
def dead_letter_task(original_task_name: str, task_id: str,
exception: str, args: list, kwargs: dict):
# Store in Redis or log -- this task is for inspection, not re-processing
print(f"DLQ: {original_task_name} [{task_id}] failed with: {exception}")
@task_failure.connect
def on_task_failure(sender=None, task_id=None, exception=None,
args=None, kwargs=None, **kw):
dead_letter_task.apply_async(
args=[sender.name if sender else "unknown", task_id,
str(exception), list(args or []), dict(kwargs or {})],
queue='dlq'
)
Running a dedicated celery worker -Q dlq processes only DLQ tasks. This separation keeps DLQ handling isolated from normal task workers and makes it visible as a queue in any Celery monitoring tool.
Option 3: Custom Task Base Class With Built-In DLQ
For a clean approach that doesn't require signal handlers, create a custom base task:
from celery import Task
import redis
import json
from datetime import datetime
r = redis.Redis(host='localhost', port=6379, db=0)
class DLQTask(Task):
abstract = True
max_retries = 3
def on_failure(self, exc, task_id, args, kwargs, einfo):
entry = {
"task_id": task_id,
"task_name": self.name,
"exception": str(exc),
"traceback": str(einfo),
"failed_at": datetime.utcnow().isoformat()
}
r.rpush("celery_dlq", json.dumps(entry))
super().on_failure(exc, task_id, args, kwargs, einfo)
@app.task(base=DLQTask, bind=True)
def process_event(self, event_data: dict):
# Processing logic
pass
Any task that inherits from DLQTask automatically routes failures to the DLQ.
Adding Alerts on DLQ Depth
A DLQ without monitoring is only marginally better than no DLQ. Add a simple depth check that runs on a schedule:
def check_dlq_depth(threshold: int = 10):
depth = r.llen("celery_dlq")
if depth > threshold:
# Send alert -- Slack webhook, PagerDuty, email, etc.
send_alert(f"DLQ depth is {depth} (threshold: {threshold})")
Run this every 5 minutes with a lightweight scheduler or as part of your existing monitoring. Datadog, Prometheus, and most APM platforms can scrape Redis metrics directly and alert on list length.
Re-queuing DLQ Messages
Once the underlying issue (bad data, downstream service outage, schema mismatch) is resolved, re-queuing DLQ messages means inspecting the failure entries and re-dispatching them:
def requeue_dlq_entries(limit: int = 100):
entries = r.lrange("celery_dlq", 0, limit - 1)
for raw in entries:
entry = json.loads(raw)
task_name = entry.get("task_name")
original_kwargs = entry.get("kwargs", {})
# Re-dispatch -- requires task be importable by name
app.send_task(task_name, kwargs=original_kwargs)
# Clear re-queued entries
r.ltrim("celery_dlq", len(entries), -1)
This script reads up to 100 DLQ entries, re-dispatches them as Celery tasks, and removes them from the DLQ list.
What to Include in Each DLQ Entry
The DLQ entry is your primary debugging artifact. What you store determines how quickly you can diagnose and fix the underlying issue when you finally look at it.
At minimum, every DLQ entry should include:
The task name. Which Celery task failed. This lets you group failures by task type: if all failures are from one task name, the issue is likely in that task's code or configuration. If failures span many task types, the issue is likely in the data or a shared dependency.
The task ID. The Celery task ID lets you cross-reference DLQ entries with your application logs. If you have structured logging in your tasks that includes task_id at start and completion, you can trace a failed task from the DLQ entry back to its full execution log.
The exception string. str(exc) is usually sufficient. The exception type tells you whether the failure is transient (network error, connection timeout) or permanent (schema validation error, missing required field).
The traceback. For permanent failures, the traceback is the fastest path to the specific line of code that failed. The einfo parameter in on_failure provides a formatted traceback via str(einfo).
The task arguments. str(args) and str(kwargs) give you the inputs to the failed task. This is essential for debugging data-specific failures where only certain inputs trigger the error.
The failure timestamp. datetime.utcnow().isoformat() at failure time. When you're reviewing DLQ contents hours after the failure occurred, knowing when each entry failed helps determine whether failures are ongoing or historical.
All three options above include the core fields. If you're building a custom DLQ handler, start with {task_name, task_id, exception, failed_at} as the minimum viable entry, and add traceback and args when you've encountered a debugging session where you wished you had them.
Summary
Setting up a DLQ for Celery with Redis requires three things:
- A mechanism to capture failed tasks (signal handler, base class override, or custom queue)
- Storage for failed task details (Redis list or dedicated Celery queue)
- Monitoring for DLQ depth with alerting
The signal handler approach (Option 1) is the simplest to add to an existing pipeline. The custom base class approach (Option 3) is cleanest for new pipelines where every task should have DLQ behavior by default.
A fourth piece worth adding: a requeue mechanism paired with a clear DLQ review process. The requeue script above reads DLQ entries and re-dispatches them as tasks after the underlying issue is fixed. Without that step, a DLQ is a collection bin for lost work rather than a recoverable buffer. Test the full loop -- failed tasks land in DLQ, alert fires, root cause is fixed, messages are requeued -- in staging before relying on it in production.
The architectural context for why DLQs are necessary in event-driven pipelines -- and how they fit into the broader producer-consumer pattern -- is covered in the full guide from 137Foundry: "How to Build an Event-Driven Data Pipeline With Python and Message Queues". That piece covers producer setup, consumer acknowledgment, and the failure handling design decisions that this implementation guide builds on.
Top comments (0)