Rahim Ranxx

Posted on May 3

Why I Added Redis Streams Between My Django API and Celery Workers.

#django #redis #celery #backend

A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.

Introduction

I run a Django API backed by Celery workers for NDVI processing workloads.

The execution layer worked fine.

The queue semantics didn’t.

I needed:

durable ingestion
replay visibility
dead-letter handling
stale consumer recovery
rollback safety
observability during incidents

…but I did not want to rewrite the worker system or destabilize production.

So instead of replacing Celery, I inserted Redis Streams between the API and the workers.

This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.

The Original Problem

The problem was not task execution.

The problem was everything before execution.

Originally, NDVI ingestion looked like this:

Django API → Celery Broker → Celery Workers

At first, this worked well.

But as the system evolved, operational gaps became more obvious:

Direct .delay() calls tightly coupled request ingestion to broker behavior.
Queue visibility was limited during incidents.
Failed ingestion paths were harder to replay safely.
In-flight recovery semantics were weak.
There was no dead-letter workflow for poisoned messages.
Worker interruptions could leave messages in uncertain states.

The architecture was fast.

It was not durable enough.

Why I Did Not Replace Celery

One of the biggest architectural decisions was choosing not to replace Celery.

That decision reduced risk dramatically.

Celery already handled:

worker orchestration
task retries
execution concurrency
scheduling
routing
operational familiarity

Replacing the worker layer would have increased migration complexity and expanded the failure domain.

Instead, I treated Redis Streams as an ingestion and reliability layer.

The resulting architecture looked like this:

Django API
    ↓
Dispatch Boundary
    ↓
Redis Streams (XADD)
    ↓
Consumer Group (XREADGROUP)
    ↓
Celery Queue
    ↓
NDVI Workers

Failures route into a dead-letter stream.

Stale consumers are recovered through reclaim logic.

Most importantly, rollback remains simple.

Centralizing Dispatch Before Adding Redis Streams

Before introducing Redis Streams, I centralized every NDVI enqueue path.

This was the most important migration step.

Instead of scattering direct .delay() calls across the codebase, everything flowed through dispatch helpers.

from ndvi.dispatch import dispatch_ndvi_job

job = enqueue_job(...)
dispatch_ndvi_job(job)

That allowed one configuration flag to control the ingestion backend.

NDVI_QUEUE_BACKEND = env("NDVI_QUEUE_BACKEND", default="celery")

Supported modes:

celery
stream

This created a clean migration boundary.

The system could switch between direct Celery dispatch and Redis Streams without changing every call site.

Operationally, this mattered more than the stream code itself.

Publishing NDVI Jobs into Redis Streams

The producer layer publishes deterministic NDVI payloads into a Redis stream.

Example:

payload = {
    "job_id": job.id,
    "request_hash": job.request_hash,
    "farm_id": job.farm_id,
    "engine": job.engine,
    "job_type": job.job_type,
    "enqueue_timestamp": time.time(),
}

redis_client.xadd(
    settings.NDVI_STREAM_NAME,
    payload,
    maxlen=settings.NDVI_STREAM_MAXLEN,
    approximate=True,
)

Key design decisions:

request_hash acts as the idempotency key.
XTRIM keeps memory bounded.
Stream payloads remain deterministic.
Producers do not execute business logic.

The stream became the ingestion ledger.

Redis Streams Consumer Design

The consumer reads from Redis Streams and forwards work into Celery.

Example:

messages = redis_client.xreadgroup(
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    streams={settings.NDVI_STREAM_NAME: ">"},
    count=settings.NDVI_STREAM_BATCH_SIZE,
    block=settings.NDVI_STREAM_BLOCK_MS,
)

For every message:

Deserialize payload
Validate structure
Apply idempotency safeguards
Enqueue Celery task
Acknowledge stream entry

process_ndvi_job.delay(job_id)

redis_client.xack(
    settings.NDVI_STREAM_NAME,
    settings.NDVI_STREAM_GROUP,
    message_id,
)

The stream consumer remains intentionally thin.

Its job is reliable transport and recovery.

Celery still handles execution.

Why Consumer Groups Matter

Redis Streams consumer groups solved several operational problems immediately.

They provided:

cooperative work distribution
independent consumer identities
pending-entry tracking
reclaim support
replay visibility

Unlike simple queue semantics, Redis Streams expose message lifecycle state.

That visibility becomes extremely valuable during failures.

Message lifecycle:

XADD → pending → reclaimed → acknowledged
                          ↓
                         DLQ

This made queue recovery observable instead of implicit.

Recovering Stale Messages with XAUTOCLAIM

The most important recovery primitive ended up being XAUTOCLAIM.

If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.

Without reclaim logic, stream durability is incomplete.

Example reclaim loop:

messages = redis_client.xautoclaim(
    name=settings.NDVI_STREAM_NAME,
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    min_idle_time=settings.NDVI_STREAM_CLAIM_IDLE_MS,
    start_id="0-0",
    count=settings.NDVI_STREAM_BATCH_SIZE,
)

This allows healthy consumers to recover abandoned work automatically.

That changed the reliability profile of the ingestion pipeline significantly.

Dead-Letter Queue Handling

I also introduced a dedicated dead-letter stream.

Messages are routed into the DLQ when:

validation fails
delivery ceilings are exceeded
payloads become structurally invalid
repeated execution attempts fail

Example:

redis_client.xadd(
    settings.NDVI_STREAM_DLQ_NAME,
    dlq_payload,
)

Every DLQ entry includes:

original message ID
delivery count
failure reason
serialized payload
timestamps

This made operational debugging dramatically easier.

The Hardest Problem: Idempotency

Redis Streams provide at-least-once delivery.

That means duplicate delivery is expected.

Exactly-once delivery is not guaranteed.

To prevent duplicate NDVI execution, I added multiple protection layers.

Layer 1: Deterministic Request Hash

Every NDVI job already had a deterministic request_hash.

That became the execution identity.

Layer 2: Distributed Redis Lock

The consumer acquires a Redis lock before execution.

lock_key = f"ndvi:lock:{request_hash}"

Acquisition uses SETNX semantics with expiration.

Layer 3: Token-Based Lock Release

Locks are released through an atomic Lua script.

This prevents blind deletion.

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
end
return 0

Layer 4: Database Status Recheck

Before execution begins, the worker re-checks terminal job state.

This acts as a second safety boundary.

The result is effectively-once execution semantics.

At-least-once delivery + idempotent execution = effectively-once processing

Observability Added During the Rollout

One major lesson from this migration:

Do not enable stream mode before queue visibility exists.

I added dedicated metrics before enabling the rollout broadly.

Examples:

redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total

I also expanded upstream visibility:

ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds

Grafana dashboards now expose:

pending stream backlog
reclaim frequency
DLQ volume
consumer liveness
upstream API failures
queue drain rate

This transformed rollout decisions from guesswork into measurable operations.

Rollback Strategy

Rollback was designed before rollout.

That mattered.

The stream backend is fully feature-flagged:

NDVI_QUEUE_BACKEND = "celery"

NDVI_QUEUE_BACKEND = "stream"

Rollback requires:

environment variable change
process restart

No redeploy.

No task rewrite.

No schema rollback.

This significantly reduced operational fear during rollout.

What Shipped This Week

This week’s rollout included:

a 528-line Redis Streams consumer
reclaim + DLQ lifecycle handling
distributed execution locking
token-safe lock release
approximately 400 lines of stream-focused tests
Prometheus metrics for queue health
Grafana visibility for consumer state and lag
feature-flag rollback support

Most of the work was not adding Redis.

Most of the work was making failure recovery predictable.

Was It Worth It?

Redis Streams did not simplify the system.

They made failure states explicit.

That introduced additional complexity:

reclaim logic
idempotency handling
consumer lifecycle management
DLQ operations
stream observability

But the reliability gains were substantial:

durable ingestion
replay visibility
safer recovery semantics
backlog introspection
controlled rollback
observable queue state

For this NDVI pipeline, the tradeoff was worth it.

Final Thoughts

One of the biggest lessons from this migration is that queue evolution is not just about throughput.

It is about operational recovery.

Redis Streams gave the ingestion layer explicit lifecycle semantics:

pending
acknowledged
reclaimed
dead-lettered

That visibility fundamentally changed how the system behaves during failures.

And importantly, I achieved that without rewriting the worker layer.

Sometimes the best migration strategy is not replacing your stack.

It is inserting a safer boundary in front of it.

DEV Community

Why I Added Redis Streams Between My Django API and Celery Workers.

Introduction

The Original Problem

Why I Did Not Replace Celery

Centralizing Dispatch Before Adding Redis Streams

Publishing NDVI Jobs into Redis Streams

Redis Streams Consumer Design

Why Consumer Groups Matter

Recovering Stale Messages with XAUTOCLAIM

Dead-Letter Queue Handling

The Hardest Problem: Idempotency

Layer 1: Deterministic Request Hash

Layer 2: Distributed Redis Lock

Layer 3: Token-Based Lock Release

Layer 4: Database Status Recheck

Observability Added During the Rollout

Rollback Strategy

What Shipped This Week

Was It Worth It?

Final Thoughts

Top comments (0)