DEV Community

Cover image for Why I Added Redis Streams Between My Django API and Celery Workers.
Rahim Ranxx
Rahim Ranxx

Posted on

Why I Added Redis Streams Between My Django API and Celery Workers.

A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.

Introduction

I run a Django API backed by Celery workers for NDVI processing workloads.

The execution layer worked fine.

The queue semantics didn’t.

I needed:

  • durable ingestion
  • replay visibility
  • dead-letter handling
  • stale consumer recovery
  • rollback safety
  • observability during incidents

…but I did not want to rewrite the worker system or destabilize production.

So instead of replacing Celery, I inserted Redis Streams between the API and the workers.

This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.


The Original Problem

The problem was not task execution.

The problem was everything before execution.

Originally, NDVI ingestion looked like this:

Django API → Celery Broker → Celery Workers
Enter fullscreen mode Exit fullscreen mode

At first, this worked well.

But as the system evolved, operational gaps became more obvious:

  • Direct .delay() calls tightly coupled request ingestion to broker behavior.
  • Queue visibility was limited during incidents.
  • Failed ingestion paths were harder to replay safely.
  • In-flight recovery semantics were weak.
  • There was no dead-letter workflow for poisoned messages.
  • Worker interruptions could leave messages in uncertain states.

The architecture was fast.

It was not durable enough.


Why I Did Not Replace Celery

One of the biggest architectural decisions was choosing not to replace Celery.

That decision reduced risk dramatically.

Celery already handled:

  • worker orchestration
  • task retries
  • execution concurrency
  • scheduling
  • routing
  • operational familiarity

Replacing the worker layer would have increased migration complexity and expanded the failure domain.

Instead, I treated Redis Streams as an ingestion and reliability layer.

The resulting architecture looked like this:

Django API
    ↓
Dispatch Boundary
    ↓
Redis Streams (XADD)
    ↓
Consumer Group (XREADGROUP)
    ↓
Celery Queue
    ↓
NDVI Workers
Enter fullscreen mode Exit fullscreen mode

Failures route into a dead-letter stream.

Stale consumers are recovered through reclaim logic.

Most importantly, rollback remains simple.


Centralizing Dispatch Before Adding Redis Streams

Before introducing Redis Streams, I centralized every NDVI enqueue path.

This was the most important migration step.

Instead of scattering direct .delay() calls across the codebase, everything flowed through dispatch helpers.

from ndvi.dispatch import dispatch_ndvi_job

job = enqueue_job(...)
dispatch_ndvi_job(job)
Enter fullscreen mode Exit fullscreen mode

That allowed one configuration flag to control the ingestion backend.

NDVI_QUEUE_BACKEND = env("NDVI_QUEUE_BACKEND", default="celery")
Enter fullscreen mode Exit fullscreen mode

Supported modes:

celery
stream
Enter fullscreen mode Exit fullscreen mode

This created a clean migration boundary.

The system could switch between direct Celery dispatch and Redis Streams without changing every call site.

Operationally, this mattered more than the stream code itself.


Publishing NDVI Jobs into Redis Streams

The producer layer publishes deterministic NDVI payloads into a Redis stream.

Example:

payload = {
    "job_id": job.id,
    "request_hash": job.request_hash,
    "farm_id": job.farm_id,
    "engine": job.engine,
    "job_type": job.job_type,
    "enqueue_timestamp": time.time(),
}

redis_client.xadd(
    settings.NDVI_STREAM_NAME,
    payload,
    maxlen=settings.NDVI_STREAM_MAXLEN,
    approximate=True,
)
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  • request_hash acts as the idempotency key.
  • XTRIM keeps memory bounded.
  • Stream payloads remain deterministic.
  • Producers do not execute business logic.

The stream became the ingestion ledger.


Redis Streams Consumer Design

The consumer reads from Redis Streams and forwards work into Celery.

Example:

messages = redis_client.xreadgroup(
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    streams={settings.NDVI_STREAM_NAME: ">"},
    count=settings.NDVI_STREAM_BATCH_SIZE,
    block=settings.NDVI_STREAM_BLOCK_MS,
)
Enter fullscreen mode Exit fullscreen mode

For every message:

  1. Deserialize payload
  2. Validate structure
  3. Apply idempotency safeguards
  4. Enqueue Celery task
  5. Acknowledge stream entry
process_ndvi_job.delay(job_id)

redis_client.xack(
    settings.NDVI_STREAM_NAME,
    settings.NDVI_STREAM_GROUP,
    message_id,
)
Enter fullscreen mode Exit fullscreen mode

The stream consumer remains intentionally thin.

Its job is reliable transport and recovery.

Celery still handles execution.


Why Consumer Groups Matter

Redis Streams consumer groups solved several operational problems immediately.

They provided:

  • cooperative work distribution
  • independent consumer identities
  • pending-entry tracking
  • reclaim support
  • replay visibility

Unlike simple queue semantics, Redis Streams expose message lifecycle state.

That visibility becomes extremely valuable during failures.

Message lifecycle:

XADD → pending → reclaimed → acknowledged
                          ↓
                         DLQ
Enter fullscreen mode Exit fullscreen mode

This made queue recovery observable instead of implicit.


Recovering Stale Messages with XAUTOCLAIM

The most important recovery primitive ended up being XAUTOCLAIM.

If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.

Without reclaim logic, stream durability is incomplete.

Example reclaim loop:

messages = redis_client.xautoclaim(
    name=settings.NDVI_STREAM_NAME,
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    min_idle_time=settings.NDVI_STREAM_CLAIM_IDLE_MS,
    start_id="0-0",
    count=settings.NDVI_STREAM_BATCH_SIZE,
)
Enter fullscreen mode Exit fullscreen mode

This allows healthy consumers to recover abandoned work automatically.

That changed the reliability profile of the ingestion pipeline significantly.


Dead-Letter Queue Handling

I also introduced a dedicated dead-letter stream.

Messages are routed into the DLQ when:

  • validation fails
  • delivery ceilings are exceeded
  • payloads become structurally invalid
  • repeated execution attempts fail

Example:

redis_client.xadd(
    settings.NDVI_STREAM_DLQ_NAME,
    dlq_payload,
)
Enter fullscreen mode Exit fullscreen mode

Every DLQ entry includes:

  • original message ID
  • delivery count
  • failure reason
  • serialized payload
  • timestamps

This made operational debugging dramatically easier.


The Hardest Problem: Idempotency

Redis Streams provide at-least-once delivery.

That means duplicate delivery is expected.

Exactly-once delivery is not guaranteed.

To prevent duplicate NDVI execution, I added multiple protection layers.

Layer 1: Deterministic Request Hash

Every NDVI job already had a deterministic request_hash.

That became the execution identity.

Layer 2: Distributed Redis Lock

The consumer acquires a Redis lock before execution.

lock_key = f"ndvi:lock:{request_hash}"
Enter fullscreen mode Exit fullscreen mode

Acquisition uses SETNX semantics with expiration.

Layer 3: Token-Based Lock Release

Locks are released through an atomic Lua script.

This prevents blind deletion.

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
end
return 0
Enter fullscreen mode Exit fullscreen mode

Layer 4: Database Status Recheck

Before execution begins, the worker re-checks terminal job state.

This acts as a second safety boundary.

The result is effectively-once execution semantics.

At-least-once delivery + idempotent execution = effectively-once processing
Enter fullscreen mode Exit fullscreen mode

Observability Added During the Rollout

One major lesson from this migration:

Do not enable stream mode before queue visibility exists.

I added dedicated metrics before enabling the rollout broadly.

Examples:

redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total
Enter fullscreen mode Exit fullscreen mode

I also expanded upstream visibility:

ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds
Enter fullscreen mode Exit fullscreen mode

Grafana dashboards now expose:

  • pending stream backlog
  • reclaim frequency
  • DLQ volume
  • consumer liveness
  • upstream API failures
  • queue drain rate

This transformed rollout decisions from guesswork into measurable operations.


Rollback Strategy

Rollback was designed before rollout.

That mattered.

The stream backend is fully feature-flagged:

NDVI_QUEUE_BACKEND = "celery"
Enter fullscreen mode Exit fullscreen mode

or

NDVI_QUEUE_BACKEND = "stream"
Enter fullscreen mode Exit fullscreen mode

Rollback requires:

  1. environment variable change
  2. process restart

No redeploy.

No task rewrite.

No schema rollback.

This significantly reduced operational fear during rollout.


What Shipped This Week

This week’s rollout included:

  • a 528-line Redis Streams consumer
  • reclaim + DLQ lifecycle handling
  • distributed execution locking
  • token-safe lock release
  • approximately 400 lines of stream-focused tests
  • Prometheus metrics for queue health
  • Grafana visibility for consumer state and lag
  • feature-flag rollback support

Most of the work was not adding Redis.

Most of the work was making failure recovery predictable.


Was It Worth It?

Redis Streams did not simplify the system.

They made failure states explicit.

That introduced additional complexity:

  • reclaim logic
  • idempotency handling
  • consumer lifecycle management
  • DLQ operations
  • stream observability

But the reliability gains were substantial:

  • durable ingestion
  • replay visibility
  • safer recovery semantics
  • backlog introspection
  • controlled rollback
  • observable queue state

For this NDVI pipeline, the tradeoff was worth it.


Final Thoughts

One of the biggest lessons from this migration is that queue evolution is not just about throughput.

It is about operational recovery.

Redis Streams gave the ingestion layer explicit lifecycle semantics:

  • pending
  • acknowledged
  • reclaimed
  • dead-lettered

That visibility fundamentally changed how the system behaves during failures.

And importantly, I achieved that without rewriting the worker layer.

Sometimes the best migration strategy is not replacing your stack.

It is inserting a safer boundary in front of it.

Top comments (0)