A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.
Introduction
I run a Django API backed by Celery workers for NDVI processing workloads.
The execution layer worked fine.
The queue semantics didn’t.
I needed:
- durable ingestion
- replay visibility
- dead-letter handling
- stale consumer recovery
- rollback safety
- observability during incidents
…but I did not want to rewrite the worker system or destabilize production.
So instead of replacing Celery, I inserted Redis Streams between the API and the workers.
This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.
The Original Problem
The problem was not task execution.
The problem was everything before execution.
Originally, NDVI ingestion looked like this:
Django API → Celery Broker → Celery Workers
At first, this worked well.
But as the system evolved, operational gaps became more obvious:
- Direct
.delay()calls tightly coupled request ingestion to broker behavior. - Queue visibility was limited during incidents.
- Failed ingestion paths were harder to replay safely.
- In-flight recovery semantics were weak.
- There was no dead-letter workflow for poisoned messages.
- Worker interruptions could leave messages in uncertain states.
The architecture was fast.
It was not durable enough.
Why I Did Not Replace Celery
One of the biggest architectural decisions was choosing not to replace Celery.
That decision reduced risk dramatically.
Celery already handled:
- worker orchestration
- task retries
- execution concurrency
- scheduling
- routing
- operational familiarity
Replacing the worker layer would have increased migration complexity and expanded the failure domain.
Instead, I treated Redis Streams as an ingestion and reliability layer.
The resulting architecture looked like this:
Django API
↓
Dispatch Boundary
↓
Redis Streams (XADD)
↓
Consumer Group (XREADGROUP)
↓
Celery Queue
↓
NDVI Workers
Failures route into a dead-letter stream.
Stale consumers are recovered through reclaim logic.
Most importantly, rollback remains simple.
Centralizing Dispatch Before Adding Redis Streams
Before introducing Redis Streams, I centralized every NDVI enqueue path.
This was the most important migration step.
Instead of scattering direct .delay() calls across the codebase, everything flowed through dispatch helpers.
from ndvi.dispatch import dispatch_ndvi_job
job = enqueue_job(...)
dispatch_ndvi_job(job)
That allowed one configuration flag to control the ingestion backend.
NDVI_QUEUE_BACKEND = env("NDVI_QUEUE_BACKEND", default="celery")
Supported modes:
celery
stream
This created a clean migration boundary.
The system could switch between direct Celery dispatch and Redis Streams without changing every call site.
Operationally, this mattered more than the stream code itself.
Publishing NDVI Jobs into Redis Streams
The producer layer publishes deterministic NDVI payloads into a Redis stream.
Example:
payload = {
"job_id": job.id,
"request_hash": job.request_hash,
"farm_id": job.farm_id,
"engine": job.engine,
"job_type": job.job_type,
"enqueue_timestamp": time.time(),
}
redis_client.xadd(
settings.NDVI_STREAM_NAME,
payload,
maxlen=settings.NDVI_STREAM_MAXLEN,
approximate=True,
)
Key design decisions:
-
request_hashacts as the idempotency key. -
XTRIMkeeps memory bounded. - Stream payloads remain deterministic.
- Producers do not execute business logic.
The stream became the ingestion ledger.
Redis Streams Consumer Design
The consumer reads from Redis Streams and forwards work into Celery.
Example:
messages = redis_client.xreadgroup(
groupname=settings.NDVI_STREAM_GROUP,
consumername=consumer_name,
streams={settings.NDVI_STREAM_NAME: ">"},
count=settings.NDVI_STREAM_BATCH_SIZE,
block=settings.NDVI_STREAM_BLOCK_MS,
)
For every message:
- Deserialize payload
- Validate structure
- Apply idempotency safeguards
- Enqueue Celery task
- Acknowledge stream entry
process_ndvi_job.delay(job_id)
redis_client.xack(
settings.NDVI_STREAM_NAME,
settings.NDVI_STREAM_GROUP,
message_id,
)
The stream consumer remains intentionally thin.
Its job is reliable transport and recovery.
Celery still handles execution.
Why Consumer Groups Matter
Redis Streams consumer groups solved several operational problems immediately.
They provided:
- cooperative work distribution
- independent consumer identities
- pending-entry tracking
- reclaim support
- replay visibility
Unlike simple queue semantics, Redis Streams expose message lifecycle state.
That visibility becomes extremely valuable during failures.
Message lifecycle:
XADD → pending → reclaimed → acknowledged
↓
DLQ
This made queue recovery observable instead of implicit.
Recovering Stale Messages with XAUTOCLAIM
The most important recovery primitive ended up being XAUTOCLAIM.
If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.
Without reclaim logic, stream durability is incomplete.
Example reclaim loop:
messages = redis_client.xautoclaim(
name=settings.NDVI_STREAM_NAME,
groupname=settings.NDVI_STREAM_GROUP,
consumername=consumer_name,
min_idle_time=settings.NDVI_STREAM_CLAIM_IDLE_MS,
start_id="0-0",
count=settings.NDVI_STREAM_BATCH_SIZE,
)
This allows healthy consumers to recover abandoned work automatically.
That changed the reliability profile of the ingestion pipeline significantly.
Dead-Letter Queue Handling
I also introduced a dedicated dead-letter stream.
Messages are routed into the DLQ when:
- validation fails
- delivery ceilings are exceeded
- payloads become structurally invalid
- repeated execution attempts fail
Example:
redis_client.xadd(
settings.NDVI_STREAM_DLQ_NAME,
dlq_payload,
)
Every DLQ entry includes:
- original message ID
- delivery count
- failure reason
- serialized payload
- timestamps
This made operational debugging dramatically easier.
The Hardest Problem: Idempotency
Redis Streams provide at-least-once delivery.
That means duplicate delivery is expected.
Exactly-once delivery is not guaranteed.
To prevent duplicate NDVI execution, I added multiple protection layers.
Layer 1: Deterministic Request Hash
Every NDVI job already had a deterministic request_hash.
That became the execution identity.
Layer 2: Distributed Redis Lock
The consumer acquires a Redis lock before execution.
lock_key = f"ndvi:lock:{request_hash}"
Acquisition uses SETNX semantics with expiration.
Layer 3: Token-Based Lock Release
Locks are released through an atomic Lua script.
This prevents blind deletion.
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
end
return 0
Layer 4: Database Status Recheck
Before execution begins, the worker re-checks terminal job state.
This acts as a second safety boundary.
The result is effectively-once execution semantics.
At-least-once delivery + idempotent execution = effectively-once processing
Observability Added During the Rollout
One major lesson from this migration:
Do not enable stream mode before queue visibility exists.
I added dedicated metrics before enabling the rollout broadly.
Examples:
redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total
I also expanded upstream visibility:
ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds
Grafana dashboards now expose:
- pending stream backlog
- reclaim frequency
- DLQ volume
- consumer liveness
- upstream API failures
- queue drain rate
This transformed rollout decisions from guesswork into measurable operations.
Rollback Strategy
Rollback was designed before rollout.
That mattered.
The stream backend is fully feature-flagged:
NDVI_QUEUE_BACKEND = "celery"
or
NDVI_QUEUE_BACKEND = "stream"
Rollback requires:
- environment variable change
- process restart
No redeploy.
No task rewrite.
No schema rollback.
This significantly reduced operational fear during rollout.
What Shipped This Week
This week’s rollout included:
- a 528-line Redis Streams consumer
- reclaim + DLQ lifecycle handling
- distributed execution locking
- token-safe lock release
- approximately 400 lines of stream-focused tests
- Prometheus metrics for queue health
- Grafana visibility for consumer state and lag
- feature-flag rollback support
Most of the work was not adding Redis.
Most of the work was making failure recovery predictable.
Was It Worth It?
Redis Streams did not simplify the system.
They made failure states explicit.
That introduced additional complexity:
- reclaim logic
- idempotency handling
- consumer lifecycle management
- DLQ operations
- stream observability
But the reliability gains were substantial:
- durable ingestion
- replay visibility
- safer recovery semantics
- backlog introspection
- controlled rollback
- observable queue state
For this NDVI pipeline, the tradeoff was worth it.
Final Thoughts
One of the biggest lessons from this migration is that queue evolution is not just about throughput.
It is about operational recovery.
Redis Streams gave the ingestion layer explicit lifecycle semantics:
- pending
- acknowledged
- reclaimed
- dead-lettered
That visibility fundamentally changed how the system behaves during failures.
And importantly, I achieved that without rewriting the worker layer.
Sometimes the best migration strategy is not replacing your stack.
It is inserting a safer boundary in front of it.
Top comments (0)