Redis Sentinel + Celery Failover: What Actually Happens in Production
Most tutorials on Redis Sentinel stop at “it elects a new master”.
Very few show what happens to a real system under failover pressure.
I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.
Here’s what actually happened.
Table of Contents
- Architecture Overview
- Sentinel Integration (Django + Celery)
- Observability with Prometheus
- Failover Drill Walkthrough
- Celery Behavior During Failover
- Performance Impact
- Production Readiness Assessment
- How to Reduce Failover Latency
Architecture Overview
flowchart LR
Client --> Django
Django -->|Cache| Sentinel
Django -->|Tasks| Celery
Celery -->|Broker| Sentinel
Celery -->|Result Backend| Sentinel
Sentinel --> RedisMaster
Sentinel --> RedisReplica1
Sentinel --> RedisReplica2
Prometheus --> RedisExporter
RedisExporter --> Sentinel
Stack Components
- Django → Redis cache via Sentinel
- Celery → Broker + result backend via Sentinel
- Redis Sentinel → High availability + failover
- Prometheus + redis_exporter → Monitoring
Sentinel Integration (Django + Celery)
All services were switched to Sentinel using environment configuration:
REDIS_ADDR=redis://host.docker.internal:26379
Validation steps:
- Django cache → successful round-trip
- Celery broker → connected via Sentinel
- Celery result backend →
SentinelBackendinitialized - Test suite passed:
pytest tests/test_settings_redis_sentinel.py
At this stage, the system is fully Sentinel-aware
Observability with Prometheus
After pointing redis_exporter to Sentinel:
Key metrics exposed:
redis_sentinel_master_statusredis_sentinel_master_ok_sentinelsredis_sentinel_master_ok_slavesredis_sentinel_masters
Verification:
redis_instance_info{redis_mode="sentinel", tcp_port="26379"}
This confirms monitoring is tracking cluster state, not a single node.
Failover Drill Walkthrough
Initial State
flowchart LR
Sentinel -->|Master| Redis1["172.20.0.3:6379"]
Sentinel --> Redis2["Replica"]
Sentinel --> Redis3["Replica"]
Prometheus reported:
master_address="172.20.0.3:6379"
Induced Failure
- Current master was stopped manually
Sentinel Election
flowchart LR
Sentinel -->|New Master| Redis2["172.20.0.2:6379"]
Sentinel --> Redis3["Replica"]
Sentinel --> Redis1["Down"]
- New master elected on first poll
- Prometheus updated on next scrape
Failover was immediate and correct
Celery Behavior During Failover
Timeline
sequenceDiagram
participant App as Django App
participant Celery
participant Sentinel
participant Redis
App->>Celery: Submit Task
Celery->>Redis: Send to Master
Redis-->>Celery: Connection Lost
Sentinel->>Sentinel: Elect New Master
Celery->>Sentinel: Retry Connection
Note over Celery: ~54.7s delay
Celery->>Redis: Reconnect to New Master
Redis-->>Celery: OK
Celery-->>App: Task SUCCESS
Observed Task
- Task ID:
9b57ba3b-a707-4c13-9255-d74de411b64b - Status during failover:
PENDING - Delay: ~54.7 seconds
- Final state:
SUCCESS
Performance Impact
| Phase | Behavior |
|---|---|
| Normal operation | Immediate execution |
| During failover | ~55s delay |
| Post-recovery | Normal |
Production Readiness Assessment
What Works
- Redis Sentinel failover is reliable
- Prometheus reflects cluster changes correctly
- Django cache survives failover
- No task loss in Celery
What Needs Attention
- Celery introduces significant delay during failover
- Reconnection is not instantaneous
When This Architecture Is Production-Ready
Use this setup if:
- Tasks are asynchronous/background
- Eventual completion is acceptable
- Temporary latency spikes are tolerable
When This Is Not Enough
Avoid this setup (as-is) if you need:
- Real-time task execution
- Sub-10s failover recovery
- User-facing async operations
How to Reduce Failover Latency
To push recovery closer to 10–15 seconds:
- Tune Celery broker retry settings
- Reduce reconnect backoff intervals
- Optimize worker heartbeat and visibility timeout
- Re-run failover drills with timing instrumentation
Key Takeaway
Redis Sentinel ensures infrastructure recovery.
Celery determines how fast your system actually resumes work.
In this test:
- Sentinel recovery: instant
- Application recovery: ~55 seconds
That gap is the real engineering challenge.
Final Thoughts
If you're using Redis Sentinel with Celery:
Don’t stop at:
“Failover works.”
Measure:
“How long until my system behaves normally again?”
Because that’s what production users experience.
Top comments (0)