Rahim Ranxx

Posted on Apr 4

Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)

#devops #distributedsystems #django #redis

Redis Sentinel + Celery Failover: What Actually Happens in Production

Most tutorials on Redis Sentinel stop at “it elects a new master”.
Very few show what happens to a real system under failover pressure.

I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.

Here’s what actually happened.

Architecture Overview
Sentinel Integration (Django + Celery)
Observability with Prometheus
Failover Drill Walkthrough
Celery Behavior During Failover
Performance Impact
Production Readiness Assessment
How to Reduce Failover Latency

Architecture Overview

flowchart LR
    Client --> Django
    Django -->|Cache| Sentinel
    Django -->|Tasks| Celery
    Celery -->|Broker| Sentinel
    Celery -->|Result Backend| Sentinel

    Sentinel --> RedisMaster
    Sentinel --> RedisReplica1
    Sentinel --> RedisReplica2

    Prometheus --> RedisExporter
    RedisExporter --> Sentinel

Stack Components

Django → Redis cache via Sentinel
Celery → Broker + result backend via Sentinel
Redis Sentinel → High availability + failover
Prometheus + redis_exporter → Monitoring

Sentinel Integration (Django + Celery)

All services were switched to Sentinel using environment configuration:

REDIS_ADDR=redis://host.docker.internal:26379

Validation steps:

Django cache → successful round-trip
Celery broker → connected via Sentinel
Celery result backend → SentinelBackend initialized
Test suite passed:

  pytest tests/test_settings_redis_sentinel.py

At this stage, the system is fully Sentinel-aware

Observability with Prometheus

After pointing redis_exporter to Sentinel:

Key metrics exposed:

redis_sentinel_master_status
redis_sentinel_master_ok_sentinels
redis_sentinel_master_ok_slaves
redis_sentinel_masters

Verification:

redis_instance_info{redis_mode="sentinel", tcp_port="26379"}

This confirms monitoring is tracking cluster state, not a single node.

Failover Drill Walkthrough

Initial State

flowchart LR
    Sentinel -->|Master| Redis1["172.20.0.3:6379"]
    Sentinel --> Redis2["Replica"]
    Sentinel --> Redis3["Replica"]

Prometheus reported:

master_address="172.20.0.3:6379"

Induced Failure

Current master was stopped manually

Sentinel Election

flowchart LR
    Sentinel -->|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --> Redis3["Replica"]
    Sentinel --> Redis1["Down"]

New master elected on first poll
Prometheus updated on next scrape

Failover was immediate and correct

Celery Behavior During Failover

Timeline

sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App->>Celery: Submit Task
    Celery->>Redis: Send to Master
    Redis-->>Celery: Connection Lost

    Sentinel->>Sentinel: Elect New Master

    Celery->>Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery->>Redis: Reconnect to New Master
    Redis-->>Celery: OK

    Celery-->>App: Task SUCCESS

Observed Task

Task ID: 9b57ba3b-a707-4c13-9255-d74de411b64b
Status during failover: PENDING
Delay: ~54.7 seconds
Final state: SUCCESS

Performance Impact

Phase	Behavior
Normal operation	Immediate execution
During failover	~55s delay
Post-recovery	Normal

Production Readiness Assessment

What Works

Redis Sentinel failover is reliable
Prometheus reflects cluster changes correctly
Django cache survives failover
No task loss in Celery

What Needs Attention

Celery introduces significant delay during failover
Reconnection is not instantaneous

When This Architecture Is Production-Ready

Use this setup if:

Tasks are asynchronous/background
Eventual completion is acceptable
Temporary latency spikes are tolerable

When This Is Not Enough

Avoid this setup (as-is) if you need:

Real-time task execution
Sub-10s failover recovery
User-facing async operations

How to Reduce Failover Latency

To push recovery closer to 10–15 seconds:

Tune Celery broker retry settings
Reduce reconnect backoff intervals
Optimize worker heartbeat and visibility timeout
Re-run failover drills with timing instrumentation

Key Takeaway

Redis Sentinel ensures infrastructure recovery.
Celery determines how fast your system actually resumes work.

In this test:

Sentinel recovery: instant
Application recovery: ~55 seconds

That gap is the real engineering challenge.

Final Thoughts

If you're using Redis Sentinel with Celery:

Don’t stop at:

“Failover works.”

Measure:

“How long until my system behaves normally again?”

Because that’s what production users experience.

DEV Community