DEV Community

Cover image for Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)
Rahim Ranxx
Rahim Ranxx

Posted on

Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)

Redis Sentinel + Celery Failover: What Actually Happens in Production

Most tutorials on Redis Sentinel stop at “it elects a new master”.
Very few show what happens to a real system under failover pressure.

I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.

Here’s what actually happened.


Table of Contents

  • Architecture Overview
  • Sentinel Integration (Django + Celery)
  • Observability with Prometheus
  • Failover Drill Walkthrough
  • Celery Behavior During Failover
  • Performance Impact
  • Production Readiness Assessment
  • How to Reduce Failover Latency

Architecture Overview

flowchart LR
    Client --> Django
    Django -->|Cache| Sentinel
    Django -->|Tasks| Celery
    Celery -->|Broker| Sentinel
    Celery -->|Result Backend| Sentinel

    Sentinel --> RedisMaster
    Sentinel --> RedisReplica1
    Sentinel --> RedisReplica2

    Prometheus --> RedisExporter
    RedisExporter --> Sentinel
Enter fullscreen mode Exit fullscreen mode

Stack Components

  • Django → Redis cache via Sentinel
  • Celery → Broker + result backend via Sentinel
  • Redis Sentinel → High availability + failover
  • Prometheus + redis_exporter → Monitoring

Sentinel Integration (Django + Celery)

All services were switched to Sentinel using environment configuration:

REDIS_ADDR=redis://host.docker.internal:26379
Enter fullscreen mode Exit fullscreen mode

Validation steps:

  • Django cache → successful round-trip
  • Celery broker → connected via Sentinel
  • Celery result backend → SentinelBackend initialized
  • Test suite passed:
  pytest tests/test_settings_redis_sentinel.py
Enter fullscreen mode Exit fullscreen mode

At this stage, the system is fully Sentinel-aware


Observability with Prometheus

After pointing redis_exporter to Sentinel:

Key metrics exposed:

  • redis_sentinel_master_status
  • redis_sentinel_master_ok_sentinels
  • redis_sentinel_master_ok_slaves
  • redis_sentinel_masters

Verification:

redis_instance_info{redis_mode="sentinel", tcp_port="26379"}
Enter fullscreen mode Exit fullscreen mode

This confirms monitoring is tracking cluster state, not a single node.


Failover Drill Walkthrough

Initial State

flowchart LR
    Sentinel -->|Master| Redis1["172.20.0.3:6379"]
    Sentinel --> Redis2["Replica"]
    Sentinel --> Redis3["Replica"]
Enter fullscreen mode Exit fullscreen mode

Prometheus reported:

master_address="172.20.0.3:6379"
Enter fullscreen mode Exit fullscreen mode

Induced Failure

  • Current master was stopped manually

Sentinel Election

flowchart LR
    Sentinel -->|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --> Redis3["Replica"]
    Sentinel --> Redis1["Down"]
Enter fullscreen mode Exit fullscreen mode
  • New master elected on first poll
  • Prometheus updated on next scrape

Failover was immediate and correct


Celery Behavior During Failover

Timeline

sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App->>Celery: Submit Task
    Celery->>Redis: Send to Master
    Redis-->>Celery: Connection Lost

    Sentinel->>Sentinel: Elect New Master

    Celery->>Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery->>Redis: Reconnect to New Master
    Redis-->>Celery: OK

    Celery-->>App: Task SUCCESS
Enter fullscreen mode Exit fullscreen mode

Observed Task

  • Task ID: 9b57ba3b-a707-4c13-9255-d74de411b64b
  • Status during failover: PENDING
  • Delay: ~54.7 seconds
  • Final state: SUCCESS

Performance Impact

Phase Behavior
Normal operation Immediate execution
During failover ~55s delay
Post-recovery Normal

Production Readiness Assessment

What Works

  • Redis Sentinel failover is reliable
  • Prometheus reflects cluster changes correctly
  • Django cache survives failover
  • No task loss in Celery

What Needs Attention

  • Celery introduces significant delay during failover
  • Reconnection is not instantaneous

When This Architecture Is Production-Ready

Use this setup if:

  • Tasks are asynchronous/background
  • Eventual completion is acceptable
  • Temporary latency spikes are tolerable

When This Is Not Enough

Avoid this setup (as-is) if you need:

  • Real-time task execution
  • Sub-10s failover recovery
  • User-facing async operations

How to Reduce Failover Latency

To push recovery closer to 10–15 seconds:

  • Tune Celery broker retry settings
  • Reduce reconnect backoff intervals
  • Optimize worker heartbeat and visibility timeout
  • Re-run failover drills with timing instrumentation

Key Takeaway

Redis Sentinel ensures infrastructure recovery.
Celery determines how fast your system actually resumes work.

In this test:

  • Sentinel recovery: instant
  • Application recovery: ~55 seconds

That gap is the real engineering challenge.


Final Thoughts

If you're using Redis Sentinel with Celery:

Don’t stop at:

“Failover works.”

Measure:

“How long until my system behaves normally again?”

Because that’s what production users experience.


Top comments (0)