Rahim Ranxx

Posted on Apr 12

Hardening Distributed Systems: Retries, Circuit Breakers & Observability.

#distributedsystems #devops #django #redis

Building Resilient Distributed Systems: A Solo Engineer's Journey

How I turned flaky upstream APIs into a predictable, observable, and operator-friendly reliability layer — with code you can steal.

Introduction

If you've ever built a service that depends on external APIs (STAC catalogs, SentinelHub, weather data providers, etc.), you know the pain:

429s when you hit rate limits
502s when upstreams hiccup
Silent timeouts that leave jobs hanging
Retry storms that make bad days worse

Last month, I undertook a focused effort to harden the retry and resilience logic for an NDVI (Normalized Difference Vegetation Index) processing pipeline. What started as "let's clean up some duplicate retry code" evolved into a production-grade reliability subsystem that now governs every upstream interaction.

In this article, I'll walk through:

Phase 1: Consolidating retry policy into a single source of truth
Phase 2: Adding circuit breakers with observability and admin controls
Phase 3 (preview): Decoupling dispatch with Redis Streams for back-pressure resilience
Key principles I learned that you can apply to your own distributed systems

All code is Python/Django/Celery, but the patterns are language-agnostic. And yes — I did this alone. No team, no dedicated SRE, no platform squad. Just me, a codebase, and a lot of careful thinking.

The Problem Space

The NDVI pipeline I was working on orchestrates vegetation index calculations by:

Querying STAC catalogs for satellite imagery metadata
Fetching raster data from SentinelHub
Computing NDVI values per farm/plot
Returning results to farmers/agronomists

The challenge: Each upstream service has different failure modes:

STAC: occasional 502s, auth errors (401/403)
SentinelHub: strict rate limits (429), validation errors (422), transient 5xx
Network: timeouts, DNS failures, TLS handshake issues

Before my refactor, retry logic was scattered across 4+ modules, with inconsistent error classification and no centralized observability. Result? Hard-to-debug failures, wasted Celery retries, and on-call pages at 3 AM.

As a solo engineer, I couldn't afford to keep firefighting. I needed a system that would just work — or fail gracefully, with clear signals.

Phase 1: One Source of Truth for Retries

The Core Insight

Not all errors are retryable. Not all retries are equal.

I started by defining a canonical truth table mapping HTTP status codes to retry behavior:

# ndvi/retry_policy.py
def classify_status_code(status_code: int | None) -> RetryClassification:
    """
    Canonical truth table: HTTP status → retry decision.

    | Status      | Retryable | Category           |
    |-------------|-----------|--------------------|
    | 401, 403    | False     | AUTH               |
    | 400, 422    | False     | VALIDATION         |
    | 429         | True      | RATE_LIMIT         |
    | >= 500      | True      | TRANSIENT_UPSTREAM |
    | Other/None  | False     | UNKNOWN            |
    """
    if status_code in (401, 403):
        return RetryClassification(retryable=False, category="AUTH")
    if status_code in (400, 422):
        return RetryClassification(retryable=False, category="VALIDATION")
    if status_code == 429:
        return RetryClassification(retryable=True, category="RATE_LIMIT")
    if status_code is not None and status_code >= 500:
        return RetryClassification(retryable=True, category="TRANSIENT_UPSTREAM")
    return RetryClassification(retryable=False, category="UNKNOWN")

Unified Exception Hierarchy

I made all upstream errors inherit from a common base, ensuring consistent attributes:

class UpstreamFailureError(NdviFailureError):
    """Base for all retryable upstream failures."""
    def __init__(self, message: str, status_code: int | None = None, response: Response | None = None):
        super().__init__(message)
        self.status_code = status_code
        self.response = response
        # Delegate to canonical classifier
        classification = classify_status_code(status_code)
        self.retryable = classification.retryable
        self.category = classification.category
        self.delay = self._compute_delay(classification)

class StacUpstreamError(UpstreamFailureError, StacError): ...
class SentinelHubUpstreamError(UpstreamFailureError): ...
class SentinelHubRasterError(UpstreamFailureError): ...

Centralized Retry Decision

@dataclass
class RetryDecision:
    retry: bool
    delay: float
    reason: str

def should_retry(exc: Exception, response_headers: dict | None = None) -> RetryDecision:
    if not isinstance(exc, UpstreamFailureError):
        return RetryDecision(retry=False, delay=0.0, reason="non-retryable-exception")

    # Respect Retry-After header for 429s
    if exc.status_code == 429 and response_headers:
        server_delay = parse_retry_after(response_headers.get("Retry-After"))
        if server_delay is not None:
            return RetryDecision(retry=True, delay=server_delay, reason="retry-after-header")

    return RetryDecision(
        retry=exc.retryable,
        delay=exc.delay,
        reason=f"{exc.category}-classification"
    )

Impact

28 parametrized tests covering all 13 truth-table branches
Removed 3 duplicate retry implementations
Celery tasks now use shared should_retry() logic
Network errors properly wrapped → no more silent failures

Lesson #1: Centralize failure classification. When retry logic lives in one place, you can test it thoroughly, document it clearly, and evolve it safely — even when you're the only one maintaining it.

Phase 2: Circuit Breakers with Teeth

Retries alone aren't enough. When an upstream is truly down, you want to fail fast and avoid thundering herds.

The Circuit Breaker State Machine

I implemented a simple but effective three-state breaker:

CLOSED → (failures ≥ threshold) → OPEN → (timeout elapsed) → HALF_OPEN → (success) → CLOSED
                              ↘ (failure) ↗

class _CircuitBreaker:
    def __init__(self, threshold: int = 3, timeout_secs: float = 300):
        self.state = "CLOSED"
        self.failure_count = 0
        self.last_failure_time: float | None = None
        self.threshold = threshold
        self.timeout_secs = timeout_secs

    def record_success(self):
        self.failure_count = 0
        self._transition_to("CLOSED")

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self._transition_to("OPEN")

    def allow_request(self) -> bool:
        if self.state == "CLOSED":
            return True
        if self.state == "OPEN":
            if time.time() - self.last_failure_time >= self.timeout_secs:
                self._transition_to("HALF_OPEN")
                return True
            return False
        # HALF_OPEN: allow one probe request
        return True

    def _transition_to(self, new_state: str):
        old_state = self.state
        self.state = new_state
        logger.info(f"Circuit breaker: {old_state} → {new_state}")
        # Export Prometheus metric
        circuit_breaker_state.labels(engine=self.engine).set(STATE_VALUES[new_state])
        circuit_breaker_transitions.labels(
            engine=self.engine, from_state=old_state, to_state=new_state
        ).inc()

Observability First

I didn't just build the breaker — I made it visible:

# Prometheus metrics
circuit_breaker_state = Gauge(
    'ndvi_circuit_breaker_state',
    'Current circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
    labelnames=['engine']
)

circuit_breaker_transitions = Counter(
    'ndvi_circuit_breaker_transitions_total',
    'Count of circuit breaker state transitions',
    labelnames=['engine', 'from_state', 'to_state']
)

And added a Grafana dashboard with:

Stat panels showing current state per engine (color-coded: 🟢 CLOSED, 🔴 OPEN, 🟡 HALF_OPEN)
Time series of transition rates
Correlation with upstream failure rates

Operator Controls

Because things will go wrong — and when you're solo, you are the operator — I added an admin endpoint to manually reset breakers:

POST /api/v1/ndvi/circuit-breaker/reset/
Content-Type: application/json
Authorization: Bearer <admin-token>

{ "engine": "stac" }

→ { "data": { "previous_state": "OPEN", "new_state": "CLOSED" } }

Lesson #2: Resilience patterns need observability and escape hatches. If you can't see it or control it, you don't own it — and when you're the only one on call, "owning it" means sleeping at night.

Phase 3 Preview: Decoupling with Redis Streams

As I scaled the system, I hit a new challenge: Celery broker unavailability during Redis Sentinel failover (~55 seconds of downtime). For background jobs, this was acceptable. But for real-time dispatch, I needed better.

The Architecture Decision

Instead of relying on Celery's built-in Redis transport, I chose a separate consumer pattern:

API → [Redis Stream] → Consumer → [Celery Queue] → Worker

Why?

Avoids Celery/Kombu stream support uncertainty
Easier to observe and debug (explicit XREADGROUP/XACK)
Natural back-pressure via XPENDING monitoring
Cleaner rollback path (just flip a feature flag)

Key Design Decisions I Made Early

1. Idempotency by Design

stream_payload = {
    "job_id": job.id,
    "request_hash": job.request_hash,  # Primary idempotency key
    "schema_version": 1,                # Future-proofing
    "colormap_normalization": "histogram",  # Evolved schema
    # ... other fields
}
# Consumer checks request_hash before enqueueing to Celery

2. Error Classification at Consumer Boundary

Not all failures should retry:

ERROR_STRATEGY = {
    "no_items": "DLQ",           # Permanent: no data exists
    "missing_assets": "DLQ",     # Permanent: schema mismatch
    "network_timeout": "RETRY",  # Transient: try again
    "celery_unavailable": "RETRY_WITH_BACKOFF",  # Infrastructure blip
}

3. Back-Pressure Strategy

PENDING_WARNING = 1_000
PENDING_CRITICAL = 5_000

pending_count = redis.xpending(stream_name, group_name)["pending"]
if pending_count > PENDING_CRITICAL:
    # Return 429 on API to slow producers
    return HttpResponseTooManyRequests("Upstream backlog critical")
elif pending_count > PENDING_WARNING:
    logger.warning(f"Stream backlog growing: {pending_count} pending")

4. Graceful Shutdown

# In consume_ndvi_stream.py
signal.signal(signal.SIGTERM, handle_shutdown)

def handle_shutdown(signum, frame):
    shutdown_flag.set()  # Stop accepting new entries
    # Finish current entry, XACK if successful
    # Exit cleanly → orchestrator restarts

Lesson #3: Decoupling isn't just about scalability — it's about failure isolation. When one component fails, the rest can keep moving. And when you're solo, isolation means you can debug one piece without bringing down the whole system.

Principles I Learned (That You Can Steal)

1. Make Failure Explicit

Don't hide errors behind generic exceptions. Classify them, tag them, and route them intentionally. Your future self — especially at 3 AM — will thank you.

2. Observability Is a Feature, Not an Afterthought

If you can't measure it, you can't improve it. Export metrics at the point of decision (retry? circuit open? stream lag?) — not just at the edges. When you're the only one debugging, every metric is a lifeline.

3. Design for the "Boring" Failure Modes

Everyone plans for the 500 error. Few plan for:

Broker failover latency
Consumer restart mid-processing
Schema evolution mid-deploy
Clock skew in distributed timestamps

Document these. Test them. Build escape hatches. When you don't have a team to lean on, preparation is your best defense.

4. Centralize, Then Specialize

Start with a single source of truth (like classify_status_code()). Then layer on engine-specific behavior on top of that foundation. This prevents drift and duplication — critical when you're the only one maintaining the code.

5. Operator Experience Matters

Admin endpoints, health checks, clear logs, and meaningful metrics aren't "nice to have" — they're what let you sleep at night. Build them in from day one. When you're solo, you are the operator.

A Note on Solo Engineering

Working alone doesn't mean working in isolation. I leaned heavily on:

Public documentation: Google SRE book, AWS Well-Architected, Martin Fowler's patterns
Open source: Studying how Celery, Kombu, and Redis clients handle resilience
Community: Reading post-mortems, blog posts, and conference talks from engineers who've been there

And I documented everything. Not for a team — for my future self. Every architecture decision, every tradeoff, every "why" is written down. Because six months from now, I won't remember why I chose 300s for the circuit breaker timeout. But my docs will.

If you're also building alone: you're not behind. You're just optimizing for a different constraint. Depth over breadth. Clarity over velocity. Resilience over features.

Conclusion

Building resilient distributed systems isn't about fancy algorithms or cutting-edge tools. It's about discipline: clear contracts, explicit failure handling, observable behavior, and operator empathy.

The NDVI pipeline I built isn't perfect. My circuit breakers are still process-local (not cluster-wide). My stream consumer doesn't yet support distributed tracing. But it's predictable, testable, and recoverable — and that's what matters.

If you take one thing from this article, let it be this:

Resilience isn't a feature you add at the end. It's a mindset you build in from the start — whether you're on a team of 50 or flying solo.

All code examples are simplified for clarity; production versions include additional error handling and logging. This work reflects my personal approach — your mileage may vary, and that's okay.

💡 Pro Tip: Want to try the circuit breaker pattern? Start small:

Add a failure_count and last_failure_time to your HTTP client

Skip requests when failure_count >= 3 and time_since_failure < 300

Log state transitions

Add one Prometheus gauge

You'll be 80% of the way there — and you'll learn what actually matters for your workload.

DEV Community