Building Resilient Distributed Systems: A Solo Engineer's Journey
How I turned flaky upstream APIs into a predictable, observable, and operator-friendly reliability layer — with code you can steal.
Introduction
If you've ever built a service that depends on external APIs (STAC catalogs, SentinelHub, weather data providers, etc.), you know the pain:
- 429s when you hit rate limits
- 502s when upstreams hiccup
- Silent timeouts that leave jobs hanging
- Retry storms that make bad days worse
Last month, I undertook a focused effort to harden the retry and resilience logic for an NDVI (Normalized Difference Vegetation Index) processing pipeline. What started as "let's clean up some duplicate retry code" evolved into a production-grade reliability subsystem that now governs every upstream interaction.
In this article, I'll walk through:
- Phase 1: Consolidating retry policy into a single source of truth
- Phase 2: Adding circuit breakers with observability and admin controls
- Phase 3 (preview): Decoupling dispatch with Redis Streams for back-pressure resilience
- Key principles I learned that you can apply to your own distributed systems
All code is Python/Django/Celery, but the patterns are language-agnostic. And yes — I did this alone. No team, no dedicated SRE, no platform squad. Just me, a codebase, and a lot of careful thinking.
The Problem Space
The NDVI pipeline I was working on orchestrates vegetation index calculations by:
- Querying STAC catalogs for satellite imagery metadata
- Fetching raster data from SentinelHub
- Computing NDVI values per farm/plot
- Returning results to farmers/agronomists
The challenge: Each upstream service has different failure modes:
- STAC: occasional 502s, auth errors (401/403)
- SentinelHub: strict rate limits (429), validation errors (422), transient 5xx
- Network: timeouts, DNS failures, TLS handshake issues
Before my refactor, retry logic was scattered across 4+ modules, with inconsistent error classification and no centralized observability. Result? Hard-to-debug failures, wasted Celery retries, and on-call pages at 3 AM.
As a solo engineer, I couldn't afford to keep firefighting. I needed a system that would just work — or fail gracefully, with clear signals.
Phase 1: One Source of Truth for Retries
The Core Insight
Not all errors are retryable. Not all retries are equal.
I started by defining a canonical truth table mapping HTTP status codes to retry behavior:
# ndvi/retry_policy.py
def classify_status_code(status_code: int | None) -> RetryClassification:
"""
Canonical truth table: HTTP status → retry decision.
| Status | Retryable | Category |
|-------------|-----------|--------------------|
| 401, 403 | False | AUTH |
| 400, 422 | False | VALIDATION |
| 429 | True | RATE_LIMIT |
| >= 500 | True | TRANSIENT_UPSTREAM |
| Other/None | False | UNKNOWN |
"""
if status_code in (401, 403):
return RetryClassification(retryable=False, category="AUTH")
if status_code in (400, 422):
return RetryClassification(retryable=False, category="VALIDATION")
if status_code == 429:
return RetryClassification(retryable=True, category="RATE_LIMIT")
if status_code is not None and status_code >= 500:
return RetryClassification(retryable=True, category="TRANSIENT_UPSTREAM")
return RetryClassification(retryable=False, category="UNKNOWN")
Unified Exception Hierarchy
I made all upstream errors inherit from a common base, ensuring consistent attributes:
class UpstreamFailureError(NdviFailureError):
"""Base for all retryable upstream failures."""
def __init__(self, message: str, status_code: int | None = None, response: Response | None = None):
super().__init__(message)
self.status_code = status_code
self.response = response
# Delegate to canonical classifier
classification = classify_status_code(status_code)
self.retryable = classification.retryable
self.category = classification.category
self.delay = self._compute_delay(classification)
class StacUpstreamError(UpstreamFailureError, StacError): ...
class SentinelHubUpstreamError(UpstreamFailureError): ...
class SentinelHubRasterError(UpstreamFailureError): ...
Centralized Retry Decision
@dataclass
class RetryDecision:
retry: bool
delay: float
reason: str
def should_retry(exc: Exception, response_headers: dict | None = None) -> RetryDecision:
if not isinstance(exc, UpstreamFailureError):
return RetryDecision(retry=False, delay=0.0, reason="non-retryable-exception")
# Respect Retry-After header for 429s
if exc.status_code == 429 and response_headers:
server_delay = parse_retry_after(response_headers.get("Retry-After"))
if server_delay is not None:
return RetryDecision(retry=True, delay=server_delay, reason="retry-after-header")
return RetryDecision(
retry=exc.retryable,
delay=exc.delay,
reason=f"{exc.category}-classification"
)
Impact
- 28 parametrized tests covering all 13 truth-table branches
- Removed 3 duplicate retry implementations
- Celery tasks now use shared
should_retry()logic - Network errors properly wrapped → no more silent failures
Lesson #1: Centralize failure classification. When retry logic lives in one place, you can test it thoroughly, document it clearly, and evolve it safely — even when you're the only one maintaining it.
Phase 2: Circuit Breakers with Teeth
Retries alone aren't enough. When an upstream is truly down, you want to fail fast and avoid thundering herds.
The Circuit Breaker State Machine
I implemented a simple but effective three-state breaker:
CLOSED → (failures ≥ threshold) → OPEN → (timeout elapsed) → HALF_OPEN → (success) → CLOSED
↘ (failure) ↗
class _CircuitBreaker:
def __init__(self, threshold: int = 3, timeout_secs: float = 300):
self.state = "CLOSED"
self.failure_count = 0
self.last_failure_time: float | None = None
self.threshold = threshold
self.timeout_secs = timeout_secs
def record_success(self):
self.failure_count = 0
self._transition_to("CLOSED")
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.threshold:
self._transition_to("OPEN")
def allow_request(self) -> bool:
if self.state == "CLOSED":
return True
if self.state == "OPEN":
if time.time() - self.last_failure_time >= self.timeout_secs:
self._transition_to("HALF_OPEN")
return True
return False
# HALF_OPEN: allow one probe request
return True
def _transition_to(self, new_state: str):
old_state = self.state
self.state = new_state
logger.info(f"Circuit breaker: {old_state} → {new_state}")
# Export Prometheus metric
circuit_breaker_state.labels(engine=self.engine).set(STATE_VALUES[new_state])
circuit_breaker_transitions.labels(
engine=self.engine, from_state=old_state, to_state=new_state
).inc()
Observability First
I didn't just build the breaker — I made it visible:
# Prometheus metrics
circuit_breaker_state = Gauge(
'ndvi_circuit_breaker_state',
'Current circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
labelnames=['engine']
)
circuit_breaker_transitions = Counter(
'ndvi_circuit_breaker_transitions_total',
'Count of circuit breaker state transitions',
labelnames=['engine', 'from_state', 'to_state']
)
And added a Grafana dashboard with:
- Stat panels showing current state per engine (color-coded: 🟢 CLOSED, 🔴 OPEN, 🟡 HALF_OPEN)
- Time series of transition rates
- Correlation with upstream failure rates
Operator Controls
Because things will go wrong — and when you're solo, you are the operator — I added an admin endpoint to manually reset breakers:
POST /api/v1/ndvi/circuit-breaker/reset/
Content-Type: application/json
Authorization: Bearer <admin-token>
{ "engine": "stac" }
→ { "data": { "previous_state": "OPEN", "new_state": "CLOSED" } }
Lesson #2: Resilience patterns need observability and escape hatches. If you can't see it or control it, you don't own it — and when you're the only one on call, "owning it" means sleeping at night.
Phase 3 Preview: Decoupling with Redis Streams
As I scaled the system, I hit a new challenge: Celery broker unavailability during Redis Sentinel failover (~55 seconds of downtime). For background jobs, this was acceptable. But for real-time dispatch, I needed better.
The Architecture Decision
Instead of relying on Celery's built-in Redis transport, I chose a separate consumer pattern:
API → [Redis Stream] → Consumer → [Celery Queue] → Worker
Why?
- Avoids Celery/Kombu stream support uncertainty
- Easier to observe and debug (explicit XREADGROUP/XACK)
- Natural back-pressure via XPENDING monitoring
- Cleaner rollback path (just flip a feature flag)
Key Design Decisions I Made Early
1. Idempotency by Design
stream_payload = {
"job_id": job.id,
"request_hash": job.request_hash, # Primary idempotency key
"schema_version": 1, # Future-proofing
"colormap_normalization": "histogram", # Evolved schema
# ... other fields
}
# Consumer checks request_hash before enqueueing to Celery
2. Error Classification at Consumer Boundary
Not all failures should retry:
ERROR_STRATEGY = {
"no_items": "DLQ", # Permanent: no data exists
"missing_assets": "DLQ", # Permanent: schema mismatch
"network_timeout": "RETRY", # Transient: try again
"celery_unavailable": "RETRY_WITH_BACKOFF", # Infrastructure blip
}
3. Back-Pressure Strategy
PENDING_WARNING = 1_000
PENDING_CRITICAL = 5_000
pending_count = redis.xpending(stream_name, group_name)["pending"]
if pending_count > PENDING_CRITICAL:
# Return 429 on API to slow producers
return HttpResponseTooManyRequests("Upstream backlog critical")
elif pending_count > PENDING_WARNING:
logger.warning(f"Stream backlog growing: {pending_count} pending")
4. Graceful Shutdown
# In consume_ndvi_stream.py
signal.signal(signal.SIGTERM, handle_shutdown)
def handle_shutdown(signum, frame):
shutdown_flag.set() # Stop accepting new entries
# Finish current entry, XACK if successful
# Exit cleanly → orchestrator restarts
Lesson #3: Decoupling isn't just about scalability — it's about failure isolation. When one component fails, the rest can keep moving. And when you're solo, isolation means you can debug one piece without bringing down the whole system.
Principles I Learned (That You Can Steal)
1. Make Failure Explicit
Don't hide errors behind generic exceptions. Classify them, tag them, and route them intentionally. Your future self — especially at 3 AM — will thank you.
2. Observability Is a Feature, Not an Afterthought
If you can't measure it, you can't improve it. Export metrics at the point of decision (retry? circuit open? stream lag?) — not just at the edges. When you're the only one debugging, every metric is a lifeline.
3. Design for the "Boring" Failure Modes
Everyone plans for the 500 error. Few plan for:
- Broker failover latency
- Consumer restart mid-processing
- Schema evolution mid-deploy
- Clock skew in distributed timestamps
Document these. Test them. Build escape hatches. When you don't have a team to lean on, preparation is your best defense.
4. Centralize, Then Specialize
Start with a single source of truth (like classify_status_code()). Then layer on engine-specific behavior on top of that foundation. This prevents drift and duplication — critical when you're the only one maintaining the code.
5. Operator Experience Matters
Admin endpoints, health checks, clear logs, and meaningful metrics aren't "nice to have" — they're what let you sleep at night. Build them in from day one. When you're solo, you are the operator.
A Note on Solo Engineering
Working alone doesn't mean working in isolation. I leaned heavily on:
- Public documentation: Google SRE book, AWS Well-Architected, Martin Fowler's patterns
- Open source: Studying how Celery, Kombu, and Redis clients handle resilience
- Community: Reading post-mortems, blog posts, and conference talks from engineers who've been there
And I documented everything. Not for a team — for my future self. Every architecture decision, every tradeoff, every "why" is written down. Because six months from now, I won't remember why I chose 300s for the circuit breaker timeout. But my docs will.
If you're also building alone: you're not behind. You're just optimizing for a different constraint. Depth over breadth. Clarity over velocity. Resilience over features.
Conclusion
Building resilient distributed systems isn't about fancy algorithms or cutting-edge tools. It's about discipline: clear contracts, explicit failure handling, observable behavior, and operator empathy.
The NDVI pipeline I built isn't perfect. My circuit breakers are still process-local (not cluster-wide). My stream consumer doesn't yet support distributed tracing. But it's predictable, testable, and recoverable — and that's what matters.
If you take one thing from this article, let it be this:
Resilience isn't a feature you add at the end. It's a mindset you build in from the start — whether you're on a team of 50 or flying solo.
All code examples are simplified for clarity; production versions include additional error handling and logging. This work reflects my personal approach — your mileage may vary, and that's okay.
💡 Pro Tip: Want to try the circuit breaker pattern? Start small:
- Add a
failure_countandlast_failure_timeto your HTTP client- Skip requests when
failure_count >= 3andtime_since_failure < 300- Log state transitions
- Add one Prometheus gauge
You'll be 80% of the way there — and you'll learn what actually matters for your workload.
Top comments (0)