DEV Community: Rahim Ranxx

From NDVI to a Generic Spectral Engine: Architecting Scalable Earth Observation Pipelines.

Rahim Ranxx — Sun, 28 Jun 2026 07:16:26 +0000

Scaling AgTech Analytics: From NDVI to a Generic Spectral Engine

Meta Description: Discover how refactoring a hardcoded NDVI pipeline into a generic, data-driven spectral engine transforms agricultural technology platforms. Learn about platform engineering, sensor abstraction (Sentinel-2, Landsat, MODIS), and scaling remote sensing analytics.

Most remote sensing and Earth observation projects begin with a single metric: NDVI (Normalized Difference Vegetation Index).

Mine did too.

Initially, this wasn't a problem. Processing one spectral index meant maintaining one computation path, one imagery loader, and one set of satellite provider integrations. Everything was straightforward and manageable.

Then, reality arrived.

I needed to add NDMI (Normalized Difference Moisture Index) to improve farm moisture monitoring across diverse data sources like Sentinel-2, Landsat, MODIS, and STAC. At first glance, this looked like a standard feature request.

It wasn't. Adding NDMI exposed a critical architectural bottleneck that had been quietly growing inside the platform.

The Problem with Scaling Spectral Indices

The original implementation followed a familiar, but ultimately flawed, pattern: every spectral index and every data provider had its own bespoke implementation. The codebase was bloating into a matrix of redundant pipelines:

NDVI + Sentinel-2
NDVI + STAC
NDVI + Landsat
NDWI + Sentinel-2
NDWI + STAC
...and now NDMI.

Every new vegetation or moisture index multiplied the codebase. Adding one feature meant generating another set of loaders, unit tests, API handlers, and maintenance paths.

While not technically broken, it wasn't sustainable platform engineering. The system wasn't becoming more intelligent; it was simply becoming more repetitive.

Designing for the Fourth Index, Not the Third

Rather than brute-forcing NDMI directly into the existing structure, I paused to ask a fundamental architecture question:

What infrastructure would make the next five spectral indices almost free to implement?

This completely shifted the project's trajectory. Instead of writing yet another custom loader, I built clean abstractions around three core concepts:

Spectral Formulas
Sensor Band Mappings
Generic Compute Engines

By transitioning these elements from hardcoded logic into dynamic data, the entire analytics pipeline simplified dramatically.

1. Spectral Formulas as Configuration

Every spectral index shares a basic blueprint requiring a name, a set of sensor bands, and a mathematical formula. Instead of scattering these definitions throughout the business logic, they now live in a centralized registry.

Adding a new index no longer requires a new processing pipeline. It simply requires registering a new formula. The underlying compute engine remains untouched.

2. Abstracting Sensor Band Names

Previously, provider-specific naming conventions leaked throughout the codebase—Sentinel-2 calls a band one thing, while Landsat and MODIS use entirely different conventions.

Now, providers expose abstract band names. The compute engine simply requests universal identifiers:

nir (Near-Infrared)
red
green
swir1 (Short-Wave Infrared)

Each provider is responsible for resolving these abstracts to their specific assets. The scientific computation layer no longer knows—or cares—which satellite produced the imagery.

3. A Single, Data-Driven Compute Engine

The most significant leap was replacing fragmented, index-specific loaders with a unified generic compute engine. Its responsibilities are strictly bounded:

Resolve required bands.
Load satellite imagery.
Apply the requested formula.
Apply cloud masking.
Return the resulting raster.

Notice what is missing: there are no if index == NDVI conditional branches. There are no provider-specific calculations. By shifting to a data-driven model, a single abstraction replaced an expanding collection of nearly identical scripts.

Beyond Features: Hardening the Production Platform

As a backend engineer, I've learned that users rarely notice the work that matters most. Alongside the NDMI refactor, standardizing the platform layer allowed for crucial operational and observability improvements:

Dependency Management: Streamlining dependency security updates using the ultra-fast uv package manager.
System Observability: Enhancing monitoring and stack trace sanitization across the Django backend using Prometheus, Grafana, and Loki.
Infrastructure Reliability: Remediating secret scanning vulnerabilities and improving email reliability for scheduled jobs.

Production engineering isn't just about shipping AgTech features; it’s about reducing operational risk and ensuring high availability when executing failovers.

What NDMI (and a Generic Engine) Actually Enables

Technology is only valuable if it drives better decisions. Within this Farm Intelligence Platform, integrating NDMI and a robust spectral engine supports:

Early Moisture Stress Detection: Crucial for proactive crop management.
Precision Irrigation Scheduling: Optimizing water usage on large-scale farms.
Seasonal Drought Monitoring: Providing macro-level environmental insights.
Automated Workflows: Triggering downstream automation via Celery pipelines, backed by Redis Sentinel for reliable queue routing.
Farmer Advisories: Translating raster data into multilingual text-to-speech alerts.

The spectral engine produces the raw information; the platform's architecture ensures that information reliably becomes an actionable recommendation.

Lessons Learned in Platform Engineering

Looking back, the most valuable outcome wasn't adding NDMI. It was recognizing that the architecture needed to evolve before the feature was integrated.

Build for Stability: Design abstractions around stable concepts, not immediate feature requests.
Isolate Science from Logic: Scientific formulas belong in data registries, not business logic.
Use Interfaces: Provider-specific API behaviors should remain hidden behind strict interfaces.
Refactor First: Cleaning up the architecture before scaling is always cheaper than untangling technical debt later.

What’s Next for the Platform

The next evolutionary step is automating satellite acquisition scheduling using Celery Beat, Redis Streams, and event-driven ingestion. Because the spectral engine is now entirely generic, these CI/CD validated workflows don't require separate logic for NDVI, NDWI, or NDMI. They simply receive an index_type and execute.

Adding NDMI started as a standard feature request but finished as a comprehensive architectural redesign. The biggest improvements in production systems often don't come from adding new capabilities—they come from removing old assumptions.

From Feature Delivery to Platform Engineering.

Rahim Ranxx — Mon, 22 Jun 2026 15:35:33 +0000

The Problem: Feature Velocity Was Creating Structural Debt

The system originally started as a simple feature delivery backend:

A Django API powering agricultural insights
Celery workers handling asynchronous processing
Independent endpoints for each new capability
A growing set of Earth Observation computations (NDVI, NDWI, etc.)

At first, it worked.

But as more features were added, a pattern emerged:

Each feature introduced its own pipeline logic
Observability was inconsistent across services
API contracts drifted between frontend and backend
Debugging required tracing multiple disconnected systems

We weren’t scaling functionality.

We were scaling fragmentation.

The Turning Point: Features vs Platforms

The key realization was simple:

Features solve user problems. Platforms solve system problems.

We were repeatedly rebuilding:

Authentication flows
Data ingestion logic
Processing pipelines
API validation layers
Monitoring hooks

Each feature was solving its own version of these concerns.

That is where platform engineering became necessary.

The Shift: Introducing a Platform Layer

We introduced a platform layer between feature delivery and infrastructure.

Instead of building isolated pipelines, we standardized:

1. Unified API Surface

All Earth Observation workflows (NDVI, NDWI, and future indices) were normalized into a consistent API contract.

Shared request/response structure
Versioned endpoints
Schema validation through serializers
Central routing logic

This eliminated endpoint fragmentation.

2. Standardized Processing Pipeline

Celery tasks were refactored into a reusable pipeline pattern:

Ingestion
Validation
Computation
Storage
Publishing

Instead of feature-specific workers, we moved toward composable tasks.

This allowed new indices or processing logic to plug into the same execution flow.

3. Observability as a First-Class Layer

One of the biggest failures in the original system was visibility.

We introduced:

Structured logging across all services
Traceable job IDs across Celery tasks
Consistent error schemas
Centralized failure reporting

Now every pipeline run could be traced end-to-end without guessing where it failed.

4. Contract-Driven Development

We enforced strict API contracts:

Schema validation at the edge
Typed serializers in Django
Explicit error responses
Versioned API evolution instead of silent changes

This reduced frontend/backend drift significantly.

5. CI/CD Guardrails for System Integrity

To prevent regression as the system grew:

Linting enforced consistency (Ruff, MyPy, Bandit)
Task registry validation ensured no orphaned Celery tasks
API schema checks prevented breaking changes
Automated tests verified pipeline execution paths

The goal was simple:

If the system breaks, it should fail in CI—not in production.

Earth Observation as a Stress Test

NDVI and NDWI pipelines became more than features—they became a stress test for architecture.

Why?

Because they exposed:

Heavy computation workflows
Large data dependencies
External geospatial inputs
Long-running async tasks
Multiple transformation stages

If the platform could handle these reliably, it could handle anything we built on top of it.

What Changed After the Shift

After moving to a platform-first architecture:

Before

Each feature = new pipeline
Debugging = distributed guesswork
API behavior = inconsistent
Observability = partial

After

Features plug into existing pipelines
Debugging = traceable execution graph
API behavior = predictable contracts
Observability = end-to-end visibility

The biggest win wasn’t performance.

It was predictability.

Key Lessons

1. Feature velocity without platform thinking creates hidden fragility

You don’t see the cost immediately—but it compounds fast.

2. Earth Observation pipelines are excellent architecture stress tests

They force you to confront real-world distributed system problems early.

3. Standardization beats optimization at early scaling stages

Before optimizing performance, unify structure.

4. Observability is not optional infrastructure

If you can’t trace a request end-to-end, you don’t have a production system—you have a collection of services.

5. Platform engineering is a mindset shift, not a rewrite

Most of the improvements came from structure, not new technology.

Closing Thought

The transition from feature delivery to platform engineering is not about scale alone.

It’s about control.

Control over how systems evolve, how they fail, and how quickly they recover.

Once that layer exists, feature development becomes what it should have been from the beginning:

Safe, composable, and predictable.

From Feature Delivery to Platform Engineering: Scaling Earth Observation Pipelines.

Rahim Ranxx — Sun, 21 Jun 2026 06:08:49 +0000

From Feature Delivery to Platform Engineering

Most engineering articles focus on building a new feature.

The reality of production systems is different.

Adding a feature is often the easiest part.

The difficult part is preserving compatibility across asynchronous workloads, external integrations, observability pipelines, CI gates, OpenAPI contracts, and years of accumulated assumptions.

This week, I wasn't simply implementing NDWI.

I was evolving a farm intelligence platform that combines Earth Observation, distributed task execution, observability, and a Nextcloud-based user experience.

The goal sounded straightforward:

Bring NDWI (Normalized Difference Water Index) to feature parity with NDVI.

The actual work touched nearly every layer of the stack.

The Problem: Feature Duplication Becomes Technical Debt

Our existing NDVI implementation already supported:

Multiple processing backends,
Celery orchestration,
Prometheus metrics,
OpenAPI exposure,
Nextcloud integration,
Dashboarding,
Automated tests.

The temptation was obvious:

Copy the NDVI implementation.

Rename everything.

Ship.

That approach works exactly once.

Every duplicated branch becomes future maintenance debt.

Every additional spectral index doubles the operational surface area.

I wanted NDWI to become the second index without making the third index exponentially harder.

Designing for the Next Spectral Index

The first step was eliminating branching logic.

The original engine dispatch evolved toward multiple conditional paths:

if index == "NDVI":
    ...
elif index == "NDWI":
    ...

That pattern does not scale.

Instead, dispatch moved to factory lookup.

factory_key = (
    engine
    if index == "NDVI"
    else f"ndwi_{engine}"
)

factory = ENGINE_FACTORIES[factory_key]

Five NDWI engine factories were introduced:

ndwi_gee
ndwi_sentinelhub
ndwi_stac
ndwi_landsat
ndwi_modis

The result wasn't just NDWI support.

It transformed the platform into one capable of supporting future indices through convention rather than branching.

Adding another index stopped being an architectural event.

Separate Tasks, Shared Internals

A common anti-pattern in Celery systems is task duplication.

Two almost-identical tasks drift apart over time.

I wanted operational separation without implementation divergence.

Instead of copying logic:

run_ndwi_job(...)

delegates into the existing NDVI execution pipeline.

This produced an interesting balance.

NDWI gained:

Independent retry policies,
Dedicated queue routing,
Separate monitoring visibility,
Future scheduling flexibility.

Without duplicating computation logic.

Operational isolation.

Implementation reuse.

Metrics: Fighting Observability Sprawl

Observability debt accumulates quietly.

Originally, NDWI introduced six additional Prometheus metrics.

That meant:

Duplicate Grafana panels,
Duplicate alert rules,
Duplicate recording rules.

Instead of expanding metrics, we collapsed them.

Before:

ndwi_requests_total
ndwi_duration_seconds
...

After:

spectral_requests_total{index="NDVI"}
spectral_requests_total{index="NDWI"}

The dashboard no longer cared which index generated the signal.

The index became metadata.

The monitoring surface remained stable.

One of the most valuable lessons in observability is this:

Labels scale better than metric names.

Testing Against Regression, Not Hope

Feature tests prove something works.

Regression tests prove you didn't destroy what already existed.

A dedicated no-regression suite was introduced.

It validated:

Factory registrations,
Query isolation,
Route resolution,
URL parity,
Metrics importability,
Representation consistency,
Celery routing behavior.

Nineteen tests across seven classes focused entirely on one question:

Did this week's work accidentally break yesterday's guarantees?

Those tests became the contract protecting future contributors from invisible coupling.

The Farm 29 Incident

The most valuable discovery wasn't code.

It was a 403 error.

Farm 29 exposed a hidden assumption.

Weather endpoints succeeded.

NDVI failed.

NDWI failed.

Initially, it looked like an authentication defect.

The investigation revealed something deeper.

Integration JWTs enforced per-farm access:

FarmIntegrationAccess

Weather bypassed this path.

Spectral endpoints enforced it.

The fix required zero Django changes.

The integration simply lacked authorization.

This incident reinforced an important operational principle:

Authentication proves identity.

Authorization determines access.

Confusing the two leads to dangerous conclusions.

Production systems teach humility.

The bug is rarely where you first look.

When 102 Radio Stations Became a Concurrency Problem

Not every challenge involved remote sensing.

A radio subsystem health check had become pathological.

Sequential probing meant:

37 stations processed,
300-second execution time,
consistent Celery timeouts.

The solution wasn't another timeout tweak.

It was concurrency.

ThreadPoolExecutor replaced sequential execution.

Redirect chasing disappeared.

HTTP 3xx and 405 responses became acceptable health signals.

After deployment:

Before:

37/102 stations,
~300 seconds,
frequent failures.

After:

102/102 stations,
~21 seconds,
stable execution.

Sometimes resilience isn't sophisticated.

Sometimes it's simply refusing to serialize independent work.

CI as an Operational Safety Net

One subtle defect triggered a larger improvement.

Three Celery beat task names drifted away from their actual registrations.

Everything appeared healthy.

Until scheduled execution failed.

Instead of fixing the names and moving on, a CI guardrail emerged.

A validation script now verifies:

Beat schedules,
Queue routes,
Shared task registrations.

The lesson was simple:

Every production incident deserves the question:

"How do we ensure this category of failure never happens again?"

Fixes remove symptoms.

Guardrails remove classes of bugs.

OpenAPI as the Source of Truth

Cross-repository systems drift.

Documentation drifts faster.

The Nextcloud application consumed Django APIs.

Over time, operation identifiers diverged.

The answer wasn't manual synchronization.

The answer was declaring ownership.

Django's schema became authoritative.

The Nextcloud OpenAPI specification synchronized directly from it.

Ninety-six operations were verified.

Fifteen controllers aligned.

The integration contract became explicit.

Contracts reduce assumptions.

Assumptions become outages.

What This Week Actually Produced

On paper:

93 Django files changed,
~2,469 lines added,
15 commits,
96 operations verified,
102 radio stations covered,
29 Celery tasks validated.

But the numbers tell only part of the story.

The real outcome was different.

The platform became:

Easier to extend,
Easier to observe,
Harder to accidentally break,
More explicit in its contracts,
More resilient under operational stress.

That is the difference between feature development and platform engineering.

The code shipped this week wasn't just NDWI.

It was institutional knowledge encoded into software.

And that compounds over time.

Shipping 12,000+ Lines Across 6 Systems in 19 Days: A Masterclass in Backend Architecture.

Rahim Ranxx — Sat, 13 Jun 2026 08:09:16 +0000

What looked like a chaotic sprint was actually a strict exercise in architectural discipline.

The last time I published on Dev.to was in late May.

At the time, I had just finished documenting how I separated media responsibilities between Django and Nextcloud. I expected the next few weeks to be incremental: fix a few bugs, close a few tickets, and improve observability.

Instead, I disappeared into the codebase.

Nineteen days later, I resurfaced, and Git had a story to tell:

111 files changed
~12,800 lines added
339 tests written
45 endpoints shipped
21 Celery tasks introduced

On paper, it looked absurd.

In reality, it taught me one of the most important backend engineering lessons of my career:

Velocity isn't about typing faster. It's about making architectural decisions that allow future work to compound instead of collide.

Here's how I survived the sprint without breaking production.

The Scaling Trap: When Features Become Surgery

Most software slows down as it grows.

Every new requirement forces developers to revisit existing code paths. A seemingly small feature request quickly expands into:

Updating database models,
Modifying serializers,
Adjusting views and business logic,
Fixing broken tests,
Introducing unexpected regressions.

Eventually, every feature feels like open-heart surgery.

I wanted the opposite.

I wanted a platform where adding functionality felt like plugging another module into a well-designed machine.

Over nineteen days, that philosophy was tested repeatedly.

1. Build Confidence Before Features

The first thing I shipped wasn't visible to users.

It was infrastructure.

Before introducing new systems, I:

Migrated the project to Python 3.12,
Replaced traditional dependency management with uv,
Containerized the CI pipeline.

Every pull request now executes:

Ruff,
MyPy,
Bandit,
Pytest,

inside reproducible Docker environments.

The payoff wasn't glamorous.

Nobody celebrates faster dependency installation.

But confidence compounds.

When your tests are trustworthy and your environments are deterministic, you move differently. You stop negotiating with the fear of breaking things.

You ship.

2. Scheduling Is a Distributed Systems Problem

One of the major features I introduced was farm activity scheduling.

At first glance, it sounded trivial:

"Let users schedule irrigation."

Then production reality arrived.

Questions started appearing:

What happens when schedules recur?
How do you prevent duplicate executions?
How do you acknowledge completed tasks?
How do retries behave after failures?
What happens if the scheduler crashes and restarts?

A simple checkbox had quietly evolved into a distributed systems problem.

The final implementation relied on:

Cron-based recurrence,
Celery orchestration backed by Redis,
WebSocket notifications,
Strict acknowledgement workflows.

The surprising part?

Users only see a push notification.

Good engineering hides complexity.

It doesn't showcase it.

3. Resilience Beats Perfection

I also introduced text-to-speech alerts using multiple synthesis engines.

Initially, I made a common assumption:

If the preferred neural engine fails, the alert fails.

Then I asked a better question.

What matters more?

Perfect audio quality?

Or ensuring critical alerts reach users?

That changed everything.

Instead of relying on a single engine, I implemented strategies.

Then I wrapped those strategies in circuit breakers.

If the primary engine crashes:

The circuit breaker trips,
The fallback engine takes over,
Users still receive alerts.

The experience degrades gracefully.

The system survives.

That single decision eliminated an entire class of outages.

4. Production Engineering Means Expecting Broken Systems

The hardest problem of the sprint wasn't satellite imagery.

It wasn't Celery.

It wasn't WebSockets.

It was internet radio metadata.

Specifically, ICY metadata.

The specification is decades old, and stations interpret it creatively.

Some use UTF-8.

Others use Latin-1.

Some omit fields entirely.

Some violate their own metadata intervals.

The parser itself was tiny.

The resilience around it became enormous.

It reinforced a lesson I won't forget:

Production backend engineering isn't writing code for systems behaving correctly.

It's writing code for systems behaving incorrectly.

5. The Abstraction Decision That Saved Weeks

Toward the end of the sprint, I needed to implement NDWI (Normalized Difference Water Index).

I already had a mature NDVI pipeline.

I had two options.

Option One: Duplicate Everything

Create:

New models,
New services,
New Celery tasks,
New metrics,
New providers.

It would work.

It would also create long-term maintenance debt.

Option Two: Generalize Selectively

Reuse what already worked.

Separate only what genuinely differed.

I chose the second approach.

The result was a hybrid architecture.

Shared Infrastructure

STAC clients,
Service layers,
Celery workflows,
Database infrastructure,
Metrics.

Specialized Logic

Quality thresholds,
Fusion rules,
Farm-state classification,
Visual representations.

I had budgeted nearly a month for the work.

It shipped in less than four days.

That experience fundamentally changed how I think about abstraction.

Bad abstractions slow teams down.

Good abstractions create leverage.

6. Observability Is Architecture

One unexpected lesson involved monitoring.

Initially, I considered separate Prometheus metrics:

ndvi_observations_total
ndwi_observations_total

Then I stopped.

Why duplicate the concept?

Instead, I moved to labels:

spectral_index_observations_total{
    index_type="NDVI"
}

spectral_index_observations_total{
    index_type="NDWI"
}

The immediate benefit was cleaner Grafana dashboards.

The long-term benefit was strategic.

When future indices arrive—NDMI, EVI, SAVI—the infrastructure remains untouched.

Only the labels evolve.

Observability stopped being monitoring.

It became architecture.

The Real Output Was Optionality

Yes, the feature list was substantial.

During those nineteen days, I shipped:

Podcast ingestion,
TTS alerting,
Activity scheduling,
Request tracing,
NDVI V2,
Multi-provider STAC integrations,
A complete NDWI pipeline.

But the real output wasn't features.

It was optionality.

The platform is significantly easier to extend today than it was before this sprint began.

That's the metric I care about most.

The Lesson

People often ask how engineers ship quickly.

The answer isn't raw talent.

It isn't caffeine.

It isn't eighty-hour work weeks.

It's this:

Make decisions today that reduce the cost of tomorrow's decisions.

Protocols instead of conditionals.

Labels instead of duplication.

Circuit breakers instead of assumptions.

Strict service boundaries instead of monolithic entanglement.

Every one of those choices feels slower in the moment.

Until one day, you look up and realize you've delivered six major systems in nineteen days without rewriting half your codebase.

What's Next?

The architecture has proven it can support multiple spectral indices without collapsing under duplication.

The obvious next candidates are:

NDMI for vegetation moisture,
EVI for dense canopy analysis,
More sophisticated agronomic decision engines.

But the interesting question is no longer:

Can the platform support them?

The interesting question has become:

What happens when agronomic intelligence becomes just another interchangeable engine?

And honestly?

That's the problem I'm most excited to solve next.

The biggest takeaway from this sprint wasn't that I shipped 12,000 lines of code.

It was realizing that good architecture doesn't slow you down.

It gives you the confidence to move faster than you thought possible.

Decoupled Media Streams: A Django and Nextcloud Radio Architecture

Rahim Ranxx — Mon, 25 May 2026 12:41:31 +0000

I recently added a radio integration to a platform built around Django REST Framework (DRF), and Nextcloud.

The existing architecture was already doing a lot of heavy lifting, powering authentication, farm management, NDVI processing pipelines, weather data ingestion, API key orchestration, and Nextcloud application integrations.

The new requirement was to introduce internet radio support seamlessly inside the Nextcloud ecosystem. However, there was a strict architectural constraint: we needed to do this without turning Django into a media relay.

That single distinction shaped the entire implementation strategy.

The Core Challenge: Avoiding the Proxy Trap
Instead of proxying heavy audio streams through the backend, the architecture relies on direct playback. Django is strictly responsible for exposing radio metadata and playback endpoints (routed under /api/v1/radio/). Meanwhile, the Nextcloud clients stream the audio directly from the source providers, such as BBC, SomaFM, and TuneIn.

The result is a much cleaner separation of responsibilities:

Nextcloud UI / Web Client: The presentation layer.

Django + DRF API: Radio metadata and stream information logic.

Radio Providers: Direct playback of media transport.

Architectural Separation in Action
The following diagram illustrates exactly how we achieved this decoupling. The critical path is that thick dark orange arrow (3), showing the media stream bypassing the Django API server entirely.

(Diagram: Metadata requests [blue] are routed through Django, while heavy media streams [orange] flow directly from providers like BBC/SomaFM to the Nextcloud user.)

This separation keeps the backend highly performant and lightweight, while allowing the Nextcloud frontend to integrate radio discovery naturally alongside the rest of the platform's services.

How Nextcloud Fits Into the Architecture
The radio integration was explicitly designed to plug into a broader, Nextcloud-driven ecosystem rather than operating as an isolated, standalone media application. By defining strict boundaries, each system handles what it does best.

Nextcloud provides:

The frontend user experience

Authenticated user workflows

App integration surfaces and dashboard presentation

Native media interaction capabilities

Django provides:

API orchestration and provider abstraction

Station metadata and stream discovery

Data normalization logic

Backend consistency

This clear separation creates a strong boundary between backend platform orchestration and frontend client experience. Instead of embedding complex streaming logic directly into Nextcloud—or forcing Django to waste resources proxying media—the architecture keeps each layer focused entirely on its primary responsibility.

Built for Future Expansion
Because the backend already behaves like a pure metadata platform rather than a streaming server, the architecture leaves massive room for future expansion.

Without needing to redesign the streaming layer itself, this setup easily supports adding:

Personalized stations and user favorites

Listening history tracking

Podcast aggregation

Recommendation systems

Analytics pipelines

Multi-provider federation

By treating media transport and metadata orchestration as two distinct problems, the integration remains scalable, fast, and ready for whatever features the platform requires next.

Debugging a Cross-Language HMAC Signature Failure Between Nextcloud and Django

Rahim Ranxx — Sat, 16 May 2026 14:47:34 +0000

Introduction

A few days ago, I hit a frustrating issue while integrating a custom Nextcloud application with a Django REST Framework backend.

Everything looked correct:

shared HMAC secret ✔️
canonical request string ✔️
HMAC-SHA256 ✔️
timestamps synchronized ✔️

Yet every authenticated request failed with:

invalid nextcloud signature

The interesting part?

Both implementations were technically correct.

The failure came from something much smaller — and much more dangerous in distributed systems:

Different string encodings of the exact same HMAC digest.

This article walks through the full debugging process, the root cause, and the engineering lessons learned from debugging cryptographic interoperability between PHP and Python services.

System Architecture

The integration architecture looked like this:

┌──────────────────────┐
│  Nextcloud App (PHP) │
│  Generates HMAC      │
└──────────┬───────────┘
           │
           │ Signed HTTP Request
           ▼
┌──────────────────────┐
│ Django DRF Backend   │
│ Verifies Signature   │
└──────────────────────┘

The request flow:

Nextcloud generates a canonical request string
PHP computes an HMAC-SHA256 signature
Signature is attached to request headers
Django reconstructs the canonical string
Django recomputes the HMAC
Signatures are compared

Simple in theory.

Except it kept failing.

Initial Symptoms

The backend logs showed repeated authorization failures:

nextcloud_hmac.denied
code=invalid_signature

Even more confusing:

the integration had worked before
secrets matched
clocks matched
payloads matched

At first glance, it looked like a replay issue, timestamp skew problem, or cache corruption.

It turned out to be none of those.

The Root Cause

The issue came from a mismatch in how the HMAC digest was encoded.

Nextcloud (PHP)

The PHP client generated the signature like this:

base64_encode(
    hash_hmac('sha256', $canonical, $secret, true)
);

Notice the important detail:

true

That parameter returns the raw digest bytes.

Those bytes were then encoded as Base64.

Django (Python)

Meanwhile, Django verified signatures like this:

hmac.new(
    secret,
    canonical.encode(),
    hashlib.sha256,
).hexdigest()

hexdigest() returns a hexadecimal string representation.

So both systems produced:

the same HMAC bytes
using the same algorithm
using the same secret

But converted those bytes into different string formats.

The Hidden Interoperability Bug

This was the breakthrough moment.

The exact same digest bytes produced:

Hex:
44c39c4ecc7268547ca51db72c6f27125251e6ea8ce3c659d918a9542522b612

Base64:
RMOcTsxyaFR8pR23LG8nElJR5uqM48ZZ2RipVCUithI=

Both values represent the same underlying bytes.

But string comparison obviously fails.

The Second Bug

While investigating, I found another subtle issue.

The Django verifier lowercased the incoming signature before comparison:

signature = signature.lower()

That may appear harmless for hexadecimal values.

But Base64 is case-sensitive.

Meaning:

ABC != abc

So even after fixing the encoding mismatch, lowercasing would still break verification.

This was a protocol normalization bug hiding inside the verification pipeline.

The Fix

I updated Django to verify signatures using Base64 instead of hexadecimal.

New Verification Function

import base64
import hashlib
import hmac


def compute_hmac_signature_b64(
    *,
    secret: bytes,
    canonical_string: str,
) -> str:
    """Compute Base64 encoded HMAC-SHA256 signature."""

    digest = hmac.new(
        secret,
        canonical_string.encode("utf-8"),
        hashlib.sha256,
    ).digest()

    return base64.b64encode(digest).decode()

Then all verification calls were updated to use:

compute_hmac_signature_b64()

instead of:

.hexdigest()

Finally, I removed:

.lower()

from the verification flow.

Verification Results

After deploying the fix:

Ping Endpoint

GET /api/v1/integrations/nextcloud/ping/

200 OK

Token Issuance

POST /api/v1/integrations/token/

200 OK

Authentication immediately started working again.

Secondary Investigation Findings

While debugging, I validated several other production concerns.

1. Time Drift

I suspected clock skew initially.

Both services were checked:

Nextcloud epoch: 1778841776
Django epoch:    1778841776
Drift:            0 seconds

Time synchronization was perfect.

2. Shared Secrets

Client IDs and secrets matched correctly across both systems.

This eliminated:

environment mismatch
stale secrets
config drift

3. Redis and Cache State

I flushed:

Redis
Django cache
integration token caches

This helped eliminate stale token artifacts and replay-state inconsistencies.

4. Infrastructure Validation

I also verified:

loopback networking
gunicorn binding
uvicorn workers
allowlists
HTTP dev mode configuration

At this point the investigation became less about cryptography and more about systematic elimination of variables.

Why It “Worked Before”

This was the most interesting systems question.

I had not changed the signing logic recently.

So why did the failure suddenly appear?

The likely answer is:

Infrastructure state had been masking a latent protocol incompatibility.

Possible contributors:

cached tokens
stale replay windows
inactive code paths
existing sessions bypassing verification
Redis persistence behavior

This is an important engineering lesson:

A system can contain dormant interoperability bugs for weeks before infrastructure conditions expose them.

Engineering Lessons Learned

1. Cryptographic Bytes ≠ String Representation

HMAC output is binary data.

Hexadecimal and Base64 are merely different textual encodings of the same bytes.

They are not interchangeable.

2. Cross-Language Integrations Need Explicit Contracts

Never assume:

encoding format
canonicalization rules
normalization behavior

Define them explicitly.

Especially across:

PHP
Python
Go
Node.js
Java

3. Normalization Can Break Security

Lowercasing signatures looked harmless.

It was not.

Cryptographic values should only be normalized if the protocol explicitly defines normalization behavior.

4. Infrastructure State Can Hide Bugs

Cache layers and token persistence can temporarily conceal protocol inconsistencies.

Sometimes:

restarts
cache flushes
clock resets

suddenly expose issues that already existed.

5. Production Debugging Requires Elimination Discipline

The investigation involved validating:

clocks
secrets
caches
workers
networking
encoding
replay protection
request canonicalization

Good debugging is often less about guessing and more about systematically removing uncertainty.

Final Thoughts

The most dangerous bugs are not always algorithm failures.

Sometimes:

the crypto is correct
the infrastructure is healthy
the logic is valid

…but the protocol contract between systems is inconsistent.

In this case:

The cryptography was correct on both sides. The protocol contract was not.

And that single mismatch was enough to break the entire authentication flow.

Why I Added Redis Streams Between My Django API and Celery Workers.

Rahim Ranxx — Sun, 03 May 2026 07:01:10 +0000

A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.

Introduction

I run a Django API backed by Celery workers for NDVI processing workloads.

The execution layer worked fine.

The queue semantics didn’t.

I needed:

durable ingestion
replay visibility
dead-letter handling
stale consumer recovery
rollback safety
observability during incidents

…but I did not want to rewrite the worker system or destabilize production.

So instead of replacing Celery, I inserted Redis Streams between the API and the workers.

This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.

The Original Problem

The problem was not task execution.

The problem was everything before execution.

Originally, NDVI ingestion looked like this:

Django API → Celery Broker → Celery Workers

At first, this worked well.

But as the system evolved, operational gaps became more obvious:

Direct .delay() calls tightly coupled request ingestion to broker behavior.
Queue visibility was limited during incidents.
Failed ingestion paths were harder to replay safely.
In-flight recovery semantics were weak.
There was no dead-letter workflow for poisoned messages.
Worker interruptions could leave messages in uncertain states.

The architecture was fast.

It was not durable enough.

Why I Did Not Replace Celery

One of the biggest architectural decisions was choosing not to replace Celery.

That decision reduced risk dramatically.

Celery already handled:

worker orchestration
task retries
execution concurrency
scheduling
routing
operational familiarity

Replacing the worker layer would have increased migration complexity and expanded the failure domain.

Instead, I treated Redis Streams as an ingestion and reliability layer.

The resulting architecture looked like this:

Django API
    ↓
Dispatch Boundary
    ↓
Redis Streams (XADD)
    ↓
Consumer Group (XREADGROUP)
    ↓
Celery Queue
    ↓
NDVI Workers

Failures route into a dead-letter stream.

Stale consumers are recovered through reclaim logic.

Most importantly, rollback remains simple.

Centralizing Dispatch Before Adding Redis Streams

Before introducing Redis Streams, I centralized every NDVI enqueue path.

This was the most important migration step.

Instead of scattering direct .delay() calls across the codebase, everything flowed through dispatch helpers.

from ndvi.dispatch import dispatch_ndvi_job

job = enqueue_job(...)
dispatch_ndvi_job(job)

That allowed one configuration flag to control the ingestion backend.

NDVI_QUEUE_BACKEND = env("NDVI_QUEUE_BACKEND", default="celery")

Supported modes:

celery
stream

This created a clean migration boundary.

The system could switch between direct Celery dispatch and Redis Streams without changing every call site.

Operationally, this mattered more than the stream code itself.

Publishing NDVI Jobs into Redis Streams

The producer layer publishes deterministic NDVI payloads into a Redis stream.

Example:

payload = {
    "job_id": job.id,
    "request_hash": job.request_hash,
    "farm_id": job.farm_id,
    "engine": job.engine,
    "job_type": job.job_type,
    "enqueue_timestamp": time.time(),
}

redis_client.xadd(
    settings.NDVI_STREAM_NAME,
    payload,
    maxlen=settings.NDVI_STREAM_MAXLEN,
    approximate=True,
)

Key design decisions:

request_hash acts as the idempotency key.
XTRIM keeps memory bounded.
Stream payloads remain deterministic.
Producers do not execute business logic.

The stream became the ingestion ledger.

Redis Streams Consumer Design

The consumer reads from Redis Streams and forwards work into Celery.

Example:

messages = redis_client.xreadgroup(
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    streams={settings.NDVI_STREAM_NAME: ">"},
    count=settings.NDVI_STREAM_BATCH_SIZE,
    block=settings.NDVI_STREAM_BLOCK_MS,
)

For every message:

Deserialize payload
Validate structure
Apply idempotency safeguards
Enqueue Celery task
Acknowledge stream entry

process_ndvi_job.delay(job_id)

redis_client.xack(
    settings.NDVI_STREAM_NAME,
    settings.NDVI_STREAM_GROUP,
    message_id,
)

The stream consumer remains intentionally thin.

Its job is reliable transport and recovery.

Celery still handles execution.

Why Consumer Groups Matter

Redis Streams consumer groups solved several operational problems immediately.

They provided:

cooperative work distribution
independent consumer identities
pending-entry tracking
reclaim support
replay visibility

Unlike simple queue semantics, Redis Streams expose message lifecycle state.

That visibility becomes extremely valuable during failures.

Message lifecycle:

XADD → pending → reclaimed → acknowledged
                          ↓
                         DLQ

This made queue recovery observable instead of implicit.

Recovering Stale Messages with XAUTOCLAIM

The most important recovery primitive ended up being XAUTOCLAIM.

If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.

Without reclaim logic, stream durability is incomplete.

Example reclaim loop:

messages = redis_client.xautoclaim(
    name=settings.NDVI_STREAM_NAME,
    groupname=settings.NDVI_STREAM_GROUP,
    consumername=consumer_name,
    min_idle_time=settings.NDVI_STREAM_CLAIM_IDLE_MS,
    start_id="0-0",
    count=settings.NDVI_STREAM_BATCH_SIZE,
)

This allows healthy consumers to recover abandoned work automatically.

That changed the reliability profile of the ingestion pipeline significantly.

Dead-Letter Queue Handling

I also introduced a dedicated dead-letter stream.

Messages are routed into the DLQ when:

validation fails
delivery ceilings are exceeded
payloads become structurally invalid
repeated execution attempts fail

Example:

redis_client.xadd(
    settings.NDVI_STREAM_DLQ_NAME,
    dlq_payload,
)

Every DLQ entry includes:

original message ID
delivery count
failure reason
serialized payload
timestamps

This made operational debugging dramatically easier.

The Hardest Problem: Idempotency

Redis Streams provide at-least-once delivery.

That means duplicate delivery is expected.

Exactly-once delivery is not guaranteed.

To prevent duplicate NDVI execution, I added multiple protection layers.

Layer 1: Deterministic Request Hash

Every NDVI job already had a deterministic request_hash.

That became the execution identity.

Layer 2: Distributed Redis Lock

The consumer acquires a Redis lock before execution.

lock_key = f"ndvi:lock:{request_hash}"

Acquisition uses SETNX semantics with expiration.

Layer 3: Token-Based Lock Release

Locks are released through an atomic Lua script.

This prevents blind deletion.

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
end
return 0

Layer 4: Database Status Recheck

Before execution begins, the worker re-checks terminal job state.

This acts as a second safety boundary.

The result is effectively-once execution semantics.

At-least-once delivery + idempotent execution = effectively-once processing

Observability Added During the Rollout

One major lesson from this migration:

Do not enable stream mode before queue visibility exists.

I added dedicated metrics before enabling the rollout broadly.

Examples:

redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total

I also expanded upstream visibility:

ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds

Grafana dashboards now expose:

pending stream backlog
reclaim frequency
DLQ volume
consumer liveness
upstream API failures
queue drain rate

This transformed rollout decisions from guesswork into measurable operations.

Rollback Strategy

Rollback was designed before rollout.

That mattered.

The stream backend is fully feature-flagged:

NDVI_QUEUE_BACKEND = "celery"

NDVI_QUEUE_BACKEND = "stream"

Rollback requires:

environment variable change
process restart

No redeploy.

No task rewrite.

No schema rollback.

This significantly reduced operational fear during rollout.

What Shipped This Week

This week’s rollout included:

a 528-line Redis Streams consumer
reclaim + DLQ lifecycle handling
distributed execution locking
token-safe lock release
approximately 400 lines of stream-focused tests
Prometheus metrics for queue health
Grafana visibility for consumer state and lag
feature-flag rollback support

Most of the work was not adding Redis.

Most of the work was making failure recovery predictable.

Was It Worth It?

Redis Streams did not simplify the system.

They made failure states explicit.

That introduced additional complexity:

reclaim logic
idempotency handling
consumer lifecycle management
DLQ operations
stream observability

But the reliability gains were substantial:

durable ingestion
replay visibility
safer recovery semantics
backlog introspection
controlled rollback
observable queue state

For this NDVI pipeline, the tradeoff was worth it.

Final Thoughts

One of the biggest lessons from this migration is that queue evolution is not just about throughput.

It is about operational recovery.

Redis Streams gave the ingestion layer explicit lifecycle semantics:

pending
acknowledged
reclaimed
dead-lettered

That visibility fundamentally changed how the system behaves during failures.

And importantly, I achieved that without rewriting the worker layer.

Sometimes the best migration strategy is not replacing your stack.

It is inserting a safer boundary in front of it.

Building a Resilient NDVI Pipeline with Redis Streams (Event-Driven Architecture)

Rahim Ranxx — Sun, 26 Apr 2026 09:02:14 +0000

A practical breakdown of moving an NDVI processing pipeline from a synchronous design to an event-driven architecture using Redis Streams — including concurrency challenges, distributed locking pitfalls, and production-safe patterns.

Introduction

Most pipelines work — until concurrency and failure expose their limits.

At first, processing NDVI (Normalized Difference Vegetation Index) data seems straightforward:

receive a request

process imagery

return results

But once you introduce:

concurrent jobs

long-running processing

distributed components

you’re no longer building a simple pipeline.

You’re designing a distributed system.

This article walks through how I transformed an NDVI processing pipeline from a synchronous model into an event-driven architecture using Redis Streams, and the real-world engineering challenges that came with it.

System Overview

The system is built using:

Django REST Framework (backend API)

Nextcloud (client-facing integration layer)

Celery (asynchronous task processing)

Redis Streams (event ingestion and coordination)

The Initial Architecture (Synchronous Design)

Client → API → Celery Task → NDVI Processing → Result

This design works well at small scale, but it introduces hidden risks when the system grows.

The Core Problems

Tight Coupling

The request lifecycle is directly tied to processing.

If processing fails:

the request fails

the user experiences errors

retries become difficult

Concurrency Issues

When multiple requests target the same job:

Request A ─┐
├──> Same Job → Duplicate Processing
Request B ─┘

This leads to:

duplicated work

inconsistent outputs

race conditions

Fragile Execution Model

Without coordination:

jobs execute immediately

no buffering exists

failure handling is reactive, not controlled

The Shift to Event-Driven Architecture

To solve these issues, I introduced Redis Streams and redesigned the system into an event-driven model.

New Architecture (Event-Driven Pipeline)

Client → API → Redis Stream → Consumer → Celery → Processing

Why Redis Streams?

Redis Streams provide:

Event buffering (decouples ingestion from execution)

At-least-once delivery (ensures reliability)

Ordered processing

Scalability for distributed systems

What Changed

Instead of executing tasks immediately:

The API publishes events to a Redis Stream

A stream consumer controls task execution

Celery workers process jobs asynchronously

This separates:

ingestion

scheduling

execution

Distributed Locking: The Critical Bug

To prevent duplicate processing, a locking mechanism was introduced.

The naive approach:

cache.delete(lock_key)

This looks harmless — but in distributed systems, it’s dangerous.

Why This Fails

Consider this sequence:

Process A acquires a lock
The lock expires
Process B acquires the same lock
Process A deletes the lock

Now:

Process B is running without protection

This creates a race condition — one of the hardest problems in distributed systems.

The Fix: Token-Based Distributed Locking

To solve this, each lock is assigned a unique token.

SET lock_key = token_A (TTL)

Release only if:
stored_token == token_A

Key Principles

Only the owner of the lock can release it

If ownership does not match → do nothing

TTL ensures eventual cleanup

This ensures:

safe concurrency

no accidental unlocks

predictable system behavior

Stream Consumer Design

Redis Streams operate with:

At-least-once delivery semantics

This means:

messages can be delivered more than once

consumers must be idempotent

Consumer Processing Flow

Read → Validate → Enqueue → Acknowledge

Critical Rule

Never acknowledge a message before it is safely enqueued.

Idempotency and Reliability

To handle duplicate events:

processing must be idempotent

tasks must tolerate retries

state transitions must be safe

This is essential in any event-driven system.

Final Architecture (Layered System Design)

The system now operates in clear layers:

Ingestion Layer

receives requests

publishes events

Stream Layer

buffers and orders events

decouples system components

Consumer Layer

controls execution

validates and dispatches tasks

Execution Layer

Celery workers process NDVI jobs

Coordination Layer

distributed locking

idempotency

concurrency control

Key Lessons from Building an Event-Driven System

Event-Driven Architecture Does Not Reduce Complexity

It shifts complexity into:

coordination

state management

failure handling

Concurrency Is the Real Challenge

Not performance.
Not frameworks.

Concurrency.

Safety Must Be Designed Explicitly

Small shortcuts (like naive lock deletion) can lead to major production issues.

Idempotency Is Non-Negotiable

In systems with retries and event delivery:

duplicate execution is expected

safe handling is required

Observability Becomes Critical

In asynchronous systems, you must answer:

“What happened to this job?”

This requires:

structured logging

tracing across components

visibility into system flow

Conclusion

This shift changed the system from:

"Run this task now"

to:

"This event will be processed safely"

That difference is fundamental.

Because in distributed systems:

You don’t design for success.
You design for failure.

What’s Next

The next phase is observability-driven engineering:

tracing event lifecycles

monitoring stream lag

correlating logs across services

Because once a system becomes event-driven:

Visibility is what makes it understandable.

Hardening Distributed Systems: Retries, Circuit Breakers & Observability.

Rahim Ranxx — Sun, 12 Apr 2026 05:12:28 +0000

Building Resilient Distributed Systems: A Solo Engineer's Journey

How I turned flaky upstream APIs into a predictable, observable, and operator-friendly reliability layer — with code you can steal.

Introduction

If you've ever built a service that depends on external APIs (STAC catalogs, SentinelHub, weather data providers, etc.), you know the pain:

429s when you hit rate limits
502s when upstreams hiccup
Silent timeouts that leave jobs hanging
Retry storms that make bad days worse

Last month, I undertook a focused effort to harden the retry and resilience logic for an NDVI (Normalized Difference Vegetation Index) processing pipeline. What started as "let's clean up some duplicate retry code" evolved into a production-grade reliability subsystem that now governs every upstream interaction.

In this article, I'll walk through:

Phase 1: Consolidating retry policy into a single source of truth
Phase 2: Adding circuit breakers with observability and admin controls
Phase 3 (preview): Decoupling dispatch with Redis Streams for back-pressure resilience
Key principles I learned that you can apply to your own distributed systems

All code is Python/Django/Celery, but the patterns are language-agnostic. And yes — I did this alone. No team, no dedicated SRE, no platform squad. Just me, a codebase, and a lot of careful thinking.

The Problem Space

The NDVI pipeline I was working on orchestrates vegetation index calculations by:

Querying STAC catalogs for satellite imagery metadata
Fetching raster data from SentinelHub
Computing NDVI values per farm/plot
Returning results to farmers/agronomists

The challenge: Each upstream service has different failure modes:

STAC: occasional 502s, auth errors (401/403)
SentinelHub: strict rate limits (429), validation errors (422), transient 5xx
Network: timeouts, DNS failures, TLS handshake issues

Before my refactor, retry logic was scattered across 4+ modules, with inconsistent error classification and no centralized observability. Result? Hard-to-debug failures, wasted Celery retries, and on-call pages at 3 AM.

As a solo engineer, I couldn't afford to keep firefighting. I needed a system that would just work — or fail gracefully, with clear signals.

Phase 1: One Source of Truth for Retries

The Core Insight

Not all errors are retryable. Not all retries are equal.

I started by defining a canonical truth table mapping HTTP status codes to retry behavior:

# ndvi/retry_policy.py
def classify_status_code(status_code: int | None) -> RetryClassification:
    """
    Canonical truth table: HTTP status → retry decision.

    | Status      | Retryable | Category           |
    |-------------|-----------|--------------------|
    | 401, 403    | False     | AUTH               |
    | 400, 422    | False     | VALIDATION         |
    | 429         | True      | RATE_LIMIT         |
    | >= 500      | True      | TRANSIENT_UPSTREAM |
    | Other/None  | False     | UNKNOWN            |
    """
    if status_code in (401, 403):
        return RetryClassification(retryable=False, category="AUTH")
    if status_code in (400, 422):
        return RetryClassification(retryable=False, category="VALIDATION")
    if status_code == 429:
        return RetryClassification(retryable=True, category="RATE_LIMIT")
    if status_code is not None and status_code >= 500:
        return RetryClassification(retryable=True, category="TRANSIENT_UPSTREAM")
    return RetryClassification(retryable=False, category="UNKNOWN")

Unified Exception Hierarchy

I made all upstream errors inherit from a common base, ensuring consistent attributes:

class UpstreamFailureError(NdviFailureError):
    """Base for all retryable upstream failures."""
    def __init__(self, message: str, status_code: int | None = None, response: Response | None = None):
        super().__init__(message)
        self.status_code = status_code
        self.response = response
        # Delegate to canonical classifier
        classification = classify_status_code(status_code)
        self.retryable = classification.retryable
        self.category = classification.category
        self.delay = self._compute_delay(classification)

class StacUpstreamError(UpstreamFailureError, StacError): ...
class SentinelHubUpstreamError(UpstreamFailureError): ...
class SentinelHubRasterError(UpstreamFailureError): ...

Centralized Retry Decision

@dataclass
class RetryDecision:
    retry: bool
    delay: float
    reason: str

def should_retry(exc: Exception, response_headers: dict | None = None) -> RetryDecision:
    if not isinstance(exc, UpstreamFailureError):
        return RetryDecision(retry=False, delay=0.0, reason="non-retryable-exception")

    # Respect Retry-After header for 429s
    if exc.status_code == 429 and response_headers:
        server_delay = parse_retry_after(response_headers.get("Retry-After"))
        if server_delay is not None:
            return RetryDecision(retry=True, delay=server_delay, reason="retry-after-header")

    return RetryDecision(
        retry=exc.retryable,
        delay=exc.delay,
        reason=f"{exc.category}-classification"
    )

Impact

28 parametrized tests covering all 13 truth-table branches
Removed 3 duplicate retry implementations
Celery tasks now use shared should_retry() logic
Network errors properly wrapped → no more silent failures

Lesson #1: Centralize failure classification. When retry logic lives in one place, you can test it thoroughly, document it clearly, and evolve it safely — even when you're the only one maintaining it.

Phase 2: Circuit Breakers with Teeth

Retries alone aren't enough. When an upstream is truly down, you want to fail fast and avoid thundering herds.

The Circuit Breaker State Machine

I implemented a simple but effective three-state breaker:

CLOSED → (failures ≥ threshold) → OPEN → (timeout elapsed) → HALF_OPEN → (success) → CLOSED
                              ↘ (failure) ↗

class _CircuitBreaker:
    def __init__(self, threshold: int = 3, timeout_secs: float = 300):
        self.state = "CLOSED"
        self.failure_count = 0
        self.last_failure_time: float | None = None
        self.threshold = threshold
        self.timeout_secs = timeout_secs

    def record_success(self):
        self.failure_count = 0
        self._transition_to("CLOSED")

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self._transition_to("OPEN")

    def allow_request(self) -> bool:
        if self.state == "CLOSED":
            return True
        if self.state == "OPEN":
            if time.time() - self.last_failure_time >= self.timeout_secs:
                self._transition_to("HALF_OPEN")
                return True
            return False
        # HALF_OPEN: allow one probe request
        return True

    def _transition_to(self, new_state: str):
        old_state = self.state
        self.state = new_state
        logger.info(f"Circuit breaker: {old_state} → {new_state}")
        # Export Prometheus metric
        circuit_breaker_state.labels(engine=self.engine).set(STATE_VALUES[new_state])
        circuit_breaker_transitions.labels(
            engine=self.engine, from_state=old_state, to_state=new_state
        ).inc()

Observability First

I didn't just build the breaker — I made it visible:

# Prometheus metrics
circuit_breaker_state = Gauge(
    'ndvi_circuit_breaker_state',
    'Current circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)',
    labelnames=['engine']
)

circuit_breaker_transitions = Counter(
    'ndvi_circuit_breaker_transitions_total',
    'Count of circuit breaker state transitions',
    labelnames=['engine', 'from_state', 'to_state']
)

And added a Grafana dashboard with:

Stat panels showing current state per engine (color-coded: 🟢 CLOSED, 🔴 OPEN, 🟡 HALF_OPEN)
Time series of transition rates
Correlation with upstream failure rates

Operator Controls

Because things will go wrong — and when you're solo, you are the operator — I added an admin endpoint to manually reset breakers:

POST /api/v1/ndvi/circuit-breaker/reset/
Content-Type: application/json
Authorization: Bearer <admin-token>

{ "engine": "stac" }

→ { "data": { "previous_state": "OPEN", "new_state": "CLOSED" } }

Lesson #2: Resilience patterns need observability and escape hatches. If you can't see it or control it, you don't own it — and when you're the only one on call, "owning it" means sleeping at night.

Phase 3 Preview: Decoupling with Redis Streams

As I scaled the system, I hit a new challenge: Celery broker unavailability during Redis Sentinel failover (~55 seconds of downtime). For background jobs, this was acceptable. But for real-time dispatch, I needed better.

The Architecture Decision

Instead of relying on Celery's built-in Redis transport, I chose a separate consumer pattern:

API → [Redis Stream] → Consumer → [Celery Queue] → Worker

Why?

Avoids Celery/Kombu stream support uncertainty
Easier to observe and debug (explicit XREADGROUP/XACK)
Natural back-pressure via XPENDING monitoring
Cleaner rollback path (just flip a feature flag)

Key Design Decisions I Made Early

1. Idempotency by Design

stream_payload = {
    "job_id": job.id,
    "request_hash": job.request_hash,  # Primary idempotency key
    "schema_version": 1,                # Future-proofing
    "colormap_normalization": "histogram",  # Evolved schema
    # ... other fields
}
# Consumer checks request_hash before enqueueing to Celery

2. Error Classification at Consumer Boundary

Not all failures should retry:

ERROR_STRATEGY = {
    "no_items": "DLQ",           # Permanent: no data exists
    "missing_assets": "DLQ",     # Permanent: schema mismatch
    "network_timeout": "RETRY",  # Transient: try again
    "celery_unavailable": "RETRY_WITH_BACKOFF",  # Infrastructure blip
}

3. Back-Pressure Strategy

PENDING_WARNING = 1_000
PENDING_CRITICAL = 5_000

pending_count = redis.xpending(stream_name, group_name)["pending"]
if pending_count > PENDING_CRITICAL:
    # Return 429 on API to slow producers
    return HttpResponseTooManyRequests("Upstream backlog critical")
elif pending_count > PENDING_WARNING:
    logger.warning(f"Stream backlog growing: {pending_count} pending")

4. Graceful Shutdown

# In consume_ndvi_stream.py
signal.signal(signal.SIGTERM, handle_shutdown)

def handle_shutdown(signum, frame):
    shutdown_flag.set()  # Stop accepting new entries
    # Finish current entry, XACK if successful
    # Exit cleanly → orchestrator restarts

Lesson #3: Decoupling isn't just about scalability — it's about failure isolation. When one component fails, the rest can keep moving. And when you're solo, isolation means you can debug one piece without bringing down the whole system.

Principles I Learned (That You Can Steal)

1. Make Failure Explicit

Don't hide errors behind generic exceptions. Classify them, tag them, and route them intentionally. Your future self — especially at 3 AM — will thank you.

2. Observability Is a Feature, Not an Afterthought

If you can't measure it, you can't improve it. Export metrics at the point of decision (retry? circuit open? stream lag?) — not just at the edges. When you're the only one debugging, every metric is a lifeline.

3. Design for the "Boring" Failure Modes

Everyone plans for the 500 error. Few plan for:

Broker failover latency
Consumer restart mid-processing
Schema evolution mid-deploy
Clock skew in distributed timestamps

Document these. Test them. Build escape hatches. When you don't have a team to lean on, preparation is your best defense.

4. Centralize, Then Specialize

Start with a single source of truth (like classify_status_code()). Then layer on engine-specific behavior on top of that foundation. This prevents drift and duplication — critical when you're the only one maintaining the code.

5. Operator Experience Matters

Admin endpoints, health checks, clear logs, and meaningful metrics aren't "nice to have" — they're what let you sleep at night. Build them in from day one. When you're solo, you are the operator.

A Note on Solo Engineering

Working alone doesn't mean working in isolation. I leaned heavily on:

Public documentation: Google SRE book, AWS Well-Architected, Martin Fowler's patterns
Open source: Studying how Celery, Kombu, and Redis clients handle resilience
Community: Reading post-mortems, blog posts, and conference talks from engineers who've been there

And I documented everything. Not for a team — for my future self. Every architecture decision, every tradeoff, every "why" is written down. Because six months from now, I won't remember why I chose 300s for the circuit breaker timeout. But my docs will.

If you're also building alone: you're not behind. You're just optimizing for a different constraint. Depth over breadth. Clarity over velocity. Resilience over features.

Conclusion

Building resilient distributed systems isn't about fancy algorithms or cutting-edge tools. It's about discipline: clear contracts, explicit failure handling, observable behavior, and operator empathy.

The NDVI pipeline I built isn't perfect. My circuit breakers are still process-local (not cluster-wide). My stream consumer doesn't yet support distributed tracing. But it's predictable, testable, and recoverable — and that's what matters.

If you take one thing from this article, let it be this:

Resilience isn't a feature you add at the end. It's a mindset you build in from the start — whether you're on a team of 50 or flying solo.

All code examples are simplified for clarity; production versions include additional error handling and logging. This work reflects my personal approach — your mileage may vary, and that's okay.

💡 Pro Tip: Want to try the circuit breaker pattern? Start small:

Add a failure_count and last_failure_time to your HTTP client

Skip requests when failure_count >= 3 and time_since_failure < 300

Log state transitions

Add one Prometheus gauge

You'll be 80% of the way there — and you'll learn what actually matters for your workload.

Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)

Rahim Ranxx — Sat, 04 Apr 2026 17:36:44 +0000

Redis Sentinel + Celery Failover: What Actually Happens in Production

Most tutorials on Redis Sentinel stop at “it elects a new master”.
Very few show what happens to a real system under failover pressure.

I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.

Here’s what actually happened.

Architecture Overview
Sentinel Integration (Django + Celery)
Observability with Prometheus
Failover Drill Walkthrough
Celery Behavior During Failover
Performance Impact
Production Readiness Assessment
How to Reduce Failover Latency

Architecture Overview

flowchart LR
    Client --> Django
    Django -->|Cache| Sentinel
    Django -->|Tasks| Celery
    Celery -->|Broker| Sentinel
    Celery -->|Result Backend| Sentinel

    Sentinel --> RedisMaster
    Sentinel --> RedisReplica1
    Sentinel --> RedisReplica2

    Prometheus --> RedisExporter
    RedisExporter --> Sentinel

Stack Components

Django → Redis cache via Sentinel
Celery → Broker + result backend via Sentinel
Redis Sentinel → High availability + failover
Prometheus + redis_exporter → Monitoring

Sentinel Integration (Django + Celery)

All services were switched to Sentinel using environment configuration:

REDIS_ADDR=redis://host.docker.internal:26379

Validation steps:

Django cache → successful round-trip
Celery broker → connected via Sentinel
Celery result backend → SentinelBackend initialized
Test suite passed:

  pytest tests/test_settings_redis_sentinel.py

At this stage, the system is fully Sentinel-aware

Observability with Prometheus

After pointing redis_exporter to Sentinel:

Key metrics exposed:

redis_sentinel_master_status
redis_sentinel_master_ok_sentinels
redis_sentinel_master_ok_slaves
redis_sentinel_masters

Verification:

redis_instance_info{redis_mode="sentinel", tcp_port="26379"}

This confirms monitoring is tracking cluster state, not a single node.

Failover Drill Walkthrough

Initial State

flowchart LR
    Sentinel -->|Master| Redis1["172.20.0.3:6379"]
    Sentinel --> Redis2["Replica"]
    Sentinel --> Redis3["Replica"]

Prometheus reported:

master_address="172.20.0.3:6379"

Induced Failure

Current master was stopped manually

Sentinel Election

flowchart LR
    Sentinel -->|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --> Redis3["Replica"]
    Sentinel --> Redis1["Down"]

New master elected on first poll
Prometheus updated on next scrape

Failover was immediate and correct

Celery Behavior During Failover

Timeline

sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App->>Celery: Submit Task
    Celery->>Redis: Send to Master
    Redis-->>Celery: Connection Lost

    Sentinel->>Sentinel: Elect New Master

    Celery->>Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery->>Redis: Reconnect to New Master
    Redis-->>Celery: OK

    Celery-->>App: Task SUCCESS

Observed Task

Task ID: 9b57ba3b-a707-4c13-9255-d74de411b64b
Status during failover: PENDING
Delay: ~54.7 seconds
Final state: SUCCESS

Performance Impact

Phase	Behavior
Normal operation	Immediate execution
During failover	~55s delay
Post-recovery	Normal

Production Readiness Assessment

What Works

Redis Sentinel failover is reliable
Prometheus reflects cluster changes correctly
Django cache survives failover
No task loss in Celery

What Needs Attention

Celery introduces significant delay during failover
Reconnection is not instantaneous

When This Architecture Is Production-Ready

Use this setup if:

Tasks are asynchronous/background
Eventual completion is acceptable
Temporary latency spikes are tolerable

When This Is Not Enough

Avoid this setup (as-is) if you need:

Real-time task execution
Sub-10s failover recovery
User-facing async operations

How to Reduce Failover Latency

To push recovery closer to 10–15 seconds:

Tune Celery broker retry settings
Reduce reconnect backoff intervals
Optimize worker heartbeat and visibility timeout
Re-run failover drills with timing instrumentation

Key Takeaway

Redis Sentinel ensures infrastructure recovery.
Celery determines how fast your system actually resumes work.

In this test:

Sentinel recovery: instant
Application recovery: ~55 seconds

That gap is the real engineering challenge.

Final Thoughts

If you're using Redis Sentinel with Celery:

Don’t stop at:

“Failover works.”

Measure:

“How long until my system behaves normally again?”

Because that’s what production users experience.

Escaping Cache Fragmentation: How Misconfigured PHP Workers Flooded My Token System

Rahim Ranxx — Sun, 22 Mar 2026 18:02:23 +0000

🚨 The Symptom

I started noticing something strange in my observability stack:

Integration tokens were being minted repeatedly
My token endpoint showed activity even when no user interaction was happening
Metrics suggested constant “traffic” to an otherwise idle system

At first glance, it looked like:

A security issue
A rogue client
Or a broken API consumer

It was none of those.

🔍 The Root Cause

The issue came down to a subtle but critical architectural mistake:

I was using a non-shared cache in a multi-worker environment.

Stack involved:

PHP-FPM (2 workers)
APCu (in-memory cache)
Token-based integration between services

⚙️ What Went Wrong

APCu is process-local, not shared.

That means:

Worker A cache ≠ Worker B cache

Each PHP-FPM worker had its own isolated memory.

💥 The Cascade Effect

My token logic was straightforward:

if token not in cache:
    mint_new_token()

But in reality, the system behaved like this:

Request hits Worker A → token exists → OK
Next request hits Worker B → cache miss → mint new token
Repeat across workers → continuous token regeneration

📈 Why Observability Looked “Wrong”

From the outside, it looked like traffic was hitting the token endpoint.

But in reality:

The system was generating its own traffic due to cache inconsistency.

This is a key lesson:

Not all traffic is external
Some is emergent behavior from system design

✅ The Fix

I switched from APCu to:

Redis (shared cache)

Now:

All workers → same cache → consistent token state

Result:

Tokens minted once
Reused across all workers
Metrics stabilized instantly

🔒 Production Hardening (What I Added Next)

Fixing the cache wasn’t enough — I hardened the system further.

1. Distributed Locking

To prevent race conditions:

if token exists:
    return token

acquire lock
    re-check cache
    mint token if still missing
release lock

2. TTL Buffering

Avoid edge expiration issues:

cache_ttl = token_expiry - safety_margin

3. Observability Metrics

I added:

token_cache_hits
token_cache_misses
token_mint_count

Now anomalies show up immediately.

🧠 Key Takeaway

This wasn’t just a bug.

It was a distributed systems failure mode:

Cache locality + multi-worker architecture → inconsistent state → emergent traffic

⚡ Final Insight

If your system:

Runs multiple workers
Uses in-memory caching
Relies on shared state

Then this rule applies:

If your cache isn’t shared, your state isn’t real.

🔗 Closing

This issue reinforced something critical in my engineering journey:

You don’t debug systems by staring at code —
you debug them by understanding how state flows across boundaries.

If you're building distributed APIs, token systems, or high-concurrency services —
this is one edge case worth designing for early.

From 80-Second APIs to Sub-Second: Rebuilding a Geospatial Backend with Async Pipelines

Rahim Ranxx — Sat, 21 Mar 2026 16:37:10 +0000

From 80-Second APIs to Sub-Second: Fixing Latency with Async Pipelines (Django + Celery)

Introduction

At some point, every backend engineer hits this wall:

The API works perfectly… until it doesn’t.

I hit that wall with a farm analytics endpoint computing NDVI (Normalized Difference Vegetation Index) from satellite imagery. The system was correct, the logic was sound, and the results were accurate.

But the numbers told a different story:

P95 latency: 1.25 minutes

That’s not an API. That’s a blocking compute job pretending to be one.

This is the story of how I redesigned the system—from a synchronous request-driven model to an asynchronous data pipeline—and brought latency down to sub-second performance (P95 ≈ 725ms).

The Original Architecture (The Hidden Problem)

At first glance, the system looked clean:

[Client]
   ↓
[Django API]
   ↓
[STAC API → Satellite Data]
   ↓
[Raster Processing (NDVI)]
   ↓
[Response]

What happened on each request?

Query satellite imagery via STAC
Fetch raster bands (Red & NIR) from remote storage
Process NDVI using rasterio
Aggregate coverage
Return result

Why this seemed fine

It worked locally
It returned correct data
It followed a “pure API” mindset

But under the hood:

Remote I/O (S3-backed satellite data)
Heavy raster decoding (JPEG2000)
Sequential band reads
Full computation per request

The Breaking Point

Logs told the truth.

Each request looked like:

STAC request → ~5s
Raster read (B04) → ~5–10s
Raster read (B08) → ~5–10s
Processing → ~5s+
Total → ~80+ seconds

And the key realization:

I wasn’t building an API—I was executing a geospatial compute pipeline on every request.

The Core Insight

This is the shift that changes everything:

APIs should serve data, not compute it on demand.

The problem wasn’t Python.
The problem wasn’t Django.
The problem was architecture.

The New Architecture (Async Pipeline)

I redesigned the system around asynchronous computation + caching:

             (Scheduled / Triggered)
                    ↓
             [Celery Worker]
                    ↓
         [NDVI Computation Pipeline]
                    ↓
             [Redis / Database]
                    ↓
[Client] → [Django API] → [Cache Lookup]

Key changes

NDVI computation moved out of the request path
Results cached in Redis
Background jobs compute and refresh data
API returns instantly (no heavy compute)

Diagram 1 — Before vs After

Before (Request-driven)

Request
   ↓
STAC API
   ↓
Raster I/O
   ↓
NDVI Compute
   ↓
Response (80s)

After (Pipeline-driven)

Request → Cache → Response (~725ms P95)
              ↓ (miss)
         Async Task
              ↓
       Compute + Store

Implementation

1. Fast API Path (Non-blocking)

from django.core.cache import cache
from ndvi.tasks import compute_farm_state_coverage

def get_farm_state(farm_id: int) -> dict:
    cache_key = f"farm_state:{farm_id}"

    data = cache.get(cache_key)
    if data:
        return data

    compute_farm_state_coverage.delay(farm_id=farm_id)

    return {
        "coverage_pct": None,
        "status": "processing"
    }

2. Celery Task (Async Compute)

from celery import shared_task
from django.core.cache import cache

@shared_task(bind=True, autoretry_for=(Exception,), retry_backoff=True)
def compute_farm_state_coverage(self, farm_id: int) -> None:
    coverage = compute_ndvi_coverage(farm_id)

    cache.set(
        f"farm_state:{farm_id}",
        {
            "coverage_pct": coverage,
            "status": "ready"
        },
        timeout=60 * 60 * 6,
    )

3. Daily Backfill (Critical)

from celery import shared_task

@shared_task
def enqueue_daily_farm_state_coverage():
    farm_ids = get_active_farm_ids()

    for farm_id in farm_ids:
        compute_farm_state_coverage.delay(farm_id=farm_id)

Observability (The Real Upgrade)

Metrics added:

Task duration
Task success/failure
Queue depth

Metrics (Grafana Observations)

📊 Grafana Screenshots

1. Latency Graph

Before

P95 latency: ~1.25 minutes

After

API latency: ~725ms (P95)
Background tasks: 60–90s

Before vs After Summary

Metric	Before	After
API latency	1.25 min	~725 ms (P95)
System type	Request-driven	Pipeline-driven
Scalability	Poor	Strong
Observability	Minimal	Improved

Final Thought

I stopped treating my API like a calculator and started treating my system like a data pipeline.

That’s when everything changed.