<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahim Ranxx</title>
    <description>The latest articles on DEV Community by Rahim Ranxx (@rahim8050).</description>
    <link>https://dev.to/rahim8050</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3744842%2F2195e1a7-7e61-47f7-9c11-41610936958d.jpg</url>
      <title>DEV Community: Rahim Ranxx</title>
      <link>https://dev.to/rahim8050</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahim8050"/>
    <language>en</language>
    <item>
      <title>From NDVI to a Generic Spectral Engine: Architecting Scalable Earth Observation Pipelines.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 28 Jun 2026 07:16:26 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-ndvi-to-a-generic-spectral-engine-architecting-scalable-earth-observation-pipelines-p10</link>
      <guid>https://dev.to/rahim8050/from-ndvi-to-a-generic-spectral-engine-architecting-scalable-earth-observation-pipelines-p10</guid>
      <description>&lt;h2&gt;
  
  
  Scaling AgTech Analytics: From NDVI to a Generic Spectral Engine
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; &lt;em&gt;Discover how refactoring a hardcoded NDVI pipeline into a generic, data-driven spectral engine transforms agricultural technology platforms. Learn about platform engineering, sensor abstraction (Sentinel-2, Landsat, MODIS), and scaling remote sensing analytics.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most remote sensing and Earth observation projects begin with a single metric: &lt;strong&gt;NDVI&lt;/strong&gt; (Normalized Difference Vegetation Index).&lt;/p&gt;

&lt;p&gt;Mine did too.&lt;/p&gt;

&lt;p&gt;Initially, this wasn't a problem. Processing one spectral index meant maintaining one computation path, one imagery loader, and one set of satellite provider integrations. Everything was straightforward and manageable.&lt;/p&gt;

&lt;p&gt;Then, reality arrived.&lt;/p&gt;

&lt;p&gt;I needed to add &lt;strong&gt;NDMI&lt;/strong&gt; (Normalized Difference Moisture Index) to improve farm moisture monitoring across diverse data sources like &lt;strong&gt;Sentinel-2, Landsat, MODIS, and STAC&lt;/strong&gt;. At first glance, this looked like a standard feature request.&lt;/p&gt;

&lt;p&gt;It wasn't. Adding NDMI exposed a critical architectural bottleneck that had been quietly growing inside the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Scaling Spectral Indices
&lt;/h2&gt;

&lt;p&gt;The original implementation followed a familiar, but ultimately flawed, pattern: every spectral index and every data provider had its own bespoke implementation. The codebase was bloating into a matrix of redundant pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NDVI + Sentinel-2&lt;/li&gt;
&lt;li&gt;NDVI + STAC&lt;/li&gt;
&lt;li&gt;NDVI + Landsat&lt;/li&gt;
&lt;li&gt;NDWI + Sentinel-2&lt;/li&gt;
&lt;li&gt;NDWI + STAC&lt;/li&gt;
&lt;li&gt;...and now NDMI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every new vegetation or moisture index multiplied the codebase. Adding one feature meant generating another set of loaders, unit tests, API handlers, and maintenance paths.&lt;/p&gt;

&lt;p&gt;While not technically broken, it wasn't sustainable platform engineering. The system wasn't becoming more intelligent; it was simply becoming more repetitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing for the Fourth Index, Not the Third
&lt;/h2&gt;

&lt;p&gt;Rather than brute-forcing NDMI directly into the existing structure, I paused to ask a fundamental architecture question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What infrastructure would make the next five spectral indices almost free to implement?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This completely shifted the project's trajectory. Instead of writing yet another custom loader, I built clean abstractions around three core concepts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Spectral Formulas&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sensor Band Mappings&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generic Compute Engines&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By transitioning these elements from hardcoded logic into dynamic data, the entire analytics pipeline simplified dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Spectral Formulas as Configuration
&lt;/h3&gt;

&lt;p&gt;Every spectral index shares a basic blueprint requiring a name, a set of sensor bands, and a mathematical formula. Instead of scattering these definitions throughout the business logic, they now live in a centralized registry.&lt;/p&gt;

&lt;p&gt;Adding a new index no longer requires a new processing pipeline. It simply requires registering a new formula. The underlying compute engine remains untouched.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Abstracting Sensor Band Names
&lt;/h3&gt;

&lt;p&gt;Previously, provider-specific naming conventions leaked throughout the codebase—Sentinel-2 calls a band one thing, while Landsat and MODIS use entirely different conventions.&lt;/p&gt;

&lt;p&gt;Now, providers expose abstract band names. The compute engine simply requests universal identifiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nir&lt;/code&gt; (Near-Infrared)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;red&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;green&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;swir1&lt;/code&gt; (Short-Wave Infrared)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each provider is responsible for resolving these abstracts to their specific assets. The scientific computation layer no longer knows—or cares—which satellite produced the imagery.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A Single, Data-Driven Compute Engine
&lt;/h3&gt;

&lt;p&gt;The most significant leap was replacing fragmented, index-specific loaders with a unified generic compute engine. Its responsibilities are strictly bounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resolve required bands.&lt;/li&gt;
&lt;li&gt;Load satellite imagery.&lt;/li&gt;
&lt;li&gt;Apply the requested formula.&lt;/li&gt;
&lt;li&gt;Apply cloud masking.&lt;/li&gt;
&lt;li&gt;Return the resulting raster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice what is missing: there are no &lt;code&gt;if index == NDVI&lt;/code&gt; conditional branches. There are no provider-specific calculations. By shifting to a data-driven model, a single abstraction replaced an expanding collection of nearly identical scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Features: Hardening the Production Platform
&lt;/h2&gt;

&lt;p&gt;As a backend engineer, I've learned that users rarely notice the work that matters most. Alongside the NDMI refactor, standardizing the platform layer allowed for crucial operational and observability improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Management:&lt;/strong&gt; Streamlining dependency security updates using the ultra-fast &lt;code&gt;uv&lt;/code&gt; package manager.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System Observability:&lt;/strong&gt; Enhancing monitoring and stack trace sanitization across the Django backend using Prometheus, Grafana, and Loki.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Reliability:&lt;/strong&gt; Remediating secret scanning vulnerabilities and improving email reliability for scheduled jobs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production engineering isn't just about shipping AgTech features; it’s about reducing operational risk and ensuring high availability when executing failovers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What NDMI (and a Generic Engine) Actually Enables
&lt;/h2&gt;

&lt;p&gt;Technology is only valuable if it drives better decisions. Within this Farm Intelligence Platform, integrating NDMI and a robust spectral engine supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Early Moisture Stress Detection:&lt;/strong&gt; Crucial for proactive crop management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision Irrigation Scheduling:&lt;/strong&gt; Optimizing water usage on large-scale farms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seasonal Drought Monitoring:&lt;/strong&gt; Providing macro-level environmental insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Workflows:&lt;/strong&gt; Triggering downstream automation via Celery pipelines, backed by Redis Sentinel for reliable queue routing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Farmer Advisories:&lt;/strong&gt; Translating raster data into multilingual text-to-speech alerts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The spectral engine produces the raw information; the platform's architecture ensures that information reliably becomes an actionable recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned in Platform Engineering
&lt;/h2&gt;

&lt;p&gt;Looking back, the most valuable outcome wasn't adding NDMI. It was recognizing that the architecture needed to evolve &lt;em&gt;before&lt;/em&gt; the feature was integrated.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build for Stability:&lt;/strong&gt; Design abstractions around stable concepts, not immediate feature requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolate Science from Logic:&lt;/strong&gt; Scientific formulas belong in data registries, not business logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Interfaces:&lt;/strong&gt; Provider-specific API behaviors should remain hidden behind strict interfaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactor First:&lt;/strong&gt; Cleaning up the architecture before scaling is always cheaper than untangling technical debt later.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What’s Next for the Platform
&lt;/h2&gt;

&lt;p&gt;The next evolutionary step is automating satellite acquisition scheduling using Celery Beat, Redis Streams, and event-driven ingestion. Because the spectral engine is now entirely generic, these CI/CD validated workflows don't require separate logic for NDVI, NDWI, or NDMI. They simply receive an &lt;code&gt;index_type&lt;/code&gt; and execute.&lt;/p&gt;

&lt;p&gt;Adding NDMI started as a standard feature request but finished as a comprehensive architectural redesign. The biggest improvements in production systems often don't come from adding new capabilities—they come from removing old assumptions.&lt;/p&gt;

</description>
      <category>agtech</category>
      <category>remotesensing</category>
      <category>platformengineering</category>
      <category>geospatialdata</category>
    </item>
    <item>
      <title>From Feature Delivery to Platform Engineering.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Mon, 22 Jun 2026 15:35:33 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-feature-delivery-to-platform-engineering-3m09</link>
      <guid>https://dev.to/rahim8050/from-feature-delivery-to-platform-engineering-3m09</guid>
      <description>&lt;h2&gt;
  
  
  The Problem: Feature Velocity Was Creating Structural Debt
&lt;/h2&gt;

&lt;p&gt;The system originally started as a simple feature delivery backend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Django API powering agricultural insights&lt;/li&gt;
&lt;li&gt;Celery workers handling asynchronous processing&lt;/li&gt;
&lt;li&gt;Independent endpoints for each new capability&lt;/li&gt;
&lt;li&gt;A growing set of Earth Observation computations (NDVI, NDWI, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first, it worked.&lt;/p&gt;

&lt;p&gt;But as more features were added, a pattern emerged:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each feature introduced its own pipeline logic&lt;/li&gt;
&lt;li&gt;Observability was inconsistent across services&lt;/li&gt;
&lt;li&gt;API contracts drifted between frontend and backend&lt;/li&gt;
&lt;li&gt;Debugging required tracing multiple disconnected systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We weren’t scaling functionality.&lt;/p&gt;

&lt;p&gt;We were scaling fragmentation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Turning Point: Features vs Platforms
&lt;/h2&gt;

&lt;p&gt;The key realization was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Features solve user problems. Platforms solve system problems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We were repeatedly rebuilding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authentication flows&lt;/li&gt;
&lt;li&gt;Data ingestion logic&lt;/li&gt;
&lt;li&gt;Processing pipelines&lt;/li&gt;
&lt;li&gt;API validation layers&lt;/li&gt;
&lt;li&gt;Monitoring hooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each feature was solving its own version of these concerns.&lt;/p&gt;

&lt;p&gt;That is where platform engineering became necessary.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shift: Introducing a Platform Layer
&lt;/h2&gt;

&lt;p&gt;We introduced a platform layer between feature delivery and infrastructure.&lt;/p&gt;

&lt;p&gt;Instead of building isolated pipelines, we standardized:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Unified API Surface
&lt;/h3&gt;

&lt;p&gt;All Earth Observation workflows (NDVI, NDWI, and future indices) were normalized into a consistent API contract.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared request/response structure&lt;/li&gt;
&lt;li&gt;Versioned endpoints&lt;/li&gt;
&lt;li&gt;Schema validation through serializers&lt;/li&gt;
&lt;li&gt;Central routing logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminated endpoint fragmentation.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Standardized Processing Pipeline
&lt;/h3&gt;

&lt;p&gt;Celery tasks were refactored into a reusable pipeline pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingestion&lt;/li&gt;
&lt;li&gt;Validation&lt;/li&gt;
&lt;li&gt;Computation&lt;/li&gt;
&lt;li&gt;Storage&lt;/li&gt;
&lt;li&gt;Publishing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of feature-specific workers, we moved toward composable tasks.&lt;/p&gt;

&lt;p&gt;This allowed new indices or processing logic to plug into the same execution flow.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Observability as a First-Class Layer
&lt;/h3&gt;

&lt;p&gt;One of the biggest failures in the original system was visibility.&lt;/p&gt;

&lt;p&gt;We introduced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured logging across all services&lt;/li&gt;
&lt;li&gt;Traceable job IDs across Celery tasks&lt;/li&gt;
&lt;li&gt;Consistent error schemas&lt;/li&gt;
&lt;li&gt;Centralized failure reporting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now every pipeline run could be traced end-to-end without guessing where it failed.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Contract-Driven Development
&lt;/h3&gt;

&lt;p&gt;We enforced strict API contracts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema validation at the edge&lt;/li&gt;
&lt;li&gt;Typed serializers in Django&lt;/li&gt;
&lt;li&gt;Explicit error responses&lt;/li&gt;
&lt;li&gt;Versioned API evolution instead of silent changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduced frontend/backend drift significantly.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. CI/CD Guardrails for System Integrity
&lt;/h3&gt;

&lt;p&gt;To prevent regression as the system grew:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linting enforced consistency (Ruff, MyPy, Bandit)&lt;/li&gt;
&lt;li&gt;Task registry validation ensured no orphaned Celery tasks&lt;/li&gt;
&lt;li&gt;API schema checks prevented breaking changes&lt;/li&gt;
&lt;li&gt;Automated tests verified pipeline execution paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the system breaks, it should fail in CI—not in production.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Earth Observation as a Stress Test
&lt;/h2&gt;

&lt;p&gt;NDVI and NDWI pipelines became more than features—they became a stress test for architecture.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because they exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heavy computation workflows&lt;/li&gt;
&lt;li&gt;Large data dependencies&lt;/li&gt;
&lt;li&gt;External geospatial inputs&lt;/li&gt;
&lt;li&gt;Long-running async tasks&lt;/li&gt;
&lt;li&gt;Multiple transformation stages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the platform could handle these reliably, it could handle anything we built on top of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed After the Shift
&lt;/h2&gt;

&lt;p&gt;After moving to a platform-first architecture:&lt;/p&gt;

&lt;h3&gt;
  
  
  Before
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each feature = new pipeline&lt;/li&gt;
&lt;li&gt;Debugging = distributed guesswork&lt;/li&gt;
&lt;li&gt;API behavior = inconsistent&lt;/li&gt;
&lt;li&gt;Observability = partial&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  After
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Features plug into existing pipelines&lt;/li&gt;
&lt;li&gt;Debugging = traceable execution graph&lt;/li&gt;
&lt;li&gt;API behavior = predictable contracts&lt;/li&gt;
&lt;li&gt;Observability = end-to-end visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest win wasn’t performance.&lt;/p&gt;

&lt;p&gt;It was predictability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Lessons
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Feature velocity without platform thinking creates hidden fragility
&lt;/h3&gt;

&lt;p&gt;You don’t see the cost immediately—but it compounds fast.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Earth Observation pipelines are excellent architecture stress tests
&lt;/h3&gt;

&lt;p&gt;They force you to confront real-world distributed system problems early.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Standardization beats optimization at early scaling stages
&lt;/h3&gt;

&lt;p&gt;Before optimizing performance, unify structure.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Observability is not optional infrastructure
&lt;/h3&gt;

&lt;p&gt;If you can’t trace a request end-to-end, you don’t have a production system—you have a collection of services.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Platform engineering is a mindset shift, not a rewrite
&lt;/h3&gt;

&lt;p&gt;Most of the improvements came from structure, not new technology.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;The transition from feature delivery to platform engineering is not about scale alone.&lt;/p&gt;

&lt;p&gt;It’s about control.&lt;/p&gt;

&lt;p&gt;Control over how systems evolve, how they fail, and how quickly they recover.&lt;/p&gt;

&lt;p&gt;Once that layer exists, feature development becomes what it should have been from the beginning:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Safe, composable, and predictable.&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>devops</category>
      <category>django</category>
      <category>distributedsystems</category>
      <category>python</category>
    </item>
    <item>
      <title>From Feature Delivery to Platform Engineering: Scaling Earth Observation Pipelines.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 21 Jun 2026 06:08:49 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-feature-delivery-to-platform-engineering-scaling-earth-observation-pipelines-3b1h</link>
      <guid>https://dev.to/rahim8050/from-feature-delivery-to-platform-engineering-scaling-earth-observation-pipelines-3b1h</guid>
      <description>&lt;h2&gt;
  
  
  From Feature Delivery to Platform Engineering
&lt;/h2&gt;

&lt;p&gt;Most engineering articles focus on building a new feature.&lt;/p&gt;

&lt;p&gt;The reality of production systems is different.&lt;/p&gt;

&lt;p&gt;Adding a feature is often the easiest part.&lt;/p&gt;

&lt;p&gt;The difficult part is preserving compatibility across asynchronous workloads, external integrations, observability pipelines, CI gates, OpenAPI contracts, and years of accumulated assumptions.&lt;/p&gt;

&lt;p&gt;This week, I wasn't simply implementing NDWI.&lt;/p&gt;

&lt;p&gt;I was evolving a farm intelligence platform that combines Earth Observation, distributed task execution, observability, and a Nextcloud-based user experience.&lt;/p&gt;

&lt;p&gt;The goal sounded straightforward:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Bring NDWI (Normalized Difference Water Index) to feature parity with NDVI.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual work touched nearly every layer of the stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Feature Duplication Becomes Technical Debt
&lt;/h2&gt;

&lt;p&gt;Our existing NDVI implementation already supported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple processing backends,&lt;/li&gt;
&lt;li&gt;Celery orchestration,&lt;/li&gt;
&lt;li&gt;Prometheus metrics,&lt;/li&gt;
&lt;li&gt;OpenAPI exposure,&lt;/li&gt;
&lt;li&gt;Nextcloud integration,&lt;/li&gt;
&lt;li&gt;Dashboarding,&lt;/li&gt;
&lt;li&gt;Automated tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The temptation was obvious:&lt;/p&gt;

&lt;p&gt;Copy the NDVI implementation.&lt;/p&gt;

&lt;p&gt;Rename everything.&lt;/p&gt;

&lt;p&gt;Ship.&lt;/p&gt;

&lt;p&gt;That approach works exactly once.&lt;/p&gt;

&lt;p&gt;Every duplicated branch becomes future maintenance debt.&lt;/p&gt;

&lt;p&gt;Every additional spectral index doubles the operational surface area.&lt;/p&gt;

&lt;p&gt;I wanted NDWI to become the second index without making the third index exponentially harder.&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing for the Next Spectral Index
&lt;/h2&gt;

&lt;p&gt;The first step was eliminating branching logic.&lt;/p&gt;

&lt;p&gt;The original engine dispatch evolved toward multiple conditional paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NDVI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NDWI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern does not scale.&lt;/p&gt;

&lt;p&gt;Instead, dispatch moved to factory lookup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;factory_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NDVI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ndwi_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;factory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ENGINE_FACTORIES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;factory_key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five NDWI engine factories were introduced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ndwi_gee&lt;/li&gt;
&lt;li&gt;ndwi_sentinelhub&lt;/li&gt;
&lt;li&gt;ndwi_stac&lt;/li&gt;
&lt;li&gt;ndwi_landsat&lt;/li&gt;
&lt;li&gt;ndwi_modis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result wasn't just NDWI support.&lt;/p&gt;

&lt;p&gt;It transformed the platform into one capable of supporting future indices through convention rather than branching.&lt;/p&gt;

&lt;p&gt;Adding another index stopped being an architectural event.&lt;/p&gt;




&lt;h2&gt;
  
  
  Separate Tasks, Shared Internals
&lt;/h2&gt;

&lt;p&gt;A common anti-pattern in Celery systems is task duplication.&lt;/p&gt;

&lt;p&gt;Two almost-identical tasks drift apart over time.&lt;/p&gt;

&lt;p&gt;I wanted operational separation without implementation divergence.&lt;/p&gt;

&lt;p&gt;Instead of copying logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;run_ndwi_job&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;delegates into the existing NDVI execution pipeline.&lt;/p&gt;

&lt;p&gt;This produced an interesting balance.&lt;/p&gt;

&lt;p&gt;NDWI gained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Independent retry policies,&lt;/li&gt;
&lt;li&gt;Dedicated queue routing,&lt;/li&gt;
&lt;li&gt;Separate monitoring visibility,&lt;/li&gt;
&lt;li&gt;Future scheduling flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without duplicating computation logic.&lt;/p&gt;

&lt;p&gt;Operational isolation.&lt;/p&gt;

&lt;p&gt;Implementation reuse.&lt;/p&gt;




&lt;h2&gt;
  
  
  Metrics: Fighting Observability Sprawl
&lt;/h2&gt;

&lt;p&gt;Observability debt accumulates quietly.&lt;/p&gt;

&lt;p&gt;Originally, NDWI introduced six additional Prometheus metrics.&lt;/p&gt;

&lt;p&gt;That meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate Grafana panels,&lt;/li&gt;
&lt;li&gt;Duplicate alert rules,&lt;/li&gt;
&lt;li&gt;Duplicate recording rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of expanding metrics, we collapsed them.&lt;/p&gt;

&lt;p&gt;Before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ndwi_requests_total
ndwi_duration_seconds
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spectral_requests_total{index="NDVI"}
spectral_requests_total{index="NDWI"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dashboard no longer cared which index generated the signal.&lt;/p&gt;

&lt;p&gt;The index became metadata.&lt;/p&gt;

&lt;p&gt;The monitoring surface remained stable.&lt;/p&gt;

&lt;p&gt;One of the most valuable lessons in observability is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Labels scale better than metric names.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Testing Against Regression, Not Hope
&lt;/h2&gt;

&lt;p&gt;Feature tests prove something works.&lt;/p&gt;

&lt;p&gt;Regression tests prove you didn't destroy what already existed.&lt;/p&gt;

&lt;p&gt;A dedicated no-regression suite was introduced.&lt;/p&gt;

&lt;p&gt;It validated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Factory registrations,&lt;/li&gt;
&lt;li&gt;Query isolation,&lt;/li&gt;
&lt;li&gt;Route resolution,&lt;/li&gt;
&lt;li&gt;URL parity,&lt;/li&gt;
&lt;li&gt;Metrics importability,&lt;/li&gt;
&lt;li&gt;Representation consistency,&lt;/li&gt;
&lt;li&gt;Celery routing behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nineteen tests across seven classes focused entirely on one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did this week's work accidentally break yesterday's guarantees?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those tests became the contract protecting future contributors from invisible coupling.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Farm 29 Incident
&lt;/h2&gt;

&lt;p&gt;The most valuable discovery wasn't code.&lt;/p&gt;

&lt;p&gt;It was a 403 error.&lt;/p&gt;

&lt;p&gt;Farm 29 exposed a hidden assumption.&lt;/p&gt;

&lt;p&gt;Weather endpoints succeeded.&lt;/p&gt;

&lt;p&gt;NDVI failed.&lt;/p&gt;

&lt;p&gt;NDWI failed.&lt;/p&gt;

&lt;p&gt;Initially, it looked like an authentication defect.&lt;/p&gt;

&lt;p&gt;The investigation revealed something deeper.&lt;/p&gt;

&lt;p&gt;Integration JWTs enforced per-farm access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FarmIntegrationAccess
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weather bypassed this path.&lt;/p&gt;

&lt;p&gt;Spectral endpoints enforced it.&lt;/p&gt;

&lt;p&gt;The fix required zero Django changes.&lt;/p&gt;

&lt;p&gt;The integration simply lacked authorization.&lt;/p&gt;

&lt;p&gt;This incident reinforced an important operational principle:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Authentication proves identity.&lt;/p&gt;

&lt;p&gt;Authorization determines access.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Confusing the two leads to dangerous conclusions.&lt;/p&gt;

&lt;p&gt;Production systems teach humility.&lt;/p&gt;

&lt;p&gt;The bug is rarely where you first look.&lt;/p&gt;




&lt;h2&gt;
  
  
  When 102 Radio Stations Became a Concurrency Problem
&lt;/h2&gt;

&lt;p&gt;Not every challenge involved remote sensing.&lt;/p&gt;

&lt;p&gt;A radio subsystem health check had become pathological.&lt;/p&gt;

&lt;p&gt;Sequential probing meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;37 stations processed,&lt;/li&gt;
&lt;li&gt;300-second execution time,&lt;/li&gt;
&lt;li&gt;consistent Celery timeouts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution wasn't another timeout tweak.&lt;/p&gt;

&lt;p&gt;It was concurrency.&lt;/p&gt;

&lt;p&gt;ThreadPoolExecutor replaced sequential execution.&lt;/p&gt;

&lt;p&gt;Redirect chasing disappeared.&lt;/p&gt;

&lt;p&gt;HTTP 3xx and 405 responses became acceptable health signals.&lt;/p&gt;

&lt;p&gt;After deployment:&lt;/p&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;37/102 stations,&lt;/li&gt;
&lt;li&gt;~300 seconds,&lt;/li&gt;
&lt;li&gt;frequent failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;102/102 stations,&lt;/li&gt;
&lt;li&gt;~21 seconds,&lt;/li&gt;
&lt;li&gt;stable execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sometimes resilience isn't sophisticated.&lt;/p&gt;

&lt;p&gt;Sometimes it's simply refusing to serialize independent work.&lt;/p&gt;




&lt;h2&gt;
  
  
  CI as an Operational Safety Net
&lt;/h2&gt;

&lt;p&gt;One subtle defect triggered a larger improvement.&lt;/p&gt;

&lt;p&gt;Three Celery beat task names drifted away from their actual registrations.&lt;/p&gt;

&lt;p&gt;Everything appeared healthy.&lt;/p&gt;

&lt;p&gt;Until scheduled execution failed.&lt;/p&gt;

&lt;p&gt;Instead of fixing the names and moving on, a CI guardrail emerged.&lt;/p&gt;

&lt;p&gt;A validation script now verifies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Beat schedules,&lt;/li&gt;
&lt;li&gt;Queue routes,&lt;/li&gt;
&lt;li&gt;Shared task registrations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The lesson was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every production incident deserves the question:&lt;/p&gt;

&lt;p&gt;"How do we ensure this category of failure never happens again?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fixes remove symptoms.&lt;/p&gt;

&lt;p&gt;Guardrails remove classes of bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenAPI as the Source of Truth
&lt;/h2&gt;

&lt;p&gt;Cross-repository systems drift.&lt;/p&gt;

&lt;p&gt;Documentation drifts faster.&lt;/p&gt;

&lt;p&gt;The Nextcloud application consumed Django APIs.&lt;/p&gt;

&lt;p&gt;Over time, operation identifiers diverged.&lt;/p&gt;

&lt;p&gt;The answer wasn't manual synchronization.&lt;/p&gt;

&lt;p&gt;The answer was declaring ownership.&lt;/p&gt;

&lt;p&gt;Django's schema became authoritative.&lt;/p&gt;

&lt;p&gt;The Nextcloud OpenAPI specification synchronized directly from it.&lt;/p&gt;

&lt;p&gt;Ninety-six operations were verified.&lt;/p&gt;

&lt;p&gt;Fifteen controllers aligned.&lt;/p&gt;

&lt;p&gt;The integration contract became explicit.&lt;/p&gt;

&lt;p&gt;Contracts reduce assumptions.&lt;/p&gt;

&lt;p&gt;Assumptions become outages.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Week Actually Produced
&lt;/h2&gt;

&lt;p&gt;On paper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;93 Django files changed,&lt;/li&gt;
&lt;li&gt;~2,469 lines added,&lt;/li&gt;
&lt;li&gt;15 commits,&lt;/li&gt;
&lt;li&gt;96 operations verified,&lt;/li&gt;
&lt;li&gt;102 radio stations covered,&lt;/li&gt;
&lt;li&gt;29 Celery tasks validated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the numbers tell only part of the story.&lt;/p&gt;

&lt;p&gt;The real outcome was different.&lt;/p&gt;

&lt;p&gt;The platform became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier to extend,&lt;/li&gt;
&lt;li&gt;Easier to observe,&lt;/li&gt;
&lt;li&gt;Harder to accidentally break,&lt;/li&gt;
&lt;li&gt;More explicit in its contracts,&lt;/li&gt;
&lt;li&gt;More resilient under operational stress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference between feature development and platform engineering.&lt;/p&gt;

&lt;p&gt;The code shipped this week wasn't just NDWI.&lt;/p&gt;

&lt;p&gt;It was institutional knowledge encoded into software.&lt;/p&gt;

&lt;p&gt;And that compounds over time.&lt;/p&gt;

</description>
      <category>django</category>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Shipping 12,000+ Lines Across 6 Systems in 19 Days: A Masterclass in Backend Architecture.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 13 Jun 2026 08:09:16 +0000</pubDate>
      <link>https://dev.to/rahim8050/shipping-12000-lines-across-6-systems-in-19-days-a-masterclass-in-backend-architecture-53an</link>
      <guid>https://dev.to/rahim8050/shipping-12000-lines-across-6-systems-in-19-days-a-masterclass-in-backend-architecture-53an</guid>
      <description>&lt;p&gt;&lt;em&gt;What looked like a chaotic sprint was actually a strict exercise in architectural discipline.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The last time I published on Dev.to was in late May.&lt;/p&gt;

&lt;p&gt;At the time, I had just finished documenting how I separated media responsibilities between Django and Nextcloud. I expected the next few weeks to be incremental: fix a few bugs, close a few tickets, and improve observability.&lt;/p&gt;

&lt;p&gt;Instead, I disappeared into the codebase.&lt;/p&gt;

&lt;p&gt;Nineteen days later, I resurfaced, and Git had a story to tell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;111 files changed
~12,800 lines added
339 tests written
45 endpoints shipped
21 Celery tasks introduced
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On paper, it looked absurd.&lt;/p&gt;

&lt;p&gt;In reality, it taught me one of the most important backend engineering lessons of my career:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Velocity isn't about typing faster. It's about making architectural decisions that allow future work to compound instead of collide.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's how I survived the sprint without breaking production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scaling Trap: When Features Become Surgery
&lt;/h2&gt;

&lt;p&gt;Most software slows down as it grows.&lt;/p&gt;

&lt;p&gt;Every new requirement forces developers to revisit existing code paths. A seemingly small feature request quickly expands into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updating database models,&lt;/li&gt;
&lt;li&gt;Modifying serializers,&lt;/li&gt;
&lt;li&gt;Adjusting views and business logic,&lt;/li&gt;
&lt;li&gt;Fixing broken tests,&lt;/li&gt;
&lt;li&gt;Introducing unexpected regressions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eventually, every feature feels like open-heart surgery.&lt;/p&gt;

&lt;p&gt;I wanted the opposite.&lt;/p&gt;

&lt;p&gt;I wanted a platform where adding functionality felt like plugging another module into a well-designed machine.&lt;/p&gt;

&lt;p&gt;Over nineteen days, that philosophy was tested repeatedly.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Build Confidence Before Features
&lt;/h2&gt;

&lt;p&gt;The first thing I shipped wasn't visible to users.&lt;/p&gt;

&lt;p&gt;It was infrastructure.&lt;/p&gt;

&lt;p&gt;Before introducing new systems, I:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migrated the project to Python 3.12,&lt;/li&gt;
&lt;li&gt;Replaced traditional dependency management with &lt;code&gt;uv&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;Containerized the CI pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every pull request now executes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ruff,&lt;/li&gt;
&lt;li&gt;MyPy,&lt;/li&gt;
&lt;li&gt;Bandit,&lt;/li&gt;
&lt;li&gt;Pytest,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;inside reproducible Docker environments.&lt;/p&gt;

&lt;p&gt;The payoff wasn't glamorous.&lt;/p&gt;

&lt;p&gt;Nobody celebrates faster dependency installation.&lt;/p&gt;

&lt;p&gt;But confidence compounds.&lt;/p&gt;

&lt;p&gt;When your tests are trustworthy and your environments are deterministic, you move differently. You stop negotiating with the fear of breaking things.&lt;/p&gt;

&lt;p&gt;You ship.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Scheduling Is a Distributed Systems Problem
&lt;/h2&gt;

&lt;p&gt;One of the major features I introduced was farm activity scheduling.&lt;/p&gt;

&lt;p&gt;At first glance, it sounded trivial:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Let users schedule irrigation."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then production reality arrived.&lt;/p&gt;

&lt;p&gt;Questions started appearing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when schedules recur?&lt;/li&gt;
&lt;li&gt;How do you prevent duplicate executions?&lt;/li&gt;
&lt;li&gt;How do you acknowledge completed tasks?&lt;/li&gt;
&lt;li&gt;How do retries behave after failures?&lt;/li&gt;
&lt;li&gt;What happens if the scheduler crashes and restarts?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple checkbox had quietly evolved into a distributed systems problem.&lt;/p&gt;

&lt;p&gt;The final implementation relied on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cron-based recurrence,&lt;/li&gt;
&lt;li&gt;Celery orchestration backed by Redis,&lt;/li&gt;
&lt;li&gt;WebSocket notifications,&lt;/li&gt;
&lt;li&gt;Strict acknowledgement workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The surprising part?&lt;/p&gt;

&lt;p&gt;Users only see a push notification.&lt;/p&gt;

&lt;p&gt;Good engineering hides complexity.&lt;/p&gt;

&lt;p&gt;It doesn't showcase it.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Resilience Beats Perfection
&lt;/h2&gt;

&lt;p&gt;I also introduced text-to-speech alerts using multiple synthesis engines.&lt;/p&gt;

&lt;p&gt;Initially, I made a common assumption:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the preferred neural engine fails, the alert fails.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then I asked a better question.&lt;/p&gt;

&lt;p&gt;What matters more?&lt;/p&gt;

&lt;p&gt;Perfect audio quality?&lt;/p&gt;

&lt;p&gt;Or ensuring critical alerts reach users?&lt;/p&gt;

&lt;p&gt;That changed everything.&lt;/p&gt;

&lt;p&gt;Instead of relying on a single engine, I implemented strategies.&lt;/p&gt;

&lt;p&gt;Then I wrapped those strategies in circuit breakers.&lt;/p&gt;

&lt;p&gt;If the primary engine crashes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The circuit breaker trips,&lt;/li&gt;
&lt;li&gt;The fallback engine takes over,&lt;/li&gt;
&lt;li&gt;Users still receive alerts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The experience degrades gracefully.&lt;/p&gt;

&lt;p&gt;The system survives.&lt;/p&gt;

&lt;p&gt;That single decision eliminated an entire class of outages.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Production Engineering Means Expecting Broken Systems
&lt;/h2&gt;

&lt;p&gt;The hardest problem of the sprint wasn't satellite imagery.&lt;/p&gt;

&lt;p&gt;It wasn't Celery.&lt;/p&gt;

&lt;p&gt;It wasn't WebSockets.&lt;/p&gt;

&lt;p&gt;It was internet radio metadata.&lt;/p&gt;

&lt;p&gt;Specifically, ICY metadata.&lt;/p&gt;

&lt;p&gt;The specification is decades old, and stations interpret it creatively.&lt;/p&gt;

&lt;p&gt;Some use UTF-8.&lt;/p&gt;

&lt;p&gt;Others use Latin-1.&lt;/p&gt;

&lt;p&gt;Some omit fields entirely.&lt;/p&gt;

&lt;p&gt;Some violate their own metadata intervals.&lt;/p&gt;

&lt;p&gt;The parser itself was tiny.&lt;/p&gt;

&lt;p&gt;The resilience around it became enormous.&lt;/p&gt;

&lt;p&gt;It reinforced a lesson I won't forget:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Production backend engineering isn't writing code for systems behaving correctly.&lt;/p&gt;

&lt;p&gt;It's writing code for systems behaving incorrectly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. The Abstraction Decision That Saved Weeks
&lt;/h2&gt;

&lt;p&gt;Toward the end of the sprint, I needed to implement NDWI (Normalized Difference Water Index).&lt;/p&gt;

&lt;p&gt;I already had a mature NDVI pipeline.&lt;/p&gt;

&lt;p&gt;I had two options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option One: Duplicate Everything
&lt;/h3&gt;

&lt;p&gt;Create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New models,&lt;/li&gt;
&lt;li&gt;New services,&lt;/li&gt;
&lt;li&gt;New Celery tasks,&lt;/li&gt;
&lt;li&gt;New metrics,&lt;/li&gt;
&lt;li&gt;New providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It would work.&lt;/p&gt;

&lt;p&gt;It would also create long-term maintenance debt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option Two: Generalize Selectively
&lt;/h3&gt;

&lt;p&gt;Reuse what already worked.&lt;/p&gt;

&lt;p&gt;Separate only what genuinely differed.&lt;/p&gt;

&lt;p&gt;I chose the second approach.&lt;/p&gt;

&lt;p&gt;The result was a hybrid architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shared Infrastructure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;STAC clients,&lt;/li&gt;
&lt;li&gt;Service layers,&lt;/li&gt;
&lt;li&gt;Celery workflows,&lt;/li&gt;
&lt;li&gt;Database infrastructure,&lt;/li&gt;
&lt;li&gt;Metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Specialized Logic
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Quality thresholds,&lt;/li&gt;
&lt;li&gt;Fusion rules,&lt;/li&gt;
&lt;li&gt;Farm-state classification,&lt;/li&gt;
&lt;li&gt;Visual representations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had budgeted nearly a month for the work.&lt;/p&gt;

&lt;p&gt;It shipped in less than four days.&lt;/p&gt;

&lt;p&gt;That experience fundamentally changed how I think about abstraction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Bad abstractions slow teams down.&lt;/p&gt;

&lt;p&gt;Good abstractions create leverage.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Observability Is Architecture
&lt;/h2&gt;

&lt;p&gt;One unexpected lesson involved monitoring.&lt;/p&gt;

&lt;p&gt;Initially, I considered separate Prometheus metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ndvi_observations_total
ndwi_observations_total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I stopped.&lt;/p&gt;

&lt;p&gt;Why duplicate the concept?&lt;/p&gt;

&lt;p&gt;Instead, I moved to labels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spectral_index_observations_total{
    index_type="NDVI"
}

spectral_index_observations_total{
    index_type="NDWI"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The immediate benefit was cleaner Grafana dashboards.&lt;/p&gt;

&lt;p&gt;The long-term benefit was strategic.&lt;/p&gt;

&lt;p&gt;When future indices arrive—NDMI, EVI, SAVI—the infrastructure remains untouched.&lt;/p&gt;

&lt;p&gt;Only the labels evolve.&lt;/p&gt;

&lt;p&gt;Observability stopped being monitoring.&lt;/p&gt;

&lt;p&gt;It became architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Output Was Optionality
&lt;/h2&gt;

&lt;p&gt;Yes, the feature list was substantial.&lt;/p&gt;

&lt;p&gt;During those nineteen days, I shipped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Podcast ingestion,&lt;/li&gt;
&lt;li&gt;TTS alerting,&lt;/li&gt;
&lt;li&gt;Activity scheduling,&lt;/li&gt;
&lt;li&gt;Request tracing,&lt;/li&gt;
&lt;li&gt;NDVI V2,&lt;/li&gt;
&lt;li&gt;Multi-provider STAC integrations,&lt;/li&gt;
&lt;li&gt;A complete NDWI pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the real output wasn't features.&lt;/p&gt;

&lt;p&gt;It was optionality.&lt;/p&gt;

&lt;p&gt;The platform is significantly easier to extend today than it was before this sprint began.&lt;/p&gt;

&lt;p&gt;That's the metric I care about most.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;People often ask how engineers ship quickly.&lt;/p&gt;

&lt;p&gt;The answer isn't raw talent.&lt;/p&gt;

&lt;p&gt;It isn't caffeine.&lt;/p&gt;

&lt;p&gt;It isn't eighty-hour work weeks.&lt;/p&gt;

&lt;p&gt;It's this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Make decisions today that reduce the cost of tomorrow's decisions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Protocols instead of conditionals.&lt;/p&gt;

&lt;p&gt;Labels instead of duplication.&lt;/p&gt;

&lt;p&gt;Circuit breakers instead of assumptions.&lt;/p&gt;

&lt;p&gt;Strict service boundaries instead of monolithic entanglement.&lt;/p&gt;

&lt;p&gt;Every one of those choices feels slower in the moment.&lt;/p&gt;

&lt;p&gt;Until one day, you look up and realize you've delivered six major systems in nineteen days without rewriting half your codebase.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;The architecture has proven it can support multiple spectral indices without collapsing under duplication.&lt;/p&gt;

&lt;p&gt;The obvious next candidates are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NDMI for vegetation moisture,&lt;/li&gt;
&lt;li&gt;EVI for dense canopy analysis,&lt;/li&gt;
&lt;li&gt;More sophisticated agronomic decision engines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the interesting question is no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can the platform support them?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The interesting question has become:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens when agronomic intelligence becomes just another interchangeable engine?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And honestly?&lt;/p&gt;

&lt;p&gt;That's the problem I'm most excited to solve next.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The biggest takeaway from this sprint wasn't that I shipped 12,000 lines of code.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It was realizing that good architecture doesn't slow you down.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It gives you the confidence to move faster than you thought possible.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>django</category>
      <category>devops</category>
      <category>systemdesign</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Decoupled Media Streams: A Django and Nextcloud Radio Architecture</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Mon, 25 May 2026 12:41:31 +0000</pubDate>
      <link>https://dev.to/rahim8050/decoupled-media-streams-a-django-and-nextcloud-radio-architecture-4p2b</link>
      <guid>https://dev.to/rahim8050/decoupled-media-streams-a-django-and-nextcloud-radio-architecture-4p2b</guid>
      <description>&lt;p&gt;I recently added a radio integration to a platform built around Django REST Framework (DRF), and Nextcloud.&lt;/p&gt;

&lt;p&gt;The existing architecture was already doing a lot of heavy lifting, powering authentication, farm management, NDVI processing pipelines, weather data ingestion, API key orchestration, and Nextcloud application integrations.&lt;/p&gt;

&lt;p&gt;The new requirement was to introduce internet radio support seamlessly inside the Nextcloud ecosystem. However, there was a strict architectural constraint: we needed to do this without turning Django into a media relay.&lt;/p&gt;

&lt;p&gt;That single distinction shaped the entire implementation strategy.&lt;/p&gt;

&lt;p&gt;The Core Challenge: Avoiding the Proxy Trap&lt;br&gt;
Instead of proxying heavy audio streams through the backend, the architecture relies on direct playback. Django is strictly responsible for exposing radio metadata and playback endpoints (routed under /api/v1/radio/). Meanwhile, the Nextcloud clients stream the audio directly from the source providers, such as BBC, SomaFM, and TuneIn.&lt;/p&gt;

&lt;p&gt;The result is a much cleaner separation of responsibilities:&lt;/p&gt;

&lt;p&gt;Nextcloud UI / Web Client: The presentation layer.&lt;/p&gt;

&lt;p&gt;Django + DRF API: Radio metadata and stream information logic.&lt;/p&gt;

&lt;p&gt;Radio Providers: Direct playback of media transport.&lt;/p&gt;

&lt;p&gt;Architectural Separation in Action&lt;br&gt;
The following diagram illustrates exactly how we achieved this decoupling. The critical path is that thick dark orange arrow (3), showing the media stream bypassing the Django API server entirely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga9nmcoylgbq7e84kuwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga9nmcoylgbq7e84kuwf.png" alt="Architecture diagram" width="799" height="436"&gt;&lt;/a&gt;&lt;br&gt;
(Diagram: Metadata requests [blue] are routed through Django, while heavy media streams [orange] flow directly from providers like BBC/SomaFM to the Nextcloud user.)&lt;/p&gt;

&lt;p&gt;This separation keeps the backend highly performant and lightweight, while allowing the Nextcloud frontend to integrate radio discovery naturally alongside the rest of the platform's services.&lt;/p&gt;

&lt;p&gt;How Nextcloud Fits Into the Architecture&lt;br&gt;
The radio integration was explicitly designed to plug into a broader, Nextcloud-driven ecosystem rather than operating as an isolated, standalone media application. By defining strict boundaries, each system handles what it does best.&lt;/p&gt;

&lt;p&gt;Nextcloud provides:&lt;/p&gt;

&lt;p&gt;The frontend user experience&lt;/p&gt;

&lt;p&gt;Authenticated user workflows&lt;/p&gt;

&lt;p&gt;App integration surfaces and dashboard presentation&lt;/p&gt;

&lt;p&gt;Native media interaction capabilities&lt;/p&gt;

&lt;p&gt;Django provides:&lt;/p&gt;

&lt;p&gt;API orchestration and provider abstraction&lt;/p&gt;

&lt;p&gt;Station metadata and stream discovery&lt;/p&gt;

&lt;p&gt;Data normalization logic&lt;/p&gt;

&lt;p&gt;Backend consistency&lt;/p&gt;

&lt;p&gt;This clear separation creates a strong boundary between backend platform orchestration and frontend client experience. Instead of embedding complex streaming logic directly into Nextcloud—or forcing Django to waste resources proxying media—the architecture keeps each layer focused entirely on its primary responsibility.&lt;/p&gt;

&lt;p&gt;Built for Future Expansion&lt;br&gt;
Because the backend already behaves like a pure metadata platform rather than a streaming server, the architecture leaves massive room for future expansion.&lt;/p&gt;

&lt;p&gt;Without needing to redesign the streaming layer itself, this setup easily supports adding:&lt;/p&gt;

&lt;p&gt;Personalized stations and user favorites&lt;/p&gt;

&lt;p&gt;Listening history tracking&lt;/p&gt;

&lt;p&gt;Podcast aggregation&lt;/p&gt;

&lt;p&gt;Recommendation systems&lt;/p&gt;

&lt;p&gt;Analytics pipelines&lt;/p&gt;

&lt;p&gt;Multi-provider federation&lt;/p&gt;

&lt;p&gt;By treating media transport and metadata orchestration as two distinct problems, the integration remains scalable, fast, and ready for whatever features the platform requires next.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>backend</category>
      <category>api</category>
      <category>devops</category>
    </item>
    <item>
      <title>Debugging a Cross-Language HMAC Signature Failure Between Nextcloud and Django</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 16 May 2026 14:47:34 +0000</pubDate>
      <link>https://dev.to/rahim8050/debugging-a-cross-language-hmac-signature-failure-between-nextcloud-and-django-3bfa</link>
      <guid>https://dev.to/rahim8050/debugging-a-cross-language-hmac-signature-failure-between-nextcloud-and-django-3bfa</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;A few days ago, I hit a frustrating issue while integrating a custom Nextcloud application with a Django REST Framework backend.&lt;/p&gt;

&lt;p&gt;Everything looked correct:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shared HMAC secret ✔️&lt;/li&gt;
&lt;li&gt;canonical request string ✔️&lt;/li&gt;
&lt;li&gt;HMAC-SHA256 ✔️&lt;/li&gt;
&lt;li&gt;timestamps synchronized ✔️&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yet every authenticated request failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invalid nextcloud signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part?&lt;/p&gt;

&lt;p&gt;Both implementations were technically correct.&lt;/p&gt;

&lt;p&gt;The failure came from something much smaller — and much more dangerous in distributed systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Different string encodings of the exact same HMAC digest.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This article walks through the full debugging process, the root cause, and the engineering lessons learned from debugging cryptographic interoperability between PHP and Python services.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The integration architecture looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────┐
│  Nextcloud App (PHP) │
│  Generates HMAC      │
└──────────┬───────────┘
           │
           │ Signed HTTP Request
           ▼
┌──────────────────────┐
│ Django DRF Backend   │
│ Verifies Signature   │
└──────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The request flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Nextcloud generates a canonical request string&lt;/li&gt;
&lt;li&gt;PHP computes an HMAC-SHA256 signature&lt;/li&gt;
&lt;li&gt;Signature is attached to request headers&lt;/li&gt;
&lt;li&gt;Django reconstructs the canonical string&lt;/li&gt;
&lt;li&gt;Django recomputes the HMAC&lt;/li&gt;
&lt;li&gt;Signatures are compared&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Simple in theory.&lt;/p&gt;

&lt;p&gt;Except it kept failing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Initial Symptoms
&lt;/h2&gt;

&lt;p&gt;The backend logs showed repeated authorization failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nextcloud_hmac.denied
code=invalid_signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even more confusing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the integration had worked before&lt;/li&gt;
&lt;li&gt;secrets matched&lt;/li&gt;
&lt;li&gt;clocks matched&lt;/li&gt;
&lt;li&gt;payloads matched&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looked like a replay issue, timestamp skew problem, or cache corruption.&lt;/p&gt;

&lt;p&gt;It turned out to be none of those.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Root Cause
&lt;/h2&gt;

&lt;p&gt;The issue came from a mismatch in how the HMAC digest was encoded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nextcloud (PHP)
&lt;/h2&gt;

&lt;p&gt;The PHP client generated the signature like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="nb"&gt;base64_encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;hash_hmac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'sha256'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$canonical&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the important detail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That parameter returns the raw digest bytes.&lt;/p&gt;

&lt;p&gt;Those bytes were then encoded as Base64.&lt;/p&gt;




&lt;h2&gt;
  
  
  Django (Python)
&lt;/h2&gt;

&lt;p&gt;Meanwhile, Django verified signatures like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;hexdigest()&lt;/code&gt; returns a hexadecimal string representation.&lt;/p&gt;

&lt;p&gt;So both systems produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the same HMAC bytes&lt;/li&gt;
&lt;li&gt;using the same algorithm&lt;/li&gt;
&lt;li&gt;using the same secret&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But converted those bytes into different string formats.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Interoperability Bug
&lt;/h2&gt;

&lt;p&gt;This was the breakthrough moment.&lt;/p&gt;

&lt;p&gt;The exact same digest bytes produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hex:
44c39c4ecc7268547ca51db72c6f27125251e6ea8ce3c659d918a9542522b612
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;vs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Base64:
RMOcTsxyaFR8pR23LG8nElJR5uqM48ZZ2RipVCUithI=
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both values represent the same underlying bytes.&lt;/p&gt;

&lt;p&gt;But string comparison obviously fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Bug
&lt;/h2&gt;

&lt;p&gt;While investigating, I found another subtle issue.&lt;/p&gt;

&lt;p&gt;The Django verifier lowercased the incoming signature before comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may appear harmless for hexadecimal values.&lt;/p&gt;

&lt;p&gt;But Base64 is case-sensitive.&lt;/p&gt;

&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ABC != abc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So even after fixing the encoding mismatch, lowercasing would still break verification.&lt;/p&gt;

&lt;p&gt;This was a protocol normalization bug hiding inside the verification pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;I updated Django to verify signatures using Base64 instead of hexadecimal.&lt;/p&gt;

&lt;h2&gt;
  
  
  New Verification Function
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_hmac_signature_b64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical_string&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compute Base64 encoded HMAC-SHA256 signature.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;canonical_string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then all verification calls were updated to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;compute_hmac_signature_b64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, I removed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;from the verification flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Verification Results
&lt;/h2&gt;

&lt;p&gt;After deploying the fix:&lt;/p&gt;

&lt;h2&gt;
  
  
  Ping Endpoint
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET /api/v1/integrations/nextcloud/ping/

200 OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Token Issuance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /api/v1/integrations/token/

200 OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Authentication immediately started working again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Secondary Investigation Findings
&lt;/h2&gt;

&lt;p&gt;While debugging, I validated several other production concerns.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Time Drift
&lt;/h2&gt;

&lt;p&gt;I suspected clock skew initially.&lt;/p&gt;

&lt;p&gt;Both services were checked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud epoch: 1778841776
Django epoch:    1778841776
Drift:            0 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Time synchronization was perfect.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Shared Secrets
&lt;/h2&gt;

&lt;p&gt;Client IDs and secrets matched correctly across both systems.&lt;/p&gt;

&lt;p&gt;This eliminated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;environment mismatch&lt;/li&gt;
&lt;li&gt;stale secrets&lt;/li&gt;
&lt;li&gt;config drift&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Redis and Cache State
&lt;/h2&gt;

&lt;p&gt;I flushed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;Django cache&lt;/li&gt;
&lt;li&gt;integration token caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helped eliminate stale token artifacts and replay-state inconsistencies.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Infrastructure Validation
&lt;/h2&gt;

&lt;p&gt;I also verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loopback networking&lt;/li&gt;
&lt;li&gt;gunicorn binding&lt;/li&gt;
&lt;li&gt;uvicorn workers&lt;/li&gt;
&lt;li&gt;allowlists&lt;/li&gt;
&lt;li&gt;HTTP dev mode configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point the investigation became less about cryptography and more about systematic elimination of variables.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It “Worked Before”
&lt;/h2&gt;

&lt;p&gt;This was the most interesting systems question.&lt;/p&gt;

&lt;p&gt;I had not changed the signing logic recently.&lt;/p&gt;

&lt;p&gt;So why did the failure suddenly appear?&lt;/p&gt;

&lt;p&gt;The likely answer is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Infrastructure state had been masking a latent protocol incompatibility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Possible contributors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cached tokens&lt;/li&gt;
&lt;li&gt;stale replay windows&lt;/li&gt;
&lt;li&gt;inactive code paths&lt;/li&gt;
&lt;li&gt;existing sessions bypassing verification&lt;/li&gt;
&lt;li&gt;Redis persistence behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is an important engineering lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A system can contain dormant interoperability bugs for weeks before infrastructure conditions expose them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Engineering Lessons Learned
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Cryptographic Bytes ≠ String Representation
&lt;/h2&gt;

&lt;p&gt;HMAC output is binary data.&lt;/p&gt;

&lt;p&gt;Hexadecimal and Base64 are merely different textual encodings of the same bytes.&lt;/p&gt;

&lt;p&gt;They are not interchangeable.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Cross-Language Integrations Need Explicit Contracts
&lt;/h2&gt;

&lt;p&gt;Never assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;encoding format&lt;/li&gt;
&lt;li&gt;canonicalization rules&lt;/li&gt;
&lt;li&gt;normalization behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Define them explicitly.&lt;/p&gt;

&lt;p&gt;Especially across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PHP&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Go&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Java&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Normalization Can Break Security
&lt;/h2&gt;

&lt;p&gt;Lowercasing signatures looked harmless.&lt;/p&gt;

&lt;p&gt;It was not.&lt;/p&gt;

&lt;p&gt;Cryptographic values should only be normalized if the protocol explicitly defines normalization behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Infrastructure State Can Hide Bugs
&lt;/h2&gt;

&lt;p&gt;Cache layers and token persistence can temporarily conceal protocol inconsistencies.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restarts&lt;/li&gt;
&lt;li&gt;cache flushes&lt;/li&gt;
&lt;li&gt;clock resets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;suddenly expose issues that already existed.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Production Debugging Requires Elimination Discipline
&lt;/h2&gt;

&lt;p&gt;The investigation involved validating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clocks&lt;/li&gt;
&lt;li&gt;secrets&lt;/li&gt;
&lt;li&gt;caches&lt;/li&gt;
&lt;li&gt;workers&lt;/li&gt;
&lt;li&gt;networking&lt;/li&gt;
&lt;li&gt;encoding&lt;/li&gt;
&lt;li&gt;replay protection&lt;/li&gt;
&lt;li&gt;request canonicalization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good debugging is often less about guessing and more about systematically removing uncertainty.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The most dangerous bugs are not always algorithm failures.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the crypto is correct&lt;/li&gt;
&lt;li&gt;the infrastructure is healthy&lt;/li&gt;
&lt;li&gt;the logic is valid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…but the protocol contract between systems is inconsistent.&lt;/p&gt;

&lt;p&gt;In this case:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The cryptography was correct on both sides. The protocol contract was not.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that single mismatch was enough to break the entire authentication flow.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>security</category>
      <category>django</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Why I Added Redis Streams Between My Django API and Celery Workers.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 03 May 2026 07:01:10 +0000</pubDate>
      <link>https://dev.to/rahim8050/why-i-added-redis-streams-between-my-django-api-and-celery-workers-22bl</link>
      <guid>https://dev.to/rahim8050/why-i-added-redis-streams-between-my-django-api-and-celery-workers-22bl</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I run a Django API backed by Celery workers for NDVI processing workloads.&lt;/p&gt;

&lt;p&gt;The execution layer worked fine.&lt;/p&gt;

&lt;p&gt;The queue semantics didn’t.&lt;/p&gt;

&lt;p&gt;I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable ingestion&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;li&gt;dead-letter handling&lt;/li&gt;
&lt;li&gt;stale consumer recovery&lt;/li&gt;
&lt;li&gt;rollback safety&lt;/li&gt;
&lt;li&gt;observability during incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…but I did not want to rewrite the worker system or destabilize production.&lt;/p&gt;

&lt;p&gt;So instead of replacing Celery, I inserted Redis Streams between the API and the workers.&lt;/p&gt;

&lt;p&gt;This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Problem
&lt;/h2&gt;

&lt;p&gt;The problem was not task execution.&lt;/p&gt;

&lt;p&gt;The problem was everything before execution.&lt;/p&gt;

&lt;p&gt;Originally, NDVI ingestion looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Django API → Celery Broker → Celery Workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first, this worked well.&lt;/p&gt;

&lt;p&gt;But as the system evolved, operational gaps became more obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct &lt;code&gt;.delay()&lt;/code&gt; calls tightly coupled request ingestion to broker behavior.&lt;/li&gt;
&lt;li&gt;Queue visibility was limited during incidents.&lt;/li&gt;
&lt;li&gt;Failed ingestion paths were harder to replay safely.&lt;/li&gt;
&lt;li&gt;In-flight recovery semantics were weak.&lt;/li&gt;
&lt;li&gt;There was no dead-letter workflow for poisoned messages.&lt;/li&gt;
&lt;li&gt;Worker interruptions could leave messages in uncertain states.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture was fast.&lt;/p&gt;

&lt;p&gt;It was not durable enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Did Not Replace Celery
&lt;/h2&gt;

&lt;p&gt;One of the biggest architectural decisions was choosing not to replace Celery.&lt;/p&gt;

&lt;p&gt;That decision reduced risk dramatically.&lt;/p&gt;

&lt;p&gt;Celery already handled:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;worker orchestration&lt;/li&gt;
&lt;li&gt;task retries&lt;/li&gt;
&lt;li&gt;execution concurrency&lt;/li&gt;
&lt;li&gt;scheduling&lt;/li&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;operational familiarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Replacing the worker layer would have increased migration complexity and expanded the failure domain.&lt;/p&gt;

&lt;p&gt;Instead, I treated Redis Streams as an ingestion and reliability layer.&lt;/p&gt;

&lt;p&gt;The resulting architecture looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Django API
    ↓
Dispatch Boundary
    ↓
Redis Streams (XADD)
    ↓
Consumer Group (XREADGROUP)
    ↓
Celery Queue
    ↓
NDVI Workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failures route into a dead-letter stream.&lt;/p&gt;

&lt;p&gt;Stale consumers are recovered through reclaim logic.&lt;/p&gt;

&lt;p&gt;Most importantly, rollback remains simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  Centralizing Dispatch Before Adding Redis Streams
&lt;/h2&gt;

&lt;p&gt;Before introducing Redis Streams, I centralized every NDVI enqueue path.&lt;/p&gt;

&lt;p&gt;This was the most important migration step.&lt;/p&gt;

&lt;p&gt;Instead of scattering direct &lt;code&gt;.delay()&lt;/code&gt; calls across the codebase, everything flowed through dispatch helpers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ndvi.dispatch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dispatch_ndvi_job&lt;/span&gt;

&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;enqueue_job&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="nf"&gt;dispatch_ndvi_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That allowed one configuration flag to control the ingestion backend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;celery
stream
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This created a clean migration boundary.&lt;/p&gt;

&lt;p&gt;The system could switch between direct Celery dispatch and Redis Streams without changing every call site.&lt;/p&gt;

&lt;p&gt;Operationally, this mattered more than the stream code itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Publishing NDVI Jobs into Redis Streams
&lt;/h2&gt;

&lt;p&gt;The producer layer publishes deterministic NDVI payloads into a Redis stream.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enqueue_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_MAXLEN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;approximate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;request_hash&lt;/code&gt; acts as the idempotency key.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;XTRIM&lt;/code&gt; keeps memory bounded.&lt;/li&gt;
&lt;li&gt;Stream payloads remain deterministic.&lt;/li&gt;
&lt;li&gt;Producers do not execute business logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stream became the ingestion ledger.&lt;/p&gt;




&lt;h2&gt;
  
  
  Redis Streams Consumer Design
&lt;/h2&gt;

&lt;p&gt;The consumer reads from Redis Streams and forwards work into Celery.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xreadgroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;groupname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consumername&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consumer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BLOCK_MS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For every message:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deserialize payload&lt;/li&gt;
&lt;li&gt;Validate structure&lt;/li&gt;
&lt;li&gt;Apply idempotency safeguards&lt;/li&gt;
&lt;li&gt;Enqueue Celery task&lt;/li&gt;
&lt;li&gt;Acknowledge stream entry
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;process_ndvi_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The stream consumer remains intentionally thin.&lt;/p&gt;

&lt;p&gt;Its job is reliable transport and recovery.&lt;/p&gt;

&lt;p&gt;Celery still handles execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Consumer Groups Matter
&lt;/h2&gt;

&lt;p&gt;Redis Streams consumer groups solved several operational problems immediately.&lt;/p&gt;

&lt;p&gt;They provided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cooperative work distribution&lt;/li&gt;
&lt;li&gt;independent consumer identities&lt;/li&gt;
&lt;li&gt;pending-entry tracking&lt;/li&gt;
&lt;li&gt;reclaim support&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike simple queue semantics, Redis Streams expose message lifecycle state.&lt;/p&gt;

&lt;p&gt;That visibility becomes extremely valuable during failures.&lt;/p&gt;

&lt;p&gt;Message lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;XADD → pending → reclaimed → acknowledged
                          ↓
                         DLQ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made queue recovery observable instead of implicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recovering Stale Messages with XAUTOCLAIM
&lt;/h2&gt;

&lt;p&gt;The most important recovery primitive ended up being &lt;code&gt;XAUTOCLAIM&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.&lt;/p&gt;

&lt;p&gt;Without reclaim logic, stream durability is incomplete.&lt;/p&gt;

&lt;p&gt;Example reclaim loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xautoclaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;groupname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consumername&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consumer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_idle_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_CLAIM_IDLE_MS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0-0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows healthy consumers to recover abandoned work automatically.&lt;/p&gt;

&lt;p&gt;That changed the reliability profile of the ingestion pipeline significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dead-Letter Queue Handling
&lt;/h2&gt;

&lt;p&gt;I also introduced a dedicated dead-letter stream.&lt;/p&gt;

&lt;p&gt;Messages are routed into the DLQ when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validation fails&lt;/li&gt;
&lt;li&gt;delivery ceilings are exceeded&lt;/li&gt;
&lt;li&gt;payloads become structurally invalid&lt;/li&gt;
&lt;li&gt;repeated execution attempts fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_DLQ_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dlq_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every DLQ entry includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;original message ID&lt;/li&gt;
&lt;li&gt;delivery count&lt;/li&gt;
&lt;li&gt;failure reason&lt;/li&gt;
&lt;li&gt;serialized payload&lt;/li&gt;
&lt;li&gt;timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made operational debugging dramatically easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardest Problem: Idempotency
&lt;/h2&gt;

&lt;p&gt;Redis Streams provide at-least-once delivery.&lt;/p&gt;

&lt;p&gt;That means duplicate delivery is expected.&lt;/p&gt;

&lt;p&gt;Exactly-once delivery is not guaranteed.&lt;/p&gt;

&lt;p&gt;To prevent duplicate NDVI execution, I added multiple protection layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Deterministic Request Hash
&lt;/h2&gt;

&lt;p&gt;Every NDVI job already had a deterministic &lt;code&gt;request_hash&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That became the execution identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Distributed Redis Lock
&lt;/h2&gt;

&lt;p&gt;The consumer acquires a Redis lock before execution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lock_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ndvi:lock:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Acquisition uses &lt;code&gt;SETNX&lt;/code&gt; semantics with expiration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Token-Based Lock Release
&lt;/h2&gt;

&lt;p&gt;Locks are released through an atomic Lua script.&lt;/p&gt;

&lt;p&gt;This prevents blind deletion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"get"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"del"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 4: Database Status Recheck
&lt;/h2&gt;

&lt;p&gt;Before execution begins, the worker re-checks terminal job state.&lt;/p&gt;

&lt;p&gt;This acts as a second safety boundary.&lt;/p&gt;

&lt;p&gt;The result is effectively-once execution semantics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;At-least-once delivery + idempotent execution = effectively-once processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Observability Added During the Rollout
&lt;/h2&gt;

&lt;p&gt;One major lesson from this migration:&lt;/p&gt;

&lt;p&gt;Do not enable stream mode before queue visibility exists.&lt;/p&gt;

&lt;p&gt;I added dedicated metrics before enabling the rollout broadly.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also expanded upstream visibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grafana dashboards now expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pending stream backlog&lt;/li&gt;
&lt;li&gt;reclaim frequency&lt;/li&gt;
&lt;li&gt;DLQ volume&lt;/li&gt;
&lt;li&gt;consumer liveness&lt;/li&gt;
&lt;li&gt;upstream API failures&lt;/li&gt;
&lt;li&gt;queue drain rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transformed rollout decisions from guesswork into measurable operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rollback Strategy
&lt;/h2&gt;

&lt;p&gt;Rollback was designed before rollout.&lt;/p&gt;

&lt;p&gt;That mattered.&lt;/p&gt;

&lt;p&gt;The stream backend is fully feature-flagged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rollback requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;environment variable change&lt;/li&gt;
&lt;li&gt;process restart&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No redeploy.&lt;/p&gt;

&lt;p&gt;No task rewrite.&lt;/p&gt;

&lt;p&gt;No schema rollback.&lt;/p&gt;

&lt;p&gt;This significantly reduced operational fear during rollout.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Shipped This Week
&lt;/h2&gt;

&lt;p&gt;This week’s rollout included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a 528-line Redis Streams consumer&lt;/li&gt;
&lt;li&gt;reclaim + DLQ lifecycle handling&lt;/li&gt;
&lt;li&gt;distributed execution locking&lt;/li&gt;
&lt;li&gt;token-safe lock release&lt;/li&gt;
&lt;li&gt;approximately 400 lines of stream-focused tests&lt;/li&gt;
&lt;li&gt;Prometheus metrics for queue health&lt;/li&gt;
&lt;li&gt;Grafana visibility for consumer state and lag&lt;/li&gt;
&lt;li&gt;feature-flag rollback support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of the work was not adding Redis.&lt;/p&gt;

&lt;p&gt;Most of the work was making failure recovery predictable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Was It Worth It?
&lt;/h2&gt;

&lt;p&gt;Redis Streams did not simplify the system.&lt;/p&gt;

&lt;p&gt;They made failure states explicit.&lt;/p&gt;

&lt;p&gt;That introduced additional complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reclaim logic&lt;/li&gt;
&lt;li&gt;idempotency handling&lt;/li&gt;
&lt;li&gt;consumer lifecycle management&lt;/li&gt;
&lt;li&gt;DLQ operations&lt;/li&gt;
&lt;li&gt;stream observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the reliability gains were substantial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable ingestion&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;li&gt;safer recovery semantics&lt;/li&gt;
&lt;li&gt;backlog introspection&lt;/li&gt;
&lt;li&gt;controlled rollback&lt;/li&gt;
&lt;li&gt;observable queue state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this NDVI pipeline, the tradeoff was worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;One of the biggest lessons from this migration is that queue evolution is not just about throughput.&lt;/p&gt;

&lt;p&gt;It is about operational recovery.&lt;/p&gt;

&lt;p&gt;Redis Streams gave the ingestion layer explicit lifecycle semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pending&lt;/li&gt;
&lt;li&gt;acknowledged&lt;/li&gt;
&lt;li&gt;reclaimed&lt;/li&gt;
&lt;li&gt;dead-lettered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That visibility fundamentally changed how the system behaves during failures.&lt;/p&gt;

&lt;p&gt;And importantly, I achieved that without rewriting the worker layer.&lt;/p&gt;

&lt;p&gt;Sometimes the best migration strategy is not replacing your stack.&lt;/p&gt;

&lt;p&gt;It is inserting a safer boundary in front of it.&lt;/p&gt;

</description>
      <category>django</category>
      <category>redis</category>
      <category>celery</category>
      <category>backend</category>
    </item>
    <item>
      <title>Building a Resilient NDVI Pipeline with Redis Streams (Event-Driven Architecture)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 26 Apr 2026 09:02:14 +0000</pubDate>
      <link>https://dev.to/rahim8050/building-a-resilient-ndvi-pipeline-with-redis-streams-event-driven-architecture-2l75</link>
      <guid>https://dev.to/rahim8050/building-a-resilient-ndvi-pipeline-with-redis-streams-event-driven-architecture-2l75</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A practical breakdown of moving an NDVI processing pipeline from a synchronous design to an event-driven architecture using Redis Streams — including concurrency challenges, distributed locking pitfalls, and production-safe patterns.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;Most pipelines work — until concurrency and failure expose their limits.&lt;/p&gt;

&lt;p&gt;At first, processing NDVI (Normalized Difference Vegetation Index) data seems straightforward:&lt;/p&gt;

&lt;p&gt;receive a request&lt;/p&gt;

&lt;p&gt;process imagery&lt;/p&gt;

&lt;p&gt;return results&lt;/p&gt;

&lt;p&gt;But once you introduce:&lt;/p&gt;

&lt;p&gt;concurrent jobs&lt;/p&gt;

&lt;p&gt;long-running processing&lt;/p&gt;

&lt;p&gt;distributed components&lt;/p&gt;

&lt;p&gt;you’re no longer building a simple pipeline.&lt;/p&gt;

&lt;p&gt;You’re designing a distributed system.&lt;/p&gt;

&lt;p&gt;This article walks through how I transformed an NDVI processing pipeline from a synchronous model into an event-driven architecture using Redis Streams, and the real-world engineering challenges that came with it.&lt;/p&gt;




&lt;p&gt;System Overview&lt;/p&gt;

&lt;p&gt;The system is built using:&lt;/p&gt;

&lt;p&gt;Django REST Framework (backend API)&lt;/p&gt;

&lt;p&gt;Nextcloud (client-facing integration layer)&lt;/p&gt;

&lt;p&gt;Celery (asynchronous task processing)&lt;/p&gt;

&lt;p&gt;Redis Streams (event ingestion and coordination)&lt;/p&gt;




&lt;p&gt;The Initial Architecture (Synchronous Design)&lt;/p&gt;

&lt;p&gt;Client → API → Celery Task → NDVI Processing → Result&lt;/p&gt;

&lt;p&gt;This design works well at small scale, but it introduces hidden risks when the system grows.&lt;/p&gt;




&lt;p&gt;The Core Problems&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tight Coupling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The request lifecycle is directly tied to processing.&lt;/p&gt;

&lt;p&gt;If processing fails:&lt;/p&gt;

&lt;p&gt;the request fails&lt;/p&gt;

&lt;p&gt;the user experiences errors&lt;/p&gt;

&lt;p&gt;retries become difficult&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Concurrency Issues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When multiple requests target the same job:&lt;/p&gt;

&lt;p&gt;Request A ─┐&lt;br&gt;
           ├──&amp;gt; Same Job → Duplicate Processing&lt;br&gt;
Request B ─┘&lt;/p&gt;

&lt;p&gt;This leads to:&lt;/p&gt;

&lt;p&gt;duplicated work&lt;/p&gt;

&lt;p&gt;inconsistent outputs&lt;/p&gt;

&lt;p&gt;race conditions&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Fragile Execution Model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without coordination:&lt;/p&gt;

&lt;p&gt;jobs execute immediately&lt;/p&gt;

&lt;p&gt;no buffering exists&lt;/p&gt;

&lt;p&gt;failure handling is reactive, not controlled&lt;/p&gt;




&lt;p&gt;The Shift to Event-Driven Architecture&lt;/p&gt;

&lt;p&gt;To solve these issues, I introduced Redis Streams and redesigned the system into an event-driven model.&lt;/p&gt;




&lt;p&gt;New Architecture (Event-Driven Pipeline)&lt;/p&gt;

&lt;p&gt;Client → API → Redis Stream → Consumer → Celery → Processing&lt;/p&gt;




&lt;p&gt;Why Redis Streams?&lt;/p&gt;

&lt;p&gt;Redis Streams provide:&lt;/p&gt;

&lt;p&gt;Event buffering (decouples ingestion from execution)&lt;/p&gt;

&lt;p&gt;At-least-once delivery (ensures reliability)&lt;/p&gt;

&lt;p&gt;Ordered processing&lt;/p&gt;

&lt;p&gt;Scalability for distributed systems&lt;/p&gt;




&lt;p&gt;What Changed&lt;/p&gt;

&lt;p&gt;Instead of executing tasks immediately:&lt;/p&gt;

&lt;p&gt;The API publishes events to a Redis Stream&lt;/p&gt;

&lt;p&gt;A stream consumer controls task execution&lt;/p&gt;

&lt;p&gt;Celery workers process jobs asynchronously&lt;/p&gt;

&lt;p&gt;This separates:&lt;/p&gt;

&lt;p&gt;ingestion&lt;/p&gt;

&lt;p&gt;scheduling&lt;/p&gt;

&lt;p&gt;execution&lt;/p&gt;




&lt;p&gt;Distributed Locking: The Critical Bug&lt;/p&gt;

&lt;p&gt;To prevent duplicate processing, a locking mechanism was introduced.&lt;/p&gt;

&lt;p&gt;The naive approach:&lt;/p&gt;

&lt;p&gt;cache.delete(lock_key)&lt;/p&gt;

&lt;p&gt;This looks harmless — but in distributed systems, it’s dangerous.&lt;/p&gt;




&lt;p&gt;Why This Fails&lt;/p&gt;

&lt;p&gt;Consider this sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Process A acquires a lock&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The lock expires&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process B acquires the same lock&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process A deletes the lock&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Process B is running without protection&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This creates a race condition — one of the hardest problems in distributed systems.&lt;/p&gt;




&lt;p&gt;The Fix: Token-Based Distributed Locking&lt;/p&gt;

&lt;p&gt;To solve this, each lock is assigned a unique token.&lt;/p&gt;

&lt;p&gt;SET lock_key = token_A (TTL)&lt;/p&gt;

&lt;p&gt;Release only if:&lt;br&gt;
stored_token == token_A&lt;/p&gt;

&lt;p&gt;Key Principles&lt;/p&gt;

&lt;p&gt;Only the owner of the lock can release it&lt;/p&gt;

&lt;p&gt;If ownership does not match → do nothing&lt;/p&gt;

&lt;p&gt;TTL ensures eventual cleanup&lt;/p&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;p&gt;safe concurrency&lt;/p&gt;

&lt;p&gt;no accidental unlocks&lt;/p&gt;

&lt;p&gt;predictable system behavior&lt;/p&gt;




&lt;p&gt;Stream Consumer Design&lt;/p&gt;

&lt;p&gt;Redis Streams operate with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At-least-once delivery semantics&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;messages can be delivered more than once&lt;/p&gt;

&lt;p&gt;consumers must be idempotent&lt;/p&gt;




&lt;p&gt;Consumer Processing Flow&lt;/p&gt;

&lt;p&gt;Read → Validate → Enqueue → Acknowledge&lt;/p&gt;

&lt;p&gt;Critical Rule&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Never acknowledge a message before it is safely enqueued.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Idempotency and Reliability&lt;/p&gt;

&lt;p&gt;To handle duplicate events:&lt;/p&gt;

&lt;p&gt;processing must be idempotent&lt;/p&gt;

&lt;p&gt;tasks must tolerate retries&lt;/p&gt;

&lt;p&gt;state transitions must be safe&lt;/p&gt;

&lt;p&gt;This is essential in any event-driven system.&lt;/p&gt;




&lt;p&gt;Final Architecture (Layered System Design)&lt;/p&gt;

&lt;p&gt;The system now operates in clear layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ingestion Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;receives requests&lt;/p&gt;

&lt;p&gt;publishes events&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stream Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;buffers and orders events&lt;/p&gt;

&lt;p&gt;decouples system components&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Consumer Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;controls execution&lt;/p&gt;

&lt;p&gt;validates and dispatches tasks&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Execution Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Celery workers process NDVI jobs&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Coordination Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;distributed locking&lt;/p&gt;

&lt;p&gt;idempotency&lt;/p&gt;

&lt;p&gt;concurrency control&lt;/p&gt;




&lt;p&gt;Key Lessons from Building an Event-Driven System&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Event-Driven Architecture Does Not Reduce Complexity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It shifts complexity into:&lt;/p&gt;

&lt;p&gt;coordination&lt;/p&gt;

&lt;p&gt;state management&lt;/p&gt;

&lt;p&gt;failure handling&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Concurrency Is the Real Challenge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not performance.&lt;br&gt;
Not frameworks.&lt;/p&gt;

&lt;p&gt;Concurrency.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Safety Must Be Designed Explicitly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Small shortcuts (like naive lock deletion) can lead to major production issues.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Idempotency Is Non-Negotiable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In systems with retries and event delivery:&lt;/p&gt;

&lt;p&gt;duplicate execution is expected&lt;/p&gt;

&lt;p&gt;safe handling is required&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Observability Becomes Critical&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In asynchronous systems, you must answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What happened to this job?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This requires:&lt;/p&gt;

&lt;p&gt;structured logging&lt;/p&gt;

&lt;p&gt;tracing across components&lt;/p&gt;

&lt;p&gt;visibility into system flow&lt;/p&gt;




&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;This shift changed the system from:&lt;/p&gt;

&lt;p&gt;"Run this task now"&lt;/p&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;p&gt;"This event will be processed safely"&lt;/p&gt;

&lt;p&gt;That difference is fundamental.&lt;/p&gt;

&lt;p&gt;Because in distributed systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You don’t design for success.&lt;br&gt;
You design for failure.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;What’s Next&lt;/p&gt;

&lt;p&gt;The next phase is observability-driven engineering:&lt;/p&gt;

&lt;p&gt;tracing event lifecycles&lt;/p&gt;

&lt;p&gt;monitoring stream lag&lt;/p&gt;

&lt;p&gt;correlating logs across services&lt;/p&gt;

&lt;p&gt;Because once a system becomes event-driven:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Visibility is what makes it understandable.&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>django</category>
      <category>systemdesign</category>
      <category>redis</category>
      <category>devops</category>
    </item>
    <item>
      <title>Hardening Distributed Systems: Retries, Circuit Breakers &amp; Observability.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 12 Apr 2026 05:12:28 +0000</pubDate>
      <link>https://dev.to/rahim8050/hardening-distributed-systems-retries-circuit-breakers-observability-4m5n</link>
      <guid>https://dev.to/rahim8050/hardening-distributed-systems-retries-circuit-breakers-observability-4m5n</guid>
      <description>&lt;h2&gt;
  
  
  Building Resilient Distributed Systems: A Solo Engineer's Journey
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;How I turned flaky upstream APIs into a predictable, observable, and operator-friendly reliability layer — with code you can steal.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you've ever built a service that depends on external APIs (STAC catalogs, SentinelHub, weather data providers, etc.), you know the pain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;429s when you hit rate limits&lt;/li&gt;
&lt;li&gt;502s when upstreams hiccup&lt;/li&gt;
&lt;li&gt;Silent timeouts that leave jobs hanging&lt;/li&gt;
&lt;li&gt;Retry storms that make bad days worse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Last month, I undertook a focused effort to harden the retry and resilience logic for an NDVI (Normalized Difference Vegetation Index) processing pipeline. What started as "let's clean up some duplicate retry code" evolved into a &lt;strong&gt;production-grade reliability subsystem&lt;/strong&gt; that now governs every upstream interaction.&lt;/p&gt;

&lt;p&gt;In this article, I'll walk through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: Consolidating retry policy into a single source of truth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: Adding circuit breakers with observability and admin controls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3 (preview)&lt;/strong&gt;: Decoupling dispatch with Redis Streams for back-pressure resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key principles&lt;/strong&gt; I learned that you can apply to your own distributed systems&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All code is Python/Django/Celery, but the patterns are language-agnostic. And yes — I did this alone. No team, no dedicated SRE, no platform squad. Just me, a codebase, and a lot of careful thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Space
&lt;/h2&gt;

&lt;p&gt;The NDVI pipeline I was working on orchestrates vegetation index calculations by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Querying STAC catalogs for satellite imagery metadata&lt;/li&gt;
&lt;li&gt;Fetching raster data from SentinelHub&lt;/li&gt;
&lt;li&gt;Computing NDVI values per farm/plot&lt;/li&gt;
&lt;li&gt;Returning results to farmers/agronomists&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The challenge&lt;/strong&gt;: Each upstream service has different failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;STAC: occasional 502s, auth errors (401/403)&lt;/li&gt;
&lt;li&gt;SentinelHub: strict rate limits (429), validation errors (422), transient 5xx&lt;/li&gt;
&lt;li&gt;Network: timeouts, DNS failures, TLS handshake issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before my refactor, retry logic was scattered across 4+ modules, with inconsistent error classification and no centralized observability. Result? Hard-to-debug failures, wasted Celery retries, and on-call pages at 3 AM.&lt;/p&gt;

&lt;p&gt;As a solo engineer, I couldn't afford to keep firefighting. I needed a system that would &lt;em&gt;just work&lt;/em&gt; — or fail gracefully, with clear signals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: One Source of Truth for Retries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Not all errors are retryable. Not all retries are equal.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I started by defining a canonical truth table mapping HTTP status codes to retry behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ndvi/retry_policy.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_status_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Canonical truth table: HTTP status → retry decision.

    | Status      | Retryable | Category           |
    |-------------|-----------|--------------------|
    | 401, 403    | False     | AUTH               |
    | 400, 422    | False     | VALIDATION         |
    | 429         | True      | RATE_LIMIT         |
    | &amp;gt;= 500      | True      | TRANSIENT_UPSTREAM |
    | Other/None  | False     | UNKNOWN            |
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AUTH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;422&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VALIDATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RATE_LIMIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRANSIENT_UPSTREAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNKNOWN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unified Exception Hierarchy
&lt;/h3&gt;

&lt;p&gt;I made all upstream errors inherit from a common base, ensuring consistent attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NdviFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Base for all retryable upstream failures.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="c1"&gt;# Delegate to canonical classifier
&lt;/span&gt;        &lt;span class="n"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_status_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compute_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StacUpstreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StacError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SentinelHubUpstreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SentinelHubRasterError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Centralized Retry Decision
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-retryable-exception&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Respect Retry-After header for 429s
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;server_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_retry_after&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retry-After&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;server_delay&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;server_delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry-after-header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt; 28 parametrized tests covering all 13 truth-table branches&lt;/li&gt;
&lt;li&gt; Removed 3 duplicate retry implementations&lt;/li&gt;
&lt;li&gt; Celery tasks now use shared &lt;code&gt;should_retry()&lt;/code&gt; logic&lt;/li&gt;
&lt;li&gt; Network errors properly wrapped → no more silent failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson #1&lt;/strong&gt;: Centralize failure classification. When retry logic lives in one place, you can test it thoroughly, document it clearly, and evolve it safely — even when you're the only one maintaining it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Circuit Breakers with Teeth
&lt;/h2&gt;

&lt;p&gt;Retries alone aren't enough. When an upstream is truly down, you want to &lt;strong&gt;fail fast&lt;/strong&gt; and avoid thundering herds.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Circuit Breaker State Machine
&lt;/h3&gt;

&lt;p&gt;I implemented a simple but effective three-state breaker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLOSED → (failures ≥ threshold) → OPEN → (timeout elapsed) → HALF_OPEN → (success) → CLOSED
                              ↘ (failure) ↗
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;_CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeout_secs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timeout_secs&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeout_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="c1"&gt;# HALF_OPEN: allow one probe request
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;old_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_state&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Circuit breaker: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;old_state&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Export Prometheus metric
&lt;/span&gt;        &lt;span class="n"&gt;circuit_breaker_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STATE_VALUES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;circuit_breaker_transitions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;old_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observability First
&lt;/h3&gt;

&lt;p&gt;I didn't just build the breaker — I made it &lt;strong&gt;visible&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus metrics
&lt;/span&gt;&lt;span class="n"&gt;circuit_breaker_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ndvi_circuit_breaker_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Current circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;circuit_breaker_transitions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ndvi_circuit_breaker_transitions_total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count of circuit breaker state transitions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And added a Grafana dashboard with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stat panels showing current state per engine (color-coded: 🟢 CLOSED, 🔴 OPEN, 🟡 HALF_OPEN)&lt;/li&gt;
&lt;li&gt;Time series of transition rates&lt;/li&gt;
&lt;li&gt;Correlation with upstream failure rates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operator Controls
&lt;/h3&gt;

&lt;p&gt;Because things &lt;em&gt;will&lt;/em&gt; go wrong — and when you're solo, you &lt;em&gt;are&lt;/em&gt; the operator — I added an admin endpoint to manually reset breakers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/ndvi/circuit-breaker/reset/
Content-Type: application/json
Authorization: Bearer &amp;lt;admin-token&amp;gt;

{ "engine": "stac" }

→ { "data": { "previous_state": "OPEN", "new_state": "CLOSED" } }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson #2&lt;/strong&gt;: Resilience patterns need observability and escape hatches. If you can't see it or control it, you don't own it — and when you're the only one on call, "owning it" means sleeping at night.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3 Preview: Decoupling with Redis Streams
&lt;/h2&gt;

&lt;p&gt;As I scaled the system, I hit a new challenge: &lt;strong&gt;Celery broker unavailability during Redis Sentinel failover&lt;/strong&gt; (~55 seconds of downtime). For background jobs, this was acceptable. But for real-time dispatch, I needed better.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture Decision
&lt;/h3&gt;

&lt;p&gt;Instead of relying on Celery's built-in Redis transport, I chose a &lt;strong&gt;separate consumer pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API → [Redis Stream] → Consumer → [Celery Queue] → Worker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Avoids Celery/Kombu stream support uncertainty&lt;/li&gt;
&lt;li&gt; Easier to observe and debug (explicit XREADGROUP/XACK)&lt;/li&gt;
&lt;li&gt; Natural back-pressure via XPENDING monitoring&lt;/li&gt;
&lt;li&gt; Cleaner rollback path (just flip a feature flag)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Design Decisions I Made Early
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Idempotency by Design
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream_payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Primary idempotency key
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Future-proofing
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;colormap_normalization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;histogram&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Evolved schema
&lt;/span&gt;    &lt;span class="c1"&gt;# ... other fields
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Consumer checks request_hash before enqueueing to Celery
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Error Classification at Consumer Boundary
&lt;/h4&gt;

&lt;p&gt;Not all failures should retry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ERROR_STRATEGY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DLQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Permanent: no data exists
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing_assets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DLQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Permanent: schema mismatch
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RETRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Transient: try again
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery_unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RETRY_WITH_BACKOFF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Infrastructure blip
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Back-Pressure Strategy
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PENDING_WARNING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1_000&lt;/span&gt;
&lt;span class="n"&gt;PENDING_CRITICAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;

&lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PENDING_CRITICAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Return 429 on API to slow producers
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;HttpResponseTooManyRequests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Upstream backlog critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PENDING_WARNING&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stream backlog growing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pending_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Graceful Shutdown
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In consume_ndvi_stream.py
&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGTERM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_shutdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;shutdown_flag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Stop accepting new entries
&lt;/span&gt;    &lt;span class="c1"&gt;# Finish current entry, XACK if successful
&lt;/span&gt;    &lt;span class="c1"&gt;# Exit cleanly → orchestrator restarts
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson #3&lt;/strong&gt;: Decoupling isn't just about scalability — it's about &lt;strong&gt;failure isolation&lt;/strong&gt;. When one component fails, the rest can keep moving. And when you're solo, isolation means you can debug one piece without bringing down the whole system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principles I Learned (That You Can Steal)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Make Failure Explicit
&lt;/h3&gt;

&lt;p&gt;Don't hide errors behind generic exceptions. Classify them, tag them, and route them intentionally. Your future self — especially at 3 AM — will thank you.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Observability Is a Feature, Not an Afterthought
&lt;/h3&gt;

&lt;p&gt;If you can't measure it, you can't improve it. Export metrics at the point of decision (retry? circuit open? stream lag?) — not just at the edges. When you're the only one debugging, every metric is a lifeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Design for the "Boring" Failure Modes
&lt;/h3&gt;

&lt;p&gt;Everyone plans for the 500 error. Few plan for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Broker failover latency&lt;/li&gt;
&lt;li&gt;Consumer restart mid-processing&lt;/li&gt;
&lt;li&gt;Schema evolution mid-deploy&lt;/li&gt;
&lt;li&gt;Clock skew in distributed timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these. Test them. Build escape hatches. When you don't have a team to lean on, preparation is your best defense.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Centralize, Then Specialize
&lt;/h3&gt;

&lt;p&gt;Start with a single source of truth (like &lt;code&gt;classify_status_code()&lt;/code&gt;). Then layer on engine-specific behavior &lt;em&gt;on top&lt;/em&gt; of that foundation. This prevents drift and duplication — critical when you're the only one maintaining the code.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Operator Experience Matters
&lt;/h3&gt;

&lt;p&gt;Admin endpoints, health checks, clear logs, and meaningful metrics aren't "nice to have" — they're what let you sleep at night. Build them in from day one. When you're solo, &lt;em&gt;you&lt;/em&gt; are the operator.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on Solo Engineering
&lt;/h2&gt;

&lt;p&gt;Working alone doesn't mean working in isolation. I leaned heavily on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public documentation&lt;/strong&gt;: Google SRE book, AWS Well-Architected, Martin Fowler's patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source&lt;/strong&gt;: Studying how Celery, Kombu, and Redis clients handle resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community&lt;/strong&gt;: Reading post-mortems, blog posts, and conference talks from engineers who've been there&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And I documented &lt;em&gt;everything&lt;/em&gt;. Not for a team — for my future self. Every architecture decision, every tradeoff, every "why" is written down. Because six months from now, I won't remember why I chose &lt;code&gt;300s&lt;/code&gt; for the circuit breaker timeout. But my docs will.&lt;/p&gt;

&lt;p&gt;If you're also building alone: you're not behind. You're just optimizing for a different constraint. Depth over breadth. Clarity over velocity. Resilience over features.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building resilient distributed systems isn't about fancy algorithms or cutting-edge tools. It's about &lt;strong&gt;discipline&lt;/strong&gt;: clear contracts, explicit failure handling, observable behavior, and operator empathy.&lt;/p&gt;

&lt;p&gt;The NDVI pipeline I built isn't perfect. My circuit breakers are still process-local (not cluster-wide). My stream consumer doesn't yet support distributed tracing. But it's &lt;strong&gt;predictable&lt;/strong&gt;, &lt;strong&gt;testable&lt;/strong&gt;, and &lt;strong&gt;recoverable&lt;/strong&gt; — and that's what matters.&lt;/p&gt;

&lt;p&gt;If you take one thing from this article, let it be this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Resilience isn't a feature you add at the end. It's a mindset you build in from the start — whether you're on a team of 50 or flying solo.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;All code examples are simplified for clarity; production versions include additional error handling and logging. This work reflects my personal approach — your mileage may vary, and that's okay.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt;: Want to try the circuit breaker pattern? Start small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add a &lt;code&gt;failure_count&lt;/code&gt; and &lt;code&gt;last_failure_time&lt;/code&gt; to your HTTP client&lt;/li&gt;
&lt;li&gt;Skip requests when &lt;code&gt;failure_count &amp;gt;= 3&lt;/code&gt; and &lt;code&gt;time_since_failure &amp;lt; 300&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Log state transitions&lt;/li&gt;
&lt;li&gt;Add one Prometheus gauge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You'll be 80% of the way there — and you'll learn what &lt;em&gt;actually&lt;/em&gt; matters for your workload.&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>distributedsystems</category>
      <category>devops</category>
      <category>django</category>
      <category>redis</category>
    </item>
    <item>
      <title>Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:36:44 +0000</pubDate>
      <link>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</link>
      <guid>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</guid>
      <description>&lt;p&gt;Redis Sentinel + Celery Failover: What Actually Happens in Production&lt;/p&gt;

&lt;p&gt;Most tutorials on Redis Sentinel stop at “it elects a new master”.&lt;br&gt;
Very few show what happens to a real system under failover pressure.&lt;/p&gt;

&lt;p&gt;I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.&lt;/p&gt;
&lt;h2&gt;
  
  
  Here’s what actually happened.
&lt;/h2&gt;


&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Overview&lt;/li&gt;
&lt;li&gt;Sentinel Integration (Django + Celery)&lt;/li&gt;
&lt;li&gt;Observability with Prometheus&lt;/li&gt;
&lt;li&gt;Failover Drill Walkthrough&lt;/li&gt;
&lt;li&gt;Celery Behavior During Failover&lt;/li&gt;
&lt;li&gt;Performance Impact&lt;/li&gt;
&lt;li&gt;Production Readiness Assessment&lt;/li&gt;
&lt;li&gt;How to Reduce Failover Latency&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Client --&amp;gt; Django
    Django --&amp;gt;|Cache| Sentinel
    Django --&amp;gt;|Tasks| Celery
    Celery --&amp;gt;|Broker| Sentinel
    Celery --&amp;gt;|Result Backend| Sentinel

    Sentinel --&amp;gt; RedisMaster
    Sentinel --&amp;gt; RedisReplica1
    Sentinel --&amp;gt; RedisReplica2

    Prometheus --&amp;gt; RedisExporter
    RedisExporter --&amp;gt; Sentinel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Stack Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; → Redis cache via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery&lt;/strong&gt; → Broker + result backend via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis Sentinel&lt;/strong&gt; → High availability + failover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus + redis_exporter&lt;/strong&gt; → Monitoring&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Sentinel Integration (Django + Celery)
&lt;/h2&gt;

&lt;p&gt;All services were switched to Sentinel using environment configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REDIS_ADDR=redis://host.docker.internal:26379
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validation steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Django cache → successful round-trip&lt;/li&gt;
&lt;li&gt;Celery broker → connected via Sentinel&lt;/li&gt;
&lt;li&gt;Celery result backend → &lt;code&gt;SentinelBackend&lt;/code&gt; initialized&lt;/li&gt;
&lt;li&gt;Test suite passed:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  pytest tests/test_settings_redis_sentinel.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this stage, the system is fully &lt;strong&gt;Sentinel-aware&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability with Prometheus
&lt;/h2&gt;

&lt;p&gt;After pointing &lt;code&gt;redis_exporter&lt;/code&gt; to Sentinel:&lt;/p&gt;

&lt;p&gt;Key metrics exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_status&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_sentinels&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_slaves&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_masters&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;redis_instance_info&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;redis_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sentinel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tcp_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"26379"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This confirms monitoring is tracking &lt;strong&gt;cluster state&lt;/strong&gt;, not a single node.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failover Drill Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial State
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|Master| Redis1["172.20.0.3:6379"]
    Sentinel --&amp;gt; Redis2["Replica"]
    Sentinel --&amp;gt; Redis3["Replica"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prometheus reported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;master_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"172.20.0.3:6379"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Induced Failure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Current master was stopped manually&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Sentinel Election
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --&amp;gt; Redis3["Replica"]
    Sentinel --&amp;gt; Redis1["Down"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;New master elected on &lt;strong&gt;first poll&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Prometheus updated on next scrape&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failover was immediate and correct&lt;/p&gt;




&lt;h2&gt;
  
  
  Celery Behavior During Failover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App-&amp;gt;&amp;gt;Celery: Submit Task
    Celery-&amp;gt;&amp;gt;Redis: Send to Master
    Redis--&amp;gt;&amp;gt;Celery: Connection Lost

    Sentinel-&amp;gt;&amp;gt;Sentinel: Elect New Master

    Celery-&amp;gt;&amp;gt;Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery-&amp;gt;&amp;gt;Redis: Reconnect to New Master
    Redis--&amp;gt;&amp;gt;Celery: OK

    Celery--&amp;gt;&amp;gt;App: Task SUCCESS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observed Task
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Task ID: &lt;code&gt;9b57ba3b-a707-4c13-9255-d74de411b64b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Status during failover: &lt;code&gt;PENDING&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Delay: &lt;strong&gt;~54.7 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Final state: &lt;code&gt;SUCCESS&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Impact
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Normal operation&lt;/td&gt;
&lt;td&gt;Immediate execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;During failover&lt;/td&gt;
&lt;td&gt;~55s delay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-recovery&lt;/td&gt;
&lt;td&gt;Normal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Production Readiness Assessment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Redis Sentinel failover is reliable&lt;/li&gt;
&lt;li&gt;Prometheus reflects cluster changes correctly&lt;/li&gt;
&lt;li&gt;Django cache survives failover&lt;/li&gt;
&lt;li&gt;No task loss in Celery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Needs Attention
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Celery introduces &lt;strong&gt;significant delay during failover&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reconnection is not instantaneous&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Architecture Is Production-Ready
&lt;/h2&gt;

&lt;p&gt;Use this setup if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks are &lt;strong&gt;asynchronous/background&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Eventual completion is acceptable&lt;/li&gt;
&lt;li&gt;Temporary latency spikes are tolerable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Is Not Enough
&lt;/h2&gt;

&lt;p&gt;Avoid this setup (as-is) if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time task execution&lt;/li&gt;
&lt;li&gt;Sub-10s failover recovery&lt;/li&gt;
&lt;li&gt;User-facing async operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Reduce Failover Latency
&lt;/h2&gt;

&lt;p&gt;To push recovery closer to &lt;strong&gt;10–15 seconds&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tune Celery broker retry settings&lt;/li&gt;
&lt;li&gt;Reduce reconnect backoff intervals&lt;/li&gt;
&lt;li&gt;Optimize worker heartbeat and visibility timeout&lt;/li&gt;
&lt;li&gt;Re-run failover drills with timing instrumentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Redis Sentinel ensures infrastructure recovery.&lt;br&gt;
Celery determines how fast your system actually resumes work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentinel recovery: &lt;strong&gt;instant&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Application recovery: &lt;strong&gt;~55 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gap is the real engineering challenge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;If you're using Redis Sentinel with Celery:&lt;/p&gt;

&lt;p&gt;Don’t stop at:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Failover works.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Measure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How long until my system behaves normally again?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because that’s what production users experience.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>django</category>
      <category>redis</category>
    </item>
    <item>
      <title>Escaping Cache Fragmentation: How Misconfigured PHP Workers Flooded My Token System</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 22 Mar 2026 18:02:23 +0000</pubDate>
      <link>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</link>
      <guid>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</guid>
      <description>&lt;h2&gt;
  
  
  🚨 The Symptom
&lt;/h2&gt;

&lt;p&gt;I started noticing something strange in my observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration tokens were being minted repeatedly&lt;/li&gt;
&lt;li&gt;My token endpoint showed activity even when no user interaction was happening&lt;/li&gt;
&lt;li&gt;Metrics suggested constant “traffic” to an otherwise idle system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A security issue&lt;/li&gt;
&lt;li&gt;A rogue client&lt;/li&gt;
&lt;li&gt;Or a broken API consumer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was none of those.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 The Root Cause
&lt;/h2&gt;

&lt;p&gt;The issue came down to a subtle but critical architectural mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I was using a non-shared cache in a multi-worker environment.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Stack involved:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PHP-FPM (2 workers)&lt;/li&gt;
&lt;li&gt;APCu (in-memory cache)&lt;/li&gt;
&lt;li&gt;Token-based integration between services&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚙️ What Went Wrong
&lt;/h2&gt;

&lt;p&gt;APCu is &lt;strong&gt;process-local&lt;/strong&gt;, not shared.&lt;/p&gt;

&lt;p&gt;That means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker A cache ≠ Worker B cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each PHP-FPM worker had its own isolated memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  💥 The Cascade Effect
&lt;/h2&gt;

&lt;p&gt;My token logic was straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;mint_new_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But in reality, the system behaved like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request hits Worker A → token exists → OK&lt;/li&gt;
&lt;li&gt;Next request hits Worker B → cache miss → mint new token&lt;/li&gt;
&lt;li&gt;Repeat across workers → continuous token regeneration&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  📈 Why Observability Looked “Wrong”
&lt;/h2&gt;

&lt;p&gt;From the outside, it looked like traffic was hitting the token endpoint.&lt;/p&gt;

&lt;p&gt;But in reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The system was generating its own traffic due to cache inconsistency.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a key lesson:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all traffic is external&lt;/li&gt;
&lt;li&gt;Some is &lt;strong&gt;emergent behavior from system design&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ The Fix
&lt;/h2&gt;

&lt;p&gt;I switched from APCu to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis (shared cache)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All workers → same cache → consistent token state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tokens minted once&lt;/li&gt;
&lt;li&gt;Reused across all workers&lt;/li&gt;
&lt;li&gt;Metrics stabilized instantly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔒 Production Hardening (What I Added Next)
&lt;/h2&gt;

&lt;p&gt;Fixing the cache wasn’t enough — I hardened the system further.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Distributed Locking
&lt;/h3&gt;

&lt;p&gt;To prevent race conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

&lt;span class="n"&gt;acquire&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
    &lt;span class="n"&gt;mint&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;still&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;
&lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. TTL Buffering
&lt;/h3&gt;

&lt;p&gt;Avoid edge expiration issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cache_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token_expiry&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;safety_margin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Observability Metrics
&lt;/h3&gt;

&lt;p&gt;I added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;token_cache_hits&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_cache_misses&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_mint_count&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now anomalies show up immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Key Takeaway
&lt;/h2&gt;

&lt;p&gt;This wasn’t just a bug.&lt;/p&gt;

&lt;p&gt;It was a &lt;strong&gt;distributed systems failure mode&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cache locality + multi-worker architecture → inconsistent state → emergent traffic&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ⚡ Final Insight
&lt;/h2&gt;

&lt;p&gt;If your system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs multiple workers&lt;/li&gt;
&lt;li&gt;Uses in-memory caching&lt;/li&gt;
&lt;li&gt;Relies on shared state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then this rule applies:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If your cache isn’t shared, your state isn’t real.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔗 Closing
&lt;/h2&gt;

&lt;p&gt;This issue reinforced something critical in my engineering journey:&lt;/p&gt;

&lt;p&gt;You don’t debug systems by staring at code —&lt;br&gt;
you debug them by understanding &lt;strong&gt;how state flows across boundaries&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;If you're building distributed APIs, token systems, or high-concurrency services —&lt;br&gt;
this is one edge case worth designing for early.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>php</category>
      <category>webdev</category>
    </item>
    <item>
      <title>From 80-Second APIs to Sub-Second: Rebuilding a Geospatial Backend with Async Pipelines</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 21 Mar 2026 16:37:10 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</link>
      <guid>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;From 80-Second APIs to Sub-Second: Fixing Latency with Async Pipelines (Django + Celery)&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;At some point, every backend engineer hits this wall:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The API works perfectly… until it doesn’t.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hit that wall with a farm analytics endpoint computing NDVI (Normalized Difference Vegetation Index) from satellite imagery. The system was correct, the logic was sound, and the results were accurate.&lt;/p&gt;

&lt;p&gt;But the numbers told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P95 latency: 1.25 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s not an API. That’s a blocking compute job pretending to be one.&lt;/p&gt;

&lt;p&gt;This is the story of how I redesigned the system—from a synchronous request-driven model to an asynchronous data pipeline—and brought latency down to &lt;strong&gt;sub-second performance (P95 ≈ 725ms)&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Architecture (The Hidden Problem)
&lt;/h2&gt;

&lt;p&gt;At first glance, the system looked clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Client]
   ↓
[Django API]
   ↓
[STAC API → Satellite Data]
   ↓
[Raster Processing (NDVI)]
   ↓
[Response]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happened on each request?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Query satellite imagery via STAC&lt;/li&gt;
&lt;li&gt;Fetch raster bands (Red &amp;amp; NIR) from remote storage&lt;/li&gt;
&lt;li&gt;Process NDVI using rasterio&lt;/li&gt;
&lt;li&gt;Aggregate coverage&lt;/li&gt;
&lt;li&gt;Return result&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why this seemed fine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It worked locally&lt;/li&gt;
&lt;li&gt;It returned correct data&lt;/li&gt;
&lt;li&gt;It followed a “pure API” mindset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remote I/O (S3-backed satellite data)&lt;/li&gt;
&lt;li&gt;Heavy raster decoding (JPEG2000)&lt;/li&gt;
&lt;li&gt;Sequential band reads&lt;/li&gt;
&lt;li&gt;Full computation per request&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Breaking Point
&lt;/h2&gt;

&lt;p&gt;Logs told the truth.&lt;/p&gt;

&lt;p&gt;Each request looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STAC request → ~5s
Raster read (B04) → ~5–10s
Raster read (B08) → ~5–10s
Processing → ~5s+
Total → ~80+ seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the key realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I wasn’t building an API—I was executing a geospatial compute pipeline on every request.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Core Insight
&lt;/h2&gt;

&lt;p&gt;This is the shift that changes everything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;APIs should &lt;strong&gt;serve data&lt;/strong&gt;, not &lt;strong&gt;compute it on demand&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem wasn’t Python.&lt;br&gt;
The problem wasn’t Django.&lt;br&gt;
The problem was &lt;strong&gt;architecture&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  The New Architecture (Async Pipeline)
&lt;/h2&gt;

&lt;p&gt;I redesigned the system around &lt;strong&gt;asynchronous computation + caching&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;             (Scheduled / Triggered)
                    ↓
             [Celery Worker]
                    ↓
         [NDVI Computation Pipeline]
                    ↓
             [Redis / Database]
                    ↓
[Client] → [Django API] → [Cache Lookup]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key changes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NDVI computation moved out of the request path&lt;/li&gt;
&lt;li&gt;Results cached in Redis&lt;/li&gt;
&lt;li&gt;Background jobs compute and refresh data&lt;/li&gt;
&lt;li&gt;API returns instantly (no heavy compute)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Diagram 1 — Before vs After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before (Request-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
   ↓
STAC API
   ↓
Raster I/O
   ↓
NDVI Compute
   ↓
Response (80s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After (Pipeline-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request → Cache → Response (~725ms P95)
              ↓ (miss)
         Async Task
              ↓
       Compute + Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Fast API Path (Non-blocking)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ndvi.tasks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_farm_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;

    &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Celery Task (Async Compute)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;autoretry_for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;retry_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;coverage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_ndvi_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Daily Backfill (Critical)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enqueue_daily_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;farm_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_active_farm_ids&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;farm_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Observability (The Real Upgrade)
&lt;/h2&gt;

&lt;p&gt;Metrics added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task duration&lt;/li&gt;
&lt;li&gt;Task success/failure&lt;/li&gt;
&lt;li&gt;Queue depth&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Metrics (Grafana Observations)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📊 Grafana Screenshots
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Latency Graph&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" alt="725ms on farm get endpoint" width="342" height="458"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Before
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;P95 latency: ~1.25 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  After
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API latency: ~725ms (P95)&lt;/li&gt;
&lt;li&gt;Background tasks: 60–90s&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Before vs After Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API latency&lt;/td&gt;
&lt;td&gt;1.25 min&lt;/td&gt;
&lt;td&gt;~725 ms (P95)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System type&lt;/td&gt;
&lt;td&gt;Request-driven&lt;/td&gt;
&lt;td&gt;Pipeline-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Improved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;I stopped treating my API like a calculator and started treating my system like a data pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s when everything changed.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>performance</category>
      <category>distributedsystems</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
