<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahim Ranxx</title>
    <description>The latest articles on DEV Community by Rahim Ranxx (@rahim8050).</description>
    <link>https://dev.to/rahim8050</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3744842%2F2195e1a7-7e61-47f7-9c11-41610936958d.jpg</url>
      <title>DEV Community: Rahim Ranxx</title>
      <link>https://dev.to/rahim8050</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahim8050"/>
    <language>en</language>
    <item>
      <title>Decoupled Media Streams: A Django and Nextcloud Radio Architecture</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Mon, 25 May 2026 12:41:31 +0000</pubDate>
      <link>https://dev.to/rahim8050/decoupled-media-streams-a-django-and-nextcloud-radio-architecture-4p2b</link>
      <guid>https://dev.to/rahim8050/decoupled-media-streams-a-django-and-nextcloud-radio-architecture-4p2b</guid>
      <description>&lt;p&gt;I recently added a radio integration to a platform built around Django REST Framework (DRF), and Nextcloud.&lt;/p&gt;

&lt;p&gt;The existing architecture was already doing a lot of heavy lifting, powering authentication, farm management, NDVI processing pipelines, weather data ingestion, API key orchestration, and Nextcloud application integrations.&lt;/p&gt;

&lt;p&gt;The new requirement was to introduce internet radio support seamlessly inside the Nextcloud ecosystem. However, there was a strict architectural constraint: we needed to do this without turning Django into a media relay.&lt;/p&gt;

&lt;p&gt;That single distinction shaped the entire implementation strategy.&lt;/p&gt;

&lt;p&gt;The Core Challenge: Avoiding the Proxy Trap&lt;br&gt;
Instead of proxying heavy audio streams through the backend, the architecture relies on direct playback. Django is strictly responsible for exposing radio metadata and playback endpoints (routed under /api/v1/radio/). Meanwhile, the Nextcloud clients stream the audio directly from the source providers, such as BBC, SomaFM, and TuneIn.&lt;/p&gt;

&lt;p&gt;The result is a much cleaner separation of responsibilities:&lt;/p&gt;

&lt;p&gt;Nextcloud UI / Web Client: The presentation layer.&lt;/p&gt;

&lt;p&gt;Django + DRF API: Radio metadata and stream information logic.&lt;/p&gt;

&lt;p&gt;Radio Providers: Direct playback of media transport.&lt;/p&gt;

&lt;p&gt;Architectural Separation in Action&lt;br&gt;
The following diagram illustrates exactly how we achieved this decoupling. The critical path is that thick dark orange arrow (3), showing the media stream bypassing the Django API server entirely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga9nmcoylgbq7e84kuwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fga9nmcoylgbq7e84kuwf.png" alt="Architecture diagram" width="799" height="436"&gt;&lt;/a&gt;&lt;br&gt;
(Diagram: Metadata requests [blue] are routed through Django, while heavy media streams [orange] flow directly from providers like BBC/SomaFM to the Nextcloud user.)&lt;/p&gt;

&lt;p&gt;This separation keeps the backend highly performant and lightweight, while allowing the Nextcloud frontend to integrate radio discovery naturally alongside the rest of the platform's services.&lt;/p&gt;

&lt;p&gt;How Nextcloud Fits Into the Architecture&lt;br&gt;
The radio integration was explicitly designed to plug into a broader, Nextcloud-driven ecosystem rather than operating as an isolated, standalone media application. By defining strict boundaries, each system handles what it does best.&lt;/p&gt;

&lt;p&gt;Nextcloud provides:&lt;/p&gt;

&lt;p&gt;The frontend user experience&lt;/p&gt;

&lt;p&gt;Authenticated user workflows&lt;/p&gt;

&lt;p&gt;App integration surfaces and dashboard presentation&lt;/p&gt;

&lt;p&gt;Native media interaction capabilities&lt;/p&gt;

&lt;p&gt;Django provides:&lt;/p&gt;

&lt;p&gt;API orchestration and provider abstraction&lt;/p&gt;

&lt;p&gt;Station metadata and stream discovery&lt;/p&gt;

&lt;p&gt;Data normalization logic&lt;/p&gt;

&lt;p&gt;Backend consistency&lt;/p&gt;

&lt;p&gt;This clear separation creates a strong boundary between backend platform orchestration and frontend client experience. Instead of embedding complex streaming logic directly into Nextcloud—or forcing Django to waste resources proxying media—the architecture keeps each layer focused entirely on its primary responsibility.&lt;/p&gt;

&lt;p&gt;Built for Future Expansion&lt;br&gt;
Because the backend already behaves like a pure metadata platform rather than a streaming server, the architecture leaves massive room for future expansion.&lt;/p&gt;

&lt;p&gt;Without needing to redesign the streaming layer itself, this setup easily supports adding:&lt;/p&gt;

&lt;p&gt;Personalized stations and user favorites&lt;/p&gt;

&lt;p&gt;Listening history tracking&lt;/p&gt;

&lt;p&gt;Podcast aggregation&lt;/p&gt;

&lt;p&gt;Recommendation systems&lt;/p&gt;

&lt;p&gt;Analytics pipelines&lt;/p&gt;

&lt;p&gt;Multi-provider federation&lt;/p&gt;

&lt;p&gt;By treating media transport and metadata orchestration as two distinct problems, the integration remains scalable, fast, and ready for whatever features the platform requires next.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>backend</category>
      <category>api</category>
      <category>devops</category>
    </item>
    <item>
      <title>Debugging a Cross-Language HMAC Signature Failure Between Nextcloud and Django</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 16 May 2026 14:47:34 +0000</pubDate>
      <link>https://dev.to/rahim8050/debugging-a-cross-language-hmac-signature-failure-between-nextcloud-and-django-3bfa</link>
      <guid>https://dev.to/rahim8050/debugging-a-cross-language-hmac-signature-failure-between-nextcloud-and-django-3bfa</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;A few days ago, I hit a frustrating issue while integrating a custom Nextcloud application with a Django REST Framework backend.&lt;/p&gt;

&lt;p&gt;Everything looked correct:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shared HMAC secret ✔️&lt;/li&gt;
&lt;li&gt;canonical request string ✔️&lt;/li&gt;
&lt;li&gt;HMAC-SHA256 ✔️&lt;/li&gt;
&lt;li&gt;timestamps synchronized ✔️&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yet every authenticated request failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invalid nextcloud signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part?&lt;/p&gt;

&lt;p&gt;Both implementations were technically correct.&lt;/p&gt;

&lt;p&gt;The failure came from something much smaller — and much more dangerous in distributed systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Different string encodings of the exact same HMAC digest.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This article walks through the full debugging process, the root cause, and the engineering lessons learned from debugging cryptographic interoperability between PHP and Python services.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The integration architecture looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────┐
│  Nextcloud App (PHP) │
│  Generates HMAC      │
└──────────┬───────────┘
           │
           │ Signed HTTP Request
           ▼
┌──────────────────────┐
│ Django DRF Backend   │
│ Verifies Signature   │
└──────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The request flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Nextcloud generates a canonical request string&lt;/li&gt;
&lt;li&gt;PHP computes an HMAC-SHA256 signature&lt;/li&gt;
&lt;li&gt;Signature is attached to request headers&lt;/li&gt;
&lt;li&gt;Django reconstructs the canonical string&lt;/li&gt;
&lt;li&gt;Django recomputes the HMAC&lt;/li&gt;
&lt;li&gt;Signatures are compared&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Simple in theory.&lt;/p&gt;

&lt;p&gt;Except it kept failing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Initial Symptoms
&lt;/h2&gt;

&lt;p&gt;The backend logs showed repeated authorization failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nextcloud_hmac.denied
code=invalid_signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even more confusing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the integration had worked before&lt;/li&gt;
&lt;li&gt;secrets matched&lt;/li&gt;
&lt;li&gt;clocks matched&lt;/li&gt;
&lt;li&gt;payloads matched&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looked like a replay issue, timestamp skew problem, or cache corruption.&lt;/p&gt;

&lt;p&gt;It turned out to be none of those.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Root Cause
&lt;/h2&gt;

&lt;p&gt;The issue came from a mismatch in how the HMAC digest was encoded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nextcloud (PHP)
&lt;/h2&gt;

&lt;p&gt;The PHP client generated the signature like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="nb"&gt;base64_encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;hash_hmac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'sha256'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$canonical&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;$secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the important detail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That parameter returns the raw digest bytes.&lt;/p&gt;

&lt;p&gt;Those bytes were then encoded as Base64.&lt;/p&gt;




&lt;h2&gt;
  
  
  Django (Python)
&lt;/h2&gt;

&lt;p&gt;Meanwhile, Django verified signatures like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;hexdigest()&lt;/code&gt; returns a hexadecimal string representation.&lt;/p&gt;

&lt;p&gt;So both systems produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the same HMAC bytes&lt;/li&gt;
&lt;li&gt;using the same algorithm&lt;/li&gt;
&lt;li&gt;using the same secret&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But converted those bytes into different string formats.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Interoperability Bug
&lt;/h2&gt;

&lt;p&gt;This was the breakthrough moment.&lt;/p&gt;

&lt;p&gt;The exact same digest bytes produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hex:
44c39c4ecc7268547ca51db72c6f27125251e6ea8ce3c659d918a9542522b612
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;vs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Base64:
RMOcTsxyaFR8pR23LG8nElJR5uqM48ZZ2RipVCUithI=
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both values represent the same underlying bytes.&lt;/p&gt;

&lt;p&gt;But string comparison obviously fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Bug
&lt;/h2&gt;

&lt;p&gt;While investigating, I found another subtle issue.&lt;/p&gt;

&lt;p&gt;The Django verifier lowercased the incoming signature before comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may appear harmless for hexadecimal values.&lt;/p&gt;

&lt;p&gt;But Base64 is case-sensitive.&lt;/p&gt;

&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ABC != abc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So even after fixing the encoding mismatch, lowercasing would still break verification.&lt;/p&gt;

&lt;p&gt;This was a protocol normalization bug hiding inside the verification pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;I updated Django to verify signatures using Base64 instead of hexadecimal.&lt;/p&gt;

&lt;h2&gt;
  
  
  New Verification Function
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_hmac_signature_b64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical_string&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compute Base64 encoded HMAC-SHA256 signature.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;canonical_string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then all verification calls were updated to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;compute_hmac_signature_b64&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, I removed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;from the verification flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Verification Results
&lt;/h2&gt;

&lt;p&gt;After deploying the fix:&lt;/p&gt;

&lt;h2&gt;
  
  
  Ping Endpoint
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET /api/v1/integrations/nextcloud/ping/

200 OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Token Issuance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /api/v1/integrations/token/

200 OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Authentication immediately started working again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Secondary Investigation Findings
&lt;/h2&gt;

&lt;p&gt;While debugging, I validated several other production concerns.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Time Drift
&lt;/h2&gt;

&lt;p&gt;I suspected clock skew initially.&lt;/p&gt;

&lt;p&gt;Both services were checked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud epoch: 1778841776
Django epoch:    1778841776
Drift:            0 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Time synchronization was perfect.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Shared Secrets
&lt;/h2&gt;

&lt;p&gt;Client IDs and secrets matched correctly across both systems.&lt;/p&gt;

&lt;p&gt;This eliminated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;environment mismatch&lt;/li&gt;
&lt;li&gt;stale secrets&lt;/li&gt;
&lt;li&gt;config drift&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Redis and Cache State
&lt;/h2&gt;

&lt;p&gt;I flushed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;Django cache&lt;/li&gt;
&lt;li&gt;integration token caches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helped eliminate stale token artifacts and replay-state inconsistencies.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Infrastructure Validation
&lt;/h2&gt;

&lt;p&gt;I also verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loopback networking&lt;/li&gt;
&lt;li&gt;gunicorn binding&lt;/li&gt;
&lt;li&gt;uvicorn workers&lt;/li&gt;
&lt;li&gt;allowlists&lt;/li&gt;
&lt;li&gt;HTTP dev mode configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point the investigation became less about cryptography and more about systematic elimination of variables.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It “Worked Before”
&lt;/h2&gt;

&lt;p&gt;This was the most interesting systems question.&lt;/p&gt;

&lt;p&gt;I had not changed the signing logic recently.&lt;/p&gt;

&lt;p&gt;So why did the failure suddenly appear?&lt;/p&gt;

&lt;p&gt;The likely answer is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Infrastructure state had been masking a latent protocol incompatibility.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Possible contributors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cached tokens&lt;/li&gt;
&lt;li&gt;stale replay windows&lt;/li&gt;
&lt;li&gt;inactive code paths&lt;/li&gt;
&lt;li&gt;existing sessions bypassing verification&lt;/li&gt;
&lt;li&gt;Redis persistence behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is an important engineering lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A system can contain dormant interoperability bugs for weeks before infrastructure conditions expose them.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Engineering Lessons Learned
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Cryptographic Bytes ≠ String Representation
&lt;/h2&gt;

&lt;p&gt;HMAC output is binary data.&lt;/p&gt;

&lt;p&gt;Hexadecimal and Base64 are merely different textual encodings of the same bytes.&lt;/p&gt;

&lt;p&gt;They are not interchangeable.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Cross-Language Integrations Need Explicit Contracts
&lt;/h2&gt;

&lt;p&gt;Never assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;encoding format&lt;/li&gt;
&lt;li&gt;canonicalization rules&lt;/li&gt;
&lt;li&gt;normalization behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Define them explicitly.&lt;/p&gt;

&lt;p&gt;Especially across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PHP&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Go&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Java&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Normalization Can Break Security
&lt;/h2&gt;

&lt;p&gt;Lowercasing signatures looked harmless.&lt;/p&gt;

&lt;p&gt;It was not.&lt;/p&gt;

&lt;p&gt;Cryptographic values should only be normalized if the protocol explicitly defines normalization behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Infrastructure State Can Hide Bugs
&lt;/h2&gt;

&lt;p&gt;Cache layers and token persistence can temporarily conceal protocol inconsistencies.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restarts&lt;/li&gt;
&lt;li&gt;cache flushes&lt;/li&gt;
&lt;li&gt;clock resets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;suddenly expose issues that already existed.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Production Debugging Requires Elimination Discipline
&lt;/h2&gt;

&lt;p&gt;The investigation involved validating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clocks&lt;/li&gt;
&lt;li&gt;secrets&lt;/li&gt;
&lt;li&gt;caches&lt;/li&gt;
&lt;li&gt;workers&lt;/li&gt;
&lt;li&gt;networking&lt;/li&gt;
&lt;li&gt;encoding&lt;/li&gt;
&lt;li&gt;replay protection&lt;/li&gt;
&lt;li&gt;request canonicalization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good debugging is often less about guessing and more about systematically removing uncertainty.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The most dangerous bugs are not always algorithm failures.&lt;/p&gt;

&lt;p&gt;Sometimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the crypto is correct&lt;/li&gt;
&lt;li&gt;the infrastructure is healthy&lt;/li&gt;
&lt;li&gt;the logic is valid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…but the protocol contract between systems is inconsistent.&lt;/p&gt;

&lt;p&gt;In this case:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The cryptography was correct on both sides. The protocol contract was not.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that single mismatch was enough to break the entire authentication flow.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>security</category>
      <category>django</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Why I Added Redis Streams Between My Django API and Celery Workers.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 03 May 2026 07:01:10 +0000</pubDate>
      <link>https://dev.to/rahim8050/why-i-added-redis-streams-between-my-django-api-and-celery-workers-22bl</link>
      <guid>https://dev.to/rahim8050/why-i-added-redis-streams-between-my-django-api-and-celery-workers-22bl</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A practical engineering breakdown of how I introduced Redis Streams into a live Django + Celery NDVI pipeline without rewriting the worker layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I run a Django API backed by Celery workers for NDVI processing workloads.&lt;/p&gt;

&lt;p&gt;The execution layer worked fine.&lt;/p&gt;

&lt;p&gt;The queue semantics didn’t.&lt;/p&gt;

&lt;p&gt;I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable ingestion&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;li&gt;dead-letter handling&lt;/li&gt;
&lt;li&gt;stale consumer recovery&lt;/li&gt;
&lt;li&gt;rollback safety&lt;/li&gt;
&lt;li&gt;observability during incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…but I did not want to rewrite the worker system or destabilize production.&lt;/p&gt;

&lt;p&gt;So instead of replacing Celery, I inserted Redis Streams between the API and the workers.&lt;/p&gt;

&lt;p&gt;This article explains why I made that decision, how the architecture works, and what I learned while implementing reliable stream-backed NDVI ingestion in Django.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Problem
&lt;/h2&gt;

&lt;p&gt;The problem was not task execution.&lt;/p&gt;

&lt;p&gt;The problem was everything before execution.&lt;/p&gt;

&lt;p&gt;Originally, NDVI ingestion looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Django API → Celery Broker → Celery Workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first, this worked well.&lt;/p&gt;

&lt;p&gt;But as the system evolved, operational gaps became more obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct &lt;code&gt;.delay()&lt;/code&gt; calls tightly coupled request ingestion to broker behavior.&lt;/li&gt;
&lt;li&gt;Queue visibility was limited during incidents.&lt;/li&gt;
&lt;li&gt;Failed ingestion paths were harder to replay safely.&lt;/li&gt;
&lt;li&gt;In-flight recovery semantics were weak.&lt;/li&gt;
&lt;li&gt;There was no dead-letter workflow for poisoned messages.&lt;/li&gt;
&lt;li&gt;Worker interruptions could leave messages in uncertain states.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture was fast.&lt;/p&gt;

&lt;p&gt;It was not durable enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Did Not Replace Celery
&lt;/h2&gt;

&lt;p&gt;One of the biggest architectural decisions was choosing not to replace Celery.&lt;/p&gt;

&lt;p&gt;That decision reduced risk dramatically.&lt;/p&gt;

&lt;p&gt;Celery already handled:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;worker orchestration&lt;/li&gt;
&lt;li&gt;task retries&lt;/li&gt;
&lt;li&gt;execution concurrency&lt;/li&gt;
&lt;li&gt;scheduling&lt;/li&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;operational familiarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Replacing the worker layer would have increased migration complexity and expanded the failure domain.&lt;/p&gt;

&lt;p&gt;Instead, I treated Redis Streams as an ingestion and reliability layer.&lt;/p&gt;

&lt;p&gt;The resulting architecture looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Django API
    ↓
Dispatch Boundary
    ↓
Redis Streams (XADD)
    ↓
Consumer Group (XREADGROUP)
    ↓
Celery Queue
    ↓
NDVI Workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failures route into a dead-letter stream.&lt;/p&gt;

&lt;p&gt;Stale consumers are recovered through reclaim logic.&lt;/p&gt;

&lt;p&gt;Most importantly, rollback remains simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  Centralizing Dispatch Before Adding Redis Streams
&lt;/h2&gt;

&lt;p&gt;Before introducing Redis Streams, I centralized every NDVI enqueue path.&lt;/p&gt;

&lt;p&gt;This was the most important migration step.&lt;/p&gt;

&lt;p&gt;Instead of scattering direct &lt;code&gt;.delay()&lt;/code&gt; calls across the codebase, everything flowed through dispatch helpers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ndvi.dispatch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dispatch_ndvi_job&lt;/span&gt;

&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;enqueue_job&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="nf"&gt;dispatch_ndvi_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That allowed one configuration flag to control the ingestion backend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;celery
stream
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This created a clean migration boundary.&lt;/p&gt;

&lt;p&gt;The system could switch between direct Celery dispatch and Redis Streams without changing every call site.&lt;/p&gt;

&lt;p&gt;Operationally, this mattered more than the stream code itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Publishing NDVI Jobs into Redis Streams
&lt;/h2&gt;

&lt;p&gt;The producer layer publishes deterministic NDVI payloads into a Redis stream.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enqueue_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_MAXLEN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;approximate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;request_hash&lt;/code&gt; acts as the idempotency key.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;XTRIM&lt;/code&gt; keeps memory bounded.&lt;/li&gt;
&lt;li&gt;Stream payloads remain deterministic.&lt;/li&gt;
&lt;li&gt;Producers do not execute business logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The stream became the ingestion ledger.&lt;/p&gt;




&lt;h2&gt;
  
  
  Redis Streams Consumer Design
&lt;/h2&gt;

&lt;p&gt;The consumer reads from Redis Streams and forwards work into Celery.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xreadgroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;groupname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consumername&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consumer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BLOCK_MS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For every message:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deserialize payload&lt;/li&gt;
&lt;li&gt;Validate structure&lt;/li&gt;
&lt;li&gt;Apply idempotency safeguards&lt;/li&gt;
&lt;li&gt;Enqueue Celery task&lt;/li&gt;
&lt;li&gt;Acknowledge stream entry
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;process_ndvi_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The stream consumer remains intentionally thin.&lt;/p&gt;

&lt;p&gt;Its job is reliable transport and recovery.&lt;/p&gt;

&lt;p&gt;Celery still handles execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Consumer Groups Matter
&lt;/h2&gt;

&lt;p&gt;Redis Streams consumer groups solved several operational problems immediately.&lt;/p&gt;

&lt;p&gt;They provided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cooperative work distribution&lt;/li&gt;
&lt;li&gt;independent consumer identities&lt;/li&gt;
&lt;li&gt;pending-entry tracking&lt;/li&gt;
&lt;li&gt;reclaim support&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike simple queue semantics, Redis Streams expose message lifecycle state.&lt;/p&gt;

&lt;p&gt;That visibility becomes extremely valuable during failures.&lt;/p&gt;

&lt;p&gt;Message lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;XADD → pending → reclaimed → acknowledged
                          ↓
                         DLQ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made queue recovery observable instead of implicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recovering Stale Messages with XAUTOCLAIM
&lt;/h2&gt;

&lt;p&gt;The most important recovery primitive ended up being &lt;code&gt;XAUTOCLAIM&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If a consumer dies after reading a message but before acknowledging it, the entry remains pending indefinitely unless another consumer reclaims it.&lt;/p&gt;

&lt;p&gt;Without reclaim logic, stream durability is incomplete.&lt;/p&gt;

&lt;p&gt;Example reclaim loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xautoclaim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;groupname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_GROUP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consumername&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;consumer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_idle_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_CLAIM_IDLE_MS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0-0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows healthy consumers to recover abandoned work automatically.&lt;/p&gt;

&lt;p&gt;That changed the reliability profile of the ingestion pipeline significantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dead-Letter Queue Handling
&lt;/h2&gt;

&lt;p&gt;I also introduced a dedicated dead-letter stream.&lt;/p&gt;

&lt;p&gt;Messages are routed into the DLQ when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validation fails&lt;/li&gt;
&lt;li&gt;delivery ceilings are exceeded&lt;/li&gt;
&lt;li&gt;payloads become structurally invalid&lt;/li&gt;
&lt;li&gt;repeated execution attempts fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NDVI_STREAM_DLQ_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dlq_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every DLQ entry includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;original message ID&lt;/li&gt;
&lt;li&gt;delivery count&lt;/li&gt;
&lt;li&gt;failure reason&lt;/li&gt;
&lt;li&gt;serialized payload&lt;/li&gt;
&lt;li&gt;timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made operational debugging dramatically easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardest Problem: Idempotency
&lt;/h2&gt;

&lt;p&gt;Redis Streams provide at-least-once delivery.&lt;/p&gt;

&lt;p&gt;That means duplicate delivery is expected.&lt;/p&gt;

&lt;p&gt;Exactly-once delivery is not guaranteed.&lt;/p&gt;

&lt;p&gt;To prevent duplicate NDVI execution, I added multiple protection layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Deterministic Request Hash
&lt;/h2&gt;

&lt;p&gt;Every NDVI job already had a deterministic &lt;code&gt;request_hash&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That became the execution identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Distributed Redis Lock
&lt;/h2&gt;

&lt;p&gt;The consumer acquires a Redis lock before execution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lock_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ndvi:lock:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Acquisition uses &lt;code&gt;SETNX&lt;/code&gt; semantics with expiration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Token-Based Lock Release
&lt;/h2&gt;

&lt;p&gt;Locks are released through an atomic Lua script.&lt;/p&gt;

&lt;p&gt;This prevents blind deletion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"get"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"del"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 4: Database Status Recheck
&lt;/h2&gt;

&lt;p&gt;Before execution begins, the worker re-checks terminal job state.&lt;/p&gt;

&lt;p&gt;This acts as a second safety boundary.&lt;/p&gt;

&lt;p&gt;The result is effectively-once execution semantics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;At-least-once delivery + idempotent execution = effectively-once processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Observability Added During the Rollout
&lt;/h2&gt;

&lt;p&gt;One major lesson from this migration:&lt;/p&gt;

&lt;p&gt;Do not enable stream mode before queue visibility exists.&lt;/p&gt;

&lt;p&gt;I added dedicated metrics before enabling the rollout broadly.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;redis_stream_pending_entries
redis_stream_pending_age_max
ndvi_stream_consumer_heartbeat
ndvi_stream_consumer_failures_total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also expanded upstream visibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ndvi_upstream_requests_total
ndvi_upstream_failures_total
ndvi_upstream_duration_seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Grafana dashboards now expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pending stream backlog&lt;/li&gt;
&lt;li&gt;reclaim frequency&lt;/li&gt;
&lt;li&gt;DLQ volume&lt;/li&gt;
&lt;li&gt;consumer liveness&lt;/li&gt;
&lt;li&gt;upstream API failures&lt;/li&gt;
&lt;li&gt;queue drain rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transformed rollout decisions from guesswork into measurable operations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rollback Strategy
&lt;/h2&gt;

&lt;p&gt;Rollback was designed before rollout.&lt;/p&gt;

&lt;p&gt;That mattered.&lt;/p&gt;

&lt;p&gt;The stream backend is fully feature-flagged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NDVI_QUEUE_BACKEND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rollback requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;environment variable change&lt;/li&gt;
&lt;li&gt;process restart&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No redeploy.&lt;/p&gt;

&lt;p&gt;No task rewrite.&lt;/p&gt;

&lt;p&gt;No schema rollback.&lt;/p&gt;

&lt;p&gt;This significantly reduced operational fear during rollout.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Shipped This Week
&lt;/h2&gt;

&lt;p&gt;This week’s rollout included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a 528-line Redis Streams consumer&lt;/li&gt;
&lt;li&gt;reclaim + DLQ lifecycle handling&lt;/li&gt;
&lt;li&gt;distributed execution locking&lt;/li&gt;
&lt;li&gt;token-safe lock release&lt;/li&gt;
&lt;li&gt;approximately 400 lines of stream-focused tests&lt;/li&gt;
&lt;li&gt;Prometheus metrics for queue health&lt;/li&gt;
&lt;li&gt;Grafana visibility for consumer state and lag&lt;/li&gt;
&lt;li&gt;feature-flag rollback support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of the work was not adding Redis.&lt;/p&gt;

&lt;p&gt;Most of the work was making failure recovery predictable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Was It Worth It?
&lt;/h2&gt;

&lt;p&gt;Redis Streams did not simplify the system.&lt;/p&gt;

&lt;p&gt;They made failure states explicit.&lt;/p&gt;

&lt;p&gt;That introduced additional complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reclaim logic&lt;/li&gt;
&lt;li&gt;idempotency handling&lt;/li&gt;
&lt;li&gt;consumer lifecycle management&lt;/li&gt;
&lt;li&gt;DLQ operations&lt;/li&gt;
&lt;li&gt;stream observability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the reliability gains were substantial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;durable ingestion&lt;/li&gt;
&lt;li&gt;replay visibility&lt;/li&gt;
&lt;li&gt;safer recovery semantics&lt;/li&gt;
&lt;li&gt;backlog introspection&lt;/li&gt;
&lt;li&gt;controlled rollback&lt;/li&gt;
&lt;li&gt;observable queue state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this NDVI pipeline, the tradeoff was worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;One of the biggest lessons from this migration is that queue evolution is not just about throughput.&lt;/p&gt;

&lt;p&gt;It is about operational recovery.&lt;/p&gt;

&lt;p&gt;Redis Streams gave the ingestion layer explicit lifecycle semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pending&lt;/li&gt;
&lt;li&gt;acknowledged&lt;/li&gt;
&lt;li&gt;reclaimed&lt;/li&gt;
&lt;li&gt;dead-lettered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That visibility fundamentally changed how the system behaves during failures.&lt;/p&gt;

&lt;p&gt;And importantly, I achieved that without rewriting the worker layer.&lt;/p&gt;

&lt;p&gt;Sometimes the best migration strategy is not replacing your stack.&lt;/p&gt;

&lt;p&gt;It is inserting a safer boundary in front of it.&lt;/p&gt;

</description>
      <category>django</category>
      <category>redis</category>
      <category>celery</category>
      <category>backend</category>
    </item>
    <item>
      <title>Building a Resilient NDVI Pipeline with Redis Streams (Event-Driven Architecture)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 26 Apr 2026 09:02:14 +0000</pubDate>
      <link>https://dev.to/rahim8050/building-a-resilient-ndvi-pipeline-with-redis-streams-event-driven-architecture-2l75</link>
      <guid>https://dev.to/rahim8050/building-a-resilient-ndvi-pipeline-with-redis-streams-event-driven-architecture-2l75</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A practical breakdown of moving an NDVI processing pipeline from a synchronous design to an event-driven architecture using Redis Streams — including concurrency challenges, distributed locking pitfalls, and production-safe patterns.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;Most pipelines work — until concurrency and failure expose their limits.&lt;/p&gt;

&lt;p&gt;At first, processing NDVI (Normalized Difference Vegetation Index) data seems straightforward:&lt;/p&gt;

&lt;p&gt;receive a request&lt;/p&gt;

&lt;p&gt;process imagery&lt;/p&gt;

&lt;p&gt;return results&lt;/p&gt;

&lt;p&gt;But once you introduce:&lt;/p&gt;

&lt;p&gt;concurrent jobs&lt;/p&gt;

&lt;p&gt;long-running processing&lt;/p&gt;

&lt;p&gt;distributed components&lt;/p&gt;

&lt;p&gt;you’re no longer building a simple pipeline.&lt;/p&gt;

&lt;p&gt;You’re designing a distributed system.&lt;/p&gt;

&lt;p&gt;This article walks through how I transformed an NDVI processing pipeline from a synchronous model into an event-driven architecture using Redis Streams, and the real-world engineering challenges that came with it.&lt;/p&gt;




&lt;p&gt;System Overview&lt;/p&gt;

&lt;p&gt;The system is built using:&lt;/p&gt;

&lt;p&gt;Django REST Framework (backend API)&lt;/p&gt;

&lt;p&gt;Nextcloud (client-facing integration layer)&lt;/p&gt;

&lt;p&gt;Celery (asynchronous task processing)&lt;/p&gt;

&lt;p&gt;Redis Streams (event ingestion and coordination)&lt;/p&gt;




&lt;p&gt;The Initial Architecture (Synchronous Design)&lt;/p&gt;

&lt;p&gt;Client → API → Celery Task → NDVI Processing → Result&lt;/p&gt;

&lt;p&gt;This design works well at small scale, but it introduces hidden risks when the system grows.&lt;/p&gt;




&lt;p&gt;The Core Problems&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tight Coupling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The request lifecycle is directly tied to processing.&lt;/p&gt;

&lt;p&gt;If processing fails:&lt;/p&gt;

&lt;p&gt;the request fails&lt;/p&gt;

&lt;p&gt;the user experiences errors&lt;/p&gt;

&lt;p&gt;retries become difficult&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Concurrency Issues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When multiple requests target the same job:&lt;/p&gt;

&lt;p&gt;Request A ─┐&lt;br&gt;
           ├──&amp;gt; Same Job → Duplicate Processing&lt;br&gt;
Request B ─┘&lt;/p&gt;

&lt;p&gt;This leads to:&lt;/p&gt;

&lt;p&gt;duplicated work&lt;/p&gt;

&lt;p&gt;inconsistent outputs&lt;/p&gt;

&lt;p&gt;race conditions&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Fragile Execution Model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without coordination:&lt;/p&gt;

&lt;p&gt;jobs execute immediately&lt;/p&gt;

&lt;p&gt;no buffering exists&lt;/p&gt;

&lt;p&gt;failure handling is reactive, not controlled&lt;/p&gt;




&lt;p&gt;The Shift to Event-Driven Architecture&lt;/p&gt;

&lt;p&gt;To solve these issues, I introduced Redis Streams and redesigned the system into an event-driven model.&lt;/p&gt;




&lt;p&gt;New Architecture (Event-Driven Pipeline)&lt;/p&gt;

&lt;p&gt;Client → API → Redis Stream → Consumer → Celery → Processing&lt;/p&gt;




&lt;p&gt;Why Redis Streams?&lt;/p&gt;

&lt;p&gt;Redis Streams provide:&lt;/p&gt;

&lt;p&gt;Event buffering (decouples ingestion from execution)&lt;/p&gt;

&lt;p&gt;At-least-once delivery (ensures reliability)&lt;/p&gt;

&lt;p&gt;Ordered processing&lt;/p&gt;

&lt;p&gt;Scalability for distributed systems&lt;/p&gt;




&lt;p&gt;What Changed&lt;/p&gt;

&lt;p&gt;Instead of executing tasks immediately:&lt;/p&gt;

&lt;p&gt;The API publishes events to a Redis Stream&lt;/p&gt;

&lt;p&gt;A stream consumer controls task execution&lt;/p&gt;

&lt;p&gt;Celery workers process jobs asynchronously&lt;/p&gt;

&lt;p&gt;This separates:&lt;/p&gt;

&lt;p&gt;ingestion&lt;/p&gt;

&lt;p&gt;scheduling&lt;/p&gt;

&lt;p&gt;execution&lt;/p&gt;




&lt;p&gt;Distributed Locking: The Critical Bug&lt;/p&gt;

&lt;p&gt;To prevent duplicate processing, a locking mechanism was introduced.&lt;/p&gt;

&lt;p&gt;The naive approach:&lt;/p&gt;

&lt;p&gt;cache.delete(lock_key)&lt;/p&gt;

&lt;p&gt;This looks harmless — but in distributed systems, it’s dangerous.&lt;/p&gt;




&lt;p&gt;Why This Fails&lt;/p&gt;

&lt;p&gt;Consider this sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Process A acquires a lock&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The lock expires&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process B acquires the same lock&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Process A deletes the lock&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Process B is running without protection&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This creates a race condition — one of the hardest problems in distributed systems.&lt;/p&gt;




&lt;p&gt;The Fix: Token-Based Distributed Locking&lt;/p&gt;

&lt;p&gt;To solve this, each lock is assigned a unique token.&lt;/p&gt;

&lt;p&gt;SET lock_key = token_A (TTL)&lt;/p&gt;

&lt;p&gt;Release only if:&lt;br&gt;
stored_token == token_A&lt;/p&gt;

&lt;p&gt;Key Principles&lt;/p&gt;

&lt;p&gt;Only the owner of the lock can release it&lt;/p&gt;

&lt;p&gt;If ownership does not match → do nothing&lt;/p&gt;

&lt;p&gt;TTL ensures eventual cleanup&lt;/p&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;p&gt;safe concurrency&lt;/p&gt;

&lt;p&gt;no accidental unlocks&lt;/p&gt;

&lt;p&gt;predictable system behavior&lt;/p&gt;




&lt;p&gt;Stream Consumer Design&lt;/p&gt;

&lt;p&gt;Redis Streams operate with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At-least-once delivery semantics&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;messages can be delivered more than once&lt;/p&gt;

&lt;p&gt;consumers must be idempotent&lt;/p&gt;




&lt;p&gt;Consumer Processing Flow&lt;/p&gt;

&lt;p&gt;Read → Validate → Enqueue → Acknowledge&lt;/p&gt;

&lt;p&gt;Critical Rule&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Never acknowledge a message before it is safely enqueued.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Idempotency and Reliability&lt;/p&gt;

&lt;p&gt;To handle duplicate events:&lt;/p&gt;

&lt;p&gt;processing must be idempotent&lt;/p&gt;

&lt;p&gt;tasks must tolerate retries&lt;/p&gt;

&lt;p&gt;state transitions must be safe&lt;/p&gt;

&lt;p&gt;This is essential in any event-driven system.&lt;/p&gt;




&lt;p&gt;Final Architecture (Layered System Design)&lt;/p&gt;

&lt;p&gt;The system now operates in clear layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ingestion Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;receives requests&lt;/p&gt;

&lt;p&gt;publishes events&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stream Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;buffers and orders events&lt;/p&gt;

&lt;p&gt;decouples system components&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Consumer Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;controls execution&lt;/p&gt;

&lt;p&gt;validates and dispatches tasks&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Execution Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Celery workers process NDVI jobs&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Coordination Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;distributed locking&lt;/p&gt;

&lt;p&gt;idempotency&lt;/p&gt;

&lt;p&gt;concurrency control&lt;/p&gt;




&lt;p&gt;Key Lessons from Building an Event-Driven System&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Event-Driven Architecture Does Not Reduce Complexity&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It shifts complexity into:&lt;/p&gt;

&lt;p&gt;coordination&lt;/p&gt;

&lt;p&gt;state management&lt;/p&gt;

&lt;p&gt;failure handling&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Concurrency Is the Real Challenge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not performance.&lt;br&gt;
Not frameworks.&lt;/p&gt;

&lt;p&gt;Concurrency.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Safety Must Be Designed Explicitly&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Small shortcuts (like naive lock deletion) can lead to major production issues.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Idempotency Is Non-Negotiable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In systems with retries and event delivery:&lt;/p&gt;

&lt;p&gt;duplicate execution is expected&lt;/p&gt;

&lt;p&gt;safe handling is required&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Observability Becomes Critical&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In asynchronous systems, you must answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What happened to this job?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This requires:&lt;/p&gt;

&lt;p&gt;structured logging&lt;/p&gt;

&lt;p&gt;tracing across components&lt;/p&gt;

&lt;p&gt;visibility into system flow&lt;/p&gt;




&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;This shift changed the system from:&lt;/p&gt;

&lt;p&gt;"Run this task now"&lt;/p&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;p&gt;"This event will be processed safely"&lt;/p&gt;

&lt;p&gt;That difference is fundamental.&lt;/p&gt;

&lt;p&gt;Because in distributed systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You don’t design for success.&lt;br&gt;
You design for failure.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;What’s Next&lt;/p&gt;

&lt;p&gt;The next phase is observability-driven engineering:&lt;/p&gt;

&lt;p&gt;tracing event lifecycles&lt;/p&gt;

&lt;p&gt;monitoring stream lag&lt;/p&gt;

&lt;p&gt;correlating logs across services&lt;/p&gt;

&lt;p&gt;Because once a system becomes event-driven:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Visibility is what makes it understandable.&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>django</category>
      <category>systemdesign</category>
      <category>redis</category>
      <category>devops</category>
    </item>
    <item>
      <title>Hardening Distributed Systems: Retries, Circuit Breakers &amp; Observability.</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 12 Apr 2026 05:12:28 +0000</pubDate>
      <link>https://dev.to/rahim8050/hardening-distributed-systems-retries-circuit-breakers-observability-4m5n</link>
      <guid>https://dev.to/rahim8050/hardening-distributed-systems-retries-circuit-breakers-observability-4m5n</guid>
      <description>&lt;h2&gt;
  
  
  Building Resilient Distributed Systems: A Solo Engineer's Journey
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;How I turned flaky upstream APIs into a predictable, observable, and operator-friendly reliability layer — with code you can steal.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you've ever built a service that depends on external APIs (STAC catalogs, SentinelHub, weather data providers, etc.), you know the pain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;429s when you hit rate limits&lt;/li&gt;
&lt;li&gt;502s when upstreams hiccup&lt;/li&gt;
&lt;li&gt;Silent timeouts that leave jobs hanging&lt;/li&gt;
&lt;li&gt;Retry storms that make bad days worse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Last month, I undertook a focused effort to harden the retry and resilience logic for an NDVI (Normalized Difference Vegetation Index) processing pipeline. What started as "let's clean up some duplicate retry code" evolved into a &lt;strong&gt;production-grade reliability subsystem&lt;/strong&gt; that now governs every upstream interaction.&lt;/p&gt;

&lt;p&gt;In this article, I'll walk through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: Consolidating retry policy into a single source of truth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: Adding circuit breakers with observability and admin controls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3 (preview)&lt;/strong&gt;: Decoupling dispatch with Redis Streams for back-pressure resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key principles&lt;/strong&gt; I learned that you can apply to your own distributed systems&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All code is Python/Django/Celery, but the patterns are language-agnostic. And yes — I did this alone. No team, no dedicated SRE, no platform squad. Just me, a codebase, and a lot of careful thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Space
&lt;/h2&gt;

&lt;p&gt;The NDVI pipeline I was working on orchestrates vegetation index calculations by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Querying STAC catalogs for satellite imagery metadata&lt;/li&gt;
&lt;li&gt;Fetching raster data from SentinelHub&lt;/li&gt;
&lt;li&gt;Computing NDVI values per farm/plot&lt;/li&gt;
&lt;li&gt;Returning results to farmers/agronomists&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The challenge&lt;/strong&gt;: Each upstream service has different failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;STAC: occasional 502s, auth errors (401/403)&lt;/li&gt;
&lt;li&gt;SentinelHub: strict rate limits (429), validation errors (422), transient 5xx&lt;/li&gt;
&lt;li&gt;Network: timeouts, DNS failures, TLS handshake issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before my refactor, retry logic was scattered across 4+ modules, with inconsistent error classification and no centralized observability. Result? Hard-to-debug failures, wasted Celery retries, and on-call pages at 3 AM.&lt;/p&gt;

&lt;p&gt;As a solo engineer, I couldn't afford to keep firefighting. I needed a system that would &lt;em&gt;just work&lt;/em&gt; — or fail gracefully, with clear signals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: One Source of Truth for Retries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Not all errors are retryable. Not all retries are equal.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I started by defining a canonical truth table mapping HTTP status codes to retry behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ndvi/retry_policy.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_status_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Canonical truth table: HTTP status → retry decision.

    | Status      | Retryable | Category           |
    |-------------|-----------|--------------------|
    | 401, 403    | False     | AUTH               |
    | 400, 422    | False     | VALIDATION         |
    | 429         | True      | RATE_LIMIT         |
    | &amp;gt;= 500      | True      | TRANSIENT_UPSTREAM |
    | Other/None  | False     | UNKNOWN            |
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AUTH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;422&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VALIDATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RATE_LIMIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRANSIENT_UPSTREAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryClassification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNKNOWN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unified Exception Hierarchy
&lt;/h3&gt;

&lt;p&gt;I made all upstream errors inherit from a common base, ensuring consistent attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NdviFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Base for all retryable upstream failures.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="c1"&gt;# Delegate to canonical classifier
&lt;/span&gt;        &lt;span class="n"&gt;classification&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_status_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compute_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StacUpstreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StacError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SentinelHubUpstreamError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SentinelHubRasterError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Centralized Retry Decision
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UpstreamFailureError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-retryable-exception&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Respect Retry-After header for 429s
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;server_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_retry_after&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retry-After&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;server_delay&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;server_delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry-after-header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RetryDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-classification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt; 28 parametrized tests covering all 13 truth-table branches&lt;/li&gt;
&lt;li&gt; Removed 3 duplicate retry implementations&lt;/li&gt;
&lt;li&gt; Celery tasks now use shared &lt;code&gt;should_retry()&lt;/code&gt; logic&lt;/li&gt;
&lt;li&gt; Network errors properly wrapped → no more silent failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson #1&lt;/strong&gt;: Centralize failure classification. When retry logic lives in one place, you can test it thoroughly, document it clearly, and evolve it safely — even when you're the only one maintaining it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Circuit Breakers with Teeth
&lt;/h2&gt;

&lt;p&gt;Retries alone aren't enough. When an upstream is truly down, you want to &lt;strong&gt;fail fast&lt;/strong&gt; and avoid thundering herds.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Circuit Breaker State Machine
&lt;/h3&gt;

&lt;p&gt;I implemented a simple but effective three-state breaker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLOSED → (failures ≥ threshold) → OPEN → (timeout elapsed) → HALF_OPEN → (success) → CLOSED
                              ↘ (failure) ↗
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;_CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeout_secs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timeout_secs&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLOSED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_failure_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeout_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="c1"&gt;# HALF_OPEN: allow one probe request
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_transition_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;old_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_state&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Circuit breaker: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;old_state&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Export Prometheus metric
&lt;/span&gt;        &lt;span class="n"&gt;circuit_breaker_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STATE_VALUES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;circuit_breaker_transitions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;from_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;old_state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_state&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observability First
&lt;/h3&gt;

&lt;p&gt;I didn't just build the breaker — I made it &lt;strong&gt;visible&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus metrics
&lt;/span&gt;&lt;span class="n"&gt;circuit_breaker_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ndvi_circuit_breaker_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Current circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;circuit_breaker_transitions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ndvi_circuit_breaker_transitions_total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count of circuit breaker state transitions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labelnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;engine&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to_state&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And added a Grafana dashboard with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stat panels showing current state per engine (color-coded: 🟢 CLOSED, 🔴 OPEN, 🟡 HALF_OPEN)&lt;/li&gt;
&lt;li&gt;Time series of transition rates&lt;/li&gt;
&lt;li&gt;Correlation with upstream failure rates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operator Controls
&lt;/h3&gt;

&lt;p&gt;Because things &lt;em&gt;will&lt;/em&gt; go wrong — and when you're solo, you &lt;em&gt;are&lt;/em&gt; the operator — I added an admin endpoint to manually reset breakers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/ndvi/circuit-breaker/reset/
Content-Type: application/json
Authorization: Bearer &amp;lt;admin-token&amp;gt;

{ "engine": "stac" }

→ { "data": { "previous_state": "OPEN", "new_state": "CLOSED" } }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson #2&lt;/strong&gt;: Resilience patterns need observability and escape hatches. If you can't see it or control it, you don't own it — and when you're the only one on call, "owning it" means sleeping at night.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3 Preview: Decoupling with Redis Streams
&lt;/h2&gt;

&lt;p&gt;As I scaled the system, I hit a new challenge: &lt;strong&gt;Celery broker unavailability during Redis Sentinel failover&lt;/strong&gt; (~55 seconds of downtime). For background jobs, this was acceptable. But for real-time dispatch, I needed better.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture Decision
&lt;/h3&gt;

&lt;p&gt;Instead of relying on Celery's built-in Redis transport, I chose a &lt;strong&gt;separate consumer pattern&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API → [Redis Stream] → Consumer → [Celery Queue] → Worker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Avoids Celery/Kombu stream support uncertainty&lt;/li&gt;
&lt;li&gt; Easier to observe and debug (explicit XREADGROUP/XACK)&lt;/li&gt;
&lt;li&gt; Natural back-pressure via XPENDING monitoring&lt;/li&gt;
&lt;li&gt; Cleaner rollback path (just flip a feature flag)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Design Decisions I Made Early
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Idempotency by Design
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream_payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Primary idempotency key
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Future-proofing
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;colormap_normalization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;histogram&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Evolved schema
&lt;/span&gt;    &lt;span class="c1"&gt;# ... other fields
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Consumer checks request_hash before enqueueing to Celery
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Error Classification at Consumer Boundary
&lt;/h4&gt;

&lt;p&gt;Not all failures should retry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ERROR_STRATEGY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DLQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Permanent: no data exists
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing_assets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DLQ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Permanent: schema mismatch
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;network_timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RETRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Transient: try again
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;celery_unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RETRY_WITH_BACKOFF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Infrastructure blip
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. Back-Pressure Strategy
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PENDING_WARNING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1_000&lt;/span&gt;
&lt;span class="n"&gt;PENDING_CRITICAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5_000&lt;/span&gt;

&lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpending&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;group_name&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PENDING_CRITICAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Return 429 on API to slow producers
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;HttpResponseTooManyRequests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Upstream backlog critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;PENDING_WARNING&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stream backlog growing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pending_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Graceful Shutdown
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In consume_ndvi_stream.py
&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGTERM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_shutdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;shutdown_flag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Stop accepting new entries
&lt;/span&gt;    &lt;span class="c1"&gt;# Finish current entry, XACK if successful
&lt;/span&gt;    &lt;span class="c1"&gt;# Exit cleanly → orchestrator restarts
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson #3&lt;/strong&gt;: Decoupling isn't just about scalability — it's about &lt;strong&gt;failure isolation&lt;/strong&gt;. When one component fails, the rest can keep moving. And when you're solo, isolation means you can debug one piece without bringing down the whole system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Principles I Learned (That You Can Steal)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Make Failure Explicit
&lt;/h3&gt;

&lt;p&gt;Don't hide errors behind generic exceptions. Classify them, tag them, and route them intentionally. Your future self — especially at 3 AM — will thank you.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Observability Is a Feature, Not an Afterthought
&lt;/h3&gt;

&lt;p&gt;If you can't measure it, you can't improve it. Export metrics at the point of decision (retry? circuit open? stream lag?) — not just at the edges. When you're the only one debugging, every metric is a lifeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Design for the "Boring" Failure Modes
&lt;/h3&gt;

&lt;p&gt;Everyone plans for the 500 error. Few plan for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Broker failover latency&lt;/li&gt;
&lt;li&gt;Consumer restart mid-processing&lt;/li&gt;
&lt;li&gt;Schema evolution mid-deploy&lt;/li&gt;
&lt;li&gt;Clock skew in distributed timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Document these. Test them. Build escape hatches. When you don't have a team to lean on, preparation is your best defense.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Centralize, Then Specialize
&lt;/h3&gt;

&lt;p&gt;Start with a single source of truth (like &lt;code&gt;classify_status_code()&lt;/code&gt;). Then layer on engine-specific behavior &lt;em&gt;on top&lt;/em&gt; of that foundation. This prevents drift and duplication — critical when you're the only one maintaining the code.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Operator Experience Matters
&lt;/h3&gt;

&lt;p&gt;Admin endpoints, health checks, clear logs, and meaningful metrics aren't "nice to have" — they're what let you sleep at night. Build them in from day one. When you're solo, &lt;em&gt;you&lt;/em&gt; are the operator.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on Solo Engineering
&lt;/h2&gt;

&lt;p&gt;Working alone doesn't mean working in isolation. I leaned heavily on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public documentation&lt;/strong&gt;: Google SRE book, AWS Well-Architected, Martin Fowler's patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source&lt;/strong&gt;: Studying how Celery, Kombu, and Redis clients handle resilience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community&lt;/strong&gt;: Reading post-mortems, blog posts, and conference talks from engineers who've been there&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And I documented &lt;em&gt;everything&lt;/em&gt;. Not for a team — for my future self. Every architecture decision, every tradeoff, every "why" is written down. Because six months from now, I won't remember why I chose &lt;code&gt;300s&lt;/code&gt; for the circuit breaker timeout. But my docs will.&lt;/p&gt;

&lt;p&gt;If you're also building alone: you're not behind. You're just optimizing for a different constraint. Depth over breadth. Clarity over velocity. Resilience over features.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building resilient distributed systems isn't about fancy algorithms or cutting-edge tools. It's about &lt;strong&gt;discipline&lt;/strong&gt;: clear contracts, explicit failure handling, observable behavior, and operator empathy.&lt;/p&gt;

&lt;p&gt;The NDVI pipeline I built isn't perfect. My circuit breakers are still process-local (not cluster-wide). My stream consumer doesn't yet support distributed tracing. But it's &lt;strong&gt;predictable&lt;/strong&gt;, &lt;strong&gt;testable&lt;/strong&gt;, and &lt;strong&gt;recoverable&lt;/strong&gt; — and that's what matters.&lt;/p&gt;

&lt;p&gt;If you take one thing from this article, let it be this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Resilience isn't a feature you add at the end. It's a mindset you build in from the start — whether you're on a team of 50 or flying solo.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;All code examples are simplified for clarity; production versions include additional error handling and logging. This work reflects my personal approach — your mileage may vary, and that's okay.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt;: Want to try the circuit breaker pattern? Start small:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add a &lt;code&gt;failure_count&lt;/code&gt; and &lt;code&gt;last_failure_time&lt;/code&gt; to your HTTP client&lt;/li&gt;
&lt;li&gt;Skip requests when &lt;code&gt;failure_count &amp;gt;= 3&lt;/code&gt; and &lt;code&gt;time_since_failure &amp;lt; 300&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Log state transitions&lt;/li&gt;
&lt;li&gt;Add one Prometheus gauge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You'll be 80% of the way there — and you'll learn what &lt;em&gt;actually&lt;/em&gt; matters for your workload.&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>distributedsystems</category>
      <category>devops</category>
      <category>django</category>
      <category>redis</category>
    </item>
    <item>
      <title>Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:36:44 +0000</pubDate>
      <link>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</link>
      <guid>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</guid>
      <description>&lt;p&gt;Redis Sentinel + Celery Failover: What Actually Happens in Production&lt;/p&gt;

&lt;p&gt;Most tutorials on Redis Sentinel stop at “it elects a new master”.&lt;br&gt;
Very few show what happens to a real system under failover pressure.&lt;/p&gt;

&lt;p&gt;I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.&lt;/p&gt;
&lt;h2&gt;
  
  
  Here’s what actually happened.
&lt;/h2&gt;


&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Overview&lt;/li&gt;
&lt;li&gt;Sentinel Integration (Django + Celery)&lt;/li&gt;
&lt;li&gt;Observability with Prometheus&lt;/li&gt;
&lt;li&gt;Failover Drill Walkthrough&lt;/li&gt;
&lt;li&gt;Celery Behavior During Failover&lt;/li&gt;
&lt;li&gt;Performance Impact&lt;/li&gt;
&lt;li&gt;Production Readiness Assessment&lt;/li&gt;
&lt;li&gt;How to Reduce Failover Latency&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Client --&amp;gt; Django
    Django --&amp;gt;|Cache| Sentinel
    Django --&amp;gt;|Tasks| Celery
    Celery --&amp;gt;|Broker| Sentinel
    Celery --&amp;gt;|Result Backend| Sentinel

    Sentinel --&amp;gt; RedisMaster
    Sentinel --&amp;gt; RedisReplica1
    Sentinel --&amp;gt; RedisReplica2

    Prometheus --&amp;gt; RedisExporter
    RedisExporter --&amp;gt; Sentinel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Stack Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; → Redis cache via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery&lt;/strong&gt; → Broker + result backend via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis Sentinel&lt;/strong&gt; → High availability + failover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus + redis_exporter&lt;/strong&gt; → Monitoring&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Sentinel Integration (Django + Celery)
&lt;/h2&gt;

&lt;p&gt;All services were switched to Sentinel using environment configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REDIS_ADDR=redis://host.docker.internal:26379
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validation steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Django cache → successful round-trip&lt;/li&gt;
&lt;li&gt;Celery broker → connected via Sentinel&lt;/li&gt;
&lt;li&gt;Celery result backend → &lt;code&gt;SentinelBackend&lt;/code&gt; initialized&lt;/li&gt;
&lt;li&gt;Test suite passed:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  pytest tests/test_settings_redis_sentinel.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this stage, the system is fully &lt;strong&gt;Sentinel-aware&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability with Prometheus
&lt;/h2&gt;

&lt;p&gt;After pointing &lt;code&gt;redis_exporter&lt;/code&gt; to Sentinel:&lt;/p&gt;

&lt;p&gt;Key metrics exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_status&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_sentinels&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_slaves&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_masters&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;redis_instance_info&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;redis_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sentinel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tcp_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"26379"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This confirms monitoring is tracking &lt;strong&gt;cluster state&lt;/strong&gt;, not a single node.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failover Drill Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial State
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|Master| Redis1["172.20.0.3:6379"]
    Sentinel --&amp;gt; Redis2["Replica"]
    Sentinel --&amp;gt; Redis3["Replica"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prometheus reported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;master_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"172.20.0.3:6379"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Induced Failure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Current master was stopped manually&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Sentinel Election
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --&amp;gt; Redis3["Replica"]
    Sentinel --&amp;gt; Redis1["Down"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;New master elected on &lt;strong&gt;first poll&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Prometheus updated on next scrape&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failover was immediate and correct&lt;/p&gt;




&lt;h2&gt;
  
  
  Celery Behavior During Failover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App-&amp;gt;&amp;gt;Celery: Submit Task
    Celery-&amp;gt;&amp;gt;Redis: Send to Master
    Redis--&amp;gt;&amp;gt;Celery: Connection Lost

    Sentinel-&amp;gt;&amp;gt;Sentinel: Elect New Master

    Celery-&amp;gt;&amp;gt;Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery-&amp;gt;&amp;gt;Redis: Reconnect to New Master
    Redis--&amp;gt;&amp;gt;Celery: OK

    Celery--&amp;gt;&amp;gt;App: Task SUCCESS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observed Task
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Task ID: &lt;code&gt;9b57ba3b-a707-4c13-9255-d74de411b64b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Status during failover: &lt;code&gt;PENDING&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Delay: &lt;strong&gt;~54.7 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Final state: &lt;code&gt;SUCCESS&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Impact
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Normal operation&lt;/td&gt;
&lt;td&gt;Immediate execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;During failover&lt;/td&gt;
&lt;td&gt;~55s delay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-recovery&lt;/td&gt;
&lt;td&gt;Normal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Production Readiness Assessment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Redis Sentinel failover is reliable&lt;/li&gt;
&lt;li&gt;Prometheus reflects cluster changes correctly&lt;/li&gt;
&lt;li&gt;Django cache survives failover&lt;/li&gt;
&lt;li&gt;No task loss in Celery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Needs Attention
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Celery introduces &lt;strong&gt;significant delay during failover&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reconnection is not instantaneous&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Architecture Is Production-Ready
&lt;/h2&gt;

&lt;p&gt;Use this setup if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks are &lt;strong&gt;asynchronous/background&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Eventual completion is acceptable&lt;/li&gt;
&lt;li&gt;Temporary latency spikes are tolerable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Is Not Enough
&lt;/h2&gt;

&lt;p&gt;Avoid this setup (as-is) if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time task execution&lt;/li&gt;
&lt;li&gt;Sub-10s failover recovery&lt;/li&gt;
&lt;li&gt;User-facing async operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Reduce Failover Latency
&lt;/h2&gt;

&lt;p&gt;To push recovery closer to &lt;strong&gt;10–15 seconds&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tune Celery broker retry settings&lt;/li&gt;
&lt;li&gt;Reduce reconnect backoff intervals&lt;/li&gt;
&lt;li&gt;Optimize worker heartbeat and visibility timeout&lt;/li&gt;
&lt;li&gt;Re-run failover drills with timing instrumentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Redis Sentinel ensures infrastructure recovery.&lt;br&gt;
Celery determines how fast your system actually resumes work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentinel recovery: &lt;strong&gt;instant&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Application recovery: &lt;strong&gt;~55 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gap is the real engineering challenge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;If you're using Redis Sentinel with Celery:&lt;/p&gt;

&lt;p&gt;Don’t stop at:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Failover works.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Measure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How long until my system behaves normally again?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because that’s what production users experience.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>django</category>
      <category>redis</category>
    </item>
    <item>
      <title>Escaping Cache Fragmentation: How Misconfigured PHP Workers Flooded My Token System</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 22 Mar 2026 18:02:23 +0000</pubDate>
      <link>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</link>
      <guid>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</guid>
      <description>&lt;h2&gt;
  
  
  🚨 The Symptom
&lt;/h2&gt;

&lt;p&gt;I started noticing something strange in my observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration tokens were being minted repeatedly&lt;/li&gt;
&lt;li&gt;My token endpoint showed activity even when no user interaction was happening&lt;/li&gt;
&lt;li&gt;Metrics suggested constant “traffic” to an otherwise idle system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A security issue&lt;/li&gt;
&lt;li&gt;A rogue client&lt;/li&gt;
&lt;li&gt;Or a broken API consumer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was none of those.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 The Root Cause
&lt;/h2&gt;

&lt;p&gt;The issue came down to a subtle but critical architectural mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I was using a non-shared cache in a multi-worker environment.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Stack involved:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PHP-FPM (2 workers)&lt;/li&gt;
&lt;li&gt;APCu (in-memory cache)&lt;/li&gt;
&lt;li&gt;Token-based integration between services&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚙️ What Went Wrong
&lt;/h2&gt;

&lt;p&gt;APCu is &lt;strong&gt;process-local&lt;/strong&gt;, not shared.&lt;/p&gt;

&lt;p&gt;That means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker A cache ≠ Worker B cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each PHP-FPM worker had its own isolated memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  💥 The Cascade Effect
&lt;/h2&gt;

&lt;p&gt;My token logic was straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;mint_new_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But in reality, the system behaved like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request hits Worker A → token exists → OK&lt;/li&gt;
&lt;li&gt;Next request hits Worker B → cache miss → mint new token&lt;/li&gt;
&lt;li&gt;Repeat across workers → continuous token regeneration&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  📈 Why Observability Looked “Wrong”
&lt;/h2&gt;

&lt;p&gt;From the outside, it looked like traffic was hitting the token endpoint.&lt;/p&gt;

&lt;p&gt;But in reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The system was generating its own traffic due to cache inconsistency.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a key lesson:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all traffic is external&lt;/li&gt;
&lt;li&gt;Some is &lt;strong&gt;emergent behavior from system design&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ The Fix
&lt;/h2&gt;

&lt;p&gt;I switched from APCu to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis (shared cache)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All workers → same cache → consistent token state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tokens minted once&lt;/li&gt;
&lt;li&gt;Reused across all workers&lt;/li&gt;
&lt;li&gt;Metrics stabilized instantly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔒 Production Hardening (What I Added Next)
&lt;/h2&gt;

&lt;p&gt;Fixing the cache wasn’t enough — I hardened the system further.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Distributed Locking
&lt;/h3&gt;

&lt;p&gt;To prevent race conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

&lt;span class="n"&gt;acquire&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
    &lt;span class="n"&gt;mint&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;still&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;
&lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. TTL Buffering
&lt;/h3&gt;

&lt;p&gt;Avoid edge expiration issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cache_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token_expiry&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;safety_margin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Observability Metrics
&lt;/h3&gt;

&lt;p&gt;I added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;token_cache_hits&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_cache_misses&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_mint_count&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now anomalies show up immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Key Takeaway
&lt;/h2&gt;

&lt;p&gt;This wasn’t just a bug.&lt;/p&gt;

&lt;p&gt;It was a &lt;strong&gt;distributed systems failure mode&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cache locality + multi-worker architecture → inconsistent state → emergent traffic&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ⚡ Final Insight
&lt;/h2&gt;

&lt;p&gt;If your system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs multiple workers&lt;/li&gt;
&lt;li&gt;Uses in-memory caching&lt;/li&gt;
&lt;li&gt;Relies on shared state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then this rule applies:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If your cache isn’t shared, your state isn’t real.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔗 Closing
&lt;/h2&gt;

&lt;p&gt;This issue reinforced something critical in my engineering journey:&lt;/p&gt;

&lt;p&gt;You don’t debug systems by staring at code —&lt;br&gt;
you debug them by understanding &lt;strong&gt;how state flows across boundaries&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;If you're building distributed APIs, token systems, or high-concurrency services —&lt;br&gt;
this is one edge case worth designing for early.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>php</category>
      <category>webdev</category>
    </item>
    <item>
      <title>From 80-Second APIs to Sub-Second: Rebuilding a Geospatial Backend with Async Pipelines</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 21 Mar 2026 16:37:10 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</link>
      <guid>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;From 80-Second APIs to Sub-Second: Fixing Latency with Async Pipelines (Django + Celery)&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;At some point, every backend engineer hits this wall:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The API works perfectly… until it doesn’t.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hit that wall with a farm analytics endpoint computing NDVI (Normalized Difference Vegetation Index) from satellite imagery. The system was correct, the logic was sound, and the results were accurate.&lt;/p&gt;

&lt;p&gt;But the numbers told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P95 latency: 1.25 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s not an API. That’s a blocking compute job pretending to be one.&lt;/p&gt;

&lt;p&gt;This is the story of how I redesigned the system—from a synchronous request-driven model to an asynchronous data pipeline—and brought latency down to &lt;strong&gt;sub-second performance (P95 ≈ 725ms)&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Architecture (The Hidden Problem)
&lt;/h2&gt;

&lt;p&gt;At first glance, the system looked clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Client]
   ↓
[Django API]
   ↓
[STAC API → Satellite Data]
   ↓
[Raster Processing (NDVI)]
   ↓
[Response]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happened on each request?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Query satellite imagery via STAC&lt;/li&gt;
&lt;li&gt;Fetch raster bands (Red &amp;amp; NIR) from remote storage&lt;/li&gt;
&lt;li&gt;Process NDVI using rasterio&lt;/li&gt;
&lt;li&gt;Aggregate coverage&lt;/li&gt;
&lt;li&gt;Return result&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why this seemed fine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It worked locally&lt;/li&gt;
&lt;li&gt;It returned correct data&lt;/li&gt;
&lt;li&gt;It followed a “pure API” mindset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remote I/O (S3-backed satellite data)&lt;/li&gt;
&lt;li&gt;Heavy raster decoding (JPEG2000)&lt;/li&gt;
&lt;li&gt;Sequential band reads&lt;/li&gt;
&lt;li&gt;Full computation per request&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Breaking Point
&lt;/h2&gt;

&lt;p&gt;Logs told the truth.&lt;/p&gt;

&lt;p&gt;Each request looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STAC request → ~5s
Raster read (B04) → ~5–10s
Raster read (B08) → ~5–10s
Processing → ~5s+
Total → ~80+ seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the key realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I wasn’t building an API—I was executing a geospatial compute pipeline on every request.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Core Insight
&lt;/h2&gt;

&lt;p&gt;This is the shift that changes everything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;APIs should &lt;strong&gt;serve data&lt;/strong&gt;, not &lt;strong&gt;compute it on demand&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem wasn’t Python.&lt;br&gt;
The problem wasn’t Django.&lt;br&gt;
The problem was &lt;strong&gt;architecture&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  The New Architecture (Async Pipeline)
&lt;/h2&gt;

&lt;p&gt;I redesigned the system around &lt;strong&gt;asynchronous computation + caching&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;             (Scheduled / Triggered)
                    ↓
             [Celery Worker]
                    ↓
         [NDVI Computation Pipeline]
                    ↓
             [Redis / Database]
                    ↓
[Client] → [Django API] → [Cache Lookup]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key changes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NDVI computation moved out of the request path&lt;/li&gt;
&lt;li&gt;Results cached in Redis&lt;/li&gt;
&lt;li&gt;Background jobs compute and refresh data&lt;/li&gt;
&lt;li&gt;API returns instantly (no heavy compute)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Diagram 1 — Before vs After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before (Request-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
   ↓
STAC API
   ↓
Raster I/O
   ↓
NDVI Compute
   ↓
Response (80s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After (Pipeline-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request → Cache → Response (~725ms P95)
              ↓ (miss)
         Async Task
              ↓
       Compute + Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Fast API Path (Non-blocking)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ndvi.tasks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_farm_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;

    &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Celery Task (Async Compute)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;autoretry_for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;retry_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;coverage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_ndvi_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Daily Backfill (Critical)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enqueue_daily_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;farm_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_active_farm_ids&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;farm_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Observability (The Real Upgrade)
&lt;/h2&gt;

&lt;p&gt;Metrics added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task duration&lt;/li&gt;
&lt;li&gt;Task success/failure&lt;/li&gt;
&lt;li&gt;Queue depth&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Metrics (Grafana Observations)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📊 Grafana Screenshots
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Latency Graph&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" alt="725ms on farm get endpoint" width="342" height="458"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Before
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;P95 latency: ~1.25 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  After
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API latency: ~725ms (P95)&lt;/li&gt;
&lt;li&gt;Background tasks: 60–90s&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Before vs After Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API latency&lt;/td&gt;
&lt;td&gt;1.25 min&lt;/td&gt;
&lt;td&gt;~725 ms (P95)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System type&lt;/td&gt;
&lt;td&gt;Request-driven&lt;/td&gt;
&lt;td&gt;Pipeline-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Improved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;I stopped treating my API like a calculator and started treating my system like a data pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s when everything changed.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>performance</category>
      <category>distributedsystems</category>
      <category>backend</category>
    </item>
    <item>
      <title>Designing a One-Way Farm Sync Architecture (Nextcloud Django DRF)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 15 Mar 2026 07:11:34 +0000</pubDate>
      <link>https://dev.to/rahim8050/designing-a-one-way-farm-sync-architecture-nextcloud-django-drf-5bh3</link>
      <guid>https://dev.to/rahim8050/designing-a-one-way-farm-sync-architecture-nextcloud-django-drf-5bh3</guid>
      <description>&lt;h2&gt;
  
  
  From Nextcloud to Django: Designing a Farm Sync Architecture with DRF
&lt;/h2&gt;

&lt;p&gt;Modern applications rarely live in a single system. As projects grow, different components begin to specialize: one system handles identity and user workflows, while another focuses on computation and domain logic.&lt;/p&gt;

&lt;p&gt;This week I explored a small but interesting distributed architecture problem: &lt;strong&gt;how to synchronize farm data between a Nextcloud application and a Django REST API backend&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal was simple in theory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nextcloud provides the user interface.&lt;/li&gt;
&lt;li&gt;Django performs geospatial computation (NDVI, raster processing).&lt;/li&gt;
&lt;li&gt;Farm data must stay consistent between the two systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But as any engineer knows, “simple in theory” is where architecture decisions start to matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Initial Problem
&lt;/h2&gt;

&lt;p&gt;In the system I’m building, users manage farms from a Nextcloud application while a Django service handles geospatial workloads.&lt;/p&gt;

&lt;p&gt;The challenge was deciding &lt;strong&gt;where farm data should live&lt;/strong&gt; and &lt;strong&gt;how it should propagate between systems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Three architectural questions emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which system is the source of truth?&lt;/li&gt;
&lt;li&gt;How do we synchronize data between services?&lt;/li&gt;
&lt;li&gt;How do we avoid identity conflicts between systems?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are classic distributed systems questions, even in relatively small projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 1: Bidirectional Sync (The Dangerous Path)
&lt;/h2&gt;

&lt;p&gt;One tempting solution is letting both systems create farms and then syncing them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud &amp;lt;--&amp;gt; Django
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance this feels flexible. In practice it creates difficult problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;conflicting updates&lt;/li&gt;
&lt;li&gt;race conditions&lt;/li&gt;
&lt;li&gt;reconciliation logic&lt;/li&gt;
&lt;li&gt;versioning requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large distributed databases solve this with vector clocks and conflict resolution strategies. For most applications, that complexity is unnecessary.&lt;/p&gt;

&lt;p&gt;So I rejected bidirectional replication early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 2: API-First Architecture
&lt;/h2&gt;

&lt;p&gt;Instead of replicating farms between databases, Nextcloud simply &lt;strong&gt;delegates creation to the Django API&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a user creates a farm in Nextcloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
  ↓
Nextcloud UI
  ↓
Nextcloud Controller
  ↓
POST /api/v1/farms/ (Django REST Framework)
  ↓
Django database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Django becomes the &lt;strong&gt;source of truth for farm data&lt;/strong&gt;, while Nextcloud acts as the interface layer.&lt;/p&gt;

&lt;p&gt;This pattern has several advantages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Immediate Consistency
&lt;/h3&gt;

&lt;p&gt;Since farms are created directly in Django, the backend always has the latest data.&lt;/p&gt;

&lt;p&gt;There is no delayed replication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clear Ownership
&lt;/h3&gt;

&lt;p&gt;Each system has a defined responsibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nextcloud&lt;/strong&gt; – user interface, identity, workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; – domain logic, geospatial processing, data storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clear boundaries reduce architectural complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extensibility
&lt;/h3&gt;

&lt;p&gt;Once Django exposes a clean API, other systems can integrate easily:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mobile apps&lt;/li&gt;
&lt;li&gt;data pipelines&lt;/li&gt;
&lt;li&gt;satellite processing services&lt;/li&gt;
&lt;li&gt;analytics dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything interacts with the same API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solving the Cross-System Identity Problem
&lt;/h2&gt;

&lt;p&gt;Once multiple systems talk about the same object, a subtle problem appears:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;identity consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If each system generated its own farm IDs, synchronization would become fragile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud Farm ID = 12
Django Farm ID = 47
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every integration requires mapping tables.&lt;/p&gt;

&lt;p&gt;Instead, the architecture introduces a &lt;strong&gt;stable external identifier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Nextcloud generates a UUID called:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;external_farm_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That UUID is sent to Django whenever a farm is synchronized.&lt;/p&gt;

&lt;p&gt;Conceptually the model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Farm
 ├─ id (internal database id)
 └─ external_farm_id (shared UUID)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now both systems reference the farm using the same identifier.&lt;/p&gt;

&lt;p&gt;When Nextcloud syncs a farm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/farms/sync
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Payload example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"external_farm_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uuid"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"external_user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"nextcloud_uid"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Demo Farm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"centroid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach provides several benefits.&lt;/p&gt;

&lt;h3&gt;
  
  
  No ID Collisions
&lt;/h3&gt;

&lt;p&gt;UUIDs are globally unique, preventing conflicts between systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean Synchronization
&lt;/h3&gt;

&lt;p&gt;Updates and raster requests can reference farms using the same external identifier.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;/api/v1/farms/{external_farm_id}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Future Integration
&lt;/h3&gt;

&lt;p&gt;If other services appear later (mobile apps, analytics pipelines, satellite processors), they can all reference farms using the same UUID.&lt;/p&gt;

&lt;p&gt;This pattern is common in distributed systems and prevents a large class of synchronization bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Identity Translation Between Systems
&lt;/h2&gt;

&lt;p&gt;Another challenge is mapping users between Nextcloud and Django.&lt;/p&gt;

&lt;p&gt;Nextcloud users authenticate normally and then communicate with Django using an &lt;strong&gt;integration token&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud User
      ↓
Integration Token
      ↓
Django API
      ↓
Farm Sync
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Django service stores farms under a service user while still preserving the original &lt;code&gt;external_user_id&lt;/code&gt; from Nextcloud.&lt;/p&gt;

&lt;p&gt;This keeps authentication simple while preserving user ownership information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring the Integration Layer
&lt;/h2&gt;

&lt;p&gt;During development I analyzed the Nextcloud app using &lt;code&gt;cloc&lt;/code&gt; to understand the size of the integration layer.&lt;/p&gt;

&lt;p&gt;The results showed roughly &lt;strong&gt;11,000 lines of code&lt;/strong&gt;, split mainly between PHP backend logic and JavaScript UI.&lt;/p&gt;

&lt;p&gt;Around this size, architecture decisions begin to matter more than raw implementation.&lt;/p&gt;

&lt;p&gt;Systems become complex enough that &lt;strong&gt;clear service boundaries&lt;/strong&gt; become essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Several practical lessons emerged from this integration work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Avoid bidirectional replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two systems writing to the same domain model creates unnecessary complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Establish a clear source of truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this architecture, Django owns farm data while Nextcloud orchestrates the workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use stable external identifiers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;UUIDs dramatically simplify synchronization across systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Prefer API-first architectures&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;APIs make it easier to expand systems and integrate future services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Keep compute close to the data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since Django handles geospatial processing, storing farms there keeps the compute layer efficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;What started as a simple integration between Nextcloud and Django turned into a useful exercise in distributed system design.&lt;/p&gt;

&lt;p&gt;Even relatively small systems benefit from clear service boundaries and stable identity strategies.&lt;/p&gt;

&lt;p&gt;By combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an &lt;strong&gt;API-first architecture&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;external UUID identifiers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and &lt;strong&gt;clear ownership of farm data&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the system stays simple today while remaining extensible for future services like satellite analytics or mobile farming applications.&lt;/p&gt;

&lt;p&gt;Sometimes good architecture isn’t about complexity at all — it’s about &lt;strong&gt;clarity of responsibility between systems&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>api</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Escaping the Sync Trap: How I Slashed Latency by 10x in a Django-Rust API Gateway</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 01 Mar 2026 13:24:13 +0000</pubDate>
      <link>https://dev.to/rahim8050/escaping-the-sync-trap-how-i-slashed-latency-by-10x-in-a-django-rust-api-gateway-323m</link>
      <guid>https://dev.to/rahim8050/escaping-the-sync-trap-how-i-slashed-latency-by-10x-in-a-django-rust-api-gateway-323m</guid>
      <description>&lt;h2&gt;
  
  
  How I diagnosed and eliminated synchronous bottlenecks in a Django-Rust API gateway, migrating to ASGI and pre-warming caches for millisecond responses.
&lt;/h2&gt;

&lt;p&gt;When building a high-performance backend, the standard playbook is well-known: &lt;strong&gt;offload heavy computational tasks to faster microservices (like Rust)&lt;/strong&gt; and implement an aggressive caching strategy.&lt;/p&gt;

&lt;p&gt;Recently, I did exactly that. My architecture is built around a &lt;strong&gt;Django REST Framework gateway&lt;/strong&gt; sitting behind &lt;strong&gt;Caddy&lt;/strong&gt;, heavily monitored with &lt;strong&gt;Prometheus&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;But despite the raw speed of Rust and my caching layers, my dashboards were flashing red. Latency was spiking to brutal 10-second flatlines for my most critical endpoints. Worse, my observability itself started failing, creating &lt;em&gt;silent blind spots&lt;/em&gt; exactly when I needed data the most.&lt;/p&gt;

&lt;p&gt;Here is the detective story of how I used telemetry to hunt down synchronous traps, migrate to a non-blocking async architecture, and implement proactive pre-warming to bring response times down to the millisecond range — all while reclaiming &lt;strong&gt;30% of my idle CPU&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: A Gateway and its Heavy Lifters
&lt;/h2&gt;

&lt;p&gt;Before diving into the problem, here is a quick look at my setup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          ┌──────────────────────────────┐
          │          Nextcloud           │
          │ (Authenticated Client Calls) │
          └──────────────┬───────────────┘
                         │  JWT / API Key
                         ▼
               ┌────────────────────┐
               │  Django Gateway    │
               │ (ASGI, DRF, Caddy) │
               └──────┬─────────────┘
                      │
     ┌────────────────┴────────────────┐
     │                                 │
     ▼                                 ▼
┌───────────────┐              ┌────────────────┐
│ NDVI Service  │              │ Weather Service│
│ (Rust, 8081)  │              │ (Rust, 8090)   │
│  → Postgres    │              │  → MySQL       │
└───────────────┘              └────────────────┘

       ▲
       │ Prometheus &amp;amp; Grafana
       │ (Observability Stack)
       ▼
   System Telemetry + Metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vebc0f2a5ldvbhzfubp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vebc0f2a5ldvbhzfubp.png" alt="The Architecture" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; stays as the public API gateway, accepting requests authenticated via &lt;strong&gt;JWT&lt;/strong&gt; or &lt;strong&gt;API keys&lt;/strong&gt;.&lt;br&gt;
It enforces a shared JSON response envelope containing &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;, &lt;code&gt;data&lt;/code&gt;, and &lt;code&gt;errors&lt;/code&gt; to keep all client interactions standardized.&lt;/p&gt;

&lt;p&gt;Specific traffic routes — namely &lt;code&gt;/api/v1/ndvi&lt;/code&gt; and &lt;code&gt;/api/v1/weather/*&lt;/code&gt; — are forwarded directly to my &lt;strong&gt;Rust&lt;/strong&gt; backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🛰 &lt;strong&gt;NDVI microservice&lt;/strong&gt; ingests satellite data into a dedicated &lt;strong&gt;Postgres&lt;/strong&gt; database.&lt;/li&gt;
&lt;li&gt;🌦 &lt;strong&gt;Weather microservice&lt;/strong&gt; relies on a &lt;strong&gt;MySQL&lt;/strong&gt; database and communicates with external providers like &lt;strong&gt;Open-Meteo&lt;/strong&gt; and &lt;strong&gt;NASA POWER&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My &lt;strong&gt;Nextcloud&lt;/strong&gt; instance acts like any other client, presenting either an &lt;code&gt;Authorization: Bearer&lt;/code&gt; token or an &lt;code&gt;X-API-Key&lt;/code&gt;. Django manages this traffic using a specific &lt;code&gt;nextcloud_hmac&lt;/code&gt; throttle configuration before passing the authorized call down to Rust with the original headers intact.&lt;/p&gt;




&lt;h2&gt;
  
  
  The False Cure: The Worker Starvation Anomaly
&lt;/h2&gt;

&lt;p&gt;To protect the system, I implemented aggressive TTL caching (e.g., 1 hour for schema data, 5 minutes for API tokens). However, once I added traffic, my &lt;strong&gt;Grafana&lt;/strong&gt; dashboard revealed a chaotic reality.&lt;/p&gt;

&lt;p&gt;I saw brutal, perfectly flat 8–10 second latency spikes on key endpoints. Crucially, perfectly timed with these latency spikes, my internal &lt;code&gt;/metrics&lt;/code&gt; request rate dropped to zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Diagnosis: The Gateway Caching Itself to Death
&lt;/h2&gt;

&lt;p&gt;The telemetry told a story of hidden synchronous bottlenecks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Trigger:&lt;/strong&gt; When a high-traffic endpoint like &lt;code&gt;farm-weather-current/GET&lt;/code&gt; experienced a cache miss, the Django gateway had to fetch fresh data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Trap:&lt;/strong&gt; My Django deployment was running using &lt;strong&gt;standard synchronous workers&lt;/strong&gt;. It called the Rust service, which then called the external weather API (taking 2.2+ seconds).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Impact (Worker Starvation):&lt;/strong&gt; Because the Django worker was synchronous, it &lt;em&gt;blocked entirely&lt;/em&gt; for those 2.2 seconds. All incoming traffic got stuck in a queue.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  The Trap: Synchronous Gateway Routing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JsonResponse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.views&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;View&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;async_weather_proxy_view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JsonResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I had successfully offloaded work to Rust — but my synchronous Django workers completely nullified the speed gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Fix: Embracing the Non-Blocking Gateway
&lt;/h2&gt;

&lt;p&gt;I needed to decouple the speed of the gateway from the speed of the external API calls it was routing.&lt;br&gt;
I migrated the Django deployment from &lt;strong&gt;synchronous workers&lt;/strong&gt; to an &lt;strong&gt;ASGI&lt;/strong&gt; (Asynchronous Server Gateway Interface) setup, allowing my gateway to handle requests asynchronously.&lt;/p&gt;

&lt;p&gt;I rewrote my proxy views to use asynchronous HTTP clients like &lt;strong&gt;httpx&lt;/strong&gt;:&lt;/p&gt;




&lt;h3&gt;
  
  
  The Fix: Asynchronous Non-Blocking Routing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JsonResponse&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;async_weather_proxy_view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The event loop is freed! Django can serve other requests while waiting
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rust_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JsonResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rust_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The visual evidence on my dashboards was a massive, instant victory:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;Observability Restored:&lt;/strong&gt; The metrics scrape line remained unbroken. Django could finally pause a slow weather request, instantly answer the Prometheus scrape, and resume without blocking.&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Instant Internal Routing:&lt;/strong&gt; In my initial setup, a simple internal metrics scrape took ~84ms. After the ASGI migration, that duration dropped to &lt;strong&gt;11ms&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Second Fix: Proactive Caching
&lt;/h2&gt;

&lt;p&gt;While the infrastructure was now bulletproof, the end-user experience was still occasionally sluggish.&lt;/p&gt;

&lt;p&gt;With an “on-demand” caching strategy, the very first user to request the weather after a 1-hour cache expiration had to pay the &lt;strong&gt;Cache Miss Penalty&lt;/strong&gt; (waiting ~2.2 seconds for the external API).&lt;/p&gt;

&lt;p&gt;To eliminate this, I &lt;strong&gt;decoupled the data-fetching time from the user-request cycle entirely&lt;/strong&gt;.&lt;br&gt;
I implemented a &lt;strong&gt;Proactive Background Pre-warming&lt;/strong&gt; pattern using a background task (like Celery) that runs every 55 minutes, independently fetching slow data and silently overwriting the cache before it expires.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gk9f4fawwvmmjypt9xe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gk9f4fawwvmmjypt9xe.png" alt="caching" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  The Cache Pre-Warmer (Celery Example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pre_warm_weather_cache&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Runs in the background every 55 minutes, shielding the user from the 2.2s wait
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_current_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result? The average latency for critical weather endpoints &lt;strong&gt;plummeted from seconds to milliseconds&lt;/strong&gt; as the hot cache permanently took over.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Grand Slam: Optimizing the Rust Build Pipeline
&lt;/h2&gt;

&lt;p&gt;The final victory of this new architecture came from pure server efficiency.&lt;br&gt;
During deployments, compiling Rust crates (&lt;code&gt;sqlx&lt;/code&gt;, &lt;code&gt;syn&lt;/code&gt;) from scratch was pegging my 4-core server at 100% CPU, artificially causing timeouts.&lt;/p&gt;

&lt;p&gt;To fix this, I implemented &lt;strong&gt;cargo-chef&lt;/strong&gt; in a &lt;strong&gt;multi-stage Dockerfile&lt;/strong&gt; to strictly cache Rust dependencies.&lt;/p&gt;




&lt;h3&gt;
  
  
  Multi-stage Dockerfile for the Rust Microservice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;rust:1.88-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; root&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;cargo-chef
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo chef prepare &lt;span class="nt"&gt;--recipe-path&lt;/span&gt; recipe.json

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=planner /app/recipe.json recipe.json&lt;/span&gt;
&lt;span class="c"&gt;# Docker caches this heavy dependency build!&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo chef cook &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--recipe-path&lt;/span&gt; recipe.json
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--bin&lt;/span&gt; weather-service

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;debian:bookworm-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;runtime&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libssl-dev ca-certificates
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/target/release/weather-service /usr/local/bin/&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8090&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/usr/local/bin/weather-service"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Between ASGI, background caching, and Docker layer caching, my total &lt;strong&gt;Node CPU&lt;/strong&gt; now rests comfortably between &lt;strong&gt;11% and 13%&lt;/strong&gt;.&lt;br&gt;
I fundamentally reclaimed &lt;strong&gt;30% of my total server compute capacity&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Finding the Next Bottleneck
&lt;/h2&gt;

&lt;p&gt;Building high-performance API gateways is an ongoing journey of shifting bottlenecks.&lt;/p&gt;

&lt;p&gt;By relying strictly on my telemetry, I proved that &lt;strong&gt;synchronous workers nullify microservice speed&lt;/strong&gt;, validated the immense power of &lt;strong&gt;ASGI&lt;/strong&gt;, and eliminated &lt;strong&gt;cache miss penalties&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With the gateway running unburdened, my dashboards have revealed one final bottleneck — a 6-to-8 second delay on my token generation endpoint.&lt;br&gt;
Because my CPU is mostly idle, I know exactly what this is: a &lt;strong&gt;database connection pool limitation&lt;/strong&gt; in the Rust service.&lt;/p&gt;

&lt;p&gt;And thanks to my new observability baseline, I know exactly &lt;strong&gt;where to strike next&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>django</category>
      <category>rust</category>
      <category>microservices</category>
      <category>devops</category>
    </item>
    <item>
      <title>Boost Performance by Migrating Django Endpoints to Rust: NDVI &amp; Weather Services (Phase 2 Complete)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 21 Feb 2026 16:55:43 +0000</pubDate>
      <link>https://dev.to/rahim8050/boost-performance-by-migrating-django-endpoints-to-rust-ndvi-weather-services-phase-2-complete-29a8</link>
      <guid>https://dev.to/rahim8050/boost-performance-by-migrating-django-endpoints-to-rust-ndvi-weather-services-phase-2-complete-29a8</guid>
      <description>&lt;h1&gt;
  
  
  Migrating Django Endpoints to Rust: My NDVI &amp;amp; Weather Services Journey
&lt;/h1&gt;

&lt;p&gt;When I started rethinking my NDVI and weather endpoints, the goal was simple: improve performance, enforce strong auth, and gain full observability. Over the last few weeks, I migrated critical services from Django to Rust, and the process turned out to be an engineering adventure worth sharing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 0 – Contract Freeze: Locking the APIs
&lt;/h2&gt;

&lt;p&gt;Before touching Rust, I froze all NDVI and weather API contracts in Django. This ensured that the front-end and other consumers could continue working without disruptions. Think of it as putting a protective glass over your APIs: nothing moves until Rust is ready to take over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Frozen NDVI + weather contracts from Django.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1 – Multi-Service Architecture &amp;amp; Shared Auth/Throttle
&lt;/h2&gt;

&lt;p&gt;Next, I set up a Rust workspace with multiple services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NDVI service:&lt;/strong&gt; Handles vegetation index calculations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weather service:&lt;/strong&gt; Will eventually serve weather data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared auth &amp;amp; throttling module:&lt;/strong&gt; Ensures consistent authentication and rate limiting across all services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This phase established the skeleton for independent Rust microservices while maintaining the same contract as Django.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Rust workspace, shared auth/throttle, NDVI envelope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2 – Weather Migration
&lt;/h2&gt;

&lt;p&gt;With the workspace ready, I migrated weather endpoints from Django to Rust. Key steps included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implementing shared authentication and throttling.&lt;/li&gt;
&lt;li&gt;Integrating MySQL connections safely with Rust’s type system.&lt;/li&gt;
&lt;li&gt;Ensuring the endpoints conformed to the frozen contract from Phase 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this phase, all weather requests were fully handled by Rust services, improving throughput and reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Weather endpoints implemented in Rust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3 – Gateway Cutover (Planned)
&lt;/h2&gt;

&lt;p&gt;The final phase will transition Django routes to forward requests to Rust microservices. This will include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Canary deployments to avoid downtime.&lt;/li&gt;
&lt;li&gt;Metrics and alerting for observability.&lt;/li&gt;
&lt;li&gt;CI enforcement for Rust formatting, clippy lints, and tests across the workspace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;End state after Phase 3:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Django acts as a gateway, routing NDVI + weather requests to Rust services.&lt;/li&gt;
&lt;li&gt;NDVI is fully served by Rust/Postgres.&lt;/li&gt;
&lt;li&gt;Weather is fully served by Rust/MySQL.&lt;/li&gt;
&lt;li&gt;Shared auth and throttling are enforced in Rust.&lt;/li&gt;
&lt;li&gt;Observability and canary rollouts ensure safe production deployment.&lt;/li&gt;
&lt;li&gt;CI checks formatting, linting, and tests across the workspace.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Contract first:&lt;/strong&gt; Freezing contracts before migration prevents chaos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared modules are gold:&lt;/strong&gt; Auth and throttling reused across services reduces duplication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust’s type system and ownership model&lt;/strong&gt; force careful database and network design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental migration&lt;/strong&gt; avoids “big bang” outages.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;Migrating to Rust allowed me to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serve high-throughput endpoints with lower latency.&lt;/li&gt;
&lt;li&gt;Reduce runtime errors with compile-time guarantees.&lt;/li&gt;
&lt;li&gt;Scale services independently while sharing critical modules like auth and throttling.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example: Rust Weather Service (axum + sqlx)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/main.rs&lt;/span&gt;
&lt;span class="nd"&gt;#![deny(clippy::all)]&lt;/span&gt;
&lt;span class="nd"&gt;#![forbid(unsafe_code)]&lt;/span&gt;

&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;routing&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Extension&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;response&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;IntoResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;http&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;serde&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Deserialize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Serialize&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;sqlx&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MySqlPoolOptions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SubscriberExt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;util&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SubscriberInitExt&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Clone)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AppState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;sqlx&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MySqlPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Deserialize)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WeatherQuery&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lon&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;i64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Serialize)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WeatherResponse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;temp_c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;precip_mm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;anyhow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;EnvFilter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_default_env&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.init&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;MySqlPoolOptions&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.max_connections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"WEATHER_DATABASE_URL"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AppState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/v1/weather/point"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_weather_point&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/v1/weather/bulk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_weather_bulk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;AuthLayer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;RateLimitLayer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"LISTEN_ADDR"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_else&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"0.0.0.0:8080"&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="nn"&gt;tracing&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"listening on {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="nf"&gt;.parse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="nf"&gt;.into_make_service&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Django Gateway Proxy Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# views/proxy.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HttpResponse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.conf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;

&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., "http://rust-weather:8080"
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;proxy_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-forwarded-for&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;META&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REMOTE_ADDR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;upstream_resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;HttpResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer-encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CI Example (GitHub Actions)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rust-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dtolnay/rust-toolchain-action@v1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo fmt --all -- --check&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo clippy --workspace --all-targets -- -D warnings&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo test --workspace --all-features&lt;/span&gt;

  &lt;span class="na"&gt;python-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.11'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;python -m pip install --upgrade pip&lt;/span&gt;
          &lt;span class="s"&gt;pip install -r requirements.txt&lt;/span&gt;
          &lt;span class="s"&gt;pip install ruff mypy bandit&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ruff check .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mypy .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bandit -r .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;This setup ensures a production-ready, highly observable Rust microservices environment while keeping Django as a stable gateway. Phase 3 will finalize the gateway cutover with canary deployment and metrics monitoring.&lt;/p&gt;

</description>
      <category>api</category>
      <category>django</category>
      <category>performance</category>
      <category>rust</category>
    </item>
    <item>
      <title>From Django to Rust Microservices: What Prometheus Taught Me About Backend Performance</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 15 Feb 2026 06:48:42 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-django-to-rust-microservices-what-prometheus-taught-me-about-backend-performance-3lbk</link>
      <guid>https://dev.to/rahim8050/from-django-to-rust-microservices-what-prometheus-taught-me-about-backend-performance-3lbk</guid>
      <description>&lt;p&gt;&lt;strong&gt;Django Performance and Prometheus Observability&lt;/strong&gt;&lt;br&gt;
I operate a stack combining Django REST Framework, Nextcloud integrations, Prometheus for metrics, and Grafana dashboards — all served behind Caddy with strict CI/CD and Dockerized isolation.&lt;/p&gt;

&lt;p&gt;Everything looked stable until my Prometheus metrics told a different story.&lt;/p&gt;

&lt;p&gt;In Grafana, the /prometheus-django-metrics endpoint consistently showed 250 ms latency spikes, while other endpoints like /farm-weather-hourly and /home averaged under 50 ms. Scrape durations varied between 80 ms and 430 ms, even when request rates stayed flat at 0.08 req/s.&lt;/p&gt;

&lt;p&gt;That meant the latency wasn’t due to load — it was intrinsic to Python’s runtime and how Django handled metrics serialization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyxwxx12fdn2qkxmm1n9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyxwxx12fdn2qkxmm1n9.png" alt="resource-hungry endpoints" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mldcldzd7vxhtrcapg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mldcldzd7vxhtrcapg.png" alt="Ram usage and cpu usage of the stack" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Prometheus Exposed Django’s Bottleneck&lt;/strong&gt;&lt;br&gt;
Each Prometheus scrape forces Django to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Lock the Global Interpreter Lock (GIL)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gather live counters and histograms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Serialize JSON or text payloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reallocate memory on every request&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even low-volume systems suffer because this happens repeatedly at fixed intervals. Observability itself became a performance cost.&lt;/p&gt;

&lt;p&gt;The graphs made it clear: the bottleneck was the runtime, not the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Migrate Django Microservices to Rust&lt;/strong&gt;&lt;br&gt;
Rust’s asynchronous ecosystem (Tokio / Actix Web) solves these exact issues.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;No GIL: True multi-core concurrency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictable latency: Consistent under heavy I/O.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Memory safety: Compile-time guarantees without a garbage collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Low overhead I/O: Async networking with minimal allocations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my benchmarks, Rust microservices consistently stay under 40 ms latency, use 30–40 % less CPU, and make Prometheus scrape times nearly constant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rust Microservices Architecture with Django and Prometheus&lt;/strong&gt;&lt;br&gt;
The new architecture keeps Django as the orchestrator — managing authentication, APIs, and admin routes — while Rust handles performance-intensive modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NDVI raster computation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Weather data transformation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metrics aggregation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They communicate via REST or gRPC. Prometheus exports data from both runtimes into unified Grafana dashboards.&lt;br&gt;
Caddy provides HTTPS termination and reverse-proxy routing, maintaining secure observability across the stack.&lt;/p&gt;

&lt;p&gt;This hybrid model keeps Django’s flexibility while giving me Rust’s efficiency where it matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons from Observability&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Metrics are architectural signals, not just health checks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Python’s runtime trade-offs appear first under introspection, not user load.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rust isn’t a replacement for Django — it’s a reinforcement for its weak spots.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Observability drives evolution when used as feedback, not just monitoring.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Road Ahead&lt;/strong&gt;&lt;br&gt;
My next experiment involves measuring CPU cycles per request across Django and Rust services under sustained Prometheus scrapes. The goal: prove observability-driven performance scaling in production.&lt;/p&gt;

&lt;p&gt;If your /metrics endpoint is your slowest route, don’t ignore it — that graph might be pointing directly toward your next architectural upgrade.&lt;/p&gt;

&lt;p&gt;Further reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Prometheus Documentation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tokio Runtime&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Actix Web Framework&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana Observability Platform&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Published by Rahim—a backend and DevOps engineer exploring observability-driven architecture with Django, Prometheus, and Rust microservices.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>devops</category>
      <category>django</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
