<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahim Ranxx</title>
    <description>The latest articles on DEV Community by Rahim Ranxx (@rahim8050).</description>
    <link>https://dev.to/rahim8050</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3744842%2F2195e1a7-7e61-47f7-9c11-41610936958d.jpg</url>
      <title>DEV Community: Rahim Ranxx</title>
      <link>https://dev.to/rahim8050</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahim8050"/>
    <language>en</language>
    <item>
      <title>Django + Celery + Redis Sentinel: A Real Failover Test (With Metrics)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:36:44 +0000</pubDate>
      <link>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</link>
      <guid>https://dev.to/rahim8050/django-celery-redis-sentinel-a-real-failover-test-with-metrics-4ajn</guid>
      <description>&lt;p&gt;Redis Sentinel + Celery Failover: What Actually Happens in Production&lt;/p&gt;

&lt;p&gt;Most tutorials on Redis Sentinel stop at “it elects a new master”.&lt;br&gt;
Very few show what happens to a real system under failover pressure.&lt;/p&gt;

&lt;p&gt;I ran a failover drill on a Django + Celery stack backed by Redis Sentinel and Prometheus monitoring.&lt;/p&gt;
&lt;h2&gt;
  
  
  Here’s what actually happened.
&lt;/h2&gt;


&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Architecture Overview&lt;/li&gt;
&lt;li&gt;Sentinel Integration (Django + Celery)&lt;/li&gt;
&lt;li&gt;Observability with Prometheus&lt;/li&gt;
&lt;li&gt;Failover Drill Walkthrough&lt;/li&gt;
&lt;li&gt;Celery Behavior During Failover&lt;/li&gt;
&lt;li&gt;Performance Impact&lt;/li&gt;
&lt;li&gt;Production Readiness Assessment&lt;/li&gt;
&lt;li&gt;How to Reduce Failover Latency&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Client --&amp;gt; Django
    Django --&amp;gt;|Cache| Sentinel
    Django --&amp;gt;|Tasks| Celery
    Celery --&amp;gt;|Broker| Sentinel
    Celery --&amp;gt;|Result Backend| Sentinel

    Sentinel --&amp;gt; RedisMaster
    Sentinel --&amp;gt; RedisReplica1
    Sentinel --&amp;gt; RedisReplica2

    Prometheus --&amp;gt; RedisExporter
    RedisExporter --&amp;gt; Sentinel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Stack Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; → Redis cache via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery&lt;/strong&gt; → Broker + result backend via Sentinel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis Sentinel&lt;/strong&gt; → High availability + failover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus + redis_exporter&lt;/strong&gt; → Monitoring&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Sentinel Integration (Django + Celery)
&lt;/h2&gt;

&lt;p&gt;All services were switched to Sentinel using environment configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REDIS_ADDR=redis://host.docker.internal:26379
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validation steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Django cache → successful round-trip&lt;/li&gt;
&lt;li&gt;Celery broker → connected via Sentinel&lt;/li&gt;
&lt;li&gt;Celery result backend → &lt;code&gt;SentinelBackend&lt;/code&gt; initialized&lt;/li&gt;
&lt;li&gt;Test suite passed:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  pytest tests/test_settings_redis_sentinel.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this stage, the system is fully &lt;strong&gt;Sentinel-aware&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability with Prometheus
&lt;/h2&gt;

&lt;p&gt;After pointing &lt;code&gt;redis_exporter&lt;/code&gt; to Sentinel:&lt;/p&gt;

&lt;p&gt;Key metrics exposed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_status&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_sentinels&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_master_ok_slaves&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redis_sentinel_masters&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;redis_instance_info&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;redis_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sentinel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;tcp_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"26379"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This confirms monitoring is tracking &lt;strong&gt;cluster state&lt;/strong&gt;, not a single node.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failover Drill Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial State
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|Master| Redis1["172.20.0.3:6379"]
    Sentinel --&amp;gt; Redis2["Replica"]
    Sentinel --&amp;gt; Redis3["Replica"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prometheus reported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;master_address&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"172.20.0.3:6379"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Induced Failure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Current master was stopped manually&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Sentinel Election
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    Sentinel --&amp;gt;|New Master| Redis2["172.20.0.2:6379"]
    Sentinel --&amp;gt; Redis3["Replica"]
    Sentinel --&amp;gt; Redis1["Down"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;New master elected on &lt;strong&gt;first poll&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Prometheus updated on next scrape&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failover was immediate and correct&lt;/p&gt;




&lt;h2&gt;
  
  
  Celery Behavior During Failover
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant App as Django App
    participant Celery
    participant Sentinel
    participant Redis

    App-&amp;gt;&amp;gt;Celery: Submit Task
    Celery-&amp;gt;&amp;gt;Redis: Send to Master
    Redis--&amp;gt;&amp;gt;Celery: Connection Lost

    Sentinel-&amp;gt;&amp;gt;Sentinel: Elect New Master

    Celery-&amp;gt;&amp;gt;Sentinel: Retry Connection
    Note over Celery: ~54.7s delay

    Celery-&amp;gt;&amp;gt;Redis: Reconnect to New Master
    Redis--&amp;gt;&amp;gt;Celery: OK

    Celery--&amp;gt;&amp;gt;App: Task SUCCESS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observed Task
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Task ID: &lt;code&gt;9b57ba3b-a707-4c13-9255-d74de411b64b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Status during failover: &lt;code&gt;PENDING&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Delay: &lt;strong&gt;~54.7 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Final state: &lt;code&gt;SUCCESS&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Impact
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Normal operation&lt;/td&gt;
&lt;td&gt;Immediate execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;During failover&lt;/td&gt;
&lt;td&gt;~55s delay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-recovery&lt;/td&gt;
&lt;td&gt;Normal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Production Readiness Assessment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Redis Sentinel failover is reliable&lt;/li&gt;
&lt;li&gt;Prometheus reflects cluster changes correctly&lt;/li&gt;
&lt;li&gt;Django cache survives failover&lt;/li&gt;
&lt;li&gt;No task loss in Celery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Needs Attention
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Celery introduces &lt;strong&gt;significant delay during failover&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reconnection is not instantaneous&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Architecture Is Production-Ready
&lt;/h2&gt;

&lt;p&gt;Use this setup if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks are &lt;strong&gt;asynchronous/background&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Eventual completion is acceptable&lt;/li&gt;
&lt;li&gt;Temporary latency spikes are tolerable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When This Is Not Enough
&lt;/h2&gt;

&lt;p&gt;Avoid this setup (as-is) if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time task execution&lt;/li&gt;
&lt;li&gt;Sub-10s failover recovery&lt;/li&gt;
&lt;li&gt;User-facing async operations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Reduce Failover Latency
&lt;/h2&gt;

&lt;p&gt;To push recovery closer to &lt;strong&gt;10–15 seconds&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tune Celery broker retry settings&lt;/li&gt;
&lt;li&gt;Reduce reconnect backoff intervals&lt;/li&gt;
&lt;li&gt;Optimize worker heartbeat and visibility timeout&lt;/li&gt;
&lt;li&gt;Re-run failover drills with timing instrumentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Redis Sentinel ensures infrastructure recovery.&lt;br&gt;
Celery determines how fast your system actually resumes work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentinel recovery: &lt;strong&gt;instant&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Application recovery: &lt;strong&gt;~55 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gap is the real engineering challenge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;If you're using Redis Sentinel with Celery:&lt;/p&gt;

&lt;p&gt;Don’t stop at:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Failover works.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Measure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How long until my system behaves normally again?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because that’s what production users experience.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>django</category>
      <category>redis</category>
    </item>
    <item>
      <title>Escaping Cache Fragmentation: How Misconfigured PHP Workers Flooded My Token System</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 22 Mar 2026 18:02:23 +0000</pubDate>
      <link>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</link>
      <guid>https://dev.to/rahim8050/escaping-cache-fragmentation-how-misconfigured-php-workers-flooded-my-token-system-2ijb</guid>
      <description>&lt;h2&gt;
  
  
  🚨 The Symptom
&lt;/h2&gt;

&lt;p&gt;I started noticing something strange in my observability stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration tokens were being minted repeatedly&lt;/li&gt;
&lt;li&gt;My token endpoint showed activity even when no user interaction was happening&lt;/li&gt;
&lt;li&gt;Metrics suggested constant “traffic” to an otherwise idle system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At first glance, it looked like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A security issue&lt;/li&gt;
&lt;li&gt;A rogue client&lt;/li&gt;
&lt;li&gt;Or a broken API consumer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It was none of those.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 The Root Cause
&lt;/h2&gt;

&lt;p&gt;The issue came down to a subtle but critical architectural mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;I was using a non-shared cache in a multi-worker environment.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Stack involved:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;PHP-FPM (2 workers)&lt;/li&gt;
&lt;li&gt;APCu (in-memory cache)&lt;/li&gt;
&lt;li&gt;Token-based integration between services&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚙️ What Went Wrong
&lt;/h2&gt;

&lt;p&gt;APCu is &lt;strong&gt;process-local&lt;/strong&gt;, not shared.&lt;/p&gt;

&lt;p&gt;That means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker A cache ≠ Worker B cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each PHP-FPM worker had its own isolated memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  💥 The Cascade Effect
&lt;/h2&gt;

&lt;p&gt;My token logic was straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;mint_new_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But in reality, the system behaved like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request hits Worker A → token exists → OK&lt;/li&gt;
&lt;li&gt;Next request hits Worker B → cache miss → mint new token&lt;/li&gt;
&lt;li&gt;Repeat across workers → continuous token regeneration&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  📈 Why Observability Looked “Wrong”
&lt;/h2&gt;

&lt;p&gt;From the outside, it looked like traffic was hitting the token endpoint.&lt;/p&gt;

&lt;p&gt;But in reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The system was generating its own traffic due to cache inconsistency.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a key lesson:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all traffic is external&lt;/li&gt;
&lt;li&gt;Some is &lt;strong&gt;emergent behavior from system design&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ The Fix
&lt;/h2&gt;

&lt;p&gt;I switched from APCu to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis (shared cache)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All workers → same cache → consistent token state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tokens minted once&lt;/li&gt;
&lt;li&gt;Reused across all workers&lt;/li&gt;
&lt;li&gt;Metrics stabilized instantly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔒 Production Hardening (What I Added Next)
&lt;/h2&gt;

&lt;p&gt;Fixing the cache wasn’t enough — I hardened the system further.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Distributed Locking
&lt;/h3&gt;

&lt;p&gt;To prevent race conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;

&lt;span class="n"&gt;acquire&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;check&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
    &lt;span class="n"&gt;mint&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;still&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;
&lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. TTL Buffering
&lt;/h3&gt;

&lt;p&gt;Avoid edge expiration issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cache_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token_expiry&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;safety_margin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Observability Metrics
&lt;/h3&gt;

&lt;p&gt;I added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;token_cache_hits&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_cache_misses&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;token_mint_count&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now anomalies show up immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Key Takeaway
&lt;/h2&gt;

&lt;p&gt;This wasn’t just a bug.&lt;/p&gt;

&lt;p&gt;It was a &lt;strong&gt;distributed systems failure mode&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cache locality + multi-worker architecture → inconsistent state → emergent traffic&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ⚡ Final Insight
&lt;/h2&gt;

&lt;p&gt;If your system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs multiple workers&lt;/li&gt;
&lt;li&gt;Uses in-memory caching&lt;/li&gt;
&lt;li&gt;Relies on shared state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then this rule applies:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If your cache isn’t shared, your state isn’t real.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔗 Closing
&lt;/h2&gt;

&lt;p&gt;This issue reinforced something critical in my engineering journey:&lt;/p&gt;

&lt;p&gt;You don’t debug systems by staring at code —&lt;br&gt;
you debug them by understanding &lt;strong&gt;how state flows across boundaries&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;If you're building distributed APIs, token systems, or high-concurrency services —&lt;br&gt;
this is one edge case worth designing for early.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>php</category>
      <category>webdev</category>
    </item>
    <item>
      <title>From 80-Second APIs to Sub-Second: Rebuilding a Geospatial Backend with Async Pipelines</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 21 Mar 2026 16:37:10 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</link>
      <guid>https://dev.to/rahim8050/from-80-second-apis-to-sub-second-rebuilding-a-geospatial-backend-with-async-pipelines-h81</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;From 80-Second APIs to Sub-Second: Fixing Latency with Async Pipelines (Django + Celery)&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;At some point, every backend engineer hits this wall:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The API works perfectly… until it doesn’t.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hit that wall with a farm analytics endpoint computing NDVI (Normalized Difference Vegetation Index) from satellite imagery. The system was correct, the logic was sound, and the results were accurate.&lt;/p&gt;

&lt;p&gt;But the numbers told a different story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P95 latency: 1.25 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s not an API. That’s a blocking compute job pretending to be one.&lt;/p&gt;

&lt;p&gt;This is the story of how I redesigned the system—from a synchronous request-driven model to an asynchronous data pipeline—and brought latency down to &lt;strong&gt;sub-second performance (P95 ≈ 725ms)&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Architecture (The Hidden Problem)
&lt;/h2&gt;

&lt;p&gt;At first glance, the system looked clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Client]
   ↓
[Django API]
   ↓
[STAC API → Satellite Data]
   ↓
[Raster Processing (NDVI)]
   ↓
[Response]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happened on each request?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Query satellite imagery via STAC&lt;/li&gt;
&lt;li&gt;Fetch raster bands (Red &amp;amp; NIR) from remote storage&lt;/li&gt;
&lt;li&gt;Process NDVI using rasterio&lt;/li&gt;
&lt;li&gt;Aggregate coverage&lt;/li&gt;
&lt;li&gt;Return result&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why this seemed fine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It worked locally&lt;/li&gt;
&lt;li&gt;It returned correct data&lt;/li&gt;
&lt;li&gt;It followed a “pure API” mindset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remote I/O (S3-backed satellite data)&lt;/li&gt;
&lt;li&gt;Heavy raster decoding (JPEG2000)&lt;/li&gt;
&lt;li&gt;Sequential band reads&lt;/li&gt;
&lt;li&gt;Full computation per request&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Breaking Point
&lt;/h2&gt;

&lt;p&gt;Logs told the truth.&lt;/p&gt;

&lt;p&gt;Each request looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STAC request → ~5s
Raster read (B04) → ~5–10s
Raster read (B08) → ~5–10s
Processing → ~5s+
Total → ~80+ seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the key realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I wasn’t building an API—I was executing a geospatial compute pipeline on every request.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Core Insight
&lt;/h2&gt;

&lt;p&gt;This is the shift that changes everything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;APIs should &lt;strong&gt;serve data&lt;/strong&gt;, not &lt;strong&gt;compute it on demand&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem wasn’t Python.&lt;br&gt;
The problem wasn’t Django.&lt;br&gt;
The problem was &lt;strong&gt;architecture&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  The New Architecture (Async Pipeline)
&lt;/h2&gt;

&lt;p&gt;I redesigned the system around &lt;strong&gt;asynchronous computation + caching&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;             (Scheduled / Triggered)
                    ↓
             [Celery Worker]
                    ↓
         [NDVI Computation Pipeline]
                    ↓
             [Redis / Database]
                    ↓
[Client] → [Django API] → [Cache Lookup]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key changes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NDVI computation moved out of the request path&lt;/li&gt;
&lt;li&gt;Results cached in Redis&lt;/li&gt;
&lt;li&gt;Background jobs compute and refresh data&lt;/li&gt;
&lt;li&gt;API returns instantly (no heavy compute)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Diagram 1 — Before vs After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before (Request-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request
   ↓
STAC API
   ↓
Raster I/O
   ↓
NDVI Compute
   ↓
Response (80s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After (Pipeline-driven)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request → Cache → Response (~725ms P95)
              ↓ (miss)
         Async Task
              ↓
       Compute + Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Fast API Path (Non-blocking)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ndvi.tasks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_farm_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;

    &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Celery Task (Async Compute)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;autoretry_for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;retry_backoff&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;coverage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_ndvi_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;farm_state:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Daily Backfill (Critical)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enqueue_daily_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;farm_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_active_farm_ids&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;farm_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;farm_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;compute_farm_state_coverage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;farm_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Observability (The Real Upgrade)
&lt;/h2&gt;

&lt;p&gt;Metrics added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task duration&lt;/li&gt;
&lt;li&gt;Task success/failure&lt;/li&gt;
&lt;li&gt;Queue depth&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Metrics (Grafana Observations)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📊 Grafana Screenshots
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Latency Graph&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgrbfnr3v7fcvdop951k.png" alt="725ms on farm get endpoint" width="342" height="458"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Before
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;P95 latency: ~1.25 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  After
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;API latency: ~725ms (P95)&lt;/li&gt;
&lt;li&gt;Background tasks: 60–90s&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Before vs After Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API latency&lt;/td&gt;
&lt;td&gt;1.25 min&lt;/td&gt;
&lt;td&gt;~725 ms (P95)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System type&lt;/td&gt;
&lt;td&gt;Request-driven&lt;/td&gt;
&lt;td&gt;Pipeline-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Improved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;I stopped treating my API like a calculator and started treating my system like a data pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s when everything changed.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>performance</category>
      <category>distributedsystems</category>
      <category>backend</category>
    </item>
    <item>
      <title>Designing a One-Way Farm Sync Architecture (Nextcloud Django DRF)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 15 Mar 2026 07:11:34 +0000</pubDate>
      <link>https://dev.to/rahim8050/designing-a-one-way-farm-sync-architecture-nextcloud-django-drf-5bh3</link>
      <guid>https://dev.to/rahim8050/designing-a-one-way-farm-sync-architecture-nextcloud-django-drf-5bh3</guid>
      <description>&lt;h2&gt;
  
  
  From Nextcloud to Django: Designing a Farm Sync Architecture with DRF
&lt;/h2&gt;

&lt;p&gt;Modern applications rarely live in a single system. As projects grow, different components begin to specialize: one system handles identity and user workflows, while another focuses on computation and domain logic.&lt;/p&gt;

&lt;p&gt;This week I explored a small but interesting distributed architecture problem: &lt;strong&gt;how to synchronize farm data between a Nextcloud application and a Django REST API backend&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal was simple in theory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nextcloud provides the user interface.&lt;/li&gt;
&lt;li&gt;Django performs geospatial computation (NDVI, raster processing).&lt;/li&gt;
&lt;li&gt;Farm data must stay consistent between the two systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But as any engineer knows, “simple in theory” is where architecture decisions start to matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Initial Problem
&lt;/h2&gt;

&lt;p&gt;In the system I’m building, users manage farms from a Nextcloud application while a Django service handles geospatial workloads.&lt;/p&gt;

&lt;p&gt;The challenge was deciding &lt;strong&gt;where farm data should live&lt;/strong&gt; and &lt;strong&gt;how it should propagate between systems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Three architectural questions emerged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which system is the source of truth?&lt;/li&gt;
&lt;li&gt;How do we synchronize data between services?&lt;/li&gt;
&lt;li&gt;How do we avoid identity conflicts between systems?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are classic distributed systems questions, even in relatively small projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 1: Bidirectional Sync (The Dangerous Path)
&lt;/h2&gt;

&lt;p&gt;One tempting solution is letting both systems create farms and then syncing them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud &amp;lt;--&amp;gt; Django
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance this feels flexible. In practice it creates difficult problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;conflicting updates&lt;/li&gt;
&lt;li&gt;race conditions&lt;/li&gt;
&lt;li&gt;reconciliation logic&lt;/li&gt;
&lt;li&gt;versioning requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large distributed databases solve this with vector clocks and conflict resolution strategies. For most applications, that complexity is unnecessary.&lt;/p&gt;

&lt;p&gt;So I rejected bidirectional replication early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Option 2: API-First Architecture
&lt;/h2&gt;

&lt;p&gt;Instead of replicating farms between databases, Nextcloud simply &lt;strong&gt;delegates creation to the Django API&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a user creates a farm in Nextcloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
  ↓
Nextcloud UI
  ↓
Nextcloud Controller
  ↓
POST /api/v1/farms/ (Django REST Framework)
  ↓
Django database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Django becomes the &lt;strong&gt;source of truth for farm data&lt;/strong&gt;, while Nextcloud acts as the interface layer.&lt;/p&gt;

&lt;p&gt;This pattern has several advantages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Immediate Consistency
&lt;/h3&gt;

&lt;p&gt;Since farms are created directly in Django, the backend always has the latest data.&lt;/p&gt;

&lt;p&gt;There is no delayed replication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clear Ownership
&lt;/h3&gt;

&lt;p&gt;Each system has a defined responsibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nextcloud&lt;/strong&gt; – user interface, identity, workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Django&lt;/strong&gt; – domain logic, geospatial processing, data storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Clear boundaries reduce architectural complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extensibility
&lt;/h3&gt;

&lt;p&gt;Once Django exposes a clean API, other systems can integrate easily:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mobile apps&lt;/li&gt;
&lt;li&gt;data pipelines&lt;/li&gt;
&lt;li&gt;satellite processing services&lt;/li&gt;
&lt;li&gt;analytics dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything interacts with the same API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solving the Cross-System Identity Problem
&lt;/h2&gt;

&lt;p&gt;Once multiple systems talk about the same object, a subtle problem appears:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;identity consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If each system generated its own farm IDs, synchronization would become fragile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud Farm ID = 12
Django Farm ID = 47
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every integration requires mapping tables.&lt;/p&gt;

&lt;p&gt;Instead, the architecture introduces a &lt;strong&gt;stable external identifier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Nextcloud generates a UUID called:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;external_farm_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That UUID is sent to Django whenever a farm is synchronized.&lt;/p&gt;

&lt;p&gt;Conceptually the model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Farm
 ├─ id (internal database id)
 └─ external_farm_id (shared UUID)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now both systems reference the farm using the same identifier.&lt;/p&gt;

&lt;p&gt;When Nextcloud syncs a farm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/farms/sync
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Payload example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"external_farm_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uuid"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"external_user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"nextcloud_uid"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Demo Farm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"centroid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach provides several benefits.&lt;/p&gt;

&lt;h3&gt;
  
  
  No ID Collisions
&lt;/h3&gt;

&lt;p&gt;UUIDs are globally unique, preventing conflicts between systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean Synchronization
&lt;/h3&gt;

&lt;p&gt;Updates and raster requests can reference farms using the same external identifier.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;/api/v1/farms/{external_farm_id}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Future Integration
&lt;/h3&gt;

&lt;p&gt;If other services appear later (mobile apps, analytics pipelines, satellite processors), they can all reference farms using the same UUID.&lt;/p&gt;

&lt;p&gt;This pattern is common in distributed systems and prevents a large class of synchronization bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Identity Translation Between Systems
&lt;/h2&gt;

&lt;p&gt;Another challenge is mapping users between Nextcloud and Django.&lt;/p&gt;

&lt;p&gt;Nextcloud users authenticate normally and then communicate with Django using an &lt;strong&gt;integration token&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nextcloud User
      ↓
Integration Token
      ↓
Django API
      ↓
Farm Sync
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Django service stores farms under a service user while still preserving the original &lt;code&gt;external_user_id&lt;/code&gt; from Nextcloud.&lt;/p&gt;

&lt;p&gt;This keeps authentication simple while preserving user ownership information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring the Integration Layer
&lt;/h2&gt;

&lt;p&gt;During development I analyzed the Nextcloud app using &lt;code&gt;cloc&lt;/code&gt; to understand the size of the integration layer.&lt;/p&gt;

&lt;p&gt;The results showed roughly &lt;strong&gt;11,000 lines of code&lt;/strong&gt;, split mainly between PHP backend logic and JavaScript UI.&lt;/p&gt;

&lt;p&gt;Around this size, architecture decisions begin to matter more than raw implementation.&lt;/p&gt;

&lt;p&gt;Systems become complex enough that &lt;strong&gt;clear service boundaries&lt;/strong&gt; become essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Several practical lessons emerged from this integration work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Avoid bidirectional replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two systems writing to the same domain model creates unnecessary complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Establish a clear source of truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this architecture, Django owns farm data while Nextcloud orchestrates the workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use stable external identifiers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;UUIDs dramatically simplify synchronization across systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Prefer API-first architectures&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;APIs make it easier to expand systems and integrate future services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Keep compute close to the data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since Django handles geospatial processing, storing farms there keeps the compute layer efficient.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;What started as a simple integration between Nextcloud and Django turned into a useful exercise in distributed system design.&lt;/p&gt;

&lt;p&gt;Even relatively small systems benefit from clear service boundaries and stable identity strategies.&lt;/p&gt;

&lt;p&gt;By combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an &lt;strong&gt;API-first architecture&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;external UUID identifiers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and &lt;strong&gt;clear ownership of farm data&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;the system stays simple today while remaining extensible for future services like satellite analytics or mobile farming applications.&lt;/p&gt;

&lt;p&gt;Sometimes good architecture isn’t about complexity at all — it’s about &lt;strong&gt;clarity of responsibility between systems&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>api</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Escaping the Sync Trap: How I Slashed Latency by 10x in a Django-Rust API Gateway</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 01 Mar 2026 13:24:13 +0000</pubDate>
      <link>https://dev.to/rahim8050/escaping-the-sync-trap-how-i-slashed-latency-by-10x-in-a-django-rust-api-gateway-323m</link>
      <guid>https://dev.to/rahim8050/escaping-the-sync-trap-how-i-slashed-latency-by-10x-in-a-django-rust-api-gateway-323m</guid>
      <description>&lt;h2&gt;
  
  
  How I diagnosed and eliminated synchronous bottlenecks in a Django-Rust API gateway, migrating to ASGI and pre-warming caches for millisecond responses.
&lt;/h2&gt;

&lt;p&gt;When building a high-performance backend, the standard playbook is well-known: &lt;strong&gt;offload heavy computational tasks to faster microservices (like Rust)&lt;/strong&gt; and implement an aggressive caching strategy.&lt;/p&gt;

&lt;p&gt;Recently, I did exactly that. My architecture is built around a &lt;strong&gt;Django REST Framework gateway&lt;/strong&gt; sitting behind &lt;strong&gt;Caddy&lt;/strong&gt;, heavily monitored with &lt;strong&gt;Prometheus&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;But despite the raw speed of Rust and my caching layers, my dashboards were flashing red. Latency was spiking to brutal 10-second flatlines for my most critical endpoints. Worse, my observability itself started failing, creating &lt;em&gt;silent blind spots&lt;/em&gt; exactly when I needed data the most.&lt;/p&gt;

&lt;p&gt;Here is the detective story of how I used telemetry to hunt down synchronous traps, migrate to a non-blocking async architecture, and implement proactive pre-warming to bring response times down to the millisecond range — all while reclaiming &lt;strong&gt;30% of my idle CPU&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: A Gateway and its Heavy Lifters
&lt;/h2&gt;

&lt;p&gt;Before diving into the problem, here is a quick look at my setup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          ┌──────────────────────────────┐
          │          Nextcloud           │
          │ (Authenticated Client Calls) │
          └──────────────┬───────────────┘
                         │  JWT / API Key
                         ▼
               ┌────────────────────┐
               │  Django Gateway    │
               │ (ASGI, DRF, Caddy) │
               └──────┬─────────────┘
                      │
     ┌────────────────┴────────────────┐
     │                                 │
     ▼                                 ▼
┌───────────────┐              ┌────────────────┐
│ NDVI Service  │              │ Weather Service│
│ (Rust, 8081)  │              │ (Rust, 8090)   │
│  → Postgres    │              │  → MySQL       │
└───────────────┘              └────────────────┘

       ▲
       │ Prometheus &amp;amp; Grafana
       │ (Observability Stack)
       ▼
   System Telemetry + Metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vebc0f2a5ldvbhzfubp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7vebc0f2a5ldvbhzfubp.png" alt="The Architecture" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Django&lt;/strong&gt; stays as the public API gateway, accepting requests authenticated via &lt;strong&gt;JWT&lt;/strong&gt; or &lt;strong&gt;API keys&lt;/strong&gt;.&lt;br&gt;
It enforces a shared JSON response envelope containing &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;, &lt;code&gt;data&lt;/code&gt;, and &lt;code&gt;errors&lt;/code&gt; to keep all client interactions standardized.&lt;/p&gt;

&lt;p&gt;Specific traffic routes — namely &lt;code&gt;/api/v1/ndvi&lt;/code&gt; and &lt;code&gt;/api/v1/weather/*&lt;/code&gt; — are forwarded directly to my &lt;strong&gt;Rust&lt;/strong&gt; backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🛰 &lt;strong&gt;NDVI microservice&lt;/strong&gt; ingests satellite data into a dedicated &lt;strong&gt;Postgres&lt;/strong&gt; database.&lt;/li&gt;
&lt;li&gt;🌦 &lt;strong&gt;Weather microservice&lt;/strong&gt; relies on a &lt;strong&gt;MySQL&lt;/strong&gt; database and communicates with external providers like &lt;strong&gt;Open-Meteo&lt;/strong&gt; and &lt;strong&gt;NASA POWER&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My &lt;strong&gt;Nextcloud&lt;/strong&gt; instance acts like any other client, presenting either an &lt;code&gt;Authorization: Bearer&lt;/code&gt; token or an &lt;code&gt;X-API-Key&lt;/code&gt;. Django manages this traffic using a specific &lt;code&gt;nextcloud_hmac&lt;/code&gt; throttle configuration before passing the authorized call down to Rust with the original headers intact.&lt;/p&gt;




&lt;h2&gt;
  
  
  The False Cure: The Worker Starvation Anomaly
&lt;/h2&gt;

&lt;p&gt;To protect the system, I implemented aggressive TTL caching (e.g., 1 hour for schema data, 5 minutes for API tokens). However, once I added traffic, my &lt;strong&gt;Grafana&lt;/strong&gt; dashboard revealed a chaotic reality.&lt;/p&gt;

&lt;p&gt;I saw brutal, perfectly flat 8–10 second latency spikes on key endpoints. Crucially, perfectly timed with these latency spikes, my internal &lt;code&gt;/metrics&lt;/code&gt; request rate dropped to zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Diagnosis: The Gateway Caching Itself to Death
&lt;/h2&gt;

&lt;p&gt;The telemetry told a story of hidden synchronous bottlenecks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Trigger:&lt;/strong&gt; When a high-traffic endpoint like &lt;code&gt;farm-weather-current/GET&lt;/code&gt; experienced a cache miss, the Django gateway had to fetch fresh data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Trap:&lt;/strong&gt; My Django deployment was running using &lt;strong&gt;standard synchronous workers&lt;/strong&gt;. It called the Rust service, which then called the external weather API (taking 2.2+ seconds).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Impact (Worker Starvation):&lt;/strong&gt; Because the Django worker was synchronous, it &lt;em&gt;blocked entirely&lt;/em&gt; for those 2.2 seconds. All incoming traffic got stuck in a queue.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  The Trap: Synchronous Gateway Routing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JsonResponse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.views&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;View&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;async_weather_proxy_view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JsonResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I had successfully offloaded work to Rust — but my synchronous Django workers completely nullified the speed gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Fix: Embracing the Non-Blocking Gateway
&lt;/h2&gt;

&lt;p&gt;I needed to decouple the speed of the gateway from the speed of the external API calls it was routing.&lt;br&gt;
I migrated the Django deployment from &lt;strong&gt;synchronous workers&lt;/strong&gt; to an &lt;strong&gt;ASGI&lt;/strong&gt; (Asynchronous Server Gateway Interface) setup, allowing my gateway to handle requests asynchronously.&lt;/p&gt;

&lt;p&gt;I rewrote my proxy views to use asynchronous HTTP clients like &lt;strong&gt;httpx&lt;/strong&gt;:&lt;/p&gt;




&lt;h3&gt;
  
  
  The Fix: Asynchronous Non-Blocking Routing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JsonResponse&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;async_weather_proxy_view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The event loop is freed! Django can serve other requests while waiting
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rust_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JsonResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rust_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The visual evidence on my dashboards was a massive, instant victory:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;Observability Restored:&lt;/strong&gt; The metrics scrape line remained unbroken. Django could finally pause a slow weather request, instantly answer the Prometheus scrape, and resume without blocking.&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Instant Internal Routing:&lt;/strong&gt; In my initial setup, a simple internal metrics scrape took ~84ms. After the ASGI migration, that duration dropped to &lt;strong&gt;11ms&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Second Fix: Proactive Caching
&lt;/h2&gt;

&lt;p&gt;While the infrastructure was now bulletproof, the end-user experience was still occasionally sluggish.&lt;/p&gt;

&lt;p&gt;With an “on-demand” caching strategy, the very first user to request the weather after a 1-hour cache expiration had to pay the &lt;strong&gt;Cache Miss Penalty&lt;/strong&gt; (waiting ~2.2 seconds for the external API).&lt;/p&gt;

&lt;p&gt;To eliminate this, I &lt;strong&gt;decoupled the data-fetching time from the user-request cycle entirely&lt;/strong&gt;.&lt;br&gt;
I implemented a &lt;strong&gt;Proactive Background Pre-warming&lt;/strong&gt; pattern using a background task (like Celery) that runs every 55 minutes, independently fetching slow data and silently overwriting the cache before it expires.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gk9f4fawwvmmjypt9xe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gk9f4fawwvmmjypt9xe.png" alt="caching" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  The Cache Pre-Warmer (Celery Example)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;celery&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shared_task&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.core.cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="nd"&gt;@shared_task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pre_warm_weather_cache&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Runs in the background every 55 minutes, shielding the user from the 2.2s wait
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://weather-service:8090/api/v1/weather-current&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_current_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result? The average latency for critical weather endpoints &lt;strong&gt;plummeted from seconds to milliseconds&lt;/strong&gt; as the hot cache permanently took over.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Grand Slam: Optimizing the Rust Build Pipeline
&lt;/h2&gt;

&lt;p&gt;The final victory of this new architecture came from pure server efficiency.&lt;br&gt;
During deployments, compiling Rust crates (&lt;code&gt;sqlx&lt;/code&gt;, &lt;code&gt;syn&lt;/code&gt;) from scratch was pegging my 4-core server at 100% CPU, artificially causing timeouts.&lt;/p&gt;

&lt;p&gt;To fix this, I implemented &lt;strong&gt;cargo-chef&lt;/strong&gt; in a &lt;strong&gt;multi-stage Dockerfile&lt;/strong&gt; to strictly cache Rust dependencies.&lt;/p&gt;




&lt;h3&gt;
  
  
  Multi-stage Dockerfile for the Rust Microservice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;rust:1.88-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; root&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;cargo-chef
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo chef prepare &lt;span class="nt"&gt;--recipe-path&lt;/span&gt; recipe.json

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;chef&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=planner /app/recipe.json recipe.json&lt;/span&gt;
&lt;span class="c"&gt;# Docker caches this heavy dependency build!&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo chef cook &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--recipe-path&lt;/span&gt; recipe.json
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--bin&lt;/span&gt; weather-service

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;debian:bookworm-slim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;runtime&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libssl-dev ca-certificates
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/target/release/weather-service /usr/local/bin/&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8090&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/usr/local/bin/weather-service"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Between ASGI, background caching, and Docker layer caching, my total &lt;strong&gt;Node CPU&lt;/strong&gt; now rests comfortably between &lt;strong&gt;11% and 13%&lt;/strong&gt;.&lt;br&gt;
I fundamentally reclaimed &lt;strong&gt;30% of my total server compute capacity&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Finding the Next Bottleneck
&lt;/h2&gt;

&lt;p&gt;Building high-performance API gateways is an ongoing journey of shifting bottlenecks.&lt;/p&gt;

&lt;p&gt;By relying strictly on my telemetry, I proved that &lt;strong&gt;synchronous workers nullify microservice speed&lt;/strong&gt;, validated the immense power of &lt;strong&gt;ASGI&lt;/strong&gt;, and eliminated &lt;strong&gt;cache miss penalties&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With the gateway running unburdened, my dashboards have revealed one final bottleneck — a 6-to-8 second delay on my token generation endpoint.&lt;br&gt;
Because my CPU is mostly idle, I know exactly what this is: a &lt;strong&gt;database connection pool limitation&lt;/strong&gt; in the Rust service.&lt;/p&gt;

&lt;p&gt;And thanks to my new observability baseline, I know exactly &lt;strong&gt;where to strike next&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>django</category>
      <category>rust</category>
      <category>microservices</category>
      <category>devops</category>
    </item>
    <item>
      <title>Boost Performance by Migrating Django Endpoints to Rust: NDVI &amp; Weather Services (Phase 2 Complete)</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sat, 21 Feb 2026 16:55:43 +0000</pubDate>
      <link>https://dev.to/rahim8050/boost-performance-by-migrating-django-endpoints-to-rust-ndvi-weather-services-phase-2-complete-29a8</link>
      <guid>https://dev.to/rahim8050/boost-performance-by-migrating-django-endpoints-to-rust-ndvi-weather-services-phase-2-complete-29a8</guid>
      <description>&lt;h1&gt;
  
  
  Migrating Django Endpoints to Rust: My NDVI &amp;amp; Weather Services Journey
&lt;/h1&gt;

&lt;p&gt;When I started rethinking my NDVI and weather endpoints, the goal was simple: improve performance, enforce strong auth, and gain full observability. Over the last few weeks, I migrated critical services from Django to Rust, and the process turned out to be an engineering adventure worth sharing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 0 – Contract Freeze: Locking the APIs
&lt;/h2&gt;

&lt;p&gt;Before touching Rust, I froze all NDVI and weather API contracts in Django. This ensured that the front-end and other consumers could continue working without disruptions. Think of it as putting a protective glass over your APIs: nothing moves until Rust is ready to take over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Frozen NDVI + weather contracts from Django.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1 – Multi-Service Architecture &amp;amp; Shared Auth/Throttle
&lt;/h2&gt;

&lt;p&gt;Next, I set up a Rust workspace with multiple services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NDVI service:&lt;/strong&gt; Handles vegetation index calculations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weather service:&lt;/strong&gt; Will eventually serve weather data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared auth &amp;amp; throttling module:&lt;/strong&gt; Ensures consistent authentication and rate limiting across all services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This phase established the skeleton for independent Rust microservices while maintaining the same contract as Django.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Rust workspace, shared auth/throttle, NDVI envelope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2 – Weather Migration
&lt;/h2&gt;

&lt;p&gt;With the workspace ready, I migrated weather endpoints from Django to Rust. Key steps included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implementing shared authentication and throttling.&lt;/li&gt;
&lt;li&gt;Integrating MySQL connections safely with Rust’s type system.&lt;/li&gt;
&lt;li&gt;Ensuring the endpoints conformed to the frozen contract from Phase 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this phase, all weather requests were fully handled by Rust services, improving throughput and reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Weather endpoints implemented in Rust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3 – Gateway Cutover (Planned)
&lt;/h2&gt;

&lt;p&gt;The final phase will transition Django routes to forward requests to Rust microservices. This will include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Canary deployments to avoid downtime.&lt;/li&gt;
&lt;li&gt;Metrics and alerting for observability.&lt;/li&gt;
&lt;li&gt;CI enforcement for Rust formatting, clippy lints, and tests across the workspace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;End state after Phase 3:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Django acts as a gateway, routing NDVI + weather requests to Rust services.&lt;/li&gt;
&lt;li&gt;NDVI is fully served by Rust/Postgres.&lt;/li&gt;
&lt;li&gt;Weather is fully served by Rust/MySQL.&lt;/li&gt;
&lt;li&gt;Shared auth and throttling are enforced in Rust.&lt;/li&gt;
&lt;li&gt;Observability and canary rollouts ensure safe production deployment.&lt;/li&gt;
&lt;li&gt;CI checks formatting, linting, and tests across the workspace.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Contract first:&lt;/strong&gt; Freezing contracts before migration prevents chaos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared modules are gold:&lt;/strong&gt; Auth and throttling reused across services reduces duplication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust’s type system and ownership model&lt;/strong&gt; force careful database and network design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental migration&lt;/strong&gt; avoids “big bang” outages.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;Migrating to Rust allowed me to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serve high-throughput endpoints with lower latency.&lt;/li&gt;
&lt;li&gt;Reduce runtime errors with compile-time guarantees.&lt;/li&gt;
&lt;li&gt;Scale services independently while sharing critical modules like auth and throttling.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example: Rust Weather Service (axum + sqlx)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/main.rs&lt;/span&gt;
&lt;span class="nd"&gt;#![deny(clippy::all)]&lt;/span&gt;
&lt;span class="nd"&gt;#![forbid(unsafe_code)]&lt;/span&gt;

&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;routing&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Extension&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;response&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;IntoResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;http&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;serde&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Deserialize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Serialize&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;sqlx&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MySqlPoolOptions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SubscriberExt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;util&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SubscriberInitExt&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Clone)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AppState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;sqlx&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MySqlPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Deserialize)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WeatherQuery&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lon&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;i64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[derive(Serialize)]&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WeatherResponse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;temp_c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;precip_mm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;anyhow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;EnvFilter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_default_env&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.with&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tracing_subscriber&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.init&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;MySqlPoolOptions&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.max_connections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"WEATHER_DATABASE_URL"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AppState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/v1/weather/point"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_weather_point&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/v1/weather/bulk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post_weather_bulk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;AuthLayer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;RateLimitLayer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"LISTEN_ADDR"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or_else&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="s"&gt;"0.0.0.0:8080"&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="nn"&gt;tracing&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"listening on {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="nf"&gt;.parse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="nf"&gt;.into_make_service&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Django Gateway Proxy Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# views/proxy.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HttpResponse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;django.conf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;

&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt;  &lt;span class="c1"&gt;# e.g., "http://rust-weather:8080"
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;proxy_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROXY_TARGET&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-forwarded-for&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;META&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REMOTE_ADDR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;upstream_resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;HttpResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;upstream_resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer-encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CI Example (GitHub Actions)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rust-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dtolnay/rust-toolchain-action@v1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo fmt --all -- --check&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo clippy --workspace --all-targets -- -D warnings&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cargo test --workspace --all-features&lt;/span&gt;

  &lt;span class="na"&gt;python-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.11'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;python -m pip install --upgrade pip&lt;/span&gt;
          &lt;span class="s"&gt;pip install -r requirements.txt&lt;/span&gt;
          &lt;span class="s"&gt;pip install ruff mypy bandit&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ruff check .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mypy .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bandit -r .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;This setup ensures a production-ready, highly observable Rust microservices environment while keeping Django as a stable gateway. Phase 3 will finalize the gateway cutover with canary deployment and metrics monitoring.&lt;/p&gt;

</description>
      <category>api</category>
      <category>django</category>
      <category>performance</category>
      <category>rust</category>
    </item>
    <item>
      <title>From Django to Rust Microservices: What Prometheus Taught Me About Backend Performance</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 15 Feb 2026 06:48:42 +0000</pubDate>
      <link>https://dev.to/rahim8050/from-django-to-rust-microservices-what-prometheus-taught-me-about-backend-performance-3lbk</link>
      <guid>https://dev.to/rahim8050/from-django-to-rust-microservices-what-prometheus-taught-me-about-backend-performance-3lbk</guid>
      <description>&lt;p&gt;&lt;strong&gt;Django Performance and Prometheus Observability&lt;/strong&gt;&lt;br&gt;
I operate a stack combining Django REST Framework, Nextcloud integrations, Prometheus for metrics, and Grafana dashboards — all served behind Caddy with strict CI/CD and Dockerized isolation.&lt;/p&gt;

&lt;p&gt;Everything looked stable until my Prometheus metrics told a different story.&lt;/p&gt;

&lt;p&gt;In Grafana, the /prometheus-django-metrics endpoint consistently showed 250 ms latency spikes, while other endpoints like /farm-weather-hourly and /home averaged under 50 ms. Scrape durations varied between 80 ms and 430 ms, even when request rates stayed flat at 0.08 req/s.&lt;/p&gt;

&lt;p&gt;That meant the latency wasn’t due to load — it was intrinsic to Python’s runtime and how Django handled metrics serialization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyxwxx12fdn2qkxmm1n9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyxwxx12fdn2qkxmm1n9.png" alt="resource-hungry endpoints" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mldcldzd7vxhtrcapg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1mldcldzd7vxhtrcapg.png" alt="Ram usage and cpu usage of the stack" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Prometheus Exposed Django’s Bottleneck&lt;/strong&gt;&lt;br&gt;
Each Prometheus scrape forces Django to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Lock the Global Interpreter Lock (GIL)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gather live counters and histograms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Serialize JSON or text payloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reallocate memory on every request&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even low-volume systems suffer because this happens repeatedly at fixed intervals. Observability itself became a performance cost.&lt;/p&gt;

&lt;p&gt;The graphs made it clear: the bottleneck was the runtime, not the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Migrate Django Microservices to Rust&lt;/strong&gt;&lt;br&gt;
Rust’s asynchronous ecosystem (Tokio / Actix Web) solves these exact issues.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;No GIL: True multi-core concurrency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictable latency: Consistent under heavy I/O.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Memory safety: Compile-time guarantees without a garbage collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Low overhead I/O: Async networking with minimal allocations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my benchmarks, Rust microservices consistently stay under 40 ms latency, use 30–40 % less CPU, and make Prometheus scrape times nearly constant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rust Microservices Architecture with Django and Prometheus&lt;/strong&gt;&lt;br&gt;
The new architecture keeps Django as the orchestrator — managing authentication, APIs, and admin routes — while Rust handles performance-intensive modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;NDVI raster computation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Weather data transformation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Metrics aggregation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They communicate via REST or gRPC. Prometheus exports data from both runtimes into unified Grafana dashboards.&lt;br&gt;
Caddy provides HTTPS termination and reverse-proxy routing, maintaining secure observability across the stack.&lt;/p&gt;

&lt;p&gt;This hybrid model keeps Django’s flexibility while giving me Rust’s efficiency where it matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons from Observability&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Metrics are architectural signals, not just health checks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Python’s runtime trade-offs appear first under introspection, not user load.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rust isn’t a replacement for Django — it’s a reinforcement for its weak spots.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Observability drives evolution when used as feedback, not just monitoring.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Road Ahead&lt;/strong&gt;&lt;br&gt;
My next experiment involves measuring CPU cycles per request across Django and Rust services under sustained Prometheus scrapes. The goal: prove observability-driven performance scaling in production.&lt;/p&gt;

&lt;p&gt;If your /metrics endpoint is your slowest route, don’t ignore it — that graph might be pointing directly toward your next architectural upgrade.&lt;/p&gt;

&lt;p&gt;Further reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Prometheus Documentation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tokio Runtime&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Actix Web Framework&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana Observability Platform&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Published by Rahim—a backend and DevOps engineer exploring observability-driven architecture with Django, Prometheus, and Rust microservices.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>devops</category>
      <category>django</category>
      <category>backend</category>
    </item>
    <item>
      <title>Wired Django, Nextcloud, Grafana, Loki &amp; Prometheus into a secure observability mesh over Tailnet (metrics &amp; logs, dashboards).</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 08 Feb 2026 05:10:03 +0000</pubDate>
      <link>https://dev.to/rahim8050/wired-django-nextcloud-grafana-loki-prometheus-into-a-secure-observability-mesh-over-tailnet-357k</link>
      <guid>https://dev.to/rahim8050/wired-django-nextcloud-grafana-loki-prometheus-into-a-secure-observability-mesh-over-tailnet-357k</guid>
      <description>&lt;p&gt;&lt;strong&gt;Building an Observability Mesh with Grafana, Loki, and Prometheus&lt;/strong&gt;&lt;br&gt;
When multiple backend services start running in isolation, debugging becomes guesswork. My recent sprint was about turning that guesswork into clarity — by wiring up full observability across Django, Nextcloud, Grafana, Loki, and Prometheus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;&lt;br&gt;
Unify logs and metrics across services in a distributed setup — all communicating over Caddy TLS and my Tailnet domain.&lt;br&gt;
I wanted one dashboard that could tell me everything about my system’s health without SSH-ing into individual servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
Here’s the high-level design:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9m4bkh8c2atv6mjpc86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9m4bkh8c2atv6mjpc86.png" alt="Architecture flow diagram" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack Overview&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Prometheus → scrapes metrics from Django and Nextcloud API endpoints&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loki → ingests logs from both services&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana → visualizes metrics and logs together&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Caddy → reverse proxy with trusted TLS for all endpoints&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tailnet (Tailscale) → private network with identity-based access&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything talks securely — no exposed ports, no unencrypted traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenges&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Grafana showed logs but no metrics&lt;/strong&gt;&lt;br&gt;
Root cause: Prometheus targets weren’t reachable after moving from localhost to tailnet hostnames.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. TLS verification issues in Prometheus&lt;/strong&gt;&lt;br&gt;
Solved by updating Caddy’s certificates and confirming Prometheus scrape configs pointed to HTTPS endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cross-service routing&lt;/strong&gt;&lt;br&gt;
Caddy needed to handle routes like /metrics, /api/schema, and /api/* correctly between Django and Nextcloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s a simplified Prometheus scrape config example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;scrape_configs: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;job_name: "django" metrics_path: /metrics static_configs: &lt;/li&gt;
&lt;li&gt;&lt;p&gt;targets: ["X.tail.ts.net:8000"]&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;job_name: "nextcloud" metrics_path: /metrics static_configs: &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;targets: ["X.tail.ts.net:8080"]&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both routes sit behind Caddy, which handles TLS termination using trusted Tailnet certificates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results&lt;/strong&gt;&lt;br&gt;
Once Prometheus started scraping successfully, Grafana dashboards came alive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9237zfjex7yqb7j2r35.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9237zfjex7yqb7j2r35.png" alt="grafana example dashboard" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now I can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Correlate logs and metrics per request&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Track uptime and performance trends&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Visualize distributed system behavior across all nodes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It feels like operating my own mini control plane — distributed, secure, and explainable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Steps&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Add distributed tracing (OpenTelemetry)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Define Prometheus alert rules for critical endpoints&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automate observability config rollout via CI/CD&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaway&lt;/strong&gt;&lt;br&gt;
Observability isn’t an add-on — it’s the nervous system of your infrastructure.&lt;br&gt;
When your servers start talking, you start listening differently.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>django</category>
      <category>nextcloud</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>CGNAT Escape Plan: Private Access + Browser-Trusted TLS for My Self-Hosted Stack</title>
      <dc:creator>Rahim Ranxx</dc:creator>
      <pubDate>Sun, 01 Feb 2026 08:19:29 +0000</pubDate>
      <link>https://dev.to/rahim8050/cgnat-escape-plan-private-access-browser-trusted-tls-for-my-self-hosted-stack-37h7</link>
      <guid>https://dev.to/rahim8050/cgnat-escape-plan-private-access-browser-trusted-tls-for-my-self-hosted-stack-37h7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femmcskdh1yilj59xgfb0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femmcskdh1yilj59xgfb0.png" alt="Architecture diagram: Phone/Laptop → Tailscale → Reverse Proxy (TLS) → Nextcloud + Django API" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm02qz91mcakdspbx61ee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm02qz91mcakdspbx61ee.png" alt="Hardening checklist for private access + browser-trusted TLS" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
Most self-hosting stories end at “it works on my laptop.” Mine only felt real when my phone loaded the stack remotely with a clean TLS lock in the browser—no warnings—and my services stayed private by default.&lt;/p&gt;

&lt;p&gt;This is how I made my self-hosted setup—* *Nextcloud + a Django REST framework API—reachable from anywhere without opening public ports, and how I verified that both the network boundary and TLS trust were actually correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The constraint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CGNAT is the quiet villain of home labs. You can build a great system locally, but inbound access becomes fragile or impossible. Even when you can port-forward, “expose 443 to the internet” is a great way to collect bots, scans, and stress you didn’t ask for.&lt;/p&gt;

&lt;p&gt;So I set a different goal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Reach my services from my phone and laptop&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep them off the public internet&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use browser-trusted TLS (real lock icon, no prompts)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Validate it like a production system: boundaries + proof&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Threat model (what I was defending against)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’m not building a bank, but I am defending against the common stuff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Random internet scanning and opportunistic attacks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Misconfiguration that accidentally exposes admin panels&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“It’s HTTPS” illusions where the browser still doesn’t trust the cert&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Curl working while browsers fail because of redirects/headers/mixed content&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The strategy: private reachability + controlled entry point + trusted TLS + verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The design (high-level)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used a private network overlay—Tailscale—so my devices can reach the services securely without public inbound ports.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;p&gt;Phone/Laptop&lt;br&gt;
→ private network&lt;br&gt;
→ reverse proxy (TLS termination + routing)&lt;br&gt;
→ Nextcloud + Django API&lt;/p&gt;

&lt;p&gt;Reverse proxy-wise, you can do this with Caddy or Nginx—the important part is the role: one controlled front door, consistent routing, and TLS handled properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What “network was clean” means (not vibes, boundaries)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For me, “clean network” meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The services are reachable only over the private network (tailnet)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No public IP exposure for app ports&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reverse proxy is the only entry point I maintain&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Everything uses explicit hostnames + stable routing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the difference between “it works” and “it’s defensible.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The TLS win (browser-trusted)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I didn’t want “HTTPS” that still nags the browser. I wanted the lock icon and a certificate chain the browser trusts.&lt;/p&gt;

&lt;p&gt;My acceptance criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The hostname matches the certificate (SAN is correct)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The full trust chain is valid (no warnings)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Works on mobile browsers (the harshest judge)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No weird “mixed content” failures when the UI loads assets&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once that was done, the system stopped feeling like a lab toy and started feeling like a real service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proof: how I validated end-to-end&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I like checks that produce receipts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1) Browser trust proof&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From my phone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The site loads with a lock icon&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Certificate details match the expected hostname and validity&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2) HTTP behavior proof (sanity headers + redirect behavior)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From a trusted client device:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;curl -I https:///... confirms:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;expected status codes (200/302)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;redirects aren’t downgrading to HTTP&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;headers look intentional, not accidental&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3) Boundary proof (private-only reachability)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the server:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Confirm services bind where you expect (private interfaces / localhost behind proxy)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Confirm the reverse proxy is the only exposed listener you intended&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple sanity check is reviewing active listeners (ss -tulpn) and verifying only the expected ports are bound to the expected interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4) Integration proof (UI ↔ API)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best proof wasn’t curl:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The Nextcloud app successfully calls the Django API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data renders end-to-end over the private path&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the “real system” test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The debugging lesson: curl worked, browser didn’t&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the most educational part.&lt;/p&gt;

&lt;p&gt;Curl can succeed while browsers fail because browsers enforce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;stricter TLS validation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;redirect rules and canonical hostnames&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;mixed content blocking&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;security policies (CSP, framing, referrer policy)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when I saw “works in Termux, blank in browser,” I treated it as a routing/trust signal—not mystery magic. Tightening hostnames, TLS, and consistent proxy routing fixed the mismatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardening checklist (what I locked down before calling it done)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the part that turns “reachable” into “safe enough to run.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network &amp;amp; exposure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Services reachable only via private network path&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reverse proxy is the only entry point&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Avoid binding internal services to public interfaces&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TLS &amp;amp; trust&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Browser-trusted certs, correct hostname coverage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No HTTP downgrade paths&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Renewals handled automatically (so security doesn’t rot)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;App security hygiene&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Secrets live in environment config (not in repo)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Strong auth on the API endpoints (and no unauth admin surfaces)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consistent logging so failures aren’t invisible&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Practical resilience&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic monitoring and dashboards are next (I’m aiming for Grafana + Prometheus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;-Backups and recovery plan (because “it works” is not the same as “it survives”)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Private access is not a compromise—it’s a security strategy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Browser-trusted TLS is a real quality bar. If the browser trusts it, you eliminate a whole class of hidden problems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Verification beats vibes. I now validate: reachability, trust, boundaries, and integration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Closing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This wasn’t just “getting Nextcloud running.” It was turning a home lab into something closer to production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;private-by-default access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;browser-trusted TLS&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;one controlled entry point&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;verifiable behavior end-to-end&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the kind of engineering I want to keep doing: systems that work—and deserve trust.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>tls</category>
      <category>selfhosted</category>
    </item>
  </channel>
</rss>
