Rahim Ranxx

Posted on Mar 21

From 80-Second APIs to Sub-Second: Rebuilding a Geospatial Backend with Async Pipelines

#devops #performance #distributedsystems #backend

From 80-Second APIs to Sub-Second: Fixing Latency with Async Pipelines (Django + Celery)

Introduction

At some point, every backend engineer hits this wall:

The API works perfectly… until it doesn’t.

I hit that wall with a farm analytics endpoint computing NDVI (Normalized Difference Vegetation Index) from satellite imagery. The system was correct, the logic was sound, and the results were accurate.

But the numbers told a different story:

P95 latency: 1.25 minutes

That’s not an API. That’s a blocking compute job pretending to be one.

This is the story of how I redesigned the system—from a synchronous request-driven model to an asynchronous data pipeline—and brought latency down to sub-second performance (P95 ≈ 725ms).

The Original Architecture (The Hidden Problem)

At first glance, the system looked clean:

[Client]
   ↓
[Django API]
   ↓
[STAC API → Satellite Data]
   ↓
[Raster Processing (NDVI)]
   ↓
[Response]

What happened on each request?

Query satellite imagery via STAC
Fetch raster bands (Red & NIR) from remote storage
Process NDVI using rasterio
Aggregate coverage
Return result

Why this seemed fine

It worked locally
It returned correct data
It followed a “pure API” mindset

But under the hood:

Remote I/O (S3-backed satellite data)
Heavy raster decoding (JPEG2000)
Sequential band reads
Full computation per request

The Breaking Point

Logs told the truth.

Each request looked like:

STAC request → ~5s
Raster read (B04) → ~5–10s
Raster read (B08) → ~5–10s
Processing → ~5s+
Total → ~80+ seconds

And the key realization:

I wasn’t building an API—I was executing a geospatial compute pipeline on every request.

The Core Insight

This is the shift that changes everything:

APIs should serve data, not compute it on demand.

The problem wasn’t Python.
The problem wasn’t Django.
The problem was architecture.

The New Architecture (Async Pipeline)

I redesigned the system around asynchronous computation + caching:

             (Scheduled / Triggered)
                    ↓
             [Celery Worker]
                    ↓
         [NDVI Computation Pipeline]
                    ↓
             [Redis / Database]
                    ↓
[Client] → [Django API] → [Cache Lookup]

Key changes

NDVI computation moved out of the request path
Results cached in Redis
Background jobs compute and refresh data
API returns instantly (no heavy compute)

Diagram 1 — Before vs After

Before (Request-driven)

Request
   ↓
STAC API
   ↓
Raster I/O
   ↓
NDVI Compute
   ↓
Response (80s)

After (Pipeline-driven)

Request → Cache → Response (~725ms P95)
              ↓ (miss)
         Async Task
              ↓
       Compute + Store

Implementation

1. Fast API Path (Non-blocking)

from django.core.cache import cache
from ndvi.tasks import compute_farm_state_coverage

def get_farm_state(farm_id: int) -> dict:
    cache_key = f"farm_state:{farm_id}"

    data = cache.get(cache_key)
    if data:
        return data

    compute_farm_state_coverage.delay(farm_id=farm_id)

    return {
        "coverage_pct": None,
        "status": "processing"
    }

2. Celery Task (Async Compute)

from celery import shared_task
from django.core.cache import cache

@shared_task(bind=True, autoretry_for=(Exception,), retry_backoff=True)
def compute_farm_state_coverage(self, farm_id: int) -> None:
    coverage = compute_ndvi_coverage(farm_id)

    cache.set(
        f"farm_state:{farm_id}",
        {
            "coverage_pct": coverage,
            "status": "ready"
        },
        timeout=60 * 60 * 6,
    )

3. Daily Backfill (Critical)

from celery import shared_task

@shared_task
def enqueue_daily_farm_state_coverage():
    farm_ids = get_active_farm_ids()

    for farm_id in farm_ids:
        compute_farm_state_coverage.delay(farm_id=farm_id)

Observability (The Real Upgrade)

Metrics added:

Task duration
Task success/failure
Queue depth

Metrics (Grafana Observations)

📊 Grafana Screenshots

1. Latency Graph

Before

P95 latency: ~1.25 minutes

After

API latency: ~725ms (P95)
Background tasks: 60–90s

Before vs After Summary

Metric	Before	After
API latency	1.25 min	~725 ms (P95)
System type	Request-driven	Pipeline-driven
Scalability	Poor	Strong
Observability	Minimal	Improved

Final Thought

I stopped treating my API like a calculator and started treating my system like a data pipeline.

That’s when everything changed.

DEV Community