DEV Community

Cover image for From Feature Delivery to Platform Engineering: Scaling Earth Observation Pipelines.
Rahim Ranxx
Rahim Ranxx

Posted on

From Feature Delivery to Platform Engineering: Scaling Earth Observation Pipelines.

From Feature Delivery to Platform Engineering

Most engineering articles focus on building a new feature.

The reality of production systems is different.

Adding a feature is often the easiest part.

The difficult part is preserving compatibility across asynchronous workloads, external integrations, observability pipelines, CI gates, OpenAPI contracts, and years of accumulated assumptions.

This week, I wasn't simply implementing NDWI.

I was evolving a farm intelligence platform that combines Earth Observation, distributed task execution, observability, and a Nextcloud-based user experience.

The goal sounded straightforward:

Bring NDWI (Normalized Difference Water Index) to feature parity with NDVI.

The actual work touched nearly every layer of the stack.


The Problem: Feature Duplication Becomes Technical Debt

Our existing NDVI implementation already supported:

  • Multiple processing backends,
  • Celery orchestration,
  • Prometheus metrics,
  • OpenAPI exposure,
  • Nextcloud integration,
  • Dashboarding,
  • Automated tests.

The temptation was obvious:

Copy the NDVI implementation.

Rename everything.

Ship.

That approach works exactly once.

Every duplicated branch becomes future maintenance debt.

Every additional spectral index doubles the operational surface area.

I wanted NDWI to become the second index without making the third index exponentially harder.


Designing for the Next Spectral Index

The first step was eliminating branching logic.

The original engine dispatch evolved toward multiple conditional paths:

if index == "NDVI":
    ...
elif index == "NDWI":
    ...
Enter fullscreen mode Exit fullscreen mode

That pattern does not scale.

Instead, dispatch moved to factory lookup.

factory_key = (
    engine
    if index == "NDVI"
    else f"ndwi_{engine}"
)

factory = ENGINE_FACTORIES[factory_key]
Enter fullscreen mode Exit fullscreen mode

Five NDWI engine factories were introduced:

  • ndwi_gee
  • ndwi_sentinelhub
  • ndwi_stac
  • ndwi_landsat
  • ndwi_modis

The result wasn't just NDWI support.

It transformed the platform into one capable of supporting future indices through convention rather than branching.

Adding another index stopped being an architectural event.


Separate Tasks, Shared Internals

A common anti-pattern in Celery systems is task duplication.

Two almost-identical tasks drift apart over time.

I wanted operational separation without implementation divergence.

Instead of copying logic:

run_ndwi_job(...)
Enter fullscreen mode Exit fullscreen mode

delegates into the existing NDVI execution pipeline.

This produced an interesting balance.

NDWI gained:

  • Independent retry policies,
  • Dedicated queue routing,
  • Separate monitoring visibility,
  • Future scheduling flexibility.

Without duplicating computation logic.

Operational isolation.

Implementation reuse.


Metrics: Fighting Observability Sprawl

Observability debt accumulates quietly.

Originally, NDWI introduced six additional Prometheus metrics.

That meant:

  • Duplicate Grafana panels,
  • Duplicate alert rules,
  • Duplicate recording rules.

Instead of expanding metrics, we collapsed them.

Before:

ndwi_requests_total
ndwi_duration_seconds
...
Enter fullscreen mode Exit fullscreen mode

After:

spectral_requests_total{index="NDVI"}
spectral_requests_total{index="NDWI"}
Enter fullscreen mode Exit fullscreen mode

The dashboard no longer cared which index generated the signal.

The index became metadata.

The monitoring surface remained stable.

One of the most valuable lessons in observability is this:

Labels scale better than metric names.


Testing Against Regression, Not Hope

Feature tests prove something works.

Regression tests prove you didn't destroy what already existed.

A dedicated no-regression suite was introduced.

It validated:

  • Factory registrations,
  • Query isolation,
  • Route resolution,
  • URL parity,
  • Metrics importability,
  • Representation consistency,
  • Celery routing behavior.

Nineteen tests across seven classes focused entirely on one question:

Did this week's work accidentally break yesterday's guarantees?

Those tests became the contract protecting future contributors from invisible coupling.


The Farm 29 Incident

The most valuable discovery wasn't code.

It was a 403 error.

Farm 29 exposed a hidden assumption.

Weather endpoints succeeded.

NDVI failed.

NDWI failed.

Initially, it looked like an authentication defect.

The investigation revealed something deeper.

Integration JWTs enforced per-farm access:

FarmIntegrationAccess
Enter fullscreen mode Exit fullscreen mode

Weather bypassed this path.

Spectral endpoints enforced it.

The fix required zero Django changes.

The integration simply lacked authorization.

This incident reinforced an important operational principle:

Authentication proves identity.

Authorization determines access.

Confusing the two leads to dangerous conclusions.

Production systems teach humility.

The bug is rarely where you first look.


When 102 Radio Stations Became a Concurrency Problem

Not every challenge involved remote sensing.

A radio subsystem health check had become pathological.

Sequential probing meant:

  • 37 stations processed,
  • 300-second execution time,
  • consistent Celery timeouts.

The solution wasn't another timeout tweak.

It was concurrency.

ThreadPoolExecutor replaced sequential execution.

Redirect chasing disappeared.

HTTP 3xx and 405 responses became acceptable health signals.

After deployment:

Before:

  • 37/102 stations,
  • ~300 seconds,
  • frequent failures.

After:

  • 102/102 stations,
  • ~21 seconds,
  • stable execution.

Sometimes resilience isn't sophisticated.

Sometimes it's simply refusing to serialize independent work.


CI as an Operational Safety Net

One subtle defect triggered a larger improvement.

Three Celery beat task names drifted away from their actual registrations.

Everything appeared healthy.

Until scheduled execution failed.

Instead of fixing the names and moving on, a CI guardrail emerged.

A validation script now verifies:

  • Beat schedules,
  • Queue routes,
  • Shared task registrations.

The lesson was simple:

Every production incident deserves the question:

"How do we ensure this category of failure never happens again?"

Fixes remove symptoms.

Guardrails remove classes of bugs.


OpenAPI as the Source of Truth

Cross-repository systems drift.

Documentation drifts faster.

The Nextcloud application consumed Django APIs.

Over time, operation identifiers diverged.

The answer wasn't manual synchronization.

The answer was declaring ownership.

Django's schema became authoritative.

The Nextcloud OpenAPI specification synchronized directly from it.

Ninety-six operations were verified.

Fifteen controllers aligned.

The integration contract became explicit.

Contracts reduce assumptions.

Assumptions become outages.


What This Week Actually Produced

On paper:

  • 93 Django files changed,
  • ~2,469 lines added,
  • 15 commits,
  • 96 operations verified,
  • 102 radio stations covered,
  • 29 Celery tasks validated.

But the numbers tell only part of the story.

The real outcome was different.

The platform became:

  • Easier to extend,
  • Easier to observe,
  • Harder to accidentally break,
  • More explicit in its contracts,
  • More resilient under operational stress.

That is the difference between feature development and platform engineering.

The code shipped this week wasn't just NDWI.

It was institutional knowledge encoded into software.

And that compounds over time.

Top comments (0)