From Feature Delivery to Platform Engineering
Most engineering articles focus on building a new feature.
The reality of production systems is different.
Adding a feature is often the easiest part.
The difficult part is preserving compatibility across asynchronous workloads, external integrations, observability pipelines, CI gates, OpenAPI contracts, and years of accumulated assumptions.
This week, I wasn't simply implementing NDWI.
I was evolving a farm intelligence platform that combines Earth Observation, distributed task execution, observability, and a Nextcloud-based user experience.
The goal sounded straightforward:
Bring NDWI (Normalized Difference Water Index) to feature parity with NDVI.
The actual work touched nearly every layer of the stack.
The Problem: Feature Duplication Becomes Technical Debt
Our existing NDVI implementation already supported:
- Multiple processing backends,
- Celery orchestration,
- Prometheus metrics,
- OpenAPI exposure,
- Nextcloud integration,
- Dashboarding,
- Automated tests.
The temptation was obvious:
Copy the NDVI implementation.
Rename everything.
Ship.
That approach works exactly once.
Every duplicated branch becomes future maintenance debt.
Every additional spectral index doubles the operational surface area.
I wanted NDWI to become the second index without making the third index exponentially harder.
Designing for the Next Spectral Index
The first step was eliminating branching logic.
The original engine dispatch evolved toward multiple conditional paths:
if index == "NDVI":
...
elif index == "NDWI":
...
That pattern does not scale.
Instead, dispatch moved to factory lookup.
factory_key = (
engine
if index == "NDVI"
else f"ndwi_{engine}"
)
factory = ENGINE_FACTORIES[factory_key]
Five NDWI engine factories were introduced:
- ndwi_gee
- ndwi_sentinelhub
- ndwi_stac
- ndwi_landsat
- ndwi_modis
The result wasn't just NDWI support.
It transformed the platform into one capable of supporting future indices through convention rather than branching.
Adding another index stopped being an architectural event.
Separate Tasks, Shared Internals
A common anti-pattern in Celery systems is task duplication.
Two almost-identical tasks drift apart over time.
I wanted operational separation without implementation divergence.
Instead of copying logic:
run_ndwi_job(...)
delegates into the existing NDVI execution pipeline.
This produced an interesting balance.
NDWI gained:
- Independent retry policies,
- Dedicated queue routing,
- Separate monitoring visibility,
- Future scheduling flexibility.
Without duplicating computation logic.
Operational isolation.
Implementation reuse.
Metrics: Fighting Observability Sprawl
Observability debt accumulates quietly.
Originally, NDWI introduced six additional Prometheus metrics.
That meant:
- Duplicate Grafana panels,
- Duplicate alert rules,
- Duplicate recording rules.
Instead of expanding metrics, we collapsed them.
Before:
ndwi_requests_total
ndwi_duration_seconds
...
After:
spectral_requests_total{index="NDVI"}
spectral_requests_total{index="NDWI"}
The dashboard no longer cared which index generated the signal.
The index became metadata.
The monitoring surface remained stable.
One of the most valuable lessons in observability is this:
Labels scale better than metric names.
Testing Against Regression, Not Hope
Feature tests prove something works.
Regression tests prove you didn't destroy what already existed.
A dedicated no-regression suite was introduced.
It validated:
- Factory registrations,
- Query isolation,
- Route resolution,
- URL parity,
- Metrics importability,
- Representation consistency,
- Celery routing behavior.
Nineteen tests across seven classes focused entirely on one question:
Did this week's work accidentally break yesterday's guarantees?
Those tests became the contract protecting future contributors from invisible coupling.
The Farm 29 Incident
The most valuable discovery wasn't code.
It was a 403 error.
Farm 29 exposed a hidden assumption.
Weather endpoints succeeded.
NDVI failed.
NDWI failed.
Initially, it looked like an authentication defect.
The investigation revealed something deeper.
Integration JWTs enforced per-farm access:
FarmIntegrationAccess
Weather bypassed this path.
Spectral endpoints enforced it.
The fix required zero Django changes.
The integration simply lacked authorization.
This incident reinforced an important operational principle:
Authentication proves identity.
Authorization determines access.
Confusing the two leads to dangerous conclusions.
Production systems teach humility.
The bug is rarely where you first look.
When 102 Radio Stations Became a Concurrency Problem
Not every challenge involved remote sensing.
A radio subsystem health check had become pathological.
Sequential probing meant:
- 37 stations processed,
- 300-second execution time,
- consistent Celery timeouts.
The solution wasn't another timeout tweak.
It was concurrency.
ThreadPoolExecutor replaced sequential execution.
Redirect chasing disappeared.
HTTP 3xx and 405 responses became acceptable health signals.
After deployment:
Before:
- 37/102 stations,
- ~300 seconds,
- frequent failures.
After:
- 102/102 stations,
- ~21 seconds,
- stable execution.
Sometimes resilience isn't sophisticated.
Sometimes it's simply refusing to serialize independent work.
CI as an Operational Safety Net
One subtle defect triggered a larger improvement.
Three Celery beat task names drifted away from their actual registrations.
Everything appeared healthy.
Until scheduled execution failed.
Instead of fixing the names and moving on, a CI guardrail emerged.
A validation script now verifies:
- Beat schedules,
- Queue routes,
- Shared task registrations.
The lesson was simple:
Every production incident deserves the question:
"How do we ensure this category of failure never happens again?"
Fixes remove symptoms.
Guardrails remove classes of bugs.
OpenAPI as the Source of Truth
Cross-repository systems drift.
Documentation drifts faster.
The Nextcloud application consumed Django APIs.
Over time, operation identifiers diverged.
The answer wasn't manual synchronization.
The answer was declaring ownership.
Django's schema became authoritative.
The Nextcloud OpenAPI specification synchronized directly from it.
Ninety-six operations were verified.
Fifteen controllers aligned.
The integration contract became explicit.
Contracts reduce assumptions.
Assumptions become outages.
What This Week Actually Produced
On paper:
- 93 Django files changed,
- ~2,469 lines added,
- 15 commits,
- 96 operations verified,
- 102 radio stations covered,
- 29 Celery tasks validated.
But the numbers tell only part of the story.
The real outcome was different.
The platform became:
- Easier to extend,
- Easier to observe,
- Harder to accidentally break,
- More explicit in its contracts,
- More resilient under operational stress.
That is the difference between feature development and platform engineering.
The code shipped this week wasn't just NDWI.
It was institutional knowledge encoded into software.
And that compounds over time.
Top comments (0)