We had seven clusters, sixty developers, and a $40K/month AWS bill no one could explain. Here's the architecture that fixed it — and what we'd do differently.
Three days. That's how long a mid-level engineer waited for a staging environment last year while a Friday release deadline approached.
Not because we were negligent. Because staging environment provisioning required a senior engineer to manually wire Postgres, Redis, ingress config, RBAC bindings, and namespace allocation — while that same senior engineer was handling an active incident and two other identical requests. The environment was ready Thursday. The feature shipped late.
We had a platform engineering problem. What took us longer to admit was that the obvious solutions were going to make it worse.
The Bill Nobody Could Explain
Sprawl is insidious because it looks like growth. Namespaces accumulate. Engineers spin up test environments, finish the work, move on. The namespace stays. The Postgres pod stays. The load balancer stays. Nobody deletes things they didn't explicitly create.
When finance flagged a $40K month-over-month spike, we spent a week cross-referencing AWS Cost Explorer with Slack history trying to figure out which team owned what. We couldn't. Cost attribution was aspirational. The actual state of our clusters was known only approximately, by the people who'd been there long enough to remember what they'd provisioned.
Flexera's State of the Cloud 2025 puts industry-wide cloud waste at up to 32% from idle and overprovisioned resources. We were running hotter than that.
The YAML problem compounded everything. Junior engineers couldn't self-serve — every new service needed a senior engineer to write Deployment manifests, configure resource limits, set up HPA, wire RBAC, and identify the right ServiceAccount for private registry access. We'd built an architecture that required senior engineers for routine operations. That's not a staffing problem. That's a design problem.
Measured honestly: 20–35% of our engineering hours were going to infrastructure toil. That's consistent with IDC's research on how developers actually spend their time. It's also roughly 1.5 FTE per month doing work that, in theory, shouldn't require human judgment.
Why We Didn't Just Use Backstage
We ran a two-month Backstage proof of concept. Here's what we learned.
Backstage is a React application that your team owns. That's the thing nobody says clearly upfront. The plugin ecosystem is real. The software catalog concept is good. But operating Backstage in production means maintaining a React app, a Node backend, a Postgres database, and a plugin integration layer — in addition to the clusters you're trying to simplify. Cortex's analysis of real deployments puts the staffing requirement at 3–12 engineers. For a three-person platform team, that math doesn't work. And Backstage ships with no AI features. Every AI capability is a plugin you build and maintain yourself.
We looked at Humanitec and Port. Both are genuinely capable. Both have a structural problem: your infrastructure state lives in their cloud. Environment definitions, deployment configs, service topology — all stored externally. When we asked both vendors what a migration away would look like, neither gave a satisfying answer. That's not a knock on them — it's the inherent tension of a SaaS IDP. To give you a good product, they need to own your state.
Humanitec's pricing at the time: $2,199/month for five users. We had sixty developers.
What We Actually Built
The constraint we set: all state lives in the cluster, in standard Kubernetes primitives. No external services storing our data. Migrate away by running kubectl get.
Fortem is a Kubernetes Operator with a UI layer. When a developer requests an environment, they create a FortemEnvironment custom resource. The Operator's reconciliation loop provisions the constituent resources — Deployments, Services, PVCs, ConfigMaps, RBAC bindings — and writes status conditions back to the CRD.
apiVersion: fortem.dev/v1alpha1
kind: FortemEnvironment
metadata:
name: feature-payments-v2
namespace: team-backend
spec:
template: microservice-stack
services:
- name: payments-api
image: registry.internal/payments:pr-442
- name: postgres
preset: postgres-15-small
- name: redis
preset: redis-7-ephemeral
ttl: 72h
The spec is declarative and portable. Put it in Git. Apply it with kubectl. The TTL field handles cleanup — when it expires, the Operator tears down the environment and releases the resources. No manual deletion. No orphaned namespaces.
Three AI integrations sit on top of the Operator:
NL-to-manifest. Engineers describe an environment in plain English and get a FortemEnvironment manifest back, with dry-run preview before anything is applied. This works well for templated environments. It's less reliable for novel configurations — the LLM occasionally generates plausible-looking but invalid resource specs, which the dry-run catches. We treat it as a starting point, not a final answer.
Idle detection. The Operator tracks inbound traffic and deployment activity per namespace. Zero traffic + zero deploys for 48 hours (configurable) triggers an idle flag. Auto-shutdown or manual review, your choice. The first month caught 23 abandoned environments. A typical idle environment — Postgres, a few services, load balancer — runs $180–250/month. We recovered roughly $4,200/month from that initial pass.
Incident diagnosis. On crash loop or unexpected HPA trigger, the Operator aggregates recent logs, events, and resource metrics into a structured prompt and runs it through the configured LLM. Output is a root cause summary and a suggested fix. It's correct often enough to cut mean-time-to-understand significantly — not correct enough to act on without review.
Install is a single Helm chart, runs entirely inside your cluster:
helm install fortem fortem/fortem \
--namespace fortem-system \
--create-namespace \
--set ai.provider=anthropic \
--set ai.apiKey=$ANTHROPIC_API_KEY
No egress requirements beyond your LLM provider. No Fortem infrastructure touches your data.
Migrating away: kubectl get fortemenv -A -o yaml > environments.yaml. The underlying resources are all native K8s objects. They exist independently of Fortem. The migration path is real because we tested it — we ran the export against a staging cluster before committing to the architecture.
What Actually Changed
Environment provisioning: 2–3 days to under 8 minutes. This is the number that gets cited, and it's accurate, but it understates the change. The bigger shift is that provisioning no longer requires senior engineer involvement. Junior engineers self-serve. The senior engineers work on things that need senior judgment.
Cloud spend: down 55% from the baseline we measured at the start of the idle detection project. The idle environment reclamation accounts for most of it. Right-sizing recommendations from the AI layer account for the rest.
Cost attribution: automatic. Every FortemEnvironment carries team and namespace labels that flow through to cost metering. The monthly finance conversation is now a dashboard, not a spreadsheet archaeology project.
What didn't get better: the Operator model trades one kind of complexity for another. You're maintaining CRD schemas, managing controller health, and debugging reconciliation loops when the Operator gets into a bad state. We've had three incidents where the Operator's reconciler got stuck on a malformed resource and stopped processing the queue. That's recoverable, but it requires understanding the Operator internals. The abstraction has a floor.
If You Want to Try It
Community tier is free — one cluster, three environments, basic AIOps. The docs walk through a working environment in about 20 minutes on an existing cluster.
The engineer who sent that Tuesday Slack message hasn't waited more than 10 minutes for an environment since we shipped this. That outcome isn't because we built something clever. It's because environment provisioning is now a reconciliation loop — deterministic, auditable, and not dependent on a senior engineer being available.
Top comments (0)