Our CI pipelines went from 3 minutes to 15 after moving to Kubernetes. Here's how we fixed it.

#cicd #devops #kubernetes #performance

We're a ~500 developer company with around 800 GitHub repositories. Our CI/CD pipelines work hard.

Before Kubernetes, we had 20 dedicated machines running pipelines — not small ones either. They were fast. Developers pushed code, pipelines kicked off immediately, builds finished in a few minutes. Nobody thought about it. It just worked.

Then we moved our CI agents to Kubernetes.

The migration

The move made sense on paper. We already had Karpenter running, so we had auto-scaling out of the box. No more managing a fixed fleet of machines. Nodes would spin up when pipelines needed them and disappear when they didn't. Pay for what you use. Clean.

And it did work — technically. Karpenter scaled nodes, pods got scheduled, pipelines ran.

But then the complaints started.

The regression

Developers who were used to instant pipelines suddenly had to wait 1-2 minutes just for a pod to get scheduled onto a new node. And that was the good part.

The real problem was what happened next. That pod landed on a completely fresh node. No Docker layer cache. No node_modules. No Go module cache. No build artifacts from previous runs. Nothing.

Pipelines that used to run in 3 minutes were now taking 15.

That's not a minor regression — that's a different experience entirely. Developers who used to push and move on were now pushing, going for coffee, coming back, and checking if the build was done yet.

Management noticed. "You said Kubernetes would cut costs — why is the coffee budget going up?"

(That didn't literally happen, but the sentiment was real. Developer time is far more expensive than compute.)

The dilemma

We were stuck between two bad options:

Keep 20 large machines running 24/7 — defeats the purpose of moving to Kubernetes. We're back to paying for idle compute, just with extra orchestration overhead.
Accept the cold starts — developers waste time on every pipeline run. Multiply 12 extra minutes by hundreds of pipeline runs per day and the math gets ugly fast.

Karpenter is great at what it does — it provisions nodes quickly (~40-50 seconds). But "quickly" still means waiting. And even when the node is ready, it's a blank slate. Every docker pull, every npm install, every go mod download starts from zero. Every single time.

We needed something different.

Thinking outside the box

We sat down and asked a simple question: what if the node was already there, just... asleep?

Cloud providers can start a stopped instance in seconds. The OS is installed, the disk is intact, the network interface is attached. It's not provisioning — it's resuming. And critically, the EBS volume survives. Whatever was on disk before the instance stopped is still there when it starts.

That means Docker layer caches survive. node_modules caches survive. Go module caches survive. Everything persists.

This was the insight, but turning it into something production-ready took work. We went through a lot of research, trial and error, prototypes that kind of worked, edge cases that didn't. How do you pre-warm a node so it's already joined to the cluster when it starts? How do you handle CNI initialization? How do you drain nodes gracefully before stopping them? How do you track which nodes are standby and which are active?

Eventually, all of that became Stratos.

How Stratos works

Stratos is a Kubernetes operator that manages a pool of nodes through a simple lifecycle:

warmup → standby → running → stopping
              ↑                    |
              └────────────────────┘

Warmup — Stratos launches an instance and runs your initialization script: join the cluster, configure kubelet, pull DaemonSet images, pull whatever other images you configure. When it's done, the instance powers itself off. All the slow work happens here, once, ahead of time.

Standby — The instance is stopped. Compute costs: zero. It only costs EBS storage. The Kubernetes node object still exists. The disk has everything cached from warmup (and from previous runs).

Running — A pipeline triggers, a pod goes Pending, Stratos detects it and starts a standby instance. The node comes back online in ~20 seconds. Kubelet reconnects, CNI is already configured, images are already on disk. The pod gets scheduled.

Stopping — When the pipeline finishes and the node empties out, Stratos drains it (respecting PodDisruptionBudgets), stops the instance, and returns it to standby. Not terminated — stopped. The disk, the caches, everything — preserved for the next run.

The key difference from traditional autoscalers: nodes are reused, not recycled. The second pipeline run on a Stratos node is dramatically faster than the first. And every subsequent run benefits from the accumulated cache.

The results

After deploying Stratos:

Pod pending to running: ~20 seconds (down from 1-2 minutes with Karpenter)
Pipeline duration back to ~3-4 minutes for most builds, because caches persist across runs
No idle compute costs — standby nodes cost only EBS storage. A pool of 10 stopped m5.large instances with 50GB gp3 volumes runs about $40/month. Those same instances running idle would be ~$700/month.

Developers stopped complaining. The coffee budget stabilized.

Beyond CI/CD

Once we had Stratos running for CI, we found other uses:

LLM model serving — Model images are 10-50GB+. Pre-pulling them during warmup means a standby node starts with the image already on disk. We went from 15+ minute cold starts to under 2 minutes.
Scale-to-zero — With ~20-second node startup, true scale-to-zero becomes viable. Pair it with a lightweight ingress proxy that holds requests while a node starts, and you can serve within a 30-second timeout window. No idle replicas needed.

Try it out

Stratos is open source (Apache 2.0) and currently in alpha. It supports AWS EC2 today.

GitHub: github.com/stratos-sh/stratos
Docs: stratos-sh.github.io/stratos

The README has a full quick start with Helm installation and example NodePool configurations.

If you've dealt with cold-start problems on Kubernetes — whether for CI/CD, model serving, or anything else — I'd like to hear about your experience. Issues, feedback, and contributions are all welcome.