Mateen Anjum

Posted on May 17

We Didn't Migrate Systems. We Migrated Assumptions: Heroku to EKS at Scale

#aws #kubernetes #devops #opensource

TL;DR: I'm speaking at Open Source Summit North America 2026 in Minneapolis on Monday, May 18, about moving a fast-growing invoicing SaaS off Heroku onto EKS. This post is the long version of that talk: the three failures that nearly rolled the whole thing back, the open source decisions that saved it, and the honest numbers on what it cost. The one line I keep coming back to: we didn't migrate systems, we migrated assumptions.

If you're at OSS NA, the session is Monday at 5:25pm CDT in Room 200F. If you're not, this is everything I'd tell you over coffee afterward.

The platform

The product was a fast-growing invoicing SaaS. About 2 million active small business merchants, roughly 33 million invoices a year, enterprise clients with contractual SLAs we couldn't afford to miss.

The architecture was already 47 Node.js microservices on Heroku, with SQS for events and Redis for sessions. The engineering team was 10 people, 2 of us on platform.

I want to be precise about the title. The services were already micro. The platform was the monolith. Everything routed through one PaaS that made a lot of decisions for us, quietly, and those decisions were exactly the ones that broke when we left.

What broke at scale

Four things, all at once, all getting worse:

API latency sitting at 700ms p99, with no obvious lever to pull because we'd hit the Heroku dyno scaling ceiling.
A deploy pipeline that took 45 minutes or more, against enterprise SLAs we kept missing.
No container-level observability, so we were guessing.
A monthly bill that had quietly crossed the line where it cost more than the value it returned.

We ran the decision honestly. Stay on Heroku and accept the ceiling. Rewrite for serverless and eat the rewrite cost. Move to raw AWS VMs and get cost relief but no velocity. Or move to EKS, the highest-risk, highest-ceiling option. We picked EKS, and we picked it knowing it was the riskiest path on the board.

We failed three times before it worked

This is the part most migration writeups skip. Here's what actually happened.

Attempt 1: the invisible throttle

The PDF generation service went from 800ms p99 to 9 seconds. Dashboards showed 35% CPU. Everything looked fine and nothing was fine.

The CFS scheduler enforces CPU limits in 100ms slices. At a 500m limit, you get 50ms of CPU per 100ms period. Node.js libuv spawns 4 worker threads, V8 garbage collection runs separately, so you've got around 6 threads fighting over that 50ms window. A crypto operation that takes 15ms unthrottled stretches to 200ms under contention. Average CPU looked low because the process spent most of its time throttled, not running.

The metric that told the truth was container_cpu_cfs_throttled_periods_total, not CPU utilization.

Lesson: a 500m CPU limit isn't a number. It's a 50ms-per-100ms scheduling rule, and Heroku had been hiding that from us by letting dynos burst.

Attempt 2: the DNS amplification tax

Heroku's resolv.conf had options ndots:1. The EKS default is ndots:5. That one number difference turned api.stripe.com, which has 2 dots, into roughly 10 DNS packets per lookup because the resolver walks the search domains before trying the name as-is.

We made about 150,000 Stripe calls a day. That became 1.5 million DNS queries. Across every external integration, around 12 million unnecessary DNS queries a day, and CoreDNS was the thing falling over.

There was a second trap layered on top. An npm ci during the Docker build produced a valid lockfile, just not the same one Heroku's slug cache had been running. A drifted agentkeepalive version recycled connections every 15 seconds instead of 30, which doubled the lookup rate before we'd even noticed the first problem.

Lesson: ndots:5 turns every short hostname into 10x DNS amplification, and your dependency tree can quietly make it worse.

Attempt 3: the connection pool death spiral

A Tuesday deploy. Thirty seconds later the connection pool was exhausted at 450 against a 400 limit. Sixty seconds in, SIGTERM was being ignored and connections were leaking. Two minutes in, it had exhausted the shared Postgres connections on the Heroku side too, so now both environments were down.

Root cause was one line in a Dockerfile. CMD npm start is shell form, which makes PID 1 /bin/sh, and /bin/sh swallows SIGTERM. The Node process never got the signal, never drained, never shut down cleanly. CMD ["node", "server.js"] is exec form, PID 1 is node, and the signal arrives.

The fix was three things stacked: PgBouncer in transaction mode to cap real connections around 80, exec-form CMD so SIGTERM lands, and an actual SIGTERM handler that drains gracefully.

Lesson: PID 1 is a contract. Shell form breaks the contract.

The pattern

The question I had to sit with: why didn't any of our dashboards catch this? CFS, because defaults are invisible. DNS, because amplification is multiplicative, not additive. The connection pool, because PID 1 betrayed us in a way no metric was watching.

That's where the talk's spine comes from. We didn't migrate systems. We migrated assumptions. Every platform hides a different class of failure, and the only safe way through is incremental, observable, reversible.

The four decisions that mattered most

People ask why we didn't just use AWS directly. The answer is that four decisions cost less to make once at the platform layer than to carry per-team forever. All four are open source.

Traffic shifting with Istio. We rejected DNS-based routing and ALB weighted target groups and landed on Istio. Canary in steps: 5%, 25%, 50%, 100%, with rollback as a single config change that takes seconds, no redeploy, no DNS propagation. Istio is heavy. Our adoption was deliberately light, and mTLS came free with the mesh.

Observe before you migrate. Prometheus with Thanos for long-term cross-cluster metrics, Grafana showing Heroku and EKS side by side on the same panels, Elastic Stack for centralized structured logging. We collected 2 weeks of baseline before moving a single byte. You cannot migrate what you cannot measure.

PR-driven infrastructure with Atlantis. Open a PR, Atlantis runs terraform plan, the diff lands in the PR comment, you approve and comment atlantis apply, and it executes and audits itself. The on-call engineer at 2am no longer has to wonder who ran apply from their laptop, because nobody does. It also took me out of the critical path as a bottleneck.

Deploys are git commits with Flux. HelmRelease resources for declarative deploys, drift detection that auto-corrects the inevitable manual kubectl apply, and within a month everyone was working through git because it was simply easier than not.

The database cutover used dual-write to RDS with checksum-validated continuous replication. When we flipped it, the cutover was anticlimactic. That's exactly what we wanted.

The results

The headline numbers:

Metric	Before	After	Change
API latency p99	700ms	70ms	down 90%
Deploy time	45 min	4 min	down 91%
Monthly incidents	12	2	down 83%
Deploy frequency	2/week	15/day	up 50x
Monthly cost	baseline	60%+ lower	right-sizing + spot + Karpenter

I don't like a 90% number with no explanation, so here's where the 630ms went. Routing variance was about 250ms, Istio least-connection routing versus Heroku's effectively random routing. Network topology was around 160ms, pod-to-pod inside the VPC instead of a public path with TLS renegotiation. Resource isolation was about 125ms, with CFS throttling going from 65% of periods to under 2%. Connection pooling was the remaining 95ms from PgBouncer transaction mode.

And the part that belongs in every honest migration post: this absorbed 2 platform engineers full-time for 5 months, plus roughly 30% of 8 application engineers' time. Nothing here was free.

Developer experience after the move

Simple for developers means complicated for the platform team, and that's the trade we chose to own. Heroku's superpower was git push heroku main. We weren't going to beat that, so we got close with an internal developer portal built on Backstage. A scaffolder template stands up a new service in about 5 minutes. Kubernetes complexity stayed our problem, not the developers' problem. That's how a 10-person team scaled to 100 on a platform 2 of us maintained.

What almost stopped us

Istio sidecar injection added about 8 seconds to pod startup until we tuned readiness probe timeouts across every service. Flux reconciliation during peak hours triggered rolling restarts until we scheduled reconciliation windows. cert-manager TLS rotation broke active connections until we added graceful connection draining, which we should have had from day one.

Migration is not over. It's a beginning. We're still working on cost-attribution dashboards in Backstage and evaluating Istio Ambient mode.

What we gave back

None of this runs without code other people wrote. So we contributed back: 49 CNCF DevStats contributions in 2026, 22 merged upstream PRs across 14 projects in the last three months, spanning observability, Kubernetes, security, and developer tooling. A cert-manager maintainer's review on one of them, "this is a super cool contribution," is the kind of feedback that makes the loop worth closing.

Open source is the equalizer

Here's the thing I'll close the talk on. A 2-person platform team in Ontario, Canada ran the same infrastructure stack as companies 100 times our size. The team grew from 10 engineers to 100. The service count went from 47 to 47, still, because the platform absorbed the growth instead of the codebase. The platform team went from 2 people to 2 people.

That's only possible because thousands of contributors built the tools we stand on. Open source is what let a small team in a mid-market company run infrastructure that used to require a department.

Should you migrate?

git push heroku main is still the best deploy UX I've ever used, and half the Fortune 500 still runs on Heroku for good reason. Migrate if you have 2 or more platform engineers, steady scaling pressure, some Kubernetes exposure on the team, and a PaaS limit you've actually hit. Don't migrate yet if you're a solo platform owner, your workload is steady-state, nobody has Kubernetes time, or Heroku still meets your needs.

If your team isn't ready for the highest-risk, highest-ceiling option, that's not a failure. That's a correct read of your situation.

Come say hi

If you're at Open Source Summit North America 2026, the talk is Monday, May 18, 5:25pm CDT, Room 200F at the Minneapolis Convention Center. I'll hang around after for the parts that don't fit in 25 minutes, and there are plenty.

Slides and the full list of the 22 merged PRs are at phonotech.ca/ossna26.

DEV Community