DEV Community

speed engineer
speed engineer

Posted on

Our Kubernetes Cluster Was Burning $18K/Month. I Replaced It With 3 Bare Metal Servers.

 ## The Bill That Stopped Everything

$18,247. Last month. 90% idle.

Three people on rotation. Four applications. Oversized by 2.5x on its best day. Istio sidecars running (we disabled them six months prior). Persistent volume claims nobody could explain. Operational surface area of a company 10x our size.

Still paying for what the previous architect designed in 2020.

Why We Thought We Needed Kubernetes

We scaled unpredictably once. Fifteen microservices at peak. Twelve archived now. One team got folded. The database-per-service experiment? Failed.

Built for growth that never materialized.

The Industry Consensus That Charged Us Money

"You can't run production without orchestration."

Every conference says it. AWS says it. I said it.

The problem wasn't choosing Kubernetes in 2020. It was never questioning it again.

The Math That Made It Unavoidable

Breaking it down:

  • EKS cluster: $9,200/month
  • RDS (bloated Postgres): $4,100/month
  • NAT Gateway (data egress hell): $2,800/month
  • EBS volumes: $1,200/month
  • Junk (ECR, CloudWatch, VPC endpoints): $947/month

Total: $18,247/month. $219,000 a year.

Replacement cost: $7,200 upfront for three Dell PowerEdge R750 boxes. Colocation: $600/month (less than a single EKS instance).

Year one savings: $160,000. Every year after: $17,500 in the bank.

What We Actually Moved To

Three machines. 24 cores, 192GB RAM each. NVMe drives. Stupid fast.

  • Systemd services (just compiled binaries, no container nonsense)
  • Nginx load balancer
  • Postgres on machine one, replicated via WAL archiving to standby machines
  • Minio for S3-compatible blob storage
  • One Golang binary per microservice

Deleted 47 Helm charts. That's… all of them.

Deploy process: SSH into /opt/app, drop the binary, systemctl restart. Twenty seconds. No image registry flakes. No "why's the infrastructure broken while I'm debugging the app" moments.

What Actually Worked

Here's the thing people say will break without Kubernetes, and what actually happened:

Microservices won't communicate without service mesh magic.

They just… do? Systemd exposes ports. We configure hostnames in /etc/hosts or use Consul (free tier). DNS works. I spent three weeks bracing for NAT errors that never happened. Not a single cross-machine RPC timeout we couldn't trace to bad code. That was a weird win.

Deployments will tank availability because nothing's orchestrated.

False. Ansible restarts services serially — health check between each. Three minutes. One deploy rolled back because the build was bad. Took 90 seconds. We've done 184 deploys in six months. Zero unplanned downtime. Zero hotfixes that couldn't wait for the standard deployment window.

You can't scale without orchestration.

Peak load is 200 concurrent users. Machines sit at 15% CPU on heavy days. Scaling means — and I'm not exaggerating — buying another server, throwing it in the load balancer config. Maybe 40 minutes of labor tops. We've never had to do it. Ever.

The real fight? Came from the team. That's where the pressure was.

The Resistance (Career Risk Is Real)

One engineer — brilliant engineer — built his entire resume as a "Kubernetes expert." Senior titles, conference talks, the whole thing. Heard the word "migration" and thought: "I'm unemployable now."

I told him: your value isn't Kubernetes. It's shipping production systems. He proved it writing the Ansible playbooks that executed the migration. Caught bugs nobody spotted. Got promoted.

That shift in thinking changed everything about how we make infrastructure decisions.

Everything else was just talking points:

"Self-healing clusters?"
kubectl rollout undo was never used. Not once. Zero times. Good deploy pipelines eliminate the need. Kubernetes doesn't make you ship better code.

"Load balancing?"
Nginx. Single process. We know what it does. Understand the logs. Change the config in 30 seconds. No black magic. No "why is traffic stuck on pod three?" mysteries at 2am.

"Secrets management?"
Ran Vault before Kubernetes. Ran it after migration. Kubernetes Secret management solved exactly zero of our actual security headaches. It was theater.

The Actual Migration Steps

We didn't rip the band-aid. Shadow traffic for three weeks.

  1. Provision hardware (one week) — Dell's fast, colocation at a real datacenter. Hardware dies? We recover without calling a vendor. No more waiting for EKS node recovery.

  2. Build the OS (three days) — Ubuntu 22.04. Hardened with Lynis. SSH keys only. Firewall restricted to ports 22, 8080, 8081, 8082. Done. Security audit took one afternoon.

  3. Migration pilot (two weeks) — This part actually mattered. One app running simultaneously on bare metal and EKS. Traffic split via DNS weighting at the load balancer — 5% bare metal, 95% Kubernetes. We watched. For 502s. For slow queries. Memory leaks. Pod crashes that mysteriously happened at 4pm. Latency on bare metal came in 8% lower. No sidecar proxy tax. No Istio intercepting every packet. This split meant we caught configuration issues before full cutover. One app had a hardcoded endpoint it couldn't reach; DNS weighting caught it.

  4. Full cutover (one day) — Updated load balancer weights. Drained EKS. Monitored through the night. Zero incidents.

  5. Deprovisioning (two days) — Killed RDS, EKS, NAT gateways, redundant VPC cruft.

Left the Kubernetes cluster in read-only mode for 30 days. Just in case. Never touched it. Never needed it.

Database Failover: The Real Remaining Risk

Postgres replication via WAL archiving is solid. That's not the issue. Bare-metal Postgres failover isn't automated like AWS RDS. Primary NVMe drive dies? We detect via monitoring, manually promote a standby, update connection strings. Maybe 15 minutes of downtime in a failure scenario we've never hit.

That's the trade-off. You lose silent failover. Your ops team needs to understand Postgres replication, not just assume the cloud handles it.

For our scale and SLA? Acceptable.
For high-frequency trading or a Fortune 500 backend? Absolutely not.

Before and After

Metric Kubernetes Bare Metal Change
Monthly cost $18,247 $600 −97%
Deploy time 90–120s 20s 82% faster
P99 latency 245ms 227ms 7% lower
Incident response 15–20 min 2 min (SSH) 88% faster
Operational overhead 12–15 hrs/week 1–2 hrs/week 87% reduction

Six months. Zero incidents on bare metal.

(Your numbers differ. Your requirements differ. These are ours at our scale.)

What We Lost

Being ruthless:

  • Auto-recovery on hardware failure. Now manual — IPMI reboot or drive replacement.
  • Built-in autoscaling. Now we buy a machine. Happened zero times.
  • Multi-region failover. We're in one datacenter anyway; Kubernetes never helped us there.
  • Container portability. We're not leaving. Never was real for us.

Didn't matter. That's the entire point.

When Bare Metal Is Wrong

Be brutally honest about your workload.

Traffic's unpredictable? Autoscaling buys margin you need. Kubernetes solves that. Don't migrate.

Running ten services with independent release cycles, teams shipping on different schedules, dependency hell? Configuration management overhead becomes real — painful, actually. Don't migrate.

Your team's three months of Linux fundamentals away from zero Kubernetes experience? Buy the abstraction. You'll save money on hiring people who understand how to operate systems.

Startup chasing product-market fit? Way bigger problems exist. Kubernetes doesn't matter yet.

But profitable, traffic stable, team knows operating systems, everyone understands systemd and can SSH into a box and debug a process? The abstraction's a tax. Stop paying it.

The Aftermath

Six months in. No Kubernetes. Three on-call pages total, all user error (somebody deployed bad code and blamed the infrastructure).

Team's happier. Deploys are fast. Debugging is SSH, ps aux, check logs, done. It's boring. We like boring.

Hiring changed. We interview for "ships fast," not "optimized resume."

That engineer who panicked about obsolescence — the "Kubernetes expert" — got promoted. That's the lesson nobody talks about. We stopped optimizing for resume keywords and started optimizing for shipping things that work.


Enjoyed the read? Let's stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Top comments (0)