Amal Zacharia

Posted on May 23 • Originally published at crza.dev

Five Clusters. Five Lessons. One Production System.

#kubernetes #devops #cloudnative #infrastructure

Most engineers talk about Kubernetes like it's something you set up once and forget. I set it up five times. Each time because the last version taught me something I couldn't have learned any other way.

This is that story.

The honest version of each stage:

Stage 1. I just need HA.

Stage 2. I just need cheaper HA.

Stage 3. I just need multi-cloud ARM scheduling.

Stage 4. I just need BGP mesh ingress.

Stage 5. I have achieved zero trust enlightenment.

Every stage felt like a small reasonable ask. The cluster had other ideas.

Stage 1: The Naive But Functional Cluster

The first cluster wasn't elegant. It wasn't supposed to be.

Three to five Hetzner dedicated nodes running k3s. Three DigitalOcean VPS running nginx as a makeshift load balancer. A DigitalOcean cloud load balancer sitting in front of those three VPS.

I followed Techno Tim's k3s HA install guide as the foundation. Solid starting point for getting a highly available cluster off the ground. Replaced his default Traefik ingress with nginx and added Rancher dashboard on top for visibility. Everything after that was documentation and experimentation.

Before this everything ran on plain VPS. Individual servers. No orchestration. Moving to Kubernetes meant migrating workloads one by one. Old VPS running stable, new cluster being tested in parallel, services moved gradually. If something broke I reverted. Nothing burned.

The cluster itself was stable. No memorable incidents with the nodes or networking. The one persistent irritation was MongoDB. Running Bitnami's MongoDB HA chart. One write replica, two read replicas. Randomly and without warning one of the pods would enter a crashloop. Logs pointed vaguely at lock contention but nothing actionable. A restart brought it back clean every time.

The read heavy workload saved me. A local cache layer absorbed most requests so when a replica went down users barely felt it. I'd notice eventually. Not through alerts, just observation. Restart the StatefulSet and move on.

It happened often enough to be annoying. Never often enough to be a crisis. The cluster was otherwise stable. This was just a ghost that came with the Bitnami chart.

It followed me into the next stage. And the one after that.

Cost pressure eventually killed this setup.

Stage 2: The WireGuard Mesh

Three DigitalOcean VPS plus one cloud load balancer to serve less than a thousand requests a month. Those thousand requests were me hitting the Rancher dashboard. Every actual workload was outbound only. Bots, task processors, queue consumers. Nothing needed to receive HTTP from the internet. Kubernetes was not there to serve traffic. The nginx tier existed for one admin panel. The math didn't make sense.

I wanted to move to Contabo. Not cheaper in absolute terms but more efficient for my actual workload requirements. Hetzner dedicated nodes were sitting with RAM mostly idle. I was paying for resources I wasn't using.

The problem was Contabo had no private network at the time.

So I built one.

WireGuard. I found it while researching private networking options. I looked at Kilo as well, a Kubernetes-native WireGuard mesh operator, but it felt too complex for what I needed. Netclient handled mesh generation automatically. WireGuard configurations for multiple peers without the manual key exchange headache.

A single DigitalOcean VPS became the entry point. Running the Netclient server and dashboard. All the Contabo nodes connected through it via mesh. First try. Everything came up clean.

The three nginx VPS and the cloud load balancer disappeared. Previously traffic flowed from Klipper through three nginx VPS to a DigitalOcean cloud load balancer before reaching users. An entire layer of infrastructure serving as a middleman. Replaced now by Klipper speaking directly to users. The elaborate routing chain had been architectural overkill from the start.

Contabo had weird reliability. Nodes would occasionally behave unexpectedly in ways that didn't make sense on paper. But the workload was tolerant to disruption and the cost saving from eliminating the most expensive components made the occasional weirdness acceptable.

The MongoDB crashloop followed. Still there. Still managed by cache and manual restarts. Not catastrophic. Just persistent.

Stage 3: ARM Workers and the First Multi-Cloud Cluster

ARM compute is cheap. Oracle's free tier ARM instances are extremely cheap. As in free.

A Network Chuck video appeared in my YouTube recommendations one day. Kubernetes on Raspberry Pi. Watched it casually. The part that stuck was his explanation of why he ran the Rancher dashboard on an amd64 node rather than the ARM64 ones. The reasoning was practical. Certain workloads and management tooling simply behaved better on amd64. That one observation shaped the architecture before I'd even started.

Master nodes stayed on amd64. Workers went on ARM64. Clean separation from the beginning.

I extended the Netclient WireGuard mesh to include Oracle ARM nodes in a different region. Multi-cloud before that was a common thing to talk about. Not for ideological reasons. Purely because free ARM compute was sitting there unused. The scale didn't justify multi-cloud architecture. The price did. Free compute is free compute.

Multi-arch builds handled through GoReleaser. On every GitHub commit, GitHub Actions triggered GoReleaser. Publishing multi-arch Docker images and binaries as releases automatically. Services with amd64-only dependencies got node affinity rules keeping them on the Contabo nodes. Cross-architecture scheduling handled through proper affinity configuration rather than hoping things landed in the right place.

Oracle's networking was strange. Occasionally a worker node would drift out of WireGuard mesh sync and require a restart. Never more than one at a time. Enough redundancy that it didn't cause visible issues. Whether that was a free tier limitation or something inherent to their networking I never definitively determined.

The MongoDB crashloop was still there. Quietly persistent across three different infrastructure setups now.

What ended it was a new client. Twenty Discord bots. Each one 512MB RAM at startup, 2GB at peak load. Discord.js cache is RAM hungry by design. The library caches guild members, messages, and state aggressively. Twenty bots at peak meant up to 40GB of RAM requirement for workloads that could not be disrupted.

Oracle free tier ARM nodes are not where you run non-disruptable RAM-intensive production workloads.

Time to go back to dedicated hardware.

Stage 4: Real High Availability

Five Hetzner nodes. Dual network. This is where the cluster started looking like something serious.

The private VLAN handled all internode communication. k3s configured via args to bind to the private IP. Not 0.0.0.0. Cluster join happened over private IPs not public ones. The public interface existed for egress only at the node level.

The goal with CNI was specific. Klipper, k3s's default service load balancer, requires port 80 and 443 open on every node to function. I didn't want that. I wanted a single advertised IP for ingress, not ports splayed across every node in the cluster.

Calico with BGP was the answer. Or so I thought.

The setup was architecturally sound. Five Hetzner dedicated nodes running a BGP mesh via Calico. MetalLB advertising a private IP range for load balancer services. That private range connected to Hetzner's cloud network via static routes. Each node acting as a router. A Hetzner cloud load balancer sitting in front, pointing to the MetalLB advertised IP. Clean. No ports open on individual nodes. Exactly what I wanted.

Keepalived handled the cluster join entry point. A floating IP that survived node failures. I tested this manually during bootstrap. Simulated a node failure. Keepalived moved the IP cleanly to a healthy node. That part worked exactly as expected.

The BGP setup worked too. Technically. I ran it for a month.

Then I measured HTTP latency.

The requests going through the Hetzner cloud network into the BGP mesh and back out were noticeably slow. Not broken. Just slow in a way that didn't make sense given the hardware. I only had ingress exposed so I couldn't measure other protocols. Whether it was Hetzner Cloud's routing adding overhead or something in the mesh I never definitively diagnosed.

I reverted. Wrote it off as a cloud networking quirk. The internode BGP mesh stayed. Calico peers between all five nodes remained stable. But MetalLB and the cloud LB routing experiment went back to Keepalived managing a floating IP with Hetzner API hooks for failover.

Something else happened at this stage. The MongoDB crashloop stopped. I don't know exactly when. I don't know exactly why. It had followed me through three different infrastructure setups. Different providers, different CNIs, different everything. Then somewhere around moving to Calico it just stopped. Never diagnosed. Never explained. Just gone.

Around a hundred workloads running at this point. Bots, websites, the WebSocket infrastructure, client services accumulated over years.

What ended Stage 4 wasn't a failure. It was a conference.

Interlude: The Homelab Test

Before this I validated experiments on DigitalOcean VPS. Spin one up, test the idea, tear it down. Clean. Disposable. No commitment.

But before committing rke2 and Cilium to production I wanted to test on real hardware. Not a cloud VM. Actual bare metal.

I had an old PC. Booted Ventoy, loaded a gparted live image, used dd to write Debian directly to the disk. Configured SSH and networking. Came back to my main machine and SSHed in. Bare metal provisioning the old fashioned way. No cloud console, no managed image, just a disk, Ventoy, and a Debian ISO.

Installed rke2. Installed Cilium. Configured BGP and set static routes in my home router to expose services publicly. Ran a PostgreSQL instance. One database per project, cleaner than running everything on my main machine. Threw a Minecraft server and an Ark Survival server on it so friends could connect.

It worked. Everything I planned for Stage 5 behaved the way I expected on real hardware before I committed it to production.

That's the point of a homelab. Not to show off. Not to run unnecessary infrastructure. To answer the question before production has to.

Stage 5 wasn't a leap of faith. It was a validated decision.

Stage 5: Zero Trust and the Hardened Cluster

The current cluster. OVH this time, not Hetzner.

OVH has better DDoS protection. Their uptime guarantees are stronger.

rke2 replaced k3s as the distribution. The operational feel is similar. rke2 is heavier. Larger resource footprint. But for a cluster running serious production workloads that weight is appropriate.

Cilium replaced Calico as CNI.

I saw Cilium at KubeCon. But the real reason was simpler. I had never successfully run a firewall alongside Kubernetes. Every attempt broke with CNI changes. iptables rules conflicting with CNI networking. Configurations that worked until a CNI update silently invalidated them. I gave up trying to run a traditional firewall on Kubernetes nodes entirely.

Cilium's host firewall was different. It runs at the eBPF layer, below where CNI conflicts happen. That was the actual problem I needed solved. KubeCon introduced me to the solution. The L2 load balancer announcement was a bonus.

L2 announcement for load balancer IPs. A few YAML manifests and the cluster manages IP assignment itself. The goal from Stage 4 finally solved. Differently than expected but solved. Keepalived still exists for the cluster join endpoint but the complex API hook dance for floating IP management is gone.

Host firewall via CiliumClusterwideNetworkPolicy. The public IP of every node is egress only. Nodes can initiate connections to the world. The world cannot initiate connections to nodes. Not even ping. Except from monitoring systems and my static IP for SSH. Enforced at the kernel level on every node via eBPF. Not a firewall rule sitting in front of the infrastructure. The infrastructure itself is the firewall.

Ingress hardened to Cloudflare only. Traefik as the ingress controller. Cilium network policy restricts ingress traffic to Cloudflare IP ranges only at the network level. Traefik middleware handles mTLS authenticated pull on top of that. Two layers of verification before any request reaches a service.

Namespace isolation by default. Network policies deny all ingress by default. Services explicitly whitelist what they need. DNS, shared resources, specific cross-namespace communication. The model is incoming-decides. Each namespace declares what it accepts rather than what it sends. No traffic sniffing between namespaces is possible by policy.

The cluster currently runs bots, websites, and accumulated client workloads. It is boring in the useful sense. Deployable, diagnosable, recoverable.

What Five Clusters Taught Me

Cost pressure is a better architect than planning. Every significant change in this infrastructure happened because something was too expensive relative to its value. The nginx VPS layer. The idle Hetzner RAM. The Oracle free tier nodes that couldn't handle non-disruptable workloads. Cost pressure forced clarity about what the infrastructure actually needed to do.

The first version is supposed to be wrong. One video and documentation got the first cluster running. That was enough. The wrongness of the first version taught me what the second version needed to be.

Persistent problems have hidden causes. MongoDB crashed across three different providers, two different CNIs, and multiple cluster rebuilds. Then stopped. I never found the root cause. Sometimes production systems have ghosts you learn to live with until they quietly leave on their own.

Security is a journey not a configuration. Stage 1 had no meaningful firewall story. Stage 5 has host-level eBPF enforcing zero trust at the kernel. That's not a single decision. That's five iterations of failing to run a firewall properly until the right tool existed.

Simplicity compounds. Every stage removed something. The nginx VPS layer. The cloud load balancer. The WireGuard entry point VPS. The Keepalived API hooks. Each removal made the system more reliable not less. The most complex part of Stage 1 was the load balancing layer doing the least important work.

Context beats technology. Multi-cloud sounds impressive. The reality was free Oracle ARM compute that cost nothing to add. Cilium sounds cutting edge. The reality was a persistent firewall problem that nothing else solved cleanly. The technology follows the problem. Not the other way around.

Validate before you migrate. I didn't move production to rke2 and Cilium based on a KubeCon talk. I went home, provisioned a bare metal server from scratch, and ran the stack on real hardware first. The homelab wasn't a hobby. It was due diligence.

Security - Five Stage Evolution

Stage	Security Posture
Stage 1	No meaningful firewall · MongoDB on crashloop · Manual restart as incident response
Stage 2	WireGuard mesh · First network isolation layer · Removed public-facing VPS middlemen
Stage 3	Node affinity arch-based isolation · Multi-cloud scheduling boundaries · GoReleaser controlled build pipeline
Stage 4	Private VLAN for internode comms · Public interface egress only · Keepalived tested at bootstrap
Stage 5	eBPF host firewall nodes egress only · Cloudflare-only ingress + mTLS · Namespace isolation by default · No traffic sniffing between namespaces

Stage 1 ░░░░░░░░░░░░░░░░░░░░

Stage 2 ████░░░░░░░░░░░░░░░░

Stage 3 ████████░░░░░░░░░░░░

Stage 4 ████████████░░░░░░░░

Stage 5 ████████████████████