yep

Posted on Apr 20 • Originally published at yepchaos.com

Multi region kubernetes cluster (Onprem/Public Cloud)

#kubernetes #multiregion #networking

A single Kubernetes cluster is manageable. You deploy things, they run, life is good. Things get complicated when you have clusters in multiple regions that need to talk to each other. This post covers why I needed multi-region, what I tried, what didn’t work, and how I got pod-to-pod connectivity between Mongolia and Germany using Netbird.

Why Multi-Region

The answer has two parts.

First: the Mongolia cluster is on-premise, and this infrastructure is unstable. Network switches fail, the DC has maintenance windows, things go down. With only one cluster, when it goes down, the app goes down. Even with 10 users, they deserve better.

Second: I wanted to build something complicated and solve problems I didn't actually have yet. Active/active multi-region means clients in different geographies connect to their nearest cluster — a user in Germany shouldn't be routing through Mongolia just to send a chat message. That's ~200ms round trip just to reach the server. For a chat app that latency is noticeable. I don't have German users yet. I don't have many users at all. But the architecture is ready for when the app explodes. Hope it’s gonna explode.

For the second cluster I chose Hetzner — cheap, stable, good network. Germany datacenter. I provision it with Terraform and run k3s with Cilium, same as Mongolia.

What I Looked At and Ruled Out

Stretched k3s Cluster (Single etcd)

k3s supports Tailscale natively, which means you could in theory run a single stretched cluster across two DCs — nodes in both Mongolia and Germany joining the same k3s cluster with a shared etcd.

The problem is etcd. It's quorum-based — a majority of nodes must agree on every write. With nodes split across two DCs and ~200ms latency between them, every etcd write waits for that round trip. That's not acceptable for a control plane. You'd also need an odd number of etcd nodes for quorum, which means either an unbalanced split between DCs or a third observer node somewhere. The complexity and latency made this a non-starter before even trying it.

Cilium Cluster Mesh

Cilium has a built-in multi-cluster feature called cluster mesh. It connects multiple clusters at the network level — services become reachable across clusters natively, policies apply across clusters, load balancing works across clusters. It's exactly what I wanted.

The requirement: cluster nodes must be directly reachable from each other. The Mongolia cluster is behind NAT — its nodes have private IPs, not public ones. Without a way to make those nodes reachable from Germany, cluster mesh won't work. I ruled this out before attempting the setup.

This is important to understand about the architecture: each cluster internally uses native Cilium networking, no VPN overhead. Pod-to-pod traffic within a cluster is fully native. Netbird only sits between clusters — it's the bridge layer, not the base layer.

The Solution: Netbird for Cross-Cluster Connectivity

Netbird is a WireGuard-based overlay network with a routing architecture that handles NAT traversal automatically. The key feature: you can configure routing peers that expose a private network to the Netbird network. Other Netbird peers can then reach that private network through the router, even if the network itself is behind NAT.

For my setup:

cluster-mn (Mongolia, behind NAT) — pod CIDR <MN_POD_CIDR>, service CIDR <MN_SVC_CIDR>
cluster-de (Germany, Hetzner) — pod CIDR <DE_POD_CIDR>, service CIDR <DE_SVC_CIDR>

Pod and service CIDRs must be different between clusters — if both use the same ranges, routing breaks.

Option 1: Netbird Agent on VMs (Rejected)

Install Netbird agent directly on each cluster's nodes, use those nodes as routing peers. Pods wanting to reach the other cluster go through a Cilium Egress Gateway to the VM running the agent.

This works. I tested it. But it means managing agents on VMs manually, making sure they stay running, handling updates. I didn't want that operational overhead. I'm already over-engineering enough things.

Option 2: Netbird Kubernetes Operator (What I Use)

Netbird has a Kubernetes operator that installs agents as pods inside the cluster. No VM management needed.

The problem: the operator's router pod doesn't use hostNetwork, so pods inside the cluster can't reach the Netbird network through it. The pod is isolated from the node's network namespace. Possible to use Sidecar, but for the cassandra and others it becomes more complicated.

The fix: patch the router deployment to enable hostNetwork: true after it's deployed. The Netbird Helm chart doesn't expose this as a configurable value, so I use a Kubernetes Job that runs post-deploy and patches the deployment:

apiVersion: batch/v1
kind: Job
metadata:
  name: patch-router-hostnetwork
  namespace: netbird
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  backoffLimit: 5
  template:
    spec:
      serviceAccountName: netbird-patcher
      restartPolicy: OnFailure
      containers:
        - name: patch
          image: bitnami/kubectl:latest
          command:
            - sh
            - -c
            - |
              echo "Waiting for router deployment..."
              kubectl rollout status deployment/router -n netbird --timeout=120s
              echo "Patching hostNetwork..."
              kubectl patch deployment router -n netbird --type=json -p='[
                {"op":"add","path":"/spec/template/spec/hostNetwork","value":true},
                {"op":"add","path":"/spec/template/spec/dnsPolicy","value":"ClusterFirstWithHostNet"}
              ]'
              echo "Done."

The Job is triggered by ArgoCD as a PostSync hook — it runs automatically after every sync, so the patch is always applied even after redeployments. I'll write a dedicated post on ArgoCD, but this is a good example of why it's useful — this whole setup is just config, no manual steps.

The router pod also needs to run on a specific node (the one that will act as the routing peer), so I label that node and add node affinity:

router:
  enabled: true
  replicas: 1
  nodeSelector:
    netbird-router: "true"

Cilium Egress Gateway

With the router pod running with hostNetwork: true on the designated node, I configure a CiliumEgressGatewayPolicy. Any pod in cluster-de wanting to reach <MN_POD_CIDR> (Mongolia's pod network) exits through the router node's wt0 interface (the WireGuard/Netbird interface):

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: netbird-egress
spec:
  selectors:
    - podSelector: {}
  destinationCIDRs:
    - <MN_POD_CIDR>
  egressGateway:
    nodeSelector:
      matchLabels:
        netbird-router: "true"
    interface: "wt0"

Important: the egress policy only covers the other cluster's pod CIDR, not Netbird IPs (100.115.x.x range). This is intentional. If you add an egress policy for the Netbird IP range, you create a routing loop — traffic arrives from a Netbird IP, the egress policy kicks in and tries to redirect it back through the router, and it never reaches the pod. By leaving Netbird IPs out of the egress policy, pods can't directly reach the Netbird network (minor downside), but the loop is avoided and cross-cluster traffic works correctly.

Enable egress gateway in Cilium's Helm values:

egressGateway:
  enabled: true

The same setup runs mirror-image on cluster-mn, with egress policy pointing at <DE_POD_CIDR>.

Cross-Cluster Service Discovery

Pod-to-pod connectivity works now, but pod IPs change. You can't exactly hardcode them. Kubernetes service IPs don't work across clusters either — Cilium uses virtual routing for service IPs, and a pod in cluster-de trying to reach a service IP from cluster-mn doesn't know how to resolve it. The solution is a bit of a DNS relay. We need to trick the pods into asking the other cluster for the right IP.

The Relay Logic

When a pod in Germany wants to find database.namespace.svc.astring-mn.internal, the request follows this path:

Local CoreDNS: Realizes it doesn't own .astring-mn.internal and forwards it to our "DNS Bridge."
DNS Bridge: A CoreDNS pod running on the hostNetwork of the router node. It listens on port 1053 and forwards the request over the Netbird tunnel.
Netbird Peer: The request travels through the WireGuard tunnel to a Netbird peer that can see the Mongolia cluster's internal API/DNS.
Resolution: The IP comes back, and the egress policy we set up earlier handles the actual data routing.

CoreDNS Custom Config

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  astring-fsn1.server: |
    astring-fsn1.internal:53 {
        rewrite name suffix .svc.astring-fsn1.internal .svc.cluster.local
        rewrite name suffix .astring-fsn1.internal .svc.cluster.local
        kubernetes cluster.local
    }
  astring-mn.server: |
    astring-mn.internal:53 {
      errors
      cache 30
      forward . <ROUTER_NODE_IP>:1053
    }

Queries for *.astring-mn.internal get forwarded to port 1053 on the router node. The router node runs a DNS bridge that forwards those queries into the other cluster's network via Netbird.

DNS Bridge: The middleman

The DNS bridge is a CoreDNS instance running on the router node with hostNetwork: true, forwarding queries to a Netbird peer IP:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dns-bridge-mn
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dns-bridge-mn
  template:
    metadata:
      labels:
        app: dns-bridge-mn
    spec:
      nodeSelector:
        netbird-router: "true"
      hostNetwork: true
      containers:
        - name: coredns
          image: coredns/coredns:1.10.1
          args: ["-conf", "/etc/coredns/Corefile"]
          volumeMounts:
            - name: config-volume
              mountPath: /etc/coredns
              readOnly: true
      volumes:
        - name: config-volume
          configMap:
            name: dns-bridge-conf
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: dns-bridge-conf
  namespace: kube-system
data:
  Corefile: |
    astring-mn.internal:1053 {
      errors
      cache 30
      forward . <NETBIRD_PEER_IP>
    }

<NETBIRD_PEER_IP> is a fixed Netbird IP — in my case, the management VM's Netbird agent. I chose this because I wanted a stable IP that doesn't change. Using another cluster's routing peer IP would also work. This is not ideal — ideally this would be dynamically resolved — but it works and I'll improve it later. Same applies to <ROUTER_NODE_IP> in the CoreDNS config above. Future me has a lot of work to do.

With this in place, a pod in cluster-de can reach a service in cluster-mn using service-name.namespace.svc.astring-mn.internal. CoreDNS forwards the query to the DNS bridge, the bridge forwards it through Netbird to the other cluster's DNS, gets the pod IP back, and the egress policy routes the traffic through the router node.

Managing All of This with ArgoCD

If you're doing this manually — applying patches, keeping configs in sync across two clusters, making sure the post-deploy job runs — it becomes a nightmare quickly. ArgoCD manages all of it declaratively. The patch job, the egress policies, the CoreDNS configs, the Netbird operator — all defined as ApplicationSets, applied automatically on sync. I'll cover ArgoCD properly in a separate post.

Current State

Pod-to-pod connectivity between cluster-mn and cluster-de is working. Services are resolvable across clusters using the custom CoreDNS setup. NATS supercluster and cross-DC ScyllaDB replication both run on top of this — that's a separate post covering the active/active chat architecture.

Known limitations I'll fix later:

<ROUTER_NODE_IP> in CoreDNS is hardcoded — should be dynamic
<NETBIRD_PEER_IP> using management VM is not ideal — should use a cluster routing peer directly

Is this over-engineered for a chat app with small number of users? Absolutely.

DEV Community