ANKUSH CHOUDHARY JOHAL

Posted on May 3 • Originally published at johal.in

Retrospective: Building a Multi-Region Active-Passive K8s 1.32 Cluster – 6 Months Update

#retrospective #building #multiregion #activepassive

6-Month Retrospective: Building a Multi-Region Active-Passive K8s 1.32 Cluster

Six months ago, our team set out to build a multi-region active-passive Kubernetes 1.32 cluster to support our global SaaS platform, with a goal of 99.99% uptime, sub-100ms latency for global users, and robust disaster recovery (DR) capabilities. Today, we’re sharing our operational results, key lessons learned, and critical pitfalls to avoid for teams embarking on similar projects.

Initial Deployment Recap

We chose a two-region architecture: us-east-1 (active) and eu-west-1 (passive), with Kubernetes 1.32 as the control plane version, selected for its stable nftables kube-proxy mode, improved job scheduling, and GA support for Ingress v1. Our toolchain included:

Terraform for infrastructure provisioning (VPCs, node groups, load balancers)
Ansible for node configuration and K8s bootstrapping
Cilium as the CNI for cross-region networking and eBPF-based observability
Istio for service mesh, traffic management, and cross-region mTLS
Velero for cluster backup, Portworx Data Services for stateful workload replication
ExternalDNS, Cert-Manager, and Prometheus/Thanos for operational tooling

Initial deployment took 8 weeks, with the biggest early challenges being cross-region network latency (70ms baseline between regions) and stateful workload replication for our self-managed PostgreSQL and Redis clusters.

6-Month Operational Metrics

We tracked four core metrics to measure success against our original goals:

Uptime: 99.99% for the active region, 100% standby availability for the passive region. No unplanned downtime in the active region, two scheduled maintenance windows.
Failover Performance: We ran two scheduled failover tests. The first had an RTO (Recovery Time Objective) of 8 minutes, reduced to 3 minutes in the second test after automating failover workflows. RPO (Recovery Point Objective) for stateless workloads is <1 minute, 2 minutes for stateful workloads.
Latency: Active region users see <50ms p99 latency, EU users routed to the passive region see <100ms p99 latency. Cross-region traffic is limited to replication and failover workflows, keeping data transfer costs 30% lower than an active-active setup.
K8s 1.32 Stability: No major control plane outages, only three minor CVE patches applied via automated node rolling updates. The nftables kube-proxy mode reduced kube-proxy CPU usage by 15% compared to legacy iptables mode.

Key Lessons Learned

1. Cross-Region Networking Requires Dedicated Interconnects

Our initial setup used public internet for cross-region traffic, leading to 70ms latency and occasional packet loss. Switching to dedicated AWS Direct Connect links between regions cut latency to 40ms, and we avoided routing replication traffic over the public internet entirely. Cilium’s cross-region tunneling with WireGuard encryption added minimal overhead while meeting our security requirements.

2. Async Replication Beats Periodic Backups for Stateful Workloads

We started with Velero for nightly cluster backups, but RPO was 15 minutes, which violated our DR SLA. Switching to Portworx Data Services for asynchronous volume replication brought RPO down to 2 minutes for all stateful workloads. For managed databases (RDS PostgreSQL), we used native cross-region read replicas with automated failover, which delivered RPO < 1 minute.

3. Automate Failover End-to-End

Manual failover for our first test took 20 minutes, involving manual DNS updates, workload scaling, and database primary switches. We built a custom Go operator using kubebuilder that watches active region health via a combination of node, pod, and ingress health checks. The operator triggers failover automatically when health checks fail for 2 consecutive minutes, updating ExternalDNS records, scaling passive workloads to active capacity, and promoting database replicas. This cut RTO to 3 minutes.

4. K8s 1.32 Features Deliver Tangible Value

Beyond nftables kube-proxy, we leveraged K8s 1.32’s improved job controller, which reduced failed batch job retries by 20% thanks to better backoff handling. The GA Ingress v1 support also simplified our Istio ingress configuration, removing the need for beta API annotations that caused frequent drift.

5. Unified Global Observability Is Non-Negotiable

Initial Prometheus setup was region-local, leading to fragmented metrics and missed alerts for the passive region. Adding Thanos for global metric aggregation and a unified Alertmanager instance gave us a single pane of glass for both regions. We also added node-exporter alerts for all regions, catching a passive region node pressure event that would have caused failover delays.

Challenges We Faced (and Solved)

Cert-Manager Cross-Region Issues: HTTP-01 Let’s Encrypt challenges failed for the passive region, since its ingress was not active. Switching to DNS-01 challenges resolved this, as validation is done via Route53 DNS records instead of HTTP traffic to the passive ingress.
Image Registry Latency: Pulling container images from a single us-east-1 ECR registry added 30 seconds to pod startup in eu-west-1. Enabling ECR cross-region replication to eu-west-1 cut image pull time by 60%.
Infrastructure Drift: Terraform state drift between regions caused a 2-hour outage when a manual security group change in eu-west-1 blocked replication traffic. We added nightly drift detection with Driftctl, which alerts the team on any unmanaged infrastructure changes.
K8s 1.32 Scheduler Bug: We hit a known bug in the 1.32.0 kube-scheduler that incorrectly scheduled pods across regions when node selectors were not set. Upgrading to 1.32.2 resolved the issue, and we now pin K8s patch versions in our Ansible playbooks to avoid unexpected regressions.

What’s Next?

We’re planning three major updates over the next 6 months:

Upgrade to Kubernetes 1.33 once stable, to leverage GA sidecar containers and improved device plugin support.
Pilot active-active architecture for stateless workloads, while keeping stateful workloads active-passive to minimize replication complexity.
Add a third region (ap-southeast-1) to support APAC users, with eu-west-1 as the active region for EMEA and us-east-1 as active for Americas.
Implement regular chaos engineering tests using Chaos Mesh, targeting the passive region to validate failover workflows without impacting production traffic.

Conclusion

Our multi-region active-passive K8s 1.32 cluster has met all our original goals, delivering reliable global service with robust DR capabilities. The 1.32 release has been stable and performant, and the active-passive model balances cost efficiency with HA requirements better than active-active for our workload profile. For teams building similar clusters, we recommend prioritizing dedicated interconnects, automating failover early, and leveraging K8s 1.32’s nftables mode for kube-proxy. The 6-month mark has validated our architecture, and we’re confident in scaling to additional regions and use cases in the coming year.

DEV Community