War Story: How a Faulty Istio 1.23 mTLS Config Broke Cross-Region Traffic for 2M Users in 2026
It was 3:00 AM UTC on January 14, 2026, when our on-call pager started screaming. For a global streaming platform serving 2 million active users across 4 regions, cross-region traffic had suddenly dropped to zero. The culprit? A seemingly innocuous Istio 1.23 mTLS configuration change pushed 20 minutes earlier.
Our Setup
We run a multi-region Kubernetes fleet across us-east1, us-west2, eu-west1, and ap-southeast1, all running Istio 1.23.0 (the latest stable release at the time) for service mesh functionality, with strict mTLS enforced globally via PeerAuthentication resources. Cross-region traffic is routed via Istio's East-West gateway, with automatic mTLS between all workloads, regardless of region.
The Breaking Change
A member of our platform team was updating our mTLS configuration to align with new compliance requirements. They modified the global PeerAuthentication resource to set mtls: mode to STRICT for all namespaces, but accidentally applied a region-specific override for ap-southeast1 that set mtls: to STRICT with a certificate validation rule that only trusted regional CA bundles, not the global root CA. The change was pushed via GitOps (ArgoCD) and synced to all clusters within 5 minutes.
Within 10 minutes of the config sync, our monitoring showed 100% error rates for requests from eu-west1 to ap-southeast1, and vice versa. Requests within the same region worked fine, but cross-region mTLS handshakes started failing with certificate verify failed errors in the Istio proxy logs.
Debugging the Mess
First, we checked the Istio proxy logs on the East-West gateways. The error was clear: upstream connect error or disconnect/reset before headers. reset reason: connection termination, transport failure reason: TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED. But why? All clusters used the same Istio root CA, provisioned via our internal PKI.
We pulled the PeerAuthentication resources across all clusters. The global PeerAuthentication was correct: apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: istio-system spec: mtls: mode: STRICT. But in ap-southeast1, there was an additional PeerAuthentication in the istio-system namespace: apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: ap-se-mtls-override namespace: istio-system spec: selector: matchLabels: istio: eastwestgateway mtls: mode: STRICT caCertificates: path: /etc/istio/certs/regional-ca.pem. Ah! The regional CA cert path only contained the ap-southeast1 intermediate CA, not the global root CA. Cross-region workloads from other regions presented certificates signed by the global root, which the ap-southeast1 East-West gateway didn't trust.
Remediation
We immediately rolled back the ArgoCD sync to the previous known good configuration, which removed the faulty PeerAuthentication override. Within 3 minutes, cross-region traffic started recovering. We then fixed the bad config: updated the caCertificates path to point to the global root CA bundle, and added all regional intermediate CAs to the bundle. We tested the change in a staging environment that mirrored our multi-region setup, then rolled it out gradually, first to one region, then all.
Root Cause and Lessons Learned
The root cause was a combination of human error (incorrect CA bundle path in a region-specific override) and a lack of guardrails in our GitOps pipeline: we didn't have validation for Istio mTLS configs to check that all CA bundles included the global root CA, especially for East-West gateway workloads that handle cross-region traffic.
Lessons we took away:
- Always test region-specific config overrides in a multi-region staging environment that mirrors production.
- Add Admission Webhooks to validate Istio security resources, rejecting any PeerAuthentication that sets caCertificates without including the global root CA.
- Monitor mTLS handshake errors as a high-priority metric, with alerts for cross-region failures.
- Document all region-specific overrides clearly in Git, with links to the relevant compliance requirements.
Aftermath
The incident lasted 47 minutes total, impacting 2 million users with failed cross-region requests. We posted a public postmortem, and updated our internal training to include Istio mTLS best practices for multi-region setups. As of Q3 2026, we haven't had a recurrence, thanks to the new guardrails.
Top comments (0)