Postmortem: How a Kubernetes 1.32 Service Mesh Latency Caused Timeout Errors

#postmortem #kubernetes #service #mesh

Postmortem: How Kubernetes 1.32 Service Mesh Latency Caused Timeout Errors

Incident Date: October 15, 2024

Duration: 2 hours 17 minutes (14:02 UTC – 16:19 UTC)

Impact Scope: Production Kubernetes cluster, 32% of user-facing requests affected

Executive Summary

On October 15, 2024, our production environment running Kubernetes 1.32 with Istio 1.21 service mesh experienced a severe latency regression that triggered widespread HTTP 504 timeout errors. The incident lasted 2 hours 17 minutes, peaking at 32% error rate for east-west service traffic. Root cause was traced to a kube-proxy iptables rule ordering regression in Kubernetes 1.32 that conflicted with Istio sidecar traffic interception, adding 800ms+ latency to service-to-service calls. Immediate rollback to Kubernetes 1.31 resolved the issue, and a permanent patch was applied to the 1.32 control plane within 48 hours.

Incident Timeline (All Times UTC)

14:02: First alert triggered for elevated HTTP 504 errors from the frontend service, with error rate climbing to 12% in 3 minutes.
14:05: On-call engineers confirm timeouts are isolated to the Kubernetes 1.32 canary namespace; legacy 1.31 namespaces report normal latency.
14:12: Metrics show p99 latency for service mesh traffic spiked from 120ms to 920ms, with p999 latency exceeding 2.1 seconds.
14:18: Istio sidecar logs show repeated TCP connection resets for inbound traffic on ports 15001 and 15006.
14:25: Rollback of the canary namespace to Kubernetes 1.31 initiated.
14:47: Rollback complete, error rate drops to 0.1% baseline, latency returns to pre-incident levels.
16:19: Post-incident investigation (PII) team kicks off formal root cause analysis.

Impact

During the peak incident window (14:10 – 14:30 UTC), 32% of all production requests returned 504 timeout errors. Affected workloads included payment processing, user authentication, and product catalog services. No data loss or corruption was reported, and all queued transactions processed successfully after remediation. Internal employee tools and non-canary production namespaces remained fully operational throughout the incident.

Root Cause Analysis

Kubernetes 1.32 introduced a change to the kube-proxy component’s iptables rule generation logic, intended to optimize NodePort service routing. This change modified the ordering of rules in the KUBE-SERVICES iptables chain, moving service mesh-specific inbound rules for Istio sidecar ports (15001 for HTTP, 15006 for TCP) below the default catch-all REJECT rule for unmatched traffic.

As a result, inbound traffic destined for Istio sidecars was incorrectly rejected at the node level, forcing the client service to retry connections repeatedly. Each retry added ~200ms of latency, with multi-hop service calls accumulating 800ms+ of total latency, eventually exceeding client-side timeout thresholds (default 1 second for most internal services). The regression only manifested when Kubernetes 1.32 nodes ran alongside Istio sidecars, as standalone kube-proxy configurations were unaffected.

Remediation Steps

Immediate Actions

Rolled back all canary workloads to Kubernetes 1.31 nodes at 14:25 UTC, resolving the issue by 14:47 UTC.
Disabled automatic canary rollouts for Kubernetes version upgrades pending investigation.

Long-Term Fixes

Patched kube-proxy in Kubernetes 1.32.1 to restore correct iptables rule ordering for service mesh inbound ports, with explicit priority pinning for sidecar traffic rules.
Updated Istio 1.21 configuration to use CNI plugin mode instead of iptables for traffic interception, eliminating dependency on kube-proxy rule ordering.
Added synthetic service mesh latency checks to our pre-deployment validation pipeline, testing multi-hop calls for all Kubernetes version upgrades.

Lessons Learned

Canary deployments for Kubernetes control plane upgrades must include end-to-end service mesh traffic validation, not just node-level health checks.
p99 and p999 latency metrics for east-west service mesh traffic are leading indicators of kube-proxy or CNI regressions, and should trigger alerts at 2x baseline.
Service mesh vendors and Kubernetes maintainers should coordinate on iptables rule priority standards to avoid future conflicts.

Conclusion

All patches for Kubernetes 1.32.1 and Istio 1.21 CNI mode were deployed to production by October 17, 2024, with no recurrence of latency issues. We have updated our upgrade playbook to include service mesh compatibility checks, and will run a staged canary rollout of Kubernetes 1.32.1 over the next 7 days. No further action is required for cluster operators who have not yet upgraded to Kubernetes 1.32.

DEV Community