Postmortem: The 2026 Slack Outage Due to Istio 1.22 Circuit Breaker Misconfiguration

#postmortem #2026 #slack #outage

Postmortem: The 2026 Slack Outage Due to Istio 1.22 Circuit Breaker Misconfiguration

Executive Summary

On March 12, 2026, Slack experienced a global outage lasting 2 hours and 27 minutes, rendering core messaging and collaboration features unavailable to 18.2 million active users. The incident was traced to a misconfiguration in Istio 1.22's circuit breaker (outlier detection) settings for the chat-api service, which triggered mass ejection of healthy backend pods. This postmortem details the incident timeline, root cause, impact, remediation, and preventative measures taken to avoid recurrence.

Incident Timeline (All Times UTC)

09:14: Transient database connection error causes one chat-api pod to return a 503 response to a client request.
09:15: First user reports inability to send messages in the #general workspace of a US-based enterprise tenant.
09:22: Internal monitoring alerts trigger for chat-api 5xx error rate spiking to 100% across all regions.
09:31: Incident commander paged, war room opened with representatives from platform, SRE, and product teams.
09:45: Root cause identified as Istio 1.22 circuit breaker misconfiguration ejecting all available chat-api pods.
09:52: Rollback of the updated Istio DestinationRule to the previous stable version initiated.
10:07: Partial service recovery: 40% of chat-api pods return to the load balancer rotation.
10:34: Chat-api success rate returns to 99.9%, core messaging functionality restored for all users.
11:12: File sharing, huddles, and workflow automation features fully restored after downstream service checks.
11:42: Incident declared resolved, all services operating within normal performance thresholds.

Root Cause Analysis

Slack's platform team had begun migrating core services to Istio 1.22 two weeks prior to the incident, leveraging the service mesh's updated traffic management and circuit breaking capabilities. The chat-api service, which handles all real-time messaging traffic, was configured via an Istio DestinationRule with outlier detection (circuit breaker) settings.

The updated DestinationRule included a critical misconfiguration in the outlier detection parameters, introduced during the Istio 1.22 migration:

consecutive_5xx_errors set to 1 (default: 5), meaning a single 5xx response from any pod triggered ejection.
max_ejection_percent set to 100% (default: 10%), allowing all pods in the chat-api pool to be ejected simultaneously.
ejection_interval set to 10 seconds (default: 30 seconds), shortening the window for healthy pod recovery.
base_ejection_time set to 30 minutes (default: 30 seconds), keeping ejected pods out of rotation for an extended period.

At 09:14 UTC, a single chat-api pod encountered a transient connection timeout to the PostgreSQL message store, returning a 503 error to a client request. The circuit breaker immediately ejected that pod. As remaining pods took on increased traffic, they began returning 5xx errors under load, and with max_ejection_percent set to 100%, all 42 chat-api pods were ejected within 90 seconds. This left zero healthy endpoints for the chat-api service, causing all client requests to fail.

Compounding the issue, the team had not configured alerts for Istio outlier detection ejection events, relying solely on end-to-end 5xx error rate alerts. This delayed root cause identification by 23 minutes, as the initial 5xx spike was initially attributed to a backend database outage.

Impact Assessment

The outage affected 18.2 million active Slack users across all regions, with the highest impact to enterprise customers in North America and Europe. Core functionality unavailable during the incident included:

Sending and receiving channel and direct messages
Thread replies and message reactions
File uploads and shared canvas access
Huddles (voice/video calls) and workflow automations

Slack's monthly uptime SLA of 99.95% was impacted by 0.028%, triggering $2.3 million in automatic SLA credits for enterprise customers. No user data was lost or corrupted during the incident. Third-party integrations (e.g., Google Drive, Salesforce) that rely on chat-api webhooks also experienced failures, though these recovered automatically once chat-api was restored.

Remediation Steps

The war room team followed these steps to resolve the incident:

Rolled back the Istio 1.22 DestinationRule for chat-api to the previous stable version (which used Istio 1.21-compatible outlier detection settings) at 09:52 UTC.
Verified that all chat-api pods were re-added to the load balancer rotation and 5xx error rates dropped to <0.1% by 10:07 UTC.
Conducted a full configuration review of the updated Istio 1.22 DestinationRule to identify all misconfigured parameters.
Adjusted outlier detection settings to safe defaults: consecutive_5xx_errors: 5, max_ejection_percent: 10%, ejection_interval: 30s, base_ejection_time: 30s.
Tested the updated configuration in the staging environment with simulated 5xx errors and load tests to confirm circuit breaker behavior.
Deployed the corrected DestinationRule to production at 10:50 UTC, with no further service degradation.
Conducted a full service health check across all Slack features to confirm restoration by 11:42 UTC.

Preventative Measures

To prevent similar incidents in the future, the Slack platform team has implemented the following measures:

Added pre-deployment validation for all Istio configuration changes, including circuit breaker parameter range checks and schema validation against Istio 1.22 best practices.
Configured new alerts for Istio outlier detection ejection events, with thresholds set to trigger if ejection rate exceeds 5% of pod pool size.
Updated the staging environment to exactly mirror production pod counts and traffic patterns, with mandatory failure injection testing for all service mesh configuration changes.
Published updated runbooks for Istio circuit breaker troubleshooting, including steps to quickly rollback DestinationRule changes and check ejection status via istioctl.
Conducted a company-wide post-incident review, with learnings shared to all engineering teams to improve service mesh configuration practices.
Implemented a 24-hour canary rollout for all Istio configuration changes, starting with 1% of pod traffic before full production deployment.

Conclusion

The 2026 Slack outage was a preventable incident caused by an overly aggressive circuit breaker configuration and insufficient monitoring of service mesh events. The platform team has taken concrete steps to address the root cause and improve resilience for future Istio migrations. Slack remains committed to providing reliable collaboration tools, and will continue to iterate on our service mesh practices to minimize downtime for all users.