Slack · Reliability · 17 May 2026
On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.
- 2021-06-30 AZ outage trigger
- 1.5 years migration time
- AZ drain in <5 minutes
- 99.99% SLA maintained
- Gray failure eliminated
- Stateless + stateful cell strategies
The Story
Slack runs most of its core infrastructure in the AWS us-east-1 region across multiple availability zones (isolated data centers within the same geographic region, designed so that a failure in one AZ does not affect others — each AZ has independent power, cooling, and networking) (AZs). The cloud infrastructure guarantee is clear: AZs should provide failure isolation. A problem in one AZ should not cascade to others. On June 30, 2021, at 11:45am PDT, an intermittent fault developed in a network link connecting one AZ to its neighbors. From a physical hardware perspective, this was an unremarkable incident — a flaky network link that was automatically removed from service at 12:33pm, restoring full connectivity 48 minutes after it first showed symptoms. What was remarkable was that Slack's users felt it at all.
We were led to wonder why, in fact, this outage was visible to our users at all. Slack operates a global, multi-regional edge network, but most of our core computational infrastructure resides in multiple Availability Zones within a single region, us-east-1.
— — Slack Engineering — via Slack's Migration to a Cellular Architecture blog post
The answer was gray failure (a failure mode where different components have different views of system availability — some servers see one AZ as fully available while servers in that AZ see the others as unavailable, creating an inconsistent state that is much harder to detect and respond to than a clean hard failure). When the network link became intermittent, Slack's systems within the impacted AZ believed they had full connectivity to everything inside that AZ. Systems outside the AZ saw it as unavailable. Even clients within the same AZ had inconsistent views depending on whether their specific network flow traversed the failed equipment. This partial, view-dependent failure was far harder to detect and respond to than a clean hard failure. No single alert could capture it. No automated remediation was precise enough to act on it. The answer, the team concluded, was not to solve automated remediation of gray failures — it was to make the computers' job easier by relying on human judgment.
THE BUTTON WE NEEDED
During the June 2021 incident, engineers monitoring the outage could see clearly on their dashboards that one AZ was the problem — nearly every graph segmented by target AZ told the same story. If there had been a button to tell all systems 'this AZ is bad; avoid it,' they would have pressed it immediately. So the goal became: build that button. Design requirements: drain an AZ within 5 minutes with no user-visible errors, operable from outside the affected AZ itself.
Problem
June 2021: AZ Outage Reaches Users
A network link connecting one AWS AZ to the others experienced intermittent faults for 48 minutes. Despite Slack running in multiple AZs, users experienced degraded service — because Slack's core infrastructure was monolithically distributed across AZs without AZ-aware traffic isolation. No single switch could route traffic away from the affected AZ.
Cause
Gray Failures Don't Respect AZ Boundaries
Gray failures (partial failures where different components have inconsistent views of system availability, making it impossible to detect or respond to them with simple binary health checks) are uniquely dangerous in multi-AZ architectures. When a network link is intermittently faulty, not flaky, the failure depends on which specific flow traverses the bad equipment. Automated health checks often cannot detect this — they pass most of the time and fail occasionally, making the system appear healthy by aggregated metrics.
Solution
Cellular Architecture + AZ Drain Button
Slack spent 1.5 years migrating its most critical user-facing services to a cell-based architecture — with 3-4 independent instances of each service, one per AZ. An AZ drain button was built that, when pressed by an operator, reroutes all traffic away from the targeted AZ within 5 minutes. The drain mechanism was designed to operate from outside the affected AZ so it remains usable even when the AZ's own control plane is degraded.
Result
<5 Minutes to Safety
An AZ failure that previously required 48 minutes of user impact can now be mitigated in under 5 minutes via the drain button. Slack's 99.99% availability SLA (less than 1 hour downtime per year) makes 5-minute mitigation operationally viable; 48 minutes does not. The cellular architecture also brought independent deployment, testing, and monitoring for each cell.
The cellular architecture migration forced Slack to make hard decisions about which services could be siloed cleanly and which could not. The key dividing line was statefulness. Stateless services — those that hold no long-lived data and process requests independently — are natural candidates for full siloing: run 3-4 independent instances, one per AZ, and route requests to the closest healthy instance. Stateful services — those that are the system of record for data — are harder. Distributing state across cells introduces consistency challenges under CAP theorem (the theoretical result stating that a distributed data store can provide at most two of: Consistency (all nodes return the same data), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions)) tradeoffs. Slack's team used CAP theorem analysis as a principled framework for categorizing each service and selecting the appropriate cell architecture for it.
🔴
Slack's AZ drain button has a design requirement that many engineers overlook: it must not rely on the impacted AZ to function. During large network outages, SSH-ing into servers in the affected AZ to make them 'lame-duck' themselves is unreliable. The drain mechanism works from outside — using control plane infrastructure that is deliberately hosted in AZs other than the one being drained.
ℹ️
Incremental Traffic Recovery: 1% at a Time
The drain capability is bidirectional — Slack can also gradually reintroduce traffic to a recovering AZ rather than dumping all traffic back at once. Starting at 1% and monitoring for errors before increasing gives engineers confidence that the recovery is real before exposing the full user base. This incremental re-introduction is as important as the drain itself: a hard full restore after an AZ incident often triggers a second cascade as cold caches and restarted services encounter sudden full-load.
📐
Why Not Automatic AZ Failover?
Automatic remediation of gray failures is technically hard because the signal is ambiguous — partial connectivity, inconsistent views, intermittent errors that don't trigger clean alert thresholds. Slack's architects chose to rely on human judgment for the drain decision while making the execution automated and fast. An operator who can see the dashboards and understand that 'one AZ is bad' is a more reliable detector than any automated system for this class of failure.
⚠️
The 99.99% SLA Math
Slack's 99.99% availability SLA means less than 1 hour of downtime per year is tolerable. The June 2021 AZ incident lasted 48 minutes — nearly the entire annual budget in a single event. A second incident of similar duration would breach the SLA. The cellular architecture and AZ drain button are not aspirational reliability improvements; they are the technical prerequisites for Slack to honor its contractual commitments to enterprise customers.
🏗️
Shipping Deep Changes Across Connected Services
One of the most underappreciated challenges of the cellular migration was coordinating changes across services that have live dependencies on each other. Converting Service A to a cellular model while Service B still calls it monolithically requires careful sequencing and temporary compatibility shims. Slack's engineering team wrote extensively about the 'ship the change' problem: making sweeping architectural changes to a live system without disrupting the engineers working on it daily.
WHAT 'GOOD ENOUGH' ENABLED
Slack's architecture team made an explicit decision to embrace a 'good enough' cellular model rather than pursuing perfect cell isolation. Some services couldn't be fully siloed without years of additional work. A cell with 80% AZ isolation that can be built in 6 months is more valuable than a perfectly isolated cell that requires 3 years. The pragmatic threshold — drain within 5 minutes with no user errors — guided every architecture decision.
The Fix
18 Months of Architecture Work for a 5-Minute Fix
The cellular architecture migration is notable not just for what it produced but for how long it took. 1.5 years of engineering effort across dozens of services, with careful sequencing to avoid disrupting a platform that millions of professionals depend on every day. The team decomposed the problem by service type, migrated services incrementally starting with those most amenable to siloing, and built the AZ drain infrastructure before migrating all services to depend on it. The project combined the operational discipline of a database migration with the architectural ambition of a complete infrastructure overhaul.
- <5 min — Time to drain all traffic from a failing AZ using the drain button — versus ~48 minutes of user impact in the June 2021 AZ incident that triggered this work
- 1.5 years — Duration of the cellular architecture migration — reflecting the complexity of safely rearchitecting infrastructure serving millions of daily active users
- 3–4 cells — Independent instances of each critical service — one per AZ — providing fault isolation so a single AZ failure affects at most 25–33% of requests before drain
- 99.99% — Slack's SLA availability target — less than 1 hour total downtime per year — the business requirement that made sub-5-minute AZ mitigation a hard engineering constraint
# Simplified AZ drain logic (conceptual)
# Real implementation uses load balancer weight APIs and health check manipulation
class AZDrainButton:
def drain_az(self, target_az: str):
"""Drain all traffic from target_az within 5 minutes.
Operable from any AZ — does not rely on target_az control plane."""
# Step 1: Update load balancer weights to 0 for target_az
# Uses the cloud provider API — operates outside the AZ itself
for service in self.critical_services:
self.lb_api.set_weight(
service=service,
az=target_az,
weight=0 # no new traffic; existing connections drain naturally
)
# Step 2: Update internal service discovery to prefer other AZs
self.consul_api.set_az_preference(
preferred_azs=[az for az in ALL_AZS if az != target_az],
avoid_az=target_az
)
# Step 3: Monitor drain progress — connections should complete within 5 min
return self.monitor_drain_progress(target_az, timeout_minutes=5)
def gradual_restore(self, target_az: str, start_pct: float = 0.01):
"""Incrementally restore traffic to recovering AZ starting at 1%.
Monitor for errors before increasing allocation."""
current_pct = start_pct
while current_pct <= 1.0:
self.lb_api.set_weight(target_az, weight=current_pct)
if self.error_rate_acceptable(target_az):
current_pct = min(current_pct * 2, 1.0) # double until 100%
else:
self.lb_api.set_weight(target_az, weight=0) # back to zero
break
STATEFUL VS STATELESS CELL STRATEGY
Slack's cellular migration required a principled decision for each service: can this service be independently siloed per AZ? Stateless services (no persistent data) are straightforward — run 3-4 independent instances. Stateful services (system of record) require more nuance. Services optimizing for availability during a partition get AZ-isolated instances that can serve stale data. Services requiring consistency stay centralized with careful cross-AZ replication. CAP theorem is not an abstract thought experiment here — it is a deployment decision for each individual service.
✅
Independent Deployment Per Cell
An underappreciated benefit of cellular architecture is that each cell can be deployed and updated independently. A canary deployment that goes wrong in Cell A does not affect Cell B or Cell C. This dramatically reduces the blast radius of bad deploys — one of the leading causes of production incidents. Cellular architecture is not just a reliability pattern; it's a deployment safety pattern.
ℹ️
Testing the Drain Before You Need It
Slack's engineering team explicitly built tooling to test the AZ drain mechanism regularly — not just in staging but in production via controlled drains of individual services. This is chaos engineering applied to the mitigation tool itself: if the drain button is only tested during incidents, its failure modes will be discovered at the worst possible moment. The drain mechanism is exercised regularly to ensure it works when it matters.
🌐
Slack has a global multi-regional edge network that handles user connections near the user's geographic location. This edge layer is already highly distributed. The cellular architecture migration focused on Slack's core computational infrastructure in us-east-1 — the tier where message storage, fanout, and business logic lives. Fixing the core tier was the key to eliminating AZ-level blast radius for the majority of user-visible failures.
✅
The Independent Cell Deployment Dividend
Post-migration, Slack's oncall teams reported that bad deploys could be isolated to a single cell before being promoted to the full fleet. A canary that degrades performance in Cell A triggers an alert while Cells B and C continue healthy — giving the deploying team clear signal and a clean blast-radius boundary. Reliability and deployment safety turned out to be the same investment.
Architecture
Before the migration, Slack's core platform was a monolithic service topology with components distributed across AZs but not isolated within them. A request might be load-balanced to a webapp in AZ-1, which calls a backend service in AZ-3, which reads from a database in AZ-2. This cross-AZ traffic pattern meant that any AZ degradation could affect the latency of any request — the system had no natural blast-radius boundary at the AZ level. Post-migration, each cell contains a complete serving stack: its own webapp instances, its own backend service instances, and its own cache tier. Cross-cell traffic exists only for data that genuinely requires global consistency.
Before: Cross-AZ Monolithic Topology (Gray Failure Spreads Freely)
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: Cellular Architecture with AZ Drain Capability
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
THE CAP THEOREM AS A DEPLOYMENT GUIDE
Every service Slack migrated required a CAP tradeoff decision. Cooper Bethea's QCon talk summarizes it clearly: services can choose to partition-tolerate with availability (serve potentially stale data from an isolated cell, keeping users active) or partition-tolerate with consistency (refuse to serve data if it can't verify it's current, protecting correctness). For Slack, the messaging delivery path chose availability — users can still send messages even if the cell is partitioned. Billing and auth services chose consistency — better to reject a request than to process it on stale data.
ℹ️
Observability Must Be Cell-Scoped
One of the operational requirements that emerged from the cellular migration was cell-scoped monitoring dashboards. Aggregated metrics across cells can hide cell-specific degradation — if Cell B is struggling but Cells A and C are fine, a global average error rate might look acceptable. Each cell needs its own metrics, alerts, and dashboards so operators can detect and act on cell-level issues without noise from the healthy cells masking the signal.
📊
Per-Cell Metrics: The Observability Requirement
After cellularizing Slack's services, global aggregated metrics became misleading. A global p99 latency of 120ms might hide that Cell B has a p99 of 400ms while Cells A and C are at 90ms. Every cell now has its own dashboard, its own alerts, and its own error budget tracking. This per-cell observability investment was not optional — without it, the drain mechanism's input (operator judgment about which cell is degraded) would be unreliable.
Lessons
Slack's cellular architecture migration is a landmark case study in proactive reliability engineering: a team that experienced an incident, asked 'why did this affect users at all?', and then spent 18 months building the answer into the infrastructure.
- 01. Multi-AZ alone does not guarantee AZ-failure isolation. Running services in multiple AZs provides hardware redundancy but not traffic isolation if your services freely communicate cross-AZ. Build AZ-aware traffic routing and cell boundaries so that a failure in one AZ cannot affect requests served entirely by another AZ.
- 02. Gray failures (partial failures where different components have inconsistent views of availability) are best mitigated by human-triggered fast mitigation, not automated remediation. The ambiguous, view-dependent nature of gray failures makes reliable automated detection extremely difficult. Build a fast drain mechanism with a human in the loop — the goal is not autonomous failure response but human-triggered response that completes in minutes.
- 03. Design your incident mitigation tools to operate from outside the affected system. An AZ drain that requires SSH-ing into the affected AZ is useless when the AZ's network is degraded. Control-plane infrastructure for incident mitigation should be hosted in AZs that are deliberately different from the ones being managed.
- 04. Gradual traffic restoration is as important as fast draining. Restoring 100% of traffic instantly to a recovering AZ can trigger a second cascade as cold caches encounter sudden full load. Design your drain mechanism to be bidirectional: fast drain, slow restore starting at 1% with error-rate gating before each increase.
- 05. Cellular architecture is a deployment safety pattern in addition to a reliability pattern. Independent per-cell deployments bound the blast radius of bad code changes. When a canary goes wrong in one cell, the other cells continue serving users normally. This is a compounding benefit that makes every subsequent deploy safer than it would be in a monolithic topology.
THE COST OF CORRECTNESS
Some services at Slack could not be cleanly siloed because they require strong consistency across AZs — and consistency under partition is expensive. These services were migrated last, required the most architectural work, and ended up with more complex cell topologies (cross-cell replication, global coordination). This is the honest cost of CAP theorem reality: perfect AZ isolation is achievable for stateless services and very expensive for stateful ones. Acknowledge the cost early and budget accordingly.
⚠️
Metastable States Can Emerge in Cell-Based Systems Too
Cellular architecture reduces blast radius but does not eliminate the risk of metastable failures. A cell-level cascade — where a single cell degrades in a self-sustaining way — is still possible. The AZ drain mechanism helps by allowing operators to route traffic away from a degraded cell, but the metastable failure patterns described in Slack's 2-22-22 incident postmortem can still occur within an individual cell. Defense in depth requires both architectural isolation and operational practices for cascade detection and recovery.
Slack spent 18 months building a button so an operator could drain an entire data center in five minutes, which is either a lot of work for a button or exactly the right amount of work for that button.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)