TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

#devops #reliability #architecture #webdev

June 30, 2021 — a single AZ network link failure; users felt it despite Slack running in multiple AZs
48 minutes of user impact from one flaky network link — nearly Slack's entire annual SLA budget in one incident
<5 minutes — AZ drain time after the cellular architecture migration
1.5 years of engineering effort to build that 5-minute button
3–4 cells — independent instances of each critical service, one per AZ
99.99% SLA — less than 1 hour downtime per year; the business constraint that made 5-minute mitigation a hard requirement

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work and the construction of a very specific button.

The Story

We were led to wonder why, in fact, this outage was visible to our users at all. Slack operates a global, multi-regional edge network, but most of our core computational infrastructure resides in multiple Availability Zones within a single region, us-east-1.

— Slack Engineering, via Slack's Migration to a Cellular Architecture blog post

Slack runs most of its core infrastructure in AWS us-east-1 across multiple availability zones (isolated data centres within the same geographic region, designed so a failure in one AZ does not affect others — each has independent power, cooling, and networking). On June 30, 2021, at 11:45am PDT, an intermittent fault developed in a network link connecting one AZ to its neighbours. The physical incident resolved at 12:33pm, 48 minutes after first symptoms. What was remarkable was that Slack's users felt it at all.

The answer was gray failure (a failure mode where different components have different views of system availability — some servers see one AZ as fully available while servers in that AZ see the others as unavailable, creating inconsistent state that is much harder to detect and respond to than a clean hard failure). When the network link became intermittent, systems within the impacted AZ believed they had full connectivity. Systems outside saw the AZ as unavailable. Even clients within the same AZ had inconsistent views depending on whether their specific network flow traversed the failed equipment. No single alert could capture it. No automated remediation was precise enough to act on it. The answer, the team concluded, was not to solve automated remediation of gray failures — it was to build a fast human-triggered button.

The Button We Needed

During the June 2021 incident, engineers monitoring the outage could see clearly on their dashboards that one AZ was the problem — nearly every graph segmented by target AZ told the same story. If there had been a button to tell all systems "this AZ is bad; avoid it," they would have pressed it immediately. So the goal became: build that button. Design requirements: drain an AZ within 5 minutes with no user-visible errors, operable from outside the affected AZ itself.

Problem

June 2021: AZ Outage Reaches Users

A network link connecting one AWS AZ to the others experienced intermittent faults for 48 minutes. Despite Slack running in multiple AZs, users experienced degraded service — because Slack's core infrastructure was distributed across AZs without AZ-aware traffic isolation. No single switch could route traffic away from the affected AZ.

Cause

Gray Failures Don't Respect AZ Boundaries

Gray failures are uniquely dangerous in multi-AZ architectures. When a network link is intermittently faulty rather than completely failed, the failure depends on which specific flow traverses the bad equipment. Automated health checks often cannot detect this — they pass most of the time and fail occasionally, making the system appear healthy by aggregated metrics while individual users experience intermittent errors.

Solution

Cellular Architecture + AZ Drain Button

Slack spent 1.5 years migrating its most critical user-facing services to a cell-based architecture — with 3-4 independent instances of each service, one per AZ. An AZ drain button was built that, when pressed by an operator, reroutes all traffic away from the targeted AZ within 5 minutes. The drain mechanism was designed to operate from outside the affected AZ so it remains usable even when the AZ's own control plane is degraded.

Result

<5 Minutes to Safety

An AZ failure that previously required 48 minutes of user impact can now be mitigated in under 5 minutes via the drain button. Slack's 99.99% availability SLA (less than 1 hour downtime per year) makes 5-minute mitigation operationally viable; 48 minutes does not. The cellular architecture also brought independent deployment, testing, and monitoring for each cell.

The Fix

18 Months of Architecture Work for a 5-Minute Fix

The cellular architecture migration is notable not just for what it produced but for how long it took. 1.5 years of engineering effort across dozens of services, carefully sequenced to avoid disrupting a platform that millions of professionals depend on every day. The team decomposed the problem by service type, migrated services incrementally starting with those most amenable to siloing, and built the AZ drain infrastructure before migrating all services to depend on it.

<5 min — time to drain all traffic from a failing AZ; versus ~48 minutes of user impact in the June 2021 incident that triggered this work
1.5 years — duration of the cellular architecture migration
3–4 cells — independent instances of each critical service, one per AZ; a single AZ failure affects at most 25–33% of requests before drain
99.99% — Slack's SLA target; less than 1 hour total downtime per year; the business requirement that made sub-5-minute AZ mitigation a hard engineering constraint

# Simplified AZ drain logic (conceptual)
# Key design requirement: operates from outside the affected AZ

class AZDrainButton:
    def drain_az(self, target_az: str):
        """Drain all traffic from target_az within 5 minutes.
        Uses cloud provider API — does NOT rely on target_az control plane."""

        # Step 1: Update load balancer weights to 0 for target_az
        # Existing connections drain naturally; no new traffic routed there
        for service in self.critical_services:
            self.lb_api.set_weight(
                service=service,
                az=target_az,
                weight=0
            )

        # Step 2: Update service discovery to prefer other AZs
        self.consul_api.set_az_preference(
            preferred_azs=[az for az in ALL_AZS if az != target_az],
            avoid_az=target_az
        )

        # Step 3: Monitor drain progress
        return self.monitor_drain_progress(target_az, timeout_minutes=5)

    def gradual_restore(self, target_az: str, start_pct: float = 0.01):
        """Incrementally restore traffic to recovering AZ starting at 1%.
        Monitor error rates before each increase — prevents second cascade
        from cold caches encountering sudden full load."""
        current_pct = start_pct
        while current_pct <= 1.0:
            self.lb_api.set_weight(target_az, weight=current_pct)
            if self.error_rate_acceptable(target_az):
                current_pct = min(current_pct * 2, 1.0)  # double until 100%
            else:
                self.lb_api.set_weight(target_az, weight=0)  # back to zero
                break

Stateful vs Stateless Cell Strategy

The cellular migration required a principled decision for each service: can this service be independently siloed per AZ? Stateless services (no persistent data) are straightforward — run 3-4 independent instances. Stateful services require CAP theorem (the theoretical result stating a distributed data store can provide at most two of: Consistency, Availability, and Partition tolerance) analysis. Services optimising for availability during a partition get AZ-isolated instances that can serve stale data. Services requiring consistency stay centralised with careful cross-AZ replication. CAP theorem is not an abstract thought experiment here — it is a deployment decision for each individual service.

Why not automatic AZ failover?

Automatic remediation of gray failures is technically hard because the signal is ambiguous — partial connectivity, inconsistent views, intermittent errors that don't trigger clean alert thresholds. Slack's architects chose to rely on human judgment for the drain decision while making the execution automated and fast. An operator who can see the dashboards and understand that "one AZ is bad" is a more reliable detector than any automated system for this class of failure. The goal is not autonomous failure response but human-triggered response that completes in minutes.

Independent deployment per cell: the deployment safety dividend

An underappreciated benefit of cellular architecture is that each cell can be deployed and updated independently. A canary deployment that goes wrong in Cell A does not affect Cell B or Cell C. This dramatically reduces the blast radius of bad deploys — one of the leading causes of production incidents. Post-migration, Slack's oncall teams reported that bad deploys could be isolated to a single cell before being promoted to the full fleet. Reliability and deployment safety turned out to be the same investment.

Testing the drain before you need it

Slack's engineering team explicitly built tooling to test the AZ drain mechanism regularly — not just in staging but in production via controlled drains of individual services. If the drain button is only tested during incidents, its failure modes will be discovered at the worst possible moment. The drain mechanism is exercised regularly to ensure it works when it matters. This is chaos engineering applied to the mitigation tool itself.

Architecture

Before the migration, Slack's core platform was a monolithic service topology with components distributed across AZs but not isolated within them. A request might be load-balanced to a webapp in AZ-1, which calls a backend service in AZ-3, which reads from a database in AZ-2. Any AZ degradation could affect the latency of any request. Post-migration, each cell contains a complete serving stack: its own webapp instances, its own backend service instances, and its own cache tier. Cross-cell traffic exists only for data that genuinely requires global consistency.

Before: Cross-AZ Monolithic Topology (Gray Failure Spreads Freely)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Cellular Architecture with AZ Drain Capability

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Per-Cell Observability: The Required Investment

After cellularising Slack's services, global aggregated metrics became misleading. A global p99 latency of 120ms might hide that Cell B has a p99 of 400ms while Cells A and C are at 90ms. Every cell now has its own dashboard, its own alerts, and its own error budget tracking. This per-cell observability investment was not optional — without it, the drain mechanism's input (operator judgment about which cell is degraded) would be unreliable. The button is only as good as the signal telling you which button to press.

Lessons

Multi-AZ alone does not guarantee AZ-failure isolation. Running services in multiple AZs provides hardware redundancy but not traffic isolation if your services freely communicate cross-AZ. Build AZ-aware traffic routing and cell boundaries so a failure in one AZ cannot affect requests served entirely by another AZ.
Gray failures (partial failures where different components have inconsistent views of availability) are best mitigated by human-triggered fast mitigation, not automated remediation. The ambiguous, view-dependent nature of gray failures makes reliable automated detection extremely difficult. Build a fast drain mechanism with a human in the loop — the goal is human-triggered response that completes in minutes, not autonomous failure response.
Design your incident mitigation tools to operate from outside the affected system. An AZ drain that requires SSH-ing into the affected AZ is useless when the AZ's network is degraded. Control-plane infrastructure for incident mitigation should be deliberately hosted in AZs other than the ones being managed.
Gradual traffic restoration is as important as fast draining. Restoring 100% of traffic instantly to a recovering AZ can trigger a second cascade as cold caches encounter sudden full load. Design your drain mechanism to be bidirectional: fast drain, slow restore starting at 1% with error-rate gating before each increase.
Cellular architecture is a deployment safety pattern in addition to a reliability pattern. Independent per-cell deployments bound the blast radius of bad code changes. When a canary goes wrong in one cell, the other cells continue serving users normally. This is a compounding benefit that makes every subsequent deploy safer than it would be in a monolithic topology.

Engineering Glossary

Availability Zone (AZ) — an isolated, physically separate data centre within an AWS region, designed to be independent of failures in other zones. Each AZ has independent power, cooling, and networking. Multi-AZ architecture provides hardware redundancy but not traffic isolation without explicit cell boundaries.

CAP theorem — the theoretical result stating that a distributed data store can provide at most two of: Consistency (all nodes return the same data), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions). Used by Slack as a principled framework for deciding which services should optimise for availability vs consistency in their cell architecture.

Cellular architecture — a system design where a service is split into multiple independent instances (cells), typically one per availability zone, each capable of serving its full workload independently. Enables AZ-level blast radius containment and independent per-cell deployment.

Gray failure — a failure mode where different components have different views of system availability — some servers see one AZ as fully available while servers in that AZ see the others as unavailable. Creates inconsistent state that is much harder to detect and respond to than a clean hard failure.

AZ drain — a operational mechanism that routes all traffic away from a targeted availability zone within a defined time window (5 minutes in Slack's case). Operable from outside the affected AZ so it remains usable even when the AZ's own control plane is degraded.

Gradual traffic restoration — the practice of reintroducing traffic to a recovering AZ incrementally, starting at 1% and monitoring error rates before each increase. Prevents a second cascade from cold caches encountering sudden full load immediately after drain recovery.

Metastable failure — a failure pattern where a system enters a self-sustaining degraded state that persists even after the original trigger is removed. Cellular architecture reduces but does not eliminate this risk — cascade patterns can still occur within an individual cell.

99.99% SLA — a service level agreement guaranteeing less than 52.6 minutes of downtime per year (~1 hour). A 48-minute AZ incident consumes nearly the entire annual budget in a single event — the business constraint that made sub-5-minute AZ mitigation a hard engineering requirement, not an aspirational improvement.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community