DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

Slack · Reliability · 17 May 2026

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

  • Feb 22 2022 outage
  • Consul rollout to 75% of fleet
  • Cache hit rate collapsed
  • Vitess keyspace severely overloaded
  • Cascading failure: metastable state
  • Mcrib: faster = more dangerous

The Story

Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging!

— — Laura Nolan, Senior Staff Engineer — via Slack Engineering Blog

The 2-22-22 incident at Slack is one of the cleanest documented examples of a metastable failure (a failure mode described in distributed systems research where a system settles into a stable degraded state from which it cannot recover without external intervention, even after the original trigger has passed) in production systems. It did not require a bug. It did not require bad code. It required a Consul (a service discovery and service mesh tool used by Slack to maintain a dynamic registry of which servers are healthy and serving traffic) rollout to hit a tipping point during peak traffic, a faster-than-previous cache configuration manager to amplify the resulting churn, and a single inefficient database query to become a load-amplifying avalanche. Every component was working exactly as designed. The system, however, was not.

The Architecture: Caches, Consul, and Mcrib

Slack's serving architecture routes requests through a web application layer that uses Memcached (a high-performance in-memory key-value cache used by Slack to store frequently accessed data and avoid repeated database queries) as its primary caching tier. A component called Mcrouter routes cache requests using consistent hashing (a routing algorithm that maps each cache key to a specific server using a hash ring, so cache lookups are predictable and cache warmth is preserved during small topology changes). A control plane called Mcrib watches Consul (the service discovery system that tracks which Memcached nodes are healthy) and updates Mcrouter's configuration whenever nodes appear or disappear from the service catalog. When a Memcached node leaves the catalog — even temporarily, during a restart — Mcrib replaces it with a fresh, empty spare node. The new node's cache is cold. Requests that would have been cache hits on the old node now miss and hit the Vitess (Slack's horizontally sharded MySQL system) database instead. Under normal circumstances, this is fine: node restarts are infrequent, cache warm-up is fast, and the load spike is transient.

⚙️

The new Mcrib component was described as objectively better than its predecessor: it was faster and more efficient at detecting downed Memcached instances and replacing them with spare nodes. It did exactly what it was designed to do. That efficiency was precisely why the incident was so severe.

Problem

Consul Rollout Hits Tipping Point

Slack was running a percentage-based rollout (a deployment strategy that applies a change to a fixed percentage of hosts at a time, intended to allow controlled testing before full rollout) of the Consul agent binary. Two 25% rollout steps the prior week had completed without incident. The third 25% step on February 22 — bringing total upgraded hosts to 75% — hit peak traffic and entered a cascading failure. Engineers saw user tickets, internal errors, and alerts firing simultaneously at 6am Pacific.


Cause

Cache Emptying Cascade

When a Consul agent restarts on a Memcached node, it briefly deregisters the node from the service catalog. Mcrib — the new, faster control plane — detects this immediately and replaces the departing node with an empty spare. As the rollout processed 25% of the fleet sequentially, cache nodes were continuously being emptied and replaced. Cache hit rates dropped. Cache misses (requests where the data is not in cache, forcing a database query to serve the response) flooded Vitess (Slack's sharded MySQL layer), particularly one keyspace containing channel membership data.


Solution

Throttle, Optimize, Drain

Recovery required three simultaneous interventions. First, client boot throttling reduced the incoming request rate to give the cache time to warm. Second, the problematic GDM scatter query was optimized to only fetch missing data from Vitess instead of querying every shard. Third, engineers added Vitess replicas as read sources to distribute the database load. The system was in a metastable failure state — pausing the Consul rollout was not sufficient because the cascade was already self-sustaining.


Result

Service Restored, Architecture Hardened

Slack recovered after engineers intervened to break the cascade. Long-term fixes included modifying Mcrib's control loop to avoid rapid consecutive node replacements, fixing the scatter query to read from a table sharded by channel ID, and analyzing other high-volume queries backed by the cache tier for similar vulnerability.


THE METASTABLE STATE TRAP

Once Slack's system entered its cascading failure state, simply pausing the Consul rollout did not restore service. The system was already in a metastable state — a condition where the failure was self-sustaining: cache misses caused database load, database load caused slow responses, slow responses caused retries, retries caused more database load. The only exit was external intervention that changed the system state — throttling load or increasing capacity — not undoing the original trigger. This is the defining characteristic of metastable failures and the reason they are so dangerous.

The GDM (Group Direct Message) scatter query was the specific weakness that turned a cache degradation into a database overload. This query listed GDM conversations per user, and crucially, it queried every shard in the Vitess keyspace even when most shards contained no relevant data for that user. Under normal conditions, this query's results were cached with a long TTL because GDM membership is immutable — so cache hits were nearly universal and the scatter pattern was rarely exercised. When the cache was systematically emptied by the Mcrib replacements, the scatter query began executing on the database at full scale for the first time under real load. The keyspace became severely overloaded almost immediately.

Client Retries: The Amplifier

Client retries, designed to recover from transient failures, become load amplifiers during sustained overload. When the Slack client receives a failure or timeout, it doesn't know whether the system is experiencing a transient local hiccup or a global overload — so it retries. During the 2-22-22 incident, automated retries with exponential backoff significantly increased database load during the window when the system needed space to recover. Backoff and jitter help, but they cannot fully counteract retries from millions of clients all experiencing the same global overload simultaneously.

😅

The Author Was Also Affected

A detail that makes this postmortem unusually human: Laura Nolan, who wrote the postmortem, was also the Incident Commander during the event — and was personally unable to connect to Slack for portions of it. She was managing a Slack outage using a platform that wasn't working, making incident coordination substantially harder. The note is a small reminder that incident commanders are humans using the same systems they're trying to fix.

ℹ️

Two Prior Steps Passed Without Incident

The February 22 rollout was the third of three 25% steps — the prior two, executed the previous week, completed without any issues. This is a critical detail for understanding why the incident was surprising: the rollout process was validated, the previous steps were clean, and there was no signal that the third step would behave differently. The failure was not a process failure — it was a scale threshold phenomenon that only manifested at 75% fleet coverage during peak traffic.

⚠️

Why GDM Membership Was Particularly Vulnerable

The Group Direct Message membership data had a long cache TTL because GDM membership is immutable under Slack's current application requirements — once you're in a GDM, you stay in it. This long TTL meant the cache was almost always warm, the scatter query rarely executed on the database, and the latent scalability issue was never observed under normal conditions. The safest-feeling queries can hide the most dangerous database access patterns.


The Fix

Breaking the Cascade: Three Simultaneous Interventions

The critical insight of the 2-22-22 recovery was that the system was in a metastable state that could not be exited by simply reverting the original trigger. Stopping the Consul rollout was necessary but insufficient — the cache was already empty, the database was already overloaded, and client retries were already sustaining the load even with the rollout paused. The engineering team needed to change the system's state, not just stop what had changed it. This required reducing load from outside while simultaneously increasing the system's capacity to serve that load.

  • 3 — Simultaneous recovery interventions required: client throttling, query optimization, and adding Vitess replicas — none alone was sufficient
  • Metastable — Failure state — self-sustaining cascade where reverting the trigger does not restore service; requires active external intervention to change system state
  • 25% × 3 — Percentage-based rollout steps: two prior 25% steps passed without incident; the third hit a tipping point at peak traffic
  • GDM query — The specific scatter query that turned cache degradation into database overload — querying every Vitess shard even when most shards had no relevant data
# The problematic GDM (Group Direct Message) scatter query pattern
# (conceptual — Slack uses Vitess with MySQL dialect)

# BEFORE: scatter across ALL shards for a single user's GDM list
# This runs on every shard in the keyspace regardless of data locality
def get_gdm_list_old(user_id):
    results = []
    # Vitess without scatter guard: queries all shards
    for shard in ALL_VITESS_SHARDS:
        shard_results = db.query(
            shard, 
            "SELECT * FROM gdm_memberships WHERE user_id = ?",
            user_id
        ) # expensive: O(shards) database round-trips per user
        results.extend(shard_results)
    return results

# AFTER: query only the shard that owns this user's data
# The table was reschemed to shard BY channel_id (colocated with users)
def get_gdm_list_fixed(user_id):
    # Vitess routes this to the single correct shard using VSchema VIndex
    return db.query(
        "SELECT * FROM gdm_memberships_sharded_by_channel WHERE user_id = ?",
        user_id
    ) # O(1) — one shard, one query

# Long-term: always verify cache-backed queries can survive cache-cold load
Enter fullscreen mode Exit fullscreen mode

MCRIB'S EFFICIENCY PARADOX

The key lesson from Mcrib is architectural: a faster control loop for infrastructure changes can make a distributed system less safe , even if the control loop itself is correct. Mcrib was better than its predecessor at detecting and responding to Consul node departures — but that speed meant Memcached churn from the Consul rollout happened faster than the cache tier could recover. The fix was not to make Mcrib slower or less correct, but to add rate limiting on consecutive node replacements — ensuring that the cache tier never loses more than a bounded fraction of its warmth at any moment.

Long-Term Architecture Hardening

After the incident, Slack made permanent structural changes: modifying Mcrib's control loop to prevent rapid consecutive cache node replacements, rewriting the GDM scatter query to target the correctly sharded table, auditing all high-volume cache-backed queries for similar scatter vulnerabilities, and analyzing whether a brief network partition affecting cache nodes could trigger the same cascade (it could — and changes were made to protect against that too). The incident became a forcing function for systematic resilience improvements.

⚠️

Testing Cascading Failure Recovery — Not Just Prevention

A question raised by this incident: do you test your system's ability to recover from metastable failure states , not just its ability to prevent them? Slack's game days had focused on preventing failures. But the 2-22-22 incident required active recovery from a state the prevention had failed to avoid. Testing the exit paths from failure is as important as testing the entry paths to it.

The client boot throttle was the most immediately effective intervention. By artificially limiting the rate at which clients could complete their session initialization (boot (the process by which a Slack client initializes its state — fetching channel memberships, unread counts, and other data that the client caches locally for the session)), Slack reduced the volume of requests hitting the overloaded database tier. This bought time for the cache to begin refilling and for the query optimization to take effect. The mechanism that makes throttling effective in metastable failures is that it reduces the load sustaining the cascade — if you can reduce load below the system's current degraded capacity, it can begin recovering rather than staying pinned at the overload threshold.

🔒

Rate Limiting the Mcrib Control Loop

The architectural fix to Mcrib was conceptually simple: rate-limit how many cache nodes can be replaced within a rolling time window. This prevents a coordinated wave of Consul agent restarts from emptying the entire cache tier simultaneously. The trade-off is that node replacement is slightly slower during a real failure — but this is an acceptable delay in exchange for the cache tier never losing more than a bounded fraction of its warmth at once.


Architecture

The 2-22-22 incident is best understood through the lens of how Slack's serving architecture handles data retrieval and cache topology. Requests to the webapp go through Mcrouter to Memcached first; only on a cache miss do they hit Vitess (Slack's sharded MySQL layer). Consul is the system that keeps Mcrouter and Mcrib informed about which Memcached nodes are alive. When Consul says a node left — even temporarily during a Consul agent restart — Mcrib responds by assigning a cold spare to replace it. The architecture was designed for resilience under normal node failures. It was not designed for a coordinated wave of Consul agent restarts that emptied cache at the exact moment peak traffic arrived.

Before: How a Consul Agent Restart Drains Cache (Single Node)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Cascade: 25% of Fleet Draining Simultaneously at Peak Traffic

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

GRAY FAILURE VS HARD FAILURE

The Consul rollout itself was not a bug — it was a maintenance operation that had completed successfully twice before at 25% steps. The failure that emerged was a property of the whole system interacting across components : Consul's temporary deregistration behavior, Mcrib's fast replacement response, the cache's role in shielding the database, and the GDM query's implicit assumption of cache warmth. No single component was broken. The interaction between correct components under peak load at a specific scale threshold produced the cascade. This is precisely why cascading failures in distributed systems are so hard to prevent: they emerge from correctness, not from bugs.

ℹ️

Metastable Failures: The Academic Context

Laura Nolan's postmortem explicitly references the academic concept of metastable failures in distributed systems. The 'Metastable States in Distributed Systems' research paper (Bronson et al.) describes how systems can enter stable degraded states where removing the triggering condition does not restore normal operation. Slack's incident is a near-perfect real-world illustration: once the cascade began, it was self-sustaining. This framing matters because it changes how you think about recovery — you're not reverting a change, you're escaping a state.

The Fixed Architecture: Rate-Limited Replacements + Sharded Queries

After the incident, Slack made two structural changes to the architecture: Mcrib was modified to rate-limit consecutive node replacements , ensuring cache churn is bounded even during fleet-wide maintenance operations. The GDM membership query was rewritten to target a correctly sharded table, eliminating the scatter pattern. Together these changes make the system resilient to both Consul rollout-style churn and to cold cache conditions that might arise from other causes like network partitions affecting cache nodes.


Lessons

The 2-22-22 incident is cited by distributed systems researchers and practitioners because it is so precisely documented and because its lessons are universal. No bugs. No negligence. Just the emergent behavior of correct components interacting at scale in a state that the individual components couldn't see.

  1. 01. A faster infrastructure component is not always safer. Mcrib was a better system than its predecessor — but its efficiency amplified cache churn during the Consul rollout in a way the slower predecessor would not have. Whenever you improve the speed of a control loop that modifies infrastructure state, audit whether that speed creates new failure modes under coordinated changes.
  2. 02. Metastable failures (failure states in distributed systems that are self-sustaining even after the original trigger is removed, requiring active external intervention to exit) cannot be fixed by reverting the trigger. Your incident playbooks need explicit recovery procedures for 'we are in sustained overload even though the cause has been addressed' — throttle incoming load, add capacity, or both. Waiting for the system to self-recover from a metastable state is not a strategy.
  3. 03. Test your high-traffic cache-miss path before it is exercised under load. The GDM scatter query had never been exercised at full scale because the cache hit rate was near-perfect. When the cache emptied, a latent design flaw — querying all shards — became a severity-1 incident. Every high-frequency query protected by a cache should be load-tested against the scenario where the cache is cold.
  4. 04. Percentage-based rollouts (deployment strategies that apply changes to a fixed fraction of infrastructure at a time to detect problems before full exposure) do not guarantee safety when the failure mode is a tipping-point cascade. The first two 25% Consul steps passed without incident; the third hit a threshold where the interaction between rollout churn and peak traffic became self-sustaining. Consider adding traffic-aware gates to rollout automation.
  5. 05. Client retries are a double-edged sword. They are essential for recovering from transient failures but amplify load during sustained overload. Slack's clients used exponential backoff with jitter — the right design — but even well-designed retries from millions of clients contribute significantly to sustained overload. Design retry logic with a global-overload abort condition: if every retry attempt is failing, stop retrying.

HOW TO ESCAPE A METASTABLE STATE

Slack's recovery from 2-22-22 provides a practical playbook for escaping metastable failures: Step 1 — Reduce incoming load (client boot throttle) to give the system breathing room below its degraded capacity. Step 2 — Eliminate the load amplification source (fix the scatter query) so each cache miss is less expensive. Step 3 — Increase capacity (add Vitess replicas) so the system can handle the remaining load while recovering. These three moves — load reduction, amplification removal, capacity addition — are the universal toolkit for escaping cascading failure states.

ℹ️

The Reference: How Complex Systems Fail

Laura Nolan's postmortem references Richard Cook's 'How Complex Systems Fail' — a foundational text in systems reliability. Cook's work describes how complex systems are never fully safe, how accidents involve multiple contributing factors, and how practitioners become expert in working the system's defenses. The 2-22-22 incident is a near-perfect illustration: multiple correct components, a scale-dependent tipping point, and the system's defenses (retries, caching) becoming contributors to the failure mode.

Mcrib was the best cache manager Slack had ever built — it responded so fast to node departures that it helped bring down the entire platform, which is a kind of achievement.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)