TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

#devops #reliability #backend #webdev

February 22, 2022 — Slack went down; the Incident Commander was also personally unable to connect
75% of fleet — the Consul rollout percentage that hit a tipping point during peak traffic
Mcrib — the new, faster cache manager that worked exactly as designed and made things worse
GDM scatter query — the specific inefficient query that turned cache degradation into database overload
Metastable failure — self-sustaining cascade; reverting the Consul rollout was necessary but not sufficient
3 simultaneous interventions required to escape: client throttling, query optimisation, Vitess replica addition

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

The Story

Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging!

— Laura Nolan, Senior Staff Engineer, via Slack Engineering Blog

The 2-22-22 incident at Slack is one of the cleanest documented examples of a metastable failure (a failure mode in distributed systems research where a system settles into a stable degraded state from which it cannot recover without external intervention, even after the original trigger has passed) in production systems. It did not require a bug. It did not require bad code. It required a Consul (a service discovery and service mesh tool used by Slack to maintain a dynamic registry of which servers are healthy and serving traffic) rollout to hit a tipping point during peak traffic, a faster-than-previous cache configuration manager to amplify the resulting churn, and a single inefficient database query to become a load-amplifying avalanche. Every component was working exactly as designed. The system was not.

The Architecture: Caches, Consul, and Mcrib

Slack's serving architecture uses Memcached (a high-performance in-memory key-value cache) as its primary caching tier. A component called Mcrib watches Consul and updates cache routing whenever nodes appear or disappear from the service catalog. When a Memcached node leaves the catalog — even temporarily, during a restart — Mcrib replaces it with a fresh, empty spare node. The new node's cache is cold. Requests that would have been cache hits now miss and hit the Vitess database instead. Under normal circumstances this is fine: node restarts are infrequent, cache warm-up is fast, and the load spike is transient. The new Mcrib was faster and more efficient at detecting downed nodes and replacing them. That efficiency was precisely why the incident was so severe.

Problem

Consul Rollout Hits Tipping Point

Slack was running a percentage-based rollout (a deployment strategy that applies a change to a fixed percentage of hosts at a time, intended to allow controlled testing before full rollout) of the Consul agent binary. Two 25% steps the prior week had completed without incident. The third 25% step on February 22 — bringing total upgraded hosts to 75% — hit peak traffic and entered a cascading failure.

Cause

Cache Emptying Cascade

When a Consul agent restarts on a Memcached node, it briefly deregisters the node from the service catalog. Mcrib — the new, faster control plane — detects this immediately and replaces the departing node with an empty spare. As the rollout processed 25% of the fleet sequentially, cache nodes were continuously being emptied and replaced. Cache hit rates dropped. Cache misses (requests where the data is not in cache, forcing a database query to serve the response) flooded Vitess — particularly one keyspace containing channel membership data — through a scatter query that hit every shard even when most shards had no relevant data.

Solution

Throttle, Optimise, Add Capacity

Recovery required three simultaneous interventions: client boot throttling to reduce incoming request rate, optimising the GDM scatter query to only fetch missing data from Vitess, and adding Vitess replicas as read sources to distribute database load. The system was in a metastable failure state — pausing the Consul rollout was necessary but not sufficient to restore service.

Result

Service Restored, Architecture Hardened

Slack recovered after engineers intervened to break the cascade. Long-term fixes included modifying Mcrib's control loop to avoid rapid consecutive node replacements, rewriting the scatter query to target a correctly sharded table, and auditing all high-volume cache-backed queries for similar vulnerability.

The Fix

Breaking the Cascade: Three Simultaneous Interventions

The critical insight of the 2-22-22 recovery was that the system was in a metastable state that could not be exited by simply reverting the original trigger. Stopping the Consul rollout was necessary but insufficient — the cache was already empty, the database was already overloaded, and client retries were already sustaining the load. The engineering team needed to change the system's state, not just stop what had changed it. This required reducing load from outside while simultaneously increasing the system's capacity to serve that load.

3 — simultaneous recovery interventions required: client throttling, query optimisation, and adding Vitess replicas — none alone was sufficient
Metastable — failure state; self-sustaining cascade where reverting the trigger does not restore service
25% × 3 — percentage-based rollout steps: two prior steps passed without incident; the third hit a tipping point at peak traffic
GDM scatter — the specific query that turned cache degradation into database overload; queried every Vitess shard even when most had no relevant data

# The problematic GDM (Group Direct Message) scatter query pattern
# (conceptual — Slack uses Vitess with MySQL dialect)

# BEFORE: scatter across ALL shards for a single user's GDM list
# Long cache TTL meant this almost never executed on the database directly
# When cache emptied simultaneously across the fleet: catastrophic
def get_gdm_list_old(user_id):
    results = []
    for shard in ALL_VITESS_SHARDS:  # O(shards) round-trips per user
        shard_results = db.query(
            shard,
            "SELECT * FROM gdm_memberships WHERE user_id = ?",
            user_id
        )
        results.extend(shard_results)
    return results

# AFTER: query only the shard that owns this user's data
# Table reschemed to shard by channel_id (colocated with user data)
def get_gdm_list_fixed(user_id):
    # Vitess routes this to the single correct shard via VSchema VIndex
    return db.query(
        "SELECT * FROM gdm_memberships_sharded_by_channel WHERE user_id = ?",
        user_id
    )  # O(1) — one shard, one query

# Lesson: always verify cache-backed queries can survive cache-cold load
# If the scatter query only executes on a cache miss, test it cold

Mcrib's Efficiency Paradox

The key lesson from Mcrib is architectural: a faster control loop for infrastructure changes can make a distributed system less safe, even if the control loop itself is correct. Mcrib was better than its predecessor at detecting and responding to Consul node departures — but that speed meant Memcached churn from the Consul rollout happened faster than the cache tier could recover. The fix was not to make Mcrib slower or less correct, but to add rate limiting on consecutive node replacements — ensuring the cache tier never loses more than a bounded fraction of its warmth at any moment.

The GDM scatter query was the specific weakness that turned a cache degradation into a database overload. This query listed GDM conversations per user, and crucially, it queried every shard in the Vitess keyspace even when most shards contained no relevant data for that user. Under normal conditions, results were cached with a long TTL because GDM membership is immutable — so cache hits were nearly universal and the scatter pattern was rarely exercised. When the cache was systematically emptied by the Mcrib replacements, the scatter query began executing on the database at full scale for the first time under real load.

The metastable state trap: why reverting didn't help

Once Slack's system entered its cascading failure state, pausing the Consul rollout did not restore service. The system was in a metastable state — cache misses caused database load, database load caused slow responses, slow responses caused retries, retries caused more database load. This cycle was self-sustaining even after the original trigger was removed. The only exit was external intervention that changed the system state — throttling load and increasing capacity — not just undoing what had changed. This is the defining characteristic of metastable failures: the system is stable in the failed state, not just on its way to recovery.

Why GDM membership was particularly vulnerable

The Group Direct Message membership data had a long cache TTL because GDM membership is immutable under Slack's current application requirements — once you're in a GDM, you stay in it. This long TTL meant the cache was almost always warm, the scatter query rarely executed on the database, and the latent scalability issue was never observed under normal conditions. The queries that feel safest to skip testing are often the ones hiding the most dangerous database access patterns.

Client retries as load amplifiers

Client retries, designed to recover from transient failures, become load amplifiers during sustained overload. When the Slack client receives a failure or timeout, it doesn't know whether the system is experiencing a transient local hiccup or a global overload — so it retries. During the 2-22-22 incident, automated retries with exponential backoff significantly increased database load during the window when the system needed space to recover. Exponential backoff with jitter helps but cannot fully counteract retries from millions of clients all experiencing the same global overload simultaneously.

Architecture

The 2-22-22 incident is best understood through how Slack's serving architecture handles data retrieval and cache topology. Requests go through Mcrouter to Memcached first; only on a cache miss do they hit Vitess. Consul is the system that keeps Mcrouter and Mcrib informed about which Memcached nodes are alive. The architecture was designed for resilience under normal node failures. It was not designed for a coordinated wave of Consul agent restarts that emptied cache at the exact moment peak traffic arrived.

Before: How a Consul Agent Restart Drains Cache (Single Node)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Cascade: 25% of Fleet Draining Simultaneously at Peak Traffic

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

No Bugs. Just Emergent Behavior.

The Consul rollout was not a bug — it was a maintenance operation that had completed successfully twice before. The failure was a property of the whole system interacting across components: Consul's temporary deregistration behaviour, Mcrib's fast replacement response, the cache's role in shielding the database, and the GDM query's implicit assumption of cache warmth. No single component was broken. The interaction between correct components under peak load at a specific scale threshold produced the cascade. This is precisely why cascading failures in distributed systems are so hard to prevent: they emerge from correctness, not from bugs.

Lessons

A faster infrastructure component is not always safer. Mcrib was a better system than its predecessor — but its efficiency amplified cache churn during the Consul rollout in a way the slower predecessor would not have. Whenever you improve the speed of a control loop that modifies infrastructure state, audit whether that speed creates new failure modes under coordinated changes.
Metastable failures (self-sustaining failure states that persist even after the original trigger is removed) cannot be fixed by reverting the trigger. Your incident playbooks need explicit recovery procedures for "we are in sustained overload even though the cause has been addressed" — throttle incoming load, add capacity, or both. Waiting for self-recovery from a metastable state is not a strategy.
Test your high-traffic cache-miss path before it is exercised under load. The GDM scatter query had never been exercised at full scale because the cache hit rate was near-perfect. When the cache emptied, a latent design flaw — querying all shards — became a severity-1 incident. Every high-frequency query protected by a cache should be load-tested against the scenario where the cache is cold.
Percentage-based rollouts do not guarantee safety when the failure mode is a tipping-point cascade. The first two 25% Consul steps passed without incident; the third hit a threshold where the interaction between rollout churn and peak traffic became self-sustaining. Consider adding traffic-aware gates to rollout automation.
Client retries are a double-edged sword. They are essential for recovering from transient failures but amplify load during sustained overload. Slack's clients used exponential backoff with jitter — the right design — but even well-designed retries from millions of clients contribute significantly to sustained overload. Design retry logic with a global-overload abort condition: if every retry attempt is failing, stop retrying.

Engineering Glossary

Cache miss — a request where the data is not in cache, forcing a database query to serve the response. Under normal conditions, cache misses are rare and database load is low. When the cache empties simultaneously across a fleet, cache misses become the majority of traffic and database load becomes catastrophic.

Consul — a service discovery and service mesh tool used by Slack to maintain a dynamic registry of which servers are healthy and serving traffic. When a Consul agent restarts, it briefly deregisters the host from the service catalog — even if the host itself is healthy. This deregistration triggered Mcrib's cache replacement logic.

Consistent hashing — a routing algorithm that maps each cache key to a specific server using a hash ring, so cache lookups are predictable and cache warmth is preserved during small topology changes. Used by Mcrouter to route cache requests to the correct Memcached node.

Mcrib — Slack's control plane that watches Consul and updates Memcached cluster routing whenever nodes appear or disappear from the service catalog. The new Mcrib was faster than its predecessor at detecting node departures and replacing them with cold spare nodes — the efficiency that amplified the 2-22-22 cascade.

Metastable failure — a failure mode in distributed systems where a system settles into a stable degraded state from which it cannot recover without external intervention, even after the original trigger has passed. Characterised by self-sustaining load cycles: cache misses → database overload → slow responses → retries → more cache misses.

Scatter query — a database query that fans out to all shards in a Vitess keyspace regardless of where the relevant data lives. Expensive under normal conditions; catastrophic when the cache protecting it empties simultaneously across a fleet at peak traffic.

Vitess — Slack's horizontally sharded MySQL system. When the Memcached cache tier emptied during the 2-22-22 incident, the scatter query against the GDM membership Vitess keyspace hit every shard simultaneously, severely overloading the keyspace.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community