<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TechLogStack</title>
    <description>The latest articles on DEV Community by TechLogStack (@techlogstack).</description>
    <link>https://dev.to/techlogstack</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942907%2Fa4ab56cd-2c6b-475e-9f32-91735275dadc.png</url>
      <title>DEV Community: TechLogStack</title>
      <link>https://dev.to/techlogstack</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/techlogstack"/>
    <language>en</language>
    <item>
      <title>The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/the-80-problem-why-getting-an-llm-system-to-works-in-demo-is-20-of-the-work-3cni</link>
      <guid>https://dev.to/techlogstack/the-80-problem-why-getting-an-llm-system-to-works-in-demo-is-20-of-the-work-3cni</guid>
      <description>&lt;p&gt;&lt;strong&gt;Shopify&lt;/strong&gt; · Reliability · 19 May 2026&lt;/p&gt;

&lt;p&gt;Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM judge: 0.02 → 0.61 Kappa&lt;/li&gt;
&lt;li&gt;300-example hand-crafted benchmark&lt;/li&gt;
&lt;li&gt;Production mirroring closes gap in 2 weeks&lt;/li&gt;
&lt;li&gt;Merchant simulator pre-deployment&lt;/li&gt;
&lt;li&gt;Weekly Qwen3-32B retraining cycle&lt;/li&gt;
&lt;li&gt;ZenML: 1,200 prod deployments analyzed&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;ZenML analyzed 1,200 production LLM deployments across companies ranging from startups to large enterprises and found a pattern so consistent it has become a rule: &lt;strong&gt;reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time&lt;/strong&gt;. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust. Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation), lived this curve in production. Their solution was not a better model. It was a better measurement system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;WHY EVALUATION IS THE HARD PART&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways that only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. &lt;strong&gt;Without a reliable way to measure quality, you cannot improve systematically.&lt;/strong&gt; You are optimizing blind, hoping that the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Shopify's Flow agent generates Shopify Flow automations from natural language — merchants describe what they want to happen ('when an order is over $200, add the customer to my VIP segment'), and the agent produces the workflow. The task requires &lt;em&gt;tool calling&lt;/em&gt; (a pattern where an LLM is given a set of available functions (tools) with descriptions, and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and produces a structured output in a domain-specific format. It sounds well-bounded. In practice, the diversity of merchant intent is vast, the edge cases accumulate rapidly, and subtle errors in the generated workflow — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📏&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shopify calibrated their LLM judge from a &lt;strong&gt;Cohen's Kappa of 0.02&lt;/strong&gt; (essentially random — the judge agreed with human evaluators no more than chance would predict) to &lt;strong&gt;0.61&lt;/strong&gt; , close to the human evaluator baseline of 0.69. The human baseline itself was 0.69 rather than 1.0 — a reminder that human evaluators don't perfectly agree with each other either. The goal is not a perfect judge; it's a judge trustworthy enough that its signals drive reliable engineering decisions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Benchmarks Said Ready; Production Said Otherwise
&lt;/h4&gt;

&lt;p&gt;Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  No Quality Signal Trustworthy Enough to Drive Iteration
&lt;/h4&gt;

&lt;p&gt;The early LLM judge had a Cohen's Kappa of 0.02 — barely better than random agreement with human evaluators. This meant the judge's verdicts could not reliably distinguish good responses from bad ones. Engineering decisions based on judge verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Calibrated LLM Judge + Production Mirroring Flywheel
&lt;/h4&gt;

&lt;p&gt;The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Production Gap Closed in Two Weeks with the Flywheel
&lt;/h4&gt;

&lt;p&gt;The gap from 'benchmark-ready' to 'production-ready' closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behavior evolves.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Human Agreement Ceiling&lt;/p&gt;

&lt;p&gt;One of the most grounding facts in Shopify's evaluation system is that human evaluators agreed with each other at &lt;strong&gt;Cohen's Kappa of 0.69&lt;/strong&gt; — not 1.0. Humans disagree about quality. This is not a failure of the evaluation process; it reflects genuine ambiguity in what 'correct' means for natural language tasks. The practical implication: don't try to build a perfect judge. Build a judge that matches or approaches human agreement levels, and treat that as the meaningful ceiling. Optimizing a judge past the human agreement level is overfitting to individual human annotators, not finding truth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The merchant simulator deserves particular attention as an engineering pattern. Before any system change ships to production, it is tested against simulated merchant interactions derived from real production conversations. The simulator captures the 'essence' — the underlying merchant goal — from real conversations and replays that goal against the new system. This is fundamentally different from benchmark evaluation: it tests the new system against &lt;strong&gt;realistic merchant intent distributions&lt;/strong&gt; , including the long tail that engineering-crafted benchmarks consistently miss. It is also fundamentally different from A/B testing: it catches regressions before any real merchant sees them, without requiring a traffic split.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Synthetic Data: Closing the Training Data Gap&lt;/p&gt;

&lt;p&gt;The Flow agent's fine-tuning training data was almost entirely &lt;strong&gt;synthetic&lt;/strong&gt; — generated by an LLM, not labeled by humans. The process: sample a real production workflow, use a stronger model to generate a plausible natural-language request that would produce it, construct the ideal multi-turn tool trajectory. The synthetic data generation was the majority of the engineering effort. The resulting dataset covered the breadth of Flow's usage in a way that manual annotation never could — because the diversity of real workflows provided the supervision signal, and the LLM provided the scale. This is the emerging pattern for fine-tuning specialists: synthetic data from real production outputs, not expensive human annotation from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔬&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Industry Pattern: ZenML's 1,200 Case Studies&lt;/p&gt;

&lt;p&gt;ZenML's LLMOps database of 1,200+ production deployments confirms that Shopify's experience is universal, not exceptional. The summary from their analysis: &lt;strong&gt;'Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features.'&lt;/strong&gt; LLM-as-judge has emerged as the dominant pattern for scalable quality measurement. But every successful deployment maintains human-in-the-loop golden datasets for critical domains. The dual-layer approach — LLM judges for velocity, human ground truth for calibration — is the de facto standard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The A/B Test That Isn't&lt;/p&gt;

&lt;p&gt;Teams new to LLM evaluation often reach for A/B testing as the measurement tool: split traffic, measure conversion, pick the winner. A/B testing has a fatal problem for LLM evaluation: &lt;strong&gt;a 5% improvement in a downstream metric like merchant click-through might take weeks of traffic to reach statistical significance&lt;/strong&gt; — and you may have introduced a subtle quality regression in a different dimension that the metric doesn't capture. Production mirroring with direct output comparison is faster and richer: you see whether response quality improved for the same inputs, without waiting for downstream business metric movement. Business metrics confirm value; output comparison guides engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE COST CURVE IS ASYMMETRIC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 80% → 95% quality journey is asymmetric in effort. The first 80% comes from model capability: the LLM already knows how to generate text, use tools, and follow instructions. The final 15% comes from &lt;strong&gt;understanding the specific failure modes of your specific application on your specific user distribution&lt;/strong&gt; — and that knowledge cannot be bought or downloaded. It is earned through measurement, systematic failure analysis, and targeted training data creation. The companies that invest in this domain-specific evaluation work build durable advantages over those that simply upgrade to the next model version and hope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🏭&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Notion AI (referenced in ZenML's analysis) built a &lt;strong&gt;multi-layer evaluation stack&lt;/strong&gt; that balances speed and cost: cheap heuristic checks on every commit, LLM judge scoring on every merge, and expensive human evaluation on every release candidate. Teams that adopted this tiered approach reported &lt;strong&gt;10x faster development velocity&lt;/strong&gt; compared to running full human evaluation on every change. The insight: match eval depth to the stakes of the change, not to a uniform 'always run everything' policy.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Building the Evaluation Flywheel
&lt;/h3&gt;

&lt;p&gt;Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn of the flywheel reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration. A low-coverage benchmark misses the failures that actually matter.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0.61&lt;/strong&gt; — Cohen's Kappa achieved for Shopify's LLM judge after iterative calibration — close to the human evaluator baseline of 0.69 and sufficient to drive reliable engineering decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;300&lt;/strong&gt; — Hand-crafted benchmark examples for the Flow agent — covering breadth of expected usage, used as the initial quality gate before production shadow testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 weeks&lt;/strong&gt; — Time to close the benchmark-to-production gap using the production mirroring flywheel — from 'benchmark says ready' to 'production confirms ready'&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly&lt;/strong&gt; — Qwen3-32B retraining cadence on H200 GPUs (12h full training run) — keeping the model aligned with evolving merchant behavior without months-long release cycles
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# LLM Judge calibration: the process that takes you from Kappa 0.02 to 0.61
# A judge is only useful if it agrees with humans. Measure this first.
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calibrate_llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    calibration_set: list of {conversation, human_label} pairs
    human_label: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;good&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;needs_improvement&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    Returns Cohen&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s Kappa between judge and human labels.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;judge_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Ask the judge to evaluate this conversation
&lt;/span&gt;        &lt;span class="n"&gt;judge_verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;conversation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;human_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;human_label&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="c1"&gt;# target: &amp;gt;0.60 before trusting judge at scale
&lt;/span&gt;
&lt;span class="c1"&gt;# The calibration loop:
&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt; &lt;span class="c1"&gt;# initial judge is barely better than random
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Analyze where judge and humans disagree
&lt;/span&gt;    &lt;span class="n"&gt;disagreements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_disagreements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_judge_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Improve judge prompt based on disagreement patterns:
&lt;/span&gt;    &lt;span class="c1"&gt;# - Add clarifying criteria for ambiguous cases
&lt;/span&gt;    &lt;span class="c1"&gt;# - Add few-shot examples from disagreements (human = ground truth)
&lt;/span&gt;    &lt;span class="c1"&gt;# - Adjust rubric language to match human intuitions
&lt;/span&gt;    &lt;span class="n"&gt;new_judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;improve_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;disagreements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calibrate_llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kappa after iteration: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# logs: 0.02 → 0.15 → 0.31 → 0.48 → 0.61
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PRODUCTION MIRRORING: THE GROUND TRUTH TEST&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Benchmarks are necessary but not sufficient. A benchmark is a fixed dataset that reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all the edge cases, unusual phrasings, and unexpected use patterns that no engineer anticipated. &lt;strong&gt;Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously&lt;/strong&gt; , comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Synthetic Data Generation Pipeline&lt;/p&gt;

&lt;p&gt;Shopify's Flow agent training data was generated through a three-step pipeline: &lt;strong&gt;Step 1&lt;/strong&gt; — sample a diverse set of validated production workflows (at least one workflow per unique workflow descriptor, from merchants with two or more qualifying workflows). &lt;strong&gt;Step 2&lt;/strong&gt; — use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow. &lt;strong&gt;Step 3&lt;/strong&gt; — construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: &lt;strong&gt;scale&lt;/strong&gt; (the production workflow corpus is large) and &lt;strong&gt;grounding&lt;/strong&gt; (every training example was derived from a real workflow that actually ran).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tangle: The ML Pipeline That Enables Weekly Retraining&lt;/p&gt;

&lt;p&gt;The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on &lt;strong&gt;Tangle, Shopify's open-source ML experimentation platform&lt;/strong&gt;. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. This means a change to the synthetic data generator doesn't trigger a full pipeline rerun from scratch — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Golden Datasets Are Non-Negotiable&lt;/p&gt;

&lt;p&gt;ZenML's analysis is unambiguous: &lt;strong&gt;every successful production LLM deployment they analyzed maintains human-in-the-loop golden datasets&lt;/strong&gt; for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labeled examples that represent ground truth — anchor the judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Two-Week Rule&lt;/p&gt;

&lt;p&gt;Shopify's experience with the production mirroring flywheel produced a rule of thumb that has since appeared in multiple other teams' postmortems: &lt;strong&gt;if your candidate model passes benchmark evaluation, it takes approximately two weeks of production mirroring to confirm whether it's truly production-ready&lt;/strong&gt;. Two weeks of real traffic at a shadow percentage generates enough diverse examples to surface the tail failures that the benchmark didn't cover. If the flywheel is working, those failures are incorporated into the training data and the model improves. If the failures are systematic — indicating a training distribution problem rather than isolated edge cases — the two weeks reveals this before the model is promoted to full production.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The evaluation architecture for production LLM systems has four components that form a cycle. &lt;strong&gt;Benchmark evaluation&lt;/strong&gt; provides fast, reproducible quality gates during development. &lt;strong&gt;LLM-as-judge scoring&lt;/strong&gt; provides continuous quality measurement at production traffic scale. &lt;strong&gt;Production mirroring&lt;/strong&gt; provides ground truth about whether a candidate model performs better for real users. &lt;strong&gt;The training flywheel&lt;/strong&gt; converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Production LLM Evaluation Flywheel
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Judge Architecture: From Random Agreement to Near-Human
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE MERCHANT SIMULATOR AS PRE-DEPLOYMENT SAFETY NET&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The merchant simulator sits between benchmark evaluation and production mirroring — it's a &lt;strong&gt;synthetic production environment&lt;/strong&gt;. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode that benchmarks miss: correct behavior on engineer-anticipated test cases, incorrect behavior on the realistic distribution of merchant intent in production. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Eval Budget vs Training Budget: The Cost Trap&lt;/p&gt;

&lt;p&gt;ZenML's analysis of 1,200 production deployments found that teams frequently discover that &lt;strong&gt;running comprehensive evaluations on every commit burns through inference budget faster than production traffic&lt;/strong&gt;. Running a full eval suite — LLM judge on 1,000 examples × multiple iterations per PR — can cost more per day than serving users. The solution is a tiered eval strategy: fast, cheap unit evals on every commit; comprehensive judge-scored evals on every merge; full production mirroring only for release candidates. Eval should be sized to the stakes of what's being changed, not run at maximum coverage on every code change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧪&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Multi-LLM Annotation Pattern&lt;/p&gt;

&lt;p&gt;For high-stakes quality assessments (like Shopify's Global Catalogue product taxonomy), a single LLM judge has too much variance. The production pattern is to run &lt;strong&gt;multiple LLMs independently on the same evaluation task&lt;/strong&gt; , then use an arbitration system — a specialized model — to resolve disagreements. This ensemble approach dramatically reduces false positives in quality assessment: a response that confuses one model but is rated correctly by three others is probably correct. The arbitration model applies structured ruling logic for edge cases that simple voting would misclassify. The pattern adds cost but reduces the error rate of the quality signal for critical decisions.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;The 80% quality curve is the defining challenge of production AI engineering. The teams that accept it and build systematic measurement infrastructure navigate it successfully. The teams that are surprised by it and try to push past it with more prompting and model upgrades are still on it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;You will spend more time building evaluation infrastructure than the application logic itself.&lt;/strong&gt; This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. The teams shipping reliable AI products have evaluation as a first-class engineering investment, not an afterthought.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;LLM-as-judge&lt;/em&gt; (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction) is the scalable evaluation pattern. But an uncalibrated judge (Cohen's Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;A benchmark that passes is a necessary condition, not a sufficient one.&lt;/strong&gt; Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. The two weeks Shopify needed to close the benchmark-to-production gap is the standard cost of this final validation step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; &lt;em&gt;Synthetic data generation&lt;/em&gt; (using an LLM to create training examples from a production data source, such as generating natural-language merchant requests from real production workflows) from real production outputs is the path to scalable training data for domain-specific fine-tuning. Manual annotation doesn't scale. Synthetic data derived from production workflows does — and it's grounded in real-world distribution rather than engineer-imagined distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt;  &lt;strong&gt;The retraining cycle speed determines how fast you can respond to production drift.&lt;/strong&gt; Merchant behavior changes, new workflow patterns emerge, new merchant categories join Shopify — and a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (Tangle's intelligent caching, H200 GPUs, 12h run), keeps the model in alignment with the world it serves.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Universal Pattern Across 1,200 Deployments&lt;/p&gt;

&lt;p&gt;ZenML's analysis of 1,200 production LLM deployments confirms Shopify's findings are not unique: &lt;strong&gt;the organizations extracting real value from AI are not the ones with the most innovative demos — they are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty, and treating their LLM systems with the same rigor they'd apply to any critical infrastructure.&lt;/strong&gt; The pattern is consistent across startups, mid-market, and enterprise. Model quality is table stakes. Evaluation infrastructure is competitive differentiation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EVALUATION INFRASTRUCTURE IS PRODUCT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The merchant simulator, the calibrated LLM judge, the production mirroring pipeline, the golden dataset maintenance process — these are not internal tooling that engineers built for themselves. They are the &lt;strong&gt;product quality infrastructure&lt;/strong&gt; that Shopify's merchants depend on, even though they will never see it. Every improvement to the evaluation system is an improvement to Sidekick's and Flow's reliability. Building evaluation infrastructure is building the product. Teams that separate 'evaluation tooling' from 'product work' are misclassifying one of their highest-value investments.&lt;/p&gt;

&lt;p&gt;Shopify's engineers discovered that getting an AI to produce a correct Shopify Flow automation 80% of the time takes two weeks, and getting it to 95% takes the rest of the year — which is either a profound insight about probabilistic systems or a profound insight about how hard it is to write good evals for commerce automation, and it turns out to be both.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>shopify</category>
      <category>machinelearning</category>
      <category>engineering</category>
    </item>
    <item>
      <title>LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/linkedin-needed-a-message-queue-they-built-the-one-the-entire-internet-runs-on-24j1</link>
      <guid>https://dev.to/techlogstack/linkedin-needed-a-message-queue-they-built-the-one-the-entire-internet-runs-on-24j1</guid>
      <description>&lt;p&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt; · Messaging · 19 May 2026&lt;/p&gt;

&lt;p&gt;In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1B events/day at launch (2011)&lt;/li&gt;
&lt;li&gt;1T messages/day by 2015&lt;/li&gt;
&lt;li&gt;7T messages/day by 2019&lt;/li&gt;
&lt;li&gt;Named after Franz Kafka&lt;/li&gt;
&lt;li&gt;Confluent founded 2014&lt;/li&gt;
&lt;li&gt;80%+ of Fortune 100 run it today&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;In 2010, LinkedIn was a growing professional network with a problem that every ambitious data-driven company eventually hits: a massive accumulation of valuable activity data that was effectively &lt;strong&gt;locked inside the systems that generated it&lt;/strong&gt;. Every page view, every job click, every connection request, every search query was data. LinkedIn's ML engineers wanted that data to train recommendation models. LinkedIn's analytics engineers wanted it to understand user behavior. LinkedIn's search engineers needed it to keep the index fresh within seconds of updates. But the pipelines connecting these systems to their data sources were a fragile, inconsistent web of point-to-point integrations — each one custom-built, each one brittle, none of them sharing any infrastructure. Jay Kreps, who was leading data infrastructure engineering at LinkedIn, later described the root cause with characteristic directness: "Everyone wanted to build fancy machine-learning algorithms, but without the data, the algorithms were useless. Getting the data from source systems and reliably moving it around was very difficult."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📊&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before Kafka, LinkedIn's data architecture had an &lt;strong&gt;N×M integration problem&lt;/strong&gt; : every data source needed a custom pipeline to every data destination. With dozens of source systems and dozens of consumer systems, engineers were maintaining hundreds of individual pipelines — each with its own error handling, schema management, and operational burden. Adding one new data source meant writing N new pipelines. Adding one new consumer meant updating M existing sources.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Kreps, alongside Jun Rao (who had joined from IBM's database group) and Neha Narkhede (who had come from Oracle), evaluated every existing solution. &lt;strong&gt;Traditional message queues&lt;/strong&gt; — &lt;em&gt;ActiveMQ&lt;/em&gt; (an open-source message broker implementing the JMS specification, designed for reliable, ordered message delivery between enterprise applications), &lt;em&gt;RabbitMQ&lt;/em&gt; (a message broker built around the AMQP protocol, designed for flexible routing, delivery guarantees, and complex messaging patterns) — were designed for a different problem. They offered rich delivery guarantees, complex routing, and transaction semantics, but they were built for the &lt;em&gt;reliable delivery of individual task messages&lt;/em&gt;, not for the &lt;em&gt;high-throughput streaming of millions of activity events&lt;/em&gt;. The broker in these systems tracked the delivery state of every message — consuming memory and CPU proportional to the number of outstanding messages. They were designed for near-immediate consumption. They could not handle the situation where a Hadoop job needed to replay yesterday's activity data. They could not scale to the throughput LinkedIn needed. Most critically: their &lt;strong&gt;per-message overhead was enormous&lt;/strong&gt;. ActiveMQ's message format had 144 bytes of overhead per message. LinkedIn needed to process millions of messages per second.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE INSIGHT: TREAT DATA MOVEMENT LIKE A LOG&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The founding insight of Kafka was recognizing that LinkedIn's data movement problem was not a messaging problem — it was a &lt;strong&gt;log problem&lt;/strong&gt;. Databases have used append-only logs for decades: the &lt;em&gt;write-ahead log&lt;/em&gt; (a sequential record of all changes made to a database, written before the changes are applied — used for crash recovery, replication, and point-in-time restoration) is how MySQL, Postgres, and every serious database achieves durability and replication. Jay Kreps asked: what if the data pipeline itself was an append-only log? Producers append events. Consumers read them at their own pace. The log retains messages for a configured period. Any consumer can replay from any point. The broker doesn't track state. The simplicity unlocked everything.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  LinkedIn's Data Was Locked in Silos
&lt;/h4&gt;

&lt;p&gt;By 2010, LinkedIn had dozens of data source systems and dozens of consumer systems — ML models, analytics pipelines, search indexers, real-time features — all of which needed the same activity stream data. Point-to-point custom pipelines were the solution, but maintaining hundreds of them was unsustainable. The existing messaging systems (ActiveMQ, RabbitMQ) couldn't handle LinkedIn's throughput requirements and were designed for task queues, not event streams.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  No Tool Existed for High-Throughput Real-Time Event Streaming
&lt;/h4&gt;

&lt;p&gt;The problem in 2010 had two halves: batch systems (Hadoop) could handle large volumes but only hours later; traditional message queues could deliver in real-time but couldn't scale to LinkedIn's volume or support replay. There was no system that provided high throughput, low latency, durability, replayability, and horizontal scalability simultaneously. The three engineers concluded that the tool they needed did not exist.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  One Year Building Kafka: The Append-Only Distributed Log
&lt;/h4&gt;

&lt;p&gt;Kreps, Rao, and Narkhede spent approximately one year building the first version of Kafka. The core architectural decision was treating the message store as an append-only log rather than a queue. This single choice enabled sequential disk I/O (orders of magnitude faster than random I/O), stateless brokers (consumers track their own position), arbitrary replay (consumers read from any offset), and horizontal partitioning (each partition is an independent log that scales independently).&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1 Billion Events Per Day at Launch, 7 Trillion by 2019
&lt;/h4&gt;

&lt;p&gt;Kafka went into production at LinkedIn in 2011 and immediately became the backbone of the company's real-time infrastructure. At launch it was ingesting over 1 billion events per day. LinkedIn open-sourced it in early 2011. It became an Apache Top-Level Project in October 2012. By 2015 it was processing 1 trillion messages per day. By 2019, 7 trillion. Kreps, Narkhede, and Rao left LinkedIn in November 2014 to found Confluent, building the commercial ecosystem around Kafka.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;I thought that since Kafka was a system optimized for writing, using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Jay Kreps — on naming Kafka after Franz Kafka, via Quora&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The name was chosen when the project was being prepared for open-sourcing. Jay Kreps was inspired by &lt;em&gt;Franz Kafka&lt;/em&gt; (the German-language novelist (1883–1924) known for works including The Metamorphosis, The Trial, and The Castle — exploring themes of alienation, bureaucracy, and transformation) — a writer whose work Kreps admired and whose name, he felt, suited a system built for writing. The practical truth is that the naming was an afterthought. The engineering came first. In the original academic paper published at the NetDB workshop in June 2011, the system is described without literary flourish: a distributed messaging system for log processing, designed for high throughput, low latency, and horizontal scalability. The paper's benchmarks were direct: &lt;strong&gt;Kafka produced messages at a rate that was orders of magnitude faster than ActiveMQ or RabbitMQ&lt;/strong&gt;. The numbers were not close.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why LinkedIn's Data Architecture Needed Real-Time&lt;/p&gt;

&lt;p&gt;LinkedIn's core value proposition — showing you the right jobs, the right connections, the right content — required &lt;strong&gt;real-time signals&lt;/strong&gt;. If you search for "software engineers in San Francisco" and connect with three of them, LinkedIn's recommendations should update within seconds to reflect what your new connections know and who they know. With Hadoop batch jobs, this update happened hours later. With Kafka feeding real-time stream processing, &lt;strong&gt;updates became searchable within seconds of being posted&lt;/strong&gt;. The latency reduction from hours to seconds was not a technical nicety — it was the product feature that made LinkedIn's social graph feel alive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What Existing Systems Got Wrong at Scale&lt;/p&gt;

&lt;p&gt;The original Kafka paper published the benchmark numbers without ceremony. LinkedIn configured a single producer to publish 10 million messages of 200 bytes each. Kafka with batch size 50: &lt;strong&gt;~50MB/sec&lt;/strong&gt;. ActiveMQ: &lt;strong&gt;~2MB/sec&lt;/strong&gt;. RabbitMQ: slightly better than ActiveMQ but far below Kafka. The gap was not 10% or even 2x — it was an order of magnitude. The performance difference traced directly to Kafka's design: sequential disk writes, zero per-message broker state, batched I/O, and a message format with 9 bytes of overhead versus ActiveMQ's 144 bytes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Jay Kreps wrote one of the most-cited engineering essays of the last decade: &lt;strong&gt;"The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction"&lt;/strong&gt; (LinkedIn Engineering, 2013). The essay argued that the append-only log was not just a Kafka implementation detail — it was a &lt;strong&gt;universal primitive&lt;/strong&gt; for distributed systems. Databases use it for replication. Kafka uses it for messaging. Stream processors use it for state. The essay made the case that any system that needs to integrate data across multiple consumers should be built on a log, not on point-to-point integrations. At the time Kreps wrote the essay, LinkedIn was running over &lt;strong&gt;60 billion unique message writes through Kafka per day&lt;/strong&gt; — several hundred billion counting cross-datacenter replication. The argument was not theoretical.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LinkedIn's Real-Time Feature Hunger&lt;/p&gt;

&lt;p&gt;LinkedIn's product ambitions in 2010 were fundamentally real-time: &lt;strong&gt;who viewed your profile in the last 24 hours?&lt;/strong&gt; Which of your connections just updated their job title? When a recruiter posts a job matching your skills, how fast does it appear in your feed? These features required that activity data flowing from the website into the recommendation and notification systems be fresh — not hours old. The batch pipeline to Hadoop was adequate for weekly model training but useless for features that needed sub-minute freshness. Kafka was not just a performance improvement over existing infrastructure; it was the prerequisite for an entire class of real-time product features that didn't yet exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1 BILLION EVENTS PER DAY — IMMEDIATELY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When Kafka went into production at LinkedIn in 2011, it was immediately processing &lt;strong&gt;more than 1 billion events per day&lt;/strong&gt;. This was not a gradual ramp — the scale was there from day one because LinkedIn's existing activity volume was already at that level; Kafka simply replaced the fragile point-to-point pipelines that had been handling it. The immediate billion-event scale validated the architecture under real production conditions within weeks of launch. It also meant the open-source release in mid-2011 came with a credibility that mattered: this was not a research prototype. It was a system already running at significant scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Five Design Decisions That Made Kafka Fast
&lt;/h3&gt;

&lt;p&gt;Kafka's performance advantage over existing systems was not the result of clever optimization of a standard architecture. It was the result of choosing a fundamentally different architecture, where every key design decision reinforced the same goal: &lt;strong&gt;maximize throughput for streaming event data&lt;/strong&gt;. Five decisions stand out as architecturally defining — and each was a deliberate rejection of how existing messaging systems had been built.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~50MB/s&lt;/strong&gt; — Kafka producer throughput in the original 2011 benchmark — versus ~2MB/s for ActiveMQ at the same message size (200 bytes, 10M messages)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 bytes&lt;/strong&gt; — Per-message overhead in Kafka — versus 144 bytes in ActiveMQ. The storage efficiency difference meant Kafka could handle 16x more messages in the same disk space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateless&lt;/strong&gt; — Kafka brokers — consumer offset tracking is done by the consumer, not the broker, eliminating the broker memory pressure that crippled traditional queues at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential&lt;/strong&gt; — Disk access pattern for both writes and reads — append-only to the log means no random I/O, allowing Kafka to push disk throughput to near hardware limits
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// The five key Kafka design decisions in code form&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 1: Append-only log storage (not a queue)&lt;/span&gt;
&lt;span class="c1"&gt;// Each partition is a directory of segment files, appended to sequentially&lt;/span&gt;
&lt;span class="c1"&gt;// /kafka-logs/my-topic-0/00000000000000000000.log&lt;/span&gt;
&lt;span class="c1"&gt;// /kafka-logs/my-topic-0/00000000000000100000.log (new segment at 100k messages)&lt;/span&gt;
&lt;span class="c1"&gt;// → Sequential writes: disk seeks are expensive; sequential I/O is 100x faster&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 2: Consumer tracks its own offset&lt;/span&gt;
&lt;span class="c1"&gt;// The broker doesn't care what consumers have read — it just serves bytes&lt;/span&gt;
&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;consumerOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;position&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicPartition&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// consumer owns this&lt;/span&gt;
&lt;span class="c1"&gt;// → Brokers are stateless: no per-consumer memory, no ack tracking overhead&lt;/span&gt;
&lt;span class="c1"&gt;// → Consumers can replay any time: just reset the offset&lt;/span&gt;
&lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;seek&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// replay from the beginning&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 3: Topics are partitioned for horizontal scale&lt;/span&gt;
&lt;span class="c1"&gt;// Each partition is an independent log — producers and consumers parallelise&lt;/span&gt;
&lt;span class="nc"&gt;ProducerRecord&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"user-activity"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// partition key: same user → same partition = ordered&lt;/span&gt;
        &lt;span class="n"&gt;eventJson&lt;/span&gt; &lt;span class="c1"&gt;// the message&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// → Topic with N partitions can be consumed by N consumers in parallel&lt;/span&gt;
&lt;span class="c1"&gt;// → Add brokers, add partitions: linear throughput scaling&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 4: Batch I/O from client to broker&lt;/span&gt;
&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"batch.size"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16384&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// batch up to 16KB before sending&lt;/span&gt;
&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"linger.ms"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// or wait up to 5ms for the batch&lt;/span&gt;
&lt;span class="c1"&gt;// The original paper: batch size 50 improved throughput by ~10x vs batch size 1&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 5: Zero-copy transfer using sendfile()&lt;/span&gt;
&lt;span class="c1"&gt;// When a consumer fetches data, Kafka uses OS sendfile() syscall:&lt;/span&gt;
&lt;span class="c1"&gt;// data goes directly disk → network socket, bypassing userspace entirely&lt;/span&gt;
&lt;span class="c1"&gt;// → No data copy into JVM heap → no GC pressure → consistent low latency&lt;/span&gt;
&lt;span class="c1"&gt;// This is why Kafka can deliver data nearly as fast as the network allows&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE STATELESS BROKER: THE COUNTERINTUITIVE MASTERSTROKE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most counterintuitive decision in Kafka's design is making the broker stateless — the broker doesn't track which consumers have read which messages. In ActiveMQ and RabbitMQ, the broker maintains delivery state for every message: who acknowledged it, who hasn't, what needs to be retried. At scale, this per-message state tracking consumes enormous memory and creates a bottleneck. Kafka's solution was radical: &lt;strong&gt;let consumers track their own position&lt;/strong&gt; (their offset in each partition). The broker just stores bytes in a log. Consumers read at their own pace, commit their offset to Zookeeper (later to a Kafka topic itself), and can reset to any offset to replay. The broker's memory footprint is constant regardless of consumer count or message backlog.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Kafka vs Traditional Message Queues: Architectural Comparison (2011 original benchmarks and design properties)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;ActiveMQ / RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage model&lt;/td&gt;
&lt;td&gt;Queue (messages deleted after ack)&lt;/td&gt;
&lt;td&gt;Append-only log (messages retained by time/size)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broker state&lt;/td&gt;
&lt;td&gt;Tracks ack state per message per consumer&lt;/td&gt;
&lt;td&gt;Stateless — consumers track own offset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Producer throughput (bench)&lt;/td&gt;
&lt;td&gt;~2 MB/sec (ActiveMQ)&lt;/td&gt;
&lt;td&gt;~50 MB/sec (batch size 50)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message overhead&lt;/td&gt;
&lt;td&gt;144 bytes (ActiveMQ, JMS header)&lt;/td&gt;
&lt;td&gt;9 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consumer replay&lt;/td&gt;
&lt;td&gt;Not supported (message gone after ack)&lt;/td&gt;
&lt;td&gt;Supported — seek to any offset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Horizontal scale&lt;/td&gt;
&lt;td&gt;Limited (complex cluster configs)&lt;/td&gt;
&lt;td&gt;Native — add partitions, add consumers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use case fit&lt;/td&gt;
&lt;td&gt;Task queues, guaranteed delivery, routing&lt;/td&gt;
&lt;td&gt;Event streaming, log aggregation, activity tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What LinkedIn Actually Used Kafka For&lt;/p&gt;

&lt;p&gt;By 2019, Kafka was the circulatory system of LinkedIn's infrastructure. &lt;strong&gt;Activity tracking&lt;/strong&gt; (the original use case): every page view, search, ad impression fed to both Hadoop and real-time processors. &lt;strong&gt;Real-time search indexing&lt;/strong&gt; : profile updates searchable within seconds. &lt;strong&gt;Database replication&lt;/strong&gt; : Espresso CDC via Kafka replaced MySQL replication. &lt;strong&gt;Inter-service messaging&lt;/strong&gt; : microservices decoupled through Kafka topics. &lt;strong&gt;Stream processing&lt;/strong&gt; : Apache Samza (LinkedIn's open-source stream processor) used Kafka as both input and durable state store. Every part of LinkedIn's data plane ran on Kafka.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zero-Copy: The OS Kernel Trick That Doubled Throughput&lt;/p&gt;

&lt;p&gt;One of Kafka's most impactful performance optimizations is invisible to application code: &lt;strong&gt;zero-copy data transfer&lt;/strong&gt; using the OS-level &lt;code&gt;sendfile()&lt;/code&gt; syscall. In a traditional data transfer, data moves from disk to kernel buffer, kernel buffer to userspace, userspace to socket buffer, socket buffer to network. In Kafka's consumer path, the OS &lt;code&gt;sendfile()&lt;/code&gt; call transfers data directly from the page cache (disk buffer) to the network socket, bypassing userspace entirely. This means no data is copied into the JVM heap — no GC pressure, no object allocation overhead. At LinkedIn's throughput rates, this optimization alone is responsible for significant throughput gains and, more importantly, for Kafka's consistent low latency even under high load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Open-Source Flywheel&lt;/p&gt;

&lt;p&gt;LinkedIn open-sourced Kafka in early 2011 — before it was even an Apache project. The decision to share the core infrastructure was not philanthropic; it was strategic. LinkedIn's engineers knew that the data pipeline problem they had solved was universal. By open-sourcing Kafka, they attracted contributions from engineers at Netflix, Uber, Twitter, and hundreds of other companies — all of whom had the same problem. The community built tooling LinkedIn would never have had resources to build alone: Kafka Streams, Kafka Connect, ksqlDB, MirrorMaker, Schema Registry. The open-source flywheel turned a LinkedIn internal tool into the internet's standard real-time data infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Kafka's architecture has three layers. The &lt;strong&gt;storage layer&lt;/strong&gt; is a set of partitioned, replicated append-only log files on disk — each partition is an independent, totally ordered sequence of records. The &lt;strong&gt;broker layer&lt;/strong&gt; is a cluster of server processes that manage partition assignment, replication, and client connections — but hold no consumer state. The &lt;strong&gt;client layer&lt;/strong&gt; is producers writing to partitions and consumer groups reading from them, each consumer group maintaining its own independent offset position per partition. Understanding why this architecture outperforms traditional queues requires visualizing the data flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Kafka: N×M Integration Spaghetti
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After Kafka: The Centralized Log Hub
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE LOG/TABLE DUALITY: JAY KREPS' DEEPER INSIGHT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In his 2013 essay "The Log," Kreps articulated a concept that went beyond Kafka's implementation: the &lt;em&gt;log/table duality&lt;/em&gt; (a mathematical relationship where any database table can be derived by replaying a log of changes from the beginning, and any log can be materialized into a table by applying each event as a state update — they are two views of the same underlying truth). Every database table can be derived by replaying the change log from the beginning. Every stream of events can be materialized into a table by accumulating state. This duality means a &lt;strong&gt;Kafka topic is simultaneously a stream and a database&lt;/strong&gt; — you can query it as a stream in motion (stream processing) or materialize it as a snapshot (a table). This insight later became the foundation for Kafka Streams, ksqlDB, and the entire stream-processing ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LinkedIn's Kafka by 2019: The Numbers&lt;/p&gt;

&lt;p&gt;By 2019, LinkedIn's Kafka deployment had become one of the largest publicly documented distributed systems in existence: &lt;strong&gt;7 trillion messages per day&lt;/strong&gt; , spread across &lt;strong&gt;100+ clusters&lt;/strong&gt; , &lt;strong&gt;4,000+ brokers&lt;/strong&gt; , &lt;strong&gt;100,000+ topics&lt;/strong&gt; , and &lt;strong&gt;7 million partitions&lt;/strong&gt;. Each message was consumed by approximately &lt;strong&gt;four consumer groups&lt;/strong&gt; on average. The cross-datacenter replication system (Brooklin) was itself mirroring over 7 trillion messages per day. From 1 billion events per day at launch in 2011 to 7 trillion by 2019: a &lt;strong&gt;7,000x growth&lt;/strong&gt; in eight years on the same fundamental architecture.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Kafka is fifteen years old and it powers a majority of the world's real-time data infrastructure. Its success is not luck — it emerged directly from the architectural decisions Jay Kreps, Jun Rao, and Neha Narkhede made in 2010. The lessons here are about identifying the right abstraction, challenging assumptions baked into existing tools, and the compounding returns of open-sourcing infrastructure.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Before building, verify no existing tool solves your problem at your scale.&lt;/strong&gt; The Kafka team evaluated ActiveMQ, RabbitMQ, and existing log aggregation systems before building. Their conclusion — existing tools were designed for the wrong problem — was evidence-based. The benchmark comparison (50 MB/sec vs 2 MB/sec) made the decision concrete. Never rebuild what can be adopted; never adopt what demonstrably can't serve your workload.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;The append-only log&lt;/em&gt; (a data structure where records are only ever added to the end, never modified in place — enabling sequential I/O, arbitrary consumer replay, and stateless brokers) is the universal data integration primitive. Any system that moves data between producers and consumers is implementing a log, whether it knows it or not. The explicit recognition of this pattern — and building directly on it — is what gave Kafka its performance advantage and its flexibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Stateless brokers make systems horizontally scalable in ways stateful brokers cannot match.&lt;/strong&gt; When the broker tracks delivery state per consumer per message, broker memory scales with consumers × outstanding messages. When consumers track their own offsets, broker memory scales with partitions only. This seemingly small architectural choice is why Kafka can serve hundreds of consumer groups without broker degradation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs. &lt;strong&gt;An append-only log turns a bursty stream of writes into sequential disk operations&lt;/strong&gt; , allowing Kafka to approach disk hardware throughput limits. Systems that update records in-place pay random I/O costs on every write. Kafka writes append-only and leverages the OS page cache for reads, achieving throughput that surprised the entire industry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; Open-sourcing infrastructure that solves a universal problem creates compounding returns. &lt;strong&gt;LinkedIn open-sourced Kafka in 2011 because the team recognized it solved a problem every data-intensive company had.&lt;/strong&gt; The community contributions, ecosystem tools (Kafka Streams, Connect, ksqlDB), and widespread adoption that followed made Kafka better than LinkedIn could have built alone. Netflix, Uber, Goldman Sachs, and thousands of other companies now run Kafka — and improvements they contribute flow back to LinkedIn. The return on open-sourcing infrastructure is measured in ecosystem, not just code.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From 1 Billion to 7 Trillion: The Same Architecture&lt;/p&gt;

&lt;p&gt;The most remarkable fact about Kafka's growth is that the core architecture described in the 2011 paper — append-only partitioned log, stateless brokers, consumer-side offsets — is still the architecture running at 7 trillion messages per day in 2019. The system scaled &lt;strong&gt;7,000x on the same fundamental design&lt;/strong&gt;. Operational complexity grew (Cruise Control for rebalancing, Burrow for consumer lag monitoring, Brooklin for cross-datacenter replication), but the append-only log at the center of it all never needed to be replaced. Good architecture ages well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE PLATFORM THAT MADE KAFKA A COMPANY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In November 2014, three years after Kafka's public launch, Jay Kreps, Neha Narkhede, and Jun Rao left LinkedIn to found &lt;strong&gt;Confluent&lt;/strong&gt; — a company built to provide enterprise Kafka services, managed Kafka infrastructure, and the commercial ecosystem around the open-source project. Confluent went public in 2021 at a $4.5 billion valuation. The path from LinkedIn internal tool to billion-dollar company in thirteen years is one of the most compelling data points for the value of open-sourcing well-designed infrastructure. The tool built to solve LinkedIn's data pipeline problem had become the data pipeline solution for most of the internet.&lt;/p&gt;

&lt;p&gt;Jay Kreps named Kafka after Franz Kafka because it was 'a system optimized for writing' — and then built something that the entire internet writes 7 trillion messages through per day, which is exactly the kind of outcome Franz Kafka would have found deeply, cosmically absurd.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>messaging</category>
      <category>linkedin</category>
      <category>apachekafka</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/datadog-went-dark-for-24-hours-and-came-back-with-a-different-philosophy-1692</link>
      <guid>https://dev.to/techlogstack/datadog-went-dark-for-24-hours-and-came-back-with-a-different-philosophy-1692</guid>
      <description>&lt;p&gt;&lt;strong&gt;Datadog&lt;/strong&gt; · Reliability · 18 May 2026&lt;/p&gt;

&lt;p&gt;On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;24h+ global outage&lt;/li&gt;
&lt;li&gt;$5M revenue loss&lt;/li&gt;
&lt;li&gt;50–60% Kubernetes nodes lost&lt;/li&gt;
&lt;li&gt;5 regions, 3 cloud providers&lt;/li&gt;
&lt;li&gt;All affected simultaneously&lt;/li&gt;
&lt;li&gt;Philosophy shift: graceful degradation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;We had built with the assumption that the only way to handle failure was to prevent it entirely — or to stop everything — rather than finding ways to degrade gracefully and continue delivering value to customers, even under extreme conditions.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Laura de Vesine, Rob Thomas, Maciej Kowalewski — via Datadog Engineering Blog&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At 01:31 EST on March 8, 2023, Datadog experienced its first global outage — every region, every cloud provider, simultaneously. The company that monitors the infrastructure of thousands of other companies could not monitor its own. Dashboards loaded but displayed no data. &lt;strong&gt;Logs, metrics, alerting, and traces were all unavailable.&lt;/strong&gt; The engineers whose job was to diagnose and fix the outage were operating without the observability tools that Datadog itself provides. It lasted over 24 hours. It cost $5 million in direct revenue. And it forced a fundamental rethink of how Datadog builds reliable systems.&lt;/p&gt;

&lt;p&gt;The immediate cause was disarmingly mundane: an &lt;strong&gt;automated&lt;/strong&gt; &lt;em&gt;systemd&lt;/em&gt; (the init system and service manager used by most modern Linux distributions — it starts processes, manages services, and handles system initialization) update was applied to Datadog's Ubuntu-based virtual machines across all regions simultaneously. This was a legacy security patch mechanism — Datadog had since built a modern lifecycle automation system for all nodes — but the legacy channel was still active and executed its update across the global fleet without any staged rollout, any health gates, or any human awareness. The update caused a &lt;em&gt;systemd-networkd&lt;/em&gt; (the systemd component responsible for managing network interfaces on Linux hosts) restart interaction that &lt;strong&gt;removed network routes from the machines as they came back up&lt;/strong&gt;. Nodes that had previously been connected to each other's network simply vanished from the cluster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE CIRCULAR DEPENDENCY TRAP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The worst part was not that 50–60% of Kubernetes nodes lost network connectivity — it was what those nodes were running. Among the VMs brought down by the network route removal were the VMs powering Datadog's &lt;strong&gt;regionalized control planes based on&lt;/strong&gt; &lt;em&gt;Cilium&lt;/em&gt; (a cloud-native networking platform for Kubernetes that uses eBPF to provide networking, security, and observability for containerized workloads). The control plane going down meant Kubernetes couldn't schedule new pods, auto-repair failed nodes, or scale workloads to compensate. The very system that should have responded to the failure was among the first things the failure took down. This circular dependency — &lt;strong&gt;the recovery mechanism depending on the infrastructure that failed&lt;/strong&gt; — is what turned a 50% node loss into a nearly complete platform outage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Simultaneous Global Node Loss at 01:31 EST
&lt;/h4&gt;

&lt;p&gt;A legacy automated Ubuntu security update channel applied a systemd update across Datadog's entire global fleet simultaneously — all five regions, all three cloud providers, all at once. The update caused a systemd-networkd restart interaction that removed network routing tables from nodes as they restarted. 50–60% of Kubernetes nodes lost network connectivity within minutes. Pages loaded but displayed no data. The outage was total from the customer perspective.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Control Plane Was in the Blast Radius
&lt;/h4&gt;

&lt;p&gt;The Kubernetes control plane — the cluster management layer responsible for scheduling, auto-repair, and scaling — was among the nodes that lost connectivity. This created a circular dependency: the recovery system needed the cluster to heal, but the cluster could not heal without the recovery system. Additionally, Datadog's multi-region, multi-cloud architecture provided no protection because the update was applied uniformly across all infrastructure simultaneously.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Manual Node Recovery + Architecture Rethink
&lt;/h4&gt;

&lt;p&gt;Recovery required manual intervention: engineers identified and restarted affected nodes, restoring network routing and bringing Kubernetes control planes back online region by region. The legacy update channel was immediately disabled. But recovery took over 24 hours — far longer than the node loss itself — because services loaded large in-memory caches on startup that were slow to initialize, and the cluster lacked the spare capacity to absorb the sudden recovery surge.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Full Recovery, New Philosophy
&lt;/h4&gt;

&lt;p&gt;Full service restoration after 24+ hours. In the months following, Datadog published a detailed engineering blog describing not just what happened but the architectural shift it drove: away from &lt;em&gt;never-fail&lt;/em&gt; systems toward systems designed to &lt;strong&gt;degrade gracefully&lt;/strong&gt; when failure inevitably occurs. Published October 2025, the blog documented two years of architectural work as a direct result of the March 2023 incident.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💸&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog operates on usage-based billing — customers pay for the volume of metrics, logs, and traces they send. During the 24-hour outage, Datadog &lt;strong&gt;did not charge customers for data they couldn't send&lt;/strong&gt;. The $5M revenue loss was direct: one day of global service unavailability translated directly into one day of foregone billing. This number was revealed on an earnings call, making the financial cost of the outage unusually concrete and public.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multi-Cloud Did Not Help&lt;/p&gt;

&lt;p&gt;Datadog ran in five regions across three cloud providers — AWS, GCP, and Azure. This architecture is often cited as a reliability best practice. But it provided &lt;strong&gt;zero protection&lt;/strong&gt; in this incident because the failure mechanism — the automated Ubuntu update — operated at the OS layer, uniformly across all infrastructure regardless of cloud provider. Multi-cloud protects against cloud provider failures. It does not protect against failures in your own automation that touch all infrastructure simultaneously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The 24-hour recovery time was itself a lesson. Even after the Kubernetes control planes came back online and new pods could be scheduled, &lt;strong&gt;services were slow to recover&lt;/strong&gt;. The investigation found two patterns: some services had insufficient compute allocated relative to others, causing them to wait a long time for Kubernetes to schedule their pods after the control plane recovered. Others loaded large, processing-intensive caches into memory at startup — caches that had been optimized for steady-state operation but were extremely expensive to rebuild from scratch after a complete restart. Both of these were design choices that had seemed reasonable in a world where failure was rare and total restarts were rarer still. In a world where failure must be expected, they were traps.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Irony of the Observability Platform&lt;/p&gt;

&lt;p&gt;There is a particular quality of darkness in losing observability tooling during an outage. Engineers responding to the incident were using Datadog to understand what was happening — and Datadog was the thing that was down. The response team had to work from first principles: SSH into individual hosts, read raw logs, check systemd status directly. The tooling built to abstract away that complexity was unavailable at precisely the moment the complexity needed to be navigated. The incident revealed how dependent Datadog's own oncall rotation was on Datadog itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Square-Wave Failure Pattern&lt;/p&gt;

&lt;p&gt;Datadog's engineers described the outage as a &lt;strong&gt;square-wave failure&lt;/strong&gt; — the platform went from fully operational to nearly completely down almost instantaneously, rather than degrading gradually. This pattern is characteristic of failures at the infrastructure layer: when Kubernetes nodes lose network connectivity, every pod running on those nodes disappears from service meshes and load balancers at once. There is no gradual ramp. For an observability platform designed around monitoring continuous signals, a square-wave drop to zero looked different from every other failure mode the monitoring systems had been trained on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌐&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog ran infrastructure across &lt;strong&gt;five regions on three different cloud providers&lt;/strong&gt; — a setup specifically designed to avoid single points of failure. It provided no protection at all against this incident because the failure mechanism lived at a layer beneath the cloud provider abstraction: the Ubuntu OS update that ran on every Datadog-managed VM, regardless of which cloud it ran on. The lesson is precise: multi-cloud resilience and OS-level automation independence are orthogonal properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE POSTMORTEM DELAY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog waited over two months to publish a public postmortem — a gap that generated significant industry commentary, particularly after the CEO referenced it on an earnings call before it was publicly available. The eventual postmortem was substantive and technical. But the delay — and the CEO's apparent confusion about whether it had been shared — was widely noted as a departure from the transparency standard set by companies like Cloudflare. &lt;strong&gt;Speed of postmortem publication matters for customer trust&lt;/strong&gt; , especially for a platform whose entire value proposition is reliability and observability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Philosophical Shift: From Never-Fail to Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;The deep engineering response to the March 2023 outage was not a list of tactical fixes. It was a philosophical shift. Datadog's engineering teams had, historically, built for reliability through redundancy — designing systems so that individual components never went down. This produced what the postmortem called &lt;strong&gt;never-fail architectures&lt;/strong&gt; : systems where components and services had to be fully functional to serve any user use case. When a component did fail, the entire service path that depended on it failed with it. The incident revealed a hidden assumption: that recovery would be fast and partial failure would be brief. A 24-hour outage broke that assumption completely, and exposed how little thought had gone into what the system should do while broken.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;24h+&lt;/strong&gt; — Total outage duration — longer than the initial node loss because service startup was slow and the cluster lacked capacity to absorb the recovery surge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$5M&lt;/strong&gt; — Direct revenue loss from usage-based billing — one day of global unavailability translated to one day of zero billing, revealed publicly on an earnings call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50–60%&lt;/strong&gt; — Kubernetes nodes that lost network connectivity from the systemd update — enough to take down control planes and make automated recovery impossible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 clouds&lt;/strong&gt; — Cloud providers affected simultaneously — AWS, GCP, and Azure all impacted because the failure was in Datadog's own automation, not in any cloud provider's infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;WHAT GRACEFUL DEGRADATION ACTUALLY MEANS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog's post-incident architectural shift was built on a simple principle: when failure occurs, the system should continue to deliver &lt;strong&gt;as much value as possible to as many customers as possible&lt;/strong&gt; , even if it cannot deliver full value to all customers. This means designing every service with an explicit answer to the question: &lt;em&gt;what does this service do when its dependencies are unavailable?&lt;/em&gt; Can it serve stale data? Can it serve a subset of features? Can it serve with degraded accuracy? Or does it have to stop entirely? Most services, when the question is asked honestly, can do better than stop.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: Never-fail architecture (implicit assumption)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MetricsQueryService&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# If storage is unavailable, this raises an exception
&lt;/span&gt;        &lt;span class="c1"&gt;# The exception propagates up — user sees an error page
&lt;/span&gt;        &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# no fallback
&lt;/span&gt;
&lt;span class="c1"&gt;# After: Graceful degradation architecture
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MetricsQueryService&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Try live storage first
&lt;/span&gt;            &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;StorageUnavailable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Fall back to cached/stale data — user sees old data with a warning
&lt;/span&gt;            &lt;span class="n"&gt;stale_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_through_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stale_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;DataResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stale_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;staleness_warning&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Fall further back — return partial data from other sources
&lt;/span&gt;            &lt;span class="n"&gt;partial&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;DataResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completeness_warning&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Only now surface an error — and make it informative
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;DataResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Storage degraded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry_in&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Startup Optimization: Fixing the Recovery Drag&lt;/p&gt;

&lt;p&gt;Two changes addressed the slow recovery after node restoration. First, Datadog used &lt;strong&gt;Kubernetes priority mechanisms&lt;/strong&gt; to ensure critical services got compute allocated before lower-priority ones when the cluster came back online — preventing a thundering herd of equal-priority pods all waiting for the same scarce resources. Second, services with large startup caches shortened their &lt;strong&gt;lookback windows&lt;/strong&gt; and changed data formats to eliminate processing-intensive deserialization at startup. Services that had been trying to rebuild six months of cache at startup were redesigned to start with a smaller warm window and build up over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Architectural Patterns That Emerged&lt;/p&gt;

&lt;p&gt;Over the two years following the incident, Datadog published a set of graceful degradation patterns applied across its products: &lt;strong&gt;persist data early&lt;/strong&gt; (write to durable storage as early as possible in the pipeline, so recovery is stateless); &lt;strong&gt;stale reads&lt;/strong&gt; (serve cached data with a staleness indicator rather than surfacing an error); &lt;strong&gt;partial serving&lt;/strong&gt; (return what you have rather than nothing); &lt;strong&gt;circuit breaking&lt;/strong&gt; (automatically stop calling a failing dependency, fall back to alternative, re-probe for recovery). None of these patterns were invented by Datadog — they were standard resilience engineering techniques that Datadog had systematically under-applied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Persist Data Early: The Durability Pattern&lt;/p&gt;

&lt;p&gt;One of the most concrete architectural changes after the incident was implementing a &lt;strong&gt;persist early&lt;/strong&gt; pattern across Datadog's data pipelines. Instead of holding data in-memory for processing before writing to durable storage, the system was changed to write to durable storage as soon as data arrived — before processing. This meant that even if processing services went down, incoming customer telemetry was safely on disk and could be processed retroactively when services recovered. Recovery no longer required customers to resend data that had arrived during the outage window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes Priority Class Oversight&lt;/p&gt;

&lt;p&gt;After the outage, Datadog's investigation found that many services had not been assigned appropriate &lt;strong&gt;Kubernetes Priority Classes&lt;/strong&gt; — a mechanism that tells the Kubernetes scheduler which pods should get compute resources first when the cluster is under resource pressure. In normal operation, this doesn't matter much. After a large failure where the entire cluster restarts simultaneously, priority classes determine recovery order. Services that should start first (database proxies, ingestion pipelines) were waiting for the same CPU allocations as low-priority background jobs. Recovery order is a design decision that should be made explicitly, not left to scheduler defaults.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture that failed in March 2023 had a specific shape: every product feature in Datadog's platform depended on a chain of services, each of which had to be fully healthy for any part of the chain to work. Logs required a log ingestion pipeline, a storage layer, a query layer, and a frontend — all healthy. If any component in the chain was down, the entire feature was down. The never-fail architecture assumed each link in the chain would always be up. The March 2023 incident showed what happens when multiple links go down simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before: Never-Fail Chain Architecture (Any Failure = Total Failure)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After: Graceful Degradation Architecture (Failure = Degraded, Not Dark)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;MULTI-CLOUD IS NOT A RELIABILITY SILVER BULLET&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog's global, multi-cloud infrastructure — five regions, three cloud providers — provided zero protection against this incident. The lesson generalizes: &lt;strong&gt;multi-cloud protects against cloud provider failures&lt;/strong&gt;. It does not protect against failures in your own configuration management, your own automation, your own deployment systems, or your own service design. An automated update that runs across all infrastructure uniformly bypasses all multi-cloud redundancy. Organizations that invest heavily in multi-cloud while neglecting the uniformity of their own automation are addressing the wrong failure vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Legacy Channel Problem&lt;/p&gt;

&lt;p&gt;The update that caused the outage went through a &lt;strong&gt;legacy security update mechanism&lt;/strong&gt; — a channel that Datadog's security team had kept active while building a modern replacement. The modern system had been built; the legacy system had not been decommissioned. This is one of the most common failure patterns in infrastructure: a replaced system that was never actually turned off. The old system executed one last time at the worst possible moment. Every team with legacy automation that still runs in production should audit whether it could execute in a way that bypasses the modern system's safety gates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔬&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Root Cause Archaeology: Finding the Network Route Bug&lt;/p&gt;

&lt;p&gt;The technical root cause was subtle: when &lt;strong&gt;systemd-networkd restarted&lt;/strong&gt; during the OS update, it cleared the network routing table for container workloads that had been set up by Kubernetes's networking plugin (Cilium). New nodes starting up for the first time don't have this problem — they start with an empty routing table and Cilium populates it correctly. But nodes that were already running had existing routing entries that were erased by the systemd-networkd restart. This was a previously unobserved interaction that only manifested when restarting a running node rather than provisioning a new one.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;The March 2023 Datadog outage is extraordinary for two reasons: the irony of an observability platform going dark, and the depth of the architectural response it drove. The lessons here are not primarily about the incident itself but about the philosophy that emerged from it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Build for graceful degradation, not just failure prevention.&lt;/strong&gt; Every service should have an explicit answer to: what do I do when my dependencies are unavailable? Stale data with a warning, partial results, degraded accuracy — all of these are better than returning nothing. The goal is to serve as many customers as possible, as fully as possible, even while broken.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Circular dependencies&lt;/em&gt; (when component A depends on component B for recovery, and component B depends on component A to be running) between service infrastructure and recovery infrastructure are a reliability catastrophe waiting to happen. Explicitly audit your control planes, monitoring systems, and automation pipelines: if the thing that fixes failures is also in the blast radius of those failures, you have a recovery problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Decommission legacy automation systems completely.&lt;/strong&gt; The outage was caused by a legacy update channel that still had execution access after its replacement was built. Every organization has deprecated-but-still-running systems. Audit them. A legacy channel that runs once a year can cause an outage just as reliably as one that runs every day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; &lt;em&gt;Staged rollouts&lt;/em&gt; (applying changes to a small percentage of infrastructure first, checking health, then expanding gradually) are not optional for automated changes to production infrastructure. The Datadog systemd update was applied globally and simultaneously. A staged rollout — 1% of nodes, health check, 10%, health check — would have caught the network route removal on a handful of nodes before it cascaded to the entire fleet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt;  &lt;strong&gt;Design service startup to be fast under the conditions that follow a large outage.&lt;/strong&gt; When a cluster recovers from a significant failure, all services restart simultaneously with no warm caches, competing for scarce cluster capacity. Services optimized for steady-state operation can become bottlenecks in this cold-restart scenario. Test your startup behavior under cluster-wide cold-start conditions, not just under normal rolling restarts.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two Years to Publish the Engineering Post&lt;/p&gt;

&lt;p&gt;The March 2023 outage happened in March 2023. The detailed engineering blog documenting the architectural response was published in &lt;strong&gt;October 2025 — two and a half years later&lt;/strong&gt;. This timeline reflects the depth of the work: the blog described real architectural changes that had been implemented and validated in production across Datadog's entire product portfolio, not aspirational plans. Publishing only after the work was done is the responsible version of transparency — claiming to have fixed something before you've fixed it erodes trust faster than silence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GRACEFUL DEGRADATION AS A DESIGN PRINCIPLE, NOT A FEATURE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The deepest lesson from Datadog's post-incident work is that graceful degradation is not a feature you add to a service after it's built — it's a design principle that shapes how the service is architected from the beginning. A service designed to gracefully degrade will have different internal boundaries, different cache strategies, different dependency contracts, and different SLOs than one designed to always succeed. &lt;strong&gt;Retrofitting graceful degradation into a never-fail architecture is expensive&lt;/strong&gt;. Building for it from the start is cheaper. After two years of retrofitting, Datadog's engineering organization now treats the question 'how does this service degrade?' as a required design review criterion.&lt;/p&gt;

&lt;p&gt;Datadog's monitoring platform went down for 24 hours — which means the engineers had to debug a global infrastructure failure using SSH, intuition, and the kind of raw log reading skills that got them into engineering in the first place.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>datadog</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/netflix-unleashed-a-monkey-with-a-weapon-in-its-own-data-center-on-purpose-3ph0</link>
      <guid>https://dev.to/techlogstack/netflix-unleashed-a-monkey-with-a-weapon-in-its-own-data-center-on-purpose-3ph0</guid>
      <description>&lt;p&gt;&lt;strong&gt;Netflix&lt;/strong&gt; · Chaos Engineering · 18 May 2026&lt;/p&gt;

&lt;p&gt;It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;July 19 2011 blog published&lt;/li&gt;
&lt;li&gt;Business-hours instance killing&lt;/li&gt;
&lt;li&gt;10 Simian Army members&lt;/li&gt;
&lt;li&gt;Sept 2014: AWS rebooted 10% of servers — Netflix unaffected&lt;/li&gt;
&lt;li&gt;Open-sourced 2012, v2.0 in 2016&lt;/li&gt;
&lt;li&gt;Spawned entire chaos engineering discipline&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Yury Izrailevsky &amp;amp; Ariel Tseitlin — via The Netflix Simian Army, Netflix Tech Blog, July 19, 2011&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running its technology on vertically scaled servers in its own data centers. A &lt;strong&gt;major database corruption&lt;/strong&gt; took down the entire system. For three days, Netflix could not ship DVDs to its customers. It wasn't a complicated failure. It was a &lt;em&gt;single point of failure&lt;/em&gt; (a component whose failure brings down the entire system — the exact opposite of a fault-tolerant distributed architecture) at the most basic level: one database, one failure mode, total outage. The company's engineering leadership concluded that the only path forward was to move away from centralized relational databases in their own datacenter toward &lt;strong&gt;highly reliable, horizontally scalable, distributed systems in the cloud&lt;/strong&gt;. They chose Amazon Web Services. The seven-year cloud migration that followed would produce one of the most influential engineering philosophies in the history of distributed systems.&lt;/p&gt;

&lt;p&gt;The migration to AWS presented a new problem in place of the old one. Netflix was moving from a single monolith with a small number of failure points — each catastrophic — to a &lt;em&gt;microservices architecture&lt;/em&gt; (a system design where an application is broken into many small, independently deployable services that communicate over a network, improving scalability and team autonomy at the cost of increased distributed systems complexity) with hundreds of services, each potentially failing in its own unique way. The distributed system was theoretically more resilient. But theory is not production. Netflix's engineers designed systems with graceful degradation in mind — if the recommendations service failed, show popular titles instead of personalized ones; if the search service was slow, streaming should still work. They wrote the code. They reviewed it. They tested it in staging. And then they realized: &lt;strong&gt;there was no way to know if the fault tolerance actually worked without experiencing actual failures&lt;/strong&gt;. The staging environment couldn't reproduce the chaos of production. Controlled tests couldn't capture the emergent failure modes of hundreds of interdependent services under real load.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE CORE INSIGHT: FAIL CONSTANTLY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Netflix's founding philosophy for Chaos Engineering was radical in its simplicity: &lt;strong&gt;the best way to avoid failure is to fail constantly&lt;/strong&gt;. If you only experience failures accidentally, in production, at 3am, your engineers have no muscle memory for responding to them and your systems have never been forced to prove their resilience claims. If you fail constantly, during business hours, with engineers present — your systems either prove they can recover or they expose the gaps so engineers can fix them before those gaps become incidents.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What Chaos Monkey Actually Does
&lt;/h3&gt;

&lt;p&gt;Chaos Monkey is, mechanically, a simple tool. It runs continuously across Netflix's AWS environment and at some point during &lt;strong&gt;business hours&lt;/strong&gt; , picks one EC2 instance at random from each cluster and terminates it. No warning. No coordination. No grace period. The instance just stops. This deceptively simple act forces every service in Netflix's architecture to prove, continuously, that it can tolerate the loss of an individual instance. Services that depend on a single backend instance fail immediately and obviously. Services built with proper fallbacks — load balancers, retries, graceful degradation paths — continue working. The business hours constraint is deliberate: when Chaos Monkey strikes at 2pm on a Tuesday, engineers are at their desks and can respond to any cascading failure. Striking at 2am would produce the exact scenario Netflix was trying to avoid — unplanned, unattended failures with no one ready to respond.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  August 2008: Database Corruption, Three Days of Darkness
&lt;/h4&gt;

&lt;p&gt;Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure. No redundancy, no graceful degradation, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Distributed Systems Are Only Theoretically Resilient
&lt;/h4&gt;

&lt;p&gt;Moving to hundreds of microservices on AWS solved the single-point-of-failure problem at the architecture level — but introduced new questions: did the code actually implement the graceful degradation it was supposed to? Staging environments couldn't tell you. Code review couldn't tell you. The only honest answer required production failures, and those were the thing Netflix was trying to avoid.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Chaos Monkey: Production Failure on a Schedule
&lt;/h4&gt;

&lt;p&gt;Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing that Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Sept 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.
&lt;/h4&gt;

&lt;p&gt;On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had already been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded exactly as they'd been trained to respond — automatically, gracefully, and without requiring an emergency war room.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🐒&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chaos Monkey was one of the &lt;strong&gt;first systems&lt;/strong&gt; Netflix engineers built in AWS during the cloud migration. Not a caching layer, not a deployment system, not a monitoring platform — a tool to randomly kill their own production servers. This sequencing was intentional: the discipline came first, and the architecture was shaped by it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Rambo Architecture&lt;/p&gt;

&lt;p&gt;Netflix's engineering team coined the term &lt;strong&gt;Rambo Architecture&lt;/strong&gt; for the design philosophy that Chaos Monkey enforced: each system must be able to succeed no matter what, even all on its own. If the recommendations service is down, still respond — show popular titles. If the search service is slow, streaming still works. If a dependent microservice returns an error, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Simian Army
&lt;/h3&gt;

&lt;p&gt;The success of Chaos Monkey triggered a proliferation. If randomly killing instances made Netflix more resilient to instance failures, what would it take to become resilient to other failure categories? In July 2011 — the same blog post that named Chaos Monkey publicly — Netflix announced the &lt;strong&gt;Simian Army&lt;/strong&gt; : a growing suite of failure-injection and resilience-verification tools, each targeting a different class of failure. The roster was remarkable in its scope and its naming creativity. &lt;em&gt;Latency Monkey&lt;/em&gt; (a tool that injects artificial delays into Netflix's RESTful service communication layer, simulating network degradation to verify that upstream services detect and respond to downstream slowdowns appropriately) introduced artificial delays in service communication to simulate degradation. &lt;em&gt;Conformity Monkey&lt;/em&gt; identified and shut down instances not following engineering best practices. &lt;em&gt;Doctor Monkey&lt;/em&gt; ran health checks and removed unhealthy instances from service. &lt;em&gt;Janitor Monkey&lt;/em&gt; cleaned up unused cloud resources to reduce costs and complexity. &lt;em&gt;Security Monkey&lt;/em&gt; hunted for security vulnerabilities. &lt;em&gt;10-18 Monkey&lt;/em&gt; detected multi-region configuration problems. And &lt;em&gt;Chaos Gorilla&lt;/em&gt; (a Simian Army tool that simulates the failure of an entire AWS availability zone — one step up from Chaos Monkey's instance-level failures, testing whether Netflix's architecture could survive losing an entire AZ) simulated the complete failure of an AWS availability zone.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chaos Kong: The Region Killer&lt;/p&gt;

&lt;p&gt;Above Chaos Gorilla in the hierarchy sat &lt;strong&gt;Chaos Kong&lt;/strong&gt; — the most extreme tool in the Simian Army, designed to simulate the complete failure of an entire AWS region. If Chaos Monkey proved Netflix could survive an instance failure and Chaos Gorilla proved it could survive an AZ failure, Chaos Kong tested the hardest question: could Netflix continue streaming if us-east-1 went dark? The answer, after years of Chaos Engineering practice, was yes — with careful architecture involving active-active multi-region deployment and data replication strategies that Netflix documented in subsequent engineering blog posts.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Building a Fault-Tolerant Culture
&lt;/h3&gt;

&lt;p&gt;The most important thing Chaos Monkey fixed was not a technical system — it was an organizational incentive. Before Chaos Monkey, engineers at Netflix could ship code that was theoretically fault-tolerant but practically fragile without facing immediate consequences. The fragility would only become visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this &lt;strong&gt;during your working hours, while you were at your desk&lt;/strong&gt; , with your team watching. This behavioral economics effect — where the cost of fragility was paid by the person who created it, immediately — transformed how Netflix engineers thought about resilience. It was no longer a design principle to be aspirationally implemented. It was a daily test to be continuously passed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2011&lt;/strong&gt; — Year Chaos Monkey was publicly announced in 'The Netflix Simian Army' blog post — three years after the 2008 database outage that triggered the AWS migration and the need for built-in fault tolerance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+&lt;/strong&gt; — Members of the Simian Army at peak — each targeting a different failure category from individual instances (Chaos Monkey) to full AWS regions (Chaos Kong)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business hours&lt;/strong&gt; — The scheduling constraint that made Chaos Monkey safe and effective — failures during working hours, with engineers present to respond, rather than 3am on-call escalations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sept 2014&lt;/strong&gt; — The real-world validation: AWS rebooted 10% of EC2 instances without warning — Netflix handled it without customer impact, directly crediting years of Chaos Monkey practice
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified version of what Chaos Monkey does
# Real implementation was originally Java, later Go (v2.0)
# Runs continuously during configurable business hours
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChaosMonkey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aws_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded_clusters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aws_client&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;excluded_clusters&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_business_hours&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Only run during business hours so engineers are present.
        The key safety constraint of Chaos Monkey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s original design.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;weekday&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="c1"&gt;# Monday–Friday
&lt;/span&gt;            &lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt; &lt;span class="c1"&gt;# 9am–5pm local time
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_business_hours&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="c1"&gt;# Identify all clusters Chaos Monkey is configured to target
&lt;/span&gt;                &lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_clusters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;

                    &lt;span class="c1"&gt;# Pick one instance at random from each cluster
&lt;/span&gt;                    &lt;span class="n"&gt;instances&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_instances&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;

                    &lt;span class="n"&gt;victim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                    &lt;span class="c1"&gt;# Terminate it. No warning. No coordination.
&lt;/span&gt;                    &lt;span class="c1"&gt;# If the system doesn't survive this, the engineers
&lt;/span&gt;                    &lt;span class="c1"&gt;# will know about it immediately — and fix it.
&lt;/span&gt;                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate_instance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;victim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Chaos Monkey] Terminated &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;victim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in cluster &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Wait before running again — mean time between terminations
&lt;/span&gt;            &lt;span class="c1"&gt;# configured per cluster, not random probability
&lt;/span&gt;            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;termination_interval_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;FAILURE INJECTION TESTING (FIT): THE EVOLUTION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced &lt;strong&gt;FIT — Failure Injection Testing&lt;/strong&gt;. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through &lt;em&gt;Zuul&lt;/em&gt; (Netflix's edge proxy that handles all requests from devices and applications to Netflix's backend services) to simulate specific service failures with surgical precision. FIT could say 'for this specific user's request, pretend the recommendations service is timing out' without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chaos Monkey 2.0: Open-Sourced and Rebuilt in Go&lt;/p&gt;

&lt;p&gt;Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 also added &lt;strong&gt;Trackers&lt;/strong&gt; — Go language objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Industry Adoption: From Netflix to Everywhere&lt;/p&gt;

&lt;p&gt;By 2015, Netflix's Chaos Engineering practices had been codified in the &lt;strong&gt;Principles of Chaos Engineering&lt;/strong&gt; document (published by a team including Casey Rosenthal, who led Netflix's Chaos Engineering team), transforming what had been an internal Netflix tool into a formal engineering discipline. Companies including LinkedIn, Facebook, Google, Amazon, and Twilio adopted chaos engineering practices. Kolton Andrus (from Netflix's FIT team) founded Gremlin in 2016 to commercialize chaos engineering tooling. AWS launched its own Fault Injection Simulator service in 2021.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Open-Source Release and Industry Spread&lt;/p&gt;

&lt;p&gt;Netflix open-sourced Chaos Monkey in 2012, making the tool available to any engineering team that wanted to adopt the practice. The release did something more important than provide the code: it legitimized the approach. Engineering teams at other companies who had been quietly running similar experiments could now point to Netflix's published methodology as industry precedent. By 2015, companies including &lt;strong&gt;LinkedIn, Facebook, Google, Amazon, and Twilio&lt;/strong&gt; had publicly acknowledged chaos engineering practices. The 2015 publication of the &lt;em&gt;Principles of Chaos Engineering&lt;/em&gt; by Netflix's Casey Rosenthal and colleagues formalized the discipline with scientific language: hypothesis, experiment, steady state, blast radius. What had been a Netflix internal tool became a named engineering discipline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE SPINNAKER DEPENDENCY IN V2.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chaos Monkey 2.0 (2016) introduced a significant constraint: it requires &lt;em&gt;Spinnaker&lt;/em&gt; (Netflix's open-source multi-cloud continuous delivery platform that manages application deployments across AWS, Azure, Kubernetes, and other providers) as its deployment platform. This means that teams wanting to use Chaos Monkey 2.0 must also adopt Spinnaker — a substantial investment. Companies unwilling to commit to Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin (founded by Netflix alumni Kolton Andrus and Matt Fornaciari) that offered chaos engineering as a service without infrastructure prerequisites.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Netflix's architecture in 2011 was organized around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, with each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the AWS EC2 instance layer — the individual virtual machines running each service's processes. When an instance was terminated, the load balancer in front of that service's cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had been sized with enough redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Simian Army: Failure Coverage Across Infrastructure Layers
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Netflix's Architecture Handles Chaos Monkey Instance Loss
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE BEHAVIORAL ECONOMICS OF CHAOS ENGINEERING&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chaos Monkey's deepest contribution to Netflix's culture was &lt;strong&gt;aligning incentives&lt;/strong&gt;. Without it, the cost of fragile code was paid by whoever happened to be on-call when a real failure occurred — often not the engineer who wrote the fragile code. With Chaos Monkey, the cost was paid immediately and visibly by the team whose service broke. Engineers who experienced a Chaos Monkey failure during business hours had a powerful motivator to invest in proper fault tolerance: they didn't want to experience it again. This is DevOps incentive design at its finest — not policy mandates, but a system where the right behavior is the path of least resistance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why Business Hours Only — The Safety Constraint&lt;/p&gt;

&lt;p&gt;The original Chaos Monkey ran only during business hours, and this was not a limitation — it was the essential design principle. An instance killed at 2am when engineers are asleep creates exactly the scenario Netflix wanted to avoid: unplanned, unattended failure with long MTTD (Mean Time To Detect) and long MTTR (Mean Time To Recover). An instance killed at 2pm on a Tuesday &lt;strong&gt;is pedagogical, not adversarial&lt;/strong&gt; : engineers learn from it, fix the gap, and build better systems. As Netflix's confidence in its architecture grew, chaos experiments expanded to cover more scenarios and broader failure scopes — but the principle of human-attended chaos remained core to responsible chaos engineering practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What Chaos Monkey Doesn't Test&lt;/p&gt;

&lt;p&gt;Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test &lt;strong&gt;network partitions&lt;/strong&gt; (instances visible but unreachable), &lt;strong&gt;latency degradation&lt;/strong&gt; (Latency Monkey's job), &lt;strong&gt;data corruption&lt;/strong&gt; , or &lt;strong&gt;slow memory leaks&lt;/strong&gt; that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army and in tools like Gremlin were created precisely to cover these gaps. The original insight — failing constantly builds resilience — generalizes to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering program that only kills instances is missing most of the failure surface.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Chaos Monkey is fourteen years old and it has influenced every major engineering organization's approach to reliability. Its lessons are not about the specific tool — they are about the philosophy that the tool embodies and the cultural transformation it requires.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Designing for fault tolerance is not the same as having fault tolerance.&lt;/strong&gt; Netflix's engineers wrote graceful degradation code. Netflix's Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Chaos Engineering&lt;/em&gt; (the discipline of deliberately injecting controlled failures into production systems during business hours, with engineers present, in order to proactively expose resilience gaps before they become unplanned outages) must be practiced during business hours, with humans present. The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Align incentives with the behavior you want.&lt;/strong&gt; Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; The &lt;em&gt;blast radius&lt;/em&gt; (the scope of impact when a single component fails — chaos engineering is designed to continuously measure and minimize blast radius by forcing service-level isolation) of individual failures is only measurable through testing. A microservices architecture where every service failure cascades to every other service provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt;  &lt;strong&gt;Start at the instance level and escalate gradually.&lt;/strong&gt; Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then to Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. This graduated escalation model — expand scope only when you're confident you've solved the current scope — is the responsible path for any chaos engineering program.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The September 2014 Test That Validated Everything&lt;/p&gt;

&lt;p&gt;Netflix's most public validation of Chaos Monkey's philosophy came not from their own experiments but from AWS itself. On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances across regions without warning — a real, unplanned failure event at significant scale. Netflix handled it without customer impact. The years of Chaos Monkey practice had built exactly the muscle memory and architectural robustness required. Engineers didn't panic. Systems didn't cascade. Services degraded gracefully and recovered automatically. This was the experiment Netflix couldn't have designed themselves — and they passed it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FROM TOOL TO DISCIPLINE: THE PRINCIPLES OF CHAOS ENGINEERING&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2015, Netflix's Casey Rosenthal formalized Chaos Monkey's philosophy into the &lt;strong&gt;Principles of Chaos Engineering&lt;/strong&gt; — a document that defined chaos engineering with scientific rigor: establish a steady-state hypothesis, vary real-world events, run experiments in production, automate continuously, minimize blast radius. These principles transformed chaos engineering from 'Netflix's thing where they kill their own servers' into a reproducible engineering discipline with clear methodologies. &lt;strong&gt;The formalization is what allowed chaos engineering to spread beyond Netflix&lt;/strong&gt; — teams could now implement the practice without having to rediscover the same principles themselves.&lt;/p&gt;

&lt;p&gt;Netflix built a tool that killed their own servers on purpose every business day for years, and the one time AWS killed 10% of their servers by accident, nobody noticed — which is either the best possible outcome of a chaos engineering program or proof that Netflix engineers have very high stress tolerances.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>chaosengineering</category>
      <category>netflix</category>
      <category>cloud</category>
      <category>devops</category>
    </item>
    <item>
      <title>Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/uber-had-150000-secrets-scattered-across-25-vaults-so-they-built-one-platform-to-rule-them-32j6</link>
      <guid>https://dev.to/techlogstack/uber-had-150000-secrets-scattered-across-25-vaults-so-they-built-one-platform-to-rule-them-32j6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Uber&lt;/strong&gt; · Security · 18 May 2026&lt;/p&gt;

&lt;p&gt;150,000 secrets. 25 separate vaults. Hundreds of teams managing their own credentials in their own ways, some in plain text in version control. At Uber's scale — 5,000 microservices, 5,000 databases, 500,000 analytical jobs per day — secrets sprawl is not a compliance problem. It is an incident waiting to happen. A team of ten engineers decided to fix it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;150,000 secrets managed&lt;/li&gt;
&lt;li&gt;25 vaults → 6 managed vaults&lt;/li&gt;
&lt;li&gt;5,000 microservices secured&lt;/li&gt;
&lt;li&gt;20,000 automated rotations/month&lt;/li&gt;
&lt;li&gt;90% fewer secrets in pipelines&lt;/li&gt;
&lt;li&gt;Team of 10 engineers&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;Secrets sprawl is the entropy of infrastructure security. Left to its own devices, every team builds its own vault, stores credentials however is convenient, and shares secrets in whatever way is fastest. At a startup with ten engineers, this is manageable. At Uber — &lt;strong&gt;5,000 microservices, 5,000 databases, 400+ third-party integrations, 500,000 analytical jobs per day&lt;/strong&gt; — it becomes a systemic security risk. By the time Uber's Secrets team began their consolidation project, the company had &lt;strong&gt;150,000 secrets scattered across 25 separate vault systems&lt;/strong&gt; , operated by different teams, with different security standards, different rotation practices, and inconsistent access controls. Some secrets were in plain text in codebases. Others lived in databases that had never been audited for credential exposure. Cyberattacks targeting exposed credentials were rising industry-wide. The question was not whether Uber should fix this — it was how.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔐&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the consolidation, Uber's &lt;strong&gt;25 separate vault systems&lt;/strong&gt; were operated by various teams across engineering. Some were standard &lt;em&gt;HashiCorp Vault&lt;/em&gt; (an open-source secrets management tool that provides a secure, centralized store for tokens, passwords, certificates, and encryption keys) deployments. Others were custom databases. Others were cloud-specific secret managers for AWS, GCP, and Azure. None of them talked to each other. None of them had a unified view of what credentials existed where.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Secrets team's strategy had two phases. Phase 1 was consolidation: take ownership of all vault infrastructure, standardize on a small number of canonical vault systems (one per cloud provider plus one on-premises HashiCorp Vault), and migrate all secrets from the 25 fragmented vaults into these six. This was the foundation work — unglamorous, involving hundreds of engineers across different teams, and requiring careful coordination to avoid breaking services that depended on existing vault paths. Phase 2 was the platform: building a &lt;strong&gt;Secret Management Platform&lt;/strong&gt; on top of the consolidated vaults — a metadata model, lifecycle automation, unified API, and real-time scanning — that turned six vaults into a governed, auditable, self-service system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE FIVE PROBLEMS THEY HAD TO SOLVE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As the Secrets team consolidated vaults, five common problem patterns emerged that any future platform would need to address: &lt;strong&gt;(1) no unified metadata model&lt;/strong&gt; — no way to know what a secret was for, who owned it, when it was last rotated; &lt;strong&gt;(2) no cross-vault CRUD&lt;/strong&gt; — managing secrets across different vault types required different tools and APIs; &lt;strong&gt;(3) no developer self-service&lt;/strong&gt; — engineers filed tickets to create or rotate secrets; &lt;strong&gt;(4) no inventory&lt;/strong&gt; — no way to generate a complete list of secrets for security incident response; &lt;strong&gt;(5) no automated rotation&lt;/strong&gt; — credential rotation required manual coordination, so it was delayed or skipped.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Secrets Sprawl: 150,000 Credentials, No Visibility
&lt;/h4&gt;

&lt;p&gt;Uber's infrastructure had grown faster than its secrets governance. 25 vault systems operated by different teams meant no single team had visibility into the company's complete credential inventory. Shadow IT vaults with no central oversight created audit gaps. Secrets were shared insecurely, rotated rarely, and sometimes stored in version control. With cyberattacks targeting credential exposure rising industry-wide, the status quo was untenable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Scale + Decentralization = Governance Collapse
&lt;/h4&gt;

&lt;p&gt;At Uber's scale, decentralized secrets management doesn't produce diversity and resilience — it produces inconsistency and risk. Each of 25 vaults had its own standards, its own rotation schedule (usually none), its own access model. There was no way to answer basic security questions: who has access to which credentials? When were they last rotated? Are any credentials in source code? The scale that made the problem urgent also made it hard to fix without a dedicated team and platform.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Consolidation + Secret Management Platform
&lt;/h4&gt;

&lt;p&gt;Phase 1 consolidated 25 vaults into 6 centrally managed vaults (one per cloud provider plus on-prem HashiCorp Vault). Phase 2 built the Secret Management Platform: a metadata model, a unified API abstracting across all vault types, a Cadence-orchestrated &lt;em&gt;Secret Lifecycle Manager&lt;/em&gt; (Uber's automation system that handles the complete lifecycle of secrets — creation, rotation, distribution to workloads, and eventual decommissioning — using Uber's Cadence workflow engine), real-time scanning across git/Slack/CI pipelines, and self-service developer tooling.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  20,000 Automated Rotations Per Month, 90% Fewer Exposed Secrets
&lt;/h4&gt;

&lt;p&gt;A team of 10 engineers now drives 20,000 automated monthly secret rotations — up from manual rotation that happened rarely. Secrets exposed in CI pipelines dropped by 90%. The platform generates a complete inventory of all 150,000 secrets on demand, enabling rapid response to security incidents. Uber is actively pursuing &lt;strong&gt;secretless authentication&lt;/strong&gt; — replacing long-lived credentials with ephemeral, automatically-issued tokens wherever possible.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Migration Scale Problem&lt;/p&gt;

&lt;p&gt;Migrating secrets from 25 vaults to 6 involved hundreds of engineers whose workloads depended on existing vault paths. A secret migration is not just a data copy — it is &lt;strong&gt;a coordination problem&lt;/strong&gt;. Every service reading a secret from vault path A needs to be updated to read from vault path B. In a monolith, that's one codebase. Across Uber's 5,000 microservices, that's 5,000 potential update targets. The team built tooling to discover which services were reading from which vault paths, generated migration checklists automatically, and used feature flags to switch services over gradually with rollback capability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;em&gt;metadata model&lt;/em&gt; (a structured representation of a secret's properties — owner, purpose, rotation schedule, associated services, expiry date, security classification — that enables automated governance and incident response) was the architectural cornerstone of the Secret Management Platform. Before consolidation, a secret was just a key-value pair in a vault with no context. After the platform was built, every secret had a structured record: who owned it, which services used it, when it was last rotated, what its rotation policy was, and what its security classification was. This metadata made &lt;strong&gt;automated governance possible&lt;/strong&gt; : the platform could identify secrets that hadn't been rotated in 90 days, generate compliance reports, and automatically alert owners of soon-to-expire credentials. It also made incident response practical: when a security team needed to identify all credentials that could have been exposed in a compromise, they could query the inventory rather than interviewing 250 engineering teams.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Real-Time Scanning: Catching Secrets Before They Ship&lt;/p&gt;

&lt;p&gt;One of the most impactful platform features was real-time scanning across Uber's code repositories, CI pipelines, and internal Slack messages. The scanner looks for patterns matching API keys, database passwords, and private key formats. When detected, it &lt;strong&gt;automatically revokes the exposed credential&lt;/strong&gt; and alerts the owning team. Before the platform, a credential committed to git might live there for months — or forever. Now, exposure is measured in seconds. The 90% reduction in secrets found in pipelines reflects this detection-and-revocation automation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shadow IT Vaults: The Security Debt Multiplier&lt;/p&gt;

&lt;p&gt;Perhaps the most dangerous aspect of Uber's pre-platform state was what the team called &lt;strong&gt;shadow IT vaults&lt;/strong&gt; : secret storage systems created by individual teams outside the knowledge of the central Secrets team. These vaults had no security baseline review, no rotation policy, no access audit, and no inventory. When a team built a shadow vault, they optimized for their immediate convenience — and created a security liability that the company didn't know existed. You cannot rotate credentials you don't know about. You cannot audit access to vaults you don't know exist. Shadow IT vaults are the point where 'move fast' becomes 'incur unquantifiable risk.'&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHY 400 THIRD-PARTY INTEGRATIONS MATTER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Uber's 400+ third-party vendor integrations are a significant factor in the secrets management challenge. Each integration requires credentials — API keys, OAuth tokens, database passwords — that must be rotated when vendors change their systems or when Uber's access policy changes. Before the platform, vendor credential rotation required manual coordination: someone had to get the new credentials from the vendor, find which services used them, update each service's configuration, and verify nothing broke. &lt;strong&gt;At 400 integrations, this manual process consumed disproportionate engineering time&lt;/strong&gt; and rotations were often delayed. The Secret Lifecycle Manager automated the rotation for most standard integrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔄&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the Secret Management Platform, secret rotation at Uber required a service owner to coordinate with the Secrets team, obtain a new credential from the upstream provider, update their service's configuration, and verify the rotation succeeded. At 150,000 secrets across 5,000 services, this process ran rarely — not because security was a low priority but because &lt;strong&gt;the operational overhead was prohibitive at scale&lt;/strong&gt;. Most secrets were rotated only when forced by a security incident or vendor requirement. The platform inverts this: rotation is the default, manual coordination is the exception.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes Native Injection&lt;/p&gt;

&lt;p&gt;One of the most seamless developer experiences in the platform is &lt;strong&gt;Kubernetes-native secret injection&lt;/strong&gt;. Rather than requiring services to call an API to retrieve their credentials at startup, the platform can inject secrets directly as environment variables or mounted files into Kubernetes pods at deploy time. This is transparent to application code — the service sees its credentials as normal environment variables, with no awareness of which vault they came from or how they were rotated. When a rotation occurs, the platform can trigger a pod restart with the new credentials injected automatically.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Secret Lifecycle Manager
&lt;/h3&gt;

&lt;p&gt;The Secret Lifecycle Manager (SLM) is the operational core of Uber's Secret Management Platform. Built on &lt;em&gt;Cadence&lt;/em&gt; (Uber's open-source distributed workflow engine, designed for long-running, fault-tolerant business processes — the same engine that powers Uber's ride dispatch and payment workflows), SLM orchestrates the complete lifecycle of every secret: initial creation, distribution to consuming services, periodic rotation, and eventual decommissioning. Using Cadence's durable workflow model means that secret rotation operations are fault-tolerant — if the rotation workflow fails midway through, it can resume from where it left off rather than leaving credentials in a half-rotated, potentially inconsistent state.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;25→6&lt;/strong&gt; — Vault systems consolidated — from 25 team-operated vaults with inconsistent standards to 6 centrally managed vaults with uniform security baselines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20,000&lt;/strong&gt; — Automated secret rotations per month — up from rare manual rotation that required coordination between the Secrets team and service owners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90%&lt;/strong&gt; — Reduction in secrets found exposed in CI/CD pipelines — achieved through real-time scanning with automatic revocation on detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10&lt;/strong&gt; — Engineers on the Secrets team that built and now operates the entire platform — evidence that well-designed automation multiplies individual team capacity dramatically
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified Secret Lifecycle Manager rotation workflow (conceptual)
# Real implementation uses Cadence's durable workflow primitives
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cadence.workflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;workflow_method&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecretRotationWorkflow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@workflow_method&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rotate_secret&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Cadence ensures this completes even if individual steps fail.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 1: Generate new credential from upstream provider
&lt;/span&gt;        &lt;span class="n"&gt;new_credential&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_new_credential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 2: Write new credential to canonical vault
&lt;/span&gt;        &lt;span class="c1"&gt;# (Durable: if this step completes, Cadence records it)
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_to_vault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;# old version still readable during transition
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Signal consuming services to reload credential
&lt;/span&gt;        &lt;span class="c1"&gt;# Each service has a registered reload handler
&lt;/span&gt;        &lt;span class="n"&gt;consuming_services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_consumers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;consuming_services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signal_reload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 4: Verify all services are using new credential
&lt;/span&gt;        &lt;span class="c1"&gt;# (Wait for health checks to confirm)
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_rotation_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;consuming_services&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 5: Expire old credential in upstream provider
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;revoke_old_credential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 6: Update metadata: last_rotated, next_rotation_due
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rotated_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="c1"&gt;# Cadence schedules next rotation based on policy
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SECRETLESS AUTHENTICATION: THE NEXT FRONTIER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The logical endpoint of Uber's secrets management journey is &lt;strong&gt;secretless authentication&lt;/strong&gt; — a model where services don't hold long-lived credentials at all. Instead, they use their identity (a Kubernetes service account, a Spiffe/SPIRE identity, a cloud provider IAM role) to dynamically request short-lived tokens at runtime. When a token expires in 1 hour, there is nothing to steal, nothing to rotate, nothing to audit. Uber is actively building toward this model as the long-term replacement for static credential management. The Secret Management Platform is both the current solution and the bridge to the secretless future.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Self-Service Developer Tooling&lt;/p&gt;

&lt;p&gt;Before the platform, creating a new secret required filing a ticket with the Secrets team. The turnaround could be days. After the platform, developers can create, update, and delete secrets through a &lt;strong&gt;self-service API, CLI, and web UI&lt;/strong&gt; — all of which enforce the metadata requirements and policy compliance automatically. The Secrets team's workload shifted from manual secret operations (which scaled linearly with the number of services) to platform maintenance and governance (which scales much more slowly). A team of 10 can now serve 5,000 microservices because the services serve themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Multi-Vault Abstraction Layer&lt;/p&gt;

&lt;p&gt;Uber's infrastructure spans AWS, GCP, Azure, and on-premises HashiCorp Vault. Each environment has its own native secret manager with a different API. The Secret Management Platform includes a &lt;strong&gt;unified abstraction layer&lt;/strong&gt; that presents a single API for secret CRUD operations regardless of which underlying vault the secret lives in. Application code interacts with the platform API; the platform handles routing the operation to the correct vault (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or HashiCorp Vault) and translating the response. This abstraction decouples application code from vault topology — when Uber migrates a secret from one vault to another, no application code changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Unified API: One SDK, Four Vaults&lt;/p&gt;

&lt;p&gt;Uber's unified abstraction layer exposes a single SDK that application developers use regardless of which underlying vault stores their secret. The SDK handles routing: an AWS-deployed service's database password might live in AWS Secrets Manager; an on-prem service's certificate might live in HashiCorp Vault. The developer writes &lt;code&gt;secrets.get('myservice/db_password')&lt;/code&gt; and receives the credential — the SDK consults the metadata catalog to find which vault holds that secret and retrieves it via the appropriate vault API. &lt;strong&gt;Application code is decoupled from vault topology&lt;/strong&gt; , making future vault migrations transparent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Compliance Reporting on Demand&lt;/p&gt;

&lt;p&gt;Before the metadata model, answering a compliance auditor's question — 'show me all credentials with access to our payment processing systems' — would have required interviewing dozens of engineering teams over days. After the platform, the same question is answered by a metadata query: filter by associated_system='payment_processing', return all matching secrets with their rotation history, access policies, and owner contacts. Compliance reporting that took days now takes seconds. The metadata model was built for developer self-service but it turns out to be equally valuable for security operations and compliance.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The Secret Management Platform sits as an orchestration layer above Uber's six canonical vault systems. Applications and services no longer talk directly to specific vaults — they interact with the platform's unified API or use Kubernetes-native injection (where secrets are automatically mounted into pods at deployment time). The platform maintains the metadata catalog, handles lifecycle automation via the Secret Lifecycle Manager, runs real-time scanning, and provides the developer self-service tools. The vault systems themselves are the authoritative stores; the platform is the governance and automation layer on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before: 25 Fragmented Vaults, No Governance
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/uber-multicloud-secrets-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After: Unified Secret Management Platform
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/uber-multicloud-secrets-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CADENCE: WHY UBER CHOSE WORKFLOWS FOR ROTATION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Secret rotation is a multi-step process with real failure modes: the upstream provider might be unavailable, the vault write might fail, a downstream service might not acknowledge the new credential. A simple cron job or Lambda function that fails midway leaves the system in an unknown state — is the old credential still valid? Is the new one active? &lt;strong&gt;Cadence's durable workflow model provides exactly-once execution semantics&lt;/strong&gt; : each step is recorded, and if the workflow fails partway through, it resumes from the last successful step. This makes secret rotation reliable enough to run 20,000 times per month without manual oversight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Migration Coordination Challenge&lt;/p&gt;

&lt;p&gt;Migrating a secret from its old vault path to the new centralized platform sounds like a database copy operation. In practice it's a &lt;strong&gt;distributed coordination problem across hundreds of teams&lt;/strong&gt;. Every service reading the secret needs to be updated simultaneously (or with a dual-read transition period). Uber built tooling to discover all consumers of a vault path, generate migration checklists, and track completion status. Services that hadn't migrated within the target window were flagged for the owning team. The tooling made a migration problem tractable across an organization of thousands of engineers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🛡️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SPIFFE/SPIRE: The Path to Secretless&lt;/p&gt;

&lt;p&gt;Uber's secretless authentication initiative builds on the &lt;strong&gt;SPIFFE/SPIRE framework&lt;/strong&gt; — an open standard for issuing cryptographic workload identities. Every service at Uber has a unique SPIFFE identity that is automatically issued, cryptographically verifiable, and short-lived. Services that can authenticate using their SPIFFE identity don't need to hold long-lived credentials at all — the identity proves who they are, and the system issues time-limited tokens dynamically. As more of Uber's infrastructure adopts SPIFFE-based authentication, the number of long-lived secrets that need to be managed by the platform shrinks toward zero.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Uber's secrets management story is about the organizational and engineering cost of decentralization without governance — and the compounding returns of building the right platform once.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Consolidate ownership before building automation.&lt;/strong&gt; Uber's two-phase approach — consolidate 25 vaults into 6, then build the platform — was the right sequence. Building a governance platform on top of 25 independent vaults would have required integrating 25 different systems. Building it on 6 centrally owned vaults meant one integration per vault type. Consolidation first is harder organizationally but dramatically simpler technically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; A &lt;em&gt;metadata model&lt;/em&gt; (a structured record of each secret's properties — owner, purpose, associated services, rotation policy, security classification — that enables automated governance, inventory, and incident response) is the prerequisite for all other automation. Without metadata, you cannot automate rotation (you don't know the rotation policy), you cannot generate inventory (you don't know what secrets are for), and you cannot respond to incidents (you don't know which services are affected). Build the metadata model before building any automation on top of it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Real-time scanning with automatic revocation changes the economics of credential exposure.&lt;/strong&gt; When exposure is detected in seconds and the credential is automatically revoked, a developer accidentally committing a credential to git causes a 30-second incident rather than a multi-month exposure. The scanning + revocation loop is the highest-leverage security improvement for teams still relying on manual credential hygiene.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Use &lt;strong&gt;durable workflow systems&lt;/strong&gt; (like Cadence or Temporal) for secret rotation, not scripts or cron jobs. Rotation is a multi-step process with real failure modes at each step. A workflow system that provides exactly-once execution and automatic resume on failure makes rotation reliable enough to run at scale without manual oversight. A cron job that fails halfway through a rotation leaves credentials in an unknown state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt;  &lt;strong&gt;Self-service developer tooling is what makes centralized governance scale.&lt;/strong&gt; A centralized Secrets team without self-service tooling becomes a bottleneck — every credential operation requires a ticket. A centralized Secrets team with self-service tooling becomes a platform team — they build and maintain the guardrails, and developers operate within them autonomously. The goal is governance at scale, not control at the cost of velocity.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;10 Engineers, 5,000 Microservices&lt;/p&gt;

&lt;p&gt;The most striking number in Uber's secrets story is the ratio: &lt;strong&gt;10 engineers managing secrets governance for 5,000 microservices&lt;/strong&gt;. This 500:1 leverage ratio is only possible because the platform does the work that used to require human coordination. Automated rotation, self-service tooling, policy enforcement in the platform layer — all of these shift work from the coordination model (each secret operation requires a human) to the automation model (each secret operation executes itself). Platform teams that want to scale should measure their leverage ratio and ask what automation would improve it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SECRETS IN VERSION CONTROL ARE NOT A MISTAKE; THEY'RE AN ARCHITECTURE PROBLEM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every engineering organization has discovered credentials accidentally committed to git. The standard response is to educate developers about the risk. Uber's analysis found that the root cause was not developer carelessness — it was the absence of a convenient alternative. When getting a credential into a service requires filing a ticket and waiting days, developers find a shortcut: put it in the config file. &lt;strong&gt;The secure path needs to be the easy path.&lt;/strong&gt; The self-service API and Kubernetes injection that the Secret Management Platform provides made the secure approach easier than the shortcut.&lt;/p&gt;

&lt;p&gt;Uber built a platform to manage 150,000 secrets at scale — and the most important feature turned out to be a metadata field that just says 'who owns this thing?'&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/uber-multicloud-secrets-2025/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>security</category>
      <category>uber</category>
      <category>devops</category>
    </item>
    <item>
      <title>OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/openai-runs-chatgpt-for-800-million-users-on-one-postgresql-instance-and-it-works-5o1</link>
      <guid>https://dev.to/techlogstack/openai-runs-chatgpt-for-800-million-users-on-one-postgresql-instance-and-it-works-5o1</guid>
      <description>&lt;p&gt;&lt;strong&gt;OpenAI&lt;/strong&gt; · Databases · 18 May 2026&lt;/p&gt;

&lt;p&gt;ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;800M users, 1 primary PG instance&lt;/li&gt;
&lt;li&gt;~50 read replicas globally&lt;/li&gt;
&lt;li&gt;Millions of QPS, p99 &amp;lt;20ms&lt;/li&gt;
&lt;li&gt;PgBouncer: 50ms → 5ms connect&lt;/li&gt;
&lt;li&gt;One SEV-0 in 12 months&lt;/li&gt;
&lt;li&gt;5-second DDL timeout enforced&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;The conventional wisdom about database scaling at 800 million users is straightforward: you shard. You move to a distributed SQL system. You decompose into microservices each with their own database. You do not run a single primary PostgreSQL instance. OpenAI's ChatGPT does not follow this conventional wisdom. It runs on &lt;strong&gt;one Azure PostgreSQL Flexible Server&lt;/strong&gt; that handles all writes — backed by approximately 50 read replicas spread across multiple regions. The system handles millions of queries per second at low double-digit millisecond p99 latency and has maintained five-nines availability. In twelve months, they had one SEV-0. The story is not that Postgres is magic. The story is that &lt;strong&gt;relentless optimization of a boring, proven technology can outperform premature architectural complexity&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;WHY SINGLE-PRIMARY WORKS AT THIS SCALE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ChatGPT's workload is &lt;strong&gt;overwhelmingly read-heavy&lt;/strong&gt;. When 800 million users open the app, browse their chat history, or load their settings, those are reads. Writes happen on message submission and account updates — a much smaller fraction of the total traffic. This access pattern is exactly what a single-primary with many read replicas handles well: the write path stays narrow, the read load fans out horizontally across replicas. The architecture is not brilliant. It is appropriate for the workload. That fit is what makes it work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;OpenAI's blog published at PGConf.dev 2025 was unusually candid about both the decisions that worked and the ones that nearly broke the system. The database load grew by &lt;strong&gt;more than 10x in a single year&lt;/strong&gt; following ChatGPT's viral growth. The team responded with aggressive optimization at every layer: connection management, query design, caching, write path discipline, and schema change governance. Each of these deserves examination — not because the techniques are novel, but because executing all of them simultaneously, under extreme growth pressure, with production at risk, is far harder than any one technique in isolation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI's Azure PostgreSQL Flexible Server has a maximum of &lt;strong&gt;5,000 concurrent connections&lt;/strong&gt;. At ChatGPT's scale, application servers would easily exhaust this limit without connection pooling. Before deploying &lt;em&gt;PgBouncer&lt;/em&gt; (a lightweight connection pooler for PostgreSQL that multiplexes many application connections into a smaller pool of real database connections, dramatically reducing connection overhead), average connection time was 50ms. After deployment in statement-pooling mode: &lt;strong&gt;5ms&lt;/strong&gt;. A 10x improvement from one infrastructure change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  10x Database Load Growth in One Year
&lt;/h4&gt;

&lt;p&gt;ChatGPT's viral growth — 100 million users in two months at launch, 800 million by 2025 — drove database load up more than 10x in a single year. Connection exhaustion became a recurring threat. A 12-table ORM-generated join was causing multiple high-severity incidents when traffic spiked. Write pressure on the single primary was approaching dangerous levels during high-demand events.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Invisible Query Complexity and Write Pressure
&lt;/h4&gt;

&lt;p&gt;ORMs &lt;em&gt;ORM&lt;/em&gt; (Object-Relational Mapping — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code, abstracting away the database — convenient but capable of generating complex, inefficient queries that are invisible until they cause production incidents) generate SQL automatically, hiding complexity from developers. Under low load, even a 12-table join is fast enough to not notice. Under 10x load, the same query saturates database CPU. Meanwhile, write-heavy workloads that could be migrated to sharded systems like Azure Cosmos DB remained on the single primary longer than optimal.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Multi-Layer Defense: Pool + Cache + Rate Limit + Migrate
&lt;/h4&gt;

&lt;p&gt;OpenAI implemented PgBouncer connection pooling (cutting connect time 10x), a cache-locking mechanism to prevent thundering herd on cache misses, multi-layer rate limiting at application, proxy, and query levels, surgical elimination of the worst ORM-generated queries, strict schema change governance (5-second DDL timeout), and a policy of migrating all new write-heavy workloads to sharded systems by default.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  One SEV-0 in Twelve Months, Five-Nines Availability
&lt;/h4&gt;

&lt;p&gt;One SEV-0 in twelve months — triggered by the viral launch of ChatGPT ImageGen, which caused a 10x write surge as over 100 million users signed up within a week. Postgres recovered by design. p99 latency held at low double-digit milliseconds. The single-primary architecture remained viable at a scale that surprised the entire database engineering community.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ORM Query That Caused Multiple SEV-0s&lt;/p&gt;

&lt;p&gt;OpenAI's engineers discovered that a single ORM-generated SQL query was &lt;strong&gt;joining 12 tables&lt;/strong&gt;. Under normal load, the query executed in acceptable time. Under traffic spikes, it saturated the primary database's CPU and caused multiple high-severity incidents. The query had been auto-generated by the ORM framework and never explicitly reviewed. ORMs are excellent for developer productivity and terrible for query performance visibility. OpenAI now requires that all ORM-generated queries against high-traffic tables be reviewed and analyzed with EXPLAIN ANALYZE before deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;OpenAI's schema change governance is one of the most operationally distinctive aspects of their Postgres setup. They enforce a strict rule: &lt;strong&gt;schema changes that trigger a full table rewrite are prohibited in production&lt;/strong&gt;. Postgres's &lt;em&gt;MVCC&lt;/em&gt; (Multi-Version Concurrency Control — Postgres's mechanism for allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row and requiring periodic vacuum to reclaim space) model means that operations like &lt;code&gt;ALTER TABLE ADD COLUMN DEFAULT&lt;/code&gt; on large tables can hold an exclusive lock for hours while rewriting billions of rows. This would be catastrophic at ChatGPT's scale. All DDL operations have a &lt;strong&gt;5-second timeout&lt;/strong&gt; : if the schema change cannot acquire a lock within 5 seconds, it is cancelled automatically. Long-running queries that would block vacuum or DDL are automatically terminated.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Hot Standby in High-Availability Mode&lt;/p&gt;

&lt;p&gt;OpenAI runs the primary database in &lt;strong&gt;High-Availability mode with a hot standby&lt;/strong&gt; — a continuously synchronized replica specifically designated as the failover target. If the primary goes down, the hot standby can be promoted to primary with ~30–60 seconds of downtime. During a primary failure, read traffic on replicas is unaffected — since most ChatGPT requests are reads, a primary failure is not a SEV-0 (because reads remain available). Writes fail until promotion completes. This asymmetry between read and write availability is a conscious architectural tradeoff: the 800 million users who are just browsing conversation history continue being served.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why Not Shard? The Honest Answer&lt;/p&gt;

&lt;p&gt;The engineering question 'why didn't OpenAI shard PostgreSQL?' has a straightforward answer: &lt;strong&gt;sharding is expensive and their workload didn't require it yet&lt;/strong&gt;. Horizontal sharding introduces cross-shard transaction complexity, scatter-gather query patterns, operational overhead of multiple database instances, and application-layer awareness of shard routing. For a read-heavy workload that can be served from replicas, these costs are not justified. OpenAI chose to pay the operational cost of extreme Postgres optimization rather than the architectural cost of sharding — and the math worked out. The 'no new tables' policy ensures this calculation will be revisited for write-heavy workloads as they emerge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IDLE TRANSACTION TIMEOUTS: THE QUIET KILLER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI identified a subtle but devastating Postgres pattern at scale: idle transactions. When application code opens a database connection, starts a transaction, does unrelated work (calling an external API, waiting for user input), and only then commits — the transaction holds locks for the entire duration. At ChatGPT's scale, applications that hold open transactions for seconds can block vacuum, block DDL, and degrade query performance for all other connections. OpenAI enforces &lt;strong&gt;strict idle_in_transaction_session_timeout&lt;/strong&gt; settings — any connection idle inside a transaction for more than a few seconds is automatically terminated. This breaks poorly-written code immediately in staging rather than causing incidents in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📊&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Despite having ~50 read replicas across multiple geographic regions, OpenAI reports &lt;strong&gt;near-zero replication lag&lt;/strong&gt; on most replicas under normal conditions. This is achieved by co-locating PgBouncer, application servers, and replicas in the same region (minimizing network latency in the replication path) and by keeping primary write load within the replication throughput capacity of the replicas. Heavy write events — like the ImageGen launch surge — temporarily increase replication lag, which is why read-your-own-write operations are always routed to the primary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Write Ceiling Is Real&lt;/p&gt;

&lt;p&gt;OpenAI's single-primary architecture has an acknowledged limit: &lt;strong&gt;write-heavy events can overwhelm it&lt;/strong&gt;. The ImageGen SEV-0 was caused by a write surge, not a read surge. The architecture is not defended against arbitrary write load — it is defended against the &lt;em&gt;current&lt;/em&gt; write profile, which remains manageable because most new write-heavy workloads are being routed to Cosmos DB. If write load grows faster than the migration effort proceeds, the single-primary architecture will face a harder ceiling. The 'no new tables in Postgres' policy is the operational discipline that buys time.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Seven-Layer Defense
&lt;/h3&gt;

&lt;p&gt;OpenAI's Postgres scaling is not one clever trick — it is seven mutually reinforcing operational practices applied simultaneously. Any one of them in isolation would help marginally. Together they have produced an architecture that handles a scale that its underlying technology was not originally designed for. The practices are: connection pooling, thundering herd prevention, multi-layer rate limiting, hot standby failover, write offloading, query surgery, and DDL governance. Each addresses a specific failure mode that appeared as ChatGPT grew.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10x&lt;/strong&gt; — Database load growth in a single year following ChatGPT's viral expansion — the growth rate that forced each of the seven defensive layers to be implemented under production pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5ms&lt;/strong&gt; — Average connection setup time after PgBouncer deployment — down from 50ms before pooling, a 10x improvement that eliminated connection exhaustion as a recurring incident cause&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 sec&lt;/strong&gt; — Maximum DDL lock wait timeout — schema changes that cannot acquire a lock within 5 seconds are automatically cancelled, preventing table-lock incidents on billion-row tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 SEV-0&lt;/strong&gt; — High-severity incidents in the twelve months after full defensive architecture was deployed — triggered by ImageGen launch write surge, resolved by design without full platform outage
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cache-locking pattern: prevents thundering herd on cache misses
# When cache expires, only ONE request repopulates it — others wait
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;

&lt;span class="c1"&gt;# Simplified cache-lock implementation
&lt;/span&gt;&lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;_locks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;_lock_mutex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_with_cache_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fetch_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get value from cache. On miss, only one thread fetches;
    others block and receive the result once available.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Fast path: cache hit
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Slow path: cache miss — acquire per-key lock
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;_lock_mutex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_locks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;_locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;should_fetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_fetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_locks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_fetch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# This thread does the database read
&lt;/span&gt;            &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# ONE database query
&lt;/span&gt;            &lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="c1"&gt;# populate cache
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
        &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;_lock_mutex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_locks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# wake all waiters
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Other threads wait for the fetching thread to complete
&lt;/span&gt;        &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# don't wait forever
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# return from cache
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Cosmos DB Migration Policy&lt;/p&gt;

&lt;p&gt;OpenAI's most forward-looking operational decision is a standing policy: &lt;strong&gt;no new tables are created in PostgreSQL&lt;/strong&gt;. All new workloads default to sharded systems — primarily Azure Cosmos DB. Existing write-heavy workloads that can be horizontally partitioned are gradually migrated out. This policy doesn't fix the current architecture; it fixes the future architecture. Over time, the Postgres primary handles a smaller and smaller share of writes while remaining the canonical store for core user and conversation data. The single-primary architecture is not defended forever — it's being gracefully phased toward a hybrid model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REVIEW ORM-GENERATED SQL IN PRODUCTION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI's most actionable advice: &lt;strong&gt;add ORM-generated SQL review to your production deployment process&lt;/strong&gt;. ORM frameworks are brilliant for development velocity. They are silent performance landmines at scale. A query that joins 12 tables, a query that does a full table scan on an unindexed column, a query that triggers N+1 loads — none of these are visible in code review because the ORM generates them at runtime. OpenAI now requires that SQL generated by ORM frameworks for high-traffic tables be logged, analyzed with EXPLAIN ANALYZE at peak load, and reviewed by a database engineer before the code ships. This practice is cheap. Not having it costs SEV-0s.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lazy Writes: Smoothing Write Spikes&lt;/p&gt;

&lt;p&gt;OpenAI introduced &lt;strong&gt;lazy writes&lt;/strong&gt; for certain workloads — deferring non-critical writes instead of executing them immediately. For example, updating a user's last-seen timestamp or incrementing a view counter doesn't need to hit the database synchronously with every request. Batching these writes and flushing them periodically smooths write traffic from a spiky real-time pattern to a steadier background pattern. Lazy writes reduced write load on the primary meaningfully without any change to user-visible behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Covering Indexes: The Query Surgery Tool&lt;/p&gt;

&lt;p&gt;Beyond eliminating bad ORM queries, OpenAI invested heavily in &lt;strong&gt;covering indexes&lt;/strong&gt; — indexes that contain all columns needed by a query, so Postgres can answer it from the index alone without reading table rows. A covering index on a high-frequency query can reduce query cost from a sequential scan of billions of rows to a few hundred index lookups. OpenAI's database engineers regularly audit slow query logs and apply targeted index improvements, particularly after any traffic increase reveals latent query inefficiencies that weren't visible at lower load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Feature Traffic Isolation&lt;/p&gt;

&lt;p&gt;OpenAI isolates low-priority features from critical traffic paths. If a secondary feature — say, a background data analysis job — starts behaving poorly and consuming database resources, it should not degrade ChatGPT's core conversational experience. This isolation is implemented through &lt;strong&gt;separate connection pools for different traffic classes&lt;/strong&gt; , Kubernetes resource quotas for background workloads, and rate limiting that gives core product queries priority access to database capacity. The principle: a misbehaving low-priority feature should degrade itself, not the entire platform.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;OpenAI's Postgres architecture is simple at the macro level — one writer, many readers — but densely engineered at the micro level. The simplicity is intentional: every additional layer of infrastructure complexity is a potential failure mode. The dense engineering at the application and proxy layers is what makes the simple macro architecture viable at unprecedented scale. Understanding why this architecture works requires understanding both its strengths (read-heavy workload perfectly matched to replica fan-out) and its known limits (write spikes, ORM-generated queries, connection exhaustion).&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI's PostgreSQL Architecture: Single Primary, Global Read Scale
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/openai-postgresql-scaling-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Layer Rate Limiting: Defense in Depth for Write Spikes
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/openai-postgresql-scaling-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE REPLICATION LAG TRADEOFF&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Asynchronous replication to read replicas introduces a tradeoff: reads may return slightly stale data. For most ChatGPT operations — loading conversation history, displaying user settings, browsing the interface — &lt;strong&gt;a few hundred milliseconds of staleness is imperceptible and acceptable&lt;/strong&gt;. For the small fraction of requests that require current data (a write followed immediately by a read-your-own-write pattern), OpenAI routes those reads to the primary. This explicit differentiation between 'reads that can tolerate lag' and 'reads that cannot' is a design discipline, not an accident — and it's what allows the read load to be distributed across 50 replicas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Backfill Rate Limit: So Slow It Takes a Week&lt;/p&gt;

&lt;p&gt;OpenAI enforces strict rate limits on database backfill operations — migrations that populate new columns or update existing rows across large tables. These rate limits are aggressive enough that a large backfill can take over a week to complete. This is deliberate: a fast backfill on a billion-row table would compete with live traffic for I/O, degrade query latency, and risk triggering the DDL timeout. Slow backfills are boring and invisible. Fast backfills cause incidents. OpenAI chose boring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Five-Nines Achieved&lt;/p&gt;

&lt;p&gt;OpenAI reports achieving &lt;strong&gt;99.999% availability&lt;/strong&gt; on their Postgres infrastructure — five nines, which means less than 5.26 minutes of downtime per year. This is achieved despite running a single primary, primarily because most customer traffic is read-only (served by replicas even during primary downtime), write failures during primary maintenance are brief (hot standby promotion in 30–60 seconds), and the defensive layers prevent the most common failure modes from escalating. Five nines on a single-primary setup requires more engineering discipline, not less, than achieving the same availability on a distributed system.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;OpenAI's PostgreSQL story is the strongest available evidence that conventional wisdom about 'you must shard at scale' is not a law — it's a heuristic that depends heavily on workload shape. The lessons here are about operational discipline, honest workload analysis, and knowing the limits of your architecture before they find you.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Analyze your workload before choosing your architecture.&lt;/strong&gt; OpenAI's single-primary architecture works because ChatGPT is overwhelmingly read-heavy. A write-heavy workload at the same scale would fail with this architecture. The lesson is not 'use a single primary' — it's 'design for your actual access patterns, not for the scale number on the slide.'&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Connection pooling&lt;/em&gt; (deploying a proxy like PgBouncer between application servers and PostgreSQL that multiplexes thousands of application connections into a smaller pool of database connections, reducing connection overhead and preventing connection exhaustion) is not optional at scale. At ChatGPT's traffic volume, hitting Postgres's 5,000-connection limit without pooling would have caused regular outages. PgBouncer turned a recurring incident cause into a non-issue. Deploy it before you need it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Review ORM-generated SQL for high-traffic tables before shipping.&lt;/strong&gt; A 12-table join that worked fine at 1x traffic caused multiple SEV-0s at 10x. ORMs are invisible query generators. Add explicit review of ORM-generated queries — EXPLAIN ANALYZE at production load levels — as a standard pre-deployment step for database-touching code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Enforce schema change governance with hard timeouts. &lt;strong&gt;A DDL operation that holds a table lock for hours will cause an incident.&lt;/strong&gt; OpenAI's 5-second DDL timeout automatically cancels any schema change that cannot acquire a lock quickly. This constraint forces engineers to use online DDL tools (pg_repack, zero-downtime column addition) rather than naive ALTER TABLE on large tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; Plan the exit from your current architecture before you need it. OpenAI's 'no new tables in PostgreSQL' policy and ongoing write workload migration to Cosmos DB are the planned evolution of the current architecture. &lt;strong&gt;A single-primary Postgres at 800M users is viable today because write load is bounded. It's viable tomorrow because write-heavy workloads are being systematically migrated out.&lt;/strong&gt; Know the limits of your current architecture and have a credible plan for crossing them.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ImageGen Launch: The One That Got Through&lt;/p&gt;

&lt;p&gt;In twelve months of operation with the fully hardened architecture, OpenAI had one SEV-0: the launch of ChatGPT ImageGen. Over 100 million new users signed up within a week, driving a &amp;gt;10x spike in write traffic — specifically new account creation and preference storage — that temporarily overwhelmed the primary's write capacity. The system recovered by design, but the event validated the 'no new tables in Postgres' policy. Write surges at viral launch scale are the known limit of single-primary architecture. The Cosmos DB migration is the known fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE HYBRID MIGRATION STRATEGY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenAI's hybrid approach — &lt;strong&gt;keep Postgres for what it does well, migrate write-heavy workloads to Cosmos DB, enforce 'no new tables in Postgres'&lt;/strong&gt; — is a template for any team running a successful legacy database under growth pressure. The alternative extremes (migrate everything at once, or never migrate anything) are both wrong. Incremental migration guided by workload characteristics is boring, slow, and correct. The discipline is in writing down the policy and enforcing it before the crisis arrives.&lt;/p&gt;

&lt;p&gt;OpenAI runs ChatGPT for 800 million users on one Postgres instance and the most complex part of their database engineering is telling people not to use ORMs without reading the SQL they generate.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/openai-postgresql-scaling-2026/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>database</category>
      <category>openai</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/figmas-database-grew-100x-in-four-years-heres-how-a-small-team-kept-it-from-toppling-53fp</link>
      <guid>https://dev.to/techlogstack/figmas-database-grew-100x-in-four-years-heres-how-a-small-team-kept-it-from-toppling-53fp</guid>
      <description>&lt;p&gt;&lt;strong&gt;Figma&lt;/strong&gt; · Databases · 18 May 2026&lt;/p&gt;

&lt;p&gt;In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100x DB growth since 2020&lt;/li&gt;
&lt;li&gt;Single instance → horizontal shards&lt;/li&gt;
&lt;li&gt;9-month migration&lt;/li&gt;
&lt;li&gt;Billions of rows per table&lt;/li&gt;
&lt;li&gt;DBProxy built in Go&lt;/li&gt;
&lt;li&gt;Zero-downtime logical → physical sharding&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;We needed a bigger lever.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Sammy Steele, Tech Lead — Figma Databases Team, via Figma Engineering Blog&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Figma's database story follows a pattern familiar to every fast-growing product company, but with stakes that were unusually high and a timeline that was unusually compressed. In 2020, Figma ran on a &lt;strong&gt;single Postgres database&lt;/strong&gt; on AWS's largest available RDS instance. By the end of 2022, the team had done what most scaling playbooks suggest first: add read replicas, add a connection pooler (&lt;em&gt;PgBouncer&lt;/em&gt; (a lightweight PostgreSQL connection pooler that sits between application code and the database, multiplexing many application connections down to a smaller pool of real database connections — reducing connection overhead significantly)), and &lt;em&gt;vertically partition&lt;/em&gt; (splitting a single database into multiple smaller databases, each containing a logical group of related tables — for example, one database for Figma files data, another for organization data) the database into a dozen domain-specific shards. These steps bought them runway. They did not buy them enough runway.&lt;/p&gt;

&lt;p&gt;The data was unambiguous. Certain tables — the ones tracking Figma files, user activity, and collaboration state — were growing at rates that would soon exceed what &lt;strong&gt;Amazon RDS could support in IOPS&lt;/strong&gt;. Some of these tables already contained &lt;strong&gt;several terabytes and billions of rows&lt;/strong&gt;. At that size, Postgres's &lt;em&gt;vacuum process&lt;/em&gt; (a critical background maintenance operation in Postgres that reclaims storage from deleted rows and prevents the database from running out of 32-bit transaction IDs — if vacuuming falls behind, it can cause severe performance degradation and, in extreme cases, force the database offline) was beginning to cause reliability incidents — it was falling behind on the largest tables, unable to reclaim space fast enough to keep up with write volume. Vertical partitioning couldn't fix this: the smallest unit of vertical partitioning is a single table, and these individual tables were the problem.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE VACUUM PROBLEM AT SCALE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Postgres must periodically vacuum tables to reclaim space from deleted and updated rows. This is not optional — if a table accumulates too many dead tuples, query performance degrades severely. On tables with billions of rows and high write rates, the vacuum process can fall behind the rate of new writes. When this happens, the database starts showing reliability symptoms: &lt;strong&gt;bloated tables, degraded query plans, and in extreme cases the risk of transaction ID wraparound&lt;/strong&gt; — a catastrophic condition that forces Postgres into read-only emergency mode. Figma was seeing the early signs of this at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Why Not CockroachDB, TiDB, Spanner, or Vitess?
&lt;/h3&gt;

&lt;p&gt;Figma's databases team evaluated every obvious alternative before committing to building their own horizontal sharding layer. &lt;strong&gt;CockroachDB, TiDB, Google Spanner, and Vitess&lt;/strong&gt; were all on the list. All were rejected for the same core reason: switching to any of them would have required a complex data migration across two different database stores simultaneously. With only months of runway remaining before hitting critical IOPS limits, a migration to an unfamiliar storage layer under deadline pressure was a risk the team couldn't accept. They had also accumulated significant operational expertise running RDS Postgres. That expertise would have to be rebuilt from scratch for any new system. The team instead chose to build horizontal sharding on top of their existing RDS Postgres infrastructure — not a generic solution, but one scoped precisely to Figma's data model and access patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  IOPS Ceiling and Vacuum Incidents Converge
&lt;/h4&gt;

&lt;p&gt;By late 2022, Figma's largest tables had grown to several terabytes with billions of rows, and the Postgres vacuum process was causing reliability incidents on the highest-write tables. Projections showed the team would exceed RDS maximum IOPS within months. Vertical partitioning — splitting databases by domain — could not help because individual tables were the bottleneck, not cross-domain coupling.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Single-Table Ceiling: The Vertical Partitioning Limit
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Horizontal sharding&lt;/em&gt; (splitting a single large table's rows across multiple physical database instances based on a shard key — allowing any individual table to grow beyond the limits of a single machine) was the only viable path. But implementing it on a complex relational data model, with hundreds of engineers writing queries, required solving three hard problems simultaneously: routing queries correctly, maintaining developer productivity, and enabling rollback if something went wrong.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Colos + Logical Sharding + DBProxy
&lt;/h4&gt;

&lt;p&gt;The team invented three interlocking abstractions: 'colos' (colocation groups of related tables sharing a shard key), logical sharding via Postgres views (which allowed safe percentage-based rollout without moving any data), and DBProxy (a custom Go query proxy with an AST parser that routed queries to the correct physical shard). Together these allowed incremental, reversible rollout of horizontal sharding without disrupting product development.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Nine Months, Nearly Infinite Scalability
&lt;/h4&gt;

&lt;p&gt;The migration completed in nine months with zero downtime and the ability to roll back at any step. Future shard splits at the physical level are now transparent to application developers — after the initial upfront work to make a table compatible with horizontal sharding, all subsequent scale-outs happen in the infrastructure layer without any product team involvement.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🪄&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most elegant part of Figma's sharding approach was using standard &lt;strong&gt;Postgres views&lt;/strong&gt; to implement logical sharding. A view like &lt;code&gt;CREATE VIEW table_shard1 AS SELECT * FROM table WHERE hash(shard_key) BETWEEN min AND max&lt;/code&gt; lets Postgres behave as if data is already sharded — without any data moving. This made the logical sharding phase essentially free to roll back: change the view definition, flip the config, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Shadow Planning Framework&lt;/p&gt;

&lt;p&gt;Before building DBProxy's query engine, the team needed to know which queries to support. They built a &lt;strong&gt;shadow planning framework&lt;/strong&gt; that let engineers define potential sharding schemes for their tables, then ran those plans against live production traffic — logging the queries and plans to Snowflake for offline analysis. This gave them empirical data to design a query language covering the most common &lt;strong&gt;90% of queries&lt;/strong&gt; while deliberately excluding the rare worst-case patterns that would have made DBProxy impossibly complex.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The constraints the team placed on their query language were deliberate and principled. All range scan and point queries were supported. Cross-table joins were &lt;strong&gt;only allowed when both tables belonged to the same colo and the join was on the sharding key&lt;/strong&gt;. Scatter-gather queries — those that must fan out to all shards because they lack a shard key — were supported but their use was actively discouraged because each scatter-gather effectively multiplies database load by the shard count. Application developers were encouraged to refactor scatter-gather access patterns before sharding their tables, using the shadow planning data to understand which of their queries fell into this category.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Five Goals They Refused to Compromise&lt;/p&gt;

&lt;p&gt;Figma's team defined five non-negotiables before writing a line of sharding code: &lt;strong&gt;minimize developer impact&lt;/strong&gt; (product engineers shouldn't need to rewrite queries), &lt;strong&gt;scale-out transparency&lt;/strong&gt; (future shard splits invisible to application layer), &lt;strong&gt;no expensive backfills&lt;/strong&gt; (no solution requiring moving terabytes before going live), &lt;strong&gt;incremental progress&lt;/strong&gt; (percentage-based rollout at every step), and &lt;strong&gt;rollback at any stage&lt;/strong&gt; — even after physical sharding. Every architectural decision was measured against these five goals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Postgres Vacuum Threat Nobody Talks About&lt;/p&gt;

&lt;p&gt;Postgres uses a 32-bit transaction counter. Every write increments it. If the database ever gets close to the maximum 2^31 transactions without vacuuming reclaimed space, Postgres enters a &lt;strong&gt;read-only emergency mode called transaction ID wraparound&lt;/strong&gt; — a database-wide shutdown to prevent data corruption. On tables with billions of rows and heavy write rates, falling behind on vacuuming is not a theoretical risk. Figma was experiencing real reliability incidents from vacuum lag on their largest tables. This was the alarm that confirmed horizontal sharding was urgent, not optional.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Three-Layer Solution
&lt;/h3&gt;

&lt;p&gt;Figma's horizontal sharding solution had three distinct architectural components that worked together. &lt;strong&gt;Colos&lt;/strong&gt; (colocations) were the conceptual layer — groups of related tables that shared the same sharding key and physical shard layout. Tables within a colo could be joined and queried transactionally as long as the join was on the sharding key. The sharding keys were chosen from a small set: &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;file_id&lt;/code&gt;, or &lt;code&gt;org_id&lt;/code&gt; — most tables at Figma could be naturally associated with one of these. &lt;strong&gt;Logical sharding&lt;/strong&gt; was the rollout layer — using Postgres views to simulate sharding behavior without moving any data. &lt;strong&gt;DBProxy&lt;/strong&gt; was the execution layer — intercepting queries, parsing them into an AST, determining which logical shard the query targeted, and routing it to the appropriate physical database.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;100x&lt;/strong&gt; — Database growth since 2020 — the scale that made vertical partitioning insufficient and horizontal sharding the only viable path forward&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 months&lt;/strong&gt; — Total migration timeline from design to production completion — achieved with a small team under aggressive growth-driven deadline pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90%&lt;/strong&gt; — Query coverage targeted by DBProxy's query engine — the pragmatic threshold that kept the proxy simple while covering the vast majority of production access patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; — Application layer changes required for future shard splits — after initial table compatibility work, all subsequent scale-outs are transparent to product engineers
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Logical sharding via Postgres views: the key insight&lt;/span&gt;
&lt;span class="c1"&gt;-- No data moves during logical sharding phase.&lt;/span&gt;
&lt;span class="c1"&gt;-- Tables behave as if sharded — just views on the same physical table.&lt;/span&gt;

&lt;span class="c1"&gt;-- Single physical table still holds all data:&lt;/span&gt;
&lt;span class="c1"&gt;-- CREATE TABLE figma_files (file_id UUID, org_id UUID, data JSONB, ...)&lt;/span&gt;

&lt;span class="c1"&gt;-- Logical shards created as views filtered by hash range:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;figma_files_shard_0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;figma_files&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;hashtext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;figma_files_shard_1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;figma_files&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;hashtext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Views accept both reads AND writes in Postgres:&lt;/span&gt;
&lt;span class="c1"&gt;-- INSERT INTO figma_files_shard_0 (file_id, data) VALUES (...);&lt;/span&gt;
&lt;span class="c1"&gt;-- → Postgres routes to the underlying table&lt;/span&gt;
&lt;span class="c1"&gt;-- → DBProxy validates the shard key is in the correct range&lt;/span&gt;

&lt;span class="c1"&gt;-- Physical sharding later:&lt;/span&gt;
&lt;span class="c1"&gt;-- Data is ACTUALLY moved to separate RDS instances per shard&lt;/span&gt;
&lt;span class="c1"&gt;-- DBProxy routing stays the same — application code unchanged&lt;/span&gt;
&lt;span class="c1"&gt;-- Rollback: re-point physical shard back to original instance&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;DBPROXY: THE QUERY ENGINE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DBProxy is a Go service sitting between the application layer and PgBouncer. Its query engine has three components: a &lt;strong&gt;query parser&lt;/strong&gt; that transforms SQL into an AST, a &lt;strong&gt;logical planner&lt;/strong&gt; that extracts query type and shard IDs from the AST, and a &lt;strong&gt;physical planner&lt;/strong&gt; that maps logical shard IDs to physical database instances and rewrites queries accordingly. DBProxy also handles scatter-gather queries (fanning out to all shards and aggregating results), dynamic load-shedding, improved observability, and database topology management. Building it took months — but it was the only way to make sharding transparent to application developers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Logical Before Physical: The Two-Phase Rollout&lt;/p&gt;

&lt;p&gt;Figma's key migration insight was separating logical sharding from physical sharding. &lt;strong&gt;Logical sharding&lt;/strong&gt; (Phase 1) makes the application behave as if tables are sharded — using views, updating DBProxy config — but all data still lives in one physical database. This can be rolled out as a percentage-based config change, validated against production traffic, and rolled back instantly. &lt;strong&gt;Physical sharding&lt;/strong&gt; (Phase 2) actually moves data to separate RDS instances. Much higher risk — but by this point, the logical layer has been running in production for weeks, bugs are fixed, and the team has empirical confidence in the sharding correctness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Rollback Guarantee: Even After Physical Sharding&lt;/p&gt;

&lt;p&gt;Most horizontal sharding implementations are one-way migrations — once data is on separate physical instances, rolling back requires a complex reverse migration. Figma's team designed their system so that &lt;strong&gt;physical shard splits are reversible&lt;/strong&gt;. They maintained the ability to point physical shards back to the original database instance while the new routing logic was validated. This reduced the risk of being stuck in a bad state when unknown unknowns inevitably occurred.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Figma's Three-Phase Database Scaling Journey: Before and After&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Bottleneck Addressed&lt;/th&gt;
&lt;th&gt;Runway Gained&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2020&lt;/td&gt;
&lt;td&gt;Single RDS Postgres instance&lt;/td&gt;
&lt;td&gt;Initial growth&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2021–2022&lt;/td&gt;
&lt;td&gt;Vertical partitioning (12 domain DBs) + read replicas + PgBouncer&lt;/td&gt;
&lt;td&gt;CPU, read load, connection pool&lt;/td&gt;
&lt;td&gt;~1 year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2023–2024&lt;/td&gt;
&lt;td&gt;Horizontal sharding via colos + DBProxy + logical/physical migration&lt;/td&gt;
&lt;td&gt;Table-level IOPS ceiling, vacuum backlog, billions-of-row tables&lt;/td&gt;
&lt;td&gt;Near-infinite scalability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔀&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scatter-Gather: The Necessary Evil&lt;/p&gt;

&lt;p&gt;Some queries don't have a shard key — a query like 'get all recently modified files for an admin dashboard' has no natural file_id scope. DBProxy handles these with scatter-gather: fan the query out to every shard in parallel, collect results, merge and sort. It works correctly but is expensive. Figma's engineering team was &lt;strong&gt;explicit with product engineers about the scatter-gather tax&lt;/strong&gt; , encouraging them to refactor access patterns before their tables were sharded. The shadow planning data showed exactly which queries would become scatter-gather — engineering teams had weeks to fix them before cutover.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The before-state of Figma's architecture had application services talking directly to PgBouncer, which connected to RDS Postgres. Vertical partitioning meant multiple databases, but each database was still a single physical instance — and the largest individual tables still had no mechanism to distribute their rows across instances. DBProxy was inserted between the application and PgBouncer layers, adding the query parsing and routing intelligence that made horizontal sharding possible without requiring application code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Horizontal Sharding: Vertical Partitions Only
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After: DBProxy + Logical/Physical Horizontal Sharding
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;COLOS: THE DEVELOPER-FACING ABSTRACTION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The colo concept is what made horizontal sharding usable for product engineers. A colo is a named group of tables that share a sharding key — for example, the &lt;code&gt;files_colo&lt;/code&gt; contains &lt;code&gt;figma_files&lt;/code&gt;, &lt;code&gt;file_nodes&lt;/code&gt;, &lt;code&gt;file_comments&lt;/code&gt;, and other tables all sharded by &lt;code&gt;file_id&lt;/code&gt;. Within a colo, &lt;strong&gt;cross-table joins and full transactions are supported&lt;/strong&gt; when restricted to a single shard key value. This matches how Figma's application code already accessed the database — most operations concerned a single file or a single user, not cross-colo data. The colo abstraction minimized the number of queries that needed to be refactored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Scatter-Gather Tax&lt;/p&gt;

&lt;p&gt;Queries without a shard key — those that need results from all shards — are handled by DBProxy's scatter-gather mechanism: the query is fanned out to all shards in parallel and results are merged. Scatter-gather is correct but expensive: it multiplies read load by the number of shards. &lt;strong&gt;Having too many scatter-gather queries would defeat the purpose of horizontal sharding&lt;/strong&gt;. The shadow planning framework specifically identified scatter-gather patterns in the production query log before sharding, allowing teams to refactor the most frequent offenders before their tables were migrated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DBProxy: Six Months to Build, Indefinite Value&lt;/p&gt;

&lt;p&gt;Building DBProxy — the Go service with an AST parser, logical planner, and physical planner — was the highest-risk engineering bet in the sharding project. It took months to build and required solving problems that existing tools had already solved in different ways. But the payoff was precise control: &lt;strong&gt;DBProxy understands Figma's specific query patterns&lt;/strong&gt; , supports exactly the subset of SQL that Figma uses, and can be extended as Figma's needs evolve. A generic proxy would have required adapting Figma's code to its limitations. DBProxy was adapted to Figma's code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Figma's sharding story is widely cited because it did something genuinely hard — horizontally sharding a complex relational production database under deadline pressure — and documented the architecture decisions clearly enough for other teams to learn from. The lessons are about sequencing, abstraction, and the courage to build something custom when existing tools genuinely don't fit.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Separate logical sharding from physical sharding.&lt;/strong&gt; Implementing sharding routing behavior at the application layer — using views, config, or a proxy — before moving any physical data gives you weeks of production validation at essentially zero risk. When you flip to physical sharding, the routing is already proven correct. This two-phase approach is the biggest risk reducer in a horizontal sharding migration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Colocations&lt;/em&gt; (groups of related tables that share the same sharding key and physical shard layout, allowing cross-table joins and transactions within the group) are the abstraction that makes sharding survivable for product engineers. Without colos, horizontal sharding forces engineers to think about shard routing on every database query. With colos, most queries just work as they always did.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Use shadow traffic to define your query language before building your proxy.&lt;/strong&gt; Figma's shadow planning framework let them empirically measure which query patterns existed in production before designing DBProxy. This meant the proxy was built for real queries, not imagined ones — and the 10% of queries excluded from support were known and manageable, not discovered as surprises in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Know when existing tools don't fit your timeline. Figma evaluated CockroachDB, Spanner, TiDB, and Vitess — all good systems. They chose to build something custom not out of arrogance but because &lt;strong&gt;the migration risk to an unfamiliar storage layer under a months-long deadline was genuinely higher than building a scoped custom solution&lt;/strong&gt; on their existing Postgres expertise. The build-vs-buy decision was made with real risk data, not intuition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; Design for rollback even after the migration completes. Figma maintained the ability to reverse physical shard splits after they happened. &lt;strong&gt;The unknown unknowns in a horizontal sharding migration are real&lt;/strong&gt; — building in a reverse path at every phase is the engineering discipline that lets teams execute confidently rather than hold their breath.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Post-Migration State: Scale Without Change&lt;/p&gt;

&lt;p&gt;After the initial work to make a table horizontal-sharding-compatible, all future shard splits happen transparently. As a table grows again toward limits, the infrastructure team can split shards — updating the physical topology and DBProxy's routing config — without any product engineer touching their code. This is the payoff of the upfront investment: the database can now scale indefinitely at the infrastructure layer, decoupled from the application layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHEN MONTHS OF RUNWAY MEANS NOW&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Figma's team framed their problem as 'months of runway remaining' — meaning that if they did nothing, they would hit a hard scaling ceiling and likely experience database reliability incidents or outages within months. This framing was not catastrophizing; it was &lt;strong&gt;the math of their growth rate applied to their IOPS limit&lt;/strong&gt;. The urgency drove the decision to build custom rather than migrate to an unfamiliar system. Teams facing similar trajectory should run this calculation early — months of runway sounds like plenty of time until the migration itself takes several months.&lt;/p&gt;

&lt;p&gt;Figma's database grew 100x and a small team fixed it in nine months — which is either very good database engineering or very good use of Postgres views depending on who you ask, and the answer is both.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>database</category>
      <category>figma</category>
      <category>systemdesign</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How Stripe Moves Petabytes Between Database Shards Without Stopping the Money</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/how-stripe-moves-petabytes-between-database-shards-without-stopping-the-money-24kg</link>
      <guid>https://dev.to/techlogstack/how-stripe-moves-petabytes-between-database-shards-without-stopping-the-money-24kg</guid>
      <description>&lt;p&gt;&lt;strong&gt;Stripe&lt;/strong&gt; · Databases · 17 May 2026&lt;/p&gt;

&lt;p&gt;Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$1T+ payment volume 2023&lt;/li&gt;
&lt;li&gt;99.999% uptime achieved&lt;/li&gt;
&lt;li&gt;5M database queries/sec&lt;/li&gt;
&lt;li&gt;1.5 PB migrated in 2023&lt;/li&gt;
&lt;li&gt;Thousands of shards managed&lt;/li&gt;
&lt;li&gt;Zero-downtime migrations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$1T+&lt;/strong&gt; — Payment volume processed by Stripe in 2023 — making their database reliability requirements some of the most demanding in the industry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;99.999%&lt;/strong&gt; — Uptime achieved — five nines means less than 5.26 minutes of total downtime per year across all Stripe APIs and payment processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5M QPS&lt;/strong&gt; — Database queries per second sustained across Stripe's DocDB fleet — comparable to some of the largest databases in the world&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5 PB&lt;/strong&gt; — Data migrated between shards in 2023 alone using the Data Movement Platform — transparently to all applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When Stripe launched in 2011, they chose &lt;em&gt;MongoDB&lt;/em&gt; (a document-oriented NoSQL database that stores data as flexible JSON-like documents rather than fixed relational table schemas, offering developer productivity advantages for rapid iteration) because it was more developer-friendly than standard relational databases for a fast-moving startup. Over the next decade, as Stripe grew from a startup to a financial infrastructure company processing trillion-dollar payment volumes, the team built a layer on top of MongoDB that they call &lt;em&gt;DocDB&lt;/em&gt; — a &lt;em&gt;Database-as-a-Service&lt;/em&gt; (an abstraction layer that gives application developers a simple API for data access while hiding all the complexity of sharding, replication, failover, and migrations beneath it). DocDB handles &lt;em&gt;horizontal sharding&lt;/em&gt; (a database scaling technique that distributes data rows across multiple independent database instances (shards) based on a partition key, so no single instance holds all the data and traffic is distributed) across thousands of shards, manages replication for high availability, and — crucially — enables the zero-downtime data migrations that allow Stripe's database fleet to scale continuously without ever taking payments offline.&lt;/p&gt;

&lt;p&gt;The central innovation of DocDB is its &lt;strong&gt;Data Movement Platform&lt;/strong&gt; — a system that can migrate chunks of data between shards while both the source and target shards continue serving live production traffic. This capability is essential for Stripe's operations: as certain merchants grow rapidly and their shard fills up, it needs to be split. As the fleet evolves and some shards become underutilized, they can be consolidated. When a new MongoDB version is released, shards can be upgraded by &lt;em&gt;fork-lifting&lt;/em&gt; (migrating data to a new instance running the target version, avoiding multi-step in-place upgrades that pass through each intermediate version) to the new version rather than performing multi-step in-place upgrades. All of these operations have one requirement in common: &lt;strong&gt;Stripe cannot stop accepting payments while they happen&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE FIVE NINES CONSTRAINT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;99.999% uptime means &lt;strong&gt;less than 5.26 minutes of downtime per year&lt;/strong&gt;. For a payment processor, downtime is not just SLA violation — it's merchants unable to complete sales, customers unable to pay, and real-time revenue loss for the businesses Stripe serves. Every database operation — migration, split, consolidation, upgrade — must happen transparently. The constraint is absolute: there is no maintenance window at Stripe's scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Six-Step Migration Protocol
&lt;/h3&gt;

&lt;p&gt;The Data Movement Platform executes every shard migration through a six-step protocol: (1) register the migration plan in the &lt;em&gt;chunk metadata service&lt;/em&gt; (a central catalog that tracks which data chunks live on which shards — the source of truth for query routing across Stripe's fleet), (2) build indexes on the target shard before data arrives (avoiding the performance penalty of indexing after a large data load), (3) bulk-copy a snapshot of the chunk from source to target, (4) stream async replication to apply changes made on the source since the snapshot was taken, (5) perform correctness checks to verify data consistency, (6) switch traffic to the target and deregister the chunk from the source. Steps 3 and 4 were where Stripe hit unexpected engineering challenges — and where the most creative solutions emerged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Shard Splits and Consolidations Required Downtime
&lt;/h4&gt;

&lt;p&gt;Without the Data Movement Platform, scaling Stripe's database fleet required either accepting downtime during shard operations or building complex dual-write logic for every migration. As Stripe's fleet grew to thousands of shards, this was operationally unsustainable and created real risk for every migration event.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Financial Data Cannot Tolerate Inconsistency
&lt;/h4&gt;

&lt;p&gt;Payment data has zero tolerance for consistency errors — a payment record that exists on the source shard but hasn't yet appeared on the target is a payment that could be double-charged, lost, or corrupted if traffic switches at the wrong moment. The six-step protocol was designed specifically to guarantee that by the time traffic switches, the target is &lt;strong&gt;exactly consistent&lt;/strong&gt; with the source including all writes made during migration.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  CDC Replication + Correctness Verification
&lt;/h4&gt;

&lt;p&gt;Stripe solved the consistency problem using &lt;em&gt;Change Data Capture&lt;/em&gt; (a technique that continuously reads the MongoDB operation log (oplog) to stream every write applied to the source shard to the target, keeping the target synchronized even as live traffic modifies the source data) streaming from the source shard's oplog. After CDC replication catches up to near-real-time, correctness checks compare source and target before traffic is switched. The switch itself is atomic from the application's perspective.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1.5 Petabytes Moved in 2023 Transparently
&lt;/h4&gt;

&lt;p&gt;In 2023 alone, Stripe migrated 1.5 petabytes of data between shards, consolidated thousands of databases through bin packing, and upgraded the entire MongoDB fleet — all with zero application downtime and no payment processing interruptions. 99.999% uptime was maintained throughout.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;DocDB's ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Jimmy Morzaria, Suraj Narkhede — via Stripe Engineering Blog, June 2024&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Bulk Load Throughput Problem&lt;/p&gt;

&lt;p&gt;Step 3 of the migration — bulk loading a snapshot of the chunk onto the target shard — hit a &lt;strong&gt;significant throughput limitation&lt;/strong&gt; during testing. Stripe's engineering team tried batching writes and tuning DocDB engine parameters, but neither approach resolved the bottleneck. The root cause was an impedance mismatch between the bulk loader and the target shard's write path: the target shard was not optimized for sequential ingestion at high speeds. The engineering team eventually solved this by building purpose-built bulk import tooling with different I/O patterns than the standard DocDB write path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🗃️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stripe manages &lt;strong&gt;thousands of DocDB shards&lt;/strong&gt; — and periodically performs bin-packing consolidations where underutilized shards are merged to reduce operational overhead and hardware costs. In 2023 they reduced the total number of underlying DocDB shards by approximately three-quarters through such consolidation, migrating 1.5 petabytes of data in the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⬆️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Fork-Lift Upgrade Strategy&lt;/p&gt;

&lt;p&gt;Traditional in-place database major version upgrades require going through each intermediate version sequentially — upgrading from MongoDB 4.0 to 5.0 to 6.0, for example, each step requiring careful validation. Stripe's Data Movement Platform enables a &lt;strong&gt;fork-lift strategy&lt;/strong&gt; : provision a new shard running the target version, migrate the data to it, switch traffic, decommission the old shard. Any version can jump to any other version in a single migration step. This eliminates the risk accumulation of multi-step in-place upgrades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DocDB: Not a Rewrite, an Extension&lt;/p&gt;

&lt;p&gt;A key decision in Stripe's database evolution was building DocDB &lt;strong&gt;on top of MongoDB Community&lt;/strong&gt; rather than replacing MongoDB with a different database. This preserved compatibility with existing application code, the existing data model, and years of operational knowledge. The extensions — sharding, proxy routing, migration tooling — were added as a platform layer, not a fork. This pragmatic approach to building on existing foundations rather than starting from scratch is characteristic of Stripe's infrastructure philosophy.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DocDB Architecture: The Database-as-a-Service Abstraction
&lt;/h3&gt;

&lt;p&gt;DocDB's architecture is a three-tier system sitting between Stripe's application code and raw MongoDB instances. The &lt;strong&gt;Database Proxy&lt;/strong&gt; is the entry point for all application read/write requests — it performs access control checks, validates queries, and routes requests to the correct shard by consulting the chunk metadata service. The &lt;strong&gt;Chunk Metadata Service&lt;/strong&gt; maintains the authoritative map of which data chunks live on which shards. The &lt;strong&gt;Database Shards&lt;/strong&gt; are replicated MongoDB instances that store the actual data. Applications talk only to the proxy; they are completely unaware of sharding, shard splits, or migrations in progress.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified 6-step Data Movement Platform migration flow
# Each step is atomic and resumable — migrations can be paused and continued
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataMovementPlatform&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;migrate_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 1: Register migration plan — makes migration visible to monitoring
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunk_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_migration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;target_shard&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 2: Pre-build indexes on target BEFORE data arrives
&lt;/span&gt;        &lt;span class="c1"&gt;# Avoids the performance penalty of indexing a large loaded dataset
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build_indexes_on_target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Bulk copy snapshot at time T
&lt;/span&gt;        &lt;span class="c1"&gt;# Uses purpose-built I/O patterns for high-throughput sequential writes
&lt;/span&gt;        &lt;span class="n"&gt;snapshot_timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk_copy_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 4: Stream CDC replication — catch up all writes since snapshot
&lt;/span&gt;        &lt;span class="c1"&gt;# Reads MongoDB oplog on source; applies to target until near-real-time
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cdc_replicate_to_target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_timestamp&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 5: Correctness verification — compare source and target
&lt;/span&gt;        &lt;span class="c1"&gt;# Financial data requires full consistency before any traffic switch
&lt;/span&gt;        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_consistency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 6: Atomic traffic switch — update chunk metadata, switch routing
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunk_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_active_shard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_shard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Applications querying the chunk now get routed to target
&lt;/span&gt;        &lt;span class="c1"&gt;# Deregister from source after confirmation
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunk_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deregister_from_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_shard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;BIN-PACKING: REDUCING THE FLEET BY 75%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In 2023, Stripe used the Data Movement Platform to &lt;strong&gt;bin-pack thousands of underutilized shards&lt;/strong&gt; into a smaller number of larger shards. Bin-packing is the reverse of splitting: instead of one shard becoming two, many small shards are consolidated into fewer shards with more data. This reduced the total number of DocDB shards by approximately 75% while moving 1.5 petabytes — dramatically reducing operational overhead and hardware costs without any application code changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multitenant to Single-Tenant: Isolation on Demand&lt;/p&gt;

&lt;p&gt;DocDB supports migrating a large merchant's data from a &lt;strong&gt;shared multitenant shard&lt;/strong&gt; (multiple merchants on one shard) to a &lt;strong&gt;dedicated single-tenant shard&lt;/strong&gt; (one merchant per shard). This is done transparently via the Data Movement Platform: the merchant's data is migrated to a dedicated shard, traffic routing is updated atomically, and the merchant gets dedicated resources without any downtime or visible change in behavior. This capability is increasingly important as Stripe's largest customers grow to Shopify, Amazon, and OpenAI scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Heat Management System: Next Chapter&lt;/p&gt;

&lt;p&gt;At the time of the June 2024 blog post, Stripe was &lt;strong&gt;prototyping a heat management system&lt;/strong&gt; that proactively balances data across shards based on real-time access patterns. Rather than waiting for a shard to become a bottleneck and then splitting it reactively, the heat management system would detect access pattern shifts and pre-emptively migrate hot data to shards with more capacity. Reactive sharding at Stripe's scale will eventually give way to predictive sharding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Correctness verification (Step 5) is the most cautious part of the migration protocol, and deliberately so. The platform compares a sample of records between source and target shards after CDC replication has caught up. For financial data, even a single inconsistency before the traffic switch would be unacceptable — a payment that exists on the source but not on the target could be double-charged or lost if the switch happens before it replicates. &lt;strong&gt;The verification step is the safety gate that makes five-nines availability compatible with live shard migrations.&lt;/strong&gt; The cost is time — migrations take longer because of the verification window — but that cost is the explicit price of correctness guarantees on financial data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Bulk Load Throughput Engineering Challenge&lt;/p&gt;

&lt;p&gt;During testing, Stripe found that standard MongoDB write patterns were insufficiently fast for bulk data loading during shard migrations. Batching writes and tuning engine parameters both failed to resolve the throughput bottleneck. The root cause: the standard MongoDB write path is optimized for &lt;strong&gt;low-latency individual writes&lt;/strong&gt; , not for &lt;strong&gt;high-throughput sequential bulk loads&lt;/strong&gt;. The engineering team built custom I/O patterns specifically for the bulk copy phase of migrations — patterns that bypassed some standard write overhead in favor of throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE OPLOG AND FINANCIAL CONSISTENCY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MongoDB's &lt;em&gt;oplog&lt;/em&gt; (a capped collection that stores all write operations in order, used for replication across MongoDB replica sets) is the technical foundation of CDC replication in DocDB. Every write to the source shard appears in the oplog in order. By replaying the oplog on the target shard in sequence, the Data Movement Platform guarantees that every write applied to the source during migration is also applied to the target — preserving full consistency of financial records. The oplog is not just a replication mechanism: it is a &lt;strong&gt;linearizable history of financial truth&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;DocDB's architecture enforces a clean separation between application code and database topology. Applications at Stripe never connect directly to MongoDB instances — they connect to the &lt;em&gt;Database Proxy&lt;/em&gt;, which is the single point of truth for routing, access control, and scalability decisions. This indirection is what makes zero-downtime migrations possible: the proxy can update its routing table atomically as migrations complete, and applications never see anything other than consistent data.&lt;/p&gt;

&lt;h3&gt;
  
  
  DocDB Architecture: Three-Tier Database-as-a-Service
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/stripe-docdb-data-movement-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Movement Platform: Six-Step Migration Protocol
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/stripe-docdb-data-movement-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE PROXY MAKES MIGRATIONS TRANSPARENT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Database Proxy's role is the architectural key to zero-downtime migrations. By &lt;strong&gt;abstracting away shard topology from application code&lt;/strong&gt; , the proxy can update routing atomically at Step 6 — the traffic switch — without any application restarting, reconnecting, or changing behavior. Applications see a continuous stream of consistent reads and writes before and after the switch. The migration is completely invisible from the application layer, which is the entire point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Change Data Capture: Reading the Oplog&lt;/p&gt;

&lt;p&gt;MongoDB maintains an &lt;em&gt;oplog&lt;/em&gt; (operation log — a capped MongoDB collection that records every write operation applied to the database, used for replication and CDC streaming) that records every write in sequence. DocDB's CDC service reads this oplog on the source shard and replays every operation on the target shard in order. This keeps the target continuously synchronized with the source during the migration window. When replication lag drops to near-zero, the correctness verification and traffic switch can proceed safely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Transparent Application Layer: The Developer Experience&lt;/p&gt;

&lt;p&gt;Stripe's application engineers interact with DocDB through a simple API: read a document, write a document, query by index. They never configure sharding keys, never think about which shard holds a specific customer's data, and never coordinate with the database infrastructure team before their code ships. The abstraction layer is what makes it possible for Stripe's product engineering velocity to be decoupled from its database scaling complexity — two teams that would otherwise be in each other's way operate independently.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Stripe's DocDB and Data Movement Platform represent a decade of investment in making financial database operations invisible to application code. The lessons here are about architectural abstraction, the price of correctness, and why migration tooling is a competitive advantage.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;A database abstraction layer is an operational multiplier.&lt;/strong&gt; Stripe's applications never talk to MongoDB directly — they talk to the proxy. This indirection cost engineering time upfront but enabled zero-downtime migrations, transparent sharding, and fleet-wide upgrades for a decade of scale growth. The abstraction layer is where scaling strategies live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Change Data Capture&lt;/em&gt; (reading a database's operation log to stream every change to a downstream consumer in real time) is the foundation of live migration. Without CDC, migrating a live database requires a maintenance window. With CDC, you copy a snapshot, stream the delta, verify consistency, then switch traffic atomically. Build CDC capability into your database infrastructure before you need live migrations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Pre-build indexes on the target before loading data.&lt;/strong&gt; Loading data first and then building indexes on a large dataset is far more expensive than building the indexes on empty data and then inserting. For petabyte-scale migrations, this ordering difference can be the difference between hours and days. Stripe explicitly sequences index creation before bulk data arrival.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Gradual traffic restoration and correctness verification before the switch are not optional for financial data. &lt;strong&gt;A migration that completes fast but introduces even a single data inconsistency is worse than a slow correct migration.&lt;/strong&gt; For domains where correctness is non-negotiable, treat Step 5 (verification) as the most important step in your migration protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; &lt;em&gt;Bin-packing&lt;/em&gt; (consolidating many small, underutilized database shards into fewer larger shards to reduce operational overhead and hardware costs) is as important as shard splitting for long-term database fleet health. As traffic patterns shift, some shards become cold. Without consolidation, you accumulate operational overhead and hardware waste. Plan for bidirectional shard topology management from day one.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Correctness vs Speed Tradeoff&lt;/p&gt;

&lt;p&gt;Stripe's Data Movement Platform deliberately accepts &lt;strong&gt;slower migrations in exchange for guaranteed correctness&lt;/strong&gt;. The CDC replication phase, the correctness verification step, and the atomic traffic switch all add latency to the migration timeline that a less careful system could avoid. For a company processing $1 trillion in payments, data inconsistency risk is not a speed-for-correctness tradeoff — it's a business continuity risk. The migration protocol encodes this priority explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MIGRATION TOOLING AS INFRASTRUCTURE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stripe's Data Movement Platform is not a script that gets run during migrations — it is &lt;strong&gt;production infrastructure&lt;/strong&gt; that runs continuously, managing ongoing shard operations across thousands of databases. The platform has its own SLOs, its own monitoring, its own oncall rotation. Building migration tooling as first-class infrastructure rather than ad-hoc tooling is what enables Stripe to migrate petabytes per year without extraordinary engineering effort per migration.&lt;/p&gt;

&lt;p&gt;Stripe moved 1.5 petabytes of financial data between database shards in 2023 and nobody noticed — which is either the most boring success story in engineering history or the most impressive one.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/stripe-docdb-data-movement-2024/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>database</category>
      <category>stripe</category>
      <category>systemdesign</category>
      <category>scalability</category>
    </item>
    <item>
      <title>A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/a-database-permission-change-in-clickhouse-took-down-28-of-cloudflares-http-traffic-1mg1</link>
      <guid>https://dev.to/techlogstack/a-database-permission-change-in-clickhouse-took-down-28-of-cloudflares-http-traffic-1mg1</guid>
      <description>&lt;p&gt;&lt;strong&gt;Cloudflare&lt;/strong&gt; · Reliability · 17 May 2026&lt;/p&gt;

&lt;p&gt;On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nov 2 2023 outage&lt;/li&gt;
&lt;li&gt;28% HTTP traffic impacted&lt;/li&gt;
&lt;li&gt;6 hours total duration&lt;/li&gt;
&lt;li&gt;2.5h to find root cause&lt;/li&gt;
&lt;li&gt;ClickHouse permission change&lt;/li&gt;
&lt;li&gt;Bot Management crashed globally&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;November 2, 2023 was an unusually bad day at Cloudflare. The datacenter power failure that took down the control plane had already created a major incident. Then, separately and concurrently, a different failure caused a completely independent global outage affecting 28% of Cloudflare's HTTP traffic. The two incidents shared a date but not a cause. The Bot Management outage was caused by a database permission change in &lt;em&gt;ClickHouse&lt;/em&gt; (a column-oriented database designed for real-time analytical queries, used by Cloudflare for its Bot Management system to query feature metadata) that inadvertently generated a &lt;strong&gt;corrupt configuration file&lt;/strong&gt; — and the corrupt file was propagated globally to every Bot Management node before anyone noticed something was wrong.&lt;/p&gt;

&lt;p&gt;The mechanics are precise. Cloudflare's Bot Management system queries a ClickHouse database to fetch &lt;strong&gt;feature metadata&lt;/strong&gt; — data used to evaluate whether a given request exhibits bot-like behavior patterns. A database change altered the permissions for queries, causing them to fall back to a different database called 'default' that contained a different, larger set of 60 features rather than the distributed tables normally used. The Bot Management configuration file generator fetched this expanded feature set, generated a file that was larger than the software processing it could handle, and emitted the oversized file. The oversized file was then &lt;strong&gt;propagated throughout Cloudflare's global network&lt;/strong&gt; — instantly and completely — as a standard configuration update.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE GLOBAL PROPAGATION PROBLEM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's configuration system was designed to propagate changes globally as fast as possible — this is a feature for legitimate configuration updates. For security changes, speed matters. For this incident, speed was the accelerant: a corrupt configuration file reached &lt;strong&gt;every Cloudflare server globally within seconds&lt;/strong&gt; of being generated. There was no staged rollout, no canary deployment, no percentage-based rollout. One bad file. Every server. Instantly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  ClickHouse Permission Change Triggers Fallback
&lt;/h4&gt;

&lt;p&gt;A database permission change in ClickHouse caused Bot Management queries to fall back from distributed tables to the 'default' database containing 60 features. The configuration file generator fetched the larger dataset, generating a file that exceeded the size limit of the consuming software.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Oversized Config Silently Propagated Globally
&lt;/h4&gt;

&lt;p&gt;The oversized configuration file was not validated before propagation. Cloudflare's configuration distribution system treated it like any other config update and propagated it globally to all Bot Management nodes. Every node crashed when it tried to load the oversized file. 28% of HTTP traffic was impacted because Bot Management is in the critical path for Cloudflare's proxy layer.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  2.5h to Find Root Cause, 3.5h to Fix and Deploy
&lt;/h4&gt;

&lt;p&gt;It took 2.5 hours to identify the incorrect configuration files as the source of the outage — early investigation suspected a DDoS attack because Cloudflare's status page coincidentally went offline at the same time (unrelated outage). Once identified, stopping the propagation and deploying a correct file took another hour, and cleanup took 2.5 more hours.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Service Restored 6 Hours After Start
&lt;/h4&gt;

&lt;p&gt;The outage was resolved at 17:06 UTC, approximately 6 hours after it started. A new configuration file was deployed. Bot Management came back online globally. The postmortem identified staged configuration rollouts as the primary required fix — the same action item from the control plane outage postmortem that hadn't been implemented yet.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔍&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's status page went offline at the same time as the outage, causing the incident response team to &lt;strong&gt;initially suspect a DDoS attack&lt;/strong&gt;. The status page failure was a coincidence — an unrelated issue — not part of the outage. This created a 2.5-hour investigation red herring: engineers were looking for evidence of an attack while the actual cause was a configuration file size issue.&lt;/p&gt;

&lt;p&gt;Matthew: 'None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. Sent the draft over to the SF team, who did one more sweep, then posted it.'&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Matthew Prince, CEO of Cloudflare — discussing the postmortem publication, via The Pragmatic Engineer&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cloudflare CEO Matthew Prince wrote the first version of the incident review at home in Lisbon, the evening the incident resolved. This was not a PR-managed corporate response — it was an engineer's honest account of what went wrong, written while the incident was fresh. The postmortem was then circulated internally, reviewed by the SF team, and published. The same-day publication is unusual for a company of Cloudflare's size and is a demonstration of the &lt;strong&gt;cultural commitment to transparency&lt;/strong&gt; that makes Cloudflare's postmortems some of the most cited in the industry.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The November 2023 Postmortem Action Item (Uncompleted)&lt;/p&gt;

&lt;p&gt;The previous November 2023 Cloudflare control plane outage had included an explicit action item: implement staged configuration rollouts so that configuration files do not propagate immediately to the full global network. The Bot Management outage was, in part, a consequence of that work not yet being completed. The postmortem was explicit: staged config rollouts 'remains our first priority across the organization' but implementation was a large project that could take months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why 28% of Traffic Was Affected&lt;/p&gt;

&lt;p&gt;Bot Management is not a peripheral feature — it's in the critical path for Cloudflare's proxy layer. When Bot Management crashes on a node, that node's proxy functionality goes offline. 28% of Cloudflare's HTTP traffic routes through nodes where Bot Management is active in the serving path. This architectural coupling — a feature module that can take down the core proxy function — is exactly the kind of dependency that staged rollouts would have contained: a crash on 1% of nodes is very different from a crash on 100%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bot Management Architecture: Why It Was Critical Path&lt;/p&gt;

&lt;p&gt;Cloudflare's Bot Management evaluates every HTTP request against behavioral signals to determine if it's bot traffic. This evaluation happens &lt;strong&gt;inline in the request path&lt;/strong&gt; — the proxy holds the request while Bot Management runs its checks. This design is necessary for real-time bot mitigation: if the check happened asynchronously, bots could complete their requests before being blocked. The trade-off is that a Bot Management failure blocks the request path entirely rather than allowing traffic through unprotected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two Outages, One Day: The Coincidence Tax&lt;/p&gt;

&lt;p&gt;The fact that Cloudflare experienced two separate major outages on November 2, 2023 — one from a datacenter power failure, one from a configuration file — created disproportionate reputational damage. Each incident was explainable individually. Together, they suggested to some customers that Cloudflare had a systemic reliability problem rather than two independent bad-luck events. The same is true in reliability engineering generally: coincident failures compound trust damage beyond what either would cause alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE ERROR LOGGING GAP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One finding from the postmortem: the line of code that returned an error from the oversized configuration file &lt;strong&gt;did not log the error&lt;/strong&gt;. If errors had been logged and alerted on when they spiked on nodes, root cause identification would have taken minutes rather than 2.5 hours. Logging errors at the point they occur — not just aggregating them — and alerting on error rate spikes is fundamental debugging infrastructure. This was one of the most actionable lessons from the incident.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Required Fixes: Staged Rollouts and Config Validation
&lt;/h3&gt;

&lt;p&gt;The Bot Management outage had two independent root causes that both needed to be addressed. The first: the ClickHouse permission change that caused the query fallback should have been tested in a staging environment where the configuration file output could be validated before propagation. The second: the configuration distribution system should have validated the file size and format before propagating globally — and should never have propagated a configuration change globally and instantly regardless of its validity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;28%&lt;/strong&gt; — HTTP traffic impacted — because Bot Management is in the critical path of Cloudflare's proxy layer, a module crash takes down the proxy function for that node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.5h&lt;/strong&gt; — Time to identify the root cause — delayed by initial suspicion of DDoS attack after the status page coincidentally went offline at the same time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6h&lt;/strong&gt; — Total outage duration from start to full resolution — 2.5h investigation, 1h fix deployment, 2.5h cleanup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instant&lt;/strong&gt; — Configuration propagation speed before fix — the system was designed to propagate configs globally as fast as possible, which made the corrupt config a global instant failure
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified config validation and staged rollout logic
# Addresses both root causes of the Bot Management outage
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ConfigDeployer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;MAX_CONFIG_SIZE_BYTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10_000_000&lt;/span&gt; &lt;span class="c1"&gt;# explicit size limit
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deploy_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# VALIDATION GATE: Reject invalid configs before any propagation
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_validate_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# STAGED ROLLOUT: Not global-instant anymore
&lt;/span&gt;        &lt;span class="c1"&gt;# Phase 1: Deploy to 1% of nodes
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_to_percentage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_health_check_passes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_rollback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# automatic rollback on health failure
&lt;/span&gt;            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ConfigDeploymentError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Health check failed at 1%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Phase 2: Expand to 10%
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_to_percentage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_health_check_passes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_rollback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ConfigDeploymentError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Health check failed at 10%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Phase 3: Full deployment
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_validate_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Size validation — catches the ClickHouse fallback issue
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_CONFIG_SIZE_BYTES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ConfigValidationError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Config size &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; exceeds maximum &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAX_CONFIG_SIZE_BYTES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Schema validation — catches structural issues
&lt;/span&gt;        &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CONFIG_PARSERS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;config_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# raises on malformed config
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE INVESTIGATION RED HERRING&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most instructive details in this postmortem is the &lt;strong&gt;DDoS attack hypothesis&lt;/strong&gt;. Cloudflare's status page went offline coincidentally at the same time as the Bot Management outage — completely unrelated. Incident responders, seeing both the outage and the status page failure, initially focused on finding evidence of an attack. This wasted 2.5 hours investigating the wrong hypothesis. The lesson: when an incident starts, explicitly enumerate and test competing hypotheses rather than pursuing only the first plausible one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ClickHouse Permission Architecture&lt;/p&gt;

&lt;p&gt;Cloudflare's Bot Management uses ClickHouse to query feature metadata — data about which behavioral signals to look for in traffic. The ClickHouse cluster had two query paths: the distributed tables path (normal operation, queries a subset of features), and the 'default' database fallback (60 features, designed for different purposes). The permission change that triggered the fallback was routine maintenance — &lt;strong&gt;there was no intent to cause the fallback&lt;/strong&gt;. This is a reminder that permission changes to production databases require the same testing rigor as code changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same-Day Postmortem: The Transparency Standard&lt;/p&gt;

&lt;p&gt;Cloudflare published the incident postmortem on the same day as the outage. This is exceptional — most companies take days or weeks to publish postmortems. The same-day publication reflects a culture where transparency with customers is treated as part of incident response, not a post-recovery PR exercise. Cloudflare's CEO wrote the first draft the evening the incident resolved. That speed and candor is why Cloudflare's postmortems are among the most trusted in the industry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Missing Error Log&lt;/p&gt;

&lt;p&gt;A key finding in the postmortem: the code that crashed when loading the oversized configuration file &lt;strong&gt;returned an error but did not log it&lt;/strong&gt;. This meant that even as nodes were crashing, the specific error causing the crash was not visible in monitoring. Engineers investigating the incident had to work backward from service failures rather than forward from error messages. Every error should be logged at the point it occurs, and log-level alerts should be configured for error rate spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE SPEED-SAFETY TRADEOFF IN CONFIG PROPAGATION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's instant global config propagation was designed for a real use case: when a new DDoS attack signature is detected, Cloudflare needs to push the mitigation rule globally as fast as possible. &lt;strong&gt;Security changes genuinely benefit from fast propagation&lt;/strong&gt;. The fix isn't to make config propagation slower — it's to distinguish between security-critical changes (fast propagation with validation) and configuration updates (staged rollout with health gates). Not all configuration changes have the same urgency requirements.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The Bot Management outage reveals how Cloudflare's internal architecture works at a feature module level. Bot Management is a module within Cloudflare's proxy software that evaluates every HTTP request against bot detection criteria. When it loads its configuration file at startup (or on configuration update), it reads the feature definitions that determine what signals to analyze. If that configuration file is oversized or malformed, the module crashes — and because it's in the critical path of the proxy, the proxy function for that node crashes too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bot Management Outage: The Configuration Propagation Chain
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After: Config Validation + Staged Rollout Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE 'SAME MISTAKE TWICE' CONCERN&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two separate Cloudflare outages within weeks of each other, both caused by a configuration change propagating globally without staged rollout, created a serious customer confidence problem. The November 2023 datacenter outage was an external failure. The Bot Management outage was a self-inflicted failure with a root cause that the team had already identified from the prior incident. &lt;strong&gt;Customers rightly noticed the pattern&lt;/strong&gt;. CTO Dane Knecht acknowledged in the postmortem that global configuration changes 'remains our first priority across the organization' — a public commitment to completing the staged rollout work that the team already knew it needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Module Criticality and Blast Radius&lt;/p&gt;

&lt;p&gt;An architecture question raised by this incident: &lt;strong&gt;should Bot Management be in the critical path of the proxy layer?&lt;/strong&gt; If Bot Management crashes, the proxy crashes. An alternative design isolates Bot Management as a non-critical component that the proxy bypasses on failure — allowing traffic to flow (without bot protection) rather than blocking entirely. This fail-open vs fail-closed design decision has security implications (fail-open allows bots through temporarily) versus availability implications (fail-closed takes the proxy down). For a CDN, the availability argument may outweigh the security argument.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔒&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fail-Open vs Fail-Closed: The Bot Management Design Decision&lt;/p&gt;

&lt;p&gt;The Bot Management outage surfaces a fundamental architecture decision: when a security module fails, should the system &lt;strong&gt;fail-open&lt;/strong&gt; (allow traffic through unprotected) or &lt;strong&gt;fail-closed&lt;/strong&gt; (block traffic until the module recovers)? Fail-open maintains availability but exposes customers to unprotected bot traffic during the failure window. Fail-closed maintains security posture but impacts availability. Cloudflare's current design is fail-closed — 28% of traffic went down rather than flowing unprotected. &lt;strong&gt;The right answer depends on whether your customers value security continuity or availability continuity more&lt;/strong&gt; during module failures.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;The Cloudflare Bot Management outage teaches a simple lesson about configuration safety that applies to every distributed system: fast global propagation is an availability risk. The lessons here are architectural and process-oriented.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Validate configuration files before propagating them.&lt;/strong&gt; Size limits, schema validation, and semantic checks should all run before a configuration update is distributed to production nodes. A corrupt config that fails validation is an alert; a corrupt config that propagates globally is an outage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Staged rollouts&lt;/em&gt; (deploying configuration changes to a small percentage of nodes first, checking health, then expanding gradually) for configuration changes are as important as staged rollouts for code changes. The same principles apply: canary, health gate, expand. Global instant propagation for configuration changes is a global outage waiting to happen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Database permission changes are code changes.&lt;/strong&gt; They modify system behavior and can cause unexpected fallbacks, query plan changes, and downstream effects. Test them in staging. Apply them with the same rigor as schema migrations. The Cloudflare ClickHouse permission change was routine maintenance that caused a global outage because it wasn't tested for downstream effects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; When investigating incidents, explicitly enumerate competing hypotheses and test the most likely ones in parallel. &lt;strong&gt;The DDoS false lead cost 2.5 hours&lt;/strong&gt; because investigators committed too quickly to one explanation. Structured incident investigation that tests multiple hypotheses simultaneously finds root causes faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; Postmortem action items must have urgency. &lt;strong&gt;The same staged rollout improvement identified in the November 2023 control plane outage postmortem would have prevented the Bot Management outage&lt;/strong&gt; if it had been implemented before the second incident. Postmortem action items are not backlog items — they are debt with interest that accrues in the form of the next incident.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 2023 Cloudflare Transparency Report&lt;/p&gt;

&lt;p&gt;Cloudflare's CEO published the incident review on the same day as the outage, and the write-up was detailed and candid about the mistakes made. This level of post-incident transparency is unusual and valuable for the industry. &lt;strong&gt;When major infrastructure providers share honest postmortems&lt;/strong&gt; , they give other engineering teams a chance to learn from failures they didn't experience themselves — and raise the industry standard for incident communication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CONFIGURATION AS CODE: THE MISSING GATE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Bot Management config file was generated by a system that fetched data from a database and formatted it. This is code that produces configuration. It had no equivalent of a test suite, a staging environment validation, or a size limit check. &lt;strong&gt;Configuration generators need the same quality gates as application code&lt;/strong&gt; : unit tests for the generation logic, integration tests against real database states, validation of the output before propagation, and size/schema checks at the distribution layer. Configuration generation is engineering, not operations.&lt;/p&gt;

&lt;p&gt;The same configuration safety fix that would have prevented the first outage also would have prevented the second outage — which makes the second outage Cloudflare's most expensive action item ever left in a backlog.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>cloudflare</category>
      <category>devops</category>
      <category>database</category>
    </item>
    <item>
      <title>Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/cloudflare-fixed-a-react-security-vulnerability-and-broke-the-entire-network-15n0</link>
      <guid>https://dev.to/techlogstack/cloudflare-fixed-a-react-security-vulnerability-and-broke-the-entire-network-15n0</guid>
      <description>&lt;p&gt;&lt;strong&gt;Cloudflare&lt;/strong&gt; · Reliability · 17 May 2026&lt;/p&gt;

&lt;p&gt;In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dec 2025 global outage&lt;/li&gt;
&lt;li&gt;React CVE fix triggered outage&lt;/li&gt;
&lt;li&gt;Global killswitch bug&lt;/li&gt;
&lt;li&gt;HTTP 500 across network&lt;/li&gt;
&lt;li&gt;Third config-related outage&lt;/li&gt;
&lt;li&gt;Staged rollout still incomplete&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;By December 2025, Cloudflare had experienced two major configuration-related global outages — the November 2023 Bot Management outage and various incidents in between — and had identified staged configuration rollouts as the primary systemic fix. That fix was still not fully implemented. Then came the React security vulnerability outage. Cloudflare was deploying a fix for a &lt;em&gt;React CVE&lt;/em&gt; (a Common Vulnerabilities and Exposures report for a security flaw in the React JavaScript library — CVEs trigger mandatory patching workflows across the industry) in their internal tooling. The patch introduced an error in an &lt;strong&gt;internal testing tool&lt;/strong&gt;. The team disabled the testing tool with a &lt;strong&gt;global killswitch&lt;/strong&gt;. That killswitch, unexpectedly, triggered a bug in an unrelated code path — causing HTTP 500 errors across Cloudflare's network.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In this latest outage, Cloudflare was burnt by yet another global configuration change. The previous outage in November happened thanks to a global database permissions change. This change would make it so that Cloudflare's configuration files do not propagate immediately to the full network, as they still do now. But making all global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn't time to make it yet, and it has come back to bite Cloudflare.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — The Pragmatic Engineer newsletter analysis of the Cloudflare December 2025 outage&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The pattern was now impossible to ignore. Cloudflare had experienced multiple major outages in the 2023–2025 period, each with the same root-cause category: a configuration change that propagated globally and instantly, without staged rollout, caused unexpected systemic failures. The November 2023 Bot Management outage's primary action item — implement staged configuration rollouts — was explicitly identified as &lt;strong&gt;a large implementation that could take months&lt;/strong&gt;. Each new outage was paying the price of that implementation not yet being complete. The React outage was the industry's most documented illustration of technical debt from unimplemented postmortem action items.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE KILLSWITCH THAT WASN'T JUST A KILLSWITCH&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A killswitch is a simple concept: disable something. But in a complex distributed system, disabling one component can have unexpected dependencies. The internal testing tool that was disabled via global killswitch was apparently connected to a code path that, when the tool was absent, triggered a bug causing HTTP 500 errors. &lt;strong&gt;Killswitches are configuration changes.&lt;/strong&gt; All the same rules apply: validate them, stage them, monitor them. A killswitch deployed globally and instantly is a global instant configuration change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  React CVE Fix Introduces Testing Tool Error
&lt;/h4&gt;

&lt;p&gt;Cloudflare was rolling out a fix for a React security vulnerability in internal tooling. The fix caused an error in an internal testing tool, prompting the team to disable the tool. The disable was executed as a global configuration change via killswitch.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Killswitch Triggered Unexpected Code Path Bug
&lt;/h4&gt;

&lt;p&gt;The global killswitch that disabled the testing tool unexpectedly triggered a bug in a connected code path. The bug caused HTTP 500 errors across Cloudflare's network. Because the killswitch was propagated globally and instantly, the impact was immediate and global.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Revert Killswitch Configuration
&lt;/h4&gt;

&lt;p&gt;The fix was to revert the killswitch configuration — undoing the disable of the testing tool that had triggered the bug. This brought Cloudflare's network back to its pre-fix state. The React CVE patch then needed to be reworked to avoid triggering the testing tool error.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Service Restored, Pattern Acknowledged
&lt;/h4&gt;

&lt;p&gt;Service was restored after reverting the configuration. The postmortem was published on the same day. CTO Dane Knecht acknowledged the pattern publicly and committed to making enhanced rollouts and versioning 'the first priority across the organization' — the same commitment made after the 2023 outages.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;❌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Third Configuration-Related Outage in Two Years&lt;/p&gt;

&lt;p&gt;The React security fix outage was the third major configuration-related global outage in Cloudflare's 2023–2025 period. The November 2023 Bot Management outage, subsequent incidents, and the December 2025 React outage all shared the same fundamental cause: a configuration change propagated globally and instantly without safety validation. The same fix had been identified after the first outage. That the fix hadn't been implemented by the third outage is a case study in the organizational cost of deprioritizing postmortem action items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚛️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The React vulnerability that started this chain of events was a &lt;strong&gt;security patch that Cloudflare was doing the right thing by deploying&lt;/strong&gt;. Security vulnerability patching is mandatory and time-sensitive. The outage wasn't caused by bad intentions or negligence — it was caused by a security response that didn't account for all of its dependencies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the most challenging aspects of Cloudflare's staged rollout implementation is the security-versus-safety tension. Cloudflare's configuration distribution system was designed to be fast because &lt;strong&gt;security changes need to be fast&lt;/strong&gt;. When a new attack pattern is detected, Cloudflare needs to push mitigation rules globally as quickly as possible. Slowing down configuration propagation has real security costs: the window between an attack being detected and the mitigation being globally deployed gets longer. The engineering challenge is building a system that can be fast for security-critical changes but staged for everything else — which requires distinguishing between change types at the infrastructure level.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CTO Dane Knecht's Public Commitment&lt;/p&gt;

&lt;p&gt;Following the December 2025 outage, Cloudflare CTO Dane Knecht was quoted in the postmortem: &lt;strong&gt;'Global configuration changes rolling out globally remains our first priority across the organization.'&lt;/strong&gt; This was the same commitment made after the 2023 outages. The public, repeated commitment to the same fix — without the fix having been implemented — created accountability that was difficult to ignore. The staged rollout project was given resources and deadline commitment following the December 2025 outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same-Day Postmortem: The Third Time&lt;/p&gt;

&lt;p&gt;Cloudflare published their postmortem for the December 2025 React outage on the same day the incident resolved — maintaining their remarkable transparency standard for the third major outage in two years. The postmortem's candor was notable: it explicitly referenced the November 2023 action item that hadn't been completed, and included CTO Dane Knecht's public acknowledgment that staged configuration rollouts 'remains our first priority.' Three same-day postmortems, three public commitments to the same fix, growing organizational accountability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔄&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Pattern: Configuration Changes That Break Things&lt;/p&gt;

&lt;p&gt;Looking across Cloudflare's 2023–2025 incidents, a precise pattern emerges: (1) a routine operational change is made to production infrastructure, (2) the change has unexpected downstream effects, (3) the affected configuration or rule is propagated globally and instantly, (4) the impact is global and immediate. The fix to this pattern is not 'be more careful' — it's &lt;strong&gt;staged rollout infrastructure that makes global instant propagation impossible for non-security-critical changes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHAT MAKES CLOUDFLARE'S CASE UNIQUE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most organizations have configuration-related incidents. What makes Cloudflare's case unusual is the scale: a configuration change at Cloudflare affects infrastructure serving &lt;strong&gt;a significant fraction of all internet traffic&lt;/strong&gt;. The blast radius is not one company's systems — it's millions of websites and their users globally. This scale makes configuration safety not just an operational concern but a responsibility to the broader internet ecosystem. Cloudflare's staged rollout implementation is infrastructure for global internet resilience.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Systemic Fix: Enhanced Rollouts and Versioning
&lt;/h3&gt;

&lt;p&gt;Cloudflare's CTO described the required fix as &lt;strong&gt;'Enhanced Rollouts and Versioning'&lt;/strong&gt; — applying the same safety and blast mitigation features to configuration data that Cloudflare already applies to software deployments. Software at Cloudflare is deployed gradually, with strict health validation at each stage. Configuration changes had no equivalent safety system. The fix required building one: a configuration versioning system that could tag changes, a rollout engine that could apply them to staged percentages, and health checks that could catch problems before wider propagation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3rd&lt;/strong&gt; — Configuration-related global outage in the 2023–2025 period — each one traceable to the same root cause: instant global config propagation without safety gates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Months&lt;/strong&gt; — Estimated implementation time for staged rollouts as quoted in the November 2023 postmortem — the duration that allowed the second and third outages to occur before the fix was complete&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same day&lt;/strong&gt; — Postmortem publication time — Cloudflare's consistent practice of same-day transparency, maintained even when the incident revealed repeated failure to implement a known fix&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority #1&lt;/strong&gt; — Stated organizational priority for staged configuration rollouts — acknowledged as the highest infrastructure priority after the December 2025 outage
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The required Enhanced Rollouts and Versioning system
# Differentiates security-critical changes (fast) from configuration changes (staged)
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ConfigRolloutEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deploy_change&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ConfigChange&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Security-critical changes (DDoS mitigations, attack signatures)
&lt;/span&gt;        &lt;span class="c1"&gt;# Still fast — but with validation gate
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ConfigChangeType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SECURITY_CRITICAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_validate_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# validation must pass
&lt;/span&gt;            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_global_fast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# then deploy fast
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="c1"&gt;# All other changes: staged rollout with health gates
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_validate_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Stage 1: 1% canary
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_to_percentage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_wait_and_check_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Stage 2: 10% cohort  
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_to_percentage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_wait_and_check_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Stage 3: 50% cohort
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_to_percentage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_wait_and_check_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Stage 4: Full rollout
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_deploy_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_validate_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ConfigChange&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Size limits, schema validation, semantic checks
&lt;/span&gt;        &lt;span class="c1"&gt;# Catches the oversized ClickHouse fallback config
&lt;/span&gt;        &lt;span class="c1"&gt;# Catches malformed configs before any propagation
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_wait_and_check_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Error rate, latency, traffic metrics
&lt;/span&gt;        &lt;span class="c1"&gt;# Auto-rollback if thresholds exceeded
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE SECURITY-SPEED TENSION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The core tension in Cloudflare's configuration safety problem is that their configuration system was designed for security use cases where speed matters. Staged rollouts introduce latency that's unacceptable for DDoS mitigation rules. The solution requires &lt;strong&gt;distinguishing between change types&lt;/strong&gt; : security responses (fast propagation + validation) versus configuration updates (staged propagation + health gates). This distinction is architecturally complex — the system needs to know the change type, enforce the right deployment mode, and maintain separate pipelines without creating a new single point of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Three-Outage Forcing Function&lt;/p&gt;

&lt;p&gt;If the staged rollout implementation had been deprioritized after the November 2023 outage, the December 2025 outage provided an undeniable forcing function. Three configuration-related global outages in two years, with the same root cause, creates organizational pressure that cannot be managed with further prioritization discussions. The December 2025 outage finally resulted in resources, deadline commitment, and executive ownership for the staged rollout project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Postmortem Action Items Need Priority Enforcement&lt;/p&gt;

&lt;p&gt;The Cloudflare staged rollout story is one of the industry's clearest examples of what happens when postmortem action items are treated as backlog items rather than critical debt. The November 2023 postmortem identified the fix. The December 2025 outage demonstrated the cost of not implementing it. Engineering organizations need mechanisms to track postmortem action items with urgency, not just completeness — including escalation paths when critical action items age without progress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Resources Finally Allocated After Three Incidents&lt;/p&gt;

&lt;p&gt;The December 2025 outage served as the organizational forcing function that earlier incidents hadn't fully achieved. Following the third configuration-related global outage in two years, Cloudflare allocated dedicated engineering resources, a named project lead, and a committed delivery timeline for the Enhanced Rollouts and Versioning system. The system is now being built as production infrastructure rather than a backlog item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Security-Critical Fast Path&lt;/p&gt;

&lt;p&gt;One of the hardest engineering problems in the staged rollout system is the security-critical fast path. When Cloudflare detects a new DDoS attack pattern or zero-day exploit, they need to push mitigations to every PoP globally within seconds — not within the staged rollout window of 30+ minutes. The system must &lt;strong&gt;distinguish at the protocol level&lt;/strong&gt; between security-critical changes (which maintain fast propagation) and configuration updates (which go through staged rollout). Building this distinction correctly — without creating a bypass that regular configuration changes can be misclassified into — is the core engineering challenge.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The React outage sits in a chain of failures that reveals a systemic architectural vulnerability in Cloudflare's control plane. At the data plane level — PoPs, traffic routing, DDoS mitigation — Cloudflare's architecture is highly resilient. At the configuration plane level — the system that distributes rules and settings to the data plane — the architecture was designed for speed rather than safety. Three outages in two years from the same root cause is the empirical evidence that speed without safety is not viable at global infrastructure scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Configuration Safety Gap: 2023–2025 Timeline
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-react-config-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Required Enhanced Rollout Architecture for Cloudflare
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-react-config-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE ORGANIZATIONAL LESSON: ACTION ITEMS NEED OWNERS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's staged rollout work was identified as a priority after three separate incidents. Each time, it was described as a large implementation requiring months. In hindsight, the organizational failure was not the identification — it was the &lt;strong&gt;lack of a named owner with authority, resources, and a committed deadline&lt;/strong&gt;. Postmortem action items without named owners, resource allocation, and deadline accountability often age in backlogs until a subsequent incident forces the conversation again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's Transparency as Industry Standard&lt;/p&gt;

&lt;p&gt;Despite three major outages with related root causes, Cloudflare's consistent same-day postmortem publication is widely recognized as an industry best practice. The transparency builds trust even when the incidents themselves erode it. &lt;strong&gt;Companies that publish honest postmortems attract and retain engineers who want to learn from failures&lt;/strong&gt; , and they establish accountability mechanisms that internal-only postmortems don't create. The public commitment to fixing staged rollouts after the December 2025 outage has an accountability dimension that an internal action item does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's Scale Makes the Problem Harder&lt;/p&gt;

&lt;p&gt;Staged configuration rollout at Cloudflare's scale (300+ PoPs, millions of configuration updates per year, microsecond-sensitive security decisions) is genuinely difficult infrastructure engineering. The problem is not that Cloudflare doesn't know how to build staged rollouts — they already do this for software deployments. The problem is &lt;strong&gt;retrofitting staged rollout semantics onto a configuration distribution system that was designed for a different set of requirements&lt;/strong&gt; (fast propagation, consistency, global reach) without disrupting the security use cases that depend on that speed.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;The React security fix outage is the third chapter in a two-year story about the cost of not completing a known critical infrastructure fix. The lessons are organizational as much as technical.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;A postmortem action item that isn't implemented before the next incident becomes evidence.&lt;/strong&gt; The staged rollout fix was identified in November 2023. Three subsequent incidents demonstrated its absence. Each one was preventable if the fix had been implemented. Organizations that deprioritize critical postmortem action items pay the price in the form of the next incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Killswitches&lt;/em&gt; (configuration flags that disable functionality globally) are configuration changes and must be treated with the same safety rigor. A killswitch that propagates globally and instantly, without validation and health gating, is a global instant configuration change. Apply staged rollout requirements to all configuration changes — including disables, removes, and shutdowns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Security patches create deployment urgency that can override normal safety practices.&lt;/strong&gt; CVE patches are time-sensitive, creating pressure to deploy quickly. Build explicit processes for security patching that maintain urgency while preserving safety gates — staged deployment with fast canary windows is both fast and safe compared to instant global deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Postmortem action items need &lt;strong&gt;named owners, resource allocation, and deadline commitment&lt;/strong&gt; — not just backlog entries. The difference between 'we identified the need for staged rollouts' and 'engineer X owns staged rollouts with Y engineers and a Q1 deadline' is the difference between an action item that ages and one that gets done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; Repeated incidents with the same root cause are not evidence that the fix is impossible — they are evidence that the fix is &lt;strong&gt;insufficiently prioritized&lt;/strong&gt;. Three configuration-related global outages is a forcing function for resource allocation. If the first incident's postmortem doesn't unlock the resources to fix the root cause, count on needing either the second or third incident to do it.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE TRANSPARENCY COMPOUNDING EFFECT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloudflare's pattern of same-day postmortem publication for major incidents has created a compounding transparency dividend: each postmortem increases customer trust, each public commitment creates accountability, each incident with the same root cause raises the organizational urgency. &lt;strong&gt;The third outage with the same root cause forced a resource and timeline commitment that the first and second outages hadn't achieved&lt;/strong&gt;. Transparency accelerates accountability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Testing Infrastructure for Operational Safety Changes&lt;/p&gt;

&lt;p&gt;The React CVE fix that started this chain of events was a security response — the right thing to do. But deploying it through a testing tool that hadn't been validated for that specific change created the downstream error. &lt;strong&gt;Operational safety infrastructure (testing tools, killswitches, monitoring systems) needs the same testing rigor as application code&lt;/strong&gt;. When safety infrastructure fails, it often does so during incidents — exactly the moment it's needed most.&lt;/p&gt;

&lt;p&gt;Cloudflare fixed a React security vulnerability and accidentally broke the global internet, which is both very on-brand for React and a reminder that security patches are just change management with higher stakes.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/cloudflare-react-config-outage-2025/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>reliability</category>
      <category>cloudflare</category>
      <category>security</category>
      <category>react</category>
    </item>
    <item>
      <title>Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/shopify-sharded-a-rails-database-with-vitess-and-the-app-never-knew-it-happened-4ff9</link>
      <guid>https://dev.to/techlogstack/shopify-sharded-a-rails-database-with-vitess-and-the-app-never-knew-it-happened-4ff9</guid>
      <description>&lt;p&gt;&lt;strong&gt;Shopify&lt;/strong&gt; · Databases · 17 May 2026&lt;/p&gt;

&lt;p&gt;The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KateSQL → Vitess migration&lt;/li&gt;
&lt;li&gt;user_id as sharding key&lt;/li&gt;
&lt;li&gt;VTGate transparent to app&lt;/li&gt;
&lt;li&gt;Dynamic connection switcher&lt;/li&gt;
&lt;li&gt;Zero downtime cutover&lt;/li&gt;
&lt;li&gt;Jan 2024 blog published&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;Shopify launched the Shop app in April 2020, giving consumers a personalized browsing and checkout experience across Shopify's merchant network. By 2023, the Shop app had achieved remarkable growth — and its backend database was approaching the scaling ceiling that every fast-growing application eventually hits. The database powering the Shop backend, running on Shopify's internal managed MySQL system called &lt;em&gt;KateSQL&lt;/em&gt;, was a single MySQL instance. Single-instance databases have a hard vertical limit: no matter how much you upgrade the hardware, there's a maximum amount of data and queries per second one machine can handle. &lt;strong&gt;Horizontal sharding was the only path forward&lt;/strong&gt; , and Shopify's team chose &lt;em&gt;Vitess&lt;/em&gt; (an open-source MySQL scaling system developed at YouTube that adds horizontal sharding, connection pooling, and query routing on top of standard MySQL) to execute it.&lt;/p&gt;

&lt;p&gt;Vitess has a deceptively clean architecture at the application level: applications connect to &lt;em&gt;VTGate&lt;/em&gt; (Vitess's query routing proxy — a stateless service that accepts MySQL connections from applications, parses queries, and routes them to the correct shard based on the query's sharding key) as if it were a regular MySQL server. VTGate speaks the MySQL wire protocol, so applications need only update their database connection string. Queries are then routed by VTGate to the appropriate &lt;em&gt;VTTablet&lt;/em&gt; (a Vitess process that runs alongside each MySQL instance and manages the connection pool, health checks, and query execution for that shard), which communicates directly with the underlying MySQL process. &lt;strong&gt;From the application's perspective, there is one database. From the infrastructure's perspective, there are many.&lt;/strong&gt; This transparency is what makes Vitess viable for a Rails monolith like Shopify's — the application code doesn't change, only the database topology.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔑&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shopify chose &lt;strong&gt;user_id&lt;/strong&gt; as the sharding key for the Shop app's user-owned data. Almost all tables in the database are associated with a user, so user_id was a natural choice — it distributes data evenly, ensures all of a user's data lives on the same shard, and keeps user-scoped queries on a single shard without cross-shard joins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE VITESSIFYING PHASE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shopify coined the term &lt;strong&gt;'Vitessifying'&lt;/strong&gt; for the process of transforming an existing MySQL database into a Vitess keyspace without immediately sharding. In this first phase, a VTTablet is added alongside each MySQL process, and the application is reconfigured to connect through VTGate — but all data still lives on a single shard. This allows the team to validate Vitess integration, test VTGate routing, and gain operational familiarity with Vitess before making the more complex sharding changes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Single-Instance Database Approaching Its Ceiling
&lt;/h4&gt;

&lt;p&gt;The Shop app's backend was scaling rapidly but its database was a single MySQL instance. Vertical scaling had diminishing returns and a hard ceiling. The engineering team needed horizontal sharding to support continued growth without database-induced bottlenecks.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Rails Monolith Expected One Database
&lt;/h4&gt;

&lt;p&gt;Shopify's Shop backend was a Rails application that, like most Rails apps, expected a single primary database connection. Introducing sharding without a transparent proxy would require extensive application-level changes to route queries to the correct shard — a significant refactoring risk. The alternative was a transparent proxy that handled sharding invisibly.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Vitess + Dynamic Connection Switcher
&lt;/h4&gt;

&lt;p&gt;The migration proceeded in phases: first Vitessifying (adding VTTablet and VTGate without sharding), then adding application-layer VTGate connectivity, then splitting tables into the user and global keyspaces, then horizontally sharding the user keyspace by user_id. A dynamic connection switcher allowed gradual traffic migration from the old system to VTGate, with the percentage adjustable without a deploy.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Horizontally Scalable, App Unchanged
&lt;/h4&gt;

&lt;p&gt;The Shop app backend gained horizontal scalability via Vitess sharding without requiring the application to understand sharding. The connection string changed; the application code did not. Shopify can now add shards as the Shop app continues to grow without additional application-level changes.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Auto-Increment Problem in Sharded Systems&lt;/p&gt;

&lt;p&gt;Rails applications default to using &lt;strong&gt;auto-incrementing integer primary IDs&lt;/strong&gt; — a database feature that generates unique IDs by incrementing a counter. In a sharded system, multiple shards generating auto-increment IDs independently would produce duplicate IDs across shards. Vitess solves this with a &lt;strong&gt;Sequences table&lt;/strong&gt; in an unsharded keyspace: VTTablets cache blocks of IDs from the Sequences table and distribute them, ensuring globally unique IDs across all shards. The cache size of 1000 IDs per VTTablet reduces the per-ID write overhead while maintaining uniqueness.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The schema migration challenge was particularly subtle. When running schema migrations (&lt;em&gt;DDL&lt;/em&gt; (Data Definition Language — SQL statements like ALTER TABLE that change database structure rather than data) operations) on a sharded Vitess cluster, all shards must apply the migration and complete before the Rails application can query the table schema. If the migration completes on some shards but not others, a Rails query checking the schema might get an inconsistent view — triggering a dump of a potentially incorrect schema. Shopify's solution: migrations tracked across all shards, and schema dumps only triggered after all shards confirmed completion. This required custom Rails tooling to coordinate with Vitess's sharding topology.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🧩&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two Keyspaces: Users and Global&lt;/p&gt;

&lt;p&gt;Shopify split the Shop app data into two &lt;em&gt;keyspaces&lt;/em&gt; (a Vitess concept for a logical database that can span one or more shards): a &lt;strong&gt;sharded 'users' keyspace&lt;/strong&gt; containing all user-owned tables (sharded by user_id), and an &lt;strong&gt;unsharded 'global' keyspace&lt;/strong&gt; for data that doesn't belong to individual users and must be accessed without a sharding key. This two-keyspace architecture is the standard pattern for Vitess migrations: shard what scales with users, keep globally-accessed lookup data unsharded.&lt;/p&gt;

&lt;p&gt;Vitessifying is our internal terminology for the process of transforming an existing MySQL into a keyspace in a Vitess cluster. This allows us to start using core Vitess functionality without explicitly moving data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Shopify Engineering — via 'Horizontally scaling the Rails backend of Shop app with Vitess'&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dynamic Connection Switching: Gradual Traffic Migration&lt;/p&gt;

&lt;p&gt;Rather than a hard cutover from KateSQL to VTGate, Shopify built a &lt;strong&gt;dynamic connection switcher&lt;/strong&gt; that allowed them to gradually route increasing percentages of traffic through VTGate while monitoring for performance differences. Starting at a small percentage and slowly ramping to 100% gave the team confidence in VTGate's behavior under real production load before fully committing. The percentage was adjustable at runtime without a code deploy — giving operators immediate control during the migration window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Shopify's First Vitess Production Deployment&lt;/p&gt;

&lt;p&gt;The Shop app backend migration was Shopify's &lt;strong&gt;first deployment of Vitess in production&lt;/strong&gt;. This wasn't just a database migration — it was building organizational competency with a new database infrastructure layer from scratch. The team had to learn Vitess's operational model, its failure modes, its monitoring requirements, and its configuration nuances simultaneously with executing a live migration. Phasing the migration was in part a strategy to build this knowledge incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cross-Shard Queries: The Scatter-Gather Problem&lt;/p&gt;

&lt;p&gt;When a query cannot be routed to a single shard — because it lacks a sharding key or spans multiple shards — Vitess performs a &lt;strong&gt;scatter-gather operation&lt;/strong&gt; : it sends the query to all shards and aggregates the results. Scatter-gather is more expensive than single-shard queries. Shopify's engineering team reviewed the Shop app's query patterns to identify scatter queries and either added sharding keys to make them single-shard or moved the data they accessed into the global keyspace. Unhandled scatter queries can become performance bottlenecks at scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Vitess Migration Playbook: Four Phases
&lt;/h3&gt;

&lt;p&gt;Shopify's Vitess migration was carefully sequenced into phases that minimized risk at each step. Phase 1 (Vitessifying) validated the Vitess stack without sharding. Phase 2 (dual connectivity) validated that the application could talk to VTGate alongside the existing system. Phase 3 (keyspace splitting) separated tables into users and global keyspaces. Phase 4 (sharding) performed the actual horizontal split of the users keyspace by user_id. Each phase produced a stable, production-validated state before the next phase began — the classic incremental risk management strategy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 phases&lt;/strong&gt; — Migration phases: Vitessify → dual connectivity → keyspace split → horizontal shard — each independently production-validated before proceeding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;user_id&lt;/strong&gt; — Sharding key — ensures all data for a user lives on the same shard, making user-scoped queries single-shard with no cross-shard joins for most operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 app changes&lt;/strong&gt; — Application code changes required to complete the sharding — VTGate's MySQL protocol compatibility meant only the connection string changed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1000 IDs&lt;/strong&gt; — VTTablet sequence cache size — each shard pre-fetches 1000 globally-unique IDs from the Sequences table to avoid per-insert writes to the sequence source
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Simplified&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Vitess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;VSchema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Shopify's&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;two-keyspace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;architecture&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;VSchema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tells&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;VTGate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;how&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;route&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;queries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;shards&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;USERS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;keyspace:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sharded&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user_id&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;All&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user-owned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;tables&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;have&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user_id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Primary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;VIndex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(shard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sharded"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vindexes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hash"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;consistent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hash&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user_id&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"orders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"columnVindexes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hash"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;shard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user_id&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Vitess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Sequence&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;globally&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;unique&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;primary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"autoIncrement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"sequence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GLOBAL_KEYSPACE.orders_seq"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;lives&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;unsharded&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;global&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;keyspace&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user_preferences"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"columnVindexes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hash"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;GLOBAL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;keyspace:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;unsharded&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user_id)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Merchant&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;category&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;other&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;cross-user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;lookups&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sharded"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"merchants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;accessed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;without&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sharding&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"categories"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Schema Migrations Across Multiple Shards&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;ALTER TABLE&lt;/code&gt; on a sharded Vitess cluster requires coordination: the DDL must be applied to all shards, and the application must not attempt to query the new schema until all shards have confirmed completion. Shopify built tooling to track migration status across all shards and only allow the Rails schema dump (used to verify the schema is as expected) after all shards reported completion. Without this coordination, a Rails schema check on a partially-migrated cluster could return an inconsistent view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE SHOPIFY FIRST: VITESS IN PRODUCTION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Shop app backend was &lt;strong&gt;Shopify's first production deployment of Vitess&lt;/strong&gt;. This meant the team was building operational knowledge from scratch — learning Vitess's failure modes, monitoring requirements, and operational procedures while also executing a live migration. The careful phasing of the migration (Vitessify first, shard second) was in part a strategy to build this operational experience incrementally rather than learning all of Vitess's complexity at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;VTGate: MySQL Protocol Transparency&lt;/p&gt;

&lt;p&gt;VTGate's most valuable property for application developers is that it speaks the &lt;strong&gt;standard MySQL wire protocol&lt;/strong&gt;. Any MySQL client — including ActiveRecord, the ORM that powers Rails — can connect to VTGate without modification. From the application's perspective, VTGate is just another MySQL server. The sharding logic, the shard topology, the cross-shard routing — all invisible to the application layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ProxySQL to VTGate: The Connection String Change&lt;/p&gt;

&lt;p&gt;The Shop app had previously been using &lt;strong&gt;ProxySQL&lt;/strong&gt; as its database proxy — a standard approach for MySQL connection pooling and query routing. Replacing ProxySQL with VTGate was the connection-layer change that made Vitess integration possible. From the application's perspective, both ProxySQL and VTGate speak the MySQL wire protocol; the change was transparent to Rails. The dual connectivity phase let the team validate VTGate behavior alongside ProxySQL before fully committing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VITESS RESOURCE ALLOCATION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One operational detail that surprised the Shopify team: &lt;strong&gt;VTTablet requires significant resource allocation&lt;/strong&gt;. Vitess's own rule of thumb is allocating an equal number of CPUs to VTTablet as to the mysqld process it runs alongside. Memory consumption for VTTablet is generally low, but CPU requirements are substantial — VTTablet handles connection pooling, health checking, query execution, and replication management. Underprovisioning VTTablet creates a bottleneck in the query path that can limit the effective throughput of the underlying MySQL instance.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Vitess's architecture introduces two new components between the application and MySQL: VTGate (the stateless query router, deployed as multiple replicas for high availability) and VTTablet (a sidecar process running alongside each MySQL instance). The application connects to VTGate using a standard MySQL connection. VTGate consults the &lt;em&gt;VSchema&lt;/em&gt; (Vitess Schema — a configuration document that describes how keyspaces and shards are organized and which columns are used as sharding keys) to determine which shard a query should target, then forwards it to the appropriate VTTablet. The MySQL instances themselves are unchanged — they continue running as standard MySQL servers with replication configured for high availability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vitess Architecture: Rails App → VTGate → Sharded MySQL
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Migration Phases: From KateSQL to Sharded Vitess
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CONNECTION POOLING: AN OFTEN-OVERLOOKED VITESS BENEFIT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond sharding, VTTablet provides &lt;strong&gt;connection pooling at the database level&lt;/strong&gt;. A Rails application with 100 Puma worker threads might open 100 MySQL connections — and 100 application instances might open 10,000. VTTablet multiplexes these connections to a much smaller pool against the actual MySQL process. At Shopify's scale, this connection efficiency is a meaningful resource saving in addition to the sharding capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sequence Caching: Trading Latency for Throughput&lt;/p&gt;

&lt;p&gt;VTTablet's sequence ID caching (set at 1000 in Shopify's production config) is a throughput-versus-latency tradeoff. &lt;strong&gt;Without caching&lt;/strong&gt; , every INSERT requires a roundtrip to the Sequences table in the global keyspace to get the next ID — adding latency to every write. &lt;strong&gt;With caching of 1000 IDs&lt;/strong&gt; , 999 out of every 1000 INSERTs get their ID from the local cache instantly, with only every 1000th INSERT requiring a roundtrip. IDs have gaps in the sequence after a server restart (cached-but-unused IDs are lost) but remain globally unique.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;VTOrc: Automated Topology Management&lt;/p&gt;

&lt;p&gt;In a sharded Vitess cluster, managing primary/replica failover across dozens of shards manually would be operationally prohibitive. &lt;em&gt;VTOrc&lt;/em&gt; (Vitess Orchestrator — an automated MySQL topology manager integrated into Vitess that detects primary failures and promotes replicas automatically, maintaining high availability without manual operator intervention) handles this automatically. When a shard's primary fails, VTOrc promotes the best available replica and updates VTGate's routing table — keeping the cluster available without human intervention.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Shopify's Vitess migration demonstrates that horizontal database sharding doesn't have to mean rewriting your application. With the right proxy architecture, the sharding is in the infrastructure — invisible to the application layer.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Vitessify before you shard.&lt;/strong&gt; Adding Vitess to an existing MySQL database without sharding (Vitessifying) is a safe, low-risk first step that validates the Vitess stack and builds operational knowledge before attempting the more complex sharding migration. Shopify's phased approach reflects this: get comfortable with Vitess on one shard before splitting into many.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; Choose your &lt;em&gt;sharding key&lt;/em&gt; (the column value used to determine which shard a row belongs to — the most important architectural decision in horizontal sharding because it determines data locality and query routing) carefully and early. user_id was the right choice for Shopify's user-centric application: it distributes data evenly, keeps user data colocated on one shard, and makes user-scoped queries single-shard. A bad sharding key creates hot shards, cross-shard joins, and an architecture that fights itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Auto-increment IDs break in sharded systems.&lt;/strong&gt; Every sharded application needs a strategy for globally unique IDs. Vitess Sequences, UUIDs, Snowflake IDs — the choice matters for performance, sortability, and debuggability. Don't discover this problem during your sharding migration; design for it before migration begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; Schema migrations on sharded clusters require explicit cross-shard coordination. &lt;strong&gt;Any tooling that inspects or depends on schema state must be sharding-aware.&lt;/strong&gt; Rails's schema dump, ActiveRecord migrations, and ORM schema introspection all need to understand that schema changes must be applied to all shards before the application can assume they've taken effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt; A dynamic connection switcher that allows gradual traffic migration is &lt;strong&gt;the safety mechanism that makes production sharding migrations recoverable&lt;/strong&gt;. Being able to route 1% → 5% → 25% → 100% of traffic through the new system, with instant rollback by setting the percentage back to 0%, is the difference between a migration you can execute confidently and one that requires a maintenance window.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;VSchema Maintenance: An Ongoing Obligation&lt;/p&gt;

&lt;p&gt;The VSchema must be updated every time the database schema changes. A new table needs a VSchema entry defining its sharding key. A new index needs evaluation for VIndex configuration. &lt;strong&gt;Vitess amplifies the schema change process&lt;/strong&gt; : what was previously a single DDL operation now requires DDL plus VSchema update, coordinated across all shards. Teams adopting Vitess need processes and tooling to ensure VSchema updates are not overlooked during schema migrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VITESS AS SHOPIFY STANDARD&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Following the Shop app success, Shopify has been expanding Vitess adoption to other services. The first deployment built the organizational knowledge and tooling (custom Rails integration, dynamic connection switcher, cross-shard schema migration tooling) that makes subsequent deployments faster and safer. &lt;strong&gt;Infrastructure investments compound&lt;/strong&gt; : the second Vitess deployment benefits from all the work done during the first.&lt;/p&gt;

&lt;p&gt;Shopify added horizontal database sharding to a Rails app, and the app continued insisting there was only one database — which is either a beautiful abstraction or a comfortable lie, and honestly both.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>database</category>
      <category>shopify</category>
      <category>analytics</category>
      <category>backend</category>
    </item>
    <item>
      <title>How Discord Migrated Trillions of Messages and Fired Their Garbage Collector</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/how-discord-migrated-trillions-of-messages-and-fired-their-garbage-collector-4818</link>
      <guid>https://dev.to/techlogstack/how-discord-migrated-trillions-of-messages-and-fired-their-garbage-collector-4818</guid>
      <description>&lt;p&gt;&lt;strong&gt;Discord&lt;/strong&gt; · Databases · 17 May 2026&lt;/p&gt;

&lt;p&gt;It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;177 → 72 nodes&lt;/li&gt;
&lt;li&gt;p99 latency 15ms (was 40–125ms)&lt;/li&gt;
&lt;li&gt;9-day migration (was 3-month est.)&lt;/li&gt;
&lt;li&gt;3.2M records/sec migrated&lt;/li&gt;
&lt;li&gt;4T+ messages moved&lt;/li&gt;
&lt;li&gt;Zero user-visible downtime&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— — Bo Ingram, Senior Software Engineer — via Discord Engineering Blog&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Discord launched in 2015 with a mission to build the best voice and text chat platform for gamers. By 2017 they had outgrown MongoDB and migrated their entire message store to &lt;em&gt;Apache Cassandra&lt;/em&gt; (a distributed wide-column NoSQL database designed for high availability across many nodes without a single point of failure). Cassandra's promise was compelling: write anywhere, replicate everywhere, scale horizontally forever. For a few years it held. By 2022, however, the promises had curdled into a maintenance nightmare that consumed engineering cycles every single week. The database cluster had grown to &lt;strong&gt;177 nodes&lt;/strong&gt; holding &lt;strong&gt;trillions of messages&lt;/strong&gt; , and keeping it alive required the kind of expertise and vigilance that should be reserved for nuclear reactor operators, not chat app engineers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔥&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At peak, Discord's Cassandra cluster required engineers to manually reboot individual nodes after &lt;em&gt;JVM GC pauses&lt;/em&gt; (Java Virtual Machine garbage collection — periodic stop-the-world pauses where the JVM freezes all threads to reclaim memory) spiraled long enough to drop the node from the cluster. This was not a rare emergency — it was routine on-call work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The core problem was architectural. Cassandra is written in Java, and Java's garbage collector periodically halts all threads in the JVM to reclaim heap memory — a moment engineers call a stop-the-world pause. Under Discord's workloads, these pauses could last long enough to cause cascading latency spikes visible to users, and in severe cases, the JVM's consecutive GC pauses got so bad that a node would effectively fall out of the cluster entirely. An on-call engineer would then have to manually reboot it and babysit it back to health. The &lt;strong&gt;p99 latency on historical message reads&lt;/strong&gt; ranged between &lt;strong&gt;40 and 125 milliseconds&lt;/strong&gt; depending on whether compaction was running — an unpredictability that made SLO planning impossible. Every time someone tried to improve the cluster rather than merely maintain it, they risked triggering a cascade.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hot Partition Problem
&lt;/h3&gt;

&lt;p&gt;Discord's message data model organized messages by channel ID and a fixed time window called a &lt;em&gt;bucket&lt;/em&gt; (a fixed time slice, e.g. 10 days, used as part of the partition key so messages are spread across multiple Cassandra partitions rather than one per channel). This was efficient for write distribution and replication, but created a painful read problem. Cassandra performs writes cheaply by appending to a &lt;em&gt;commit log&lt;/em&gt; (a sequential on-disk journal where writes are recorded before being applied to the in-memory structure, enabling fast writes at the cost of read complexity) and an in-memory structure called a &lt;em&gt;memtable&lt;/em&gt; (an in-memory write buffer in Cassandra that is flushed to disk as SSTables when it fills up). Reads, however, must query the memtable and potentially multiple &lt;em&gt;SSTables&lt;/em&gt; (Sorted String Tables — immutable on-disk files in Cassandra that hold flushed memtable data, which must all be merged on read to reconstruct the current value), a dramatically more expensive operation. When a popular Discord server made a major announcement and thousands of users simultaneously opened their apps to read it, every single one of those reads would hammer the same partition. The cluster called these &lt;strong&gt;hot partitions&lt;/strong&gt; , and they were Discord's most common and painful operational incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Maintenance Spiral
&lt;/h4&gt;

&lt;p&gt;By early 2022, Discord's on-call rotation was spending more time nursing Cassandra than building features. GC pause alerts fired multiple times a week, and the p99 latency on reads ranged from 40ms to 125ms depending on whether compaction was running on the affected node — an unpredictability engineers had simply learned to live with.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  JVM GC + Hot Partition Physics
&lt;/h4&gt;

&lt;p&gt;The root cause split into two layers: &lt;em&gt;JVM garbage collection&lt;/em&gt; (Java's memory management system that periodically pauses all threads to reclaim heap memory — in large heaps, these pauses could last hundreds of milliseconds) on write-heavy nodes created latency cliffs, while Cassandra's read path — requiring merges across multiple SSTables — meant any popular channel partition would spike latency under concurrent user load. The combination made the cluster inherently unpredictable at scale.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  ScyllaDB + Rust Data Services
&lt;/h4&gt;

&lt;p&gt;Discord chose ScyllaDB, a Cassandra-compatible database rewritten in C++ with a &lt;em&gt;shard-per-core architecture&lt;/em&gt; (a design where each CPU core is assigned its own exclusive subset of data and handles requests independently, avoiding cross-core coordination and lock contention). They also built a Rust-based data services layer between the API and the database to absorb hot-partition spikes via request coalescing. The migration tool was rewritten in Rust to achieve 3.2 million records per second transfer speed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  9 Days, 4 Trillion Messages, Zero Users Noticed
&lt;/h4&gt;

&lt;p&gt;The migration completed in nine days — the original estimate using ScyllaDB's Spark migrator had been three months. The cluster footprint shrank from 177 nodes to 72, each ScyllaDB node running with 9 TB of disk versus the average 4 TB on Cassandra. P99 latency for historical reads settled at a stable, predictable &lt;strong&gt;15 milliseconds&lt;/strong&gt;.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why cassandra-messages Was the Last to Move&lt;/p&gt;

&lt;p&gt;By 2020 Discord had migrated &lt;strong&gt;every other database&lt;/strong&gt; to ScyllaDB — the messages cluster was the lone holdout. They deliberately waited to last because it was the most critical dataset: trillions of messages, nearly 200 nodes, and the one cluster whose failure would be immediately visible to every user. They used the other migrations to tune ScyllaDB for their access patterns first, including filing and waiting on performance improvements to ScyllaDB's reverse query support.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Tombstone Trap at 99.9999%
&lt;/h3&gt;

&lt;p&gt;The migration nearly ended in drama rather than triumph. After running the Rust migrator for days at &lt;strong&gt;3.2 million records per second&lt;/strong&gt; , the progress bar hit 99.9999% — and stopped. The migrator was timing out trying to read the last few &lt;em&gt;token ranges&lt;/em&gt; (in Cassandra, data is distributed across the ring by assigning each partition a hash token, and a token range is a contiguous slice of that ring assigned to a node) because they contained gigantic ranges of &lt;em&gt;tombstones&lt;/em&gt; (deletion markers in Cassandra — when data is deleted, a tombstone is written instead of the row being removed, because immutable SSTables cannot be modified in-place; these tombstones must be read and skipped during every subsequent read until compaction removes them) that had never been compacted away. Engineers had to manually trigger compaction on that token range; seconds later, the migration hit 100%. Automated data validation confirmed correctness by sending a sample of reads to both databases and comparing results. Discord switched to ScyllaDB in May 2022.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;THE WORLD CUP TEST&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real stress test came months after go-live: the 2022 FIFA World Cup Final between Argentina and France. Every goal by Messi, every equalizer by Mbappé, every moment in the shootout created a massive spike of simultaneous message reads across Discord's biggest servers. Under the old Cassandra architecture this would have triggered hot-partition alerts and cascading latency. Under ScyllaDB with the Rust data services layer, the monitoring dashboards showed nothing unusual. The system held flat through 120 minutes of football and a penalty shootout.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Rust data services layer was the architectural insight that made ScyllaDB viable, not just the database choice alone. When a popular server makes an announcement and &lt;strong&gt;thousands of users simultaneously open their clients&lt;/strong&gt; , all those read requests arrive at the data service within milliseconds of each other — all asking for the same messages in the same channel. Without coalescing, each request would hit the database separately, creating a hot partition. With &lt;em&gt;request coalescing&lt;/em&gt; (a pattern where the first incoming request for a piece of data triggers an active lookup, and all subsequent requests for the same data subscribe to that lookup's result rather than issuing their own query, reducing N database hits to 1), only one query goes to ScyllaDB; every subsequent request subscribes to the in-flight result and receives the answer when the single database query returns. The data services layer also used &lt;em&gt;consistent hashing&lt;/em&gt; (a ring-based routing scheme where each data service instance is responsible for a specific subset of channel IDs, ensuring all requests for a given channel are routed to the same service instance to maximize coalescing effectiveness) to route requests for the same channel to the same service instance, maximizing coalescing opportunity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;177→72&lt;/strong&gt; — Cassandra nodes replaced by ScyllaDB nodes — a 59% reduction in cluster footprint while handling the same workload&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15ms&lt;/strong&gt; — Stable p99 read latency on ScyllaDB, down from an unpredictable 40–125ms range on Cassandra depending on compaction status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 days&lt;/strong&gt; — Total migration time for 4+ trillion messages — versus the original 3-month estimate with ScyllaDB's Spark migrator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3.2M/s&lt;/strong&gt; — Peak migration throughput of the Rust-rewritten migrator, unlocking a single-flip cutover instead of a complex time-based phased approach&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix had three distinct components, and Discord was deliberate about not rushing any of them. First, they spent years migrating every other database to ScyllaDB to build operational expertise before touching the one cluster that mattered most. Second, they collaborated with the ScyllaDB team to improve reverse query performance — a blocker they hit in early testing — and waited until that was production-grade before proceeding. Third, they &lt;strong&gt;built the Rust data services layer before starting the migration&lt;/strong&gt; , so the new database would go live already protected from hot-partition load patterns. This sequencing was the engineering discipline that made the migration look easy in retrospect.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Rust Migrator Rewrite
&lt;/h3&gt;

&lt;p&gt;The turning point in the migration timeline was a one-day engineering sprint. ScyllaDB's off-the-shelf &lt;em&gt;Spark migrator&lt;/em&gt; (an Apache Spark-based tool provided by ScyllaDB for bulk data migration that reads token ranges from Cassandra and writes them to ScyllaDB) estimated three months to move the message data — three months of dual-running two massive database clusters, three months of operational complexity, and three months of potential failure modes. Bo Ingram decided that was three months too long. He and two colleagues rewrote the migrator in Rust in a single day. The new migrator read token ranges from a database, checkpointed them locally via SQLite for crash recovery, and fired them into ScyllaDB as fast as possible. The result: &lt;strong&gt;3.2 million records per second&lt;/strong&gt;. The new estimate was nine days, and the team chose to do a single-flip cutover instead of a phased time-based approach entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified version of Discord's request coalescing logic in the Rust data service&lt;/span&gt;
&lt;span class="c1"&gt;// Real implementation uses Tokio async runtime&lt;/span&gt;

&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;CoalescingDataService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Map from cache_key -&amp;gt; active broadcast sender&lt;/span&gt;
    &lt;span class="c1"&gt;// If a task is in flight, subscribers receive the result&lt;/span&gt;
    &lt;span class="n"&gt;in_flight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;CoalescingDataService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;channel_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;before_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Build a stable cache key for this exact query&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}:{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;before_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.in_flight&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// A query for this channel is already in flight&lt;/span&gt;
            &lt;span class="c1"&gt;// Subscribe and wait — NO second database round-trip&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="nf"&gt;.subscribe&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="nf"&gt;.recv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; &lt;span class="c1"&gt;// receive the shared result&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// No existing task — we are the first; create the broadcast channel&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_rx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.in_flight&lt;/span&gt;&lt;span class="nf"&gt;.insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

        &lt;span class="c1"&gt;// Execute the single database query to ScyllaDB&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.scylladb&lt;/span&gt;&lt;span class="nf"&gt;.query_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;before_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// Broadcast to ALL waiting subscribers at once&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// every subscriber wakes up&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.in_flight&lt;/span&gt;&lt;span class="nf"&gt;.remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// clean up the in-flight tracker&lt;/span&gt;

        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The SuperDisk: Custom Hardware for Cloud Durability&lt;/p&gt;

&lt;p&gt;ScyllaDB is optimized for &lt;em&gt;NVMe SSDs&lt;/em&gt; (Non-Volatile Memory Express solid-state drives — extremely fast local storage that dramatically reduces I/O latency) but in cloud environments NVMe is ephemeral — a node restart wipes the disk. Discord engineered a &lt;strong&gt;custom RAID 1 configuration&lt;/strong&gt; they called the Superdisk: writes go to both fast local NVMe and slower persistent network-attached storage simultaneously; reads prefer the NVMe for speed. This gave them NVMe-level read performance with cloud-level data durability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zstandard Compression: 50–60% Disk Reduction&lt;/p&gt;

&lt;p&gt;Alongside the database migration, Discord enabled &lt;strong&gt;Zstandard compression&lt;/strong&gt; on their ScyllaDB tables. Message data compresses extremely well. The result was a &lt;strong&gt;50–60% reduction&lt;/strong&gt; in raw disk usage compared to uncompressed Cassandra storage — effectively giving each physical node far more useful capacity at zero hardware cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE VALIDATION STRATEGY&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Discord ran automated correctness validation throughout the migration by sending a &lt;strong&gt;small percentage of reads to both databases simultaneously&lt;/strong&gt; and comparing results. Only when reads matched across Cassandra and ScyllaDB was a partition considered successfully migrated. This shadow-read approach caught data inconsistencies without any user-visible impact, and gave the team confidence to flip the cutover switch as a single atomic event rather than a long, hedged transition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cassandra vs ScyllaDB at Discord: Before and After Migration&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Cassandra (Before)&lt;/th&gt;
&lt;th&gt;ScyllaDB (After)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cluster Nodes&lt;/td&gt;
&lt;td&gt;177&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk per Node (avg)&lt;/td&gt;
&lt;td&gt;4 TB&lt;/td&gt;
&lt;td&gt;9 TB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99 Read Latency&lt;/td&gt;
&lt;td&gt;40–125ms (variable)&lt;/td&gt;
&lt;td&gt;~15ms (stable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GC Pauses&lt;/td&gt;
&lt;td&gt;Frequent stop-the-world&lt;/td&gt;
&lt;td&gt;None (C++, no GC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hot Partition Risk&lt;/td&gt;
&lt;td&gt;High — no coalescing&lt;/td&gt;
&lt;td&gt;Mitigated by Rust data services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-Call Toil&lt;/td&gt;
&lt;td&gt;Weekly node babysitting&lt;/td&gt;
&lt;td&gt;Dramatically reduced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Before the migration, Discord's message write and read path ran through a monolithic API server directly into the Cassandra cluster. There was no intermediary — every user action that read messages translated directly into a database query, with no protection against fan-out or &lt;em&gt;hot partition amplification&lt;/em&gt; (when many users simultaneously request data stored in the same database partition, causing that node to receive far more traffic than its neighbors, creating latency spikes and potential instability). The API server held connection pools to Cassandra, handled &lt;em&gt;CQL queries&lt;/em&gt; (Cassandra Query Language — a SQL-like interface for querying Cassandra) for message pagination, and relied on Cassandra's own internal mechanisms (memtable, SSTables, compaction) to handle read pressure. Under normal load this worked. Under peak load — a major announcement, a viral moment, a World Cup Final — it did not.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before: Direct API-to-Cassandra Architecture (Hot Partition Risk)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/discord-cassandra-scylladb-2022/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;SHARD-PER-CORE ARCHITECTURE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fundamental reason ScyllaDB handles concurrent reads so much better than Cassandra is its &lt;strong&gt;shard-per-core architecture&lt;/strong&gt;. Each CPU core is assigned its own exclusive slice of the data and handles all requests for that data without coordination with other cores. In Cassandra's JVM-based model, all threads compete for heap memory under a single garbage collector. In ScyllaDB's C++ model, &lt;strong&gt;each core is an independent actor&lt;/strong&gt; : no cross-core locking, no GC, no stop-the-world. When one partition gets hot, it affects only the core assigned to that shard — it cannot cascade to neighbors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ℹ️&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consistent Hashing: Routing Channels to Service Instances&lt;/p&gt;

&lt;p&gt;Each Rust data service instance is responsible for a &lt;strong&gt;deterministic subset of channel IDs&lt;/strong&gt; via &lt;em&gt;consistent hashing&lt;/em&gt; (a routing scheme where each channel_id is mapped to a specific service instance using a hash ring, so all requests for channel #12345 always go to Data Service Instance B — maximizing the chance that an in-flight coalescing task for that channel already exists). This means if 1,000 users simultaneously load the same popular channel, all 1,000 requests arrive at the same service instance and collapse into one database query.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  After: Rust Data Services + ScyllaDB Architecture (Hot Partition Mitigated)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/discord-cassandra-scylladb-2022/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🦀&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why Rust for the Data Services Layer&lt;/p&gt;

&lt;p&gt;Discord chose Rust for data services because it offered C-level throughput with memory safety guarantees that prevent entire classes of concurrency bugs common in C++ — exactly what you want in a layer handling millions of concurrent subscribers. The Tokio async runtime gave them non-blocking I/O without the GC overhead that had plagued their Cassandra setup. As Bo Ingram noted with characteristic candor: it also let them say they &lt;strong&gt;rewrote it in Rust&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;p&gt;Discord's migration took years of preparation and nine days of execution. The long preparation was not waste — it was the reason the execution was clean. The lessons here are as much about sequencing and courage as they are about database choice.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;01.&lt;/strong&gt;  &lt;strong&gt;Migrate your riskiest system last, but don't use that as an excuse to never migrate it.&lt;/strong&gt; Discord deliberately kept the messages database in Cassandra for two years after migrating everything else, using that time to build ScyllaDB expertise on less critical workloads. However, they committed to a hard deadline once operational confidence was achieved — avoiding the trap of indefinite deferral that plagues many large migrations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02.&lt;/strong&gt; &lt;em&gt;Request coalescing&lt;/em&gt; (combining multiple concurrent requests for identical data into a single database query, broadcasting the result to all waiters) is a force multiplier against hot partitions that no amount of database scaling alone can provide. When you have popular content that thousands of users read simultaneously, add a coalescing layer between your application and your database — the reduction in query fan-out is often more impactful than hardware upgrades.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;03.&lt;/strong&gt;  &lt;strong&gt;Rewrite your migration tooling if the estimated duration is unacceptable.&lt;/strong&gt; A three-month migration estimate is not a constraint — it's a scope definition that you can change. Discord's one-day Rust rewrite of the migrator turned a three-month project into nine days, enabling a simpler single-flip cutover instead of a complex phased approach. Always ask: what would it take to make this ten times faster?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04.&lt;/strong&gt; &lt;em&gt;Stop-the-world GC pauses&lt;/em&gt; (periodic halts in JVM-based systems where all threads freeze while the garbage collector reclaims memory) are a predictable, structural problem in Java-based databases at high concurrency — not a tuning problem you can engineer your way out of at Discord's scale. When your on-call team spends more time maintaining a database than improving it, that's the signal to evaluate architecturally different alternatives, not just different JVM flags.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;05.&lt;/strong&gt;  &lt;strong&gt;Run shadow reads for data validation before any large-scale cutover.&lt;/strong&gt; Sending a percentage of reads to both old and new systems simultaneously — and comparing results automatically — gives you objective confidence that your migration is correct without user-visible risk. This pattern is applicable to any database migration and should be standard practice before any atomic cutover switch.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The World Cup Validation&lt;/p&gt;

&lt;p&gt;The 2022 FIFA World Cup Final was Discord's unplanned load test — and the system passed cleanly. Every goal, every save, every penalty created message spikes across thousands of servers simultaneously. The combination of ScyllaDB's shard-per-core architecture and Rust data services coalescing kept latency flat through all 120 minutes plus penalties. &lt;strong&gt;No hot partition alerts. No on-call pages. No post-match war rooms.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHADOW READ VALIDATION&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Discord's validation strategy during migration was elegantly simple: send a &lt;strong&gt;small percentage of reads to both Cassandra and ScyllaDB simultaneously&lt;/strong&gt; , compare results automatically, and flag any discrepancy. This meant correctness was continuously verified during the nine days of data transfer — not checked at the end in a tense manual review. Any database migration touching production data should implement this pattern before flipping the final switch.&lt;/p&gt;

&lt;p&gt;They migrated four trillion messages in nine days, and the most stressful moment was the progress bar stopping at 99.9999% — because even tombstones refuse to die quietly.&lt;br&gt;&lt;br&gt;
&lt;cite&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://techlogstack.com/explore/discord-cassandra-scylladb-2022/" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;&lt;/strong&gt; (interactive diagrams, source links, and the full reader experience).&lt;/p&gt;

</description>
      <category>database</category>
      <category>discord</category>
      <category>scalability</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
