<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TechLogStack</title>
    <description>The latest articles on DEV Community by TechLogStack (@techlogstack).</description>
    <link>https://dev.to/techlogstack</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942907%2Fa4ab56cd-2c6b-475e-9f32-91735275dadc.png</url>
      <title>DEV Community: TechLogStack</title>
      <link>https://dev.to/techlogstack</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/techlogstack"/>
    <language>en</language>
    <item>
      <title>A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/a-race-condition-in-dynamodbs-dns-took-down-snapchat-fortnite-ring-and-half-the-internet-for-15-3l9f</link>
      <guid>https://dev.to/techlogstack/a-race-condition-in-dynamodbs-dns-took-down-snapchat-fortnite-ring-and-half-the-internet-for-15-3l9f</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;October 19–20, 2025&lt;/strong&gt; — 15-hour outage in US-EAST-1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause:&lt;/strong&gt; race condition between two DNS Enactor processes; cleanup job deleted active DNS records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~3 hours&lt;/strong&gt; for DynamoDB to recover; &lt;strong&gt;12+ additional hours&lt;/strong&gt; for EC2 cascade to clear&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;140+ AWS services&lt;/strong&gt; affected: EC2, IAM, Lambda, STS, S3, and every control-plane dependency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapchat (375M daily users), Fortnite, Roblox, Ring, Venmo, Coinbase, UK HMRC&lt;/strong&gt; all affected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;17M+ outage reports&lt;/strong&gt; across 3,000+ organisations (Ookla data); 20–30% of internet-facing services disrupted at peak&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery anti-pattern:&lt;/strong&gt; engineers had to manually disable automatic failover — the automation was making things worse&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Amazon Web Services, Official Post-Incident Summary, October 2025&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that &lt;em&gt;manage&lt;/em&gt; everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilising everything that lost its footing when the ground disappeared.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Two-Component DNS Architecture: Planner and Enactor&lt;/strong&gt;

&lt;p&gt;At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to route traffic across load balancers. AWS built a two-component system to manage this: &lt;strong&gt;The DNS Planner&lt;/strong&gt; monitors load balancer health and periodically creates DNS plans — specifications of which load balancers should receive traffic and with what weight distribution. &lt;strong&gt;The DNS Enactors&lt;/strong&gt; are the workers — multiple independent processes running across three Availability Zones — that pick up the plans and apply them to Route53. Multiple Enactors running in parallel provide redundancy. In theory.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb
&lt;/h4&gt;

&lt;p&gt;DNS Enactor A began applying an older DNS plan but encountered unusual delays — blocked trying to update records, moving painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: "Is my plan newer than what's currently active?" At the time of that check, it was. But by the time Enactor A actually finished applying the plan, newer plans had been created and applied. The staleness check was now stale itself.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Race Condition Fires — Enactor B Wins, Then Cleans Up
&lt;/h4&gt;

&lt;p&gt;While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  11:48 PM PDT: Total DNS Blackout → Manual Recovery
&lt;/h4&gt;

&lt;p&gt;At 11:48 PM PDT, every system trying to connect to DynamoDB in US-EAST-1 received DNS failures. Engineers identified the DNS issue by 12:38 AM UTC, began temporary mitigations by 1:15 AM UTC, and DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  15 Hours of Cascading Failure
&lt;/h4&gt;

&lt;p&gt;The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilise. Engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilise. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AWS's five-layer post-incident fix plan (from the official post-incident summary, October 23, 2025):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Layer&lt;/th&gt;
&lt;th&gt;What Went Wrong&lt;/th&gt;
&lt;th&gt;AWS's Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DNS Enactor race condition&lt;/td&gt;
&lt;td&gt;Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan&lt;/td&gt;
&lt;td&gt;Stronger staleness validation at time of application — must reflect current world state, not time of plan pickup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cleanup automation&lt;/td&gt;
&lt;td&gt;Cleanup job deleted Enactor A's just-applied old plan, wiping all DNS records&lt;/td&gt;
&lt;td&gt;Safeguards ensuring no automated process can delete an active DNS plan regardless of generation number&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NLB failover velocity&lt;/td&gt;
&lt;td&gt;Network Load Balancers moved large capacity during AZ failover, amplifying the cascade&lt;/td&gt;
&lt;td&gt;Velocity control mechanism limiting how much capacity a single NLB can remove during health check failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EC2 recovery workflow&lt;/td&gt;
&lt;td&gt;DWFM entered congestive collapse when DynamoDB recovered — failure mode not tested at scale&lt;/td&gt;
&lt;td&gt;Additional test suite to exercise the DWFM recovery workflow at scale before production discovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic failover during recovery&lt;/td&gt;
&lt;td&gt;Failover automation flip-flopped during recovery, requiring manual disabling before stabilisation&lt;/td&gt;
&lt;td&gt;Review of failover automation behaviour during degraded DNS states — distinguish 'service down' from 'DNS inconsistent during recovery'&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~3 hrs&lt;/strong&gt; — time from incident start to DynamoDB DNS restoration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12+ hrs&lt;/strong&gt; — additional hours EC2's Droplet Workflow Manager required to clear congestive collapse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;140+&lt;/strong&gt; — AWS services eventually affected; DynamoDB powers the control planes of EC2, IAM, Lambda, STS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$581M&lt;/strong&gt; — estimated insurance losses (CyberCube) representing disruption to thousands of globally dependent businesses&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Anti-Pattern: When Automation Prevents Recovery&lt;/strong&gt;

&lt;p&gt;The most counterintuitive part of the recovery was that engineers had to &lt;strong&gt;disable automatic failover&lt;/strong&gt; to stabilise the system. The automatic failover mechanisms were detecting DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and re-enable it with correct DNS records in place. &lt;strong&gt;Sometimes, the recovery automation has to stop before recovery can start.&lt;/strong&gt; Build your recovery playbooks to include the question: "Is any automated system currently making this worse?"&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;The &lt;em&gt;congestive collapse&lt;/em&gt; pattern that extended the outage by 12 hours is worth naming clearly. When DynamoDB recovered, EC2's DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. The system was stuck in a self-sustaining degraded state. This is the same metastable failure pattern documented in the Slack 2-22-22 incident — and the solution is the same: reduce incoming load or add capacity, rather than waiting for self-recovery.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The EC2 Droplet Workflow Manager congestive collapse
  &lt;br&gt;
EC2's Droplet Workflow Manager (DWFM) is the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous simultaneous queue. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue. Network state recovery from this collapse took more than five additional hours after DynamoDB was fixed. AWS's fix: build the test suite that exercises this recovery workflow at production scale.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The hidden cross-region dependency problem
  &lt;br&gt;
The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: &lt;strong&gt;regions that are called independent but aren't.&lt;/strong&gt; AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organisations, this is not the architecture they have — it is the architecture they think they have.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The October 2025 DynamoDB outage is a case study in &lt;em&gt;control-plane failure&lt;/em&gt; — a class of failure categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Major services affected:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Affected Services&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Social &amp;amp; Entertainment&lt;/td&gt;
&lt;td&gt;Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance &amp;amp; Payments&lt;/td&gt;
&lt;td&gt;Coinbase, Venmo, Lloyds, Halifax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart Home &amp;amp; IoT&lt;/td&gt;
&lt;td&gt;Amazon Ring, Amazon Alexa, Eight Sleep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communications&lt;/td&gt;
&lt;td&gt;Signal, enterprise platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Government&lt;/td&gt;
&lt;td&gt;UK HMRC tax authority&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Travel&lt;/td&gt;
&lt;td&gt;United Airlines, Delta apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Services (internal)&lt;/td&gt;
&lt;td&gt;EC2, IAM, STS, Lambda, S3, SQS, Redshift (140+ total)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The DNS Race Condition: Step-by-Step
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cascade: How DynamoDB's DNS Failure Propagated
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Why US-EAST-1 Became a Single Point of Failure for the Internet&lt;/strong&gt;

&lt;p&gt;AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, accumulating the most mature feature sets. It became the default — the region developers reach for first, the one that decades of "just deploy to us-east-1" decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls. The technical independence of regions is real. &lt;strong&gt;The operational independence, as experienced during the October 2025 outage, is not.&lt;/strong&gt;&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Staleness checks must be evaluated at time of use, not time of pickup.&lt;/strong&gt; Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action. This is &lt;em&gt;TOCTOU&lt;/em&gt; (Time-of-Check to Time-of-Use — a race condition where the condition being checked changes between when it is checked and when it is acted upon) — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No automated process should be able to delete an active record.&lt;/strong&gt; The cleanup job had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: &lt;em&gt;the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number.&lt;/em&gt; This invariant is simpler than the cleanup logic that violated it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed.&lt;/strong&gt; EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Control-plane dependencies&lt;/em&gt; (the hidden dependencies that applications have on cloud provider management systems — authentication services, metadata stores, quota management — which can create cross-region failure modes even when application code is deployed in multiple regions) must be evaluated independently for each region.&lt;/strong&gt; Ring cameras deployed globally still authenticated against US-EAST-1 IAM. True regional independence requires independently redundant control planes, not just independently deployed application code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sometimes, the recovery automation has to stop before recovery can start.&lt;/strong&gt; Build recovery playbooks to include the question: "Is any automated system currently making this worse?" Automation that detects 'DNS is inconsistent during manual recovery' the same way as 'service is down' will trigger failovers that create new inconsistencies. Automation must be able to distinguish between these states — and humans must be empowered to pause it when it cannot.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Congestive collapse&lt;/strong&gt; — a failure mode where a system attempting to recover from backlog overwhelms its dependencies, slowing processing and lengthening the queue, creating a self-sustaining degraded state. EC2's DWFM entered congestive collapse when DynamoDB recovered and the accumulated lease backlog overwhelmed the now-restored database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Control-plane failure&lt;/strong&gt; — a class of failure where the management and coordination layer of a system fails, rather than the data-serving layer. Uniquely damaging because it removes the ability to manage everything else: EC2 can't track instances, IAM can't validate credentials, Lambda can't execute. Control-plane failures cascade differently from data-plane failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS Enactor&lt;/strong&gt; — one of the worker processes in AWS's DynamoDB DNS management system that picks up DNS plans and applies them to Route53. Multiple Enactors run in parallel across Availability Zones for redundancy. The race condition that caused the October 2025 outage occurred between two Enactors picking up different-generation plans simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS Planner&lt;/strong&gt; — the planning component in AWS's DynamoDB DNS management system that monitors load balancer health and creates DNS plans specifying which load balancers should receive traffic. Plans are then consumed by DNS Enactors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Droplet Workflow Manager (DWFM)&lt;/strong&gt; — EC2's system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM accumulated a backlog of expired lease management tasks. When DynamoDB recovered, the simultaneous burst of backlog processing triggered congestive collapse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TOCTOU (Time-of-Check to Time-of-Use)&lt;/strong&gt; — a race condition where the condition being checked changes between when it is checked and when it is acted upon, causing the action to operate on incorrect assumptions. Enactor A checked its plan's staleness, found it valid, then applied the plan — but by the time it applied, the world had moved on and the check was stale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thundering herd / herd effect&lt;/strong&gt; — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource, overwhelming it. Appears in the October 2025 outage as the DWFM congestive collapse. The standard solution is randomised exponential backoff.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>reliability</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/googles-own-cleanup-job-crashed-cloud-services-across-4-continents-and-then-made-recovery-worse-g4</link>
      <guid>https://dev.to/techlogstack/googles-own-cleanup-job-crashed-cloud-services-across-4-continents-and-then-made-recovery-worse-g4</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;June 12, 2025&lt;/strong&gt; — 7+ hour outage; North America, Europe, Far East, Africa simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause:&lt;/strong&gt; null pointer exception in Service Control from a May 29 code change — dormant for 14 days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No feature flag, no error handling&lt;/strong&gt; on the new code path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50+ Google Cloud services&lt;/strong&gt; affected: IAM, Compute Engine, Cloud Storage, BigQuery, Vertex AI, Google Workspace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third-order cascade:&lt;/strong&gt; Google → Cloudflare → Discord/Twitch; Discord users had no idea why they were down&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Herd effect during recovery&lt;/strong&gt; overwhelmed Spanner in us-central1, extending the outage by 2+ hours&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorises every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by specific policy data that hadn't appeared yet. Two weeks later, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart properly. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making recovery worse than the crash.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Google Cloud, Official Incident Report, June 14, 2025&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorises every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.&lt;/p&gt;

&lt;p&gt;When Service Control crashed on June 12, it didn't just take down one service. It took down the authorisation layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorised, and nothing unauthorised can proceed in a correctly secured cloud platform.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What Service Control Actually Does&lt;/strong&gt;

&lt;p&gt;Google Cloud's Service Control performs three functions on every API request: &lt;strong&gt;authentication&lt;/strong&gt; (is this requester who they claim to be?), &lt;strong&gt;authorisation&lt;/strong&gt; (are they allowed to perform this operation?), and &lt;strong&gt;quota enforcement&lt;/strong&gt; (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronised across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  May 29: Code Deployed — Bug Present, But Invisible
&lt;/h4&gt;

&lt;p&gt;Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger
&lt;/h4&gt;

&lt;p&gt;An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields. Because quota management is global, Spanner's replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The SRE Response: Diagnosis in 10 Minutes, Red Button in 40
&lt;/h4&gt;

&lt;p&gt;Google's SRE team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Herd Effect: When Recovery Made Things Worse
&lt;/h4&gt;

&lt;p&gt;As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at once, with no randomisation in their startup sequence. Spanner was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, couldn't restart properly, kept trying, kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilised. Full resolution wasn't complete until 18:18 PDT — more than seven hours after the incident began.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Google's Response: Five Commitments After the Outage
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Google's five-category post-incident remediation plan (from the official June 14, 2025 incident report):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;What Happened&lt;/th&gt;
&lt;th&gt;Google's Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Missing error handling&lt;/td&gt;
&lt;td&gt;Null pointer exception crashed the binary when blank fields appeared&lt;/td&gt;
&lt;td&gt;Mandatory null-safe code patterns with static analysis to catch null pointer vulnerabilities before deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No feature flag&lt;/td&gt;
&lt;td&gt;New code path couldn't be disabled without full binary redeployment — adding 30+ min to response&lt;/td&gt;
&lt;td&gt;Feature flag protection required for all new Service Control code paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Herd effect during recovery&lt;/td&gt;
&lt;td&gt;Hundreds of instances restarting simultaneously overwhelmed Spanner&lt;/td&gt;
&lt;td&gt;Randomised exponential backoff on Service Control startup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Status page availability&lt;/td&gt;
&lt;td&gt;Cloud Service Health dashboard went down during the outage&lt;/td&gt;
&lt;td&gt;Decouple status infrastructure from the services it monitors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service Control architecture&lt;/td&gt;
&lt;td&gt;Monolithic binary — crash in quota logic crashes all authorisation&lt;/td&gt;
&lt;td&gt;Modularise Service Control — isolate quota checking from authentication&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10 min&lt;/strong&gt; — time for Google's SRE team to identify the root cause from the first alert at 10:49 AM PDT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40 min&lt;/strong&gt; — time to deploy the red button kill switch that disabled the problematic code path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7+ hrs&lt;/strong&gt; — total outage duration; most regions recovered in ~2 hours, herd effect in us-central1 extended full resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50+&lt;/strong&gt; — Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Feature Flag That Would Have Saved Seven Hours&lt;/strong&gt;

&lt;p&gt;The most consequential missing safeguard was the absence of a feature flag on the new quota checking code path. A feature flag would have changed the timeline dramatically: when null pointer exceptions began firing at 10:49 AM PDT, engineers with a feature flag could have disabled the new code path across all regions within seconds — before the crash had spread globally. Without a feature flag, the only option was a red-button kill switch requiring a new binary deployment: 40 minutes. &lt;strong&gt;40 minutes of global outage versus seconds of a feature flag toggle.&lt;/strong&gt; Google's incident report acknowledges this directly: "If this had been flag protected, the issue would have been caught in staging."&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;The dormant trap pattern that caused this outage is worth naming explicitly. Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. &lt;strong&gt;Staged rollouts are structurally unable to catch dormant traps&lt;/strong&gt; — bugs that only activate when a specific trigger arrives weeks later from an unrelated automated system. The only defences against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behaviour). The May 29 change had neither.&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The herd effect: a recovery anti-pattern with a known fix
  &lt;br&gt;
The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect at once and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — &lt;strong&gt;randomised exponential backoff&lt;/strong&gt; — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner is the problem. Instances waiting a random delay between 0 and 30 seconds is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost is paid in production.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The status page that went dark
  &lt;br&gt;
Google's Cloud Service Health dashboard went offline during the June 12 outage because the status infrastructure shared a dependency on the same Google Cloud services that were failing. &lt;strong&gt;A status page that fails during a widespread outage is not just unhelpful — it is actively harmful.&lt;/strong&gt; Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronised restart under load.&lt;/p&gt;

&lt;p&gt;The blast radius of the June 12 outage had three concentric rings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Ring&lt;/th&gt;
&lt;th&gt;What Failed&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First: Google's own infrastructure&lt;/td&gt;
&lt;td&gt;Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Vertex AI, Cloud Monitoring, Google Workspace&lt;/td&gt;
&lt;td&gt;Service Control crashed globally, blocking all API authorisation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second: Direct GCP customers&lt;/td&gt;
&lt;td&gt;Spotify (~46K reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor&lt;/td&gt;
&lt;td&gt;Applications on GCP couldn't authorise any backend calls — services appeared down even though underlying compute was running&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third: Cloudflare and its customers&lt;/td&gt;
&lt;td&gt;Cloudflare (partial), Discord, Twitch&lt;/td&gt;
&lt;td&gt;Cloudflare uses Google Cloud for certain backend operations; those degraded, cascading to Cloudflare's own customers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Normal Flow vs June 12 Failure: What Service Control Does on Every Request
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-cloud-service-control-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-cloud-service-control-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Global Spanner Replication Trap&lt;/strong&gt;

&lt;p&gt;The reason the June 12 failure was global rather than regional was Spanner's design strength working against Google in this case. Spanner is engineered to replicate data to all regions in real time — typically within seconds. When the automated system inserted the policy update with blank fields into the regional Spanner tables, &lt;strong&gt;Spanner replicated that policy data to every region within seconds.&lt;/strong&gt; Every Service Control instance in every region hit the null pointer at essentially the same moment. There was no regional staging, no propagation delay, no opportunity for an alert to fire in one region before the failure had spread to all others. The same architecture that gives Spanner its global consistency guarantee gave this bug its global blast radius.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error handling is not optional for code that runs in the critical path of a globally distributed system.&lt;/strong&gt; The null pointer exception that crashed Service Control was caused by a missing null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Feature flags&lt;/em&gt; (a software engineering practice where new code is deployed but kept inactive until explicitly enabled via configuration, allowing teams to disable problematic features instantly without redeployment) on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data.&lt;/strong&gt; The difference between "feature flag enabled, issue caught in staging" and "no feature flag, 7-hour global outage" is one line of configuration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;The thundering herd&lt;/em&gt; (a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming it and preventing it from returning to stable operation) is a known failure mode with a known fix: randomised exponential backoff.&lt;/strong&gt; Build randomised backoff into any service that has a shared dependency it needs to reconnect to after a failure. This has been documented for decades. The fact that Service Control lacked it is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your monitoring infrastructure must be architecturally independent of the services it monitors.&lt;/strong&gt; No shared dependencies between the monitoring stack and the application stack. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Third-order cascade failures are invisible until they happen.&lt;/strong&gt; Discord's users had no idea their outage originated in a null pointer in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. Every engineering team should map their dependency chain at least two levels deep — not just "we use Cloudflare" but "Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare."&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dormant trap&lt;/strong&gt; — a bug present in production code that cannot be triggered by any input present at deployment time, but activates when a specific trigger arrives later from an unrelated system. The May 29 Service Control change was a dormant trap: it executed correctly for 14 days until the automated policy update inserted blank fields. Staged rollouts are structurally unable to catch dormant traps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature flag (kill switch)&lt;/strong&gt; — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in this incident. A feature flag on the new quota checking code path would have allowed it to be disabled across all regions within seconds when null pointer exceptions began firing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Herd effect (thundering herd)&lt;/strong&gt; — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming the resource and preventing it from returning to stable operation. The mechanism that extended the us-central1 outage by 2+ hours after the red button was deployed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Null pointer exception&lt;/strong&gt; — a runtime error that occurs when code attempts to use a reference that points to no object (null). The missing null check in Service Control's new quota checking code that caused a 7-hour global outage for 50+ cloud services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Randomised exponential backoff&lt;/strong&gt; — a retry strategy where clients wait a random delay that increases exponentially with each retry attempt. The standard solution to the thundering herd problem — prevents synchronised reconnection bursts by distributing client attempts across a time window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service Control&lt;/strong&gt; — Google's internal authorisation gateway that processes every API request across Google Cloud and Google Workspace. Performs authentication, authorisation, and quota enforcement on every call. A crash in Service Control takes down all API authorisation for the entire platform — making it the highest-blast-radius single component in Google Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spanner&lt;/strong&gt; — Google's globally distributed database, engineered to replicate data to all regions in real time (typically within seconds). Used by Service Control for policy metadata. The same replication speed that makes Spanner powerful for global consistency made this bug's blast radius global and instantaneous.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-cloud-service-control-outage-2025/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>reliability</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>GitHub Built the Internet's Code Platform — Then AI Agents Broke It</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 31 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/github-built-the-internets-code-platform-then-ai-agents-broke-it-3lek</link>
      <guid>https://dev.to/techlogstack/github-built-the-internets-code-platform-then-ai-agents-broke-it-3lek</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;257 incidents&lt;/strong&gt; — May 2025 to April 2026; ~5 per week, every week, for 12 months straight&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;48 major outages&lt;/strong&gt; — 112+ hours of total significant downtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-agent PRs:&lt;/strong&gt; 4M (Sept 2025) → 17M (Mar 2026) — 325% increase in six months&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actions usage:&lt;/strong&gt; 500M min/week (2023) → 2.1B min/week (early 2026)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10x scaling plan&lt;/strong&gt; launched October 2025; &lt;strong&gt;revised to 30x&lt;/strong&gt; by February 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mitchell Hashimoto&lt;/strong&gt; (GitHub user #1299, 18 years, co-founder of HashiCorp) migrated Ghostty away&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;We started executing our plan to increase GitHub's capacity by 10X in October 2025, with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today's scale. The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Vlad Fedorov, CTO of GitHub, GitHub Engineering Blog, April 28, 2026&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For most of its existence, GitHub has been one of the most reliable platforms on the internet. Developers took it for granted the way they take electricity for granted — always on, always there, a utility so dependable it disappeared into the background. That changed in 2025. Not because GitHub's engineers got worse. Not because the codebase got sloppier. But because something fundamental changed about who — or more precisely, &lt;em&gt;what&lt;/em&gt; — was using GitHub. &lt;strong&gt;AI coding agents arrived at scale&lt;/strong&gt;, and they didn't behave anything like the human developers the platform was built for.&lt;/p&gt;

&lt;p&gt;In 2024, GitHub logged 119 service incidents, including 26 major ones — frustrating, but manageable. Then, between May 2025 and April 2026, incident monitoring service IncidentHub tracked &lt;strong&gt;257 separate incidents&lt;/strong&gt;, of which 48 were classified as major outages. February 2026 alone produced 37 incidents — the worst month on record. GitHub Actions suffered 57 outages in the same 12-month stretch. On May 15, 2026, a single Actions degradation caused &lt;strong&gt;42% of all Actions runs to fail at peak impact&lt;/strong&gt;.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Core Problem: Agents Don't Behave Like Humans&lt;/strong&gt;

&lt;p&gt;A human developer on a free GitHub account might generate a few commits and a handful of CI runs in a working day. &lt;strong&gt;An AI agent on the same account can generate hundreds of commits, dozens of PRs, and thousands of Actions minutes in a single afternoon.&lt;/strong&gt; GitHub's 2025 Octoverse report celebrated nearly 1 billion commits. By early 2026, GitHub COO Kyle Daigle shared a more alarming figure: the platform was handling &lt;strong&gt;275 million commits every single week&lt;/strong&gt; — on pace for 14 billion in 2026. That's a 14x annual increase. It wasn't 14x more developers. It was agents treating GitHub's API like a utility and consuming at machine speed.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  GitHub Was Built for Human-Paced Development
&lt;/h4&gt;

&lt;p&gt;GitHub's architecture was designed for a world where developers work at human speed: open a PR, push commits over hours or days, wait for CI to run, merge when green. The platform's capacity planning, its database schemas, its job queues, its rate limits — all calibrated for a workflow where one human generates a bounded amount of activity per session. That assumption held for 17 years.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  AI Agents Changed the Economics of Every GitHub Operation
&lt;/h4&gt;

&lt;p&gt;GitHub CTO Vlad Fedorov identified the mechanism: a single pull request can simultaneously touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. A human merging one PR triggers this chain once. An AI agent framework running hundreds of concurrent sessions triggers it thousands of times simultaneously. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 — a 325% increase in six months.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  10x Plan Became 30x Plan — And They Were Still Behind
&lt;/h4&gt;

&lt;p&gt;GitHub began a 10x capacity scaling initiative in October 2025. By February 2026, that plan was already obsolete — the real demand required 30x. Simultaneously, GitHub was running a migration to Azure, with 12.5% of all traffic on Azure Central US and a target of 50% by July 2026. Running a platform migration alongside an AI-driven traffic explosion is the engineering equivalent of rebuilding an airplane's engines at 35,000 feet.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Cascading Failures and High-Profile Departures
&lt;/h4&gt;

&lt;p&gt;The pressure produced not just performance degradation but engineering failures. On April 23, 2026, an incomplete feature flag silently reverted commits across 658 repositories and 2,092 pull requests — the UI showed green checkmarks while code was being rewritten underneath. On April 28, Mitchell Hashimoto — GitHub user #1299, co-founder of HashiCorp, joined February 2008 — announced that Ghostty was leaving GitHub after 18 years. The Zig programming language project also migrated away.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Engineering Response: Ruby to Go, Monolith to Services, Single Cloud to Multi-Cloud
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub's five-layer engineering response (as outlined by CTO Vlad Fedorov, April 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem Layer&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;GitHub's Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language / Runtime&lt;/td&gt;
&lt;td&gt;Ruby monolith has GIL limiting CPU parallelism under high concurrency&lt;/td&gt;
&lt;td&gt;Rewriting performance-critical services from Ruby to Go — goroutine model handles massive concurrency without the GIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Single-cloud creates concentrated failure risk and limits horizontal scaling&lt;/td&gt;
&lt;td&gt;Multi-cloud deployment — 12.5% on Azure Central US in early 2026, targeting 50% by July 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service Isolation&lt;/td&gt;
&lt;td&gt;A single PR cascades through 10+ interconnected subsystems&lt;/td&gt;
&lt;td&gt;Isolating Git and Actions into independent failure domains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capacity Planning&lt;/td&gt;
&lt;td&gt;10x plan (October 2025) obsolete by February 2026&lt;/td&gt;
&lt;td&gt;30x capacity design with automated scaling for agent-driven burst load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature Safety&lt;/td&gt;
&lt;td&gt;April 23 merge queue regression caused by incomplete feature flag&lt;/td&gt;
&lt;td&gt;Strengthened feature flag discipline — no data-integrity code path ships without complete flag protection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;257&lt;/strong&gt; — total incidents May 2025–April 2026; roughly five per week, every week, for twelve months&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;48&lt;/strong&gt; — major outages producing over 112 hours of total significant downtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30x&lt;/strong&gt; — the scale GitHub needed to design for by February 2026, triple the 10x plan launched four months earlier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2,092&lt;/strong&gt; — pull requests silently reverted by the April 23 merge queue bug across 658 repositories, with no notification&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The April 23 Silent Revert: Why This One Was Different&lt;/strong&gt;

&lt;p&gt;The April 23 merge queue bug was caused by an incomplete feature flag that allowed a new code path to activate without full safeguards. Commits that had been merged were silently reverted across 658 repositories and 2,092 pull requests. The terrifying part was not the scope — it was the silence. The UI continued to show green checkmarks and merge confirmations while the system was actively undoing work underneath. A platform's most sacred contract with its users is that when it shows a green checkmark, the operation succeeded. GitHub broke that contract. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data that developers trust as immutable, flag protection is not a best practice — it is the minimum viable safety mechanism.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  Why the Ruby-to-Go rewrite is the right call
  &lt;br&gt;
Ruby served GitHub extraordinarily well for 18 years. But Ruby's Global Interpreter Lock (GIL) is a fundamental constraint: even on a 64-core server, a Ruby process can only execute one thread of Ruby code at a time. For human-paced web traffic, this limitation is manageable. &lt;strong&gt;For AI agent workflows that generate thousands of concurrent operations, the GIL is a hard ceiling.&lt;/strong&gt; Go's goroutine model — lightweight threads managed by the Go runtime that can run across all available CPU cores without a GIL — is architecturally suited for exactly the concurrency profile that AI agents create. The rewrite is not about language preference. It is about physics.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The Mitchell Hashimoto moment
  &lt;br&gt;
On April 28, 2026, Mitchell Hashimoto — GitHub user number 1299, co-founder of HashiCorp, creator of Vagrant, Packer, Consul, Terraform, and Vault — posted that Ghostty was leaving GitHub. He had visited GitHub almost every day for over 18 years. His post described the decision as 'irrationally sad' but said the platform was no longer a place where he could 'get work done' and 'ship software.' He made a point that resonated across the developer community: the problem was not Git itself — the distributed version control system remained excellent. The problem was the &lt;strong&gt;surrounding infrastructure&lt;/strong&gt;: issues, pull requests, GitHub Actions. When the person who never had reason to question the platform for 18 years starts questioning it, something has fundamentally changed.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The structural billing gap
  &lt;br&gt;
GitHub's business model was designed for humans, and its pricing reflects human-scale consumption. A developer on a free GitHub account generates some commits, a few CI runs, and a handful of API calls per day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, thousands of Actions minutes, and tens of thousands of API calls in a single afternoon. The infrastructure cost per 'user' has fundamentally changed, but the pricing model has not yet caught up. GitHub's Octoverse 2025 report celebrated nearly 1 billion commits and 36 million new developers. But the 2026 numbers aren't being driven by 36 million new developers — they're being driven by agents treating GitHub's API like a utility.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;GitHub's architecture evolved over 18 years around a core assumption: the unit of load is a human developer. A human opens a PR, waits for review, pushes a few commits, and merges. The platform's service graph — Git storage, mergeability computation, branch protection evaluation, Actions job dispatch, search indexer, notification fan-out, webhook delivery, permission evaluation, API gateway — was sized and coupled around this human-paced access pattern.&lt;/p&gt;

&lt;p&gt;AI agents broke the architecture's fundamental assumption. An agent doesn't open a PR and wait. An agent opens 50 PRs in parallel, each triggering the full service chain simultaneously. When the number of concurrent PRs scales 4x in six months, the pressure on every one of those systems scales accordingly — and the interconnected failures begin.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Single GitHub PR: The 10+ Subsystems It Touches
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/github-ai-agents-outage-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  GitHub Actions: Weekly Compute Minutes — The AI Agent Surge
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/github-ai-agents-outage-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;83 Incidents From Capacity Failures Alone&lt;/strong&gt;

&lt;p&gt;83 of GitHub's 257 incidents between May 2025 and April 2026 were caused by &lt;strong&gt;load and capacity problems&lt;/strong&gt; — with indications that many services did not have automatic scaling configured, requiring manual intervention to add capacity during surges. This means that dozens of times, engineers had to notice the problem, escalate it, and manually provision resources before the platform could recover. Automated capacity scaling for burst load is not optional infrastructure. For a platform being consumed by AI agents, it is the minimum viable reliability architecture.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your platform's capacity model must be built around its actual consumers — not its original consumers.&lt;/strong&gt; GitHub was built for human developers. AI agents consume infrastructure at orders of magnitude greater intensity. Any platform that introduces AI-native workflows must remodel its capacity assumptions from scratch, not incrementally adjust from the human baseline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Feature flags&lt;/em&gt; (a software engineering practice where new code is deployed but kept inactive until explicitly enabled, allowing teams to test in production, roll out gradually, and instantly disable a feature without redeployment) are not optional for infrastructure that handles data integrity.&lt;/strong&gt; The April 23 merge queue bug — which silently reverted 2,092 pull requests — was caused by an incomplete feature flag. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data developers trust as immutable, flag protection is the minimum viable safety mechanism.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A monolith that can't be incrementally scaled will become a single point of failure at sufficient scale.&lt;/strong&gt; GitHub's Ruby monolith served the platform for 18 years because human-paced traffic was bounded enough that the GIL's concurrency limit never became the primary bottleneck. AI agents removed that bound. The architectural lesson is not that monoliths are bad — it's that every architectural decision encodes assumptions about scale, and those assumptions must be revisited when the scale changes fundamentally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Service isolation is not premature optimisation — it is the prerequisite for containing blast radius at scale.&lt;/strong&gt; When critical services are deeply coupled — when a PR touches Git storage, Actions, search, notifications, permissions, and webhooks in a single chain — a failure in any one component becomes a failure across all components. GitHub's commitment to isolating Git and Actions into independent failure domains is the architectural move that will have the most long-term impact on reliability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trust is the asset that reliability engineering protects.&lt;/strong&gt; Mitchell Hashimoto didn't leave GitHub because of any single outage. He left because 257 incidents over 12 months had eroded confidence in the platform as a reliable foundation for serious work. Reliability is not measured in individual incident severities — it is measured in the cumulative effect of failures on whether people trust the platform to do what it says it did.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Global Interpreter Lock (GIL)&lt;/strong&gt; — a mutex in Python and Ruby runtimes that prevents multiple threads from executing interpreter code simultaneously in the same process. Even on a multi-core server, a Ruby process can only use one CPU core at a time for Ruby execution. The fundamental scaling constraint that makes the Ruby-to-Go rewrite necessary for GitHub's AI agent traffic levels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI agent PR&lt;/strong&gt; — a pull request created by an autonomous AI coding agent (such as Copilot Workspace or similar agentic tools) rather than a human developer. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 on GitHub — the primary driver of the platform's capacity crisis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic development workflow&lt;/strong&gt; — a software development pattern where AI agents autonomously perform multi-step tasks: creating branches, writing code, running tests, opening PRs, and iterating based on feedback. Unlike human-paced development, agentic workflows can generate hundreds of concurrent operations from a single user session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature flag (kill switch)&lt;/strong&gt; — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in GitHub's April 23 merge queue incident. A complete feature flag would have allowed the problematic code path to be disabled instantly rather than requiring a full redeployment cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service isolation&lt;/strong&gt; — an architectural design where services are deployed as independent failure domains rather than a tightly coupled chain. The goal: a failure in one service (e.g. GitHub Actions) does not cascade to unrelated services (e.g. Git storage). GitHub's post-crisis architectural commitment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Merge queue regression&lt;/strong&gt; — a class of GitHub-specific incident where the merge queue processing pipeline fails, causing incorrect behaviour (such as the April 23 silent revert) or blocking PRs from merging. Merge queue regressions are particularly damaging because they violate the fundamental contract of version control: that merge operations are irreversible and accurately reported.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/github-ai-agents-outage-2026/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>reliability</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/spotify-changed-a-filter-order-in-their-proxy-then-every-server-in-the-world-crashed-at-once-37bb</link>
      <guid>https://dev.to/techlogstack/spotify-changed-a-filter-order-in-their-proxy-then-every-server-in-the-world-crashed-at-once-37bb</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3h 27m&lt;/strong&gt; outage — 12:18 UTC to 15:45 UTC, April 16 2025&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;675M&lt;/strong&gt; monthly active users affected globally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;48,000+&lt;/strong&gt; peak Downdetector reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; regions with staged rollout — applied globally simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root cause:&lt;/strong&gt; Envoy max heap configured higher than K8s memory limit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; capacity increase reduced per-instance memory below the kill threshold&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their &lt;em&gt;Envoy Proxy&lt;/em&gt; (an open-source edge proxy that receives all incoming user traffic before distributing it to backend services) perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed — and then the restart loop began, powered by Kubernetes itself, killing each new server as fast as it came back up. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;This crash happened simultaneously on all Envoy instances.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Spotify Engineering, Incident Report: Spotify Outage on April 16, 2025&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. To extend Envoy's capabilities, Spotify develops its own &lt;strong&gt;custom filters&lt;/strong&gt; — plugins that handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a &lt;strong&gt;latent bug in one of the custom filters&lt;/strong&gt;: a code path that had existed harmlessly, triggered only when the filter received control at that specific position. Envoy crashed. Not one instance, not one region. All of them.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Death Loop: Why the Restart Made Things Worse&lt;/strong&gt;

&lt;p&gt;An Envoy crash is normally survivable — Kubernetes detects the failed pod and starts a replacement. But client-side retry logic (every user's app retrying its failed request) created an unprecedented traffic spike onto each new instance. Each new Envoy started, received the full flood of retry traffic, consumed more memory than the &lt;em&gt;Kubernetes memory limit&lt;/em&gt; (the maximum memory a pod is allowed to use — when exceeded, K8s automatically terminates it), and was killed. A new instance started. The same thing happened. The loop repeated — powered by Kubernetes itself — for hours.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash
&lt;/h4&gt;

&lt;p&gt;The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit
&lt;/h4&gt;

&lt;p&gt;The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a recoverable crash into an infinite restart loop.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Asia Pacific Stayed Up — and Explained Everything
&lt;/h4&gt;

&lt;p&gt;Asia Pacific was the only region unaffected. Engineers investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  15:45 UTC — Death Loop Broken, Full Recovery
&lt;/h4&gt;

&lt;p&gt;Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalisation at 15:40 UTC. Total duration: 3 hours 27 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Misconfiguration Nobody Noticed — Until the Crash
&lt;/h3&gt;

&lt;p&gt;The root problem was that &lt;strong&gt;Envoy's max heap size was set higher than the Kubernetes memory limit for the pod&lt;/strong&gt;. In normal operation, Envoy memory usage never approached its heap maximum — the misconfiguration was invisible. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3h 27m&lt;/strong&gt; — Total outage duration, 12:18 to 15:45 UTC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;675M&lt;/strong&gt; — Users affected; 263M paying Premium subscribers — no perimeter differentiation by tier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;48,000+&lt;/strong&gt; — Peak Downdetector reports (active reporters only; actual affected users in the hundreds of millions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; — Regions with staged rollout before full deployment
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit&lt;/span&gt;

&lt;span class="c1"&gt;# Kubernetes pod resource specification (simplified)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;envoy&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3Gi"&lt;/span&gt;  &lt;span class="c1"&gt;# K8s will OOMKill the pod above this&lt;/span&gt;

&lt;span class="c1"&gt;# Envoy overload manager configuration (simplified)&lt;/span&gt;
&lt;span class="na"&gt;overload_manager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resource_monitors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;envoy.resource_monitors.fixed_heap&lt;/span&gt;
    &lt;span class="na"&gt;typed_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_heap_size_bytes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4294967296&lt;/span&gt;  &lt;span class="c1"&gt;# 4GB — HIGHER than K8s 3GB limit!&lt;/span&gt;

&lt;span class="c1"&gt;# Why this is catastrophic:&lt;/span&gt;
&lt;span class="c1"&gt;# - K8s kills at 3GB memory usage&lt;/span&gt;
&lt;span class="c1"&gt;# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB&lt;/span&gt;
&lt;span class="c1"&gt;# - K8s limit is hit BEFORE Envoy's graceful degradation kicks in&lt;/span&gt;
&lt;span class="c1"&gt;# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible&lt;/span&gt;
&lt;span class="c1"&gt;# - Under retry flood: Envoy climbs past 3GB → OOMKill → restart → repeat&lt;/span&gt;

&lt;span class="c1"&gt;# IMMEDIATE FIX: Increase perimeter server count&lt;/span&gt;
&lt;span class="c1"&gt;# More servers = retry traffic spread across more instances&lt;/span&gt;
&lt;span class="c1"&gt;# = each instance stays under 3GB = K8s doesn't kill = loop breaks&lt;/span&gt;

&lt;span class="c1"&gt;# PERMANENT FIX: Align heap config with K8s memory limit&lt;/span&gt;
&lt;span class="c1"&gt;# max_heap_size_bytes: 2684354560  # 2.5GB — safely below K8s 3GB limit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Why Increasing Capacity Fixed the Loop&lt;/strong&gt;

&lt;p&gt;The K8s memory limit was fixed. The retry traffic load was fixed (determined by user behaviour). The only variable Spotify could change quickly was the number of Envoy instances sharing that retry load. More instances → each instance receives a smaller share of the flood → memory stays below the K8s limit → K8s doesn't kill it → stable. The underlying misconfiguration (heap &amp;gt; K8s limit) was fixed separately afterward as permanent remediation.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Spotify's four post-incident commitments:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fix the filter bug&lt;/strong&gt; that caused the initial crash on filter reorder&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix the heap/K8s limit mismatch&lt;/strong&gt; — align Envoy config with pod resource limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staged perimeter rollouts&lt;/strong&gt; — regional validation before global deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved monitoring&lt;/strong&gt; — detect configuration issues earlier in the failure chain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Incident timeline:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;12:18&lt;/td&gt;
&lt;td&gt;Filter reorder applied; all Envoy instances crash&lt;/td&gt;
&lt;td&gt;🔴 Global failure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:20&lt;/td&gt;
&lt;td&gt;Alarms fire on traffic drop; death loop running&lt;/td&gt;
&lt;td&gt;🔴 Engineers paged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12:28&lt;/td&gt;
&lt;td&gt;Escalated; only APAC serving traffic&lt;/td&gt;
&lt;td&gt;🔴 Incident declared&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~13:xx&lt;/td&gt;
&lt;td&gt;Root cause identified via APAC asymmetry&lt;/td&gt;
&lt;td&gt;🟡 Diagnosis complete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14:20&lt;/td&gt;
&lt;td&gt;EU fully recovered&lt;/td&gt;
&lt;td&gt;🟡 Partial recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:10&lt;/td&gt;
&lt;td&gt;US fully recovered&lt;/td&gt;
&lt;td&gt;🟡 Partial recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:40&lt;/td&gt;
&lt;td&gt;All regions normalised&lt;/td&gt;
&lt;td&gt;🟢 Full recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Spotify's networking perimeter places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what backend it is destined for. When every Envoy instance crashes simultaneously, no user request can reach any backend service. The entire platform goes dark regardless of whether individual backend services remain healthy. This is the &lt;em&gt;shared fate&lt;/em&gt; property of perimeter architecture: a perimeter failure has a blast radius of every service, every user, every region simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three-Layer Failure Cascade: From Filter Bug to Death Loop
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The APAC Diagnostic: How One Region Proved the Root Cause&lt;/strong&gt;

&lt;p&gt;When engineers observed APAC was unaffected, they had two candidate hypotheses: (A) the filter bug is region-specific, or (B) the death loop is traffic-intensity dependent. Investigation confirmed (B): APAC runs identical filter configuration — lower traffic meant less retry amplification, meaning per-instance memory pressure never reached the K8s limit. This asymmetry transformed a hard debugging problem ("why is the loop happening?") into a tractable one ("what's different about APAC?") and pointed directly at the memory-limit misconfiguration.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  Configuration drift: why this existed undetected for months
  &lt;br&gt;
The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: configuration mismatches that are only dangerous under abnormal load go undetected indefinitely in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But it was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage, including synthetic stress tests, is the practice that catches these before they detonate.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;'Low risk' is not a substitute for staged rollout at the perimeter.&lt;/strong&gt; A change's risk profile determines what validation it needs — it doesn't override the need for validation. The filter reorder was simple; the blast radius of failure was total. Stage perimeter changes by region and monitor before expanding.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Latent bugs&lt;/em&gt; (code defects harmless until a specific triggering condition occurs) that depend on execution context cannot be caught by tests that don't vary that context.&lt;/strong&gt; A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit resource limit configurations against actual and stress-test peak usage regularly.&lt;/strong&gt; Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. A misconfiguration harmless for months can become catastrophic under the right load spike.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Client-side retry logic&lt;/em&gt; turns total simultaneous failures into traffic amplification events.&lt;/strong&gt; Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;When one region survives an outage that hits all others, that region is your fastest path to root cause.&lt;/strong&gt; APAC's survival was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Systematically compare surviving regions against failed ones — it shortens MTTR.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Client-side retry logic&lt;/strong&gt; — application behaviour where the client automatically retries failed requests after a brief delay. Designed to handle transient failures, but capable of amplifying load during sustained simultaneous failures by converting each failed request into one or more retry requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Death loop&lt;/strong&gt; — an informal term for an infinite restart cycle where a pod crashes, Kubernetes restarts it, and the replacement crashes for the same reason. Powered by K8s restart behaviour combined with a condition (here: retry flood + heap misconfiguration) that guarantees each replacement fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Envoy Proxy&lt;/strong&gt; — an open-source, high-performance edge proxy originally built at Lyft, widely used as the networking perimeter layer in distributed systems. Receives all incoming user traffic before distributing it to backend services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filter chain&lt;/strong&gt; — the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance. Each filter can inspect, modify, or reject the request before passing it to the next filter. Order is semantically meaningful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latent bug&lt;/strong&gt; — a code defect that exists in production but is harmless until a specific triggering condition occurs. Undetectable by standard testing if the triggering condition is rare or contextual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OOMKill&lt;/strong&gt; — Out-Of-Memory Kill. The Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared fate system&lt;/strong&gt; — an architecture where all dependent services rise and fall with a shared component. Spotify's Envoy perimeter is a shared fate system: if it fails, every backend service becomes unreachable regardless of whether those services are healthy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Staged rollout&lt;/strong&gt; — deploying a change to a subset of infrastructure (one region, one cluster) and validating behaviour before expanding to the full fleet. The safety mechanism absent from the April 16 deployment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>reliability</category>
      <category>webdev</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/airbnbs-fraud-detection-runs-on-a-graph-of-7-billion-nodes-heres-why-they-rebuilt-it-from-4j38</link>
      <guid>https://dev.to/techlogstack/airbnbs-fraud-detection-runs-on-a-graph-of-7-billion-nodes-heres-why-they-rebuilt-it-from-4j38</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B&lt;/strong&gt; nodes, &lt;strong&gt;11B&lt;/strong&gt; edges in Airbnb's identity graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5M&lt;/strong&gt; new edges ingested per day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P99 read latency: 5.0s → 2.5s&lt;/strong&gt; (-49% improvement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P95 write latency: 353ms → 156ms&lt;/strong&gt; (-56% improvement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10×&lt;/strong&gt; write QPS ceiling vs previous vendor maximum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero&lt;/strong&gt; manual reboots required post-migration&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Airbnb's identity graph connects every user, every device, every listing, and every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. In 2024, this graph held 7 billion nodes and 11 billion edges — growing by 5 million new edges every day. The third-party vendor powering it required periodic manual reboots to stay stable, and 8-hop graph traversal queries were hitting 5-second P99 latencies. A small team rebuilt the entire thing internally. The results were not incremental.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between &lt;strong&gt;every user, every device, every verified identity, every behavioral signal&lt;/strong&gt; that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing.&lt;/p&gt;

&lt;p&gt;The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists. This worked at low graph density. As individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update — &lt;em&gt;relational databases&lt;/em&gt; (database systems built around tables, rows, and SQL joins — optimal for normalised structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimised for multi-hop traversal at graph scale.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Four Anti-Patterns That Plagued Airbnb's Graph Teams&lt;/strong&gt;

&lt;p&gt;Before the centralised graph infrastructure, teams building graph-based products fell into four documented patterns: &lt;strong&gt;Relational graphs&lt;/strong&gt; — modelling nodes and edges in SQL tables, producing expensive joins during traversal. &lt;strong&gt;Offline graphs&lt;/strong&gt; — building in the data warehouse, limiting freshness to daily batch snapshots. &lt;strong&gt;DIY open source&lt;/strong&gt; — self-managing community graph databases, creating high operational toil. &lt;strong&gt;Managed PaaS&lt;/strong&gt; — third-party vendors with vendor lock-in, limited tuning access, and performance bottlenecks the team couldn't debug.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density
&lt;/h4&gt;

&lt;p&gt;The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As graph density grew — individual users accumulating hundreds of edges — querying became expensive. JSON deserialisation and cross-table joins are not optimised for the multi-hop traversal patterns that fraud detection requires.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability
&lt;/h4&gt;

&lt;p&gt;The 2021 migration to a third-party SaaS graph database improved horizontal scalability but introduced new problems: P99 read latency reaching 5 seconds on 8-hop queries, operational instability requiring periodic manual reboots, no ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Generation 3: JanusGraph + DynamoDB, Internally Managed
&lt;/h4&gt;

&lt;p&gt;In 2024, Airbnb built an internal graph infrastructure on &lt;em&gt;JanusGraph&lt;/em&gt; (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture let Airbnb leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimisations.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots
&lt;/h4&gt;

&lt;p&gt;P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing reached 10× the previous vendor's maximum. Manual reboots eliminated entirely. Auto-scaling enabled for the first time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three JanusGraph Engine Optimisations That Closed the Latency Gap
&lt;/h3&gt;

&lt;p&gt;Deploying stock JanusGraph with DynamoDB would not have been sufficient. Airbnb's query patterns — particularly high-fanout traversals that caused the worst P99 spikes — required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimisations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;-49%&lt;/strong&gt; — P99 read latency: 5.0s → 2.5s, directly improving fraud detection response time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;-56%&lt;/strong&gt; — P95 write latency: 353ms → 156ms, enabling faster ingestion of 5M daily new edges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10×&lt;/strong&gt; — Write QPS ceiling during load testing vs the vendor maximum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; — Manual reboots required post-migration; the internal solution auto-scales&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The choice of &lt;em&gt;Gremlin&lt;/em&gt; (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph: &lt;code&gt;g.V(userId).out('booked').in('listed')&lt;/code&gt; means "find all users who listed properties that this user has booked") as the query language was a deliberate migration enabler. Both the outgoing vendor system and the incoming JanusGraph support Gremlin, which meant &lt;strong&gt;Airbnb could run the same queries against both systems simultaneously&lt;/strong&gt; during migration — direct performance benchmarking under real production load before any cutover.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Connected Accounts: How the Graph Detects Fraud&lt;/strong&gt;

&lt;p&gt;Airbnb's Trust Graph finds structural patterns that correlate with fraud. A fraudster re-entering after a ban often reuses the same phone number, payment method, or device. The Connected Accounts system traverses the graph to find these connections: &lt;em&gt;"this new account shares a device with a banned account, which shared a payment method with another banned account, which has reviewed listings that the new account also reviewed."&lt;/em&gt; That traversal pattern — spanning 4–8 hops — is exactly why graph depth performance matters.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Three JanusGraph engine optimisations that reduced long-tail latency
&lt;/span&gt;
&lt;span class="c1"&gt;# OPTIMISATION 1: DynamoDB conditional writes replace distributed locking
# Old: explicit distributed lock before write = round-trip overhead
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_edge_default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;acquire_distributed_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# expensive
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;release_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# New: DynamoDB evaluates condition atomically server-side — no lock round-trip
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_edge_optimized&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge_with_condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;src_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst_vertex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attribute_not_exists(edge_key)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# OPTIMISATION 2: Parallel getMultiSlices for high-fanout nodes
# Before: N sequential DynamoDB calls for a user with 1000+ edges
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_edges_serial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertex_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_slices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;slice_key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;compute_slice_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertex_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_slices&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dynamo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slice_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After: single BatchGetItem — critical for high-fanout nodes
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_edges_parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertex_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_slices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;slice_keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_slice_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vertex_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_slices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dynamo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_get_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slice_keys&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 1 call instead of N
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# OPTIMISATION 3: Distributed tracing in the internal fork
# OSS JanusGraph: no tracing — impossible to profile slow queries
# Internal fork: Airbnb trace context propagated through every graph op
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_gremlin_traversal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trace_context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;airbnb_tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;janusgraph.traversal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query.hops&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;count_hops&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query.fanout&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;estimated_fanout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;janusgraph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result.edges_traversed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;edge_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  The shadow traffic migration strategy
  &lt;br&gt;
Migrating 7 billion nodes and 11 billion edges without downtime required running both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. Because both systems use Gremlin, the same queries ran unchanged on both simultaneously. This shadow traffic phase provided a performance benchmark under real load (not synthetic tests) and correctness validation before any cutover. Only after shadow traffic validated both was production traffic cut over and the vendor deprecated.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Airbnb's new graph infrastructure has three conceptual layers. The &lt;strong&gt;storage layer&lt;/strong&gt; is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale. The &lt;strong&gt;graph engine layer&lt;/strong&gt; is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries, with custom optimisations for Airbnb's access patterns. The &lt;strong&gt;management layer&lt;/strong&gt; is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call.&lt;/p&gt;
&lt;h3&gt;
  
  
  Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Why High-Fanout Nodes Cause Long-Tail Latency
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance comparison across query types:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query Type&lt;/th&gt;
&lt;th&gt;Vendor P95&lt;/th&gt;
&lt;th&gt;Internal P95&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;th&gt;Vendor P99&lt;/th&gt;
&lt;th&gt;Internal P99&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1-hop query&lt;/td&gt;
&lt;td&gt;~180ms&lt;/td&gt;
&lt;td&gt;~65ms&lt;/td&gt;
&lt;td&gt;-64%&lt;/td&gt;
&lt;td&gt;~420ms&lt;/td&gt;
&lt;td&gt;~150ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2-hop query&lt;/td&gt;
&lt;td&gt;~350ms&lt;/td&gt;
&lt;td&gt;~130ms&lt;/td&gt;
&lt;td&gt;-63%&lt;/td&gt;
&lt;td&gt;~900ms&lt;/td&gt;
&lt;td&gt;~280ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2-hop (high fanout)&lt;/td&gt;
&lt;td&gt;~620ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;td&gt;-68%&lt;/td&gt;
&lt;td&gt;~1,800ms&lt;/td&gt;
&lt;td&gt;~450ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4-hop query&lt;/td&gt;
&lt;td&gt;~900ms&lt;/td&gt;
&lt;td&gt;~380ms&lt;/td&gt;
&lt;td&gt;-58%&lt;/td&gt;
&lt;td&gt;~2,500ms&lt;/td&gt;
&lt;td&gt;~850ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8-hop query (max depth)&lt;/td&gt;
&lt;td&gt;~2,100ms&lt;/td&gt;
&lt;td&gt;~1,000ms&lt;/td&gt;
&lt;td&gt;-52%&lt;/td&gt;
&lt;td&gt;~5,000ms&lt;/td&gt;
&lt;td&gt;~2,500ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write (edge creation)&lt;/td&gt;
&lt;td&gt;~353ms&lt;/td&gt;
&lt;td&gt;~156ms&lt;/td&gt;
&lt;td&gt;-56%&lt;/td&gt;
&lt;td&gt;~800ms&lt;/td&gt;
&lt;td&gt;~360ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;JanusGraph's Pluggable Storage: The Architectural Decision That Made This Possible&lt;/strong&gt;

&lt;p&gt;Most graph databases tightly couple the query engine and the storage layer — they are one system. JanusGraph decouples them through a pluggable storage backend. Airbnb chose DynamoDB — infrastructure their team already operated at scale. This gave them full control over the graph logic layer while standing on a storage foundation that didn't need to be invented from scratch. The separation also lets them evolve the storage backend in the future without rewriting the graph layer.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;






&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Know the signals that a vendor relationship has passed its usefulness.&lt;/strong&gt; Recurring manual operational interventions, inability to instrument the system's internals, no path to tune performance for your access patterns, and P99 latency an order of magnitude worse than P50 — each individually might be tolerable, but all four together mean the vendor is costing more than an internal solution would cost to build.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pluggable storage backends are what make graph databases practical at scale.&lt;/strong&gt; JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shadow traffic is the only honest migration validation strategy for a stateful system.&lt;/strong&gt; You cannot reproduce 7 billion nodes and 11 billion edges in staging. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. Gremlin compatibility between vendor and JanusGraph made shadow traffic feasible here — evaluate migration options partly on query language compatibility.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;High-fanout nodes&lt;/em&gt; (vertices with an unusually large number of edges — sometimes called supernodes) are the specific failure mode of graph databases at scale.&lt;/strong&gt; They don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits are the tools that prevent P99 from diverging from P50.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fork open-source infrastructure when you have specific, documented performance requirements the upstream project doesn't address — and when you intend to maintain the fork.&lt;/strong&gt; The fork is a commitment that creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt; — Amazon's fully managed NoSQL key-value and document database, used by Airbnb as JanusGraph's storage backend. Provides auto-scaling, multi-region replication, and conditional write operations used in Airbnb's optimised transaction strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gremlin&lt;/strong&gt; — a graph traversal language developed as part of the Apache TinkerPop framework. Reads like a path through the graph: &lt;code&gt;g.V(userId).out('booked').in('listed')&lt;/code&gt; means "find all users who listed properties that this user has booked."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High-fanout node&lt;/strong&gt; — a vertex in a graph database with an unusually large number of edges, sometimes called a supernode. Causes disproportionate latency on traversal queries because a single hop can require fetching thousands of edges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JanusGraph&lt;/strong&gt; — an open-source distributed graph database built on Apache TinkerPop, with a pluggable storage backend that can use Cassandra, DynamoDB, or HBase as the underlying data store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-tail latency&lt;/strong&gt; — the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median. Particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P99 latency&lt;/strong&gt; — the response time that 99% of requests complete within. A P99 of 5.0s means 1 in 100 requests takes 5 seconds or longer — directly visible to users at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pluggable storage backend&lt;/strong&gt; — an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow traffic&lt;/strong&gt; — a migration validation strategy where the same production queries are routed to both the old and new systems simultaneously, comparing outputs and latencies before committing to a cutover.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>database</category>
      <category>architecture</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/googles-gemini-omni-is-the-first-ai-that-creates-from-anything-here-is-what-that-actually-means-2a1k</link>
      <guid>https://dev.to/techlogstack/googles-gemini-omni-is-the-first-ai-that-creates-from-anything-here-is-what-that-actually-means-2a1k</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Any → Video&lt;/strong&gt; — text, image, audio, and video inputs simultaneously → video output in one model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 model&lt;/strong&gt; vs chained pipeline (Veo + Imagen + Lyria) — the architectural difference that enables cross-modal reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10 seconds&lt;/strong&gt; — maximum clip length at Flash launch; longer-form on the roadmap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2B+ users&lt;/strong&gt; — YouTube Shorts monthly active users with Day 1 Omni integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SynthID&lt;/strong&gt; watermark on every generation — survives re-encoding, resizing, and colour grading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversational editing&lt;/strong&gt; — full context retained turn-to-turn, no re-prompting from scratch&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;For three years, Google built Gemini to be "natively multimodal." At I/O 2026, they finally showed what that phrase means in practice. Gemini Omni takes a photo, an audio clip, a video, and a text description — all at once — and produces a new video that reflects all of them simultaneously. This is not four models chained together. It is one — and the distinction is architectural, not cosmetic.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;When we first announced Gemini, it was our first AI model to be natively multimodal. We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Sundar Pichai, CEO of Google, Google I/O 2026, May 19 2026&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The phrase "natively multimodal" had been in Google's vocabulary since Gemini's December 2023 announcement — describing an aspiration more than a reality. At I/O 2026, Google delivered the concrete version: &lt;strong&gt;Gemini Omni&lt;/strong&gt;, a model that accepts text, image, audio, and video simultaneously and generates video as output — not by chaining Veo, Imagen, and Lyria together, but by processing all of them within a single transformer's forward pass. A chain of models cannot reason about relationships between its inputs. A unified model can.&lt;/p&gt;

&lt;p&gt;The path from Gemini's announcement to Omni runs through three milestones. &lt;strong&gt;Gemini 2.0 Flash&lt;/strong&gt; (late 2024) introduced native audio output and real-time multimodal interaction — the first demonstration that Gemini could generate, not just understand, audio and video natively. &lt;strong&gt;Project Astra&lt;/strong&gt; explored continuous, persistent understanding of physical environments through video and audio streams. &lt;strong&gt;Nano Banana&lt;/strong&gt; (2025) brought Gemini's intelligence to image generation and editing, establishing the UX patterns — natural language editing, reference image input, conversational refinement — that Omni extends to video. Omni synthesises all three threads into a single production model.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Chained Models vs Native Omni: The Fundamental Difference&lt;/strong&gt;

&lt;p&gt;OpenAI's Sora and Google's Veo were excellent at their specific tasks but could not natively reason across modalities. Generating a video matching a specific audio track and reference image required: (1) generate a video with Veo from a text description, (2) separately process the audio, (3) manually synchronise the two. &lt;strong&gt;Gemini Omni collapses these three steps into one prompt&lt;/strong&gt; — upload the image, the audio, write a description, and the model reasons about all three simultaneously. The unified context window is what makes this possible.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Multimodal AI Was a Pipeline of Specialised Models
&lt;/h4&gt;

&lt;p&gt;The previous state-of-the-art for multimodal content creation required chaining specialised models — text-to-video, text-to-image, text-to-audio — and manual integration. Each handoff between models lost context: the relationship between audio tempo and visual rhythm, the visual style of a reference image, the emotional tone of a text prompt. Creators managed these integrations manually, limiting access to specialists.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Separate Models Cannot Reason Across Modality Boundaries
&lt;/h4&gt;

&lt;p&gt;A video model that receives a reference image as a text description has lost the actual pixel relationships. A video model that receives an audio file as a text description has lost the actual waveform data. Genuine multimodal reasoning requires all modalities in the same context window — not converted to text summaries of each other.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  One Transformer Trained on All Modalities Simultaneously
&lt;/h4&gt;

&lt;p&gt;Gemini Omni was trained on text, image, audio, and video simultaneously within a single transformer architecture. The model develops internal representations encoding cross-modal relationships — understanding that a warm colour palette relates to a particular musical key, that physical object behaviour in video follows the laws of physics Gemini has observed across its training data.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Any Input to Video Output, With Conversational Editing
&lt;/h4&gt;

&lt;p&gt;Gemini Omni Flash launched May 19 2026 in the Gemini app and YouTube Shorts — 10-second clips, API access planned within weeks. The model accepted any combination of inputs and produced video with character consistency, physics grounding, and SynthID watermarking. Conversational editing retained full context across turns — a generated scene could be revised through natural language without re-prompting from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture: How Natively Multimodal Actually Works
&lt;/h3&gt;

&lt;p&gt;Gemini Omni's architecture is a transformer trained across all modalities simultaneously — not a &lt;em&gt;mixture of experts&lt;/em&gt; (a neural network architecture where different "expert" subnetworks specialise in different input types, with a routing mechanism that directs each input to the appropriate expert) architecture with separate video, image, and audio experts, but a single dense model where all modalities interact in every layer. A visual token and an audio token from the same moment in a video can attend to each other directly within the same attention layer, rather than being processed by separate networks whose outputs are later merged.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Any→Video&lt;/strong&gt; — text, image, audio, video inputs simultaneously → video output with physics grounding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10s&lt;/strong&gt; — maximum clip length at Flash launch; longer-form on the roadmap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SynthID&lt;/strong&gt; — imperceptible watermark embedded in pixel-level statistical patterns; survives re-encoding, resizing, and colour grading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 model&lt;/strong&gt; — vs chained pipeline (Veo + Imagen + Lyria); unification enables cross-modal reasoning pipeline architectures cannot match&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The conversational editing model is Omni's most transformative product experience. Previous video generation tools operated like vending machines: insert prompt, receive video, discard and re-insert if wrong. Gemini Omni operates like a &lt;strong&gt;continuous creative collaboration&lt;/strong&gt;: generate a scene, ask for the camera angle to change, ask for a second character to enter — and the model keeps the context of every previous instruction. The resulting video reflects all decisions across the conversation, not just the most recent prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual: Gemini Omni vs the chained model approach it replaces
# Illustrates the architectural difference — API details TBC when GA
&lt;/span&gt;
&lt;span class="c1"&gt;# OLD APPROACH: Chaining specialised models — context lost at every handoff
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;veo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VeoClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;lyria&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LyriaClient&lt;/span&gt;

&lt;span class="n"&gt;audio_clip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LyriaClient&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upbeat electronic music, 10 seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# no knowledge of the visual reference
&lt;/span&gt;
&lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VeoClient&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city timelapse, matches photo style&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reference_image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# can't process image input; can't see the audio
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Manual synchronisation: the user's problem
&lt;/span&gt;
&lt;span class="c1"&gt;# GEMINI OMNI: One model, all modalities in one prompt
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gemini-omni-flash&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a 10-second timelapse of a city transforming from day to night.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reference_photo.jpg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# actual pixel data — style extracted
&lt;/span&gt;    &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;audio_track.mp3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;      &lt;span class="c1"&gt;# actual waveform — beat sync possible
&lt;/span&gt;    &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reference_clip.mp4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# actual video — motion style extracted
&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# Output: video reflecting the photo's style, synced to audio's beat,
# using the reference clip's camera movement — all from one inference pass
&lt;/span&gt;
&lt;span class="c1"&gt;# Conversational editing — full context preserved across turns
&lt;/span&gt;&lt;span class="n"&gt;response2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Same scene, but make it rain and show the character from my last prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Model retains: character, city style, audio — no re-upload needed
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;SynthID: Watermarking That Cannot Be Removed&lt;/strong&gt;

&lt;p&gt;Every Gemini Omni video carries an imperceptible SynthID watermark embedded in the pixel data's statistical patterns — not in metadata. It survives re-encoding to different codecs, resizing, colour grading, and speed adjustments. Any C2PA-compatible platform can verify that a video was AI-generated by a Gemini product. Digital avatars additionally require mandatory onboarding (recording yourself, speaking verification numbers) before use — a guardrail against deepfakes built into the product from day one.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;/p&gt;
  World models: the theoretical foundation behind physics grounding
  &lt;br&gt;
Sundar Pichai described Omni as a step toward world models — AI systems that simulate physical and social reality rather than just predict token sequences. A language model predicting video token sequences will produce realistic-looking but physically incorrect motion: objects falling upward, light sources moving inconsistently, bodies with impossible joint angles. A world model that has internalised physics and causality from its training data produces videos where motion is physically coherent because the model understands &lt;em&gt;why&lt;/em&gt; objects move the way they do, not just what they look like when they move.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  Character consistency: how the long context window makes this possible
  &lt;br&gt;
A character introduced in scene 1 retains their face, clothing, and voice across all subsequent scenes in the same conversation, without the creator re-uploading the reference image for each shot. This is enabled by Gemini's long context window — the model carries the character's visual description as an implicit context throughout the conversation. Competing video models, which have shorter effective contexts, required reference images at every generation turn and still produced inconsistent results.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Gemini Omni's internal architecture reflects the design philosophy Gemini has had since its December 2023 announcement: train a single model on all modalities simultaneously so that cross-modal understanding is emergent from training, not engineered through explicit routing. The practical consequence is that Omni's internal representation of a video frame encodes relationships to audio, text context, and physical reality simultaneously — enabling generation that reflects all input modalities without explicit instructions about how to combine them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chained Pipeline vs Gemini Omni: Architectural Comparison
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-gemini-omni-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini Omni: Conversational Editing Flow and Context Retention
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-gemini-omni-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The YouTube Shorts Integration: Distribution as the Moat&lt;/strong&gt;

&lt;p&gt;Gemini Omni's Day 1 integration into YouTube Shorts is a distribution strategy no standalone AI video tool can match. Creators generate a 10-second clip directly within YouTube's creation tools — no separate app, no API key. Every Omni-generated Short carries YouTube's standard content policy enforcement on top of SynthID watermarking, and is labelled as AI-generated in discovery surfaces. This is the first time a frontier AI video model has had a direct distribution path to a 2-billion-user platform on launch day.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  C2PA content credentials: the open standard for AI provenance
  &lt;br&gt;
&lt;em&gt;C2PA&lt;/em&gt; (Coalition for Content Provenance and Authenticity — an open technical standard co-developed by Adobe, Microsoft, BBC, Intel, Sony, and others) cryptographically signs digital content at the point of creation with metadata about its origin and modification history. Any C2PA-compatible media player or content verification tool can confirm that a video was generated by Gemini Omni, when it was generated, and (if the user consented) by whom. This resolves the "is this real?" question for media at scale — not by restricting AI generation, but by making AI generation verifiable.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Training a single model on all modalities simultaneously is architecturally superior to chaining specialised models for tasks requiring cross-modal reasoning.&lt;/strong&gt; A chain of models loses pixel relationships, waveform data, and temporal correlations at every handoff. A unified model retains them throughout. The performance gap between chained and unified architectures grows with the complexity of the cross-modal reasoning required.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;World models&lt;/em&gt; (AI architectures that simulate the physical and causal structure of reality rather than predict what the next frame statistically should look like) produce more coherent generated video than token-prediction models.&lt;/strong&gt; They model causality rather than correlation. "AI is moving from predicting text to simulating reality" is the product-facing version of this architectural shift.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The conversational editing model changes who can use AI video generation.&lt;/strong&gt; Prompt-and-retry was a specialist workflow — only people fluent in prompt engineering got good results efficiently. Conversational steering, where natural language revisions apply incrementally to a persistent context, is intuitive for anyone who has ever given feedback in a meeting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Safety infrastructure is a prerequisite for deploying generative video at platform scale, not a post-launch patch.&lt;/strong&gt; &lt;em&gt;SynthID&lt;/em&gt; (Google's imperceptible AI-generated content watermark embedded in pixel-level statistical patterns — survives re-encoding, resizing, and colour processing), C2PA content credentials, and mandatory avatar onboarding verification are what make Omni deployable on YouTube without becoming deepfake infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distribution is the moat that model quality cannot easily overcome.&lt;/strong&gt; An average model with YouTube Shorts integration reaches 2 billion users on Day 1. A superior model without distribution reaches the early-adopter population. Route new AI capabilities through existing products with existing users — don't build a new acquisition funnel when you don't have to.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;C2PA (Coalition for Content Provenance and Authenticity)&lt;/strong&gt; — an open technical standard co-developed by Adobe, Microsoft, BBC, Intel, Sony, and others that cryptographically signs digital content at creation with metadata about its origin. Enables any C2PA-compatible tool to verify whether content is AI-generated, human-made, or modified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixture of experts&lt;/strong&gt; — a neural network architecture where different "expert" subnetworks specialise in different input types, with a routing mechanism directing each input to the appropriate expert. Contrasted with Gemini Omni's single dense model where all modalities interact in every layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natively multimodal&lt;/strong&gt; — a model architecture trained on multiple modalities (text, image, audio, video) simultaneously rather than routing between specialised single-modality models. Enables cross-modal reasoning that pipeline architectures cannot replicate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Astra&lt;/strong&gt; — Google DeepMind's ongoing research into a universal AI assistant that processes real-time audio and video streams continuously — exploring what it means for an AI to have persistent understanding of a physical environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SynthID&lt;/strong&gt; — Google's imperceptible digital watermark embedded in the statistical patterns of AI-generated pixel data. Survives re-encoding, resizing, and colour grading. Enables AI provenance verification without visible degradation of the content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;World model&lt;/strong&gt; — an AI architecture that simulates the physical and causal structure of reality — understanding why objects move, how light behaves, and what consequences follow from actions — rather than simply predicting what the next frame statistically should look like.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-gemini-omni-2026/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/openai-deployed-a-tool-to-monitor-kubernetes-and-it-took-down-all-of-kubernetes-adh</link>
      <guid>https://dev.to/techlogstack/openai-deployed-a-tool-to-monitor-kubernetes-and-it-took-down-all-of-kubernetes-adh</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4h 22m&lt;/strong&gt; outage — 3:16 PM to 7:38 PM PST, December 11 2024&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;29 minutes&lt;/strong&gt; from deployment start to all OpenAI products degrading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; staging warnings — telemetry service passed validation completely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; regions with staged rollout — applied to all clusters simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All&lt;/strong&gt; OpenAI services affected: ChatGPT, the API, and Sora simultaneously&lt;/li&gt;
&lt;li&gt;Engineers &lt;strong&gt;locked out&lt;/strong&gt; of clusters — &lt;code&gt;kubectl&lt;/code&gt; requires a control plane that was down&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability — to give engineers better visibility into how their clusters were behaving, to catch problems earlier. Within 29 minutes, the telemetry service had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable. And the engineers responsible for fixing it couldn't run &lt;code&gt;kubectl&lt;/code&gt; — the control plane that manages Kubernetes was down, and it was the only way back in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Our tests didn't catch the impact the change was having on the Kubernetes control plane. DNS caching added a delay between making the change and when services started failing. Remediation was very slow because of the locked out effect.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— OpenAI, December 11 2024 Incident Postmortem, status.openai.com&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The events unfolded with the particular cruelty of incidents where staging does not predict production. The telemetry service was deployed to a staging cluster on December 10 and verified as working correctly. On December 11 at 2:51 PM, the change rolled out to all production clusters. At 3:16 PM — &lt;strong&gt;five minutes before the rollout was even complete&lt;/strong&gt; — all OpenAI products began degrading. The root cause: a configuration that caused every node in every cluster to execute resource-intensive Kubernetes API operations simultaneously. The cost of these operations scaled with cluster size — meaning the largest, most critical clusters were hit hardest and fastest.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;DNS Caching: The Hidden Time Bomb&lt;/strong&gt;

&lt;p&gt;The staging environment passed for two reasons. First, the staging cluster was small — the telemetry service's API load scaled with cluster size, so small staging generated manageable load. Second: &lt;strong&gt;DNS caching masked the failure&lt;/strong&gt;. When the telemetry service started overwhelming the Kubernetes API servers, services that had already cached DNS responses continued functioning temporarily through stale cache entries. Engineers saw a clean deployment and services continuing to function — until the DNS cache expired and everything that hadn't failed yet failed all at once.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Telemetry Rollout to All Clusters in 29 Minutes
&lt;/h4&gt;

&lt;p&gt;At 2:51 PM PST, the new telemetry service configuration began rolling out to all Kubernetes clusters simultaneously. The service's configuration caused every node in every cluster to issue simultaneous resource-intensive Kubernetes API calls — a load that scaled with cluster size, hitting the largest, most critical clusters hardest.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Kubernetes Control Plane Overwhelmed — DNS and Service Discovery Broken
&lt;/h4&gt;

&lt;p&gt;With thousands of nodes simultaneously hammering the Kubernetes API servers, the control planes of most large clusters crashed. &lt;em&gt;Kubernetes's control plane&lt;/em&gt; (the set of components managing overall cluster state — API server, etcd, scheduler, controller manager) manages service discovery and DNS resolution. When it failed, services could no longer find each other. DNS cache expiry then propagated the failure to services temporarily protected by stale cache entries, turning partial degradation into complete cascading failure.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The Locked-Out Problem: No kubectl Access
&lt;/h4&gt;

&lt;p&gt;Recovery required rolling back the telemetry configuration — but rolling back Kubernetes configurations requires &lt;code&gt;kubectl&lt;/code&gt;, which requires a functioning Kubernetes control plane. The control plane was down. Engineers were effectively locked out of the clusters they needed to fix. Recovery required out-of-band mechanisms: directly accessing nodes through cloud provider management consoles, bypassing the Kubernetes layer entirely to remove the telemetry service's configuration.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  4h 22min Outage, Full Postmortem Published
&lt;/h4&gt;

&lt;p&gt;ChatGPT reached substantial recovery at 5:45 PM PST. Full recovery across all services was achieved at 7:38 PM PST — 4 hours and 22 minutes after the incident began. OpenAI published a detailed postmortem identifying four root causes and committing to specific architectural changes including break-glass emergency access mechanisms and staged rollouts for all infrastructure changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Actually Broke and Why Recovery Took Four Hours
&lt;/h3&gt;

&lt;p&gt;The telemetry service's configuration caused each node to watch Kubernetes API resources continuously — a &lt;em&gt;Watch API&lt;/em&gt; (a Kubernetes feature allowing clients to receive a stream of events as resources change — creates a persistent connection from each watcher to the API server, consuming server resources proportional to the number of watchers) operation making API calls proportional to cluster size. Across thousands of nodes in large clusters, these calls compounded into an overwhelming flood. The API servers became saturated. With them unresponsive, &lt;em&gt;etcd&lt;/em&gt; (the distributed key-value store backing all Kubernetes state — node metadata, pod specifications, service definitions — API servers cannot function without it) became unreachable. Without etcd, API servers couldn't recover. Without API servers, nothing could be changed. The cluster was in a deadlock.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4h 22m&lt;/strong&gt; — total outage duration, 3:16 PM to 7:38 PM PST — longest single outage in ChatGPT's history at the time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;29 min&lt;/strong&gt; — deployment start to all products degrading — fast enough that the full fleet was affected before the scope was understood&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All&lt;/strong&gt; — services affected simultaneously: ChatGPT, API, Sora — every OpenAI product at once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; — staging warnings — staging clusters were too small to reproduce the API call scaling behaviour that took down production
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified model of the failure: telemetry service overwhelming K8s API
# Each node watches K8s API objects — cost scales super-linearly with cluster size
&lt;/span&gt;
&lt;span class="n"&gt;TELEMETRY_CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch_all_pods&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# persistent connection per node to API server
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch_all_nodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# another persistent connection per node
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch_all_services&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# another persistent connection per node
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;poll_interval_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# aggressive — 10 checks/second per watcher
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;api_calls_per_second&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 3 watchers per node × 10 calls/sec per watcher
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cluster_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;

&lt;span class="c1"&gt;# Staging cluster (100 nodes):
&lt;/span&gt;&lt;span class="n"&gt;staging_load&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;api_calls_per_second&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 3,000/sec — manageable
# K8s API server capacity: ~1,000–2,000 requests/sec
&lt;/span&gt;
&lt;span class="c1"&gt;# Large production cluster (5,000 nodes):
&lt;/span&gt;&lt;span class="n"&gt;prod_load&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;api_calls_per_second&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 150,000/sec — CATASTROPHIC
# API server saturated within seconds → DNS breaks → services go blind
# kubectl stops working → engineers locked out
&lt;/span&gt;
&lt;span class="c1"&gt;# THE LOCKED-OUT DEADLOCK:
# Fix requires: kubectl → needs API server → API server is down → needs fix
#
# RECOVERY PATH (bypassing K8s entirely):
# 1. SSH to nodes via cloud provider console (not through K8s)
# 2. Manually stop telemetry service process on each node
# 3. API server load drops → control plane recovers
# 4. kubectl works again → roll back config through standard channels
# 5. Monitor DNS propagation and service recovery across fleet
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Four Root Causes from OpenAI's Postmortem&lt;/strong&gt;

&lt;p&gt;&lt;strong&gt;(1) Staging cluster too small&lt;/strong&gt; — the failure only manifested at production cluster sizes. &lt;strong&gt;(2) DNS caching masked the initial failure&lt;/strong&gt; — services continued on stale cache entries, giving engineers a false "clean deployment" signal before cache expiry revealed the truth. &lt;strong&gt;(3) No canary deployment&lt;/strong&gt; — configuration applied to all clusters simultaneously rather than validated incrementally. &lt;strong&gt;(4) No break-glass mechanism&lt;/strong&gt; — no pre-arranged out-of-band access path for the scenario where the standard Kubernetes management plane was unavailable.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recovery steps — bypassing Kubernetes entirely:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access individual nodes directly through the cloud provider's management console — not through Kubernetes&lt;/li&gt;
&lt;li&gt;Manually stop the telemetry service process on each node to eliminate the API call flood&lt;/li&gt;
&lt;li&gt;With load removed, Kubernetes API servers begin recovering&lt;/li&gt;
&lt;li&gt;Once &lt;code&gt;kubectl&lt;/code&gt; is functional, roll back the telemetry service configuration through standard channels&lt;/li&gt;
&lt;li&gt;Monitor service recovery and DNS propagation across the fleet&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Post-incident engineering commitments:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Immediate&lt;/strong&gt; — locked the telemetry configuration to prevent re-deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-term&lt;/strong&gt; — implement break-glass emergency access that functions when the K8s control plane is unavailable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium-term&lt;/strong&gt; — decouple observability infrastructure from the components it monitors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term&lt;/strong&gt; — all infrastructure configuration changes use staged deployment with continuous monitoring and the ability to halt at any percentage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;/p&gt;
  The iOS 18.2 coincidence
  &lt;br&gt;
Apple shipped iOS 18.2 — which introduced ChatGPT integration into Apple Intelligence — on the same day as the outage. Millions of users who updated and then tried ChatGPT saw it was unavailable. Social media immediately speculated that the iOS update had caused the outage. OpenAI's postmortem was explicit: iOS 18.2 had nothing to do with it. The telemetry failure had already begun degrading infrastructure before the iOS update's traffic could have any effect. Correlation — especially coincidence of timing — is not causation, and attributing outage causes to the most visible concurrent event is a common and often wrong instinct.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;OpenAI's Kubernetes architecture runs the inference clusters powering ChatGPT's model serving, the API gateway, and the Sora video generation pipeline — all depending on the Kubernetes control plane for service discovery, DNS resolution, pod scheduling, and configuration management. When a single telemetry service configuration saturated the API servers, it took all three of these simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Failure Chain: From Telemetry Deployment to Complete Outage
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Recovery Architecture: Bypassing Kubernetes to Restore It
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Why Kubernetes Control Plane Failure Is Catastrophic&lt;/strong&gt;

&lt;p&gt;The control plane manages three things that are catastrophic to lose simultaneously: &lt;strong&gt;DNS resolution&lt;/strong&gt; (services find each other by name, not IP — without DNS, microservices go blind), &lt;strong&gt;service discovery&lt;/strong&gt; (load balancers can't route to healthy pods without the API server updating configuration), and &lt;strong&gt;pod scheduling&lt;/strong&gt; (crashed pods can't be restarted, replicas can't be scaled). In most partial failures, you lose one of these. A control plane failure loses all three — and recovery requires the control plane to function, creating a circular dependency that demands pre-arranged out-of-band access.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  The staged rollout that would have caught it
  &lt;br&gt;
A staged rollout — 1 cluster → verify 30 minutes → 10% of clusters → verify → 50% → verify → 100% — would have caught this failure at the 1-cluster stage. One large cluster showing API server saturation is a signal. One large cluster crashing before engineers even understood why is an outage. The difference between the two outcomes is a verification window between deployment stages — time to observe behaviour before the next stage commits. OpenAI's December 11 deployment had no such window: configuration applied to all clusters in 29 minutes without a verification pause.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability infrastructure is production infrastructure.&lt;/strong&gt; A telemetry service deployed across your entire fleet has the blast radius of your entire fleet. Deploy it with the same staged rollout rigor you apply to production services: one cluster, verify, one region, verify, full fleet. The December 11 rollout applied the configuration to all clusters in 29 minutes. A staged rollout would have revealed the problem on the first cluster before it cascaded.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;DNS caching&lt;/em&gt; (storing DNS lookup results locally for a period defined by the record's TTL) is a reliability asset that becomes a diagnostic liability during incidents.&lt;/strong&gt; When an infrastructure change breaks DNS, services continue functioning on cached entries — masking the failure until TTLs expire. If your deployment passes initial health checks and then fails minutes later at scale, DNS cache expiry is a likely explanation. Monitor DNS resolution success rates separately from application health checks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build break-glass emergency access before you need it.&lt;/strong&gt; The December 11 engineers needed to access nodes directly, bypassing the Kubernetes control plane, using mechanisms that had not been pre-arranged. Pre-arrange them. Every Kubernetes deployment should have a documented, tested procedure for accessing nodes when &lt;code&gt;kubectl&lt;/code&gt; is unavailable. Like any emergency procedure, it must be practiced before the emergency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Size-dependent bugs&lt;/em&gt; (failures manifesting only at production scale because their severity is a non-linear function of system size) cannot be caught by functional testing at representative scale.&lt;/strong&gt; Load test infrastructure changes against production-equivalent cluster sizes. If production-scale testing is not feasible, test at 10% of production scale and extrapolate load metrics before applying to the full fleet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decouple the components that manage your infrastructure from the infrastructure they manage.&lt;/strong&gt; The Kubernetes control plane should not be the only path to emergency recovery. If the control plane fails, some emergency management capability should remain available independently of the failed layer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Break-glass mechanism&lt;/strong&gt; — a pre-arranged, out-of-band access path to infrastructure that functions even when the standard management layer is unavailable. Named after the physical "break glass in case of emergency" safety cabinet. The absence of a break-glass mechanism was one of OpenAI's four identified root causes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS caching&lt;/strong&gt; — storing the results of DNS lookups locally for a period defined by the record's TTL (Time to Live), allowing services to resolve domain names without contacting the DNS server on every request. A reliability asset under normal conditions; a diagnostic liability that masks failures during incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;etcd&lt;/strong&gt; — the distributed key-value store that backs all Kubernetes cluster state — node metadata, pod specifications, service definitions. Kubernetes API servers cannot function without access to etcd; etcd unavailability produces total control plane failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes control plane&lt;/strong&gt; — the set of components managing overall Kubernetes cluster state: the API server (handles all REST operations), etcd (state store), the scheduler (assigns pods to nodes), and the controller manager (runs reconciliation loops). Runs on dedicated master nodes, separate from the data plane nodes running actual workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Locked-out effect&lt;/strong&gt; — the circular dependency where recovering from a Kubernetes control plane failure requires &lt;code&gt;kubectl&lt;/code&gt;, which requires a functioning control plane. The cluster is frozen in a state where existing workloads continue running but nothing can be changed, fixed, scaled, or recovered through standard channels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Size-dependent bug&lt;/strong&gt; — a failure that only manifests at production scale because its severity is a non-linear function of system size. A 100-node staging cluster may pass cleanly while a 5,000-node production cluster fails catastrophically — the same configuration producing 50× the load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch API&lt;/strong&gt; — a Kubernetes API feature allowing clients to receive a stream of events as resources change. More efficient than polling, but creates a persistent connection from each watching client to the API server, consuming server resources proportional to the number of watchers. Misused by the December 11 telemetry service to create 15,000+ persistent connections on large clusters.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>reliability</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/google-built-a-free-design-tool-that-generates-production-code-from-a-sentence-then-added-5307</link>
      <guid>https://dev.to/techlogstack/google-built-a-free-design-tool-that-generates-production-code-from-a-sentence-then-added-5307</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30 seconds&lt;/strong&gt; — plain English sentence to complete mobile UI, live on stage at Google I/O 2025&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0 vs $15&lt;/strong&gt; — Google Stitch multiplayer vs Figma professional plan per editor per month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 input types&lt;/strong&gt; — text prompt, reference image, annotated screenshot — processed simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 screens&lt;/strong&gt; — simultaneous canvas rendering introduced in Stitch 2.0, March 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;350 free generations/month&lt;/strong&gt; — standard tier; $20/month Pro for unlimited&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1M+ waitlist signups&lt;/strong&gt; overnight after the I/O 2025 live demo&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;At Google I/O 2025, Sundar Pichai typed a one-sentence description of a mobile app and watched Google Stitch render a complete, multi-component UI in under 30 seconds. One click exported it as React code. Another exported it as an editable Figma file. Figma charges $15 per editor per month for collaborative design. Stitch does it free. A year later, Google added real-time multiplayer, a streaming design agent, and voice input — and the design industry started paying attention.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;Google Stitch did not emerge from Google's internal R&amp;amp;D labs. It began with the early-2025 acquisition of &lt;strong&gt;Galileo AI&lt;/strong&gt; — a startup that had built one of the first credible text-to-UI generators, capable of interpreting product descriptions and producing coherent interface layouts. Google acquired Galileo, rebranded it as Stitch, integrated it with &lt;em&gt;Gemini 2.5 Pro&lt;/em&gt; (Google's multimodal model able to process text, images, audio, and video simultaneously and generate structured outputs across all of them), and launched it as a Google Labs experiment at I/O 2025. The Labs framing was deliberate — testing the market before committing to a full product. Over 1 million waitlist signups appeared overnight.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What 'Vibe Design' Actually Means&lt;/strong&gt;

&lt;p&gt;Stitch entered the vocabulary alongside "vibe coding" — describing software intent to an AI and refining the output iteratively rather than building from first principles. The skill shifts from &lt;strong&gt;pixel manipulation to intent specification&lt;/strong&gt;. A founder who cannot use Figma can produce a working prototype in minutes. A product manager can test five layout variations in the time it would previously have taken to brief a designer on one.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;The evolution from launch to I/O 2026 compressed ten months of user feedback into a clear product trajectory. The May 2025 version was single-screen only — one prompt, one screen, export. July 2025 added theme customisation and Figma export. December 2025 brought multi-screen &lt;strong&gt;Prototypes&lt;/strong&gt; alongside Gemini 3 integration. March 19, 2026 was &lt;strong&gt;Stitch 2.0&lt;/strong&gt;: infinite canvas, 5-screen simultaneous generation, voice input, and app-flow generation. A demo had become a workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Design-to-Dev Handoff: The Productivity Black Hole
&lt;/h4&gt;

&lt;p&gt;The traditional pipeline required designers to build components in Figma, annotate specs manually, and hand off to developers who re-implemented everything in code. Even with design tokens and component libraries, the gap between "designed" and "built" consumed weeks. For small teams and solo founders this gap was existential — they lacked either the design skill or the engineering skill to close it alone.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Multimodal Models Reached UI-Generation Quality
&lt;/h4&gt;

&lt;p&gt;By early 2025, Gemini's multimodal capabilities had reached a threshold where they could reliably interpret both text descriptions and uploaded images of existing UIs, generating coherent layouts with appropriate component choices, spacing, and visual hierarchy. The Galileo acquisition gave Google a product layer that had already solved the prompt engineering, training data, and output format problems on top of that capability.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Stitch: Three Inputs, Gemini Core, Production-Grade Exports
&lt;/h4&gt;

&lt;p&gt;Stitch accepted three input types simultaneously: natural language descriptions, uploaded reference images or screenshots, and annotated screenshots with modification notes. Gemini 2.5 Pro processed all three in a single context window. Export paths targeted real developer workflows: Figma files with editable layers and auto-layout, production-ready HTML/CSS, React components, and Vue code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  I/O 2026: Streaming Agent + Multiplayer — Both Free
&lt;/h4&gt;

&lt;p&gt;At I/O 2026, Google launched a streaming design agent that renders UI components onto the canvas in real time as a designer types or speaks — mid-generation course correction is possible before the generation finishes. Simultaneous multi-user editing was also added, directly matching Figma's flagship collaboration feature. Both are free. Figma's professional plan charges $15 per editor per month.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Technical Architecture: Gemini as the UI Design Engine
&lt;/h3&gt;

&lt;p&gt;Stitch's core is not a purpose-built design model — it is &lt;em&gt;Gemini 2.5 Pro&lt;/em&gt; with a specialised prompt engineering and output parsing layer on top. This explains both Stitch's strengths and its limitations. Stitch understands concepts like "glassmorphism," "material design," and "iOS Human Interface Guidelines" because Gemini was trained on documentation and examples of all of them. It generates production-quality React because Gemini understands React at a level that exceeds most specialised code generation models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30s&lt;/strong&gt; — sentence to complete mobile UI including navigation, components, and colour palette&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 inputs&lt;/strong&gt; — text prompt, reference image, annotated screenshot — single Gemini context window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5-screen&lt;/strong&gt; — simultaneous canvas rendering in Stitch 2.0, March 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0 vs $15&lt;/strong&gt; — Stitch multiplayer vs Figma professional plan per editor per month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The I/O 2026 streaming agent is an architectural change, not just a speed improvement. Previous versions were &lt;strong&gt;turn-based&lt;/strong&gt;: submit a prompt, wait for completion, review, resubmit. The streaming model replaces this with continuous render — components appear on canvas as they are generated, layouts reflow before generation finishes. The practical difference is the ability to &lt;strong&gt;steer mid-generation&lt;/strong&gt;: if a layout is heading in the wrong direction, a designer can interrupt and redirect before it finishes. Voice input, integrated since March 2026, works within this same loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Turn-based vs streaming: the architectural difference in Stitch's I/O 2026 upgrade&lt;/span&gt;

&lt;span class="c1"&gt;// BEFORE (turn-based): designer sees nothing until fully done&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateUI_old&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// blocking — full wait&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screens&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// [{ html, css, figmaLayers }]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// AFTER (streaming agent): real-time render + mid-generation steering&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateUI_streaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stitch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Components render onto canvas as they are generated&lt;/span&gt;
  &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;component&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;component&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;renderPartial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;component&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// visible immediately — no waiting&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Designer can interrupt and redirect before generation finishes&lt;/span&gt;
  &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;layoutDecision&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userFeedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;checkInterrupt&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userFeedback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;steer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userFeedback&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// mid-generation course correction&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Voice input works inline — spoken mid-generation, reflected immediately&lt;/span&gt;
  &lt;span class="nx"&gt;voiceInput&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;command&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;steer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCurrentState&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Galileo Acquisition Rationale&lt;/strong&gt;

&lt;p&gt;Google could have built Stitch from scratch using Gemini. It acquired Galileo instead because Galileo had already solved the hardest non-model problems: the prompt engineering approach that reliably produces coherent UIs, the output parser that converts model outputs into valid design tokens and component trees, and the UX model for iterative refinement. Rebuilding these would have taken months. The acquisition compressed that to days. Galileo's technology became the product layer; Gemini became the intelligence underneath it.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;/p&gt;
  RLHF for UI quality: how Stitch reached 95% component rendering accuracy
  &lt;br&gt;
Stitch's code export quality reached 95% accuracy (component rendering fidelity) in the March 2025 closed beta, up from ~70% in early estimates. The improvement came from RLHF — Reinforcement Learning from Human Feedback — applied specifically to UI generation quality. The beta involved 500+ partner users including Vercel developers who provided direct feedback on generated code quality and design accuracy. This domain-specific signal tuned Gemini's output for the criteria professional designers and developers actually cared about: component naming, layout accuracy, code cleanliness, and design system compatibility.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature timeline — launch to I/O 2026:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Update&lt;/th&gt;
&lt;th&gt;Key Feature Added&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;May 20, 2025&lt;/td&gt;
&lt;td&gt;Google I/O Launch&lt;/td&gt;
&lt;td&gt;Single-screen generation, Figma export, HTML/CSS/React export&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jul–Aug 2025&lt;/td&gt;
&lt;td&gt;Public beta&lt;/td&gt;
&lt;td&gt;Theme customisation, RTL language support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dec 2025&lt;/td&gt;
&lt;td&gt;Stitch 2.0 preview&lt;/td&gt;
&lt;td&gt;Prototypes (multi-screen flows), Gemini 3 integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mar 19, 2026&lt;/td&gt;
&lt;td&gt;Stitch 2.0 GA&lt;/td&gt;
&lt;td&gt;Infinite canvas, 5-screen canvas, voice input, app-flow generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 20, 2026&lt;/td&gt;
&lt;td&gt;I/O 2026&lt;/td&gt;
&lt;td&gt;Streaming agent (real-time canvas render), multiplayer — both free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Stitch's internal architecture has three distinct layers. The &lt;strong&gt;input layer&lt;/strong&gt; processes multimodal inputs through Gemini 2.5 Pro — text prompts, reference images, and annotated screenshots are unified into a single context window. The &lt;strong&gt;generation layer&lt;/strong&gt; produces an &lt;em&gt;intermediate representation&lt;/em&gt; (an abstract, format-agnostic description of design intent — component hierarchy, spacing tokens, visual relationships — that can be translated into multiple output formats without losing design semantics) rather than raw HTML or Figma JSON directly. The &lt;strong&gt;export layer&lt;/strong&gt; translates that IR into Figma-compatible JSON with proper component structure and auto-layout, production-grade React/HTML/CSS, and AI Studio integration configs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Stitch: The Traditional Design-to-Development Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Stitch Architecture: Multimodal Input to Production Output
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Multiplayer Technical Challenge&lt;/strong&gt;

&lt;p&gt;Adding simultaneous multi-user editing to an AI-native canvas is harder than adding it to a traditional design tool. In Figma, multiplayer synchronises deterministic object operations with well-understood &lt;em&gt;CRDT&lt;/em&gt; (Conflict-free Replicated Data Type — a data structure that allows multiple users to edit concurrently without conflicts, automatically merging changes) semantics. In Stitch, two users can simultaneously prompt the AI to modify the same canvas, producing non-deterministic outputs that may conflict visually. Google's implementation queues concurrent AI generation requests per canvas object and applies last-write-wins for AI-generated changes, while standard CRDT semantics apply for manual edits.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  The design quality ceiling: what Stitch still can't do
  &lt;br&gt;
Stitch's core limitation remains consistent across all reviews: generated designs are starting points, not finished products. The AI produces layouts with appropriate components and reasonable visual hierarchy, but professional polish — precise spacing, custom illustration integration, brand-specific typography choices, edge-case state design (empty states, error states, loading states) — still requires human design expertise. Stitch is strongest for exploration and prototyping; weakest for production-ready UI that needs to meet professional brand standards.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Acquiring a specialised AI startup accelerates a product category by months, not weeks.&lt;/strong&gt; Google had the models (Gemini) but not the product layer (Galileo). Galileo had the product layer but not the model quality or distribution. The acquisition combined both instantly. Teams building in AI-adjacent product categories should evaluate whether acquiring specialised AI startups is faster than building the application layer from scratch on top of foundation models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Intermediate representation&lt;/em&gt; between AI generation and format-specific output is the architecture that makes multi-format export viable.&lt;/strong&gt; Generating React directly loses Figma compatibility. Generating Figma directly loses code usability. An IR exports to both, and to future formats not yet defined.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Free with generous limits is a viable disruption strategy when the underlying AI cost is subsidised.&lt;/strong&gt; Google can offer Stitch free because Gemini API calls are already budgeted across Google's infrastructure at marginal cost. Figma cannot match free without destroying its revenue model. This asymmetry is the structural moat Stitch is building — not feature parity, but cost parity at zero.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build the complement-not-replace narrative from day one.&lt;/strong&gt; Sarah Drasner's explicit framing of Stitch as a Figma complement — not replacement — reduced designer resistance and encouraged adoption among professional users. Fighting the dominant tool's ecosystem directly creates adversarial resistance. Complementing it creates adoption.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Streaming generation&lt;/em&gt; (delivering AI outputs progressively as they are computed) changes the product experience more profoundly than speed improvements do.&lt;/strong&gt; A 30-second generation showing nothing for 28 seconds feels slow. A 30-second generation showing components appearing in real time and allowing mid-stream steering feels like collaboration. Same underlying model, fundamentally different user experience.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CRDT (Conflict-free Replicated Data Type)&lt;/strong&gt; — a data structure designed for distributed systems that allows multiple users to edit the same data concurrently without conflicts, automatically merging changes. Used in Stitch's multiplayer for deterministic manual edits alongside non-deterministic AI-generated changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.5 Pro&lt;/strong&gt; — Google's multimodal frontier model capable of processing text, images, audio, and code simultaneously. Stitch uses it as the core reasoning engine for interpreting design intent and generating UI outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intermediate representation (IR)&lt;/strong&gt; — an abstract, format-agnostic description of design intent — component hierarchy, spacing tokens, visual relationships — that can be translated into multiple output formats (Figma JSON, React, HTML/CSS) without losing design semantics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RLHF (Reinforcement Learning from Human Feedback)&lt;/strong&gt; — a training technique where human evaluators rate model outputs, and those ratings are used to fine-tune the model toward preferred outputs. Used by Stitch to improve component rendering fidelity from ~70% to 95% accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming generation&lt;/strong&gt; — delivering AI outputs progressively as they are computed, rather than waiting for the full generation to complete before showing any output. Enables mid-generation steering and real-time canvas rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vibe design&lt;/strong&gt; — the practice of describing interface intent to an AI and refining the output iteratively, rather than building pixel by pixel. The AI design equivalent of vibe coding.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>frontend</category>
      <category>javascript</category>
    </item>
    <item>
      <title>LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/linkedin-needed-a-message-queue-they-built-the-one-the-entire-internet-runs-on-24j1</link>
      <guid>https://dev.to/techlogstack/linkedin-needed-a-message-queue-they-built-the-one-the-entire-internet-runs-on-24j1</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1 billion events/day&lt;/strong&gt; at LinkedIn launch in 2011 — immediate production scale from day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 trillion messages/day&lt;/strong&gt; by 2019 — same core architecture, 7,000× growth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~50 MB/sec&lt;/strong&gt; Kafka producer throughput vs &lt;strong&gt;~2 MB/sec&lt;/strong&gt; ActiveMQ — in the original 2011 benchmark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 bytes&lt;/strong&gt; per-message overhead vs &lt;strong&gt;144 bytes&lt;/strong&gt; in ActiveMQ — 16× storage efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateless brokers&lt;/strong&gt; — consumers track their own offset; broker memory doesn't scale with consumers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;80%+ of Fortune 100&lt;/strong&gt; run Kafka today; Confluent IPO'd at $4.5B valuation in 2021&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;By 2010, LinkedIn had dozens of data source systems and dozens of consumer systems — ML models, analytics pipelines, search indexers, real-time features — all needing the same activity stream data. The solution was point-to-point custom pipelines: each one custom-built, each one brittle, none sharing infrastructure. Adding one new data source meant writing N new pipelines. Adding one new consumer meant updating M existing sources. Jay Kreps, leading data infrastructure engineering, described the root cause directly: "Everyone wanted to build fancy machine-learning algorithms, but without the data, the algorithms were useless. Getting the data from source systems and reliably moving it around was very difficult."&lt;/p&gt;

&lt;p&gt;Kreps, alongside Jun Rao (from IBM's database group) and Neha Narkhede (from Oracle), evaluated every existing solution. &lt;em&gt;ActiveMQ&lt;/em&gt; (an open-source message broker implementing JMS, designed for reliable ordered message delivery between enterprise applications) and &lt;em&gt;RabbitMQ&lt;/em&gt; (a message broker built around AMQP, designed for flexible routing and delivery guarantees) were built for a different problem — reliable delivery of individual task messages, not high-throughput streaming of millions of activity events. Their per-message broker state tracking consumed memory proportional to outstanding messages. They couldn't support the scenario where a Hadoop job needed to replay yesterday's activity data. Most critically: ActiveMQ's message format carried &lt;strong&gt;144 bytes of overhead per message&lt;/strong&gt;. LinkedIn needed millions of messages per second.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Founding Insight: Treat Data Movement Like a Log&lt;/strong&gt;

&lt;p&gt;The breakthrough was recognising that LinkedIn's data movement problem was not a messaging problem — it was a &lt;strong&gt;log problem&lt;/strong&gt;. Databases have used append-only logs for decades: the &lt;em&gt;write-ahead log&lt;/em&gt; (a sequential record of all changes, written before the changes are applied — used for crash recovery, replication, and point-in-time restoration) is how MySQL and Postgres achieve durability. Jay Kreps asked: what if the data pipeline itself was an append-only log? Producers append events. Consumers read at their own pace. The log retains messages for a configured period. Any consumer can replay from any point. The broker tracks no state. That simplicity unlocked everything.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  LinkedIn's Data Was Locked in Silos
&lt;/h4&gt;

&lt;p&gt;By 2010, LinkedIn had an N×M integration problem — every data source needed a custom pipeline to every data destination. Existing messaging systems (ActiveMQ, RabbitMQ) were designed for task queues, not event streams, and couldn't handle LinkedIn's throughput requirements or support replay.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  No Tool Existed for High-Throughput Real-Time Event Streaming
&lt;/h4&gt;

&lt;p&gt;Batch systems (Hadoop) could handle large volumes but only hours later. Traditional message queues could deliver in real-time but couldn't scale to LinkedIn's volume or support replay. No system simultaneously provided high throughput, low latency, durability, replayability, and horizontal scalability. The three engineers concluded that the tool they needed did not exist.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  One Year Building Kafka: The Append-Only Distributed Log
&lt;/h4&gt;

&lt;p&gt;Kreps, Rao, and Narkhede spent approximately one year building the first version of Kafka. The core architectural decision was treating the message store as an append-only log rather than a queue. This single choice enabled sequential disk I/O (orders of magnitude faster than random I/O), stateless brokers (consumers track their own position), arbitrary replay (consumers read from any offset), and horizontal partitioning (each partition is an independent log that scales independently).&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1 Billion Events Per Day at Launch, 7 Trillion by 2019
&lt;/h4&gt;

&lt;p&gt;Kafka went into production at LinkedIn in 2011 and immediately processed over 1 billion events per day. LinkedIn open-sourced it in early 2011. It became an Apache Top-Level Project in October 2012. By 2015: 1 trillion messages per day. By 2019: 7 trillion. Kreps, Narkhede, and Rao left LinkedIn in November 2014 to found Confluent, building the commercial ecosystem around Kafka.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I thought that since Kafka was a system optimized for writing, using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Jay Kreps, on naming Kafka, via Quora&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Five Design Decisions That Made Kafka Fast
&lt;/h3&gt;

&lt;p&gt;Kafka's performance advantage was not clever optimisation of a standard architecture — it was a fundamentally different architecture where every key decision reinforced the same goal: maximise throughput for streaming event data. Five decisions stand out as architecturally defining, and each was a deliberate rejection of how existing messaging systems had been built.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~50 MB/s&lt;/strong&gt; — Kafka producer throughput in the original 2011 benchmark vs ~2 MB/s for ActiveMQ at 200-byte messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 bytes&lt;/strong&gt; — per-message overhead in Kafka vs 144 bytes in ActiveMQ — 16× storage efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateless&lt;/strong&gt; — Kafka brokers; consumer offset tracking is done by the consumer, not the broker&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential&lt;/strong&gt; — disk access pattern for both writes and reads; append-only means no random I/O
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// The five key Kafka design decisions illustrated in code&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 1: Append-only log storage (not a queue)&lt;/span&gt;
&lt;span class="c1"&gt;// Each partition is a directory of sequential segment files&lt;/span&gt;
&lt;span class="c1"&gt;// /kafka-logs/my-topic-0/00000000000000000000.log&lt;/span&gt;
&lt;span class="c1"&gt;// /kafka-logs/my-topic-0/00000000000000100000.log&lt;/span&gt;
&lt;span class="c1"&gt;// → Sequential writes: disk seeks are expensive; sequential I/O is ~100x faster&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 2: Consumer tracks its own offset — broker holds no state&lt;/span&gt;
&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;consumerOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;position&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicPartition&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// consumer owns this&lt;/span&gt;
&lt;span class="c1"&gt;// → Brokers are stateless: no per-consumer memory, no ack tracking overhead&lt;/span&gt;
&lt;span class="n"&gt;consumer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;seek&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topicPartition&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// replay from the beginning — any time&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 3: Topics partitioned for horizontal scale&lt;/span&gt;
&lt;span class="nc"&gt;ProducerRecord&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProducerRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"user-activity"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// partition key: same user → same partition = ordered per user&lt;/span&gt;
    &lt;span class="n"&gt;eventJson&lt;/span&gt;  &lt;span class="c1"&gt;// the message payload&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// → N partitions = N consumers in parallel = linear throughput scaling&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 4: Batch I/O from client to broker&lt;/span&gt;
&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"batch.size"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16384&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// batch up to 16KB before sending&lt;/span&gt;
&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"linger.ms"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;      &lt;span class="c1"&gt;// or wait 5ms for the batch to fill&lt;/span&gt;
&lt;span class="c1"&gt;// Original paper: batch size 50 improved throughput ~10x vs batch size 1&lt;/span&gt;

&lt;span class="c1"&gt;// DECISION 5: Zero-copy transfer via OS sendfile()&lt;/span&gt;
&lt;span class="c1"&gt;// Consumer fetch path: disk page cache → network socket (no userspace copy)&lt;/span&gt;
&lt;span class="c1"&gt;// → No data enters JVM heap → no GC pressure → consistent low latency&lt;/span&gt;
&lt;span class="c1"&gt;// → Delivers data at near-network-hardware-limit throughput&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Stateless Broker: The Counterintuitive Masterstroke&lt;/strong&gt;

&lt;p&gt;In ActiveMQ and RabbitMQ, the broker maintains delivery state for every message: who acknowledged it, who hasn't, what needs to be retried. At scale, this per-message state tracking consumes enormous memory and creates a bottleneck. Kafka's solution was radical: &lt;strong&gt;let consumers track their own position&lt;/strong&gt; (their offset in each partition). The broker stores bytes in a log. Consumers read at their own pace and can reset to any offset to replay. The broker's memory footprint is constant regardless of consumer count or message backlog — making horizontal scaling of consumers a configuration change, not an infrastructure problem.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kafka vs traditional message queues — original 2011 benchmarks and design properties:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;ActiveMQ / RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage model&lt;/td&gt;
&lt;td&gt;Queue — messages deleted after ack&lt;/td&gt;
&lt;td&gt;Append-only log — retained by time/size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broker state&lt;/td&gt;
&lt;td&gt;Tracks ack state per message per consumer&lt;/td&gt;
&lt;td&gt;Stateless — consumers track own offset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Producer throughput&lt;/td&gt;
&lt;td&gt;~2 MB/sec (ActiveMQ)&lt;/td&gt;
&lt;td&gt;~50 MB/sec (batch size 50)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message overhead&lt;/td&gt;
&lt;td&gt;144 bytes (ActiveMQ JMS header)&lt;/td&gt;
&lt;td&gt;9 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consumer replay&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Supported — seek to any offset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Horizontal scale&lt;/td&gt;
&lt;td&gt;Limited (complex cluster configs)&lt;/td&gt;
&lt;td&gt;Native — add partitions, add consumers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use case fit&lt;/td&gt;
&lt;td&gt;Task queues, guaranteed delivery, routing&lt;/td&gt;
&lt;td&gt;Event streaming, log aggregation, activity tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;/p&gt;
  Zero-copy: the OS kernel trick that doubled throughput
  &lt;br&gt;
One of Kafka's most impactful performance optimisations is invisible to application code. In a traditional data transfer, data moves: disk → kernel buffer → userspace → socket buffer → network. In Kafka's consumer path, the OS &lt;code&gt;sendfile()&lt;/code&gt; syscall transfers data directly from the page cache to the network socket, bypassing userspace entirely. No data is copied into the JVM heap — no GC pressure, no object allocation overhead. At LinkedIn's throughput rates, this optimisation alone accounts for significant throughput gains and, more importantly, consistent low latency even under high load.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  The log/table duality: Jay Kreps' deeper insight
  &lt;br&gt;
In his 2013 essay "The Log," Kreps articulated a concept beyond Kafka's implementation: the &lt;em&gt;log/table duality&lt;/em&gt; — any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. Every database table is a log in disguise. Every stream of events can be materialised into a table. This duality means &lt;strong&gt;a Kafka topic is simultaneously a stream and a database&lt;/strong&gt; — query it as a stream in motion (stream processing) or materialise it as a snapshot (a table). This insight became the foundation for Kafka Streams, ksqlDB, and the entire stream-processing ecosystem that followed.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Kafka's architecture has three layers. The &lt;strong&gt;storage layer&lt;/strong&gt; is a set of partitioned, replicated append-only log files on disk — each partition is an independent, totally ordered sequence of records. The &lt;strong&gt;broker layer&lt;/strong&gt; is a cluster of server processes managing partition assignment, replication, and client connections — holding no consumer state. The &lt;strong&gt;client layer&lt;/strong&gt; is producers writing to partitions and consumer groups reading from them, each group maintaining its own independent offset per partition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Kafka: N×M Integration Spaghetti
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  After Kafka: The Centralised Log Hub
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;LinkedIn's Kafka by 2019: The Scale Numbers&lt;/strong&gt;

&lt;p&gt;From 1 billion events per day at launch (2011) to 7 trillion messages per day by 2019 — a &lt;strong&gt;7,000× growth&lt;/strong&gt; in eight years on the same fundamental architecture. Spread across 100+ clusters, 4,000+ brokers, 100,000+ topics, and 7 million partitions. Each message consumed by approximately four consumer groups on average. The most remarkable fact: the append-only partitioned log described in the 2011 paper is still the architecture running at 7 trillion messages per day. Good architecture ages well.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Before building, verify no existing tool solves your problem at your scale.&lt;/strong&gt; The Kafka team evaluated ActiveMQ, RabbitMQ, and existing log aggregation systems before building. Their conclusion — existing tools were designed for the wrong problem — was evidence-based. The benchmark (50 MB/sec vs 2 MB/sec) made the decision concrete. Never rebuild what can be adopted; never adopt what demonstrably can't serve your workload.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;The append-only log&lt;/em&gt; (a data structure where records are only ever added to the end, never modified in place — enabling sequential I/O, arbitrary consumer replay, and stateless brokers) is the universal data integration primitive.&lt;/strong&gt; Any system moving data between producers and consumers is implementing a log, whether it knows it or not. Recognising and building explicitly on this pattern is what gave Kafka its performance advantage and its flexibility.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stateless brokers make systems horizontally scalable in ways stateful brokers cannot match.&lt;/strong&gt; When the broker tracks delivery state per consumer per message, broker memory scales with consumers × outstanding messages. When consumers track their own offsets, broker memory scales with partitions only. This single architectural choice is why Kafka can serve hundreds of consumer groups without broker degradation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs.&lt;/strong&gt; An append-only log turns a bursty stream of writes into sequential disk operations, allowing Kafka to approach disk hardware throughput limits. Systems that update records in-place pay random I/O costs on every write. Kafka writes append-only and leverages the OS page cache for reads, achieving throughput that surprised the entire industry.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Open-sourcing infrastructure that solves a universal problem creates compounding returns.&lt;/strong&gt; LinkedIn open-sourced Kafka in 2011 because the team recognised it solved a problem every data-intensive company had. Community contributions from Netflix, Uber, Twitter, and thousands of others built tooling LinkedIn could never have built alone: Kafka Streams, Kafka Connect, ksqlDB, MirrorMaker, Schema Registry. The return on open-sourcing infrastructure is measured in ecosystem, not just code.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Append-only log&lt;/strong&gt; — a data structure where records are only ever added to the end, never modified in place. Enables sequential disk I/O, arbitrary consumer replay, and stateless brokers. The core data structure underlying Kafka's architecture and the reason for its performance advantage over traditional message queues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consumer group&lt;/strong&gt; — a set of Kafka consumers that collectively read from a topic, with each partition assigned to exactly one consumer in the group at a time. Enables parallel consumption: a topic with N partitions can be consumed by up to N consumers simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consumer offset&lt;/strong&gt; — the position of a consumer within a partition, tracking which messages have been read. In Kafka, consumers (not brokers) own and commit their offsets — the key architectural decision that makes Kafka brokers stateless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log/table duality&lt;/strong&gt; — the mathematical relationship where any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. The theoretical foundation for Kafka Streams and ksqlDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition&lt;/strong&gt; — the unit of parallelism in Kafka. Each topic is divided into one or more partitions, each of which is an independent append-only log stored on a single broker. Producers write to partitions by key; consumers read from partitions independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stateless broker&lt;/strong&gt; — a broker that holds no per-consumer delivery state. Kafka brokers store bytes in partitioned logs; consumers own their own offset positions. Broker memory scales with partition count, not consumer count — the property that makes Kafka horizontally scalable to hundreds of consumer groups without broker degradation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-ahead log (WAL)&lt;/strong&gt; — a sequential record of all changes made to a database, written to disk before the changes are applied. Used for crash recovery and replication in MySQL, Postgres, and virtually every serious database. The inspiration for Kafka's append-only log architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-copy transfer&lt;/strong&gt; — the use of the OS &lt;code&gt;sendfile()&lt;/code&gt; syscall to transfer data directly from the kernel page cache to a network socket, bypassing userspace entirely. Used in Kafka's consumer fetch path to eliminate JVM heap copies, GC pressure, and the associated latency spikes at high throughput.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/linkedin-kafka-origin-2011/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>backend</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/the-80-problem-why-getting-an-llm-system-to-works-in-demo-is-20-of-the-work-3cni</link>
      <guid>https://dev.to/techlogstack/the-80-problem-why-getting-an-llm-system-to-works-in-demo-is-20-of-the-work-3cni</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0.02 → 0.61&lt;/strong&gt; Cohen's Kappa — LLM judge calibration from near-random to near-human agreement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.69&lt;/strong&gt; — human evaluator Kappa baseline; the meaningful ceiling for any judge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;300 examples&lt;/strong&gt; — hand-crafted benchmark for the Flow agent; necessary but not sufficient&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 weeks&lt;/strong&gt; — time to close the benchmark-to-production gap using the production mirroring flywheel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly&lt;/strong&gt; — Qwen3-32B retraining cadence on H200 GPUs (12h full run)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1,200&lt;/strong&gt; production LLM deployments analysed by ZenML — Shopify's findings are not exceptional, they are universal&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between "impressive demo" and "product I'd trust with my customers" — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;ZenML analysed 1,200 production LLM deployments and found a pattern so consistent it has become a rule: &lt;strong&gt;reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time&lt;/strong&gt;. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust.&lt;/p&gt;

&lt;p&gt;Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation from natural language), lived this curve in production. The Flow agent generates Shopify Flow automations from merchant descriptions — "when an order is over $200, add the customer to my VIP segment" — and produces a structured workflow. It uses &lt;em&gt;tool calling&lt;/em&gt; (a pattern where an LLM is given a set of available functions with descriptions and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and operates in a domain-specific format. The task sounds well-bounded. In practice, the diversity of merchant intent is vast, edge cases accumulate rapidly, and subtle errors — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Why Evaluation Is the Hard Part&lt;/strong&gt;

&lt;p&gt;Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. &lt;strong&gt;Without a reliable way to measure quality, you cannot improve systematically.&lt;/strong&gt; You are optimising blind, hoping the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Benchmarks Said Ready; Production Said Otherwise
&lt;/h4&gt;

&lt;p&gt;Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  No Quality Signal Trustworthy Enough to Drive Iteration
&lt;/h4&gt;

&lt;p&gt;The early LLM judge had a &lt;em&gt;Cohen's Kappa&lt;/em&gt; (a statistical measure of agreement between two raters that corrects for chance — Kappa of 0 means agreement no better than random, 1.0 means perfect agreement) of 0.02 — barely better than random agreement with human evaluators. Engineering decisions based on its verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Calibrated LLM Judge + Production Mirroring Flywheel
&lt;/h4&gt;

&lt;p&gt;The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Production Gap Closed in Two Weeks with the Flywheel
&lt;/h4&gt;

&lt;p&gt;The gap from "benchmark-ready" to "production-ready" closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behaviour evolves.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Building the Evaluation Flywheel
&lt;/h3&gt;

&lt;p&gt;Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;0.61&lt;/strong&gt; — Cohen's Kappa achieved after iterative calibration — close to the human evaluator baseline of 0.69, sufficient to drive reliable engineering decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;300&lt;/strong&gt; — hand-crafted benchmark examples, covering the breadth of expected usage; initial quality gate before shadow testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 weeks&lt;/strong&gt; — time to close the benchmark-to-production gap using the production mirroring flywheel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly&lt;/strong&gt; — Qwen3-32B retraining cadence on H200 GPUs; 12-hour full training run per cycle
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# LLM judge calibration: the process from Kappa 0.02 to 0.61
# A judge is only useful if it agrees with humans. Measure agreement first.
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calibrate_llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    calibration_set: list of {conversation, human_label} pairs
    human_label: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;good&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;needs_improvement&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    Returns Cohen&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s Kappa between judge and human labels.
    Target: Kappa &amp;gt;= 0.60 before trusting judge at scale.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;judge_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;conversation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;human_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;human_label&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;human_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The calibration loop — iterate until the judge is trustworthy
&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;  &lt;span class="c1"&gt;# initial judge is barely better than random
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Analyse where judge and humans disagree
&lt;/span&gt;    &lt;span class="n"&gt;disagreements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_disagreements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_judge_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Improve judge prompt based on disagreement patterns:
&lt;/span&gt;    &lt;span class="c1"&gt;# - Add clarifying criteria for ambiguous cases
&lt;/span&gt;    &lt;span class="c1"&gt;# - Add few-shot examples where human label is the ground truth
&lt;/span&gt;    &lt;span class="c1"&gt;# - Adjust rubric language to match human intuitions
&lt;/span&gt;    &lt;span class="n"&gt;new_judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;improve_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;disagreements&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calibrate_llm_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_judge_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calibration_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kappa: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kappa&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# progression: 0.02 → 0.15 → 0.31 → 0.48 → 0.61
&lt;/span&gt;
&lt;span class="c1"&gt;# Once Kappa &amp;gt;= 0.60: use judge to score production traffic at scale
# Once judge is calibrated: production mirroring generates the failure cases
# that benchmarks never captured — feed those failures back into training data
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Production Mirroring: The Ground Truth Test&lt;/strong&gt;

&lt;p&gt;Benchmarks are necessary but not sufficient. A benchmark reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all edge cases, unusual phrasings, and unexpected use patterns no engineer anticipated. &lt;strong&gt;Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously&lt;/strong&gt;, comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;/p&gt;
  Synthetic training data: how Shopify generated the Flow agent dataset
  &lt;br&gt;
The Flow agent's fine-tuning training data was almost entirely &lt;strong&gt;synthetic&lt;/strong&gt; — generated by an LLM, not labelled by humans. The three-step pipeline: (1) sample a diverse set of validated production workflows — at least one per unique workflow descriptor, from merchants with two or more qualifying workflows; (2) use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow; (3) construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: &lt;strong&gt;scale&lt;/strong&gt; (the production workflow corpus is large) and &lt;strong&gt;grounding&lt;/strong&gt; (every training example was derived from a real workflow that actually ran). Synthetic data from real production outputs is the emerging standard for fine-tuning domain specialists.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  Tangle: the ML pipeline that enables weekly retraining
  &lt;br&gt;
The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on Tangle, Shopify's open-source ML experimentation platform. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. A change to the synthetic data generator doesn't trigger a full pipeline rerun — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The evaluation architecture for production LLM systems has four components that form a cycle. &lt;strong&gt;Benchmark evaluation&lt;/strong&gt; provides fast, reproducible quality gates during development. &lt;strong&gt;LLM-as-judge scoring&lt;/strong&gt; provides continuous quality measurement at production traffic scale. &lt;strong&gt;Production mirroring&lt;/strong&gt; provides ground truth about whether a candidate model performs better for real users. &lt;strong&gt;The training flywheel&lt;/strong&gt; converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Production LLM Evaluation Flywheel
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM Judge Architecture: From Random Agreement to Near-Human
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Merchant Simulator as Pre-Deployment Safety Net&lt;/strong&gt;

&lt;p&gt;The merchant simulator sits between benchmark evaluation and production mirroring — a &lt;strong&gt;synthetic production environment&lt;/strong&gt;. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode benchmarks miss: correct behaviour on engineer-anticipated test cases, incorrect behaviour on the realistic distribution of merchant intent. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  Golden datasets: why they are non-negotiable
  &lt;br&gt;
ZenML's analysis is unambiguous: every successful production LLM deployment they analysed maintains human-in-the-loop golden datasets for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labelled examples representing ground truth — anchor judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You will spend more time building evaluation infrastructure than the application logic itself.&lt;/strong&gt; This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. ZenML's summary from 1,200 deployments: &lt;em&gt;"Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features."&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;LLM-as-judge&lt;/em&gt; (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale) is the scalable evaluation pattern.&lt;/strong&gt; But an uncalibrated judge (Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6. The human evaluator baseline (0.69 for Shopify) is the meaningful ceiling — don't optimise past it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A benchmark that passes is a necessary condition, not a sufficient one.&lt;/strong&gt; Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. Two weeks of shadow traffic is the standard cost of this final validation step.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Synthetic data generation&lt;/em&gt; (using an LLM to create training examples from real production outputs — generating natural-language merchant requests from real production workflows) is the path to scalable fine-tuning training data.&lt;/strong&gt; Manual annotation doesn't scale. Synthetic data derived from production outputs does — and it's grounded in real-world distribution rather than engineer-imagined distribution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retraining cycle speed determines how fast you can respond to production drift.&lt;/strong&gt; Merchant behaviour changes, new workflow patterns emerge, new merchant categories join Shopify — a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (intelligent caching, H200 GPUs, 12h runs), keeps the model aligned with the world it serves.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cohen's Kappa&lt;/strong&gt; — a statistical measure of agreement between two raters that corrects for chance agreement. Kappa of 0 means agreement no better than random; 1.0 means perfect agreement; 0.6+ is generally considered the threshold for a trustworthy judge. The Shopify LLM judge improved from 0.02 to 0.61; the human evaluator baseline was 0.69.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; — the process of further training a pre-trained LLM on a domain-specific dataset to improve performance on a specific task. Used by Shopify to specialise a base model (Qwen3-32B) for Shopify Flow workflow generation, with weekly retraining cycles to keep pace with evolving merchant behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Golden dataset&lt;/strong&gt; — a small, carefully curated set of human-labelled evaluation examples representing ground truth for a specific domain. Used to calibrate LLM judges and detect judge drift over time. The anchor of any reliable LLM evaluation system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge&lt;/strong&gt; — the pattern of using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production mirroring&lt;/strong&gt; — routing a percentage of real production traffic through both the current deployed model and a candidate model simultaneously, comparing outputs to measure whether the candidate performs better for real users. The ground truth test that benchmark evaluation cannot replicate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synthetic data generation&lt;/strong&gt; — using an LLM to create training examples from a production data source — for example, generating plausible natural-language merchant requests from real validated production workflows. Enables scalable training data creation grounded in real-world distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool calling&lt;/strong&gt; — a pattern where an LLM is given a set of available functions (tools) with descriptions and can request that a specific tool be executed by generating a structured function call. Enables LLMs to take real-world actions beyond text generation — used by Shopify's Flow agent to generate and execute workflow operations.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/quantum-computing-just-beat-the-best-classical-computer-here-is-the-engineering-that-made-it-1ie3</link>
      <guid>https://dev.to/techlogstack/quantum-computing-just-beat-the-best-classical-computer-here-is-the-engineering-that-made-it-1ie3</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3,000×&lt;/strong&gt; speedup — quantum completed in 2 minutes what classical needed 100+ hours for&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60%&lt;/strong&gt; gate count reduction by Q-CTRL's Fire Opal compiler vs native Qiskit — the engineering that made it possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12,635 atoms&lt;/strong&gt; — largest biologically meaningful molecule ever simulated on quantum hardware (May 5 2026)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40×&lt;/strong&gt; larger protein simulation than six months prior — driven by the EWF-TrimSQD algorithm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;120 qubits, 10,000+ two-qubit gates&lt;/strong&gt; — circuit depth previously considered infeasible on NISQ hardware&lt;/li&gt;
&lt;li&gt;IBM Starling roadmap: &lt;strong&gt;200 logical qubits&lt;/strong&gt; under error correction by 2029&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;On May 6, 2026, Q-CTRL ran a materials science simulation on an IBM quantum computer in 2 minutes. The best classical supercomputer needed over 100 hours to reach the same accuracy — and then gave up. The day before, IBM, Cleveland Clinic, and RIKEN simulated a 12,635-atom protein, 40 times larger than anything attempted six months prior. After 30 years of promises, practical quantum advantage arrived. What actually changed was a compiler.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;For years, quantum computing has been a promise. Now, quantum computers are producing results that matter to science. The systems we simulated here are the kind of molecules that biologists and chemists work with in the real world.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Jay Gambetta, Director of IBM Research, IBM Think 2026, Boston&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On May 19 2026, Google Trends showed a BREAKOUT signal — the highest possible designation — on the query "what is quantum computing in simple terms." The trigger was two announcements that landed within 48 hours of each other. On May 5, scientists at Cleveland Clinic, RIKEN, and IBM used quantum computers to simulate &lt;strong&gt;trypsin, a protein with 12,635 atoms&lt;/strong&gt; — the largest biologically meaningful molecule ever simulated on quantum hardware, 40 times larger than what the same method could achieve just six months prior. On May 6, Q-CTRL demonstrated a &lt;strong&gt;3,000× speedup&lt;/strong&gt; on a problem of real commercial relevance. The physics community called it practical quantum advantage — the first time a quantum computer had demonstrably outperformed the best classical tool on a problem that matters outside a laboratory.&lt;/p&gt;

&lt;p&gt;Understanding why these results matter requires understanding what stood in the way. &lt;em&gt;NISQ&lt;/em&gt; (Noisy Intermediate-Scale Quantum — the current era of quantum computing, characterised by processors with 50–1,000 qubits that are not error-corrected, meaning errors accumulate as circuit depth grows and place hard limits on what computations can run reliably) quantum computers accumulate errors with every &lt;em&gt;two-qubit gate&lt;/em&gt; (the fundamental entangling operation in quantum computing — essential for quantum algorithms but a primary source of error in NISQ hardware, with typical error rates of 0.1–1% per gate). At shallow circuit depths with a handful of gates, error mitigation can recover useful results. At 10,000+ gates across 120 qubits — the depth required for commercially meaningful simulations — errors historically compounded until the output was indistinguishable from noise. This was the wall. The May 2026 results are not the wall coming down. They are the first evidence that engineers have found a way to work precisely enough within its constraints that real problems now fall on the quantum side of it.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What Q-CTRL Actually Did&lt;/strong&gt;

&lt;p&gt;Q-CTRL used an IBM 156-qubit Heron processor on the IBM Quantum Platform, enhanced by their own Fire Opal performance-management software. The target: the &lt;em&gt;Fermi-Hubbard model&lt;/em&gt; (a foundational physics model describing how electrons interact in a crystal lattice — capturing phenomena like high-temperature superconductivity) — a system of 60 interacting electrons using 120 qubits and executing over 10,000 two-qubit gate operations. The classical competitor was ITensor's TDVP solver on a 32-vCPU, 64GB-RAM AWS instance — the best-in-class classical tool for this problem class. Quantum: &lt;strong&gt;~2 minutes&lt;/strong&gt;. Classical: &lt;strong&gt;over 100 hours&lt;/strong&gt; before the two results diverged irreconcilably.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  NISQ Wall: Errors Compound Before Computation Completes
&lt;/h4&gt;

&lt;p&gt;NISQ quantum processors accumulate errors with every two-qubit gate. For shallow circuits (hundreds of gates), error mitigation can recover useful results. For commercially meaningful simulations (10,000+ gates), errors historically compounded until the quantum output was indistinguishable from random noise. This wall had blocked practical quantum advantage for three decades.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Gate Count Was the Critical Variable
&lt;/h4&gt;

&lt;p&gt;Every additional two-qubit gate multiplies error probability. IBM's native Qiskit compiler produced correct but gate-heavy implementations. Q-CTRL's Fire Opal compiler took the same algorithm and reduced gate count by 60% through circuit optimisation and error suppression. That 60% reduction was the difference between circuits that collapsed into noise and circuits that produced valid results.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Two Simultaneous Breakthroughs: Materials and Biology
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;May 5:&lt;/strong&gt; IBM, Cleveland Clinic, and RIKEN simulated a 12,635-atom protein using quantum-centric supercomputing — fragmenting the molecule, computing quantum-mechanical behaviour on IBM Heron processors, and assembling results on Fugaku and Miyabi-G supercomputers. &lt;strong&gt;May 6:&lt;/strong&gt; Q-CTRL demonstrated 3,000× speedup on the Fermi-Hubbard model, completing in 2 minutes what took classical computers 100+ hours.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Practical Quantum Advantage: The Field's First
&lt;/h4&gt;

&lt;p&gt;On May 6 2026, Q-CTRL declared practical quantum advantage — the first time a quantum computer had outperformed the best available classical tool on a problem of known commercial relevance, using hardware accessible to any developer via the IBM Quantum Platform. IBM CEO Arvind Krishna had predicted quantum advantage would arrive in 2026. The prediction was correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Engineering Stack That Made It Possible
&lt;/h3&gt;

&lt;p&gt;The Q-CTRL result did not emerge from better quantum hardware alone. It emerged from a full engineering stack combining IBM's hardware, Q-CTRL's compiler, and years of quantum control research. Three layers mattered: the &lt;strong&gt;hardware layer&lt;/strong&gt; (IBM Heron's 156-qubit chip with improved coherence times and gate fidelity), the &lt;strong&gt;compilation layer&lt;/strong&gt; (Q-CTRL's Fire Opal reducing gate count by 60%), and the &lt;strong&gt;error suppression layer&lt;/strong&gt; (runtime techniques that actively suppress errors during execution). None of these layers alone would have been sufficient — the result is an emergent property of all three operating together.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3,000×&lt;/strong&gt; — wall-clock speedup of quantum over classical: 2 minutes vs 100+ hours on the best available classical hardware and software&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60%&lt;/strong&gt; — gate count reduction by Fire Opal vs native Qiskit — the single optimisation that made the circuit depth feasible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12,635&lt;/strong&gt; — atoms in the trypsin protein simulated by Cleveland Clinic + RIKEN + IBM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40×&lt;/strong&gt; — increase in simulation system size achieved in six months, driven by the EWF-TrimSQD algorithm
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Conceptual: What Fire Opal does differently from native Qiskit compilation
# The 60% gate reduction is the engineering story in code form
&lt;/span&gt;
&lt;span class="c1"&gt;# NATIVE QISKIT: correct but gate-heavy
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qiskit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QuantumCircuit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transpile&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qiskit_ibm_runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QiskitRuntimeService&lt;/span&gt;

&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QiskitRuntimeService&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ibm_heron_r2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 156-qubit Heron
&lt;/span&gt;
&lt;span class="c1"&gt;# Fermi-Hubbard Trotter circuit at 90 steps:
# Naive implementation produces ~15,000+ two-qubit (CX) gates
&lt;/span&gt;&lt;span class="n"&gt;circuit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_fermi_hubbard_circuit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_qubits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trotter_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;native&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transpile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;circuit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ~15,000+ CX gates → error rate exceeds threshold → output is noise
&lt;/span&gt;
&lt;span class="c1"&gt;# Q-CTRL FIRE OPAL: noise-aware compilation
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fire_opal&lt;/span&gt;

&lt;span class="c1"&gt;# Fire Opal applies four optimisations simultaneously:
# 1. Circuit rewriting — finds equivalent circuits with fewer gates
# 2. Noise-aware qubit mapping — minimises cross-talk between physical qubits
# 3. Dynamical decoupling — inserts refocusing pulses to cancel drift errors
# 4. Gate fusion — combines adjacent compatible gates into single operations
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fire_opal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;circuits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;circuit&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;optimization_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aggressive&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;error_suppression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dynamical_decoupling&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gate_twirling&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ~6,000 CX gates — 60% reduction
# Circuit runs within error tolerance → produces results accurate enough
# to match and then exceed the classical TDVP benchmark
# Wall time: ~2 minutes
# Classical TDVP equivalent: 100+ hours before diverging irreconcilably
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;Error Suppression vs Error Correction: The Critical Distinction&lt;/strong&gt;

&lt;p&gt;The May 2026 results were achieved with &lt;strong&gt;error suppression&lt;/strong&gt;, not error correction. &lt;strong&gt;Error correction&lt;/strong&gt; (the goal for 2029) uses logical qubits — groups of physical qubits encoding information redundantly, detecting and fixing errors in real-time. It requires hundreds of physical qubits per logical qubit. &lt;strong&gt;Error suppression&lt;/strong&gt; (what Q-CTRL and IBM use now) cannot fix errors — it minimises them through circuit optimisation, noise-aware compilation, and runtime control. Error suppression works within NISQ limits. Error correction eliminates those limits entirely. The 3,000× result was achieved within the NISQ limits. What becomes possible once error correction arrives is qualitatively different.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;/p&gt;
  The Cleveland Clinic protein simulation: EWF-TrimSQD explained
  &lt;br&gt;
The May 5 simulation used a quantum-centric supercomputing (QCSC) approach — pairing IBM Heron quantum processors at Cleveland Clinic (USA) and RIKEN (Japan) with two classical supercomputers: Fugaku at RIKEN and Miyabi-G at the University of Tokyo. The key algorithm was EWF-TrimSQD (Embedding Workflow with Tailored Reduced-qubit Molecular Dynamics) — a quantum-classical hybrid that fragmented the 12,635-atom trypsin protein into computable pieces, computed quantum-mechanical behaviour on QPUs (up to 94 qubits, ~6,000 quantum operations per fragment), and reconstructed the full protein's behaviour on classical supercomputers. The 40× system size increase in six months came from algorithmic improvement in how fragments were computed and assembled.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  IBM Quantum roadmap: Loon to Starling
  &lt;br&gt;
IBM Quantum Loon (November 2025) was the first processor to demonstrate all hardware components required for fault-tolerant quantum computing: c-couplers for long-range qubit connectivity, qubit reset between computations, and high-fidelity gates at FTQC-relevant speeds. IBM also achieved real-time qLDPC (quasi-cyclic Low-Density Parity-Check codes — IBM's chosen error-correcting code, requiring fewer physical qubits per logical qubit than the surface code) decoding in under 480 nanoseconds — a full year ahead of schedule. IBM Starling targets 200 logical qubits under full error correction by 2029.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Both May 2026 achievements reflect the same architectural pattern: quantum processors are specialised accelerators for specific types of computation, tightly integrated with classical CPUs and GPUs that handle the parts of the problem where quantum offers no advantage. IBM calls this &lt;strong&gt;Quantum-Centric Supercomputing (QCSC)&lt;/strong&gt; — a heterogeneous computing architecture where tasks are assigned to the compute layer where they run best. Quantum does not replace classical computing. It extends it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The NISQ Error Accumulation Problem: Why Circuit Depth Is the Wall
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quantum-Centric Supercomputing (QCSC): The Cleveland Clinic Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  IBM Quantum Roadmap: From NISQ to Fault Tolerance (2025–2029)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;What These Results Are Not&lt;/strong&gt;

&lt;p&gt;The Fermi-Hubbard result is not proof that quantum computers beat classical computers at everything. The advantage holds for this specific class of fermionic simulation problems, which scale poorly for classical computers by a known theoretical argument. Breaking RSA-2048 with Shor's algorithm requires hundreds of thousands to millions of physical qubits under error correction — orders of magnitude harder. The May 2026 results are the first concrete proof that quantum advantage is achievable on useful, commercially relevant problems with today's hardware, properly engineered.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quantum advantage arrived not from better qubits alone, but from better compilers.&lt;/strong&gt; Q-CTRL's Fire Opal reduced gate count by 60% on the same IBM hardware that was already available. The 3,000× speedup was enabled by 60% fewer gates — and 60% fewer gates was enabled by years of investment in quantum control theory and noise-aware compilation. Hardware and software co-optimisation, not hardware alone, crossed the threshold.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Quantum-centric supercomputing&lt;/em&gt; (a heterogeneous architecture pairing quantum processors with classical CPUs and GPUs, assigning each part of a problem to the resource where it runs best) is how quantum advantage works in practice.&lt;/strong&gt; Quantum computers do not replace classical computers — they accelerate the specific parts where quantum mechanics provides exponential advantage. Drug discovery, materials simulation, and optimisation are the first domains where this integration delivers measurable commercial results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Error suppression and circuit optimisation are the engineering disciplines that matter most in the NISQ era.&lt;/strong&gt; Error correction remains the long-term goal (IBM Starling, 2029), but error suppression — reducing gate count, noise-aware mapping, dynamical decoupling — is the bridge that makes today's hardware useful for real problems. Engineers building on quantum hardware should invest as much in compilation optimisation as in circuit design.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The rate of improvement is accelerating.&lt;/strong&gt; 40× larger molecule simulation in six months. A year-ahead-of-schedule qLDPC decoder. &lt;em&gt;Trotter&lt;/em&gt; (a simulation technique approximating quantum time evolution by breaking it into small sequential steps — 90 Trotter steps at 120 qubits with useful accuracy was previously considered infeasible on NISQ hardware) depth at 90 steps on 120 qubits that would have been impossible two years ago. Organisations that start developing quantum-advantage applications now will be ahead of those waiting for the technology to "mature."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Practical quantum advantage arrived on public cloud infrastructure.&lt;/strong&gt; Q-CTRL's 3,000× speedup was achieved on IBM Quantum Platform hardware accessible via API to any registered developer — not on a private research machine. The cloud-first approach IBM took in 2016 is what made May 2026's results broadly verifiable and immediately applicable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dynamical decoupling&lt;/strong&gt; — a quantum error suppression technique that inserts short refocusing pulses during circuit execution to cancel low-frequency noise and drift errors. One of the core techniques used by Q-CTRL's Fire Opal to reduce effective error rates without requiring full error correction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fermi-Hubbard model&lt;/strong&gt; — a foundational model in condensed matter physics describing interacting electrons on a lattice. Used to understand high-temperature superconductivity, Mott insulators, and quantum magnetism. Classical simulation cost grows exponentially with system size — a 60-electron system has 2^60 possible states. The Q-CTRL result is the first real-world confirmation that quantum computers provide exponential advantage on this class of problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fire Opal&lt;/strong&gt; — Q-CTRL's performance-management software for quantum computers. Applies circuit rewriting, noise-aware qubit mapping, dynamical decoupling, and gate fusion to reduce two-qubit gate count and improve circuit fidelity. Achieved 60% gate reduction vs native Qiskit on the Fermi-Hubbard circuit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical qubit&lt;/strong&gt; — a fault-tolerant qubit encoded across multiple physical qubits, with error detection and correction running continuously. The target unit for IBM Starling (200 logical qubits by 2029). Contrasted with physical qubits, which are noisy and uncorrected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NISQ (Noisy Intermediate-Scale Quantum)&lt;/strong&gt; — the current era of quantum computing, characterised by processors with 50–1,000 physical qubits that are not error-corrected. Errors accumulate as circuit depth grows, placing hard limits on computation length. The May 2026 results were achieved within NISQ constraints, not beyond them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;qLDPC (quasi-cyclic Low-Density Parity-Check codes)&lt;/strong&gt; — IBM's chosen quantum error-correcting code, requiring fewer physical qubits per logical qubit than the surface code used by most competitors. IBM achieved real-time qLDPC decoding in under 480 nanoseconds in November 2025 — a year ahead of schedule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantum-centric supercomputing (QCSC)&lt;/strong&gt; — IBM's heterogeneous computing architecture pairing quantum processors with classical CPUs and GPUs, assigning each part of a computation to the resource where it runs best. The architectural model used in the Cleveland Clinic protein simulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-qubit gate&lt;/strong&gt; — the fundamental entangling operation in quantum computing that creates correlations between qubits. Essential for quantum algorithms but a primary source of error in NISQ hardware. Reducing two-qubit gate count is the primary lever for improving circuit fidelity on today's hardware.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/ibm-quantum-advantage-2026/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose</title>
      <dc:creator>TechLogStack</dc:creator>
      <pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/techlogstack/netflix-unleashed-a-monkey-with-a-weapon-in-its-own-data-center-on-purpose-3ph0</link>
      <guid>https://dev.to/techlogstack/netflix-unleashed-a-monkey-with-a-weapon-in-its-own-data-center-on-purpose-3ph0</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2008&lt;/strong&gt; — database corruption, 3 days of darkness, entire DVD operation halted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2011&lt;/strong&gt; — Chaos Monkey deployed; instance-killing runs every business day in production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+&lt;/strong&gt; members of the Simian Army — from instance kills (Chaos Monkey) to full region failures (Chaos Kong)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business hours only&lt;/strong&gt; — the essential design constraint that makes chaos safe and pedagogical&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;September 2014&lt;/strong&gt; — AWS reboots 10% of EC2 instances without warning; Netflix serves customers without interruption&lt;/li&gt;
&lt;li&gt;Chaos Monkey spawned an entire engineering discipline now practiced at LinkedIn, Google, Amazon, and Twilio&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Yury Izrailevsky &amp;amp; Ariel Tseitlin, The Netflix Simian Army, Netflix Tech Blog, July 19 2011&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running on vertically scaled servers in its own datacentres. A &lt;strong&gt;major database corruption&lt;/strong&gt; took down the entire system. For three days, Netflix could not ship DVDs to its customers. It was a &lt;em&gt;single point of failure&lt;/em&gt; (a component whose failure brings down the entire system — the exact opposite of a fault-tolerant distributed architecture) at the most basic level: one database, one failure mode, total outage.&lt;/p&gt;

&lt;p&gt;Netflix's engineering leadership concluded that the only path forward was to move toward &lt;strong&gt;highly reliable, horizontally scalable, distributed systems in the cloud&lt;/strong&gt;. They chose AWS. The migration presented a new problem: moving from a monolith with a small number of catastrophic failure points to a &lt;em&gt;microservices architecture&lt;/em&gt; (a system design where an application is broken into many small, independently deployable services communicating over a network — improving scalability at the cost of increased distributed systems complexity) with hundreds of services, each potentially failing in its own unique way. The engineers designed graceful degradation: if recommendations failed, show popular titles instead; if search was slow, streaming should still work. They wrote the code, reviewed it, tested it in staging — and then realised they had no way to know if the fault tolerance actually worked without experiencing actual failures.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Core Insight: Fail Constantly&lt;/strong&gt;

&lt;p&gt;Netflix's founding philosophy for Chaos Engineering was radical in its simplicity: &lt;strong&gt;the best way to avoid failure is to fail constantly&lt;/strong&gt;. If you only experience failures accidentally, in production, at 3am, your engineers have no muscle memory for responding to them and your systems have never been forced to prove their resilience claims. If you fail constantly, during business hours, with engineers present — your systems either prove they can recover or they expose the gaps so engineers can fix them before those gaps become incidents.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Problem
&lt;/h3&gt;

&lt;h4&gt;
  
  
  August 2008: Database Corruption, Three Days of Darkness
&lt;/h4&gt;

&lt;p&gt;Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cause
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Distributed Systems Are Only Theoretically Resilient
&lt;/h4&gt;

&lt;p&gt;Moving to hundreds of microservices on AWS solved the single-point-of-failure problem in theory — but raised new questions: did the code actually implement the graceful degradation it was designed for? Staging environments couldn't answer this. Code review couldn't answer this. The only honest answer required production failures — and those were the thing Netflix was trying to avoid.&lt;/p&gt;




&lt;h3&gt;
  
  
  Solution
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Chaos Monkey: Production Failure on a Schedule
&lt;/h4&gt;

&lt;p&gt;Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.&lt;/p&gt;




&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;h4&gt;
  
  
  September 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.
&lt;/h4&gt;

&lt;p&gt;On September 25 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded automatically, gracefully, and without requiring an emergency war room.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Building a Fault-Tolerant Culture
&lt;/h3&gt;

&lt;p&gt;The most important thing Chaos Monkey fixed was not a technical system — it was an organisational incentive. Before Chaos Monkey, engineers could ship theoretically fault-tolerant but practically fragile code without facing immediate consequences. The fragility only became visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this &lt;strong&gt;during your working hours, while you were at your desk&lt;/strong&gt;, with your team watching.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2011&lt;/strong&gt; — year Chaos Monkey publicly announced — three years after the 2008 database outage that triggered the AWS migration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+&lt;/strong&gt; — members of the Simian Army at peak, each targeting a different failure category&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business hours&lt;/strong&gt; — the scheduling constraint that made Chaos Monkey safe; failures during working hours with engineers present&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;September 2014&lt;/strong&gt; — the real-world validation: AWS reboots 10% of EC2 instances; Netflix serves customers without interruption
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified version of what Chaos Monkey does
# Real implementation: originally Java, rebuilt in Go for v2.0 (2016)
# Runs continuously during configurable business hours
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChaosMonkey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aws_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;excluded_clusters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aws_client&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excluded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;excluded_clusters&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_business_hours&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Only run during business hours — engineers must be present.
        This is the key safety constraint of Chaos Monkey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s original design.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;weekday&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt;   &lt;span class="c1"&gt;# Monday–Friday
&lt;/span&gt;            &lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;      &lt;span class="c1"&gt;# 9am–5pm local time
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_business_hours&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_clusters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;

                    &lt;span class="c1"&gt;# Pick one instance at random from each cluster
&lt;/span&gt;                    &lt;span class="n"&gt;instances&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_instances&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="k"&gt;continue&lt;/span&gt;

                    &lt;span class="n"&gt;victim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                    &lt;span class="c1"&gt;# Terminate it. No warning. No coordination.
&lt;/span&gt;                    &lt;span class="c1"&gt;# If the system doesn't survive this, engineers know
&lt;/span&gt;                    &lt;span class="c1"&gt;# immediately — and fix it before it becomes a 3am incident.
&lt;/span&gt;                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate_instance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;victim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Chaos Monkey] Terminated &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;victim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in cluster &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;termination_interval_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Simian Army: Expanding Beyond Instance Kills&lt;/strong&gt;

&lt;p&gt;The success of Chaos Monkey triggered a proliferation. If randomly killing instances built resilience to instance failures, what would it take to become resilient to other failure categories? Netflix announced the &lt;strong&gt;Simian Army&lt;/strong&gt; in July 2011 — a suite of failure-injection tools each targeting a different failure class. &lt;em&gt;Latency Monkey&lt;/em&gt; injected artificial delays to simulate network degradation. &lt;em&gt;Conformity Monkey&lt;/em&gt; shut down instances not following engineering best practices. &lt;em&gt;Doctor Monkey&lt;/em&gt; removed unhealthy instances from service. &lt;em&gt;Janitor Monkey&lt;/em&gt; cleaned up unused cloud resources. &lt;em&gt;Chaos Gorilla&lt;/em&gt; simulated the complete failure of an entire AWS availability zone. And above all of these: &lt;strong&gt;Chaos Kong&lt;/strong&gt; — simulating the complete failure of an entire AWS region.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;



&lt;p&gt;&lt;/p&gt;
  Failure Injection Testing (FIT): the evolution beyond instance kills
  &lt;br&gt;
In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced FIT — Failure Injection Testing. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through &lt;em&gt;Zuul&lt;/em&gt; (Netflix's edge proxy handling all requests from devices to backend services) to simulate specific service failures with surgical precision. FIT could say "for this specific user's request, pretend the recommendations service is timing out" without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously — and became the pattern that tools like Gremlin later commercialised.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;
  Chaos Monkey 2.0: open-sourced and rebuilt in Go
  &lt;br&gt;
Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 added Trackers — Go objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts. The Spinnaker dependency became a significant constraint: teams unwilling to adopt Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Netflix's architecture in 2011 was organised around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the EC2 instance layer. When an instance was terminated, the load balancer in front of that cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had sufficient redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something they needed to know.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Simian Army: Failure Coverage Across Infrastructure Layers
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Netflix's Architecture Handles Chaos Monkey Instance Loss
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture" rel="noopener noreferrer"&gt;View interactive diagram on TechLogStack →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Interactive diagram available on TechLogStack (link above).&lt;/em&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The Behavioural Economics of Chaos Engineering&lt;/strong&gt;

&lt;p&gt;Chaos Monkey's deepest contribution to Netflix's culture was &lt;strong&gt;aligning incentives&lt;/strong&gt;. Without it, the cost of fragile code was paid by whoever happened to be on-call when a real failure occurred — often not the engineer who wrote the fragile code. With Chaos Monkey, the cost was paid immediately and visibly by the team whose service broke. Engineers who experienced a Chaos Monkey failure during business hours had a powerful motivator to invest in proper fault tolerance: they didn't want to experience it again. This is DevOps incentive design at its finest — not policy mandates, but a system where the right behaviour is the path of least resistance.&lt;br&gt;

&lt;/p&gt;
&lt;/div&gt;


&lt;p&gt;&lt;/p&gt;
  What Chaos Monkey doesn't test
  &lt;br&gt;
Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test &lt;strong&gt;network partitions&lt;/strong&gt; (instances visible but unreachable), &lt;strong&gt;latency degradation&lt;/strong&gt; (Latency Monkey's job), &lt;strong&gt;data corruption&lt;/strong&gt;, or &lt;strong&gt;slow memory leaks&lt;/strong&gt; that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army, and later tools like Gremlin, were created to cover these gaps. The original insight — failing constantly builds resilience — generalises to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering programme that only kills instances is missing most of the failure surface.&lt;br&gt;


&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Designing for fault tolerance is not the same as having fault tolerance.&lt;/strong&gt; Netflix's engineers wrote graceful degradation code. Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Chaos Engineering&lt;/em&gt; (deliberately injecting controlled failures into production systems during business hours, with engineers present, to proactively expose resilience gaps before they become unplanned outages) must be practised during business hours with humans present.&lt;/strong&gt; The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Align incentives with the behaviour you want.&lt;/strong&gt; Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The &lt;em&gt;blast radius&lt;/em&gt; (the scope of impact when a single component fails) of individual failures is only measurable through testing.&lt;/strong&gt; A microservices architecture where every service failure cascades to every other provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start at the instance level and escalate gradually.&lt;/strong&gt; Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. Expand scope only when you're confident you've solved the current scope.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Engineering Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Blast radius&lt;/strong&gt; — the scope of impact when a single component fails. Chaos engineering is designed to continuously measure and minimise blast radius by forcing service-level isolation. A microservices architecture where every service failure cascades to all others has a blast radius equivalent to a monolith.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos Engineering&lt;/strong&gt; — the discipline of deliberately injecting controlled failures into production systems during business hours, with engineers present, in order to proactively expose resilience gaps before they become unplanned outages. Formalised as a named discipline in the 2015 &lt;em&gt;Principles of Chaos Engineering&lt;/em&gt; document by Netflix's Casey Rosenthal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaos Kong&lt;/strong&gt; — the most extreme Simian Army tool, simulating the complete failure of an entire AWS region. Built after Netflix had proven resilience to instance failures (Chaos Monkey) and AZ failures (Chaos Gorilla). Tests active-active multi-region deployment under full regional failure conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FIT (Failure Injection Testing)&lt;/strong&gt; — a Netflix evolution beyond Chaos Monkey that operates at the application layer rather than the infrastructure layer, injecting failure metadata through Zuul to simulate specific service failures for specific users without degrading the service for everyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microservices architecture&lt;/strong&gt; — a system design where an application is broken into many small, independently deployable services communicating over a network. Improves scalability and team autonomy at the cost of increased distributed systems complexity and new categories of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rambo Architecture&lt;/strong&gt; — Netflix's internal term for the design philosophy Chaos Monkey enforced: each service must be able to succeed no matter what, even on its own. If a dependent service is down, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simian Army&lt;/strong&gt; — the suite of failure-injection and resilience-verification tools Netflix built following Chaos Monkey's success, each targeting a different failure class: Latency Monkey (network degradation), Conformity Monkey (best practice enforcement), Doctor Monkey (health checks), Janitor Monkey (resource cleanup), Chaos Gorilla (AZ failure), Chaos Kong (region failure).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single point of failure&lt;/strong&gt; — a component whose failure causes the entire system to stop working. The 2008 database corruption that triggered Netflix's cloud migration was a single point of failure at the most basic level. Eliminating single points of failure through distributed architecture is the goal; Chaos Monkey tests whether that goal was actually achieved.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This case is a plain-English retelling of publicly available engineering material.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://techlogstack.com/explore/netflix-chaos-monkey-2011/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Read the full case on TechLogStack →&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Interactive diagrams, source links, and the full reader experience)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TechLogStack — built at scale, broken in public, rebuilt by engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>reliability</category>
      <category>cloud</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
