DEV Community: TechLogStack

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

TechLogStack — Sun, 31 May 2026 00:00:00 +0000

June 12, 2025 — 7+ hour outage; North America, Europe, Far East, Africa simultaneously
Root cause: null pointer exception in Service Control from a May 29 code change — dormant for 14 days
No feature flag, no error handling on the new code path
50+ Google Cloud services affected: IAM, Compute Engine, Cloud Storage, BigQuery, Vertex AI, Google Workspace
Third-order cascade: Google → Cloudflare → Discord/Twitch; Discord users had no idea why they were down
Herd effect during recovery overwhelmed Spanner in us-central1, extending the outage by 2+ hours

On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorises every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by specific policy data that hadn't appeared yet. Two weeks later, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart properly. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making recovery worse than the crash.

The Story

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

— Google Cloud, Official Incident Report, June 14, 2025

Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorises every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.

When Service Control crashed on June 12, it didn't just take down one service. It took down the authorisation layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorised, and nothing unauthorised can proceed in a correctly secured cloud platform.

What Service Control Actually Does

Google Cloud's Service Control performs three functions on every API request: authentication (is this requester who they claim to be?), authorisation (are they allowed to perform this operation?), and quota enforcement (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronised across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.

Problem

May 29: Code Deployed — Bug Present, But Invisible

Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.

Cause

June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger

An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields. Because quota management is global, Spanner's replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.

Solution

The SRE Response: Diagnosis in 10 Minutes, Red Button in 40

Google's SRE team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours.

Result

The Herd Effect: When Recovery Made Things Worse

As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at once, with no randomisation in their startup sequence. Spanner was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, couldn't restart properly, kept trying, kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilised. Full resolution wasn't complete until 18:18 PDT — more than seven hours after the incident began.

The Fix

Google's Response: Five Commitments After the Outage

Google's five-category post-incident remediation plan (from the official June 14, 2025 incident report):

Failure Mode	What Happened	Google's Fix
Missing error handling	Null pointer exception crashed the binary when blank fields appeared	Mandatory null-safe code patterns with static analysis to catch null pointer vulnerabilities before deployment
No feature flag	New code path couldn't be disabled without full binary redeployment — adding 30+ min to response	Feature flag protection required for all new Service Control code paths
Herd effect during recovery	Hundreds of instances restarting simultaneously overwhelmed Spanner	Randomised exponential backoff on Service Control startup
Status page availability	Cloud Service Health dashboard went down during the outage	Decouple status infrastructure from the services it monitors
Service Control architecture	Monolithic binary — crash in quota logic crashes all authorisation	Modularise Service Control — isolate quota checking from authentication

10 min — time for Google's SRE team to identify the root cause from the first alert at 10:49 AM PDT
40 min — time to deploy the red button kill switch that disabled the problematic code path
7+ hrs — total outage duration; most regions recovered in ~2 hours, herd effect in us-central1 extended full resolution
50+ — Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services

The Feature Flag That Would Have Saved Seven Hours

The most consequential missing safeguard was the absence of a feature flag on the new quota checking code path. A feature flag would have changed the timeline dramatically: when null pointer exceptions began firing at 10:49 AM PDT, engineers with a feature flag could have disabled the new code path across all regions within seconds — before the crash had spread globally. Without a feature flag, the only option was a red-button kill switch requiring a new binary deployment: 40 minutes. 40 minutes of global outage versus seconds of a feature flag toggle. Google's incident report acknowledges this directly: "If this had been flag protected, the issue would have been caught in staging."

The dormant trap pattern that caused this outage is worth naming explicitly. Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. Staged rollouts are structurally unable to catch dormant traps — bugs that only activate when a specific trigger arrives weeks later from an unrelated automated system. The only defences against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behaviour). The May 29 change had neither.

The herd effect: a recovery anti-pattern with a known fix
The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect at once and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — randomised exponential backoff — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner is the problem. Instances waiting a random delay between 0 and 30 seconds is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost is paid in production.

The status page that went dark
Google's Cloud Service Health dashboard went offline during the June 12 outage because the status infrastructure shared a dependency on the same Google Cloud services that were failing. A status page that fails during a widespread outage is not just unhelpful — it is actively harmful. Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.

Architecture

Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronised restart under load.

The blast radius of the June 12 outage had three concentric rings:

Failure Ring	What Failed	Why
First: Google's own infrastructure	Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Vertex AI, Cloud Monitoring, Google Workspace	Service Control crashed globally, blocking all API authorisation
Second: Direct GCP customers	Spotify (~46K reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor	Applications on GCP couldn't authorise any backend calls — services appeared down even though underlying compute was running
Third: Cloudflare and its customers	Cloudflare (partial), Discord, Twitch	Cloudflare uses Google Cloud for certain backend operations; those degraded, cascading to Cloudflare's own customers

Normal Flow vs June 12 Failure: What Service Control Does on Every Request

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Global Spanner Replication Trap

The reason the June 12 failure was global rather than regional was Spanner's design strength working against Google in this case. Spanner is engineered to replicate data to all regions in real time — typically within seconds. When the automated system inserted the policy update with blank fields into the regional Spanner tables, Spanner replicated that policy data to every region within seconds. Every Service Control instance in every region hit the null pointer at essentially the same moment. There was no regional staging, no propagation delay, no opportunity for an alert to fire in one region before the failure had spread to all others. The same architecture that gives Spanner its global consistency guarantee gave this bug its global blast radius.

Lessons

Error handling is not optional for code that runs in the critical path of a globally distributed system. The null pointer exception that crashed Service Control was caused by a missing null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.
Feature flags (a software engineering practice where new code is deployed but kept inactive until explicitly enabled via configuration, allowing teams to disable problematic features instantly without redeployment) on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data. The difference between "feature flag enabled, issue caught in staging" and "no feature flag, 7-hour global outage" is one line of configuration.
The thundering herd (a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming it and preventing it from returning to stable operation) is a known failure mode with a known fix: randomised exponential backoff. Build randomised backoff into any service that has a shared dependency it needs to reconnect to after a failure. This has been documented for decades. The fact that Service Control lacked it is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid.
Your monitoring infrastructure must be architecturally independent of the services it monitors. No shared dependencies between the monitoring stack and the application stack. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.
Third-order cascade failures are invisible until they happen. Discord's users had no idea their outage originated in a null pointer in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. Every engineering team should map their dependency chain at least two levels deep — not just "we use Cloudflare" but "Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare."

Engineering Glossary

Dormant trap — a bug present in production code that cannot be triggered by any input present at deployment time, but activates when a specific trigger arrives later from an unrelated system. The May 29 Service Control change was a dormant trap: it executed correctly for 14 days until the automated policy update inserted blank fields. Staged rollouts are structurally unable to catch dormant traps.

Feature flag (kill switch) — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in this incident. A feature flag on the new quota checking code path would have allowed it to be disabled across all regions within seconds when null pointer exceptions began firing.

Herd effect (thundering herd) — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming the resource and preventing it from returning to stable operation. The mechanism that extended the us-central1 outage by 2+ hours after the red button was deployed.

Null pointer exception — a runtime error that occurs when code attempts to use a reference that points to no object (null). The missing null check in Service Control's new quota checking code that caused a 7-hour global outage for 50+ cloud services.

Randomised exponential backoff — a retry strategy where clients wait a random delay that increases exponentially with each retry attempt. The standard solution to the thundering herd problem — prevents synchronised reconnection bursts by distributing client attempts across a time window.

Service Control — Google's internal authorisation gateway that processes every API request across Google Cloud and Google Workspace. Performs authentication, authorisation, and quota enforcement on every call. A crash in Service Control takes down all API authorisation for the entire platform — making it the highest-blast-radius single component in Google Cloud.

Spanner — Google's globally distributed database, engineered to replicate data to all regions in real time (typically within seconds). Used by Service Control for policy metadata. The same replication speed that makes Spanner powerful for global consistency made this bug's blast radius global and instantaneous.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

TechLogStack — Sun, 31 May 2026 00:00:00 +0000

257 incidents — May 2025 to April 2026; ~5 per week, every week, for 12 months straight
48 major outages — 112+ hours of total significant downtime
AI-agent PRs: 4M (Sept 2025) → 17M (Mar 2026) — 325% increase in six months
Actions usage: 500M min/week (2023) → 2.1B min/week (early 2026)
10x scaling plan launched October 2025; revised to 30x by February 2026
Mitchell Hashimoto (GitHub user #1299, 18 years, co-founder of HashiCorp) migrated Ghostty away

Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.

The Story

We started executing our plan to increase GitHub's capacity by 10X in October 2025, with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today's scale. The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply.

— Vlad Fedorov, CTO of GitHub, GitHub Engineering Blog, April 28, 2026

For most of its existence, GitHub has been one of the most reliable platforms on the internet. Developers took it for granted the way they take electricity for granted — always on, always there, a utility so dependable it disappeared into the background. That changed in 2025. Not because GitHub's engineers got worse. Not because the codebase got sloppier. But because something fundamental changed about who — or more precisely, what — was using GitHub. AI coding agents arrived at scale, and they didn't behave anything like the human developers the platform was built for.

In 2024, GitHub logged 119 service incidents, including 26 major ones — frustrating, but manageable. Then, between May 2025 and April 2026, incident monitoring service IncidentHub tracked 257 separate incidents, of which 48 were classified as major outages. February 2026 alone produced 37 incidents — the worst month on record. GitHub Actions suffered 57 outages in the same 12-month stretch. On May 15, 2026, a single Actions degradation caused 42% of all Actions runs to fail at peak impact.

The Core Problem: Agents Don't Behave Like Humans

A human developer on a free GitHub account might generate a few commits and a handful of CI runs in a working day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, and thousands of Actions minutes in a single afternoon. GitHub's 2025 Octoverse report celebrated nearly 1 billion commits. By early 2026, GitHub COO Kyle Daigle shared a more alarming figure: the platform was handling 275 million commits every single week — on pace for 14 billion in 2026. That's a 14x annual increase. It wasn't 14x more developers. It was agents treating GitHub's API like a utility and consuming at machine speed.

Problem

GitHub Was Built for Human-Paced Development

GitHub's architecture was designed for a world where developers work at human speed: open a PR, push commits over hours or days, wait for CI to run, merge when green. The platform's capacity planning, its database schemas, its job queues, its rate limits — all calibrated for a workflow where one human generates a bounded amount of activity per session. That assumption held for 17 years.

Cause

AI Agents Changed the Economics of Every GitHub Operation

GitHub CTO Vlad Fedorov identified the mechanism: a single pull request can simultaneously touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. A human merging one PR triggers this chain once. An AI agent framework running hundreds of concurrent sessions triggers it thousands of times simultaneously. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 — a 325% increase in six months.

Solution

10x Plan Became 30x Plan — And They Were Still Behind

GitHub began a 10x capacity scaling initiative in October 2025. By February 2026, that plan was already obsolete — the real demand required 30x. Simultaneously, GitHub was running a migration to Azure, with 12.5% of all traffic on Azure Central US and a target of 50% by July 2026. Running a platform migration alongside an AI-driven traffic explosion is the engineering equivalent of rebuilding an airplane's engines at 35,000 feet.

Result

Cascading Failures and High-Profile Departures

The pressure produced not just performance degradation but engineering failures. On April 23, 2026, an incomplete feature flag silently reverted commits across 658 repositories and 2,092 pull requests — the UI showed green checkmarks while code was being rewritten underneath. On April 28, Mitchell Hashimoto — GitHub user #1299, co-founder of HashiCorp, joined February 2008 — announced that Ghostty was leaving GitHub after 18 years. The Zig programming language project also migrated away.

The Fix

The Engineering Response: Ruby to Go, Monolith to Services, Single Cloud to Multi-Cloud

GitHub's five-layer engineering response (as outlined by CTO Vlad Fedorov, April 2026):

Problem Layer	Root Cause	GitHub's Fix
Language / Runtime	Ruby monolith has GIL limiting CPU parallelism under high concurrency	Rewriting performance-critical services from Ruby to Go — goroutine model handles massive concurrency without the GIL
Infrastructure	Single-cloud creates concentrated failure risk and limits horizontal scaling	Multi-cloud deployment — 12.5% on Azure Central US in early 2026, targeting 50% by July 2026
Service Isolation	A single PR cascades through 10+ interconnected subsystems	Isolating Git and Actions into independent failure domains
Capacity Planning	10x plan (October 2025) obsolete by February 2026	30x capacity design with automated scaling for agent-driven burst load
Feature Safety	April 23 merge queue regression caused by incomplete feature flag	Strengthened feature flag discipline — no data-integrity code path ships without complete flag protection

257 — total incidents May 2025–April 2026; roughly five per week, every week, for twelve months
48 — major outages producing over 112 hours of total significant downtime
30x — the scale GitHub needed to design for by February 2026, triple the 10x plan launched four months earlier
2,092 — pull requests silently reverted by the April 23 merge queue bug across 658 repositories, with no notification

The April 23 Silent Revert: Why This One Was Different

The April 23 merge queue bug was caused by an incomplete feature flag that allowed a new code path to activate without full safeguards. Commits that had been merged were silently reverted across 658 repositories and 2,092 pull requests. The terrifying part was not the scope — it was the silence. The UI continued to show green checkmarks and merge confirmations while the system was actively undoing work underneath. A platform's most sacred contract with its users is that when it shows a green checkmark, the operation succeeded. GitHub broke that contract. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data that developers trust as immutable, flag protection is not a best practice — it is the minimum viable safety mechanism.

Why the Ruby-to-Go rewrite is the right call
Ruby served GitHub extraordinarily well for 18 years. But Ruby's Global Interpreter Lock (GIL) is a fundamental constraint: even on a 64-core server, a Ruby process can only execute one thread of Ruby code at a time. For human-paced web traffic, this limitation is manageable. For AI agent workflows that generate thousands of concurrent operations, the GIL is a hard ceiling. Go's goroutine model — lightweight threads managed by the Go runtime that can run across all available CPU cores without a GIL — is architecturally suited for exactly the concurrency profile that AI agents create. The rewrite is not about language preference. It is about physics.

The Mitchell Hashimoto moment
On April 28, 2026, Mitchell Hashimoto — GitHub user number 1299, co-founder of HashiCorp, creator of Vagrant, Packer, Consul, Terraform, and Vault — posted that Ghostty was leaving GitHub. He had visited GitHub almost every day for over 18 years. His post described the decision as 'irrationally sad' but said the platform was no longer a place where he could 'get work done' and 'ship software.' He made a point that resonated across the developer community: the problem was not Git itself — the distributed version control system remained excellent. The problem was the surrounding infrastructure: issues, pull requests, GitHub Actions. When the person who never had reason to question the platform for 18 years starts questioning it, something has fundamentally changed.

The structural billing gap
GitHub's business model was designed for humans, and its pricing reflects human-scale consumption. A developer on a free GitHub account generates some commits, a few CI runs, and a handful of API calls per day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, thousands of Actions minutes, and tens of thousands of API calls in a single afternoon. The infrastructure cost per 'user' has fundamentally changed, but the pricing model has not yet caught up. GitHub's Octoverse 2025 report celebrated nearly 1 billion commits and 36 million new developers. But the 2026 numbers aren't being driven by 36 million new developers — they're being driven by agents treating GitHub's API like a utility.

Architecture

GitHub's architecture evolved over 18 years around a core assumption: the unit of load is a human developer. A human opens a PR, waits for review, pushes a few commits, and merges. The platform's service graph — Git storage, mergeability computation, branch protection evaluation, Actions job dispatch, search indexer, notification fan-out, webhook delivery, permission evaluation, API gateway — was sized and coupled around this human-paced access pattern.

AI agents broke the architecture's fundamental assumption. An agent doesn't open a PR and wait. An agent opens 50 PRs in parallel, each triggering the full service chain simultaneously. When the number of concurrent PRs scales 4x in six months, the pressure on every one of those systems scales accordingly — and the interconnected failures begin.

A Single GitHub PR: The 10+ Subsystems It Touches

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

GitHub Actions: Weekly Compute Minutes — The AI Agent Surge

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

83 Incidents From Capacity Failures Alone

83 of GitHub's 257 incidents between May 2025 and April 2026 were caused by load and capacity problems — with indications that many services did not have automatic scaling configured, requiring manual intervention to add capacity during surges. This means that dozens of times, engineers had to notice the problem, escalate it, and manually provision resources before the platform could recover. Automated capacity scaling for burst load is not optional infrastructure. For a platform being consumed by AI agents, it is the minimum viable reliability architecture.

Lessons

Your platform's capacity model must be built around its actual consumers — not its original consumers. GitHub was built for human developers. AI agents consume infrastructure at orders of magnitude greater intensity. Any platform that introduces AI-native workflows must remodel its capacity assumptions from scratch, not incrementally adjust from the human baseline.
Feature flags (a software engineering practice where new code is deployed but kept inactive until explicitly enabled, allowing teams to test in production, roll out gradually, and instantly disable a feature without redeployment) are not optional for infrastructure that handles data integrity. The April 23 merge queue bug — which silently reverted 2,092 pull requests — was caused by an incomplete feature flag. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data developers trust as immutable, flag protection is the minimum viable safety mechanism.
A monolith that can't be incrementally scaled will become a single point of failure at sufficient scale. GitHub's Ruby monolith served the platform for 18 years because human-paced traffic was bounded enough that the GIL's concurrency limit never became the primary bottleneck. AI agents removed that bound. The architectural lesson is not that monoliths are bad — it's that every architectural decision encodes assumptions about scale, and those assumptions must be revisited when the scale changes fundamentally.
Service isolation is not premature optimisation — it is the prerequisite for containing blast radius at scale. When critical services are deeply coupled — when a PR touches Git storage, Actions, search, notifications, permissions, and webhooks in a single chain — a failure in any one component becomes a failure across all components. GitHub's commitment to isolating Git and Actions into independent failure domains is the architectural move that will have the most long-term impact on reliability.
Trust is the asset that reliability engineering protects. Mitchell Hashimoto didn't leave GitHub because of any single outage. He left because 257 incidents over 12 months had eroded confidence in the platform as a reliable foundation for serious work. Reliability is not measured in individual incident severities — it is measured in the cumulative effect of failures on whether people trust the platform to do what it says it did.

Engineering Glossary

Global Interpreter Lock (GIL) — a mutex in Python and Ruby runtimes that prevents multiple threads from executing interpreter code simultaneously in the same process. Even on a multi-core server, a Ruby process can only use one CPU core at a time for Ruby execution. The fundamental scaling constraint that makes the Ruby-to-Go rewrite necessary for GitHub's AI agent traffic levels.

AI agent PR — a pull request created by an autonomous AI coding agent (such as Copilot Workspace or similar agentic tools) rather than a human developer. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 on GitHub — the primary driver of the platform's capacity crisis.

Agentic development workflow — a software development pattern where AI agents autonomously perform multi-step tasks: creating branches, writing code, running tests, opening PRs, and iterating based on feedback. Unlike human-paced development, agentic workflows can generate hundreds of concurrent operations from a single user session.

Feature flag (kill switch) — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in GitHub's April 23 merge queue incident. A complete feature flag would have allowed the problematic code path to be disabled instantly rather than requiring a full redeployment cycle.

Service isolation — an architectural design where services are deployed as independent failure domains rather than a tightly coupled chain. The goal: a failure in one service (e.g. GitHub Actions) does not cascade to unrelated services (e.g. Git storage). GitHub's post-crisis architectural commitment.

Merge queue regression — a class of GitHub-specific incident where the merge queue processing pipeline fails, causing incorrect behaviour (such as the April 23 silent revert) or blocking PRs from merging. Merge queue regressions are particularly damaging because they violate the fundamental contract of version control: that merge operations are irreversible and accurately reported.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours

TechLogStack — Sun, 31 May 2026 00:00:00 +0000

October 19–20, 2025 — 15-hour outage in US-EAST-1
Root cause: race condition between two DNS Enactor processes; cleanup job deleted active DNS records
~3 hours for DynamoDB to recover; 12+ additional hours for EC2 cascade to clear
140+ AWS services affected: EC2, IAM, Lambda, STS, S3, and every control-plane dependency
Snapchat (375M daily users), Fortnite, Roblox, Ring, Venmo, Coinbase, UK HMRC all affected
17M+ outage reports across 3,000+ organisations (Ookla data); 20–30% of internet-facing services disrupted at peak
Recovery anti-pattern: engineers had to manually disable automatic failover — the automation was making things worse

It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.

The Story

When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.

— Amazon Web Services, Official Post-Incident Summary, October 2025

DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that manage everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilising everything that lost its footing when the ground disappeared.

The Two-Component DNS Architecture: Planner and Enactor

At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to route traffic across load balancers. AWS built a two-component system to manage this: The DNS Planner monitors load balancer health and periodically creates DNS plans — specifications of which load balancers should receive traffic and with what weight distribution. The DNS Enactors are the workers — multiple independent processes running across three Availability Zones — that pick up the plans and apply them to Route53. Multiple Enactors running in parallel provide redundancy. In theory.

Problem

Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb

DNS Enactor A began applying an older DNS plan but encountered unusual delays — blocked trying to update records, moving painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: "Is my plan newer than what's currently active?" At the time of that check, it was. But by the time Enactor A actually finished applying the plan, newer plans had been created and applied. The staleness check was now stale itself.

Cause

The Race Condition Fires — Enactor B Wins, Then Cleans Up

While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.

Solution

11:48 PM PDT: Total DNS Blackout → Manual Recovery

At 11:48 PM PDT, every system trying to connect to DynamoDB in US-EAST-1 received DNS failures. Engineers identified the DNS issue by 12:38 AM UTC, began temporary mitigations by 1:15 AM UTC, and DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process.

Result

15 Hours of Cascading Failure

The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilise. Engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilise. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.

The Fix

AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade

AWS's five-layer post-incident fix plan (from the official post-incident summary, October 23, 2025):

Failure Layer	What Went Wrong	AWS's Fix
DNS Enactor race condition	Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan	Stronger staleness validation at time of application — must reflect current world state, not time of plan pickup
Cleanup automation	Cleanup job deleted Enactor A's just-applied old plan, wiping all DNS records	Safeguards ensuring no automated process can delete an active DNS plan regardless of generation number
NLB failover velocity	Network Load Balancers moved large capacity during AZ failover, amplifying the cascade	Velocity control mechanism limiting how much capacity a single NLB can remove during health check failures
EC2 recovery workflow	DWFM entered congestive collapse when DynamoDB recovered — failure mode not tested at scale	Additional test suite to exercise the DWFM recovery workflow at scale before production discovery
Automatic failover during recovery	Failover automation flip-flopped during recovery, requiring manual disabling before stabilisation	Review of failover automation behaviour during degraded DNS states — distinguish 'service down' from 'DNS inconsistent during recovery'

~3 hrs — time from incident start to DynamoDB DNS restoration
12+ hrs — additional hours EC2's Droplet Workflow Manager required to clear congestive collapse
140+ — AWS services eventually affected; DynamoDB powers the control planes of EC2, IAM, Lambda, STS
$581M — estimated insurance losses (CyberCube) representing disruption to thousands of globally dependent businesses

The Anti-Pattern: When Automation Prevents Recovery

The most counterintuitive part of the recovery was that engineers had to disable automatic failover to stabilise the system. The automatic failover mechanisms were detecting DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and re-enable it with correct DNS records in place. Sometimes, the recovery automation has to stop before recovery can start. Build your recovery playbooks to include the question: "Is any automated system currently making this worse?"

The congestive collapse pattern that extended the outage by 12 hours is worth naming clearly. When DynamoDB recovered, EC2's DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. The system was stuck in a self-sustaining degraded state. This is the same metastable failure pattern documented in the Slack 2-22-22 incident — and the solution is the same: reduce incoming load or add capacity, rather than waiting for self-recovery.

The EC2 Droplet Workflow Manager congestive collapse
EC2's Droplet Workflow Manager (DWFM) is the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous simultaneous queue. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue. Network state recovery from this collapse took more than five additional hours after DynamoDB was fixed. AWS's fix: build the test suite that exercises this recovery workflow at production scale.

The hidden cross-region dependency problem
The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: regions that are called independent but aren't. AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organisations, this is not the architecture they have — it is the architecture they think they have.

Architecture

The October 2025 DynamoDB outage is a case study in control-plane failure — a class of failure categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service.

Major services affected:

Category	Affected Services
Social & Entertainment	Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch
Finance & Payments	Coinbase, Venmo, Lloyds, Halifax
Smart Home & IoT	Amazon Ring, Amazon Alexa, Eight Sleep
Communications	Signal, enterprise platforms
Government	UK HMRC tax authority
Travel	United Airlines, Delta apps
AWS Services (internal)	EC2, IAM, STS, Lambda, S3, SQS, Redshift (140+ total)

The DNS Race Condition: Step-by-Step

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Cascade: How DynamoDB's DNS Failure Propagated

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why US-EAST-1 Became a Single Point of Failure for the Internet

AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, accumulating the most mature feature sets. It became the default — the region developers reach for first, the one that decades of "just deploy to us-east-1" decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls. The technical independence of regions is real. The operational independence, as experienced during the October 2025 outage, is not.

Lessons

Staleness checks must be evaluated at time of use, not time of pickup. Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action. This is TOCTOU (Time-of-Check to Time-of-Use — a race condition where the condition being checked changes between when it is checked and when it is acted upon) — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.
No automated process should be able to delete an active record. The cleanup job had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number. This invariant is simpler than the cleanup logic that violated it.
Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.
Control-plane dependencies (the hidden dependencies that applications have on cloud provider management systems — authentication services, metadata stores, quota management — which can create cross-region failure modes even when application code is deployed in multiple regions) must be evaluated independently for each region. Ring cameras deployed globally still authenticated against US-EAST-1 IAM. True regional independence requires independently redundant control planes, not just independently deployed application code.
Sometimes, the recovery automation has to stop before recovery can start. Build recovery playbooks to include the question: "Is any automated system currently making this worse?" Automation that detects 'DNS is inconsistent during manual recovery' the same way as 'service is down' will trigger failovers that create new inconsistencies. Automation must be able to distinguish between these states — and humans must be empowered to pause it when it cannot.

Engineering Glossary

Congestive collapse — a failure mode where a system attempting to recover from backlog overwhelms its dependencies, slowing processing and lengthening the queue, creating a self-sustaining degraded state. EC2's DWFM entered congestive collapse when DynamoDB recovered and the accumulated lease backlog overwhelmed the now-restored database.

Control-plane failure — a class of failure where the management and coordination layer of a system fails, rather than the data-serving layer. Uniquely damaging because it removes the ability to manage everything else: EC2 can't track instances, IAM can't validate credentials, Lambda can't execute. Control-plane failures cascade differently from data-plane failures.

DNS Enactor — one of the worker processes in AWS's DynamoDB DNS management system that picks up DNS plans and applies them to Route53. Multiple Enactors run in parallel across Availability Zones for redundancy. The race condition that caused the October 2025 outage occurred between two Enactors picking up different-generation plans simultaneously.

DNS Planner — the planning component in AWS's DynamoDB DNS management system that monitors load balancer health and creates DNS plans specifying which load balancers should receive traffic. Plans are then consumed by DNS Enactors.

Droplet Workflow Manager (DWFM) — EC2's system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM accumulated a backlog of expired lease management tasks. When DynamoDB recovered, the simultaneous burst of backlog processing triggered congestive collapse.

TOCTOU (Time-of-Check to Time-of-Use) — a race condition where the condition being checked changes between when it is checked and when it is acted upon, causing the action to operate on incorrect assumptions. Enactor A checked its plan's staleness, found it valid, then applied the plan — but by the time it applied, the world had moved on and the check was stale.

Thundering herd / herd effect — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource, overwhelming it. Appears in the October 2025 outage as the DWFM congestive collapse. The standard solution is randomised exponential backoff.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

TechLogStack — Sun, 24 May 2026 00:00:00 +0000

7B nodes, 11B edges in Airbnb's identity graph
5M new edges ingested per day
P99 read latency: 5.0s → 2.5s (-49% improvement)
P95 write latency: 353ms → 156ms (-56% improvement)
10× write QPS ceiling vs previous vendor maximum
Zero manual reboots required post-migration

Airbnb's identity graph connects every user, every device, every listing, and every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. In 2024, this graph held 7 billion nodes and 11 billion edges — growing by 5 million new edges every day. The third-party vendor powering it required periodic manual reboots to stay stable, and 8-hop graph traversal queries were hitting 5-second P99 latencies. A small team rebuilt the entire thing internally. The results were not incremental.

The Story

The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between every user, every device, every verified identity, every behavioral signal that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing.

The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists. This worked at low graph density. As individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update — relational databases (database systems built around tables, rows, and SQL joins — optimal for normalised structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimised for multi-hop traversal at graph scale.

The Four Anti-Patterns That Plagued Airbnb's Graph Teams

Before the centralised graph infrastructure, teams building graph-based products fell into four documented patterns: Relational graphs — modelling nodes and edges in SQL tables, producing expensive joins during traversal. Offline graphs — building in the data warehouse, limiting freshness to daily batch snapshots. DIY open source — self-managing community graph databases, creating high operational toil. Managed PaaS — third-party vendors with vendor lock-in, limited tuning access, and performance bottlenecks the team couldn't debug.

Problem

Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density

The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As graph density grew — individual users accumulating hundreds of edges — querying became expensive. JSON deserialisation and cross-table joins are not optimised for the multi-hop traversal patterns that fraud detection requires.

Cause

Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability

The 2021 migration to a third-party SaaS graph database improved horizontal scalability but introduced new problems: P99 read latency reaching 5 seconds on 8-hop queries, operational instability requiring periodic manual reboots, no ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.

Solution

Generation 3: JanusGraph + DynamoDB, Internally Managed

In 2024, Airbnb built an internal graph infrastructure on JanusGraph (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture let Airbnb leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimisations.

Result

49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots

P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing reached 10× the previous vendor's maximum. Manual reboots eliminated entirely. Auto-scaling enabled for the first time.

The Fix

Three JanusGraph Engine Optimisations That Closed the Latency Gap

Deploying stock JanusGraph with DynamoDB would not have been sufficient. Airbnb's query patterns — particularly high-fanout traversals that caused the worst P99 spikes — required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimisations:

-49% — P99 read latency: 5.0s → 2.5s, directly improving fraud detection response time
-56% — P95 write latency: 353ms → 156ms, enabling faster ingestion of 5M daily new edges
10× — Write QPS ceiling during load testing vs the vendor maximum
0 — Manual reboots required post-migration; the internal solution auto-scales

The choice of Gremlin (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked") as the query language was a deliberate migration enabler. Both the outgoing vendor system and the incoming JanusGraph support Gremlin, which meant Airbnb could run the same queries against both systems simultaneously during migration — direct performance benchmarking under real production load before any cutover.

Connected Accounts: How the Graph Detects Fraud

Airbnb's Trust Graph finds structural patterns that correlate with fraud. A fraudster re-entering after a ban often reuses the same phone number, payment method, or device. The Connected Accounts system traverses the graph to find these connections: "this new account shares a device with a banned account, which shared a payment method with another banned account, which has reviewed listings that the new account also reviewed." That traversal pattern — spanning 4–8 hops — is exactly why graph depth performance matters.

# Three JanusGraph engine optimisations that reduced long-tail latency

# OPTIMISATION 1: DynamoDB conditional writes replace distributed locking
# Old: explicit distributed lock before write = round-trip overhead
def write_edge_default(tx, src_vertex, dst_vertex, edge_label):
    lock = acquire_distributed_lock(src_vertex, edge_label)  # expensive
    try:
        tx.add_edge(src_vertex, dst_vertex, edge_label)
        tx.commit()
    finally:
        release_lock(lock)

# New: DynamoDB evaluates condition atomically server-side — no lock round-trip
def write_edge_optimized(tx, src_vertex, dst_vertex, edge_label):
    tx.add_edge_with_condition(
        src_vertex, dst_vertex, edge_label,
        condition="attribute_not_exists(edge_key)"
    )

# OPTIMISATION 2: Parallel getMultiSlices for high-fanout nodes
# Before: N sequential DynamoDB calls for a user with 1000+ edges
def get_edges_serial(vertex_id, num_slices=50):
    results = []
    for slice_key in compute_slice_keys(vertex_id, num_slices):
        results.append(dynamo.get_item(slice_key))
    return merge(results)

# After: single BatchGetItem — critical for high-fanout nodes
def get_edges_parallel(vertex_id, num_slices=50):
    slice_keys = compute_slice_keys(vertex_id, num_slices)
    results = dynamo.batch_get_items(slice_keys)  # 1 call instead of N
    return merge(results)

# OPTIMISATION 3: Distributed tracing in the internal fork
# OSS JanusGraph: no tracing — impossible to profile slow queries
# Internal fork: Airbnb trace context propagated through every graph op
def execute_gremlin_traversal(query, trace_context):
    with airbnb_tracer.start_span('janusgraph.traversal', parent=trace_context) as span:
        span.set_tag('query.hops', count_hops(query))
        span.set_tag('query.fanout', estimated_fanout(query))
        result = janusgraph.execute(query)
        span.set_tag('result.edges_traversed', result.edge_count)
    return result

The shadow traffic migration strategy
Migrating 7 billion nodes and 11 billion edges without downtime required running both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. Because both systems use Gremlin, the same queries ran unchanged on both simultaneously. This shadow traffic phase provided a performance benchmark under real load (not synthetic tests) and correctness validation before any cutover. Only after shadow traffic validated both was production traffic cut over and the vendor deprecated.

Architecture

Airbnb's new graph infrastructure has three conceptual layers. The storage layer is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale. The graph engine layer is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries, with custom optimisations for Airbnb's access patterns. The management layer is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call.

Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why High-Fanout Nodes Cause Long-Tail Latency

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Performance comparison across query types:

Query Type	Vendor P95	Internal P95	Improvement	Vendor P99	Internal P99
1-hop query	~180ms	~65ms	-64%	~420ms	~150ms
2-hop query	~350ms	~130ms	-63%	~900ms	~280ms
2-hop (high fanout)	~620ms	~200ms	-68%	~1,800ms	~450ms
4-hop query	~900ms	~380ms	-58%	~2,500ms	~850ms
8-hop query (max depth)	~2,100ms	~1,000ms	-52%	~5,000ms	~2,500ms
Write (edge creation)	~353ms	~156ms	-56%	~800ms	~360ms

JanusGraph's Pluggable Storage: The Architectural Decision That Made This Possible

Most graph databases tightly couple the query engine and the storage layer — they are one system. JanusGraph decouples them through a pluggable storage backend. Airbnb chose DynamoDB — infrastructure their team already operated at scale. This gave them full control over the graph logic layer while standing on a storage foundation that didn't need to be invented from scratch. The separation also lets them evolve the storage backend in the future without rewriting the graph layer.

Lessons

Know the signals that a vendor relationship has passed its usefulness. Recurring manual operational interventions, inability to instrument the system's internals, no path to tune performance for your access patterns, and P99 latency an order of magnitude worse than P50 — each individually might be tolerable, but all four together mean the vendor is costing more than an internal solution would cost to build.
Pluggable storage backends are what make graph databases practical at scale. JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.
Shadow traffic is the only honest migration validation strategy for a stateful system. You cannot reproduce 7 billion nodes and 11 billion edges in staging. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. Gremlin compatibility between vendor and JanusGraph made shadow traffic feasible here — evaluate migration options partly on query language compatibility.
High-fanout nodes (vertices with an unusually large number of edges — sometimes called supernodes) are the specific failure mode of graph databases at scale. They don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits are the tools that prevent P99 from diverging from P50.
Fork open-source infrastructure when you have specific, documented performance requirements the upstream project doesn't address — and when you intend to maintain the fork. The fork is a commitment that creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.

Engineering Glossary

DynamoDB — Amazon's fully managed NoSQL key-value and document database, used by Airbnb as JanusGraph's storage backend. Provides auto-scaling, multi-region replication, and conditional write operations used in Airbnb's optimised transaction strategy.

Gremlin — a graph traversal language developed as part of the Apache TinkerPop framework. Reads like a path through the graph: g.V(userId).out('booked').in('listed') means "find all users who listed properties that this user has booked."

High-fanout node — a vertex in a graph database with an unusually large number of edges, sometimes called a supernode. Causes disproportionate latency on traversal queries because a single hop can require fetching thousands of edges.

JanusGraph — an open-source distributed graph database built on Apache TinkerPop, with a pluggable storage backend that can use Cassandra, DynamoDB, or HBase as the underlying data store.

Long-tail latency — the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median. Particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience.

P99 latency — the response time that 99% of requests complete within. A P99 of 5.0s means 1 in 100 requests takes 5 seconds or longer — directly visible to users at scale.

Pluggable storage backend — an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer.

Shadow traffic — a migration validation strategy where the same production queries are routed to both the old and new systems simultaneously, comparing outputs and latencies before committing to a cutover.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

TechLogStack — Sun, 24 May 2026 00:00:00 +0000

3h 27m outage — 12:18 UTC to 15:45 UTC, April 16 2025
675M monthly active users affected globally
48,000+ peak Downdetector reports
0 regions with staged rollout — applied globally simultaneously
Root cause: Envoy max heap configured higher than K8s memory limit
Fix: capacity increase reduced per-instance memory below the kill threshold

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy (an open-source edge proxy that receives all incoming user traffic before distributing it to backend services) perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed — and then the restart loop began, powered by Kubernetes itself, killing each new server as fast as it came back up. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

The Story

This crash happened simultaneously on all Envoy instances.

— Spotify Engineering, Incident Report: Spotify Outage on April 16, 2025

There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. To extend Envoy's capabilities, Spotify develops its own custom filters — plugins that handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a latent bug in one of the custom filters: a code path that had existed harmlessly, triggered only when the filter received control at that specific position. Envoy crashed. Not one instance, not one region. All of them.

The Death Loop: Why the Restart Made Things Worse

An Envoy crash is normally survivable — Kubernetes detects the failed pod and starts a replacement. But client-side retry logic (every user's app retrying its failed request) created an unprecedented traffic spike onto each new instance. Each new Envoy started, received the full flood of retry traffic, consumed more memory than the Kubernetes memory limit (the maximum memory a pod is allowed to use — when exceeded, K8s automatically terminates it), and was killed. A new instance started. The same thing happened. The loop repeated — powered by Kubernetes itself — for hours.

Problem

12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash

The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.

Cause

The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit

The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a recoverable crash into an infinite restart loop.

Solution

Asia Pacific Stayed Up — and Explained Everything

Asia Pacific was the only region unaffected. Engineers investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.

Result

15:45 UTC — Death Loop Broken, Full Recovery

Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalisation at 15:40 UTC. Total duration: 3 hours 27 minutes.

The Fix

The Misconfiguration Nobody Noticed — Until the Crash

The root problem was that Envoy's max heap size was set higher than the Kubernetes memory limit for the pod. In normal operation, Envoy memory usage never approached its heap maximum — the misconfiguration was invisible. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.

3h 27m — Total outage duration, 12:18 to 15:45 UTC
675M — Users affected; 263M paying Premium subscribers — no perimeter differentiation by tier
48,000+ — Peak Downdetector reports (active reporters only; actual affected users in the hundreds of millions)
0 — Regions with staged rollout before full deployment

# THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit

# Kubernetes pod resource specification (simplified)
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: envoy
    resources:
      requests:
        memory: "2Gi"
      limits:
        memory: "3Gi"  # K8s will OOMKill the pod above this

# Envoy overload manager configuration (simplified)
overload_manager:
  resource_monitors:
  - name: envoy.resource_monitors.fixed_heap
    typed_config:
      max_heap_size_bytes: 4294967296  # 4GB — HIGHER than K8s 3GB limit!

# Why this is catastrophic:
# - K8s kills at 3GB memory usage
# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB
# - K8s limit is hit BEFORE Envoy's graceful degradation kicks in
# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible
# - Under retry flood: Envoy climbs past 3GB → OOMKill → restart → repeat

# IMMEDIATE FIX: Increase perimeter server count
# More servers = retry traffic spread across more instances
# = each instance stays under 3GB = K8s doesn't kill = loop breaks

# PERMANENT FIX: Align heap config with K8s memory limit
# max_heap_size_bytes: 2684354560  # 2.5GB — safely below K8s 3GB limit

Why Increasing Capacity Fixed the Loop

The K8s memory limit was fixed. The retry traffic load was fixed (determined by user behaviour). The only variable Spotify could change quickly was the number of Envoy instances sharing that retry load. More instances → each instance receives a smaller share of the flood → memory stays below the K8s limit → K8s doesn't kill it → stable. The underlying misconfiguration (heap > K8s limit) was fixed separately afterward as permanent remediation.

Spotify's four post-incident commitments:

Fix the filter bug that caused the initial crash on filter reorder
Fix the heap/K8s limit mismatch — align Envoy config with pod resource limits
Staged perimeter rollouts — regional validation before global deployment
Improved monitoring — detect configuration issues earlier in the failure chain

Incident timeline:

Time (UTC)	Event	Status
12:18	Filter reorder applied; all Envoy instances crash	🔴 Global failure
12:20	Alarms fire on traffic drop; death loop running	🔴 Engineers paged
12:28	Escalated; only APAC serving traffic	🔴 Incident declared
~13:xx	Root cause identified via APAC asymmetry	🟡 Diagnosis complete
14:20	EU fully recovered	🟡 Partial recovery
15:10	US fully recovered	🟡 Partial recovery
15:40	All regions normalised	🟢 Full recovery

Architecture

Spotify's networking perimeter places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what backend it is destined for. When every Envoy instance crashes simultaneously, no user request can reach any backend service. The entire platform goes dark regardless of whether individual backend services remain healthy. This is the shared fate property of perimeter architecture: a perimeter failure has a blast radius of every service, every user, every region simultaneously.

Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Three-Layer Failure Cascade: From Filter Bug to Death Loop

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The APAC Diagnostic: How One Region Proved the Root Cause

When engineers observed APAC was unaffected, they had two candidate hypotheses: (A) the filter bug is region-specific, or (B) the death loop is traffic-intensity dependent. Investigation confirmed (B): APAC runs identical filter configuration — lower traffic meant less retry amplification, meaning per-instance memory pressure never reached the K8s limit. This asymmetry transformed a hard debugging problem ("why is the loop happening?") into a tractable one ("what's different about APAC?") and pointed directly at the memory-limit misconfiguration.

Configuration drift: why this existed undetected for months
The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: configuration mismatches that are only dangerous under abnormal load go undetected indefinitely in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But it was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage, including synthetic stress tests, is the practice that catches these before they detonate.

Lessons

'Low risk' is not a substitute for staged rollout at the perimeter. A change's risk profile determines what validation it needs — it doesn't override the need for validation. The filter reorder was simple; the blast radius of failure was total. Stage perimeter changes by region and monitor before expanding.
Latent bugs (code defects harmless until a specific triggering condition occurs) that depend on execution context cannot be caught by tests that don't vary that context. A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order.
Audit resource limit configurations against actual and stress-test peak usage regularly. Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. A misconfiguration harmless for months can become catastrophic under the right load spike.
Client-side retry logic turns total simultaneous failures into traffic amplification events. Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client.
When one region survives an outage that hits all others, that region is your fastest path to root cause. APAC's survival was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Systematically compare surviving regions against failed ones — it shortens MTTR.

Engineering Glossary

Client-side retry logic — application behaviour where the client automatically retries failed requests after a brief delay. Designed to handle transient failures, but capable of amplifying load during sustained simultaneous failures by converting each failed request into one or more retry requests.

Death loop — an informal term for an infinite restart cycle where a pod crashes, Kubernetes restarts it, and the replacement crashes for the same reason. Powered by K8s restart behaviour combined with a condition (here: retry flood + heap misconfiguration) that guarantees each replacement fails.

Envoy Proxy — an open-source, high-performance edge proxy originally built at Lyft, widely used as the networking perimeter layer in distributed systems. Receives all incoming user traffic before distributing it to backend services.

Filter chain — the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance. Each filter can inspect, modify, or reject the request before passing it to the next filter. Order is semantically meaningful.

Latent bug — a code defect that exists in production but is harmless until a specific triggering condition occurs. Undetectable by standard testing if the triggering condition is rare or contextual.

OOMKill — Out-Of-Memory Kill. The Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation.

Shared fate system — an architecture where all dependent services rise and fall with a shared component. Spotify's Envoy perimeter is a shared fate system: if it fails, every backend service becomes unreachable regardless of whether those services are healthy.

Staged rollout — deploying a change to a subset of infrastructure (one region, one cluster) and validating behaviour before expanding to the full fleet. The safety mechanism absent from the April 16 deployment.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer

TechLogStack — Thu, 21 May 2026 00:00:00 +0000

30 seconds — plain English sentence to complete mobile UI, live on stage at Google I/O 2025
$0 vs $15 — Google Stitch multiplayer vs Figma professional plan per editor per month
3 input types — text prompt, reference image, annotated screenshot — processed simultaneously
5 screens — simultaneous canvas rendering introduced in Stitch 2.0, March 2026
350 free generations/month — standard tier; $20/month Pro for unlimited
1M+ waitlist signups overnight after the I/O 2025 live demo

At Google I/O 2025, Sundar Pichai typed a one-sentence description of a mobile app and watched Google Stitch render a complete, multi-component UI in under 30 seconds. One click exported it as React code. Another exported it as an editable Figma file. Figma charges $15 per editor per month for collaborative design. Stitch does it free. A year later, Google added real-time multiplayer, a streaming design agent, and voice input — and the design industry started paying attention.

The Story

Google Stitch did not emerge from Google's internal R&D labs. It began with the early-2025 acquisition of Galileo AI — a startup that had built one of the first credible text-to-UI generators, capable of interpreting product descriptions and producing coherent interface layouts. Google acquired Galileo, rebranded it as Stitch, integrated it with Gemini 2.5 Pro (Google's multimodal model able to process text, images, audio, and video simultaneously and generate structured outputs across all of them), and launched it as a Google Labs experiment at I/O 2025. The Labs framing was deliberate — testing the market before committing to a full product. Over 1 million waitlist signups appeared overnight.

What 'Vibe Design' Actually Means

Stitch entered the vocabulary alongside "vibe coding" — describing software intent to an AI and refining the output iteratively rather than building from first principles. The skill shifts from pixel manipulation to intent specification. A founder who cannot use Figma can produce a working prototype in minutes. A product manager can test five layout variations in the time it would previously have taken to brief a designer on one.

The evolution from launch to I/O 2026 compressed ten months of user feedback into a clear product trajectory. The May 2025 version was single-screen only — one prompt, one screen, export. July 2025 added theme customisation and Figma export. December 2025 brought multi-screen Prototypes alongside Gemini 3 integration. March 19, 2026 was Stitch 2.0: infinite canvas, 5-screen simultaneous generation, voice input, and app-flow generation. A demo had become a workspace.

Problem

Design-to-Dev Handoff: The Productivity Black Hole

The traditional pipeline required designers to build components in Figma, annotate specs manually, and hand off to developers who re-implemented everything in code. Even with design tokens and component libraries, the gap between "designed" and "built" consumed weeks. For small teams and solo founders this gap was existential — they lacked either the design skill or the engineering skill to close it alone.

Cause

Multimodal Models Reached UI-Generation Quality

By early 2025, Gemini's multimodal capabilities had reached a threshold where they could reliably interpret both text descriptions and uploaded images of existing UIs, generating coherent layouts with appropriate component choices, spacing, and visual hierarchy. The Galileo acquisition gave Google a product layer that had already solved the prompt engineering, training data, and output format problems on top of that capability.

Solution

Stitch: Three Inputs, Gemini Core, Production-Grade Exports

Stitch accepted three input types simultaneously: natural language descriptions, uploaded reference images or screenshots, and annotated screenshots with modification notes. Gemini 2.5 Pro processed all three in a single context window. Export paths targeted real developer workflows: Figma files with editable layers and auto-layout, production-ready HTML/CSS, React components, and Vue code.

Result

I/O 2026: Streaming Agent + Multiplayer — Both Free

At I/O 2026, Google launched a streaming design agent that renders UI components onto the canvas in real time as a designer types or speaks — mid-generation course correction is possible before the generation finishes. Simultaneous multi-user editing was also added, directly matching Figma's flagship collaboration feature. Both are free. Figma's professional plan charges $15 per editor per month.

The Fix

The Technical Architecture: Gemini as the UI Design Engine

Stitch's core is not a purpose-built design model — it is Gemini 2.5 Pro with a specialised prompt engineering and output parsing layer on top. This explains both Stitch's strengths and its limitations. Stitch understands concepts like "glassmorphism," "material design," and "iOS Human Interface Guidelines" because Gemini was trained on documentation and examples of all of them. It generates production-quality React because Gemini understands React at a level that exceeds most specialised code generation models.

30s — sentence to complete mobile UI including navigation, components, and colour palette
3 inputs — text prompt, reference image, annotated screenshot — single Gemini context window
5-screen — simultaneous canvas rendering in Stitch 2.0, March 2026
$0 vs $15 — Stitch multiplayer vs Figma professional plan per editor per month

The I/O 2026 streaming agent is an architectural change, not just a speed improvement. Previous versions were turn-based: submit a prompt, wait for completion, review, resubmit. The streaming model replaces this with continuous render — components appear on canvas as they are generated, layouts reflow before generation finishes. The practical difference is the ability to steer mid-generation: if a layout is heading in the wrong direction, a designer can interrupt and redirect before it finishes. Voice input, integrated since March 2026, works within this same loop.

// Turn-based vs streaming: the architectural difference in Stitch's I/O 2026 upgrade

// BEFORE (turn-based): designer sees nothing until fully done
async function generateUI_old(prompt) {
  const result = await stitch.generate(prompt); // blocking — full wait
  return result.screens; // [{ html, css, figmaLayers }]
}

// AFTER (streaming agent): real-time render + mid-generation steering
async function generateUI_streaming(prompt) {
  const stream = stitch.generateStream(prompt);

  // Components render onto canvas as they are generated
  stream.on('component', (component) => {
    canvas.renderPartial(component); // visible immediately — no waiting
  });

  // Designer can interrupt and redirect before generation finishes
  stream.on('layoutDecision', () => {
    const userFeedback = canvas.checkInterrupt();
    if (userFeedback) {
      stream.steer(userFeedback); // mid-generation course correction
    }
  });

  // Voice input works inline — spoken mid-generation, reflected immediately
  voiceInput.on('command', (cmd) => stream.steer(cmd));

  await stream.complete();
  return canvas.getCurrentState();
}

The Galileo Acquisition Rationale

Google could have built Stitch from scratch using Gemini. It acquired Galileo instead because Galileo had already solved the hardest non-model problems: the prompt engineering approach that reliably produces coherent UIs, the output parser that converts model outputs into valid design tokens and component trees, and the UX model for iterative refinement. Rebuilding these would have taken months. The acquisition compressed that to days. Galileo's technology became the product layer; Gemini became the intelligence underneath it.

RLHF for UI quality: how Stitch reached 95% component rendering accuracy
Stitch's code export quality reached 95% accuracy (component rendering fidelity) in the March 2025 closed beta, up from ~70% in early estimates. The improvement came from RLHF — Reinforcement Learning from Human Feedback — applied specifically to UI generation quality. The beta involved 500+ partner users including Vercel developers who provided direct feedback on generated code quality and design accuracy. This domain-specific signal tuned Gemini's output for the criteria professional designers and developers actually cared about: component naming, layout accuracy, code cleanliness, and design system compatibility.

Feature timeline — launch to I/O 2026:

Date	Update	Key Feature Added
May 20, 2025	Google I/O Launch	Single-screen generation, Figma export, HTML/CSS/React export
Jul–Aug 2025	Public beta	Theme customisation, RTL language support
Dec 2025	Stitch 2.0 preview	Prototypes (multi-screen flows), Gemini 3 integration
Mar 19, 2026	Stitch 2.0 GA	Infinite canvas, 5-screen canvas, voice input, app-flow generation
May 20, 2026	I/O 2026	Streaming agent (real-time canvas render), multiplayer — both free

Architecture

Stitch's internal architecture has three distinct layers. The input layer processes multimodal inputs through Gemini 2.5 Pro — text prompts, reference images, and annotated screenshots are unified into a single context window. The generation layer produces an intermediate representation (an abstract, format-agnostic description of design intent — component hierarchy, spacing tokens, visual relationships — that can be translated into multiple output formats without losing design semantics) rather than raw HTML or Figma JSON directly. The export layer translates that IR into Figma-compatible JSON with proper component structure and auto-layout, production-grade React/HTML/CSS, and AI Studio integration configs.

Before Stitch: The Traditional Design-to-Development Pipeline

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Google Stitch Architecture: Multimodal Input to Production Output

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Multiplayer Technical Challenge

Adding simultaneous multi-user editing to an AI-native canvas is harder than adding it to a traditional design tool. In Figma, multiplayer synchronises deterministic object operations with well-understood CRDT (Conflict-free Replicated Data Type — a data structure that allows multiple users to edit concurrently without conflicts, automatically merging changes) semantics. In Stitch, two users can simultaneously prompt the AI to modify the same canvas, producing non-deterministic outputs that may conflict visually. Google's implementation queues concurrent AI generation requests per canvas object and applies last-write-wins for AI-generated changes, while standard CRDT semantics apply for manual edits.

The design quality ceiling: what Stitch still can't do
Stitch's core limitation remains consistent across all reviews: generated designs are starting points, not finished products. The AI produces layouts with appropriate components and reasonable visual hierarchy, but professional polish — precise spacing, custom illustration integration, brand-specific typography choices, edge-case state design (empty states, error states, loading states) — still requires human design expertise. Stitch is strongest for exploration and prototyping; weakest for production-ready UI that needs to meet professional brand standards.

Lessons

Acquiring a specialised AI startup accelerates a product category by months, not weeks. Google had the models (Gemini) but not the product layer (Galileo). Galileo had the product layer but not the model quality or distribution. The acquisition combined both instantly. Teams building in AI-adjacent product categories should evaluate whether acquiring specialised AI startups is faster than building the application layer from scratch on top of foundation models.
Intermediate representation between AI generation and format-specific output is the architecture that makes multi-format export viable. Generating React directly loses Figma compatibility. Generating Figma directly loses code usability. An IR exports to both, and to future formats not yet defined.
Free with generous limits is a viable disruption strategy when the underlying AI cost is subsidised. Google can offer Stitch free because Gemini API calls are already budgeted across Google's infrastructure at marginal cost. Figma cannot match free without destroying its revenue model. This asymmetry is the structural moat Stitch is building — not feature parity, but cost parity at zero.
Build the complement-not-replace narrative from day one. Sarah Drasner's explicit framing of Stitch as a Figma complement — not replacement — reduced designer resistance and encouraged adoption among professional users. Fighting the dominant tool's ecosystem directly creates adversarial resistance. Complementing it creates adoption.
Streaming generation (delivering AI outputs progressively as they are computed) changes the product experience more profoundly than speed improvements do. A 30-second generation showing nothing for 28 seconds feels slow. A 30-second generation showing components appearing in real time and allowing mid-stream steering feels like collaboration. Same underlying model, fundamentally different user experience.

Engineering Glossary

CRDT (Conflict-free Replicated Data Type) — a data structure designed for distributed systems that allows multiple users to edit the same data concurrently without conflicts, automatically merging changes. Used in Stitch's multiplayer for deterministic manual edits alongside non-deterministic AI-generated changes.

Gemini 2.5 Pro — Google's multimodal frontier model capable of processing text, images, audio, and code simultaneously. Stitch uses it as the core reasoning engine for interpreting design intent and generating UI outputs.

Intermediate representation (IR) — an abstract, format-agnostic description of design intent — component hierarchy, spacing tokens, visual relationships — that can be translated into multiple output formats (Figma JSON, React, HTML/CSS) without losing design semantics.

RLHF (Reinforcement Learning from Human Feedback) — a training technique where human evaluators rate model outputs, and those ratings are used to fine-tune the model toward preferred outputs. Used by Stitch to improve component rendering fidelity from ~70% to 95% accuracy.

Streaming generation — delivering AI outputs progressively as they are computed, rather than waiting for the full generation to complete before showing any output. Enables mid-generation steering and real-time canvas rendering.

Vibe design — the practice of describing interface intent to an AI and refining the output iteratively, rather than building pixel by pixel. The AI design equivalent of vibe coding.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

TechLogStack — Thu, 21 May 2026 00:00:00 +0000

4h 22m outage — 3:16 PM to 7:38 PM PST, December 11 2024
29 minutes from deployment start to all OpenAI products degrading
0 staging warnings — telemetry service passed validation completely
0 regions with staged rollout — applied to all clusters simultaneously
All OpenAI services affected: ChatGPT, the API, and Sora simultaneously
Engineers locked out of clusters — kubectl requires a control plane that was down

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability — to give engineers better visibility into how their clusters were behaving, to catch problems earlier. Within 29 minutes, the telemetry service had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable. And the engineers responsible for fixing it couldn't run kubectl — the control plane that manages Kubernetes was down, and it was the only way back in.

The Story

Our tests didn't catch the impact the change was having on the Kubernetes control plane. DNS caching added a delay between making the change and when services started failing. Remediation was very slow because of the locked out effect.

— OpenAI, December 11 2024 Incident Postmortem, status.openai.com

The events unfolded with the particular cruelty of incidents where staging does not predict production. The telemetry service was deployed to a staging cluster on December 10 and verified as working correctly. On December 11 at 2:51 PM, the change rolled out to all production clusters. At 3:16 PM — five minutes before the rollout was even complete — all OpenAI products began degrading. The root cause: a configuration that caused every node in every cluster to execute resource-intensive Kubernetes API operations simultaneously. The cost of these operations scaled with cluster size — meaning the largest, most critical clusters were hit hardest and fastest.

DNS Caching: The Hidden Time Bomb

The staging environment passed for two reasons. First, the staging cluster was small — the telemetry service's API load scaled with cluster size, so small staging generated manageable load. Second: DNS caching masked the failure. When the telemetry service started overwhelming the Kubernetes API servers, services that had already cached DNS responses continued functioning temporarily through stale cache entries. Engineers saw a clean deployment and services continuing to function — until the DNS cache expired and everything that hadn't failed yet failed all at once.

Problem

Telemetry Rollout to All Clusters in 29 Minutes

At 2:51 PM PST, the new telemetry service configuration began rolling out to all Kubernetes clusters simultaneously. The service's configuration caused every node in every cluster to issue simultaneous resource-intensive Kubernetes API calls — a load that scaled with cluster size, hitting the largest, most critical clusters hardest.

Cause

Kubernetes Control Plane Overwhelmed — DNS and Service Discovery Broken

With thousands of nodes simultaneously hammering the Kubernetes API servers, the control planes of most large clusters crashed. Kubernetes's control plane (the set of components managing overall cluster state — API server, etcd, scheduler, controller manager) manages service discovery and DNS resolution. When it failed, services could no longer find each other. DNS cache expiry then propagated the failure to services temporarily protected by stale cache entries, turning partial degradation into complete cascading failure.

Solution

The Locked-Out Problem: No kubectl Access

Recovery required rolling back the telemetry configuration — but rolling back Kubernetes configurations requires kubectl, which requires a functioning Kubernetes control plane. The control plane was down. Engineers were effectively locked out of the clusters they needed to fix. Recovery required out-of-band mechanisms: directly accessing nodes through cloud provider management consoles, bypassing the Kubernetes layer entirely to remove the telemetry service's configuration.

Result

4h 22min Outage, Full Postmortem Published

ChatGPT reached substantial recovery at 5:45 PM PST. Full recovery across all services was achieved at 7:38 PM PST — 4 hours and 22 minutes after the incident began. OpenAI published a detailed postmortem identifying four root causes and committing to specific architectural changes including break-glass emergency access mechanisms and staged rollouts for all infrastructure changes.

The Fix

What Actually Broke and Why Recovery Took Four Hours

The telemetry service's configuration caused each node to watch Kubernetes API resources continuously — a Watch API (a Kubernetes feature allowing clients to receive a stream of events as resources change — creates a persistent connection from each watcher to the API server, consuming server resources proportional to the number of watchers) operation making API calls proportional to cluster size. Across thousands of nodes in large clusters, these calls compounded into an overwhelming flood. The API servers became saturated. With them unresponsive, etcd (the distributed key-value store backing all Kubernetes state — node metadata, pod specifications, service definitions — API servers cannot function without it) became unreachable. Without etcd, API servers couldn't recover. Without API servers, nothing could be changed. The cluster was in a deadlock.

4h 22m — total outage duration, 3:16 PM to 7:38 PM PST — longest single outage in ChatGPT's history at the time
29 min — deployment start to all products degrading — fast enough that the full fleet was affected before the scope was understood
All — services affected simultaneously: ChatGPT, API, Sora — every OpenAI product at once
0 — staging warnings — staging clusters were too small to reproduce the API call scaling behaviour that took down production

# Simplified model of the failure: telemetry service overwhelming K8s API
# Each node watches K8s API objects — cost scales super-linearly with cluster size

TELEMETRY_CONFIG = {
    "watch_all_pods": True,      # persistent connection per node to API server
    "watch_all_nodes": True,     # another persistent connection per node
    "watch_all_services": True,  # another persistent connection per node
    "poll_interval_ms": 100,     # aggressive — 10 checks/second per watcher
}

def api_calls_per_second(cluster_size: int) -> int:
    # 3 watchers per node × 10 calls/sec per watcher
    return cluster_size * 3 * 10

# Staging cluster (100 nodes):
staging_load = api_calls_per_second(100)   # 3,000/sec — manageable
# K8s API server capacity: ~1,000–2,000 requests/sec

# Large production cluster (5,000 nodes):
prod_load = api_calls_per_second(5000)     # 150,000/sec — CATASTROPHIC
# API server saturated within seconds → DNS breaks → services go blind
# kubectl stops working → engineers locked out

# THE LOCKED-OUT DEADLOCK:
# Fix requires: kubectl → needs API server → API server is down → needs fix
#
# RECOVERY PATH (bypassing K8s entirely):
# 1. SSH to nodes via cloud provider console (not through K8s)
# 2. Manually stop telemetry service process on each node
# 3. API server load drops → control plane recovers
# 4. kubectl works again → roll back config through standard channels
# 5. Monitor DNS propagation and service recovery across fleet

The Four Root Causes from OpenAI's Postmortem

(1) Staging cluster too small — the failure only manifested at production cluster sizes. (2) DNS caching masked the initial failure — services continued on stale cache entries, giving engineers a false "clean deployment" signal before cache expiry revealed the truth. (3) No canary deployment — configuration applied to all clusters simultaneously rather than validated incrementally. (4) No break-glass mechanism — no pre-arranged out-of-band access path for the scenario where the standard Kubernetes management plane was unavailable.

Recovery steps — bypassing Kubernetes entirely:

Access individual nodes directly through the cloud provider's management console — not through Kubernetes
Manually stop the telemetry service process on each node to eliminate the API call flood
With load removed, Kubernetes API servers begin recovering
Once kubectl is functional, roll back the telemetry service configuration through standard channels
Monitor service recovery and DNS propagation across the fleet

Post-incident engineering commitments:

Immediate — locked the telemetry configuration to prevent re-deployment
Short-term — implement break-glass emergency access that functions when the K8s control plane is unavailable
Medium-term — decouple observability infrastructure from the components it monitors
Long-term — all infrastructure configuration changes use staged deployment with continuous monitoring and the ability to halt at any percentage

The iOS 18.2 coincidence
Apple shipped iOS 18.2 — which introduced ChatGPT integration into Apple Intelligence — on the same day as the outage. Millions of users who updated and then tried ChatGPT saw it was unavailable. Social media immediately speculated that the iOS update had caused the outage. OpenAI's postmortem was explicit: iOS 18.2 had nothing to do with it. The telemetry failure had already begun degrading infrastructure before the iOS update's traffic could have any effect. Correlation — especially coincidence of timing — is not causation, and attributing outage causes to the most visible concurrent event is a common and often wrong instinct.

Architecture

OpenAI's Kubernetes architecture runs the inference clusters powering ChatGPT's model serving, the API gateway, and the Sora video generation pipeline — all depending on the Kubernetes control plane for service discovery, DNS resolution, pod scheduling, and configuration management. When a single telemetry service configuration saturated the API servers, it took all three of these simultaneously.

The Failure Chain: From Telemetry Deployment to Complete Outage

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Recovery Architecture: Bypassing Kubernetes to Restore It

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why Kubernetes Control Plane Failure Is Catastrophic

The control plane manages three things that are catastrophic to lose simultaneously: DNS resolution (services find each other by name, not IP — without DNS, microservices go blind), service discovery (load balancers can't route to healthy pods without the API server updating configuration), and pod scheduling (crashed pods can't be restarted, replicas can't be scaled). In most partial failures, you lose one of these. A control plane failure loses all three — and recovery requires the control plane to function, creating a circular dependency that demands pre-arranged out-of-band access.

The staged rollout that would have caught it
A staged rollout — 1 cluster → verify 30 minutes → 10% of clusters → verify → 50% → verify → 100% — would have caught this failure at the 1-cluster stage. One large cluster showing API server saturation is a signal. One large cluster crashing before engineers even understood why is an outage. The difference between the two outcomes is a verification window between deployment stages — time to observe behaviour before the next stage commits. OpenAI's December 11 deployment had no such window: configuration applied to all clusters in 29 minutes without a verification pause.

Lessons

Observability infrastructure is production infrastructure. A telemetry service deployed across your entire fleet has the blast radius of your entire fleet. Deploy it with the same staged rollout rigor you apply to production services: one cluster, verify, one region, verify, full fleet. The December 11 rollout applied the configuration to all clusters in 29 minutes. A staged rollout would have revealed the problem on the first cluster before it cascaded.
DNS caching (storing DNS lookup results locally for a period defined by the record's TTL) is a reliability asset that becomes a diagnostic liability during incidents. When an infrastructure change breaks DNS, services continue functioning on cached entries — masking the failure until TTLs expire. If your deployment passes initial health checks and then fails minutes later at scale, DNS cache expiry is a likely explanation. Monitor DNS resolution success rates separately from application health checks.
Build break-glass emergency access before you need it. The December 11 engineers needed to access nodes directly, bypassing the Kubernetes control plane, using mechanisms that had not been pre-arranged. Pre-arrange them. Every Kubernetes deployment should have a documented, tested procedure for accessing nodes when kubectl is unavailable. Like any emergency procedure, it must be practiced before the emergency.
Size-dependent bugs (failures manifesting only at production scale because their severity is a non-linear function of system size) cannot be caught by functional testing at representative scale. Load test infrastructure changes against production-equivalent cluster sizes. If production-scale testing is not feasible, test at 10% of production scale and extrapolate load metrics before applying to the full fleet.
Decouple the components that manage your infrastructure from the infrastructure they manage. The Kubernetes control plane should not be the only path to emergency recovery. If the control plane fails, some emergency management capability should remain available independently of the failed layer.

Engineering Glossary

Break-glass mechanism — a pre-arranged, out-of-band access path to infrastructure that functions even when the standard management layer is unavailable. Named after the physical "break glass in case of emergency" safety cabinet. The absence of a break-glass mechanism was one of OpenAI's four identified root causes.

DNS caching — storing the results of DNS lookups locally for a period defined by the record's TTL (Time to Live), allowing services to resolve domain names without contacting the DNS server on every request. A reliability asset under normal conditions; a diagnostic liability that masks failures during incidents.

etcd — the distributed key-value store that backs all Kubernetes cluster state — node metadata, pod specifications, service definitions. Kubernetes API servers cannot function without access to etcd; etcd unavailability produces total control plane failure.

Kubernetes control plane — the set of components managing overall Kubernetes cluster state: the API server (handles all REST operations), etcd (state store), the scheduler (assigns pods to nodes), and the controller manager (runs reconciliation loops). Runs on dedicated master nodes, separate from the data plane nodes running actual workloads.

Locked-out effect — the circular dependency where recovering from a Kubernetes control plane failure requires kubectl, which requires a functioning control plane. The cluster is frozen in a state where existing workloads continue running but nothing can be changed, fixed, scaled, or recovered through standard channels.

Size-dependent bug — a failure that only manifests at production scale because its severity is a non-linear function of system size. A 100-node staging cluster may pass cleanly while a 5,000-node production cluster fails catastrophically — the same configuration producing 50× the load.

Watch API — a Kubernetes API feature allowing clients to receive a stream of events as resources change. More efficient than polling, but creates a persistent connection from each watching client to the API server, consuming server resources proportional to the number of watchers. Misused by the December 11 telemetry service to create 15,000+ persistent connections on large clusters.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means

TechLogStack — Thu, 21 May 2026 00:00:00 +0000

Any → Video — text, image, audio, and video inputs simultaneously → video output in one model
1 model vs chained pipeline (Veo + Imagen + Lyria) — the architectural difference that enables cross-modal reasoning
10 seconds — maximum clip length at Flash launch; longer-form on the roadmap
2B+ users — YouTube Shorts monthly active users with Day 1 Omni integration
SynthID watermark on every generation — survives re-encoding, resizing, and colour grading
Conversational editing — full context retained turn-to-turn, no re-prompting from scratch

For three years, Google built Gemini to be "natively multimodal." At I/O 2026, they finally showed what that phrase means in practice. Gemini Omni takes a photo, an audio clip, a video, and a text description — all at once — and produces a new video that reflects all of them simultaneously. This is not four models chained together. It is one — and the distinction is architectural, not cosmetic.

The Story

When we first announced Gemini, it was our first AI model to be natively multimodal. We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction.

— Sundar Pichai, CEO of Google, Google I/O 2026, May 19 2026

The phrase "natively multimodal" had been in Google's vocabulary since Gemini's December 2023 announcement — describing an aspiration more than a reality. At I/O 2026, Google delivered the concrete version: Gemini Omni, a model that accepts text, image, audio, and video simultaneously and generates video as output — not by chaining Veo, Imagen, and Lyria together, but by processing all of them within a single transformer's forward pass. A chain of models cannot reason about relationships between its inputs. A unified model can.

The path from Gemini's announcement to Omni runs through three milestones. Gemini 2.0 Flash (late 2024) introduced native audio output and real-time multimodal interaction — the first demonstration that Gemini could generate, not just understand, audio and video natively. Project Astra explored continuous, persistent understanding of physical environments through video and audio streams. Nano Banana (2025) brought Gemini's intelligence to image generation and editing, establishing the UX patterns — natural language editing, reference image input, conversational refinement — that Omni extends to video. Omni synthesises all three threads into a single production model.

Chained Models vs Native Omni: The Fundamental Difference

OpenAI's Sora and Google's Veo were excellent at their specific tasks but could not natively reason across modalities. Generating a video matching a specific audio track and reference image required: (1) generate a video with Veo from a text description, (2) separately process the audio, (3) manually synchronise the two. Gemini Omni collapses these three steps into one prompt — upload the image, the audio, write a description, and the model reasons about all three simultaneously. The unified context window is what makes this possible.

Problem

Multimodal AI Was a Pipeline of Specialised Models

The previous state-of-the-art for multimodal content creation required chaining specialised models — text-to-video, text-to-image, text-to-audio — and manual integration. Each handoff between models lost context: the relationship between audio tempo and visual rhythm, the visual style of a reference image, the emotional tone of a text prompt. Creators managed these integrations manually, limiting access to specialists.

Cause

Separate Models Cannot Reason Across Modality Boundaries

A video model that receives a reference image as a text description has lost the actual pixel relationships. A video model that receives an audio file as a text description has lost the actual waveform data. Genuine multimodal reasoning requires all modalities in the same context window — not converted to text summaries of each other.

Solution

One Transformer Trained on All Modalities Simultaneously

Gemini Omni was trained on text, image, audio, and video simultaneously within a single transformer architecture. The model develops internal representations encoding cross-modal relationships — understanding that a warm colour palette relates to a particular musical key, that physical object behaviour in video follows the laws of physics Gemini has observed across its training data.

Result

Any Input to Video Output, With Conversational Editing

Gemini Omni Flash launched May 19 2026 in the Gemini app and YouTube Shorts — 10-second clips, API access planned within weeks. The model accepted any combination of inputs and produced video with character consistency, physics grounding, and SynthID watermarking. Conversational editing retained full context across turns — a generated scene could be revised through natural language without re-prompting from scratch.

The Fix

Architecture: How Natively Multimodal Actually Works

Gemini Omni's architecture is a transformer trained across all modalities simultaneously — not a mixture of experts (a neural network architecture where different "expert" subnetworks specialise in different input types, with a routing mechanism that directs each input to the appropriate expert) architecture with separate video, image, and audio experts, but a single dense model where all modalities interact in every layer. A visual token and an audio token from the same moment in a video can attend to each other directly within the same attention layer, rather than being processed by separate networks whose outputs are later merged.

Any→Video — text, image, audio, video inputs simultaneously → video output with physics grounding
10s — maximum clip length at Flash launch; longer-form on the roadmap
SynthID — imperceptible watermark embedded in pixel-level statistical patterns; survives re-encoding, resizing, and colour grading
1 model — vs chained pipeline (Veo + Imagen + Lyria); unification enables cross-modal reasoning pipeline architectures cannot match

The conversational editing model is Omni's most transformative product experience. Previous video generation tools operated like vending machines: insert prompt, receive video, discard and re-insert if wrong. Gemini Omni operates like a continuous creative collaboration: generate a scene, ask for the camera angle to change, ask for a second character to enter — and the model keeps the context of every previous instruction. The resulting video reflects all decisions across the conversation, not just the most recent prompt.

# Conceptual: Gemini Omni vs the chained model approach it replaces
# Illustrates the architectural difference — API details TBC when GA

# OLD APPROACH: Chaining specialised models — context lost at every handoff
from veo import VeoClient
from lyria import LyriaClient

audio_clip = LyriaClient().generate(
    prompt="upbeat electronic music, 10 seconds"
)  # no knowledge of the visual reference

video = VeoClient().generate(
    prompt="city timelapse, matches photo style",
    reference_image=None  # can't process image input; can't see the audio
)
# Manual synchronisation: the user's problem

# GEMINI OMNI: One model, all modalities in one prompt
import google.generativeai as genai

model = genai.GenerativeModel('gemini-omni-flash')

response = model.generate_content([
    "Create a 10-second timelapse of a city transforming from day to night.",
    genai.upload_file('reference_photo.jpg'),  # actual pixel data — style extracted
    genai.upload_file('audio_track.mp3'),      # actual waveform — beat sync possible
    genai.upload_file('reference_clip.mp4')    # actual video — motion style extracted
])
# Output: video reflecting the photo's style, synced to audio's beat,
# using the reference clip's camera movement — all from one inference pass

# Conversational editing — full context preserved across turns
response2 = model.generate_content(
    "Same scene, but make it rain and show the character from my last prompt"
    # Model retains: character, city style, audio — no re-upload needed
)

SynthID: Watermarking That Cannot Be Removed

Every Gemini Omni video carries an imperceptible SynthID watermark embedded in the pixel data's statistical patterns — not in metadata. It survives re-encoding to different codecs, resizing, colour grading, and speed adjustments. Any C2PA-compatible platform can verify that a video was AI-generated by a Gemini product. Digital avatars additionally require mandatory onboarding (recording yourself, speaking verification numbers) before use — a guardrail against deepfakes built into the product from day one.

World models: the theoretical foundation behind physics grounding
Sundar Pichai described Omni as a step toward world models — AI systems that simulate physical and social reality rather than just predict token sequences. A language model predicting video token sequences will produce realistic-looking but physically incorrect motion: objects falling upward, light sources moving inconsistently, bodies with impossible joint angles. A world model that has internalised physics and causality from its training data produces videos where motion is physically coherent because the model understands why objects move the way they do, not just what they look like when they move.

Character consistency: how the long context window makes this possible
A character introduced in scene 1 retains their face, clothing, and voice across all subsequent scenes in the same conversation, without the creator re-uploading the reference image for each shot. This is enabled by Gemini's long context window — the model carries the character's visual description as an implicit context throughout the conversation. Competing video models, which have shorter effective contexts, required reference images at every generation turn and still produced inconsistent results.

Architecture

Gemini Omni's internal architecture reflects the design philosophy Gemini has had since its December 2023 announcement: train a single model on all modalities simultaneously so that cross-modal understanding is emergent from training, not engineered through explicit routing. The practical consequence is that Omni's internal representation of a video frame encodes relationships to audio, text context, and physical reality simultaneously — enabling generation that reflects all input modalities without explicit instructions about how to combine them.

Chained Pipeline vs Gemini Omni: Architectural Comparison

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Gemini Omni: Conversational Editing Flow and Context Retention

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The YouTube Shorts Integration: Distribution as the Moat

Gemini Omni's Day 1 integration into YouTube Shorts is a distribution strategy no standalone AI video tool can match. Creators generate a 10-second clip directly within YouTube's creation tools — no separate app, no API key. Every Omni-generated Short carries YouTube's standard content policy enforcement on top of SynthID watermarking, and is labelled as AI-generated in discovery surfaces. This is the first time a frontier AI video model has had a direct distribution path to a 2-billion-user platform on launch day.

C2PA content credentials: the open standard for AI provenance
C2PA (Coalition for Content Provenance and Authenticity — an open technical standard co-developed by Adobe, Microsoft, BBC, Intel, Sony, and others) cryptographically signs digital content at the point of creation with metadata about its origin and modification history. Any C2PA-compatible media player or content verification tool can confirm that a video was generated by Gemini Omni, when it was generated, and (if the user consented) by whom. This resolves the "is this real?" question for media at scale — not by restricting AI generation, but by making AI generation verifiable.

Lessons

Training a single model on all modalities simultaneously is architecturally superior to chaining specialised models for tasks requiring cross-modal reasoning. A chain of models loses pixel relationships, waveform data, and temporal correlations at every handoff. A unified model retains them throughout. The performance gap between chained and unified architectures grows with the complexity of the cross-modal reasoning required.
World models (AI architectures that simulate the physical and causal structure of reality rather than predict what the next frame statistically should look like) produce more coherent generated video than token-prediction models. They model causality rather than correlation. "AI is moving from predicting text to simulating reality" is the product-facing version of this architectural shift.
The conversational editing model changes who can use AI video generation. Prompt-and-retry was a specialist workflow — only people fluent in prompt engineering got good results efficiently. Conversational steering, where natural language revisions apply incrementally to a persistent context, is intuitive for anyone who has ever given feedback in a meeting.
Safety infrastructure is a prerequisite for deploying generative video at platform scale, not a post-launch patch. SynthID (Google's imperceptible AI-generated content watermark embedded in pixel-level statistical patterns — survives re-encoding, resizing, and colour processing), C2PA content credentials, and mandatory avatar onboarding verification are what make Omni deployable on YouTube without becoming deepfake infrastructure.
Distribution is the moat that model quality cannot easily overcome. An average model with YouTube Shorts integration reaches 2 billion users on Day 1. A superior model without distribution reaches the early-adopter population. Route new AI capabilities through existing products with existing users — don't build a new acquisition funnel when you don't have to.

Engineering Glossary

C2PA (Coalition for Content Provenance and Authenticity) — an open technical standard co-developed by Adobe, Microsoft, BBC, Intel, Sony, and others that cryptographically signs digital content at creation with metadata about its origin. Enables any C2PA-compatible tool to verify whether content is AI-generated, human-made, or modified.

Mixture of experts — a neural network architecture where different "expert" subnetworks specialise in different input types, with a routing mechanism directing each input to the appropriate expert. Contrasted with Gemini Omni's single dense model where all modalities interact in every layer.

Natively multimodal — a model architecture trained on multiple modalities (text, image, audio, video) simultaneously rather than routing between specialised single-modality models. Enables cross-modal reasoning that pipeline architectures cannot replicate.

Project Astra — Google DeepMind's ongoing research into a universal AI assistant that processes real-time audio and video streams continuously — exploring what it means for an AI to have persistent understanding of a physical environment.

SynthID — Google's imperceptible digital watermark embedded in the statistical patterns of AI-generated pixel data. Survives re-encoding, resizing, and colour grading. Enables AI provenance verification without visible degradation of the content.

World model — an AI architecture that simulates the physical and causal structure of reality — understanding why objects move, how light behaves, and what consequences follow from actions — rather than simply predicting what the next frame statistically should look like.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

TechLogStack — Tue, 19 May 2026 00:00:00 +0000

0.02 → 0.61 Cohen's Kappa — LLM judge calibration from near-random to near-human agreement
0.69 — human evaluator Kappa baseline; the meaningful ceiling for any judge
300 examples — hand-crafted benchmark for the Flow agent; necessary but not sufficient
2 weeks — time to close the benchmark-to-production gap using the production mirroring flywheel
Weekly — Qwen3-32B retraining cadence on H200 GPUs (12h full run)
1,200 production LLM deployments analysed by ZenML — Shopify's findings are not exceptional, they are universal

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between "impressive demo" and "product I'd trust with my customers" — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

The Story

ZenML analysed 1,200 production LLM deployments and found a pattern so consistent it has become a rule: reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust.

Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation from natural language), lived this curve in production. The Flow agent generates Shopify Flow automations from merchant descriptions — "when an order is over $200, add the customer to my VIP segment" — and produces a structured workflow. It uses tool calling (a pattern where an LLM is given a set of available functions with descriptions and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and operates in a domain-specific format. The task sounds well-bounded. In practice, the diversity of merchant intent is vast, edge cases accumulate rapidly, and subtle errors — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.

Why Evaluation Is the Hard Part

Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. Without a reliable way to measure quality, you cannot improve systematically. You are optimising blind, hoping the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.

Problem

Benchmarks Said Ready; Production Said Otherwise

Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.

Cause

No Quality Signal Trustworthy Enough to Drive Iteration

The early LLM judge had a Cohen's Kappa (a statistical measure of agreement between two raters that corrects for chance — Kappa of 0 means agreement no better than random, 1.0 means perfect agreement) of 0.02 — barely better than random agreement with human evaluators. Engineering decisions based on its verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.

Solution

Calibrated LLM Judge + Production Mirroring Flywheel

The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.

Result

Production Gap Closed in Two Weeks with the Flywheel

The gap from "benchmark-ready" to "production-ready" closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behaviour evolves.

The Fix

Building the Evaluation Flywheel

Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration.

0.61 — Cohen's Kappa achieved after iterative calibration — close to the human evaluator baseline of 0.69, sufficient to drive reliable engineering decisions
300 — hand-crafted benchmark examples, covering the breadth of expected usage; initial quality gate before shadow testing
2 weeks — time to close the benchmark-to-production gap using the production mirroring flywheel
Weekly — Qwen3-32B retraining cadence on H200 GPUs; 12-hour full training run per cycle

# LLM judge calibration: the process from Kappa 0.02 to 0.61
# A judge is only useful if it agrees with humans. Measure agreement first.

from sklearn.metrics import cohen_kappa_score

def calibrate_llm_judge(judge_prompt: str, calibration_set: list[dict]) -> float:
    """
    calibration_set: list of {conversation, human_label} pairs
    human_label: 'good' | 'bad' | 'needs_improvement'
    Returns Cohen's Kappa between judge and human labels.
    Target: Kappa >= 0.60 before trusting judge at scale.
    """
    judge_labels = []
    for sample in calibration_set:
        verdict = call_llm(judge_prompt, sample['conversation'])
        judge_labels.append(verdict)

    human_labels = [s['human_label'] for s in calibration_set]
    return cohen_kappa_score(human_labels, judge_labels)

# The calibration loop — iterate until the judge is trustworthy
kappa = 0.02  # initial judge is barely better than random
while kappa < 0.60:
    # Analyse where judge and humans disagree
    disagreements = find_disagreements(calibration_set, current_judge_labels)

    # Improve judge prompt based on disagreement patterns:
    # - Add clarifying criteria for ambiguous cases
    # - Add few-shot examples where human label is the ground truth
    # - Adjust rubric language to match human intuitions
    new_judge_prompt = improve_prompt(current_judge_prompt, disagreements)

    kappa = calibrate_llm_judge(new_judge_prompt, calibration_set)
    print(f"Kappa: {kappa:.2f}")  # progression: 0.02 → 0.15 → 0.31 → 0.48 → 0.61

# Once Kappa >= 0.60: use judge to score production traffic at scale
# Once judge is calibrated: production mirroring generates the failure cases
# that benchmarks never captured — feed those failures back into training data

Production Mirroring: The Ground Truth Test

Benchmarks are necessary but not sufficient. A benchmark reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all edge cases, unusual phrasings, and unexpected use patterns no engineer anticipated. Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously, comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.

Synthetic training data: how Shopify generated the Flow agent dataset
The Flow agent's fine-tuning training data was almost entirely synthetic — generated by an LLM, not labelled by humans. The three-step pipeline: (1) sample a diverse set of validated production workflows — at least one per unique workflow descriptor, from merchants with two or more qualifying workflows; (2) use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow; (3) construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: scale (the production workflow corpus is large) and grounding (every training example was derived from a real workflow that actually ran). Synthetic data from real production outputs is the emerging standard for fine-tuning domain specialists.

Tangle: the ML pipeline that enables weekly retraining
The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on Tangle, Shopify's open-source ML experimentation platform. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. A change to the synthetic data generator doesn't trigger a full pipeline rerun — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.

Architecture

The evaluation architecture for production LLM systems has four components that form a cycle. Benchmark evaluation provides fast, reproducible quality gates during development. LLM-as-judge scoring provides continuous quality measurement at production traffic scale. Production mirroring provides ground truth about whether a candidate model performs better for real users. The training flywheel converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.

The Production LLM Evaluation Flywheel

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

LLM Judge Architecture: From Random Agreement to Near-Human

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Merchant Simulator as Pre-Deployment Safety Net

The merchant simulator sits between benchmark evaluation and production mirroring — a synthetic production environment. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode benchmarks miss: correct behaviour on engineer-anticipated test cases, incorrect behaviour on the realistic distribution of merchant intent. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.

Golden datasets: why they are non-negotiable
ZenML's analysis is unambiguous: every successful production LLM deployment they analysed maintains human-in-the-loop golden datasets for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labelled examples representing ground truth — anchor judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.

Lessons

You will spend more time building evaluation infrastructure than the application logic itself. This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. ZenML's summary from 1,200 deployments: "Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features."
LLM-as-judge (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale) is the scalable evaluation pattern. But an uncalibrated judge (Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6. The human evaluator baseline (0.69 for Shopify) is the meaningful ceiling — don't optimise past it.
A benchmark that passes is a necessary condition, not a sufficient one. Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. Two weeks of shadow traffic is the standard cost of this final validation step.
Synthetic data generation (using an LLM to create training examples from real production outputs — generating natural-language merchant requests from real production workflows) is the path to scalable fine-tuning training data. Manual annotation doesn't scale. Synthetic data derived from production outputs does — and it's grounded in real-world distribution rather than engineer-imagined distribution.
Retraining cycle speed determines how fast you can respond to production drift. Merchant behaviour changes, new workflow patterns emerge, new merchant categories join Shopify — a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (intelligent caching, H200 GPUs, 12h runs), keeps the model aligned with the world it serves.

Engineering Glossary

Cohen's Kappa — a statistical measure of agreement between two raters that corrects for chance agreement. Kappa of 0 means agreement no better than random; 1.0 means perfect agreement; 0.6+ is generally considered the threshold for a trustworthy judge. The Shopify LLM judge improved from 0.02 to 0.61; the human evaluator baseline was 0.69.

Fine-tuning — the process of further training a pre-trained LLM on a domain-specific dataset to improve performance on a specific task. Used by Shopify to specialise a base model (Qwen3-32B) for Shopify Flow workflow generation, with weekly retraining cycles to keep pace with evolving merchant behaviour.

Golden dataset — a small, carefully curated set of human-labelled evaluation examples representing ground truth for a specific domain. Used to calibrate LLM judges and detect judge drift over time. The anchor of any reliable LLM evaluation system.

LLM-as-judge — the pattern of using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction.

Production mirroring — routing a percentage of real production traffic through both the current deployed model and a candidate model simultaneously, comparing outputs to measure whether the candidate performs better for real users. The ground truth test that benchmark evaluation cannot replicate.

Synthetic data generation — using an LLM to create training examples from a production data source — for example, generating plausible natural-language merchant requests from real validated production workflows. Enables scalable training data creation grounded in real-world distribution.

Tool calling — a pattern where an LLM is given a set of available functions (tools) with descriptions and can request that a specific tool be executed by generating a structured function call. Enables LLMs to take real-world actions beyond text generation — used by Shopify's Flow agent to generate and execute workflow operations.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen

TechLogStack — Tue, 19 May 2026 00:00:00 +0000

3,000× speedup — quantum completed in 2 minutes what classical needed 100+ hours for
60% gate count reduction by Q-CTRL's Fire Opal compiler vs native Qiskit — the engineering that made it possible
12,635 atoms — largest biologically meaningful molecule ever simulated on quantum hardware (May 5 2026)
40× larger protein simulation than six months prior — driven by the EWF-TrimSQD algorithm
120 qubits, 10,000+ two-qubit gates — circuit depth previously considered infeasible on NISQ hardware
IBM Starling roadmap: 200 logical qubits under error correction by 2029

On May 6, 2026, Q-CTRL ran a materials science simulation on an IBM quantum computer in 2 minutes. The best classical supercomputer needed over 100 hours to reach the same accuracy — and then gave up. The day before, IBM, Cleveland Clinic, and RIKEN simulated a 12,635-atom protein, 40 times larger than anything attempted six months prior. After 30 years of promises, practical quantum advantage arrived. What actually changed was a compiler.

The Story

For years, quantum computing has been a promise. Now, quantum computers are producing results that matter to science. The systems we simulated here are the kind of molecules that biologists and chemists work with in the real world.

— Jay Gambetta, Director of IBM Research, IBM Think 2026, Boston

On May 19 2026, Google Trends showed a BREAKOUT signal — the highest possible designation — on the query "what is quantum computing in simple terms." The trigger was two announcements that landed within 48 hours of each other. On May 5, scientists at Cleveland Clinic, RIKEN, and IBM used quantum computers to simulate trypsin, a protein with 12,635 atoms — the largest biologically meaningful molecule ever simulated on quantum hardware, 40 times larger than what the same method could achieve just six months prior. On May 6, Q-CTRL demonstrated a 3,000× speedup on a problem of real commercial relevance. The physics community called it practical quantum advantage — the first time a quantum computer had demonstrably outperformed the best classical tool on a problem that matters outside a laboratory.

Understanding why these results matter requires understanding what stood in the way. NISQ (Noisy Intermediate-Scale Quantum — the current era of quantum computing, characterised by processors with 50–1,000 qubits that are not error-corrected, meaning errors accumulate as circuit depth grows and place hard limits on what computations can run reliably) quantum computers accumulate errors with every two-qubit gate (the fundamental entangling operation in quantum computing — essential for quantum algorithms but a primary source of error in NISQ hardware, with typical error rates of 0.1–1% per gate). At shallow circuit depths with a handful of gates, error mitigation can recover useful results. At 10,000+ gates across 120 qubits — the depth required for commercially meaningful simulations — errors historically compounded until the output was indistinguishable from noise. This was the wall. The May 2026 results are not the wall coming down. They are the first evidence that engineers have found a way to work precisely enough within its constraints that real problems now fall on the quantum side of it.

What Q-CTRL Actually Did

Q-CTRL used an IBM 156-qubit Heron processor on the IBM Quantum Platform, enhanced by their own Fire Opal performance-management software. The target: the Fermi-Hubbard model (a foundational physics model describing how electrons interact in a crystal lattice — capturing phenomena like high-temperature superconductivity) — a system of 60 interacting electrons using 120 qubits and executing over 10,000 two-qubit gate operations. The classical competitor was ITensor's TDVP solver on a 32-vCPU, 64GB-RAM AWS instance — the best-in-class classical tool for this problem class. Quantum: ~2 minutes. Classical: over 100 hours before the two results diverged irreconcilably.

Problem

NISQ Wall: Errors Compound Before Computation Completes

NISQ quantum processors accumulate errors with every two-qubit gate. For shallow circuits (hundreds of gates), error mitigation can recover useful results. For commercially meaningful simulations (10,000+ gates), errors historically compounded until the quantum output was indistinguishable from random noise. This wall had blocked practical quantum advantage for three decades.

Cause

Gate Count Was the Critical Variable

Every additional two-qubit gate multiplies error probability. IBM's native Qiskit compiler produced correct but gate-heavy implementations. Q-CTRL's Fire Opal compiler took the same algorithm and reduced gate count by 60% through circuit optimisation and error suppression. That 60% reduction was the difference between circuits that collapsed into noise and circuits that produced valid results.

Solution

Two Simultaneous Breakthroughs: Materials and Biology

May 5: IBM, Cleveland Clinic, and RIKEN simulated a 12,635-atom protein using quantum-centric supercomputing — fragmenting the molecule, computing quantum-mechanical behaviour on IBM Heron processors, and assembling results on Fugaku and Miyabi-G supercomputers. May 6: Q-CTRL demonstrated 3,000× speedup on the Fermi-Hubbard model, completing in 2 minutes what took classical computers 100+ hours.

Result

Practical Quantum Advantage: The Field's First

On May 6 2026, Q-CTRL declared practical quantum advantage — the first time a quantum computer had outperformed the best available classical tool on a problem of known commercial relevance, using hardware accessible to any developer via the IBM Quantum Platform. IBM CEO Arvind Krishna had predicted quantum advantage would arrive in 2026. The prediction was correct.

The Fix

The Engineering Stack That Made It Possible

The Q-CTRL result did not emerge from better quantum hardware alone. It emerged from a full engineering stack combining IBM's hardware, Q-CTRL's compiler, and years of quantum control research. Three layers mattered: the hardware layer (IBM Heron's 156-qubit chip with improved coherence times and gate fidelity), the compilation layer (Q-CTRL's Fire Opal reducing gate count by 60%), and the error suppression layer (runtime techniques that actively suppress errors during execution). None of these layers alone would have been sufficient — the result is an emergent property of all three operating together.

3,000× — wall-clock speedup of quantum over classical: 2 minutes vs 100+ hours on the best available classical hardware and software
60% — gate count reduction by Fire Opal vs native Qiskit — the single optimisation that made the circuit depth feasible
12,635 — atoms in the trypsin protein simulated by Cleveland Clinic + RIKEN + IBM
40× — increase in simulation system size achieved in six months, driven by the EWF-TrimSQD algorithm

# Conceptual: What Fire Opal does differently from native Qiskit compilation
# The 60% gate reduction is the engineering story in code form

# NATIVE QISKIT: correct but gate-heavy
from qiskit import QuantumCircuit, transpile
from qiskit_ibm_runtime import QiskitRuntimeService

backend = QiskitRuntimeService().backend('ibm_heron_r2')  # 156-qubit Heron

# Fermi-Hubbard Trotter circuit at 90 steps:
# Naive implementation produces ~15,000+ two-qubit (CX) gates
circuit = build_fermi_hubbard_circuit(n_qubits=120, n_trotter_steps=90)
native = transpile(circuit, backend=backend)
# ~15,000+ CX gates → error rate exceeds threshold → output is noise

# Q-CTRL FIRE OPAL: noise-aware compilation
import fire_opal

# Fire Opal applies four optimisations simultaneously:
# 1. Circuit rewriting — finds equivalent circuits with fewer gates
# 2. Noise-aware qubit mapping — minimises cross-talk between physical qubits
# 3. Dynamical decoupling — inserts refocusing pulses to cancel drift errors
# 4. Gate fusion — combines adjacent compatible gates into single operations

result = fire_opal.run(
    circuits=[circuit],
    backend=backend,
    optimization_level='aggressive',
    error_suppression=['dynamical_decoupling', 'gate_twirling']
)
# ~6,000 CX gates — 60% reduction
# Circuit runs within error tolerance → produces results accurate enough
# to match and then exceed the classical TDVP benchmark
# Wall time: ~2 minutes
# Classical TDVP equivalent: 100+ hours before diverging irreconcilably

Error Suppression vs Error Correction: The Critical Distinction

The May 2026 results were achieved with error suppression, not error correction. Error correction (the goal for 2029) uses logical qubits — groups of physical qubits encoding information redundantly, detecting and fixing errors in real-time. It requires hundreds of physical qubits per logical qubit. Error suppression (what Q-CTRL and IBM use now) cannot fix errors — it minimises them through circuit optimisation, noise-aware compilation, and runtime control. Error suppression works within NISQ limits. Error correction eliminates those limits entirely. The 3,000× result was achieved within the NISQ limits. What becomes possible once error correction arrives is qualitatively different.

The Cleveland Clinic protein simulation: EWF-TrimSQD explained
The May 5 simulation used a quantum-centric supercomputing (QCSC) approach — pairing IBM Heron quantum processors at Cleveland Clinic (USA) and RIKEN (Japan) with two classical supercomputers: Fugaku at RIKEN and Miyabi-G at the University of Tokyo. The key algorithm was EWF-TrimSQD (Embedding Workflow with Tailored Reduced-qubit Molecular Dynamics) — a quantum-classical hybrid that fragmented the 12,635-atom trypsin protein into computable pieces, computed quantum-mechanical behaviour on QPUs (up to 94 qubits, ~6,000 quantum operations per fragment), and reconstructed the full protein's behaviour on classical supercomputers. The 40× system size increase in six months came from algorithmic improvement in how fragments were computed and assembled.

IBM Quantum roadmap: Loon to Starling
IBM Quantum Loon (November 2025) was the first processor to demonstrate all hardware components required for fault-tolerant quantum computing: c-couplers for long-range qubit connectivity, qubit reset between computations, and high-fidelity gates at FTQC-relevant speeds. IBM also achieved real-time qLDPC (quasi-cyclic Low-Density Parity-Check codes — IBM's chosen error-correcting code, requiring fewer physical qubits per logical qubit than the surface code) decoding in under 480 nanoseconds — a full year ahead of schedule. IBM Starling targets 200 logical qubits under full error correction by 2029.

Architecture

Both May 2026 achievements reflect the same architectural pattern: quantum processors are specialised accelerators for specific types of computation, tightly integrated with classical CPUs and GPUs that handle the parts of the problem where quantum offers no advantage. IBM calls this Quantum-Centric Supercomputing (QCSC) — a heterogeneous computing architecture where tasks are assigned to the compute layer where they run best. Quantum does not replace classical computing. It extends it.

The NISQ Error Accumulation Problem: Why Circuit Depth Is the Wall

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Quantum-Centric Supercomputing (QCSC): The Cleveland Clinic Architecture

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

IBM Quantum Roadmap: From NISQ to Fault Tolerance (2025–2029)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

What These Results Are Not

The Fermi-Hubbard result is not proof that quantum computers beat classical computers at everything. The advantage holds for this specific class of fermionic simulation problems, which scale poorly for classical computers by a known theoretical argument. Breaking RSA-2048 with Shor's algorithm requires hundreds of thousands to millions of physical qubits under error correction — orders of magnitude harder. The May 2026 results are the first concrete proof that quantum advantage is achievable on useful, commercially relevant problems with today's hardware, properly engineered.

Lessons

Quantum advantage arrived not from better qubits alone, but from better compilers. Q-CTRL's Fire Opal reduced gate count by 60% on the same IBM hardware that was already available. The 3,000× speedup was enabled by 60% fewer gates — and 60% fewer gates was enabled by years of investment in quantum control theory and noise-aware compilation. Hardware and software co-optimisation, not hardware alone, crossed the threshold.
Quantum-centric supercomputing (a heterogeneous architecture pairing quantum processors with classical CPUs and GPUs, assigning each part of a problem to the resource where it runs best) is how quantum advantage works in practice. Quantum computers do not replace classical computers — they accelerate the specific parts where quantum mechanics provides exponential advantage. Drug discovery, materials simulation, and optimisation are the first domains where this integration delivers measurable commercial results.
Error suppression and circuit optimisation are the engineering disciplines that matter most in the NISQ era. Error correction remains the long-term goal (IBM Starling, 2029), but error suppression — reducing gate count, noise-aware mapping, dynamical decoupling — is the bridge that makes today's hardware useful for real problems. Engineers building on quantum hardware should invest as much in compilation optimisation as in circuit design.
The rate of improvement is accelerating. 40× larger molecule simulation in six months. A year-ahead-of-schedule qLDPC decoder. Trotter (a simulation technique approximating quantum time evolution by breaking it into small sequential steps — 90 Trotter steps at 120 qubits with useful accuracy was previously considered infeasible on NISQ hardware) depth at 90 steps on 120 qubits that would have been impossible two years ago. Organisations that start developing quantum-advantage applications now will be ahead of those waiting for the technology to "mature."
Practical quantum advantage arrived on public cloud infrastructure. Q-CTRL's 3,000× speedup was achieved on IBM Quantum Platform hardware accessible via API to any registered developer — not on a private research machine. The cloud-first approach IBM took in 2016 is what made May 2026's results broadly verifiable and immediately applicable.

Engineering Glossary

Dynamical decoupling — a quantum error suppression technique that inserts short refocusing pulses during circuit execution to cancel low-frequency noise and drift errors. One of the core techniques used by Q-CTRL's Fire Opal to reduce effective error rates without requiring full error correction.

Fermi-Hubbard model — a foundational model in condensed matter physics describing interacting electrons on a lattice. Used to understand high-temperature superconductivity, Mott insulators, and quantum magnetism. Classical simulation cost grows exponentially with system size — a 60-electron system has 2^60 possible states. The Q-CTRL result is the first real-world confirmation that quantum computers provide exponential advantage on this class of problem.

Fire Opal — Q-CTRL's performance-management software for quantum computers. Applies circuit rewriting, noise-aware qubit mapping, dynamical decoupling, and gate fusion to reduce two-qubit gate count and improve circuit fidelity. Achieved 60% gate reduction vs native Qiskit on the Fermi-Hubbard circuit.

Logical qubit — a fault-tolerant qubit encoded across multiple physical qubits, with error detection and correction running continuously. The target unit for IBM Starling (200 logical qubits by 2029). Contrasted with physical qubits, which are noisy and uncorrected.

NISQ (Noisy Intermediate-Scale Quantum) — the current era of quantum computing, characterised by processors with 50–1,000 physical qubits that are not error-corrected. Errors accumulate as circuit depth grows, placing hard limits on computation length. The May 2026 results were achieved within NISQ constraints, not beyond them.

qLDPC (quasi-cyclic Low-Density Parity-Check codes) — IBM's chosen quantum error-correcting code, requiring fewer physical qubits per logical qubit than the surface code used by most competitors. IBM achieved real-time qLDPC decoding in under 480 nanoseconds in November 2025 — a year ahead of schedule.

Quantum-centric supercomputing (QCSC) — IBM's heterogeneous computing architecture pairing quantum processors with classical CPUs and GPUs, assigning each part of a computation to the resource where it runs best. The architectural model used in the Cleveland Clinic protein simulation.

Two-qubit gate — the fundamental entangling operation in quantum computing that creates correlations between qubits. Essential for quantum algorithms but a primary source of error in NISQ hardware. Reducing two-qubit gate count is the primary lever for improving circuit fidelity on today's hardware.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

TechLogStack — Tue, 19 May 2026 00:00:00 +0000

1 billion events/day at LinkedIn launch in 2011 — immediate production scale from day one
7 trillion messages/day by 2019 — same core architecture, 7,000× growth
~50 MB/sec Kafka producer throughput vs ~2 MB/sec ActiveMQ — in the original 2011 benchmark
9 bytes per-message overhead vs 144 bytes in ActiveMQ — 16× storage efficiency
Stateless brokers — consumers track their own offset; broker memory doesn't scale with consumers
80%+ of Fortune 100 run Kafka today; Confluent IPO'd at $4.5B valuation in 2021

In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.

The Story

By 2010, LinkedIn had dozens of data source systems and dozens of consumer systems — ML models, analytics pipelines, search indexers, real-time features — all needing the same activity stream data. The solution was point-to-point custom pipelines: each one custom-built, each one brittle, none sharing infrastructure. Adding one new data source meant writing N new pipelines. Adding one new consumer meant updating M existing sources. Jay Kreps, leading data infrastructure engineering, described the root cause directly: "Everyone wanted to build fancy machine-learning algorithms, but without the data, the algorithms were useless. Getting the data from source systems and reliably moving it around was very difficult."

Kreps, alongside Jun Rao (from IBM's database group) and Neha Narkhede (from Oracle), evaluated every existing solution. ActiveMQ (an open-source message broker implementing JMS, designed for reliable ordered message delivery between enterprise applications) and RabbitMQ (a message broker built around AMQP, designed for flexible routing and delivery guarantees) were built for a different problem — reliable delivery of individual task messages, not high-throughput streaming of millions of activity events. Their per-message broker state tracking consumed memory proportional to outstanding messages. They couldn't support the scenario where a Hadoop job needed to replay yesterday's activity data. Most critically: ActiveMQ's message format carried 144 bytes of overhead per message. LinkedIn needed millions of messages per second.

The Founding Insight: Treat Data Movement Like a Log

The breakthrough was recognising that LinkedIn's data movement problem was not a messaging problem — it was a log problem. Databases have used append-only logs for decades: the write-ahead log (a sequential record of all changes, written before the changes are applied — used for crash recovery, replication, and point-in-time restoration) is how MySQL and Postgres achieve durability. Jay Kreps asked: what if the data pipeline itself was an append-only log? Producers append events. Consumers read at their own pace. The log retains messages for a configured period. Any consumer can replay from any point. The broker tracks no state. That simplicity unlocked everything.

Problem

LinkedIn's Data Was Locked in Silos

By 2010, LinkedIn had an N×M integration problem — every data source needed a custom pipeline to every data destination. Existing messaging systems (ActiveMQ, RabbitMQ) were designed for task queues, not event streams, and couldn't handle LinkedIn's throughput requirements or support replay.

Cause

No Tool Existed for High-Throughput Real-Time Event Streaming

Batch systems (Hadoop) could handle large volumes but only hours later. Traditional message queues could deliver in real-time but couldn't scale to LinkedIn's volume or support replay. No system simultaneously provided high throughput, low latency, durability, replayability, and horizontal scalability. The three engineers concluded that the tool they needed did not exist.

Solution

One Year Building Kafka: The Append-Only Distributed Log

Kreps, Rao, and Narkhede spent approximately one year building the first version of Kafka. The core architectural decision was treating the message store as an append-only log rather than a queue. This single choice enabled sequential disk I/O (orders of magnitude faster than random I/O), stateless brokers (consumers track their own position), arbitrary replay (consumers read from any offset), and horizontal partitioning (each partition is an independent log that scales independently).

Result

1 Billion Events Per Day at Launch, 7 Trillion by 2019

Kafka went into production at LinkedIn in 2011 and immediately processed over 1 billion events per day. LinkedIn open-sourced it in early 2011. It became an Apache Top-Level Project in October 2012. By 2015: 1 trillion messages per day. By 2019: 7 trillion. Kreps, Narkhede, and Rao left LinkedIn in November 2014 to found Confluent, building the commercial ecosystem around Kafka.

I thought that since Kafka was a system optimized for writing, using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

— Jay Kreps, on naming Kafka, via Quora

The Fix

Five Design Decisions That Made Kafka Fast

Kafka's performance advantage was not clever optimisation of a standard architecture — it was a fundamentally different architecture where every key decision reinforced the same goal: maximise throughput for streaming event data. Five decisions stand out as architecturally defining, and each was a deliberate rejection of how existing messaging systems had been built.

~50 MB/s — Kafka producer throughput in the original 2011 benchmark vs ~2 MB/s for ActiveMQ at 200-byte messages
9 bytes — per-message overhead in Kafka vs 144 bytes in ActiveMQ — 16× storage efficiency
Stateless — Kafka brokers; consumer offset tracking is done by the consumer, not the broker
Sequential — disk access pattern for both writes and reads; append-only means no random I/O

// The five key Kafka design decisions illustrated in code

// DECISION 1: Append-only log storage (not a queue)
// Each partition is a directory of sequential segment files
// /kafka-logs/my-topic-0/00000000000000000000.log
// /kafka-logs/my-topic-0/00000000000000100000.log
// → Sequential writes: disk seeks are expensive; sequential I/O is ~100x faster

// DECISION 2: Consumer tracks its own offset — broker holds no state
long consumerOffset = consumer.position(topicPartition); // consumer owns this
// → Brokers are stateless: no per-consumer memory, no ack tracking overhead
consumer.seek(topicPartition, 0); // replay from the beginning — any time

// DECISION 3: Topics partitioned for horizontal scale
ProducerRecord record = new ProducerRecord<>(
    "user-activity",
    userId,    // partition key: same user → same partition = ordered per user
    eventJson  // the message payload
);
// → N partitions = N consumers in parallel = linear throughput scaling

// DECISION 4: Batch I/O from client to broker
props.put("batch.size", 16384); // batch up to 16KB before sending
props.put("linger.ms", 5);      // or wait 5ms for the batch to fill
// Original paper: batch size 50 improved throughput ~10x vs batch size 1

// DECISION 5: Zero-copy transfer via OS sendfile()
// Consumer fetch path: disk page cache → network socket (no userspace copy)
// → No data enters JVM heap → no GC pressure → consistent low latency
// → Delivers data at near-network-hardware-limit throughput

The Stateless Broker: The Counterintuitive Masterstroke

In ActiveMQ and RabbitMQ, the broker maintains delivery state for every message: who acknowledged it, who hasn't, what needs to be retried. At scale, this per-message state tracking consumes enormous memory and creates a bottleneck. Kafka's solution was radical: let consumers track their own position (their offset in each partition). The broker stores bytes in a log. Consumers read at their own pace and can reset to any offset to replay. The broker's memory footprint is constant regardless of consumer count or message backlog — making horizontal scaling of consumers a configuration change, not an infrastructure problem.

Kafka vs traditional message queues — original 2011 benchmarks and design properties:

Property	ActiveMQ / RabbitMQ	Kafka
Storage model	Queue — messages deleted after ack	Append-only log — retained by time/size
Broker state	Tracks ack state per message per consumer	Stateless — consumers track own offset
Producer throughput	~2 MB/sec (ActiveMQ)	~50 MB/sec (batch size 50)
Message overhead	144 bytes (ActiveMQ JMS header)	9 bytes
Consumer replay	Not supported	Supported — seek to any offset
Horizontal scale	Limited (complex cluster configs)	Native — add partitions, add consumers
Use case fit	Task queues, guaranteed delivery, routing	Event streaming, log aggregation, activity tracking

Zero-copy: the OS kernel trick that doubled throughput
One of Kafka's most impactful performance optimisations is invisible to application code. In a traditional data transfer, data moves: disk → kernel buffer → userspace → socket buffer → network. In Kafka's consumer path, the OS sendfile() syscall transfers data directly from the page cache to the network socket, bypassing userspace entirely. No data is copied into the JVM heap — no GC pressure, no object allocation overhead. At LinkedIn's throughput rates, this optimisation alone accounts for significant throughput gains and, more importantly, consistent low latency even under high load.

The log/table duality: Jay Kreps' deeper insight
In his 2013 essay "The Log," Kreps articulated a concept beyond Kafka's implementation: the log/table duality — any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. Every database table is a log in disguise. Every stream of events can be materialised into a table. This duality means a Kafka topic is simultaneously a stream and a database — query it as a stream in motion (stream processing) or materialise it as a snapshot (a table). This insight became the foundation for Kafka Streams, ksqlDB, and the entire stream-processing ecosystem that followed.

Architecture

Kafka's architecture has three layers. The storage layer is a set of partitioned, replicated append-only log files on disk — each partition is an independent, totally ordered sequence of records. The broker layer is a cluster of server processes managing partition assignment, replication, and client connections — holding no consumer state. The client layer is producers writing to partitions and consumer groups reading from them, each group maintaining its own independent offset per partition.

Before Kafka: N×M Integration Spaghetti

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After Kafka: The Centralised Log Hub

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

LinkedIn's Kafka by 2019: The Scale Numbers

From 1 billion events per day at launch (2011) to 7 trillion messages per day by 2019 — a 7,000× growth in eight years on the same fundamental architecture. Spread across 100+ clusters, 4,000+ brokers, 100,000+ topics, and 7 million partitions. Each message consumed by approximately four consumer groups on average. The most remarkable fact: the append-only partitioned log described in the 2011 paper is still the architecture running at 7 trillion messages per day. Good architecture ages well.

Lessons

Before building, verify no existing tool solves your problem at your scale. The Kafka team evaluated ActiveMQ, RabbitMQ, and existing log aggregation systems before building. Their conclusion — existing tools were designed for the wrong problem — was evidence-based. The benchmark (50 MB/sec vs 2 MB/sec) made the decision concrete. Never rebuild what can be adopted; never adopt what demonstrably can't serve your workload.
The append-only log (a data structure where records are only ever added to the end, never modified in place — enabling sequential I/O, arbitrary consumer replay, and stateless brokers) is the universal data integration primitive. Any system moving data between producers and consumers is implementing a log, whether it knows it or not. Recognising and building explicitly on this pattern is what gave Kafka its performance advantage and its flexibility.
Stateless brokers make systems horizontally scalable in ways stateful brokers cannot match. When the broker tracks delivery state per consumer per message, broker memory scales with consumers × outstanding messages. When consumers track their own offsets, broker memory scales with partitions only. This single architectural choice is why Kafka can serve hundreds of consumer groups without broker degradation.
Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs. An append-only log turns a bursty stream of writes into sequential disk operations, allowing Kafka to approach disk hardware throughput limits. Systems that update records in-place pay random I/O costs on every write. Kafka writes append-only and leverages the OS page cache for reads, achieving throughput that surprised the entire industry.
Open-sourcing infrastructure that solves a universal problem creates compounding returns. LinkedIn open-sourced Kafka in 2011 because the team recognised it solved a problem every data-intensive company had. Community contributions from Netflix, Uber, Twitter, and thousands of others built tooling LinkedIn could never have built alone: Kafka Streams, Kafka Connect, ksqlDB, MirrorMaker, Schema Registry. The return on open-sourcing infrastructure is measured in ecosystem, not just code.

Engineering Glossary

Append-only log — a data structure where records are only ever added to the end, never modified in place. Enables sequential disk I/O, arbitrary consumer replay, and stateless brokers. The core data structure underlying Kafka's architecture and the reason for its performance advantage over traditional message queues.

Consumer group — a set of Kafka consumers that collectively read from a topic, with each partition assigned to exactly one consumer in the group at a time. Enables parallel consumption: a topic with N partitions can be consumed by up to N consumers simultaneously.

Consumer offset — the position of a consumer within a partition, tracking which messages have been read. In Kafka, consumers (not brokers) own and commit their offsets — the key architectural decision that makes Kafka brokers stateless.

Log/table duality — the mathematical relationship where any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. The theoretical foundation for Kafka Streams and ksqlDB.

Partition — the unit of parallelism in Kafka. Each topic is divided into one or more partitions, each of which is an independent append-only log stored on a single broker. Producers write to partitions by key; consumers read from partitions independently.

Stateless broker — a broker that holds no per-consumer delivery state. Kafka brokers store bytes in partitioned logs; consumers own their own offset positions. Broker memory scales with partition count, not consumer count — the property that makes Kafka horizontally scalable to hundreds of consumer groups without broker degradation.

Write-ahead log (WAL) — a sequential record of all changes made to a database, written to disk before the changes are applied. Used for crash recovery and replication in MySQL, Postgres, and virtually every serious database. The inspiration for Kafka's append-only log architecture.

Zero-copy transfer — the use of the OS sendfile() syscall to transfer data directly from the kernel page cache to a network socket, bypassing userspace entirely. Used in Kafka's consumer fetch path to eliminate JVM heap copies, GC pressure, and the associated latency spikes at high throughput.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

TechLogStack — Mon, 18 May 2026 00:00:00 +0000

800M users — 1 primary PostgreSQL instance on Azure, ~50 read replicas globally
Millions of QPS — p99 latency held at low double-digit milliseconds
50ms → 5ms connection setup time after PgBouncer deployment — a 10× improvement
10× database load growth in a single year following ChatGPT's viral expansion
5 seconds — maximum DDL lock wait timeout; schema changes that can't acquire a lock in time are automatically cancelled
1 SEV-0 in twelve months — triggered by ImageGen launch write surge, resolved by design

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimisation and ruthless operational discipline.

The Story

The conventional wisdom about database scaling at 800 million users is straightforward: you shard. You move to a distributed SQL system. You do not run a single primary PostgreSQL instance. OpenAI's ChatGPT does not follow this conventional wisdom. It runs on one Azure PostgreSQL Flexible Server that handles all writes — backed by approximately 50 read replicas spread across multiple regions. The system handles millions of queries per second at low double-digit millisecond p99 latency and has maintained five-nines availability. In twelve months, they had one SEV-0.

The story is not that Postgres is magic. The story is that relentless optimisation of a boring, proven technology can outperform premature architectural complexity.

Why Single-Primary Works at This Scale

ChatGPT's workload is overwhelmingly read-heavy. When 800 million users open the app, browse their chat history, or load their settings, those are reads. Writes happen on message submission and account updates — a much smaller fraction of total traffic. This access pattern is exactly what a single-primary with many read replicas handles well: the write path stays narrow, the read load fans out horizontally across replicas. The architecture is not brilliant. It is appropriate for the workload. That fit is what makes it work.

OpenAI's blog published at PGConf.dev 2025 was unusually candid about both the decisions that worked and the ones that nearly broke the system. The database load grew by more than 10× in a single year following ChatGPT's viral growth. The team responded with aggressive optimisation at every layer: connection management, query design, caching, write path discipline, and schema change governance.

Problem

10× Database Load Growth in One Year

ChatGPT's viral growth — 100 million users in two months at launch, 800 million by 2025 — drove database load up more than 10× in a single year. Connection exhaustion became a recurring threat. A 12-table ORM (Object-Relational Mapping — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code, abstracting away the database — convenient but capable of generating complex, inefficient queries invisible until they cause production incidents) generated join was causing multiple high-severity incidents when traffic spiked. Write pressure on the single primary was approaching dangerous levels during high-demand events.

Cause

Invisible Query Complexity and Write Pressure

ORMs generate SQL automatically, hiding complexity from developers. Under low load, even a 12-table join is fast enough to not notice. Under 10× load, the same query saturates database CPU. Meanwhile, write-heavy workloads that could be migrated to sharded systems like Azure Cosmos DB remained on the single primary longer than optimal.

Solution

Multi-Layer Defence: Pool + Cache + Rate Limit + Migrate

OpenAI implemented PgBouncer connection pooling (cutting connect time 10×), a cache-locking mechanism to prevent thundering herd on cache misses, multi-layer rate limiting at application, proxy, and query levels, surgical elimination of the worst ORM-generated queries, strict schema change governance (5-second DDL timeout), and a policy of migrating all new write-heavy workloads to sharded systems by default.

Result

One SEV-0 in Twelve Months, Five-Nines Availability

One SEV-0 in twelve months — triggered by the viral launch of ChatGPT ImageGen, which caused a 10× write surge as over 100 million users signed up within a week. Postgres recovered by design. p99 latency held at low double-digit milliseconds. The single-primary architecture remained viable at a scale that surprised the entire database engineering community.

The Fix

The Seven-Layer Defence

OpenAI's Postgres scaling is not one clever trick — it is seven mutually reinforcing operational practices applied simultaneously. Any one in isolation would help marginally. Together they have produced an architecture that handles a scale its underlying technology was not originally designed for.

10× — database load growth in a single year; the growth rate that forced each defensive layer to be implemented under production pressure
5ms — average connection setup time after PgBouncer deployment; down from 50ms, a 10× improvement
5 sec — maximum DDL lock wait timeout; prevents table-lock incidents on billion-row tables
1 SEV-0 — high-severity incidents in twelve months after the full defensive architecture was deployed

# Cache-locking pattern: prevents thundering herd on cache misses
# When cache expires, only ONE request repopulates it — others wait for the result

import threading

_cache = {}
_locks = {}
_lock_mutex = threading.Lock()

def get_with_cache_lock(key: str, fetch_fn, ttl_seconds: int):
    """
    On cache hit: return immediately.
    On cache miss: one thread fetches from DB; all others wait for the result.
    Prevents N simultaneous DB queries for the same expired cache key.
    """
    # Fast path: cache hit
    if key in _cache:
        return _cache[key]

    # Slow path: cache miss — acquire per-key lock
    with _lock_mutex:
        if key not in _locks:
            _locks[key] = threading.Event()
            should_fetch = True
        else:
            should_fetch = False
            event = _locks[key]

    if should_fetch:
        try:
            value = fetch_fn()       # ONE database query, not N
            _cache[key] = value      # populate cache for all waiters
            return value
        finally:
            with _lock_mutex:
                event = _locks.pop(key)
            event.set()              # wake all waiting threads
    else:
        event.wait(timeout=5)        # wait for the fetching thread
        return _cache.get(key)       # return from cache

ORM Query Review: The Most Actionable Lesson

OpenAI discovered a single ORM-generated query joining 12 tables. Under normal load it was acceptable. Under traffic spikes it saturated the primary's CPU and caused multiple SEV-0s. The query had been auto-generated by the ORM framework and never explicitly reviewed. OpenAI now requires that all ORM-generated queries against high-traffic tables be analysed with EXPLAIN ANALYZE before deployment. This practice is cheap. Not having it costs SEV-0s.

OpenAI's schema change governance is one of the most operationally distinctive aspects of their Postgres setup. They enforce a strict rule: schema changes that trigger a full table rewrite are prohibited in production. Postgres's MVCC (Multi-Version Concurrency Control — Postgres's mechanism allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row and requiring periodic vacuum to reclaim space) model means that operations like ALTER TABLE ADD COLUMN DEFAULT on large tables can hold an exclusive lock for hours while rewriting billions of rows. All DDL operations have a 5-second timeout: if the schema change cannot acquire a lock within 5 seconds, it is cancelled automatically.

The seven defensive layers:

Connection pooling — PgBouncer in statement mode; 50ms → 5ms connect time; eliminates connection exhaustion
Thundering herd prevention — cache-locking on cache misses; only one thread fetches from DB; all others wait
Multi-layer rate limiting — at application, proxy, and query levels; core product queries get priority
Hot standby failover — continuously synchronised replica; primary promotion in 30–60 seconds
Write offloading — all new write-heavy workloads default to Cosmos DB; no new tables in Postgres
Query surgery — ORM-generated SQL reviewed with EXPLAIN ANALYZE; covering indexes on hot query paths
DDL governance — 5-second lock timeout; backfills rate-limited so aggressively they can take a week

The Cosmos DB migration policy: fixing the future architecture
OpenAI's most forward-looking operational decision is a standing policy: no new tables are created in PostgreSQL. All new workloads default to sharded systems — primarily Azure Cosmos DB. Existing write-heavy workloads that can be horizontally partitioned are gradually migrated out. This policy doesn't fix the current architecture; it fixes the future architecture. Over time, the Postgres primary handles a smaller and smaller share of writes while remaining the canonical store for core user and conversation data. The single-primary architecture is not defended forever — it is being gracefully phased toward a hybrid model.

Idle transaction timeouts: the quiet killer
OpenAI identified a subtle but devastating Postgres pattern at scale: idle transactions. When application code opens a database connection, starts a transaction, does unrelated work (calling an external API, waiting for user input), and only then commits — the transaction holds locks for the entire duration. At ChatGPT's scale, applications holding open transactions for seconds can block vacuum, block DDL, and degrade query performance for all other connections. OpenAI enforces strict idle_in_transaction_session_timeout settings — any connection idle inside a transaction for more than a few seconds is automatically terminated. This breaks poorly-written code immediately in staging rather than causing incidents in production.

The backfill rate limit: so slow it takes a week
OpenAI enforces strict rate limits on database backfill operations — migrations that populate new columns or update existing rows across large tables. These rate limits are aggressive enough that a large backfill can take over a week to complete. This is deliberate: a fast backfill on a billion-row table would compete with live traffic for I/O, degrade query latency, and risk triggering the DDL timeout. Slow backfills are boring and invisible. Fast backfills cause incidents. OpenAI chose boring.

Architecture

OpenAI's Postgres architecture is simple at the macro level — one writer, many readers — but densely engineered at the micro level. The simplicity is intentional: every additional layer of infrastructure complexity is a potential failure mode. The dense engineering at the application and proxy layers is what makes the simple macro architecture viable at unprecedented scale.

OpenAI's PostgreSQL Architecture: Single Primary, Global Read Scale

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Multi-Layer Rate Limiting: Defence in Depth for Write Spikes

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Replication Lag Tradeoff

Asynchronous replication to read replicas introduces a tradeoff: reads may return slightly stale data. For most ChatGPT operations — loading conversation history, displaying user settings, browsing the interface — a few hundred milliseconds of staleness is imperceptible and acceptable. For the small fraction of requests that require current data (write followed immediately by read-your-own-write), OpenAI routes those reads to the primary. This explicit differentiation between "reads that can tolerate lag" and "reads that cannot" is a design discipline that allows read load to be distributed across 50 replicas — not an accident.

Lessons

Analyse your workload before choosing your architecture. OpenAI's single-primary architecture works because ChatGPT is overwhelmingly read-heavy. A write-heavy workload at the same scale would fail with this architecture. The lesson is not "use a single primary" — it's "design for your actual access patterns, not for the scale number on the slide."
Connection pooling (deploying a proxy like PgBouncer between application servers and PostgreSQL that multiplexes thousands of application connections into a smaller pool of database connections) is not optional at scale. At ChatGPT's traffic volume, hitting Postgres's 5,000-connection limit without pooling would have caused regular outages. PgBouncer turned a recurring incident cause into a non-issue. Deploy it before you need it.
Review ORM-generated SQL for high-traffic tables before shipping. A 12-table join that worked fine at 1× traffic caused multiple SEV-0s at 10×. ORMs are invisible query generators. Add explicit review of ORM-generated queries — EXPLAIN ANALYZE at production load levels — as a standard pre-deployment step for database-touching code.
Enforce schema change governance with hard timeouts. A DDL operation that holds a table lock for hours will cause an incident. OpenAI's 5-second DDL timeout automatically cancels any schema change that cannot acquire a lock quickly. This constraint forces engineers to use online DDL tools (pg_repack, zero-downtime column addition) rather than naive ALTER TABLE on large tables.
Plan the exit from your current architecture before you need it. OpenAI's "no new tables in PostgreSQL" policy and ongoing write workload migration to Cosmos DB are the planned evolution of the current architecture. A single-primary Postgres at 800M users is viable today because write load is bounded. It's viable tomorrow because write-heavy workloads are being systematically migrated out. Know the limits of your current architecture and have a credible plan for crossing them.

Engineering Glossary

Covering index — a database index that contains all columns needed by a query, allowing Postgres to answer it from the index alone without reading table rows. Can reduce query cost from a sequential scan of billions of rows to a few hundred index lookups on high-frequency query paths.

DDL (Data Definition Language) — SQL statements that change database structure, such as ALTER TABLE and CREATE INDEX. DDL operations in Postgres can acquire exclusive locks that block all reads and writes on a table while they execute — catastrophic at scale. OpenAI enforces a 5-second DDL timeout and prohibits operations that require full table rewrites.

Hot standby — a continuously synchronised replica specifically designated as the failover target. When the primary goes down, the hot standby can be promoted to primary with ~30–60 seconds of downtime. During a primary failure, read traffic on other replicas is unaffected.

Idle transaction timeout — a database session setting that automatically terminates connections that are idle inside an open transaction for longer than a configured period. Prevents applications that open transactions and do unrelated work from holding locks indefinitely and blocking vacuum and DDL.

MVCC (Multi-Version Concurrency Control) — Postgres's mechanism allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row. Requires periodic vacuum to reclaim space from deleted and updated rows.

ORM (Object-Relational Mapping) — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code. Convenient for development velocity; capable of generating complex, inefficient queries (like a 12-table join) that are invisible in code review because they are generated at runtime.

PgBouncer — a lightweight connection pooler for PostgreSQL that multiplexes many application connections into a smaller pool of real database connections. Reduces connection overhead and prevents connection exhaustion. Reduced OpenAI's average connection setup time from 50ms to 5ms.

Thundering herd — a failure pattern where many concurrent requests simultaneously attempt to repopulate an expired cache entry, each independently querying the database and generating N identical queries. Prevented by a cache-locking mechanism where only one thread fetches while all others wait for the result.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.