TechLogStack

Posted on Jun 3 • Originally published at techlogstack.com on May 31

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

#distributedsystems #reliability #devops #webdev

257 incidents — May 2025 to April 2026; ~5 per week, every week, for 12 months straight
48 major outages — 112+ hours of total significant downtime
AI-agent PRs: 4M (Sept 2025) → 17M (Mar 2026) — 325% increase in six months
Actions usage: 500M min/week (2023) → 2.1B min/week (early 2026)
10x scaling plan launched October 2025; revised to 30x by February 2026
Mitchell Hashimoto (GitHub user #1299, 18 years, co-founder of HashiCorp) migrated Ghostty away

Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.

The Story

We started executing our plan to increase GitHub's capacity by 10X in October 2025, with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today's scale. The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply.

— Vlad Fedorov, CTO of GitHub, GitHub Engineering Blog, April 28, 2026

For most of its existence, GitHub has been one of the most reliable platforms on the internet. Developers took it for granted the way they take electricity for granted — always on, always there, a utility so dependable it disappeared into the background. That changed in 2025. Not because GitHub's engineers got worse. Not because the codebase got sloppier. But because something fundamental changed about who — or more precisely, what — was using GitHub. AI coding agents arrived at scale, and they didn't behave anything like the human developers the platform was built for.

In 2024, GitHub logged 119 service incidents, including 26 major ones — frustrating, but manageable. Then, between May 2025 and April 2026, incident monitoring service IncidentHub tracked 257 separate incidents, of which 48 were classified as major outages. February 2026 alone produced 37 incidents — the worst month on record. GitHub Actions suffered 57 outages in the same 12-month stretch. On May 15, 2026, a single Actions degradation caused 42% of all Actions runs to fail at peak impact.

The Core Problem: Agents Don't Behave Like Humans

A human developer on a free GitHub account might generate a few commits and a handful of CI runs in a working day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, and thousands of Actions minutes in a single afternoon. GitHub's 2025 Octoverse report celebrated nearly 1 billion commits. By early 2026, GitHub COO Kyle Daigle shared a more alarming figure: the platform was handling 275 million commits every single week — on pace for 14 billion in 2026. That's a 14x annual increase. It wasn't 14x more developers. It was agents treating GitHub's API like a utility and consuming at machine speed.

Problem

GitHub Was Built for Human-Paced Development

GitHub's architecture was designed for a world where developers work at human speed: open a PR, push commits over hours or days, wait for CI to run, merge when green. The platform's capacity planning, its database schemas, its job queues, its rate limits — all calibrated for a workflow where one human generates a bounded amount of activity per session. That assumption held for 17 years.

Cause

AI Agents Changed the Economics of Every GitHub Operation

GitHub CTO Vlad Fedorov identified the mechanism: a single pull request can simultaneously touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. A human merging one PR triggers this chain once. An AI agent framework running hundreds of concurrent sessions triggers it thousands of times simultaneously. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 — a 325% increase in six months.

Solution

10x Plan Became 30x Plan — And They Were Still Behind

GitHub began a 10x capacity scaling initiative in October 2025. By February 2026, that plan was already obsolete — the real demand required 30x. Simultaneously, GitHub was running a migration to Azure, with 12.5% of all traffic on Azure Central US and a target of 50% by July 2026. Running a platform migration alongside an AI-driven traffic explosion is the engineering equivalent of rebuilding an airplane's engines at 35,000 feet.

Result

Cascading Failures and High-Profile Departures

The pressure produced not just performance degradation but engineering failures. On April 23, 2026, an incomplete feature flag silently reverted commits across 658 repositories and 2,092 pull requests — the UI showed green checkmarks while code was being rewritten underneath. On April 28, Mitchell Hashimoto — GitHub user #1299, co-founder of HashiCorp, joined February 2008 — announced that Ghostty was leaving GitHub after 18 years. The Zig programming language project also migrated away.

The Fix

The Engineering Response: Ruby to Go, Monolith to Services, Single Cloud to Multi-Cloud

GitHub's five-layer engineering response (as outlined by CTO Vlad Fedorov, April 2026):

Problem Layer	Root Cause	GitHub's Fix
Language / Runtime	Ruby monolith has GIL limiting CPU parallelism under high concurrency	Rewriting performance-critical services from Ruby to Go — goroutine model handles massive concurrency without the GIL
Infrastructure	Single-cloud creates concentrated failure risk and limits horizontal scaling	Multi-cloud deployment — 12.5% on Azure Central US in early 2026, targeting 50% by July 2026
Service Isolation	A single PR cascades through 10+ interconnected subsystems	Isolating Git and Actions into independent failure domains
Capacity Planning	10x plan (October 2025) obsolete by February 2026	30x capacity design with automated scaling for agent-driven burst load
Feature Safety	April 23 merge queue regression caused by incomplete feature flag	Strengthened feature flag discipline — no data-integrity code path ships without complete flag protection

257 — total incidents May 2025–April 2026; roughly five per week, every week, for twelve months
48 — major outages producing over 112 hours of total significant downtime
30x — the scale GitHub needed to design for by February 2026, triple the 10x plan launched four months earlier
2,092 — pull requests silently reverted by the April 23 merge queue bug across 658 repositories, with no notification

The April 23 Silent Revert: Why This One Was Different

The April 23 merge queue bug was caused by an incomplete feature flag that allowed a new code path to activate without full safeguards. Commits that had been merged were silently reverted across 658 repositories and 2,092 pull requests. The terrifying part was not the scope — it was the silence. The UI continued to show green checkmarks and merge confirmations while the system was actively undoing work underneath. A platform's most sacred contract with its users is that when it shows a green checkmark, the operation succeeded. GitHub broke that contract. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data that developers trust as immutable, flag protection is not a best practice — it is the minimum viable safety mechanism.

Why the Ruby-to-Go rewrite is the right call

Ruby served GitHub extraordinarily well for 18 years. But Ruby's Global Interpreter Lock (GIL) is a fundamental constraint: even on a 64-core server, a Ruby process can only execute one thread of Ruby code at a time. For human-paced web traffic, this limitation is manageable. For AI agent workflows that generate thousands of concurrent operations, the GIL is a hard ceiling. Go's goroutine model — lightweight threads managed by the Go runtime that can run across all available CPU cores without a GIL — is architecturally suited for exactly the concurrency profile that AI agents create. The rewrite is not about language preference. It is about physics.

The Mitchell Hashimoto moment

On April 28, 2026, Mitchell Hashimoto — GitHub user number 1299, co-founder of HashiCorp, creator of Vagrant, Packer, Consul, Terraform, and Vault — posted that Ghostty was leaving GitHub. He had visited GitHub almost every day for over 18 years. His post described the decision as 'irrationally sad' but said the platform was no longer a place where he could 'get work done' and 'ship software.' He made a point that resonated across the developer community: the problem was not Git itself — the distributed version control system remained excellent. The problem was the surrounding infrastructure: issues, pull requests, GitHub Actions. When the person who never had reason to question the platform for 18 years starts questioning it, something has fundamentally changed.

The structural billing gap

GitHub's business model was designed for humans, and its pricing reflects human-scale consumption. A developer on a free GitHub account generates some commits, a few CI runs, and a handful of API calls per day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, thousands of Actions minutes, and tens of thousands of API calls in a single afternoon. The infrastructure cost per 'user' has fundamentally changed, but the pricing model has not yet caught up. GitHub's Octoverse 2025 report celebrated nearly 1 billion commits and 36 million new developers. But the 2026 numbers aren't being driven by 36 million new developers — they're being driven by agents treating GitHub's API like a utility.

Architecture

GitHub's architecture evolved over 18 years around a core assumption: the unit of load is a human developer. A human opens a PR, waits for review, pushes a few commits, and merges. The platform's service graph — Git storage, mergeability computation, branch protection evaluation, Actions job dispatch, search indexer, notification fan-out, webhook delivery, permission evaluation, API gateway — was sized and coupled around this human-paced access pattern.

AI agents broke the architecture's fundamental assumption. An agent doesn't open a PR and wait. An agent opens 50 PRs in parallel, each triggering the full service chain simultaneously. When the number of concurrent PRs scales 4x in six months, the pressure on every one of those systems scales accordingly — and the interconnected failures begin.

A Single GitHub PR: The 10+ Subsystems It Touches

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

GitHub Actions: Weekly Compute Minutes — The AI Agent Surge

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

83 Incidents From Capacity Failures Alone

83 of GitHub's 257 incidents between May 2025 and April 2026 were caused by load and capacity problems — with indications that many services did not have automatic scaling configured, requiring manual intervention to add capacity during surges. This means that dozens of times, engineers had to notice the problem, escalate it, and manually provision resources before the platform could recover. Automated capacity scaling for burst load is not optional infrastructure. For a platform being consumed by AI agents, it is the minimum viable reliability architecture.

Lessons

Your platform's capacity model must be built around its actual consumers — not its original consumers. GitHub was built for human developers. AI agents consume infrastructure at orders of magnitude greater intensity. Any platform that introduces AI-native workflows must remodel its capacity assumptions from scratch, not incrementally adjust from the human baseline.
Feature flags (a software engineering practice where new code is deployed but kept inactive until explicitly enabled, allowing teams to test in production, roll out gradually, and instantly disable a feature without redeployment) are not optional for infrastructure that handles data integrity. The April 23 merge queue bug — which silently reverted 2,092 pull requests — was caused by an incomplete feature flag. A complete feature flag would have allowed engineers to disable the affected code path instantly. For any code path that touches data developers trust as immutable, flag protection is the minimum viable safety mechanism.
A monolith that can't be incrementally scaled will become a single point of failure at sufficient scale. GitHub's Ruby monolith served the platform for 18 years because human-paced traffic was bounded enough that the GIL's concurrency limit never became the primary bottleneck. AI agents removed that bound. The architectural lesson is not that monoliths are bad — it's that every architectural decision encodes assumptions about scale, and those assumptions must be revisited when the scale changes fundamentally.
Service isolation is not premature optimisation — it is the prerequisite for containing blast radius at scale. When critical services are deeply coupled — when a PR touches Git storage, Actions, search, notifications, permissions, and webhooks in a single chain — a failure in any one component becomes a failure across all components. GitHub's commitment to isolating Git and Actions into independent failure domains is the architectural move that will have the most long-term impact on reliability.
Trust is the asset that reliability engineering protects. Mitchell Hashimoto didn't leave GitHub because of any single outage. He left because 257 incidents over 12 months had eroded confidence in the platform as a reliable foundation for serious work. Reliability is not measured in individual incident severities — it is measured in the cumulative effect of failures on whether people trust the platform to do what it says it did.

Engineering Glossary

Global Interpreter Lock (GIL) — a mutex in Python and Ruby runtimes that prevents multiple threads from executing interpreter code simultaneously in the same process. Even on a multi-core server, a Ruby process can only use one CPU core at a time for Ruby execution. The fundamental scaling constraint that makes the Ruby-to-Go rewrite necessary for GitHub's AI agent traffic levels.

AI agent PR — a pull request created by an autonomous AI coding agent (such as Copilot Workspace or similar agentic tools) rather than a human developer. AI-agent PRs jumped from 4 million in September 2025 to 17 million in March 2026 on GitHub — the primary driver of the platform's capacity crisis.

Agentic development workflow — a software development pattern where AI agents autonomously perform multi-step tasks: creating branches, writing code, running tests, opening PRs, and iterating based on feedback. Unlike human-paced development, agentic workflows can generate hundreds of concurrent operations from a single user session.

Feature flag (kill switch) — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in GitHub's April 23 merge queue incident. A complete feature flag would have allowed the problematic code path to be disabled instantly rather than requiring a full redeployment cycle.

Service isolation — an architectural design where services are deployed as independent failure domains rather than a tightly coupled chain. The goal: a failure in one service (e.g. GitHub Actions) does not cascade to unrelated services (e.g. Git storage). GitHub's post-crisis architectural commitment.

Merge queue regression — a class of GitHub-specific incident where the merge queue processing pipeline fails, causing incorrect behaviour (such as the April 23 silent revert) or blocking PRs from merging. Merge queue regressions are particularly damaging because they violate the fundamental contract of version control: that merge operations are irreversible and accurately reported.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community