GitHub · Reliability · 17 May 2026
June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.
- 32-minute outage
- 2-min detect-to-revert
- US East + South America
- Built Jan 2023 unused flaw
- Also: 55-min Action delays
- ~100M devs affected
The Story
- 32 min — Duration of the June 29 outage, caused not by an external attack or software bug but by GitHub's own validation test of infrastructure designed to prevent exactly this kind of outage
- 2 min — Time from alert firing to engineers reverting the failover and bringing the primary edge facility back online — fast human response, but the damage was done the moment the test ran
- 6 months — Approximate age of GitHub's second Internet edge facility when the test ran — built in January 2023 and actively routing production traffic since then, yet the configuration flaw was never discovered
- 55 min — Maximum delay GitHub Actions workflows experienced during the June 7 incident — a separate but equally instructive outage where one customer's pathological repository data starved the entire Git push queue
GitHub is the infrastructure underpinning virtually all software development on Earth. Over 100 million developers use it to store code, run CI/CD pipelines via GitHub Actions (GitHub's built-in automation platform, used by millions of teams to automatically build, test, and deploy software on every code push), and collaborate on projects ranging from weekend hobby projects to the world's most critical open-source software. When GitHub goes down, the entire software industry's ability to ship code grinds to a halt. This is not a theoretical concern — it is the reality GitHub's infrastructure team lives with every day. For years, the team had known about a single point of failure (a component whose failure causes the entire system to stop working — if there is only one of it and it breaks, everything that depends on it also breaks) in their network architecture at the Internet edge. The fix was a second Internet edge facility, completed in January 2023.
The second edge facility had been routing live production traffic since January, operating alongside the primary in a high availability (a system design where redundant components allow the service to continue operating even when one component fails, typically achieved by having a primary and one or more secondaries that can take over) architecture. Six months of real traffic without incident. The team's next step was logical and responsible: perform a live failover test — deliberately route all traffic to the secondary, as if the primary had failed — to verify the redundancy actually worked. June 29, 2023. The test began. The secondary facility could not function as a primary. GitHub went down.
Unfortunately, during this failover test we inadvertently caused a production outage. The test exposed a network pathing configuration issue in the secondary side that prevented it from properly functioning as the primary facility.
— — GitHub Status Page — Incident gqx5l06jjxhp, June 29, 2023
The Hidden Configuration Flaw
The root cause was a network path configuration issue in the secondary Internet edge facility. The secondary had been designed to route traffic alongside the primary in a shared HA architecture (High Availability architecture — a design where two or more facilities share load simultaneously, rather than one being idle as a hot standby, which keeps the secondary warm but means it must be capable of handling full load independently at any moment), but its specific network routing configuration was never validated for the scenario where it had to handle all traffic alone. This is the subtle trap of active-active HA: a facility can route 50% of traffic flawlessly for six months and still fail when asked to route 100%, because some of its internal network paths — BGP routes (Border Gateway Protocol routes — the routing table entries that tell the internet how to reach GitHub's network, which must be correctly configured on edge routers to announce GitHub's IP prefixes to the global internet) in particular — were only configured to work in the context of the primary being present. The facility was a co-pilot that had never practiced landing the plane alone.
Problem
Failover Test Initiated
At 17:39 UTC on June 29, 2023, GitHub engineers begin a planned live validation of the secondary Internet edge facility by shifting all traffic away from the primary. Within seconds, parts of North America (especially the US East Coast) and South America begin experiencing connectivity failures to GitHub.
Cause
Secondary Cannot Function as Primary
The secondary facility has a network path configuration issue that was invisible while it shared load with the primary but becomes critical when it must handle all traffic alone. Border router reconvergence (the process by which BGP routers across the internet update their routing tables to reflect the new path to GitHub's network — a process that takes time and cannot be instantly reversed) cannot happen correctly because the secondary's own configuration is broken.
Solution
Revert in 2 Minutes
GitHub's monitoring fires immediately. Within two minutes of being alerted, engineers revert the failover change and bring the primary facility back online. The revert itself is fast — but once online, border routers across the internet need time to reconverge, meaning GitHub service is not instantly restored even after the primary is running.
Result
Fixed, Then Tested Better
The network path configuration issue in the secondary is corrected. GitHub commits to improved failover testing procedures that minimize customer impact — specifically, scheduling future tests in a way that reduces blast radius. The test that caused the outage was ironically the most valuable test the team ever ran: it found the flaw that would have caused a much longer, unplanned outage during a real emergency.
❌
The Reconvergence Penalty
Even after GitHub reverted the failover and the primary came back online, users could not immediately reach GitHub. The internet's BGP routing tables (global distributed routing databases maintained by thousands of autonomous networks — once updated to reflect a path change, they must propagate the reversal across every network that learned the new path) needed time to reconverge — to undo the routing changes that the failover had caused. This is the hidden cost of network-level failures: the fix is fast, the recovery is slow.
The June 7 Incident: A Different Kind of Queue Starvation
Three weeks before the failover outage, GitHub experienced a completely separate but equally instructive incident. On June 7 at 16:11 UTC , GitHub's internal job queue for processing Git pushes began experiencing increasing delays. The monitoring system alerted engineers after 19 minutes. Customers experienced GitHub Actions workflow delays of up to 55 minutes and pull requests that failed to reflect new commits. The root cause was a single customer pushing to a repository with a specific, unusual data shape — a shape that caused the Git backend to throttle the processing jobs, making them slow. These slow jobs exhausted the worker pool (consumed all available concurrent job slots, leaving no capacity to process pushes from any other repository) that served all other users. One customer's pathological repository data silently starved the global Git push queue for nearly two and a half hours.
⚠️
Tenant Isolation in Shared Queues
The June 7 incident is a textbook case of noisy neighbor problem (when one tenant or workload in a shared system consumes disproportionate resources, degrading performance for all other tenants) in a shared job queue. GitHub's fix — making the Git backend throttle behavior fail faster and reducing the Git client timeout — prevents any single customer's workload from holding a worker slot indefinitely. The principle applies anywhere a shared queue serves diverse workloads.
GitHub spent six months routing real traffic through a backup facility, and it took a deliberate test to discover the backup couldn't actually back anything up — which is the whole point of testing, just not quite how they planned.
TechLogStack — built at scale, broken in public, rebuilt by engineers
The Fix
Fixing the Failover Test Outage
The immediate fix for the June 29 outage was surgical and fast: engineers identified the network path configuration issue exposed by the failover test and corrected it in the secondary edge facility. But the more important fix was procedural — changing how future failover tests are designed and scheduled. A live failover test that takes GitHub fully offline for users in two continents is not a sustainable validation strategy. GitHub committed to scheduling tests in ways that minimize customer impact, likely through phased traffic migration (moving a small percentage of traffic first), shadow testing (running the secondary in parallel and comparing its behavior against the primary without actually routing real user traffic to it) to identify configuration gaps before they cause outages, and off-peak timing to reduce the blast radius if something goes wrong. The secondary facility was fixed and is now genuinely capable of functioning as a primary.
# Simplified model of a safer failover test strategy
# Instead of "flip all traffic to secondary", use staged validation
def run_failover_validation(primary: EdgeFacility, secondary: EdgeFacility):
"""
Safe failover validation: verify the secondary can function as primary
without causing a production outage.
"""
# Step 1: Shadow test — route 0% of real traffic, compare responses
# Checks routing and config WITHOUT touching user requests
shadow_result = shadow_test(secondary, sample_requests=SYNTHETIC_TRAFFIC)
if not shadow_result.routes_correctly:
# ✅ Caught here — no user impact
alert_team("Secondary cannot route independently — config issue found")
return FailoverResult.ABORTED
# Step 2: Canary — shift 1% of traffic to secondary, monitor error rates
with traffic_shift(secondary, percentage=1):
if error_rate() > ACCEPTABLE_THRESHOLD:
rollback() # Instant revert, only 1% of users briefly affected
return FailoverResult.ABORTED
# Step 3: Gradual ramp — 10% → 25% → 50% → 100%
# At each stage, verify secondary handles the load correctly
for percentage in [10, 25, 50, 100]:
with traffic_shift(secondary, percentage=percentage):
# Monitor BGP convergence, latency, error rates
health = monitor(duration_seconds=300)
if not health.acceptable:
rollback() # Revert to last good state
return FailoverResult.ABORTED
# Step 4: Full failover validated — secondary proved capable
return FailoverResult.SUCCESS
# The June 29 incident used the equivalent of jumping straight to step 4.
# A broken secondary had no chance to be caught before users felt it.
ℹ️
The BGP Reconvergence Reality
When GitHub's primary facility came back online after the revert, engineers could not simply flip a switch and restore service. Border routers across the internet needed time to reconverge — each network that had learned the (broken) route to GitHub's secondary needed to update its routing tables back to the primary. This BGP propagation delay (the time it takes for routing updates to spread across the global internet's interconnected autonomous systems — typically seconds to minutes, depending on the number of hops and the speed of peering relationships) is unavoidable, which is why the outage lasted 32 minutes even though the fix itself took under 2 minutes.
The June 7 Git push queue fix was more technically nuanced. The Git backend's throttling behavior was changed to fail faster — instead of a throttled job slowly consuming a worker slot while retrying indefinitely, it now returns a failure quickly so the slot is released for another repository's work. The Git client timeout within the job was also reduced, preventing a hung upstream connection from holding a worker open. These two changes together mean a pathological repository data shape can no longer starve the shared worker pool. Additional observability (instrumentation that gives engineers visibility into system behavior — here, metrics that reveal when a single tenant is consuming disproportionate queue capacity, before it becomes a user-facing incident) improvements were added to reduce detection and diagnosis time for future incidents of this type.
✅
The Outage That Validated the Investment
GitHub's engineering team noted a pointed irony: the test that caused the outage was exactly the right test to run. Without it, the hidden configuration flaw would have remained undetected until a real infrastructure failure — at which point the outage would have been unplanned, potentially longer, and without the fast human revert that limited the June 29 impact to 32 minutes. A self-inflicted outage you can control is always better than a real one you cannot.
TWO INCIDENTS, ONE JUNE
June 2023 gave GitHub two distinct outage patterns in a single month. The June 7 incident ( 2h28m ) was caused by a shared resource exhaustion — one customer's data starving a global queue. The June 29 incident ( 32 min ) was caused by untested redundancy — infrastructure built for resilience that had never been validated as a solo primary. Both share a root: assumptions that were never tested in production conditions.
⚠️
The Hidden Cost of Active-Active HA
The secondary facility had routed live production traffic for six months without incident — because it was always operating alongside the primary, not instead of it. Active-active HA gives you a false signal of readiness. A facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load, not inferred from shared-load health.
The most important long-term fix was cultural: GitHub's team committed to making failover testing a regular practice, not a one-time event. Regular failover tests — scheduled with appropriate notice, designed to minimize blast radius, and run at off-peak times — are the only way to keep redundancy validated over time. Infrastructure drifts: routers get reconfigured, network policies change, and a facility that was a fully functional backup six months ago may not be today. Untested redundancy is not redundancy. It is the comforting fiction that your system is more resilient than it actually is.
Architecture
GitHub's Internet edge architecture is the layer that connects the global internet to GitHub's internal infrastructure. Every request from every developer in the world — whether pushing code, pulling a repository, or triggering a GitHub Action — flows through an Internet edge facility. For years, this was a single point of failure : one facility, one set of border routers (network devices that connect GitHub's private network to the public internet, running BGP to announce GitHub's IP address ranges to the global routing table), and one path in from the internet. The second facility, completed in January 2023, was designed to eliminate this vulnerability. What the architecture diagrams did not capture was the specific network path configuration that would only become a problem when the secondary had to stand alone.
Before: Single Point of Failure at the Internet Edge
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: HA Architecture with Secondary Edge — and the Hidden Flaw
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
The architecture diagram shows the deceptive appearance of redundancy. Two edge facilities, both actively routing traffic, both connected to the same internal load balancer — it looks bulletproof. But the diagram does not show the network path configuration inside the secondary: the specific BGP advertisements (announcements that a network device makes to its peers describing which IP address ranges it can route to, along with the path information used to select the best route) that tell the global internet how to reach GitHub, and the internal routing rules that control traffic flow within the facility. When the secondary was asked to function as the primary during the failover test, those configurations were incorrect for the solo-primary role. The redundancy was a drawing on paper, not a tested fact.
🔄
Border Router Reconvergence: The Delay Nobody Talks About
When GitHub's primary facility came back online, the recovery was not instant. Every network on the internet that had updated its BGP routing table (a database maintained by every network router describing the best path to reach any IP address on the internet — changes propagate gradually from router to router as UPDATE messages are exchanged) to route via the broken secondary had to learn the new path to the primary. This propagation delay is inherent to how the internet works and is unavoidable once a failover has been initiated. It is one more reason to avoid unnecessary failover events — even a 2-minute fix can result in 30 minutes of degraded service.
June 2023 GitHub incidents — two outages, two root causes, one shared theme
| Incident | Date | Duration | Root Cause | Fix |
|---|---|---|---|---|
| Git Push Queue Starvation | June 7 | 2h 28m | Single customer's pathological data shape throttled jobs, exhausting the shared worker pool | Fail-faster throttling, reduced Git client timeout |
| Failover Test Outage | June 29 | 32 min | Secondary edge facility had hidden network path config flaw that only manifested when operating solo | Fixed secondary config; improved failover test procedures |
| Common thread | Both | — | Assumptions about system behavior that were never validated under the actual failure conditions | Testing at the real failure boundary, not the assumed one |
Lessons
June 2023 gave GitHub — and the industry — two clean case studies in the same month. Neither outage was caused by a novel bug or an obscure race condition. Both were caused by things that look like good engineering on paper but hadn't been tested at the right failure boundary. These lessons apply to any team operating infrastructure with redundancy assumptions they have never validated.
- 01. Untested redundancy is not redundancy — it is a liability. GitHub's secondary edge facility routed 50% of live production traffic for six months without revealing the flaw that prevented it from functioning as a primary. Active-active HA (a high-availability design where both primary and secondary systems handle live traffic simultaneously — effective for load sharing, but does not test failover capability unless one side is actually turned off) does not validate failover capability; it validates shared-load operation. Test your redundancy by actually removing the primary, not by observing the secondary under normal conditions.
- 02. Failover tests should be staged, not binary. Shifting 100% of traffic to an untested secondary in a single step is a high-stakes gamble with no abort option. Canary failovers — shifting 1%, then 10%, then 25%, validating at each stage before proceeding — expose configuration issues before they cause full outages. The extra complexity of staged testing is trivially small compared to the cost of a production outage discovered mid-test.
- 03. Reverting fast does not mean recovering fast. GitHub reverted the failover change in under 2 minutes , but the outage lasted 32 minutes because BGP reconvergence (the time for routing updates to propagate across the global internet after a path change — unavoidable, typically measured in minutes, and not under the control of the affected party once a change has been announced) takes time that no amount of engineering can compress. Build your incident response timelines around recovery time, not just fix time.
- 04. Shared queues need tenant isolation to prevent noisy neighbor failures. The June 7 incident is a canonical example of one tenant's unusual workload consuming all of a shared resource. Design queue systems with per-tenant rate limits and fast-fail timeouts so that a single job never holds a worker slot long enough to starve the entire pool. The fix — making the Git backend throttle faster — is a one-line change that protects millions of users from one user's edge case.
- 05. The test that breaks production is the most valuable test you ever run. GitHub's team made a pointed admission: without the June 29 failover test, the network path flaw would have remained hidden until a real infrastructure emergency forced a failover under far worse conditions — with no time to prepare, no clean revert path, and no certainty about what was broken. Deliberately probing your redundancy in a controlled environment, even at the cost of a brief outage, is the engineering equivalent of a fire drill: painful in the moment, essential in the long run.
THE IRONY THEOREM
The infrastructure designed to prevent an outage caused the outage. The test designed to validate resilience proved the resilience didn't exist. And the 32-minute disruption designed to be the worst case turned out to be far better than the real-emergency case it prevented. Sometimes the most constructive thing you can do for your reliability is schedule an outage before the universe schedules one for you.
ℹ️
Document Your Assumptions Before You Test Them
A pre-test checklist should include: what does the secondary need to do independently? Not just what load it handles, but what configuration it needs — BGP route advertisements, internal routing policies, health check endpoints, TLS certificates. Every assumption about how the secondary behaves when the primary is absent should be written down and verified before the test runs, not discovered by watching production users experience an outage.
GitHub built a backup facility, routed real traffic through it for six months, and then discovered it was a backup that couldn't actually back anything up — all of which was fine, because they found out during a test instead of a Thursday morning at 3am.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)