- 32-minute outage from a failover test of infrastructure built to prevent outages
- 2 minutes detect-to-revert — fast fix, slow recovery due to BGP reconvergence
- 6 months of live production traffic through the secondary facility; the flaw was never visible
- Secondary routed 50% of traffic flawlessly — and failed completely when asked to route 100%
- Also: June 7 incident — one customer's repository data starved the global Git push queue for 2h28m
- Both June 2023 incidents share one root: assumptions never tested under actual failure conditions
June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.
The Story
Unfortunately, during this failover test we inadvertently caused a production outage. The test exposed a network pathing configuration issue in the secondary side that prevented it from properly functioning as the primary facility.
— GitHub Status Page, Incident gqx5l06jjxhp, June 29 2023
The second Internet edge facility had been routing live production traffic since January 2023, operating alongside the primary in a high availability (a system design where redundant components allow the service to continue operating even when one component fails) architecture. Six months of real traffic without incident. The team's next step was logical and responsible: perform a live failover test — deliberately route all traffic to the secondary, as if the primary had failed — to verify the redundancy actually worked. The test began. The secondary facility could not function as a primary. GitHub went down.
The root cause was a network path configuration issue in the secondary facility. The secondary had been designed to route traffic alongside the primary in a shared active-active HA architecture (a design where both primary and secondary systems handle live traffic simultaneously — keeps the secondary warm but means it must be capable of handling full load independently at any moment), but its specific network routing configuration was never validated for the scenario where it had to handle all traffic alone. This is the subtle trap of active-active HA: a facility can route 50% of traffic flawlessly for six months and still fail when asked to route 100%, because some of its internal network paths — BGP routes (Border Gateway Protocol routes — the routing table entries that tell the internet how to reach GitHub's network, which must be correctly configured on edge routers to announce GitHub's IP prefixes to the global internet) in particular — were only configured to work in the context of the primary being present. The facility was a co-pilot that had never practised landing the plane alone.
Problem
Failover Test Initiated — Secondary Cannot Function as Primary
At 17:39 UTC on June 29, GitHub engineers shift all traffic to the secondary facility. Within seconds, parts of North America and South America begin experiencing connectivity failures. The secondary's network path configuration is broken for the solo-primary role.
Cause
Active-Active HA Does Not Validate Failover Capability
The secondary had been routing 50% of live traffic for six months. This gave a false signal of readiness — a facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load.
Solution
Revert in 2 Minutes, Fix the Config, Test Better
GitHub's monitoring fired immediately. Within two minutes, engineers reverted the failover and brought the primary back online. The network path configuration issue in the secondary was corrected. GitHub committed to staged failover testing procedures that minimize customer impact.
Result
Fixed, Then Tested Better
The secondary facility was fixed and is now genuinely capable of functioning as a primary. The test that caused the outage was ironically the most valuable test the team ever ran: it found the flaw that would have caused a much longer, unplanned outage during a real emergency.
The Fix
Two Incidents, Two Fixes, One Shared Theme
June 29 fix: staged failover testing procedures that validate the secondary's solo-primary capability before any traffic reaches it. June 7 fix: Git backend throttle behaviour changed to fail faster, preventing a single customer's pathological repository data from holding worker slots indefinitely.
- 32 min — total outage duration; 2-minute fix stretched to 32 minutes by BGP reconvergence
- 2 min — detect-to-revert; fast human response, but the damage was done the moment the test ran
- 6 months — age of the secondary facility; routing live traffic without revealing the flaw because the flaw only appeared under solo-primary conditions
- 55 min — maximum GitHub Actions workflow delay during the separate June 7 incident
# Simplified model of a safer failover test strategy
# Instead of "flip all traffic to secondary", use staged validation
def run_failover_validation(primary: EdgeFacility, secondary: EdgeFacility):
"""
Safe failover validation: verify the secondary can function as primary
without causing a production outage.
"""
# Step 1: Shadow test — route 0% of real traffic, compare responses
# Catches routing and config issues WITHOUT touching user requests
shadow_result = shadow_test(secondary, sample_requests=SYNTHETIC_TRAFFIC)
if not shadow_result.routes_correctly:
alert_team("Secondary cannot route independently — config issue found")
return FailoverResult.ABORTED # Caught here — no user impact
# Step 2: Canary — shift 1% of traffic to secondary, monitor error rates
with traffic_shift(secondary, percentage=1):
if error_rate() > ACCEPTABLE_THRESHOLD:
rollback() # Only 1% of users briefly affected
return FailoverResult.ABORTED
# Step 3: Gradual ramp — 10% → 25% → 50% → 100%
for percentage in [10, 25, 50, 100]:
with traffic_shift(secondary, percentage=percentage):
health = monitor(duration_seconds=300)
if not health.acceptable:
rollback()
return FailoverResult.ABORTED
return FailoverResult.SUCCESS
# The June 29 incident used the equivalent of jumping straight to 100%.
# A broken secondary had no chance to be caught before users felt it.
The two June 2023 incidents:
| Incident | Date | Duration | Root Cause | Fix |
|---|---|---|---|---|
| Git Push Queue Starvation | June 7 | 2h 28m | One customer's pathological data shape throttled jobs, exhausting the shared worker pool | Fail-faster throttling, reduced Git client timeout |
| Failover Test Outage | June 29 | 32 min | Secondary edge facility had hidden network path config flaw that only manifested when operating solo | Fixed secondary config; staged failover test procedures |
| Common thread | Both | — | Assumptions about system behaviour never validated under actual failure conditions | Testing at the real failure boundary, not the assumed one |
The hidden cost of active-active HA
The secondary facility had routed live production traffic for six months without incident — because it was always operating alongside the primary, not instead of it. Active-active HA gives you a false signal of readiness. A facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load, not inferred from shared-load health.
Document your assumptions before you test them
A pre-test checklist should include: what does the secondary need to do independently? Not just what load it handles, but what configuration it needs — BGP route advertisements, internal routing policies, health check endpoints, TLS certificates. Every assumption about how the secondary behaves when the primary is absent should be written down and verified before the test runs, not discovered by watching production users experience an outage.
Architecture
GitHub's Internet edge architecture is the layer that connects the global internet to GitHub's internal infrastructure. Every request from every developer — pushing code, pulling a repository, triggering a GitHub Action — flows through an Internet edge facility. For years, this was a single point of failure (a component whose failure causes the entire system to stop working). The second facility, completed in January 2023, was designed to eliminate this vulnerability. What the architecture diagrams did not capture was the specific network path configuration that would only become a problem when the secondary had to stand alone.
Before: Single Point of Failure at the Internet Edge
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
After: HA Architecture with Secondary Edge — and the Hidden Flaw
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Untested redundancy is not redundancy — it is a liability. GitHub's secondary facility routed 50% of live production traffic for six months without revealing the flaw that prevented it from functioning as a primary. Active-active HA does not validate failover capability; it validates shared-load operation. Test your redundancy by actually removing the primary, not by observing the secondary under normal conditions.
Failover tests should be staged, not binary. Shifting 100% of traffic to an untested secondary in a single step is a high-stakes gamble with no abort option. Canary failovers — shifting 1%, then 10%, then 25%, validating at each stage — expose configuration issues before they cause full outages.
Reverting fast does not mean recovering fast. GitHub reverted in under 2 minutes, but the outage lasted 32 minutes because BGP reconvergence (the time for routing updates to propagate across the global internet after a path change — unavoidable, not under the control of the affected party once initiated) takes time that no amount of engineering can compress. Build your incident response timelines around recovery time, not just fix time.
Shared queues need tenant isolation to prevent noisy neighbour failures. The June 7 incident is a canonical example of one tenant's unusual workload consuming all of a shared resource. Design queue systems with per-tenant rate limits and fast-fail timeouts so that a single job never holds a worker slot long enough to starve the entire pool.
The test that breaks production is the most valuable test you ever run. Without the June 29 failover test, the network path flaw would have remained hidden until a real infrastructure emergency forced a failover under far worse conditions — with no time to prepare, no clean revert path, and no certainty about what was broken. Deliberately probing your redundancy in a controlled environment, even at the cost of a brief outage, is the engineering equivalent of a fire drill.
Engineering Glossary
Active-active HA — a high-availability architecture where both primary and secondary systems handle live traffic simultaneously. Effective for load sharing but does not test failover capability — a secondary that handles 50% of traffic when the primary is healthy has never been proven to handle 100% when the primary is gone.
BGP (Border Gateway Protocol) — the routing protocol that governs how routers across the internet exchange information about which networks they can reach. BGP routes determine how internet traffic finds GitHub's network. A BGP misconfiguration in the secondary facility prevented it from announcing GitHub's routes correctly when operating as primary.
BGP reconvergence — the process by which BGP routers across the internet update their routing tables to reflect a network change. Takes seconds to minutes to propagate globally. Once a failover event has been initiated and BGP routes updated, reconvergence after revert is unavoidable — which is why a 2-minute fix produced a 32-minute outage.
Noisy neighbour problem — when one tenant or workload in a shared system consumes disproportionate resources, degrading performance for all other tenants. The June 7 incident: one customer's pathological repository data shape held all worker pool slots, starving every other repository's push processing.
Shadow testing — running the secondary infrastructure in parallel and comparing its behaviour against the primary without routing real user traffic to it. Enables detection of configuration gaps before they cause outages. The safer alternative to live failover testing at full load.
Single point of failure — a component whose failure causes the entire system to stop working. GitHub's original single Internet edge facility was a single point of failure; the second facility was built to eliminate it.
Worker pool exhaustion — the condition where all available concurrent job processing slots are consumed by slow or stuck jobs, leaving no capacity to process any other work. The June 7 root cause: throttled jobs held slots indefinitely rather than failing fast and releasing them.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)