TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

The Test That Broke GitHub: A Failover Drill Goes Live

#devops #reliability #architecture #webdev

32-minute outage from a failover test of infrastructure built to prevent outages
2 minutes detect-to-revert — fast fix, slow recovery due to BGP reconvergence
6 months of live production traffic through the secondary facility; the flaw was never visible
Secondary routed 50% of traffic flawlessly — and failed completely when asked to route 100%
Also: June 7 incident — one customer's repository data starved the global Git push queue for 2h28m
Both June 2023 incidents share one root: assumptions never tested under actual failure conditions

June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.

The Story

Unfortunately, during this failover test we inadvertently caused a production outage. The test exposed a network pathing configuration issue in the secondary side that prevented it from properly functioning as the primary facility.

— GitHub Status Page, Incident gqx5l06jjxhp, June 29 2023

The second Internet edge facility had been routing live production traffic since January 2023, operating alongside the primary in a high availability (a system design where redundant components allow the service to continue operating even when one component fails) architecture. Six months of real traffic without incident. The team's next step was logical and responsible: perform a live failover test — deliberately route all traffic to the secondary, as if the primary had failed — to verify the redundancy actually worked. The test began. The secondary facility could not function as a primary. GitHub went down.

The root cause was a network path configuration issue in the secondary facility. The secondary had been designed to route traffic alongside the primary in a shared active-active HA architecture (a design where both primary and secondary systems handle live traffic simultaneously — keeps the secondary warm but means it must be capable of handling full load independently at any moment), but its specific network routing configuration was never validated for the scenario where it had to handle all traffic alone. This is the subtle trap of active-active HA: a facility can route 50% of traffic flawlessly for six months and still fail when asked to route 100%, because some of its internal network paths — BGP routes (Border Gateway Protocol routes — the routing table entries that tell the internet how to reach GitHub's network, which must be correctly configured on edge routers to announce GitHub's IP prefixes to the global internet) in particular — were only configured to work in the context of the primary being present. The facility was a co-pilot that had never practised landing the plane alone.

The BGP Reconvergence Penalty

Even after GitHub reverted the failover and the primary came back online, users could not immediately reach GitHub. The internet's BGP routing tables (global distributed routing databases maintained by thousands of autonomous networks) needed time to reconverge — to undo the routing changes that the failover had caused. This BGP propagation delay (the time for routing updates to spread across the global internet — typically seconds to minutes) is unavoidable once a failover has been initiated. The fix took 2 minutes. The recovery took 32 minutes. Build your incident response timelines around recovery time, not just fix time.

Problem

Failover Test Initiated — Secondary Cannot Function as Primary

At 17:39 UTC on June 29, GitHub engineers shift all traffic to the secondary facility. Within seconds, parts of North America and South America begin experiencing connectivity failures. The secondary's network path configuration is broken for the solo-primary role.

Cause

Active-Active HA Does Not Validate Failover Capability

The secondary had been routing 50% of live traffic for six months. This gave a false signal of readiness — a facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load.

Solution

Revert in 2 Minutes, Fix the Config, Test Better

GitHub's monitoring fired immediately. Within two minutes, engineers reverted the failover and brought the primary back online. The network path configuration issue in the secondary was corrected. GitHub committed to staged failover testing procedures that minimize customer impact.

Result

Fixed, Then Tested Better

The secondary facility was fixed and is now genuinely capable of functioning as a primary. The test that caused the outage was ironically the most valuable test the team ever ran: it found the flaw that would have caused a much longer, unplanned outage during a real emergency.

The Fix

Two Incidents, Two Fixes, One Shared Theme

June 29 fix: staged failover testing procedures that validate the secondary's solo-primary capability before any traffic reaches it. June 7 fix: Git backend throttle behaviour changed to fail faster, preventing a single customer's pathological repository data from holding worker slots indefinitely.

32 min — total outage duration; 2-minute fix stretched to 32 minutes by BGP reconvergence
2 min — detect-to-revert; fast human response, but the damage was done the moment the test ran
6 months — age of the secondary facility; routing live traffic without revealing the flaw because the flaw only appeared under solo-primary conditions
55 min — maximum GitHub Actions workflow delay during the separate June 7 incident

# Simplified model of a safer failover test strategy
# Instead of "flip all traffic to secondary", use staged validation

def run_failover_validation(primary: EdgeFacility, secondary: EdgeFacility):
    """
    Safe failover validation: verify the secondary can function as primary
    without causing a production outage.
    """

    # Step 1: Shadow test — route 0% of real traffic, compare responses
    # Catches routing and config issues WITHOUT touching user requests
    shadow_result = shadow_test(secondary, sample_requests=SYNTHETIC_TRAFFIC)
    if not shadow_result.routes_correctly:
        alert_team("Secondary cannot route independently — config issue found")
        return FailoverResult.ABORTED  # Caught here — no user impact

    # Step 2: Canary — shift 1% of traffic to secondary, monitor error rates
    with traffic_shift(secondary, percentage=1):
        if error_rate() > ACCEPTABLE_THRESHOLD:
            rollback()  # Only 1% of users briefly affected
            return FailoverResult.ABORTED

    # Step 3: Gradual ramp — 10% → 25% → 50% → 100%
    for percentage in [10, 25, 50, 100]:
        with traffic_shift(secondary, percentage=percentage):
            health = monitor(duration_seconds=300)
            if not health.acceptable:
                rollback()
                return FailoverResult.ABORTED

    return FailoverResult.SUCCESS

# The June 29 incident used the equivalent of jumping straight to 100%.
# A broken secondary had no chance to be caught before users felt it.

The June 7 Incident: A Different Kind of Queue Starvation

Three weeks before the failover outage, a single customer pushed to a repository with a specific, unusual data shape — a shape that caused the Git backend to throttle the processing jobs, making them slow. These slow jobs exhausted the worker pool (consumed all available concurrent job slots, leaving no capacity for any other repository) that served all users. One customer's pathological repository data silently starved the global Git push queue for nearly 2.5 hours — causing GitHub Actions delays of up to 55 minutes. The fix: make the Git backend throttle behaviour fail faster, releasing worker slots quickly rather than holding them while retrying indefinitely.

The two June 2023 incidents:

Incident	Date	Duration	Root Cause	Fix
Git Push Queue Starvation	June 7	2h 28m	One customer's pathological data shape throttled jobs, exhausting the shared worker pool	Fail-faster throttling, reduced Git client timeout
Failover Test Outage	June 29	32 min	Secondary edge facility had hidden network path config flaw that only manifested when operating solo	Fixed secondary config; staged failover test procedures
Common thread	Both	—	Assumptions about system behaviour never validated under actual failure conditions	Testing at the real failure boundary, not the assumed one

The hidden cost of active-active HA

The secondary facility had routed live production traffic for six months without incident — because it was always operating alongside the primary, not instead of it. Active-active HA gives you a false signal of readiness. A facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load, not inferred from shared-load health.

Document your assumptions before you test them

A pre-test checklist should include: what does the secondary need to do independently? Not just what load it handles, but what configuration it needs — BGP route advertisements, internal routing policies, health check endpoints, TLS certificates. Every assumption about how the secondary behaves when the primary is absent should be written down and verified before the test runs, not discovered by watching production users experience an outage.

Architecture

GitHub's Internet edge architecture is the layer that connects the global internet to GitHub's internal infrastructure. Every request from every developer — pushing code, pulling a repository, triggering a GitHub Action — flows through an Internet edge facility. For years, this was a single point of failure (a component whose failure causes the entire system to stop working). The second facility, completed in January 2023, was designed to eliminate this vulnerability. What the architecture diagrams did not capture was the specific network path configuration that would only become a problem when the secondary had to stand alone.

Before: Single Point of Failure at the Internet Edge

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: HA Architecture with Secondary Edge — and the Hidden Flaw

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Why the Architecture Looked Bulletproof — But Wasn't

Two edge facilities, both actively routing traffic, both connected to the same internal load balancer — it looks bulletproof. But the diagram does not show the network path configuration inside the secondary: the specific BGP advertisements (announcements a network device makes to peers describing which IP address ranges it can route to) that tell the global internet how to reach GitHub, and the internal routing rules that control traffic flow within the facility. When the secondary was asked to function as the primary, those configurations were incorrect for the solo-primary role. The redundancy was a drawing on paper, not a tested fact.

Lessons

Untested redundancy is not redundancy — it is a liability. GitHub's secondary facility routed 50% of live production traffic for six months without revealing the flaw that prevented it from functioning as a primary. Active-active HA does not validate failover capability; it validates shared-load operation. Test your redundancy by actually removing the primary, not by observing the secondary under normal conditions.
Failover tests should be staged, not binary. Shifting 100% of traffic to an untested secondary in a single step is a high-stakes gamble with no abort option. Canary failovers — shifting 1%, then 10%, then 25%, validating at each stage — expose configuration issues before they cause full outages.
Reverting fast does not mean recovering fast. GitHub reverted in under 2 minutes, but the outage lasted 32 minutes because BGP reconvergence (the time for routing updates to propagate across the global internet after a path change — unavoidable, not under the control of the affected party once initiated) takes time that no amount of engineering can compress. Build your incident response timelines around recovery time, not just fix time.
Shared queues need tenant isolation to prevent noisy neighbour failures. The June 7 incident is a canonical example of one tenant's unusual workload consuming all of a shared resource. Design queue systems with per-tenant rate limits and fast-fail timeouts so that a single job never holds a worker slot long enough to starve the entire pool.
The test that breaks production is the most valuable test you ever run. Without the June 29 failover test, the network path flaw would have remained hidden until a real infrastructure emergency forced a failover under far worse conditions — with no time to prepare, no clean revert path, and no certainty about what was broken. Deliberately probing your redundancy in a controlled environment, even at the cost of a brief outage, is the engineering equivalent of a fire drill.

Engineering Glossary

Active-active HA — a high-availability architecture where both primary and secondary systems handle live traffic simultaneously. Effective for load sharing but does not test failover capability — a secondary that handles 50% of traffic when the primary is healthy has never been proven to handle 100% when the primary is gone.

BGP (Border Gateway Protocol) — the routing protocol that governs how routers across the internet exchange information about which networks they can reach. BGP routes determine how internet traffic finds GitHub's network. A BGP misconfiguration in the secondary facility prevented it from announcing GitHub's routes correctly when operating as primary.

BGP reconvergence — the process by which BGP routers across the internet update their routing tables to reflect a network change. Takes seconds to minutes to propagate globally. Once a failover event has been initiated and BGP routes updated, reconvergence after revert is unavoidable — which is why a 2-minute fix produced a 32-minute outage.

Noisy neighbour problem — when one tenant or workload in a shared system consumes disproportionate resources, degrading performance for all other tenants. The June 7 incident: one customer's pathological repository data shape held all worker pool slots, starving every other repository's push processing.

Shadow testing — running the secondary infrastructure in parallel and comparing its behaviour against the primary without routing real user traffic to it. Enables detection of configuration gaps before they cause outages. The safer alternative to live failover testing at full load.

Single point of failure — a component whose failure causes the entire system to stop working. GitHub's original single Internet edge facility was a single point of failure; the second facility was built to eliminate it.

Worker pool exhaustion — the condition where all available concurrent job processing slots are consumed by slow or stuck jobs, leaving no capacity to process any other work. The June 7 root cause: throttled jobs held slots indefinitely rather than failing fast and releasing them.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community