DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

GitHub · Databases · 17 May 2026

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

  • 1,200+ MySQL hosts upgraded
  • 300+ TB data migrated
  • 5.5M queries/sec maintained
  • >1 year planning+execution
  • 50+ clusters zero-downtime
  • Rollback path preserved throughout

The Story

  • 1,200+ — MySQL hosts across Azure Virtual Machines and bare-metal data center hardware — each needing individual upgrade without disturbing its neighbors
  • 300+ TB — Relational data stored across 50+ clusters, sharded both horizontally and vertically using Vitess for GitHub's highest-traffic product domains
  • 5.5M QPS — Queries per second sustained throughout the entire year-long upgrade — the SLO target that could not slip during any single cluster promotion
  • >1 year — Total duration from preparation start in July 2022 through final cluster upgrades — a timeline that reflects the discipline of doing this safely, not slowly

GitHub started as a Ruby on Rails application with a single MySQL database over 15 years ago. Since then, MySQL had become the foundation of everything GitHub stores: repositories, pull requests, issues, code review comments, user accounts, billing data, and the entire social graph of 100 million developers. By 2022, MySQL 5.7 (the production MySQL version GitHub had been running for years, which Oracle officially declared end-of-life in October 2023, meaning no more security patches or bug fixes) was approaching end-of-life — Oracle had announced support would end in October 2023. The GitHub database team made a simple calculation: stop receiving security patches on the database that holds every line of code pushed to GitHub, or upgrade. The only real question was how to upgrade 1,200 hosts, 300+ TB of data, and 5.5 million queries per second without disrupting a single user-visible transaction.

Preparation began in July 2022 — a full year before any production host was promoted to 8.0. The team added MySQL 8.0 to CI (Continuous Integration — the automated system that runs tests against every code change before it merges, ensuring the codebase is always in a shippable state) for all applications using MySQL, running 5.7 and 8.0 side-by-side to catch regressions early. They built MySQL 8.0 Codespaces (GitHub's cloud development environment that spins up isolated VM workspaces for debugging, allowing engineers to test against specific MySQL versions without affecting production) debug containers so developers could test their queries against the new version. They created an internal GitHub Project board to track every cluster's upgrade status across the entire fleet. And they did all of this before upgrading a single production host. The discipline of the preparation phase is what made the execution phase look routine.

THE HIDDEN BREAKING CHANGE

MySQL 8.0 changes the default character set to utf8mb4 and its default collation to utf8mb4_0900_ai_ci — a newer Unicode specification that MySQL 5.7 does not support. This created a problem: when an 8.0 primary replicates writes to a 5.7 replica, the collation metadata in the binary log (the record MySQL maintains of every data modification, used to replicate changes to replica hosts and to reconstruct data state for point-in-time recovery) can cause replication to break entirely on the downstream 5.7 nodes. GitHub's rollback strategy depended on maintaining backward replication from 8.0 to 5.7 — so this had to be solved before a single production primary was promoted.

The 5-Step Upgrade Playbook

Problem

MySQL 5.7 Hits End-of-Life

Oracle announced MySQL 5.7 end-of-life for October 2023, cutting off security patches and bug fixes. GitHub's 1,200+ host fleet running at 5.5M QPS could not safely continue on an unsupported database version. The challenge was executing a major version upgrade across a mixed fleet of Azure VMs and bare-metal hosts without a maintenance window or service disruption.


Cause

Backward Replication Incompatibilities

Testing revealed two breaking changes: MySQL 8.0's new default utf8mb4_0900_ai_ci collation (a Unicode 9.0 character sorting specification supported in MySQL 8.0 but absent from 5.7, causing replication to break when an 8.0 primary writes to a 5.7 replica) broke downstream 5.7 replicas, and the new MySQL 8.0 roles feature caused permission-expansion scripts to generate 8.0-syntax statements that 5.7 replicas could not parse. Both had to be patched before any primary promotion.


Solution

Rolling Replica Upgrades + Dual Replication Chains

GitHub built a 5-step playbook: upgrade replicas one data center at a time, reconfigure the replication topology (the tree of primary and replica MySQL hosts through which write changes propagate — the primary receives writes and replicas receive a stream of changes to stay in sync) to create parallel 5.7 and 8.0 chains, promote an 8.0 host to primary via graceful failover, keep 5.7 standbys ready for rollback, then clean up after 24 hours of successful traffic.


Result

100% Fleet Upgraded, Zero SLO Violations

Every cluster upgraded without a single SLO violation. The rollback path was preserved throughout the entire year-long process — a 5.7 standby was always available. The project delivered not just the MySQL 8.0 upgrade but a repeatable automation framework for future major version upgrades, so the next one will be faster.


⚠️

The Vitess Complication: Version Advertisement

Vitess (YouTube's open-source MySQL sharding layer that GitHub uses for its highest-traffic product domains) adds an extra layer of complexity: its proxy component VTgate (Vitess's query router that intercepts MySQL connections and directs them to the correct shard) advertises the MySQL version to client applications. One Java client was checking the advertised version to decide whether to disable the MySQL query cache — a feature that was completely removed in 8.0. As soon as even one shard in a Vitess keyspace was upgraded, VTgate's version advertisement had to be updated, otherwise the Java client would generate blocking errors. Timing the VTgate version bump to coincide exactly with the first shard promotion became a critical coordination step.

Upgrading the fleet with no impact to our Service Level Objectives (SLO) was no small feat — planning, testing and the upgrade itself took over a year and collaboration across multiple teams within GitHub.

— — Jiaqi Liu, Daniel Rogart, Xin Wu — via GitHub Engineering Blog

🔄

GitHub's engineers discovered a replication bug in MySQL 8.0 that only manifested under intensive load over long periods — a host could eventually run out of commit-order sequence numbers and stall. The bug had been patched in MySQL 8.0.28. This meant GitHub had to ensure all hosts were on 8.0.28 or later before any long-running cluster was considered safe, adding a version-pinning requirement to an already complex upgrade matrix.

The upgrade process for each cluster was designed to preserve the rollback option at every single step. Promoting an 8.0 replica to primary was never an irreversible action until after 24 hours of clean traffic had confirmed success. During the brief window of dual replication chains, GitHub maintained a set of offline 5.7 replicas specifically for rollback — not serving traffic, not receiving new promotion candidates, just sitting ready. Orchestrator (GitHub's open-source MySQL topology management tool that handles automated failover, replication topology visualization, and candidate promotion) was configured to blacklist all 5.7 hosts as failover candidates during this window, preventing an automated failover from accidentally rolling back to 5.7 during an unplanned outage. The architecture of the rollback path was as carefully designed as the architecture of the upgrade path itself.

🔧

gh-ost: Schema Changes Without Table Locks

GitHub's in-house tool gh-ost (GitHub Online Schema Migrations) was a critical part of the upgrade preparation. It enabled schema changes required for MySQL 8.0 compatibility to be applied to production tables without locking them — essential when those tables receive millions of queries per second. Without gh-ost, applying schema changes to GitHub's largest tables would have required multi-hour maintenance windows that users would have noticed.

ℹ️

What MySQL 8.0 Actually Unlocked

Beyond escaping end-of-life, MySQL 8.0 delivered features GitHub's database team genuinely wanted. Instant DDLs allow many schema changes to be applied without rebuilding the entire table — critical for a 300+ TB fleet where traditional ALTER TABLE could take hours. Invisible indexes let engineers create an index, test it under production traffic without it being used by the query planner, and only then make it active — dramatically safer index deployment. Compressed binary logs reduce replication bandwidth between primary and replicas, a meaningful saving at 5.5M queries per second.


The Fix

Engineering the Rollback Path

The hardest technical problem in this upgrade was not moving forward — it was preserving the ability to move backward. MySQL officially supports replication from a lower version to the next higher version but does not support reverse replication from 8.0 down to 5.7. When GitHub tested this in staging, promoting an 8.0 host to primary caused replication to break on all downstream 5.7 replicas immediately. Two root causes: MySQL 8.0's new default collation (a set of rules that determines how character strings are compared and sorted; different collations can produce different sort orders for the same strings) utf8mb4_0900_ai_ci was not recognized by 5.7's replication parser, and MySQL 8.0's new ROLE management syntax generated statements in the binary log (the sequential log of all data-modifying SQL statements that MySQL writes to enable replication and point-in-time recovery) that 5.7 could not execute. Both required surgical fixes before any production promotion could proceed.

-- The collation incompatibility fix:
-- MySQL 8.0 defaults to utf8mb4_0900_ai_ci (Unicode 9.0)
-- MySQL 5.7 only supports up to utf8mb4_unicode_520_ci
-- Fix: explicitly set database/table collations to a 5.7-compatible value

-- On the 8.0 primary, before promotion:
ALTER DATABASE github_production
  CHARACTER SET utf8
  COLLATE utf8_unicode_ci; -- 5.7-compatible collation

-- Verify that new tables inherit the correct collation
SHOW CREATE TABLE repositories\G
-- Should show utf8_unicode_ci, NOT utf8mb4_0900_ai_ci

-- Confirm replication is running on downstream 5.7 replicas
-- after a test write to ensure no Seconds_Behind_Master growth
SHOW SLAVE STATUS\G
-- Expected: Seconds_Behind_Master: 0
-- Slave_SQL_Running: Yes
-- Last_Error: (empty)

-- The roles fix: temporarily strip role-expansion from permission grants
-- during the upgrade window so no ROLE syntax appears in the binlog
Enter fullscreen mode Exit fullscreen mode

ℹ️

The Dual Replication Chain Architecture

During the critical promotion window, GitHub maintained two parallel replication chains downstream of a single 8.0 replica: one chain of offline 5.7 standbys ready for rollback, and one chain of serving 8.0 replicas handling production traffic. This dual-chain state lasted only hours per cluster — long enough to confirm 8.0 health before decommissioning the 5.7 standby chain. The temporary cost: double the replica infrastructure per cluster during the promotion window.

ORCHESTRATOR: PREVENTING ACCIDENTAL ROLLBACK

Orchestrator (an open-source MySQL high-availability tool co-created by GitHub that manages replication topology and automated failover) is configured to make automated failover decisions when a primary fails. During the upgrade, GitHub added an explicit blacklist of all 5.7 hosts as failover candidates. Without this, an unplanned primary failure during the upgrade window could have caused Orchestrator to promote a 5.7 host as the new primary — an automated rollback that would undo hours of upgrade work and potentially confuse application behavior with a sudden version downgrade. The blacklist was the safety guard against automation working against the upgrade.

After each primary promotion, GitHub's policy required at least one complete 24-hour traffic cycle before declaring a cluster successfully upgraded and decommissioning the 5.7 standby chain. This was not arbitrary — GitHub's traffic has strong diurnal patterns, with dramatically different load profiles between business-hours peak traffic and overnight lows. A cluster that behaved well during off-peak hours might reveal latency regressions during the morning rush of developers opening pull requests in Europe and North America. The 24-hour window caught several edge cases in early clusters that were fixed before the team moved to the next one.

GitHub's 5-Step MySQL 8.0 Upgrade Playbook Per Cluster

Step Action Rollback Available?
1 Upgrade replicas one DC at a time; route read traffic to 8.0 replicas Yes — disable 8.0 replicas, re-enable 5.7
2 Reconfigure topology: split into dual 8.0 and 5.7 replication chains Yes — fail back to 5.7 chain
3 Promote 8.0 replica to primary via Orchestrator graceful failover Yes — 5.7 chain still in sync
4 Monitor for 24 hours of complete traffic cycle at full load Yes — promote 5.7 standby if needed
5 Decommission 5.7 standbys after 24h success confirmation No — rollback window closed

The Automation Investment That Pays Forward

GitHub's database team explicitly designed this upgrade to produce a reusable automation framework for future major MySQL versions. The tooling for mixed-version CI, the dual-chain promotion scripts, the rollback procedures, the checklist issue templates — all of it was built as a library, not a one-time script. When MySQL 9.0 eventually needs to be adopted, the playbook already exists. The year of effort became infrastructure.

🏷️

The Mixed-Version CI Safety Net

Running MySQL 5.7 and 8.0 side-by-side in CI for all applications throughout the entire year-long upgrade was the single most important safety investment GitHub made. Application teams discovered query incompatibilities, deprecated feature usage, and reserved keyword conflicts in automated tests rather than in production promotions. This meant by the time each cluster was promoted, the application code was already known-compatible — the upgrade was validating infrastructure, not discovering application bugs.


Architecture

GitHub's MySQL fleet is not a single cluster — it's a network of over 50 independent clusters, each serving a specific product domain (repositories, issues, pull requests, billing, etc.), with larger domains horizontally sharded via Vitess. Each cluster has its own primary-replica topology. The upgrade had to be executed independently per cluster, each following the same 5-step playbook, with the dual replication chain state existing only during the transition window. Understanding this topology is essential to understanding why the upgrade took a year and why that was the right timeline, not the wrong one.

During Upgrade: Dual Replication Chain Topology (Transition State)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Fully Upgraded Cluster (5.7 Standbys Decommissioned)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

THE TOOLING ECOSYSTEM

GitHub's MySQL reliability depends on a suite of open-source and in-house tools: Orchestrator manages replication topology and automated failover; gh-ost applies online schema changes without table locks; freno throttles schema migration speed based on replica lag to prevent migrations from disrupting production reads; and Percona Toolkit provides checksumming and replication verification. Without this ecosystem, the year-long upgrade would have required dozens of maintenance windows instead of zero.

ℹ️

Vitess Sharding: One Shard at a Time

GitHub's Vitess clusters required upgrading one shard at a time rather than one cluster at a time, adding an inner loop to the upgrade playbook. For each sharded keyspace, the VTgate version advertisement had to be updated immediately after the first shard was promoted to 8.0 — otherwise client applications checking the advertised version would behave incorrectly. This timing constraint added coordination overhead but was resolved with explicit upgrade checklist items per keyspace.

📋

GitHub Projects as the Upgrade Control Plane

GitHub used its own GitHub Projects tool to build a rolling calendar that tracked every cluster's upgrade status, scheduled upcoming cluster promotions, and coordinated between the database team and application teams. Issue templates gave application teams a standardized checklist for validating their service before and after each promotion. Meta-note: GitHub building GitHub with GitHub to upgrade GitHub's database is either very on-brand or very circular, depending on your disposition.


Lessons

GitHub's MySQL 8.0 upgrade is one of the cleanest examples of large-scale infrastructure migration executed with discipline. The lessons here are as much about process and architecture as they are about database mechanics.

  1. 01. Design for rollback before you design for progress. GitHub's upgrade strategy was architected around the constraint that rollback must be available at every step until the 24-hour validation window closed. The dual replication chain architecture, the Orchestrator blacklisting of 5.7 failover candidates, the parallel standby maintenance — all of it was overhead deliberately accepted to preserve the ability to undo. That safety margin is what allowed the team to execute confidently.
  2. 02. Binary log (the sequential write-ahead log that MySQL uses to record all data changes for replication purposes) compatibility between versions is a hidden attack surface in any major database upgrade. Always test reverse replication in staging — not just forward replication — before committing to a production upgrade strategy. GitHub discovered the collation and roles incompatibilities in staging, which is exactly the right time to find them.
  3. 03. Run a complete 24-hour traffic cycle before decommissioning your rollback infrastructure. One cluster isn't 'done' after the primary promotion completes successfully. GitHub's requirement for a full 24-hour window before removing 5.7 standbys caught edge cases during peak traffic that weren't visible during off-peak hours. Don't close the escape hatch until you've seen the full traffic profile.
  4. 04. Build your upgrade automation as a reusable library, not a one-time script. GitHub's database team explicitly designed the tooling, templates, playbooks, and automation from this project as the foundation for the next major version upgrade. The year of effort becomes infrastructure that compounds in value over time — every future upgrade starts from a much higher base.
  5. 05. Orchestrator (GitHub's open-source MySQL topology manager) and equivalent automation tools can work against you during a migration if not explicitly constrained. Blacklisting 5.7 hosts as failover candidates during the upgrade window was a critical safety measure. Any automated system that could undo your migration work must be told, explicitly, not to. Never assume automation understands your maintenance window.

THE END-OF-LIFE FORCING FUNCTION

MySQL 5.7's end-of-life announcement was the external forcing function that gave GitHub's database team the organizational priority to execute this migration. Security patch cutoffs are one of the most effective levers for getting cross-team infrastructure migrations approved and resourced. If your team has been deferring a major version upgrade, check when your current version's security support ends — it may already be overdue.

⚠️

The Replication Bug in 8.0 Pre-0.28

GitHub's testing surfaced a MySQL bug where replica_preserve_commit_order under intensive load could cause a host to exhaust commit-order sequence numbers and stall replication. The fix was in 8.0.28. This meant every host had to be on at minimum 8.0.28 — adding a version-pinning constraint to an already complex upgrade matrix. Moral: always scan the target version's release notes for known bugs before committing to that specific build in production.

They upgraded 1,200 database hosts without a single user noticing — which means they either did extraordinary engineering or extraordinary documentation, and based on the blog post, it was both.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)