TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

#database #backend #devops #programming

1,200+ MySQL hosts upgraded across Azure VMs and bare-metal data centre hardware
300+ TB of relational data across 50+ clusters
5.5M queries/second maintained throughout the entire year-long upgrade
>1 year from preparation start (July 2022) to final cluster upgrade
Rollback path preserved at every single step until 24-hour validation window closed
0 SLO violations across all cluster promotions

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

The Story

Upgrading the fleet with no impact to our Service Level Objectives (SLO) was no small feat — planning, testing and the upgrade itself took over a year and collaboration across multiple teams within GitHub.

— Jiaqi Liu, Daniel Rogart, Xin Wu, via GitHub Engineering Blog

GitHub started as a Ruby on Rails application with a single MySQL database over 15 years ago. Since then, MySQL had become the foundation of everything GitHub stores: repositories, pull requests, issues, code review comments, user accounts, billing data, and the entire social graph of 100 million developers. By 2022, MySQL 5.7 (the production version GitHub had been running for years, which Oracle officially declared end-of-life in October 2023 — meaning no more security patches or bug fixes) was approaching end-of-life. The calculation was simple: stop receiving security patches on the database that holds every line of code pushed to GitHub, or upgrade.

Preparation began in July 2022 — a full year before any production host was promoted to 8.0. The team added MySQL 8.0 to CI (Continuous Integration — the automated system that runs tests against every code change before it merges, ensuring the codebase is always in a shippable state) for all applications using MySQL, running 5.7 and 8.0 side-by-side to catch regressions early. They built MySQL 8.0 Codespaces debug containers so developers could test their queries against the new version. They created an internal GitHub Project board to track every cluster's upgrade status across the entire fleet. And they did all of this before upgrading a single production host. The discipline of the preparation phase is what made the execution phase look routine.

The Hidden Breaking Change: Collation Incompatibility

MySQL 8.0 changes the default character set to utf8mb4 and its default collation to utf8mb4_0900_ai_ci — a newer Unicode specification that MySQL 5.7 does not support. When an 8.0 primary replicates writes to a 5.7 replica, the collation metadata in the binary log (the record MySQL maintains of every data modification, used to replicate changes to replica hosts) can cause replication to break entirely on the downstream 5.7 nodes. GitHub's rollback strategy depended on maintaining backward replication from 8.0 to 5.7 — so this had to be solved before a single production primary was promoted.

Problem

MySQL 5.7 Hits End-of-Life

Oracle announced MySQL 5.7 end-of-life for October 2023, cutting off security patches and bug fixes. GitHub's 1,200+ host fleet running at 5.5M QPS could not safely continue on an unsupported database version. The challenge: executing a major version upgrade across a mixed fleet of Azure VMs and bare-metal hosts without a maintenance window or service disruption.

Cause

Backward Replication Incompatibilities

Testing revealed two breaking changes: MySQL 8.0's new default collation (a set of rules determining how character strings are compared and sorted; utf8mb4_0900_ai_ci is absent from 5.7, causing replication to break when an 8.0 primary writes to a 5.7 replica) broke downstream 5.7 replicas. The new MySQL 8.0 ROLE management syntax generated statements in the binary log that 5.7 replicas could not execute. Both had to be patched before any primary promotion could proceed.

Solution

Rolling Replica Upgrades + Dual Replication Chains

GitHub built a 5-step playbook: upgrade replicas one data centre at a time, reconfigure the replication topology (the tree of primary and replica MySQL hosts through which write changes propagate) to create parallel 5.7 and 8.0 chains, promote an 8.0 host to primary via graceful failover, keep 5.7 standbys ready for rollback, then clean up after 24 hours of successful traffic.

Result

100% Fleet Upgraded, Zero SLO Violations

Every cluster upgraded without a single SLO violation. The rollback path was preserved throughout the entire year-long process — a 5.7 standby was always available. The project delivered not just the MySQL 8.0 upgrade but a repeatable automation framework for future major version upgrades.

The Fix

Engineering the Rollback Path

The hardest technical problem in this upgrade was not moving forward — it was preserving the ability to move backward. MySQL officially supports replication from a lower version to the next higher version but does not support reverse replication from 8.0 down to 5.7. When GitHub tested this in staging, promoting an 8.0 host to primary caused replication to break on all downstream 5.7 replicas immediately. Both the collation and the ROLE syntax issues required surgical fixes before any production promotion could proceed.

1,200+ — MySQL hosts upgraded across Azure VMs and bare-metal hardware
5.5M QPS — query throughput maintained throughout; the SLO target that could not slip during any single cluster promotion
24 hours — minimum observation window after primary promotion before decommissioning 5.7 standbys
0 — SLO violations across all cluster promotions during the year-long project

-- The collation incompatibility fix:
-- MySQL 8.0 defaults to utf8mb4_0900_ai_ci (Unicode 9.0)
-- MySQL 5.7 only supports up to utf8mb4_unicode_520_ci
-- Fix: explicitly set database/table collations to a 5.7-compatible value

-- On the 8.0 primary, before promotion:
ALTER DATABASE github_production
  CHARACTER SET utf8
  COLLATE utf8_unicode_ci;  -- 5.7-compatible collation, not the 8.0 default

-- Verify that new tables inherit the correct collation
SHOW CREATE TABLE repositories\G
-- Must show utf8_unicode_ci, NOT utf8mb4_0900_ai_ci

-- Confirm replication is running on downstream 5.7 replicas
-- after a test write to ensure no replication lag growth
SHOW SLAVE STATUS\G
-- Expected:
--   Seconds_Behind_Master: 0
--   Slave_SQL_Running: Yes
--   Last_Error: (empty)

-- The ROLE syntax fix:
-- Temporarily strip role-expansion from permission grants during the upgrade
-- window so no ROLE syntax appears in the binary log that 5.7 replicas
-- cannot parse.

The Dual Replication Chain Architecture

During the critical promotion window, GitHub maintained two parallel replication chains downstream of a single 8.0 replica: one chain of offline 5.7 standbys ready for rollback, and one chain of serving 8.0 replicas handling production traffic. This dual-chain state lasted only hours per cluster — long enough to confirm 8.0 health before decommissioning the 5.7 standby chain. The temporary cost: double the replica infrastructure per cluster during the promotion window. The payoff: rollback available at any moment.

GitHub's 5-step MySQL 8.0 upgrade playbook per cluster:

Step	Action	Rollback Available?
1	Upgrade replicas one DC at a time; route read traffic to 8.0 replicas	Yes — disable 8.0 replicas, re-enable 5.7
2	Reconfigure topology: split into dual 8.0 and 5.7 replication chains	Yes — fail back to 5.7 chain
3	Promote 8.0 replica to primary via Orchestrator graceful failover	Yes — 5.7 chain still in sync
4	Monitor for 24 hours of complete traffic cycle at full load	Yes — promote 5.7 standby if needed
5	Decommission 5.7 standbys after 24h success confirmation	No — rollback window closed

The Orchestrator blacklist: preventing automated rollback

Orchestrator (GitHub's open-source MySQL topology manager that handles automated failover) makes automated failover decisions when a primary fails. During the upgrade, GitHub added an explicit blacklist of all 5.7 hosts as failover candidates. Without this, an unplanned primary failure during the upgrade window could have caused Orchestrator to promote a 5.7 host as the new primary — an automated rollback that would undo hours of upgrade work and potentially confuse application behaviour with a sudden version downgrade. The blacklist was the safety guard against automation working against the upgrade.

The Vitess complication: VTgate version advertisement

Vitess (YouTube's open-source MySQL sharding layer that GitHub uses for its highest-traffic product domains) adds an extra layer: its proxy component VTgate advertises the MySQL version to client applications. One Java client was checking the advertised version to decide whether to disable the MySQL query cache — a feature completely removed in 8.0. As soon as even one shard in a Vitess keyspace was upgraded, VTgate's version advertisement had to be updated immediately, otherwise the Java client would generate blocking errors. Timing the VTgate version bump to coincide exactly with the first shard promotion became a critical coordination step per keyspace.

The replication bug in 8.0 pre-0.28

GitHub's testing surfaced a MySQL bug where replica_preserve_commit_order under intensive load could cause a host to exhaust commit-order sequence numbers and stall replication. The fix was in MySQL 8.0.28. This meant every host had to be on at minimum 8.0.28 — adding a version-pinning constraint to an already complex upgrade matrix. Lesson: always scan the target version's release notes for known bugs before committing to a specific build in production.

Architecture

GitHub's MySQL fleet is not a single cluster — it's a network of over 50 independent clusters, each serving a specific product domain (repositories, issues, pull requests, billing), with larger domains horizontally sharded via Vitess. Each cluster has its own primary-replica topology. The upgrade had to be executed independently per cluster, each following the same 5-step playbook, with the dual replication chain state existing only during the transition window.

During Upgrade: Dual Replication Chain Topology (Transition State)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After: Fully Upgraded Cluster (5.7 Standbys Decommissioned)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Mixed-Version CI Safety Net

Running MySQL 5.7 and 8.0 side-by-side in CI for all applications throughout the entire year-long upgrade was the single most important safety investment GitHub made. Application teams discovered query incompatibilities, deprecated feature usage, and reserved keyword conflicts in automated tests rather than in production promotions. This meant by the time each cluster was promoted, the application code was already known-compatible — the upgrade was validating infrastructure, not discovering application bugs.

What MySQL 8.0 actually unlocked

Beyond escaping end-of-life, MySQL 8.0 delivered features GitHub's database team genuinely wanted. Instant DDLs allow many schema changes to be applied without rebuilding the entire table — critical for a 300+ TB fleet where traditional ALTER TABLE could take hours. Invisible indexes let engineers create an index, test it under production traffic without it being used by the query planner, and only then make it active — dramatically safer index deployment. Compressed binary logs reduce replication bandwidth between primary and replicas, a meaningful saving at 5.5M queries per second.

Lessons

Design for rollback before you design for progress. GitHub's upgrade strategy was architected around the constraint that rollback must be available at every step until the 24-hour validation window closed. The dual replication chain architecture, the Orchestrator blacklisting of 5.7 failover candidates, the parallel standby maintenance — all overhead deliberately accepted to preserve the ability to undo. That safety margin is what allowed the team to execute confidently across 1,200 hosts.
Binary log (the sequential write-ahead log MySQL uses to record all data changes for replication purposes) compatibility between versions is a hidden attack surface in any major database upgrade. Always test reverse replication in staging — not just forward replication — before committing to a production upgrade strategy. GitHub discovered the collation and roles incompatibilities in staging, which is exactly the right time to find them.
Run a complete 24-hour traffic cycle before decommissioning your rollback infrastructure. GitHub's requirement for a full 24-hour window before removing 5.7 standbys caught edge cases during peak traffic that weren't visible during off-peak hours. One cluster isn't done after the primary promotion completes successfully. Don't close the escape hatch until you've seen the full traffic profile.
Build your upgrade automation as a reusable library, not a one-time script. GitHub's database team explicitly designed the tooling, templates, playbooks, and automation from this project as the foundation for the next major version upgrade. The year of effort becomes infrastructure that compounds in value over time — every future upgrade starts from a much higher base.
Orchestrator and equivalent automation tools can work against you during a migration if not explicitly constrained. Blacklisting 5.7 hosts as failover candidates during the upgrade window was a critical safety measure. Any automated system that could undo your migration work must be told, explicitly, not to. Never assume automation understands your maintenance window.

Engineering Glossary

Binary log — the sequential write-ahead log that MySQL uses to record all data-modifying SQL statements for replication purposes. Replication works by replaying the binary log from the primary on each replica. Binary log compatibility between MySQL versions is the key technical challenge in any major version upgrade.

Collation — a set of rules that determines how character strings are compared and sorted in a database. MySQL 8.0's new default collation (utf8mb4_0900_ai_ci) is absent from MySQL 5.7, causing replication to break when an 8.0 primary writes to a 5.7 replica. The fix: explicitly set collations to a 5.7-compatible value on 8.0 hosts before promotion.

gh-ost — GitHub's open-source Online Schema Migration tool that applies schema changes to production tables without locking them. Essential for applying MySQL 8.0 compatibility changes to tables receiving millions of queries per second without causing maintenance-window-level disruption.

Invisible index — a MySQL 8.0 feature that allows an index to be created and tested under production traffic without being used by the query planner. Engineers can validate the index's performance impact before making it active — significantly safer than deploying indexes directly to production.

Instant DDL — a MySQL 8.0 feature that allows many common schema changes (ADD COLUMN, DROP COLUMN, etc.) to be applied without rebuilding the entire table. Critical for a 300+ TB database fleet where traditional ALTER TABLE could take hours.

Orchestrator — GitHub's open-source MySQL high-availability tool (co-created by GitHub) that manages replication topology and automated failover. During the MySQL 8.0 upgrade, GitHub configured Orchestrator to blacklist all 5.7 hosts as failover candidates, preventing automated failover from accidentally reversing the upgrade.

Replication topology — the tree of primary and replica MySQL hosts through which write changes propagate. The primary receives all writes and maintains a binary log; replicas connect to the primary and replay that log to stay in sync. GitHub's dual replication chain architecture maintained parallel 5.7 and 8.0 chains during the upgrade transition window.

Vitess — YouTube's open-source MySQL sharding layer that GitHub uses for its highest-traffic product domains. Adds a proxy component (VTgate) that routes queries to the correct shard and advertises the MySQL version to client applications — creating an additional coordination requirement during the upgrade.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community