1. Context — Why Should You Care?
Imagine you live in a city with one single hospital. Every person — from a child with a cold, to someone needing heart surgery, to someone getting an eye test — goes to the same building, the same reception desk, the same set of doctors.
When the city had 10,000 people, this was fine. But now the city has 10 million people. The waiting room is overflowing. The reception computer is crashing. A patient getting a routine blood test is accidentally blocking the queue for someone who needs emergency surgery.
What do you do?
You don't shut down the hospital and build new ones. People are already inside, being treated. You need to split the hospital into specialized clinics — a heart clinic, an eye clinic, a general clinic — while patients are still being treated, without anyone noticing the transition.
That is exactly what GitHub did with its database.
GitHub — the platform where 100+ million developers store their code — was running on a single, massive MySQL database. As it grew, the database started groaning under the weight. Queries were slow. One team's heavy workload was ruining performance for everyone else. Something had to change.
But GitHub is a 24/7 platform. They couldn't just "turn it off for maintenance." They had to split their giant database into smaller, independent databases — while the platform was live and millions of developers were pushing code.
What You Will Learn in This Blog
- What a monolithic database is and why it becomes a problem
- What "sharding" means and why companies do it
- GitHub's exact two-phase strategy: Virtual Partitioning → Physical Partitioning
- How they achieved a cut-over in under 100 milliseconds
- Every technical term explained from scratch with analogies
- Step-by-step walkthrough with diagrams
What Problem This Knowledge Solves
If you ever work at a company that is growing fast, you will face database scaling problems. Understanding how GitHub solved this gives you a mental framework for one of the hardest problems in backend engineering — scaling databases without downtime.
2. Jargon & Terminology Breakdown
Before we touch any concept, let's make sure every single term is crystal clear. Read this section like a mini-dictionary. Come back to it anytime you get confused later.
Monolithic Database
| Definition | One single database that stores everything for your entire application. |
| Real-Life Analogy | One giant notebook where you write your work notes, grocery lists, personal diary, and meeting minutes — all mixed together. |
| Where It's Used | Almost every application starts with a monolithic database. It's simple and works fine when you're small. |
Sharding
| Definition | Splitting one big database into multiple smaller databases, each responsible for a specific portion of the data. |
| Real-Life Analogy | Instead of one giant notebook, you now have separate notebooks: one for work, one for groceries, one for personal diary. Each notebook is independent. |
| Where It's Used | Large-scale systems like GitHub, Instagram, Uber, Pinterest — any platform that outgrows a single database. |
Schema
| Definition | The structure or blueprint of your database — what tables exist, what columns each table has, and how tables relate to each other. |
| Real-Life Analogy | Think of a schema like the layout plan of a library: "Fiction books go on floor 1, Science on floor 2, History on floor 3." The schema doesn't hold the actual books — it defines where things go. |
| Where It's Used | Every database has a schema. When developers say "schema domain," they mean a logical group of related tables. |
Schema Domain
| Definition | A logical grouping of related tables within a database. For example, all tables related to "Repositories" form one domain, all tables related to "Users" form another. |
| Real-Life Analogy | In a big hospital, "Cardiology" is one domain (heart patients, heart tests, heart doctors), "Ophthalmology" is another domain (eye patients, eye tests, eye doctors). They're in the same building but logically separate. |
| Where It's Used | GitHub created this concept internally to organize their monolithic database before splitting it. |
Cluster
| Definition | A group of database servers working together. Usually there's one "primary" server (handles writes) and several "replica" servers (handle reads). |
| Real-Life Analogy | A team at work: one team lead (primary) makes decisions, and several team members (replicas) execute and share the workload. |
| Where It's Used | Any production database setup. You rarely run just one server — you run a cluster for reliability and performance. |
Primary (Server)
| Definition | The main database server that handles all write operations (INSERT, UPDATE, DELETE). It is the "source of truth." |
| Real-Life Analogy | The original document that everyone copies from. If you want to make a change, you change this document. |
| Where It's Used | Every database cluster has exactly one primary. |
Replica (Server)
| Definition | A copy of the primary server. It receives all changes from the primary and handles read operations to reduce load on the primary. |
| Real-Life Analogy | Photocopies of the original document distributed to different offices so people can read without crowding around the original. |
| Where It's Used | Read-heavy applications (like GitHub, where millions read code but fewer write code at any given moment). |
Replication
| Definition | The automatic process of copying data changes from the primary server to replica servers. |
| Real-Life Analogy | A live Google Doc where one person types (primary) and everyone else sees changes in real time (replicas). |
| Where It's Used | Every database cluster uses replication to keep replicas in sync with the primary. |
Replication Lag
| Definition | The delay between when a change happens on the primary and when replicas receive that change. |
| Real-Life Analogy | When someone speaks in a video call and there's a 2-second delay before you hear it. That delay is "lag." |
| Where It's Used | Critical during migrations — you need lag to be near zero before you switch traffic to a new server. |
ProxySQL
| Definition | A middleman software that sits between your application and the database. It decides which database server to send each query to. |
| Real-Life Analogy | A receptionist at a hospital who listens to your problem and directs you to the right department. You don't need to know which room — the receptionist handles it. |
| Where It's Used | Large-scale MySQL deployments. It's especially useful during migrations because you can redirect traffic without changing application code. |
Cut-Over
| Definition | The final moment when you switch traffic from the old database to the new database. |
| Real-Life Analogy | The moment when a new highway opens and traffic is redirected from the old road to the new road. |
| Where It's Used | Any database migration. The cut-over is the most critical and risky moment. |
Snapshot
| Definition | A point-in-time copy of data. Like taking a photograph of your database at a specific moment. |
| Real-Life Analogy | Photocopying all pages of a book at once. The photocopy represents the book's state at that exact moment. |
| Where It's Used | Used as a starting point when setting up a new database — you load the snapshot, then apply any changes that happened after the snapshot was taken. |
Noisy Neighbor Problem
| Definition | When one workload in a shared system uses so many resources that it degrades performance for other workloads in the same system. |
| Real-Life Analogy | You're in a shared office. One person starts playing loud music and making phone calls. Your concentration is ruined — not because of your work, but because of their behavior. |
| Where It's Used | Any shared resource: shared databases, shared cloud servers, shared networks. |
Query Linter
| Definition | A tool that automatically checks your database queries for "rule violations" — like a spell-checker, but for database queries. |
| Real-Life Analogy | A grammar checker that underlines mistakes in your essay before you submit it. |
| Where It's Used | Development and testing environments. GitHub used it to catch queries that broke domain boundaries. |
Cross-Domain Join
| Definition | A database query that pulls data from tables belonging to different schema domains in a single operation. |
| Real-Life Analogy | A hospital receptionist trying to book one appointment that requires both a heart doctor and an eye doctor in the same room at the same time. It couples two independent departments together. |
| Where It's Used | Common in monolithic databases. But if you want to split the database, these cross-domain joins must be eliminated first. |
Downtime
| Definition | A period when the system is unavailable to users. |
| Real-Life Analogy | A shop putting up a "Closed for Renovation" sign. Customers can't enter. |
| Where It's Used | Every online platform fears downtime. GitHub's goal was to achieve zero visible downtime during their migration. |
3. The Big Picture — High-Level Mental Model
Before diving into the details, let's understand the overall story in simple terms.
The Problem
GitHub's entire platform — repositories, pull requests, issues, users, notifications, everything — was stored in one MySQL database cluster. Think of it as one massive warehouse where every department stores their goods.
As GitHub grew to serve 100+ million developers, two problems became unbearable:
- High query volume: Too many people asking for too many things from the same warehouse at the same time.
- Noisy neighbor problem: The "Repositories" team's heavy operations slowed down the "Notifications" team, even though their data had nothing to do with each other.
The Solution (at 30,000 feet)
GitHub decided to shard — break the one giant database into many smaller, independent databases, each responsible for one "domain" of data.
But here's the brilliance: they didn't do it all at once. They did it in two careful phases:
Phase 1 — Virtual Partitioning (Logical separation)
"Before we physically move anything, let's first draw boundaries inside the existing database and make sure no one crosses them."
Phase 2 — Physical Partitioning (Actual migration)
"Now that boundaries are clean, let's physically move each domain to its own database cluster — without downtime."
The Analogy
Imagine you have a huge shared apartment with 6 roommates. Everyone's stuff is everywhere — someone's books are on your shelf, your clothes are in their closet.
Phase 1 (Virtual Partitioning): Before anyone moves out, you first sort everything. Each person labels their stuff. You make sure no one is using someone else's belongings. You draw invisible boundaries.
Phase 2 (Physical Partitioning): Now that everything is sorted, each person moves into their own apartment — taking only their labeled stuff, without any mix-ups.
┌─────────────────────────────────────────────────┐
│ GITHUB'S SHARDING STRATEGY │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ PHASE 1: VIRTUAL PARTITIONING │ │
│ │ ───────────────────────────────── │ │
│ │ • Define schema domains │ │
│ │ • Enforce boundaries (no cross-domain │ │
│ │ queries or transactions) │ │
│ │ • Use linters + alerts to catch │ │
│ │ violations │ │
│ │ • All still in ONE physical database │ │
│ └──────────────┬───────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ PHASE 2: PHYSICAL PARTITIONING │ │
│ │ ───────────────────────────────── │ │
│ │ • Snapshot domain tables │ │
│ │ • Set up new cluster │ │
│ │ • Replicate data │ │
│ │ • Redirect traffic via ProxySQL │ │
│ │ • Cut-over in < 100ms │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Result: Independent databases per domain, │
│ zero visible downtime │
└─────────────────────────────────────────────────┘
4. Concept-by-Concept Deep Dive
4.1 — The Starting Point: GitHub's Monolithic MySQL Database
Simple Definition
GitHub stored all its data — for repositories, users, issues, pull requests, notifications, actions, and more — in a single MySQL database cluster.
Why This Existed
Every startup and growing company starts here. A monolithic database is:
- Simple to set up
- Simple to query (you can JOIN any table with any other table)
- Simple to maintain (one place, one backup)
It's the natural starting point. The problem isn't that GitHub chose this — it's that they outgrew it.
The Problems It Caused
Problem 1: Query Volume
Millions of developers, millions of repositories, billions of commits. The single database was receiving a staggering volume of queries — reads and writes, all funneled into the same cluster.
Problem 2: Noisy Neighbor
This is the critical one. Let's say the "Notifications" system runs a heavy batch query to send email digests every morning. While this query runs, it hogs database resources (CPU, memory, I/O). At the same time, a developer is trying to push code to a repository. The push is slow or fails — not because of anything wrong with the "Repositories" tables, but because the "Notifications" tables are hogging resources.
┌─────────────────────────────────────────────────────────┐
│ SINGLE DATABASE CLUSTER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Repos │ │ Users │ │ Notifs │ │ Issues │ │
│ │ Tables │ │ Tables │ │ Tables │ │ Tables │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ └──────────────┴──────┬───────┴─────────────┘ │
│ │ │
│ SHARED RESOURCES │
│ (CPU, Memory, Disk I/O) │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ │ 🔴 NOISY NEIGHBOR EFFECT │ │
│ │ Heavy Notifs query slows │ │
│ │ down Repos, Users, Issues │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Common Mistakes & Misunderstandings
❌ "Monolithic databases are bad."
No. They're fine when you're small or medium-sized. The problem is when you scale to GitHub-level traffic. Don't prematurely shard — it adds enormous complexity.
❌ "Just buy a bigger server."
This is called "vertical scaling" and it has a ceiling. There's only so much CPU and RAM you can add to a single machine. And it doesn't solve the noisy neighbor problem — all domains still share the same resources.
4.2 — Phase 1: Virtual Partitioning (Logical Separation)
Simple Definition
Virtual partitioning means drawing invisible boundaries inside the existing database so that different groups of tables (schema domains) stop interacting with each other — even though they still physically live in the same database.
Why It Exists
You can't just rip tables out of a database and move them elsewhere if your application code is constantly doing queries that JOIN tables from different domains, or running transactions that span multiple domains.
If "Repository" code is doing a JOIN with "Notifications" tables, and you move the Notifications tables to a separate database, that JOIN will break. The application will crash.
So before physically moving anything, you need to guarantee that each domain is self-contained — it doesn't depend on tables from other domains.
How It Works — Step by Step
Step 1: Define Schema Domains
GitHub categorized their tables into logical groups:
┌─────────────────────────────────────────────┐
│ SCHEMA DOMAIN EXAMPLES │
│ │
│ Domain: "Repositories" │
│ ├── repositories table │
│ ├── commits table │
│ ├── branches table │
│ └── pull_requests table │
│ │
│ Domain: "Users" │
│ ├── users table │
│ ├── profiles table │
│ └── sessions table │
│ │
│ Domain: "Notifications" │
│ ├── notifications table │
│ ├── email_preferences table │
│ └── notification_logs table │
│ │
│ Domain: "Issues" │
│ ├── issues table │
│ ├── comments table │
│ └── labels table │
└─────────────────────────────────────────────┘
Each domain contains only the tables that are tightly related to each other.
Step 2: Enforce Boundaries — Eliminate Cross-Domain Queries
Once domains were defined, GitHub's goal was: No query or transaction should touch tables from more than one domain.
Example of a violation (cross-domain join):
-- ❌ BAD: This query joins Repositories tables with Notifications tables
SELECT r.name, n.message
FROM repositories r
JOIN notifications n ON r.user_id = n.user_id
WHERE r.id = 12345;
This query reaches across two domains. If we later move Notifications to a separate database, this query will break.
Example of a clean query (single domain):
-- ✅ GOOD: This query only touches Repositories tables
SELECT r.name, b.branch_name
FROM repositories r
JOIN branches b ON r.id = b.repository_id
WHERE r.id = 12345;
Step 3: Use Query Linters in Dev/Test Environments
GitHub built a tool — a query linter — that automatically scanned every database query in the codebase during development and testing.
If a developer wrote a query that crossed domain boundaries, the linter would flag it with an error — just like a spell-checker flags a misspelled word.
┌─────────────────────────────────────────────────┐
│ QUERY LINTER IN ACTION │
│ │
│ Developer writes query ──► Linter checks it │
│ │
│ ┌──────────────────────────────┐ │
│ │ Query: SELECT * FROM repos │ │
│ │ JOIN notifications ON ... │ │
│ └──────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ 🔴 LINTER ERROR: │ │
│ │ "Cross-domain join │ │
│ │ detected! repos domain │ │
│ │ cannot join with │ │
│ │ notifications domain." │ │
│ └──────────────────────────────┘ │
│ │
│ Developer must refactor the query │
│ into two separate domain-specific queries. │
└─────────────────────────────────────────────────┘
Step 4: Production Alerts for Cross-Domain Transactions
Even after linting, some cross-domain transactions might slip through to production (maybe from older code, or edge cases).
GitHub added monitoring in production that raised alerts when a transaction spanned multiple schema domains.
This didn't block the transaction (that would cause downtime), but it notified the team so they could fix it.
Think of it like a security camera: it doesn't physically stop the thief, but it records and alerts so you can respond.
Real-World Analogy
Imagine a company with one giant open-plan office. Marketing, Engineering, Sales, and HR all share the same space, same printer, same coffee machine.
Virtual Partitioning is like:
- Defining zones: "Marketing sits in the east wing, Engineering in the west wing."
- Making sure Marketing documents don't reference Engineering's internal files.
- Installing a system that alerts if someone from Marketing accidentally uses Engineering's private printer.
Nobody has physically moved yet. Everyone is still in the same building. But the boundaries are now clear and enforced.
Common Mistakes & Misunderstandings
❌ "Virtual partitioning alone solves the scaling problem."
No. The data is still on the same physical server. You still have shared resources and the noisy neighbor problem. Virtual partitioning is a prerequisite for physical partitioning, not a substitute.
❌ "We can skip virtual partitioning and directly move tables."
Dangerous. If your code has cross-domain dependencies, moving tables will break your application. Virtual partitioning ensures you're safe to move.
4.3 — Phase 2: Physical Partitioning (The Actual Migration)
Simple Definition
Physical partitioning means actually moving a schema domain's tables from the original database cluster to a brand-new, independent database cluster — while the application is live and serving users.
Why It Exists
Virtual partitioning drew boundaries, but everything is still on the same server. The noisy neighbor problem persists. Physical partitioning gives each domain its own hardware resources — its own CPU, memory, and disk I/O.
After physical partitioning, the "Notifications" domain running a heavy query will only affect its own cluster, not the "Repositories" cluster.
How It Works — Step by Step (The 5-Step Process)
This is the core of GitHub's engineering achievement. Let's go step by step.
For clarity, let's say we're migrating the "Notifications" domain from the old cluster (Cluster A) to a new cluster (Cluster B).
Step 1: Take a Snapshot of the Domain's Tables from Cluster A
┌────────────────────────────────┐
│ CLUSTER A │
│ (Original Database) │
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Repos │ │ Notifications│ │ ──── Snapshot taken
│ │ Tables │ │ Tables │ │ of Notifications
│ └──────────┘ └──────┬───────┘ │ tables only
│ │ │
└──────────────────────┼─────────┘
│
▼
┌────────────────┐
│ SNAPSHOT │
│ (Point-in-time │
│ copy of │
│ Notifications │
│ data) │
└────────────────┘
A snapshot is like taking a photograph of the Notifications tables at a specific moment in time. It captures all the rows, all the data — as it exists right now.
Important: While the snapshot is being taken, Cluster A continues serving traffic normally. No downtime.
Step 2: Load the Snapshot into Cluster B
The snapshot is loaded into the new Cluster B — including its primary and replica servers.
┌────────────────┐
│ SNAPSHOT │
└───────┬────────┘
│
▼
┌────────────────────────┐
│ CLUSTER B │
│ (New Database) │
│ │
│ ┌──────────────────┐ │
│ │ Notifications │ │
│ │ Tables │ │
│ │ (from snapshot) │ │
│ └──────────────────┘ │
│ │
│ Primary + Replicas │
└────────────────────────┘
At this point, Cluster B has the Notifications data, but it's frozen at the time the snapshot was taken. Any changes that happened on Cluster A after the snapshot are not yet in Cluster B.
Step 3: Set Up Replication from Cluster A → Cluster B
Now GitHub sets up live replication from Cluster A's primary to Cluster B's primary. This means any new changes to Notifications data on Cluster A are automatically streamed to Cluster B.
┌────────────────────┐ ┌────────────────────┐
│ CLUSTER A │ │ CLUSTER B │
│ (Original) │ │ (New) │
│ │ │ │
│ ┌──────────────┐ │ ─────── │ ┌──────────────┐ │
│ │ Primary │──┼─Replication──│ Primary │ │
│ └──────────────┘ │ ─────► │ └──────────────┘ │
│ │ │ │
│ All live changes │ │ Receiving changes │
│ happen here │ │ in real-time │
└────────────────────┘ └────────────────────┘
Cluster B is essentially a replica of Cluster A — but only for the Notifications tables. It's catching up on all the changes that happened since the snapshot.
Step 4: Redirect Traffic via ProxySQL (But Still to Cluster A)
Here's the clever part.
GitHub now updates the application to route all Notifications-related queries through Cluster B's ProxySQL. But — and this is key — that ProxySQL is configured to forward all queries back to Cluster A.
Why? Because Cluster B might not be fully caught up yet. You can't serve queries from Cluster B if it's behind Cluster A.
┌──────────┐ ┌──────────────────┐ ┌────────────┐
│ │ │ CLUSTER B's │ │ │
│ App │────►│ ProxySQL │────►│ CLUSTER A │
│ Server │ │ │ │ (still │
│ │ │ (Middleman - │ │ serving │
│ │ │ routes to A │ │ data) │
│ │ │ for now) │ │ │
└──────────┘ └──────────────────┘ └────────────┘
This is like changing the receptionist — the new receptionist (Cluster B's ProxySQL) is sitting at the front desk, but for now, she's forwarding all patients to the old hospital. The patients (application) don't notice any difference.
Why do this intermediate step?
Because when it's time for the final cut-over, the application is already talking to Cluster B's ProxySQL. You only need to change ProxySQL's routing from "forward to A" → "serve from B directly." No application code changes needed.
Step 5: The Cut-Over (The Big Moment)
This is the most critical step. Let's break it down into micro-steps:
┌──────────────────────────────────────────────────────────┐
│ CUT-OVER PROCESS │
│ │
│ ① Check replication lag between Cluster A & B │
│ └── Must be < 1 second │
│ │
│ ② Temporarily BLOCK all requests │
│ └── ProxySQL holds all incoming queries briefly │
│ └── This ensures no new data is written to A │
│ │
│ ③ Wait for Cluster B to FULLY synchronize │
│ └── Cluster B processes the last remaining │
│ replicated changes from A │
│ │
│ ④ STOP replication from Cluster A │
│ └── Cluster B is now independent │
│ │
│ ⑤ Update ProxySQL routing │
│ └── Route traffic DIRECTLY to Cluster B's primary │
│ │
│ ⑥ UNBLOCK all requests │
│ └── Traffic now flows to Cluster B │
│ │
│ Total time: < 100 milliseconds │
└──────────────────────────────────────────────────────────┘
Let me walk through each micro-step:
① Check Replication Lag:
Before initiating the cut-over, GitHub monitors the replication lag. The cut-over only begins when the lag is less than 1 second — meaning Cluster B is almost perfectly in sync with Cluster A.
② Temporarily Block All Requests:
ProxySQL briefly holds all incoming queries. No queries reach either Cluster A or Cluster B. This creates a tiny window where no new writes happen.
Think of it like a traffic cop stopping all cars at an intersection for 2 seconds to let an ambulance pass.
③ Wait for Full Synchronization:
In this blocked state (which lasts milliseconds), Cluster B processes the final remaining replication data from Cluster A. After this, Cluster B has 100% of the Notifications data — exactly matching Cluster A.
④ Stop Replication:
The replication link from A to B is cut. Cluster B is now a standalone, independent cluster. It no longer needs Cluster A.
⑤ Update ProxySQL Routing:
ProxySQL's configuration is updated: instead of forwarding queries to Cluster A, it now sends them directly to Cluster B's primary.
⑥ Unblock Requests:
All the held queries are released. They flow to Cluster B, which is now the authoritative database for Notifications.
The entire process — from blocking requests to unblocking — takes less than 100 milliseconds. That's 0.1 seconds. A human blink takes 300-400 milliseconds. The "downtime" is shorter than a blink.
The Complete Physical Partitioning Flow (Combined Diagram)
┌─────────────────────────────────────────────────────────────────┐
│ PHYSICAL PARTITIONING - COMPLETE FLOW │
│ │
│ Step 1: Snapshot │
│ ┌──────────┐ │
│ │Cluster A │──── Take snapshot of ────► ┌──────────┐ │
│ │(Original)│ Notifications tables │ Snapshot │ │
│ └──────────┘ └────┬─────┘ │
│ │ │
│ Step 2: Load snapshot into Cluster B │ │
│ ▼ │
│ ┌──────────┐ │
│ │Cluster B │ │
│ │ (New) │ │
│ └────┬─────┘ │
│ │ │
│ Step 3: Set up replication A → B │ │
│ ┌──────────┐ ┌────┴─────┐ │
│ │Cluster A │════ Replication Stream ═══►│Cluster B │ │
│ │ Primary │ │ Primary │ │
│ └──────────┘ └────┬─────┘ │
│ │ │
│ Step 4: Redirect app → Cluster B's ProxySQL │ │
│ ┌──────┐ ┌──────────────┐ ┌────┴─────┐ │
│ │ App │───►│ B's ProxySQL │───────────►│Cluster A │ │
│ └──────┘ └──────────────┘ (still └──────────┘ │
│ routes │
│ to A) │
│ │
│ Step 5: Cut-over (< 100ms) │
│ ┌──────┐ ┌──────────────┐ ┌──────────┐ │
│ │ App │───►│ B's ProxySQL │───────────►│Cluster B │ ✅ │
│ └──────┘ └──────────────┘ (now └──────────┘ │
│ routes │
│ to B!) │
│ │
│ Cluster A no longer handles Notifications. │
│ Cluster B is fully independent. │
└─────────────────────────────────────────────────────────────────┘
Real-World Analogy (Complete)
Let's use the hospital analogy one final time.
The Hospital Analogy:
Snapshot: You photocopy all Cardiology patient records from the main hospital's filing cabinet.
Load into new clinic: You bring the photocopies to the new Heart Clinic across the street and set up their filing cabinet.
Replication: You set up a live fax machine between the main hospital and the Heart Clinic. Any new Cardiology records added to the main hospital are automatically faxed to the Heart Clinic.
Redirect reception: You tell the main hospital's receptionist: "When a Cardiology patient comes in, send them to the Heart Clinic's receptionist." But the Heart Clinic's receptionist, for now, sends them back to the main hospital (because the Heart Clinic isn't fully set up yet).
-
Cut-over: The Heart Clinic is fully caught up. In one swift moment:
- The receptionist stops accepting patients for 0.1 seconds
- The fax machine confirms all records are synced
- The fax line is disconnected
- The Heart Clinic's receptionist starts directing patients to the Heart Clinic's own doctors
- Patients resume flowing — now to the Heart Clinic directly
Nobody in the waiting room even noticed the switch.
Common Mistakes & Misunderstandings
❌ "The cut-over requires the app to be shut down."
No. ProxySQL acts as a middleman. The app doesn't know or care whether queries go to Cluster A or B. ProxySQL handles the routing transparently.
❌ "100 milliseconds sounds too good to be true."
It's achievable because most of the work is done before the cut-over. The snapshot, loading, replication, and catch-up happen over hours or days. The cut-over itself is just: block → sync the last few transactions → switch routing → unblock.
❌ "What about data loss?"
The blocking step ensures no new writes happen during the switch. The synchronization step ensures Cluster B has 100% of the data. There is no window where data can be lost.
❌ "Can you do this for all domains at once?"
GitHub did it one domain at a time. Migrating everything at once would be too risky. Each domain was virtually partitioned, validated, and then physically migrated independently.
5. Visual Explanation Section — Summary of All Diagrams
Here's a consolidated list of all diagrams you should create/reference for this blog:
| # | Diagram Type | What It Shows |
|---|---|---|
| 1 | Two-tier vertical flowchart | Phase 1 (Virtual) → Phase 2 (Physical), overall strategy |
| 2 | Shared resource diagram | All domains in one database, noisy neighbor effect |
| 3 | Colored regions within one box | Schema domains drawn inside the monolithic database |
| 4 | CI/CD flowchart | Query linter catching cross-domain joins in dev/test |
| 5 | Two-box replication diagram | Cluster A primary → Cluster B primary, replication arrow |
| 6 | Timeline diagram | The cut-over micro-steps, with the <100ms window highlighted |
| 7 | 5-panel sequential diagram | Complete physical migration flow (most important visual) |
| 8 | Before/After architecture | Monolith → multiple independent clusters (final state) |
BEFORE AFTER
────── ─────
┌──────────┐ ┌──────────┐ ┌──────────┐
│ App │ │ App │ │ App │
└────┬─────┘ └──┬───┬───┘ └──┬───┬───┘
│ │ │ │ │
▼ ▼ │ │ ▼
┌──────────────┐ ┌────────┐ │ ┌──────┴──┐
│ ONE GIANT │ │ProxySQL│ │ │ProxySQL │
│ DATABASE │ │ (A) │ │ │ (B) │
│ │ └───┬────┘ │ └───┬─────┘
│ ┌────┐┌────┐│ ▼ │ ▼
│ │Repo││User││ ┌────────┐ │ ┌────────┐
│ └────┘└────┘│ │Cluster │ │ │Cluster │
│ ┌────┐┌────┐│ │ for │ │ │ for │
│ │Noti││Issu││ │ Repos │ │ │ Notifs │
│ └────┘└────┘│ └────────┘ │ └────────┘
└──────────────┘ │
┌──────┴──┐
│ProxySQL │
│ (C) │
└───┬─────┘
▼
┌────────┐
│Cluster │
│ for │
│ Users │
└────────┘
6. Realistic Use Cases
Where This Strategy Is Actually Used
GitHub's approach isn't unique to GitHub. The pattern of virtual partitioning → physical partitioning is used widely in the industry:
| Company | Use Case |
|---|---|
| GitHub | Sharded monolithic MySQL to isolate repos, users, notifications, etc. |
| Shopify | Sharded their monolithic MySQL database to handle millions of merchants. They built a similar tool called "Ghostferry" for live data migration. |
| Moved from a monolithic MySQL to sharded clusters as they scaled to billions of pins. | |
| Migrated from a monolithic PostgreSQL database to sharded PostgreSQL as they grew beyond 1 billion users. | |
| Any growing startup | When you hit the scaling ceiling of a single database, this is the playbook. |
Why Companies Care
- Reliability: One domain's failure doesn't bring down the entire platform.
- Performance: Each domain gets dedicated resources — no more noisy neighbors.
- Independent scaling: The "Repositories" domain needs more powerful hardware? Scale just that cluster, not the entire database.
- Team independence: The Notifications team can deploy changes to their database without coordinating with the Repositories team.
7. Connecting All Concepts Together
Let's zoom out and see how everything connects into one coherent system.
┌──────────────────────────────────────────────────────────────┐
│ THE COMPLETE PICTURE │
│ │
│ 1. GitHub starts with ONE big MySQL database (Monolith) │
│ │ │
│ ▼ │
│ 2. Problems emerge: Query overload + Noisy neighbors │
│ │ │
│ ▼ │
│ 3. Decision: SHARD the database │
│ │ │
│ ┌─────────┴──────────┐ │
│ ▼ ▼ │
│ 4. PHASE 1: Virtual 5. PHASE 2: Physical │
│ Partitioning Partitioning │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │• Define domains │ │• Snapshot tables │ │
│ │• Kill cross- │ │• Load into new │ │
│ │ domain queries │ │ cluster │ │
│ │• Add linters │ │• Set up replication │ │
│ │• Add production │ │• Redirect via │ │
│ │ alerts │ │ ProxySQL │ │
│ │ │ │• Cut-over (<100ms) │ │
│ │ (Prerequisite │ │ │ │
│ │ for Phase 2) │ │ (Repeat per domain) │ │
│ └─────────────────┘ └──────────────────────┘ │
│ │ │
│ ▼ │
│ 6. Result: Multiple independent database clusters, │
│ each serving one domain, zero visible downtime, │
│ no noisy neighbors, independently scalable. │
└──────────────────────────────────────────────────────────────┘
In Definition:
GitHub had one giant database that was struggling. They couldn't just split it randomly — their code had queries reaching across different parts of the database. So first, they drew invisible boundaries (virtual partitioning) and made sure nothing crossed them. Once a domain was cleanly isolated, they physically moved it to its own dedicated server using a snapshot-replicate-cutover strategy that caused less than 100ms of disruption. They repeated this process for each domain, one at a time, until the monolith was fully broken up.
8. Final Summary — Professor Style
Let me recap what we've learned today. If you remember nothing else, remember these five things:
🔑 Key Takeaways
Monolithic databases are a fine starting point, but they eventually become bottlenecks as you scale — both in performance (query volume) and in isolation (noisy neighbor problem).
Sharding is the solution, but you can't just rip tables apart. You need a disciplined approach.
Virtual Partitioning comes first. Before you physically move data, you must logically isolate each schema domain. Kill cross-domain queries. Kill cross-domain transactions. Use linters and monitoring to enforce this.
Physical Partitioning is the actual migration. It follows a careful 5-step process: Snapshot → Load → Replicate → Redirect → Cut-over. The magic is in ProxySQL (the middleman that makes the switch invisible to the application).
The cut-over takes less than 100ms. This is possible because all the heavy lifting (snapshot, replication, catch-up) happens before the cut-over. The cut-over itself is just: block briefly → sync → switch → unblock.
How This Knowledge Helps You
- If you're a backend developer, you now understand one of the most complex database operations companies face. This is senior-level knowledge.
- If you're preparing for system design interviews, database sharding and zero-downtime migrations are frequently asked topics. You now have a concrete, real-world example to reference.
- If you're a startup developer, you know the roadmap: start with a monolith, and when you outgrow it, follow this phased approach.
9. When NOT to Use This Approach & Trade-Offs
When NOT to Shard
| Situation | Why Sharding is Overkill |
|---|---|
| Your database is under 100GB | A single well-tuned server can handle this easily. |
| You have fewer than 1,000 queries/second | You're not at the scale where this matters. |
| Your team is small (< 5 engineers) | Sharding adds massive operational complexity. Managing multiple clusters requires dedicated infrastructure engineers. |
| You haven't tried simpler solutions first | Read replicas, query optimization, caching (Redis/Memcached), and connection pooling can often delay the need for sharding by years. |
Trade-Offs of Sharding
| Benefit | Cost |
|---|---|
| Better performance per domain | More clusters to manage, monitor, and backup |
| No noisy neighbors | Cross-domain queries become impossible — you must use application-level joins or event-driven architecture |
| Independent scaling | Operational complexity increases significantly |
| Team independence | Need robust tooling (ProxySQL, linters, monitoring) |
| Higher availability | More potential points of failure (more clusters = more servers that can go down) |
Beginner Learning Roadmap
If this topic excites you, here's what to learn next:
1. SQL fundamentals (JOINs, transactions, indexes)
│
▼
2. Database replication (primary-replica architecture)
│
▼
3. Read replicas and load balancing
│
▼
4. Database sharding strategies (horizontal vs. vertical)
│
▼
5. Proxy layers (ProxySQL, PgBouncer, Vitess)
│
▼
6. Zero-downtime migration patterns
│
▼
7. Distributed systems fundamentals (CAP theorem, consistency)
Wrapping Up
What GitHub achieved is genuinely impressive engineering. They didn't shut down a platform used by 100+ million developers. They didn't lose a single byte of data. They didn't introduce hours of downtime.
They drew boundaries. They enforced discipline. They moved carefully, one domain at a time. And the result? A database architecture that can scale with them for the next decade.
The next time someone tells you "just shard the database," you'll know it's not that simple — and you'll know exactly what it takes.
Thanks for reading. If this blog helped you understand database sharding, consider bookmarking it and sharing it with someone who's preparing for system design interviews. This is the kind of knowledge that separates good engineers from great ones.







Top comments (0)