Rajkiran

Posted on May 31

System Design - 9.Database Sharding & Replication, How Facebook Serves a Billion Reads Per Second

#database #distributedsystems #performance #systemdesign

The Single Database Problem
In 2004, Facebook launched. One server. One database.
By 2008, they had 100 million users. The database was choking — not on storage, but on query volume. Every page load hit the same database. Peak traffic meant thousands of simultaneous queries. Response times were climbing. Engineers were sweating.
The naive fix was to upgrade the database server — more CPU, more RAM. They did. It helped. For a while.
Then they hit the wall every growing system hits: no single machine, no matter how powerful, can serve the entire world.
The solution? Sharding and replication — splitting data across multiple machines (sharding) and copying it across multiple machines (replication). These two techniques are the backbone of every database system operating at internet scale.
Today we break both down completely.

Replication First: Solving the Read Problem
Before you split your data, you copy it. Replication creates multiple identical copies of your database on different servers.

Why replicate?

Read scalability: Route read traffic across multiple replicas instead of hammering one server
High availability: If one server dies, others take over
Geographic distribution: Put replicas closer to your users to reduce latency

Pattern 1: Master-Slave (Primary-Replica) Replication
The most common replication topology:
Writes
↓
[Primary/Master]
/ | \
↓ ↓ ↓ (replication — async or sync)
[Replica] [Replica] [Replica]
↑ ↑ ↑
Reads distributed

Rules:

ALL writes go to the Primary
Reads can go to any Replica (or Primary)
Replicas are read-only — they receive changes from Primary

Replication modes:

Synchronous replication:
Write confirmed ONLY after all replicas acknowledge
→ Strong consistency (all replicas always in sync)
→ Higher write latency (must wait for all replicas)
→ If a replica is slow, your writes are slow

Asynchronous replication:
Write confirmed after Primary writes, replicas catch up later
→ Eventual consistency (brief lag between Primary and Replicas)
→ Low write latency (don't wait for replicas)
→ Risk: if Primary dies before replica catches up, recent writes are lost

Semi-synchronous (MySQL default):
Write confirmed after at LEAST ONE replica acknowledges
→ Balance between consistency and latency
→ At least one replica always has the latest data
→ If Primary dies, that one replica can take over without data loss

Replication lag — the delay between a write on Primary and its appearance on Replicas — is one of the most common sources of bugs in production systems. A user writes a post and immediately refreshes — they might read from a replica that hasn't caught up yet and see their post missing. This is solved by:

Routing the user's own reads to Primary for a short window after writes
Read-your-writes consistency (from Day 2)
Monitoring replication lag with alerts when it exceeds thresholds

Pattern 2: Master-Master (Multi-Primary) Replication
Both nodes accept writes. Changes are replicated bidirectionally.
[Primary A] ←—————→ [Primary B]
↑ ↑
Writes Writes

Advantages:

Both nodes can handle writes — double the write capacity
If one fails, the other keeps serving reads AND writes

The problem — write conflicts:
User A (via Primary A): UPDATE balance SET amount = 100
User B (via Primary B): UPDATE balance SET amount = 200
Both happen simultaneously — which wins?

This is genuinely hard to resolve correctly. Most databases handle it with:

Last-write-wins (simple but loses data)
Application-level conflict resolution (complex but correct)
Operational transformation (used in Google Docs)

When to use: Geographic multi-region deployments where you need writes in multiple regions simultaneously. Not recommended as a simple "more writes" solution — the conflict complexity is usually not worth it.

Pattern 3: Leaderless Replication (Quorum-Based)
No single primary. Any node can accept writes. Reads and writes use quorum to achieve consistency.

Client writes to [N1], [N2], [N3] simultaneously
Write succeeds when W nodes confirm (e.g., W=2 of 3)

Client reads from [N1], [N2], [N3] simultaneously
Read succeeds when R nodes respond (e.g., R=2 of 3)

W + R > N ensures at least one overlapping node has the latest write

Used by: Cassandra, Amazon Dynamo, Riak.

Advantages:

No single point of failure
Any node can serve reads or writes
Naturally supports multi-datacenter deployments

Disadvantages:

More complex consistency model
"Last write wins" conflicts need careful handling
Slightly higher read/write latency (must contact multiple nodes)

Sharding: Solving the Write and Storage Problem
Replication helps with reads. But what about writes? Every write still goes to a single Primary. And what about storage — when your data outgrows any single machine?

Sharding **(also called horizontal partitioning) splits your data across multiple servers. Each server holds a subset of the data — called a **shard.
Without sharding:
[All 1 billion user rows] → [Single Database]

With sharding (4 shards):
[Users 0–250M] → [Shard 1]
[Users 250M–500M] → [Shard 2]
[Users 500M–750M] → [Shard 3]
[Users 750M–1B] → [Shard 4]
Now each shard handles 1/4 of the queries and stores 1/4 of the data. Add more shards to scale linearly.
But how do you decide which shard a given piece of data goes to? That's the sharding strategy question.

The 4 Sharding Strategies

Strategy 1: Hash-Based Sharding
Apply a hash function to the shard key and use modulo to determine the shard:
pythonshard_id = hash(user_id) % num_shards

Example with 4 shards:

user_id = 12345
shard_id = hash(12345) % 4 = 2 → goes to Shard 2

user_id = 99999
shard_id = hash(99999) % 4 = 1 → goes to Shard 1

Advantages:

Data is distributed evenly (hash functions spread keys uniformly)
Simple to implement

Disadvantages:

Resharding is catastrophic: Add a 5th shard and % 5 remaps almost all data to different shards. Every record needs to move. This is a massive migration.
Range queries are impossible: users 1000-2000 are scattered across all shards

**Solution: **Consistent hashing (Day 8 deep dive) — a technique where adding a node only moves a small fraction of keys, not all of them.

Strategy 2: Range-Based Sharding
Divide the shard key into consecutive ranges:
user_id 1 – 1,000,000 → Shard 1
user_id 1,000,001 – 2,000,000 → Shard 2
user_id 2,000,001 – 3,000,000 → Shard 3

OR by date:
2020-2021 data → Shard 1
2022-2023 data → Shard 2
2024-2025 data → Shard 3

Advantages:

Range queries are efficient: "give me all users 1M–2M" hits exactly one shard
Excellent for time-series data: recent data always on the newest shard

Disadvantages:

Hotspot risk: If your access pattern skews (e.g., everyone is a new user with high IDs), one shard gets all the traffic while others sit idle
Uneven data distribution: one range might have 10x more active data than another

HBase uses range-based sharding (called "regions") with automatic splitting when a region gets too large.

Strategy 3: Directory-Based Sharding
A separate lookup service (the shard directory) maintains a mapping of every key to its shard:

Lookup Service:
user_123 → Shard 3
user_456 → Shard 1
user_789 → Shard 2
...

Query: find user_123

Ask lookup service: "where is user_123?"
Lookup returns: "Shard 3"
Query Shard 3 directly

Advantages:

Flexible: you can move data between shards without changing application logic (just update the directory)
No resharding catastrophe: just update the mapping

Disadvantages:

The lookup service becomes a single point of failure — if it goes down, nothing works
Additional network hop for every query
Directory itself must be highly available and fast

Best for: Systems that need flexibility to rebalance shards over time without downtime.

Strategy 4: Geographic Sharding
Shard by user location — data for European users on European servers, data for Asian users on Asian servers.

Europe users → EU-West Shard (Frankfurt)
US users → US-East Shard (Virginia)
Asia users → AP-Southeast Shard (Singapore)

Advantages:

Dramatically reduced latency — data is physically close to the user
Regulatory compliance (GDPR requires EU user data to stay in EU)

Disadvantages:

Cross-region queries are slow and complex
Uneven shard sizes if user distribution is unequal

Used by: WhatsApp, Facebook, any system with strict data residency requirements.

Hotspot Mitigation: When One Shard Burns
Even with a good sharding strategy, hotspots emerge. The most followed Twitter account, the most popular product on launch day, a viral post — they all create disproportionate load on a single shard.

Key Salting
Add a random suffix to distribute a hot key across multiple shards:

python

Without salting:

key = "celebrity_tweet_12345"
shard = hash(key) % 4 # Always goes to same shard

With salting (spread across 10 virtual shards):

suffix = random.randint(0, 9)
key = f"celebrity_tweet_12345#{suffix}"
shard = hash(key) % 4 # Distributed across multiple shards

On read: query all 10 salted versions and merge results

results = []
for i in range(10):
results.extend(cache.get(f"celebrity_tweet_12345#{i}"))

Trade-off: Writes are spread across shards (good), but reads must query all shards and merge (more complex, more network calls).

Virtual Nodes (vnodes)
Instead of assigning one range to one physical shard, assign many small virtual shards to each physical node:

Without vnodes:
Node A handles: 0–25% of keyspace
Node B handles: 25–50% of keyspace

With vnodes (3 vnodes per node):
Node A handles: 0–8%, 33–41%, 66–74%
Node B handles: 8–16%, 41–50%, 74–83%
Node C handles: 16–33%, 50–66%, 83–100%

If Node A gets overloaded, you can move one of its vnodes to another node with minimal data migration. Cassandra uses vnodes by default (256 vnodes per node).

Facebook TAO: Sharding at Billion-User Scale
Facebook's TAO (The Associations and Objects) system is one of the most sophisticated sharding implementations in existence.

The problem: Facebook's social graph — 3 billion users, all their connections, likes, posts, events — cannot fit in any single database. Even split across thousands of databases, hot users (celebrities, viral content) cause massive hotspots.

TAO's solution:

Hash-based sharding across thousands of MySQL shards
Each shard has a primary in one datacenter + replicas in 2 others
Aggressive in-memory caching (memcached) in front of every shard
Special handling for "hot" objects: frequently accessed objects get replicated to additional cache nodes dynamically

The result: TAO serves over a billion reads per second globally, with single-digit millisecond latency, running on commodity MySQL instances. Not exotic distributed databases — standard MySQL, sharded intelligently with a sophisticated caching layer.

Multi-Datacenter Replication
For global systems, data must exist in multiple geographic regions simultaneously.
US-East DC EU-West DC AP-Southeast DC
[Primary] ←───→ [Primary] ←───→ [Primary]
[Replica 1] [Replica 1] [Replica 1]
[Replica 2] [Replica 2] [Replica 2]

Challenges:

Cross-continent network adds 100-150ms latency to replication
Write conflicts when users in different regions modify the same data
Data residency regulations (GDPR) may prohibit some data from crossing borders

Cassandra's multi-DC replication:
Cassandra handles this natively. You specify replication factor per datacenter:
keyspace 'user_data'
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3, -- 3 replicas in US-East
'eu-west': 3, -- 3 replicas in EU-West
'ap-south': 2 -- 2 replicas in AP-Southeast
};

Local reads/writes use quorum within the local DC (fast). Cross-DC replication happens asynchronously (low latency local, eventual global consistency).

Interview Scenario: "Resharding Without Downtime"
This is a real senior interview question. The answer:

The naive approach (wrong):

Add new shards
Migrate all data to new shards
Update routing logic Problem: Downtime during migration. Hours or days.

The production approach:

Phase 1 — Double-write (days/weeks)

New writes go to BOTH old and new shards
Old shard is still authoritative for reads

Phase 2 — Backfill

Copy historical data from old shards to new shards
Verify data consistency

Phase 3 — Switch reads

Gradually route read traffic to new shards (1% → 10% → 50% → 100%)
Monitor for data inconsistencies

Phase 4 — Stop double-writing

Old shards become read-only archives
New shards are now primary

Phase 5 — Decommission old shards

After validation period, old shards are removed

Zero downtime. Weeks-long migration. This is how Pinterest migrated from 6 MySQL shards to 4096 shards in production.

Key Takeaways

Replication copies data for reads and availability.
Sharding splits data for writes and storage.
Primary-Replica replication is the most common pattern — writes to Primary, reads distributed across Replicas.
Replication lag causes read-your-own-writes bugs — solve with session-based read routing.
4 sharding strategies: Hash (even distribution), Range (range queries), Directory (flexible), Geographic (latency + compliance).
Hash sharding has a resharding problem — adding shards remaps most data. Consistent hashing solves this (Day 8).
Hotspots are inevitable — mitigate with key salting and virtual nodes.
Resharding in production requires double-writing, gradual traffic migration, and weeks of careful execution — never a weekend migration.