The Single Database Problem
In 2004, Facebook launched. One server. One database.
By 2008, they had 100 million users. The database was choking — not on storage, but on query volume. Every page load hit the same database. Peak traffic meant thousands of simultaneous queries. Response times were climbing. Engineers were sweating.
The naive fix was to upgrade the database server — more CPU, more RAM. They did. It helped. For a while.
Then they hit the wall every growing system hits: no single machine, no matter how powerful, can serve the entire world.
The solution? Sharding and replication — splitting data across multiple machines (sharding) and copying it across multiple machines (replication). These two techniques are the backbone of every database system operating at internet scale.
Today we break both down completely.
Replication First: Solving the Read Problem
Before you split your data, you copy it. Replication creates multiple identical copies of your database on different servers.
Why replicate?
- Read scalability: Route read traffic across multiple replicas instead of hammering one server
- High availability: If one server dies, others take over
- Geographic distribution: Put replicas closer to your users to reduce latency
Pattern 1: Master-Slave (Primary-Replica) Replication
The most common replication topology:
Writes
↓
[Primary/Master]
/ | \
↓ ↓ ↓ (replication — async or sync)
[Replica] [Replica] [Replica]
↑ ↑ ↑
Reads distributed
Rules:
- ALL writes go to the Primary
- Reads can go to any Replica (or Primary)
- Replicas are read-only — they receive changes from Primary
Replication modes:
Synchronous replication:
Write confirmed ONLY after all replicas acknowledge
→ Strong consistency (all replicas always in sync)
→ Higher write latency (must wait for all replicas)
→ If a replica is slow, your writes are slow
Asynchronous replication:
Write confirmed after Primary writes, replicas catch up later
→ Eventual consistency (brief lag between Primary and Replicas)
→ Low write latency (don't wait for replicas)
→ Risk: if Primary dies before replica catches up, recent writes are lost
Semi-synchronous (MySQL default):
Write confirmed after at LEAST ONE replica acknowledges
→ Balance between consistency and latency
→ At least one replica always has the latest data
→ If Primary dies, that one replica can take over without data loss
Replication lag — the delay between a write on Primary and its appearance on Replicas — is one of the most common sources of bugs in production systems. A user writes a post and immediately refreshes — they might read from a replica that hasn't caught up yet and see their post missing. This is solved by:
- Routing the user's own reads to Primary for a short window after writes
- Read-your-writes consistency (from Day 2)
- Monitoring replication lag with alerts when it exceeds thresholds
Pattern 2: Master-Master (Multi-Primary) Replication
Both nodes accept writes. Changes are replicated bidirectionally.
[Primary A] ←—————→ [Primary B]
↑ ↑
Writes Writes
Advantages:
- Both nodes can handle writes — double the write capacity
- If one fails, the other keeps serving reads AND writes
The problem — write conflicts:
User A (via Primary A): UPDATE balance SET amount = 100
User B (via Primary B): UPDATE balance SET amount = 200
Both happen simultaneously — which wins?
This is genuinely hard to resolve correctly. Most databases handle it with:
- Last-write-wins (simple but loses data)
- Application-level conflict resolution (complex but correct)
- Operational transformation (used in Google Docs)
When to use: Geographic multi-region deployments where you need writes in multiple regions simultaneously. Not recommended as a simple "more writes" solution — the conflict complexity is usually not worth it.
Pattern 3: Leaderless Replication (Quorum-Based)
No single primary. Any node can accept writes. Reads and writes use quorum to achieve consistency.
Client writes to [N1], [N2], [N3] simultaneously
Write succeeds when W nodes confirm (e.g., W=2 of 3)
Client reads from [N1], [N2], [N3] simultaneously
Read succeeds when R nodes respond (e.g., R=2 of 3)
W + R > N ensures at least one overlapping node has the latest write
Used by: Cassandra, Amazon Dynamo, Riak.
Advantages:
- No single point of failure
- Any node can serve reads or writes
- Naturally supports multi-datacenter deployments
Disadvantages:
- More complex consistency model
- "Last write wins" conflicts need careful handling
- Slightly higher read/write latency (must contact multiple nodes)
Sharding: Solving the Write and Storage Problem
Replication helps with reads. But what about writes? Every write still goes to a single Primary. And what about storage — when your data outgrows any single machine?
Sharding **(also called horizontal partitioning) splits your data across multiple servers. Each server holds a subset of the data — called a **shard.
Without sharding:
[All 1 billion user rows] → [Single Database]
With sharding (4 shards):
[Users 0–250M] → [Shard 1]
[Users 250M–500M] → [Shard 2]
[Users 500M–750M] → [Shard 3]
[Users 750M–1B] → [Shard 4]
Now each shard handles 1/4 of the queries and stores 1/4 of the data. Add more shards to scale linearly.
But how do you decide which shard a given piece of data goes to? That's the sharding strategy question.
The 4 Sharding Strategies
Strategy 1: Hash-Based Sharding
Apply a hash function to the shard key and use modulo to determine the shard:
pythonshard_id = hash(user_id) % num_shards
Example with 4 shards:
user_id = 12345
shard_id = hash(12345) % 4 = 2 → goes to Shard 2
user_id = 99999
shard_id = hash(99999) % 4 = 1 → goes to Shard 1
Advantages:
- Data is distributed evenly (hash functions spread keys uniformly)
- Simple to implement
Disadvantages:
- Resharding is catastrophic: Add a 5th shard and % 5 remaps almost all data to different shards. Every record needs to move. This is a massive migration.
- Range queries are impossible: users 1000-2000 are scattered across all shards
**Solution: **Consistent hashing (Day 8 deep dive) — a technique where adding a node only moves a small fraction of keys, not all of them.
Strategy 2: Range-Based Sharding
Divide the shard key into consecutive ranges:
user_id 1 – 1,000,000 → Shard 1
user_id 1,000,001 – 2,000,000 → Shard 2
user_id 2,000,001 – 3,000,000 → Shard 3
OR by date:
2020-2021 data → Shard 1
2022-2023 data → Shard 2
2024-2025 data → Shard 3
Advantages:
- Range queries are efficient: "give me all users 1M–2M" hits exactly one shard
- Excellent for time-series data: recent data always on the newest shard
Disadvantages:
- Hotspot risk: If your access pattern skews (e.g., everyone is a new user with high IDs), one shard gets all the traffic while others sit idle
- Uneven data distribution: one range might have 10x more active data than another
HBase uses range-based sharding (called "regions") with automatic splitting when a region gets too large.
Strategy 3: Directory-Based Sharding
A separate lookup service (the shard directory) maintains a mapping of every key to its shard:
Lookup Service:
user_123 → Shard 3
user_456 → Shard 1
user_789 → Shard 2
...
Query: find user_123
- Ask lookup service: "where is user_123?"
- Lookup returns: "Shard 3"
- Query Shard 3 directly
Advantages:
- Flexible: you can move data between shards without changing application logic (just update the directory)
- No resharding catastrophe: just update the mapping
Disadvantages:
- The lookup service becomes a single point of failure — if it goes down, nothing works
- Additional network hop for every query
- Directory itself must be highly available and fast
Best for: Systems that need flexibility to rebalance shards over time without downtime.
Strategy 4: Geographic Sharding
Shard by user location — data for European users on European servers, data for Asian users on Asian servers.
Europe users → EU-West Shard (Frankfurt)
US users → US-East Shard (Virginia)
Asia users → AP-Southeast Shard (Singapore)
Advantages:
- Dramatically reduced latency — data is physically close to the user
- Regulatory compliance (GDPR requires EU user data to stay in EU)
Disadvantages:
- Cross-region queries are slow and complex
- Uneven shard sizes if user distribution is unequal
Used by: WhatsApp, Facebook, any system with strict data residency requirements.
Hotspot Mitigation: When One Shard Burns
Even with a good sharding strategy, hotspots emerge. The most followed Twitter account, the most popular product on launch day, a viral post — they all create disproportionate load on a single shard.
Key Salting
Add a random suffix to distribute a hot key across multiple shards:
python
Without salting:
key = "celebrity_tweet_12345"
shard = hash(key) % 4 # Always goes to same shard
With salting (spread across 10 virtual shards):
suffix = random.randint(0, 9)
key = f"celebrity_tweet_12345#{suffix}"
shard = hash(key) % 4 # Distributed across multiple shards
On read: query all 10 salted versions and merge results
results = []
for i in range(10):
results.extend(cache.get(f"celebrity_tweet_12345#{i}"))
Trade-off: Writes are spread across shards (good), but reads must query all shards and merge (more complex, more network calls).
Virtual Nodes (vnodes)
Instead of assigning one range to one physical shard, assign many small virtual shards to each physical node:
Without vnodes:
Node A handles: 0–25% of keyspace
Node B handles: 25–50% of keyspace
With vnodes (3 vnodes per node):
Node A handles: 0–8%, 33–41%, 66–74%
Node B handles: 8–16%, 41–50%, 74–83%
Node C handles: 16–33%, 50–66%, 83–100%
If Node A gets overloaded, you can move one of its vnodes to another node with minimal data migration. Cassandra uses vnodes by default (256 vnodes per node).
Facebook TAO: Sharding at Billion-User Scale
Facebook's TAO (The Associations and Objects) system is one of the most sophisticated sharding implementations in existence.
The problem: Facebook's social graph — 3 billion users, all their connections, likes, posts, events — cannot fit in any single database. Even split across thousands of databases, hot users (celebrities, viral content) cause massive hotspots.
TAO's solution:
- Hash-based sharding across thousands of MySQL shards
- Each shard has a primary in one datacenter + replicas in 2 others
- Aggressive in-memory caching (memcached) in front of every shard
- Special handling for "hot" objects: frequently accessed objects get replicated to additional cache nodes dynamically
The result: TAO serves over a billion reads per second globally, with single-digit millisecond latency, running on commodity MySQL instances. Not exotic distributed databases — standard MySQL, sharded intelligently with a sophisticated caching layer.
Multi-Datacenter Replication
For global systems, data must exist in multiple geographic regions simultaneously.
US-East DC EU-West DC AP-Southeast DC
[Primary] ←───→ [Primary] ←───→ [Primary]
[Replica 1] [Replica 1] [Replica 1]
[Replica 2] [Replica 2] [Replica 2]
Challenges:
- Cross-continent network adds 100-150ms latency to replication
- Write conflicts when users in different regions modify the same data
- Data residency regulations (GDPR) may prohibit some data from crossing borders
Cassandra's multi-DC replication:
Cassandra handles this natively. You specify replication factor per datacenter:
keyspace 'user_data'
WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3, -- 3 replicas in US-East
'eu-west': 3, -- 3 replicas in EU-West
'ap-south': 2 -- 2 replicas in AP-Southeast
};
Local reads/writes use quorum within the local DC (fast). Cross-DC replication happens asynchronously (low latency local, eventual global consistency).
Interview Scenario: "Resharding Without Downtime"
This is a real senior interview question. The answer:
The naive approach (wrong):
- Add new shards
- Migrate all data to new shards
- Update routing logic Problem: Downtime during migration. Hours or days.
The production approach:
Phase 1 — Double-write (days/weeks)
- New writes go to BOTH old and new shards
- Old shard is still authoritative for reads
Phase 2 — Backfill
- Copy historical data from old shards to new shards
- Verify data consistency
Phase 3 — Switch reads
- Gradually route read traffic to new shards (1% → 10% → 50% → 100%)
- Monitor for data inconsistencies
Phase 4 — Stop double-writing
- Old shards become read-only archives
- New shards are now primary
Phase 5 — Decommission old shards
- After validation period, old shards are removed
Zero downtime. Weeks-long migration. This is how Pinterest migrated from 6 MySQL shards to 4096 shards in production.
Key Takeaways
- Replication copies data for reads and availability.
- Sharding splits data for writes and storage.
- Primary-Replica replication is the most common pattern — writes to Primary, reads distributed across Replicas.
- Replication lag causes read-your-own-writes bugs — solve with session-based read routing.
- 4 sharding strategies: Hash (even distribution), Range (range queries), Directory (flexible), Geographic (latency + compliance).
- Hash sharding has a resharding problem — adding shards remaps most data. Consistent hashing solves this (Day 8).
- Hotspots are inevitable — mitigate with key salting and virtual nodes.
- Resharding in production requires double-writing, gradual traffic migration, and weeks of careful execution — never a weekend migration.
Top comments (0)