- Book: System Design Pocket Guide: Fundamentals
- Also by me: Database Playbook
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You picked range sharding by created_at because the data was time-series. Imagine it's six months in. The bulk of writes are landing on one shard while the rest of the cluster idles. Your on-call is a multi-hour rebalance script holding SHARE UPDATE EXCLUSIVE on every write.
Or you picked hash sharding for even spread. Two quarters later, the analytics team wants "all orders from last Tuesday." The planner fans out across every shard, holds partial results on the coordinator, and times out half the dashboards.
Sharding is one of the few choices that can break a system after launch. Most teams know the four options exist. Fewer are honest about what each one costs when the data goes lopsided. Here is the tradeoff matrix, with copy-able Postgres examples for each.
The four strategies
Range assigns rows to shards by a sortable key, usually a timestamp, a numeric ID, or a region code. Shard 1 holds A–F, shard 2 holds G–M.
Hash runs the key through a hash function and uses the result modulo N. Two adjacent keys end up on completely different nodes.
Geo maps rows to shards by region: eu-west, us-east, ap-south. The shard is chosen by where the user lives, where data must legally live, or where latency must be lowest.
Directory stores a small mapping table (tenant_id → shard_id) and hits it on every routed request. Any shard can hold any row; the directory decides.
Each is good at something and bad at something else. The trap is picking one because it fits the first six months and discovering the next eighteen.
Range: easy queries, painful rebalancing
Range works when access is bounded by the same key you're sharding on. Time-series workloads. Multi-tenant systems scoped to one tenant per query. Geographic systems where a request stays inside one country.
It dies on hot ranges. New writes pile onto whichever shard owns the highest key. Any monotonically increasing key (timestamps, sequential IDs, ULIDs) creates a single hot shard while the rest of the cluster watches. Splitting it means picking a new boundary, copying half the rows, and updating the routing layer. On a busy table, that's hours under contention.
In Postgres with native partitioning:
CREATE TABLE orders (
id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
tenant_id UUID NOT NULL,
payload JSONB NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2026_q1 PARTITION OF orders
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
CREATE TABLE orders_2026_q2 PARTITION OF orders
FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');
pg_partman automates boundary creation so you stop adding the next quarter manually at 11pm on March 31st. Pair it with Citus when partitions need to live on different worker nodes.
Pick range when queries are bounded by the shard key. Avoid range when writes pile up at one end of the key space.
Hash: uniform load, painful range queries
Hash sharding solves the hot-shard problem by spreading writes uniformly across N nodes. Two consecutive tenant_id values land on different shards, so no single tenant starves the others.
It dies on range queries. "Give me all rows where created_at between X and Y" fans out to every shard. The coordinator gathers, merges, and sorts on the way back. At 12 shards, annoying. At 96, it's an SLA conversation.
Resharding is also expensive. Adding shard N+1 means rehashing every key, since hash(key) % 12 and hash(key) % 13 produce different answers for almost every row. Production systems handle this with consistent hashing (only ~1/N of keys move when you add a node) or by hashing into a large fixed virtual-shard space (say 4096) and mapping virtual-to-physical separately. The latter is the pattern Citus uses.
DDL with Citus:
-- Coordinator side. Citus turns this into a hash-distributed table
-- across worker nodes, using 32 shards by default.
CREATE TABLE events (
id BIGSERIAL,
tenant_id UUID NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL,
event_type TEXT NOT NULL,
payload JSONB
);
SELECT create_distributed_table('events', 'tenant_id');
-- Co-locate a related table to keep tenant-scoped joins local.
CREATE TABLE event_attachments (
event_id BIGINT NOT NULL,
tenant_id UUID NOT NULL,
blob_url TEXT NOT NULL
);
SELECT create_distributed_table(
'event_attachments', 'tenant_id',
colocate_with => 'events'
);
The colocation hint is the load-bearing part. Without it, every join between events and event_attachments fans out. With it, joins stay inside a single shard because rows with the same tenant_id land on the same worker node.
Pick hash when writes need to spread evenly and most queries are point lookups. Avoid hash when range scans on a non-shard column drive the workload.
Geo: low-latency reads, regulatory headaches
Geo sharding gives you two things: every user reads from a node near them, and every byte of personal data lives in a jurisdiction your lawyers signed off on. EU users hit eu-west, US users hit us-east. Latency is bounded by the speed of light, not by an oversea round trip.
It dies on cross-region operations. A US user buys a gift for a German friend. Which shard owns the order? You pick one and treat the other as an asynchronous notification, or you build a two-phase write nobody wants to debug at 3am.
It also dies on compliance edges. GDPR doesn't prohibit storing EU personal data outside the EU outright; it prohibits transfers without a legal basis under Chapter V (Articles 44–49). If shards bleed into each other (a shared cache, an analytics pipeline pulling from all regions to one warehouse), the residency story falls apart. The data plane needs hard borders.
Failover is harder too. If eu-west goes down, you can't transparently promote us-east to serve EU traffic without violating the residency contract. You need a same-region standby, which doubles the shard count.
-- Schema-per-region pattern, enforced at the routing layer.
CREATE SCHEMA shard_eu_west;
CREATE SCHEMA shard_us_east;
CREATE TABLE shard_eu_west.users (
id UUID PRIMARY KEY,
email CITEXT NOT NULL,
region TEXT NOT NULL CHECK (region = 'eu-west'),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
The CHECK constraint on region is cheap insurance. It catches the bug where a stale config writes an EU user into the US shard before the misroute even reaches your alerting.
Pick geo when latency to the user is critical or residency is a hard constraint. Avoid geo when workloads cross regions frequently.
Directory: flexible, single-point-of-failure risk
Directory sharding stores a mapping table (typically tenant_id → shard_id) and routes every request through it. Move tenant 47 from shard 3 to shard 8 because they outgrew their neighbors? Update one row. The next request follows the new route.
This is the strategy with the highest ceiling and the most operational risk. The ceiling: arbitrary placement, surgical rebalancing, per-tenant capacity tuning. The risk: the directory itself. Every request reads it. If it goes down, the entire system is dark. If it lies, writes go to the wrong shard.
Standard pattern: directory in Postgres or a strongly-consistent KV, aggressive read-through caching, explicit invalidation on tenant moves, and a fallback "lookup by hash" for tenants the directory hasn't placed yet.
CREATE TABLE shard_directory (
tenant_id UUID PRIMARY KEY,
shard_id SMALLINT NOT NULL,
moved_at TIMESTAMPTZ NOT NULL DEFAULT now(),
pinned BOOLEAN NOT NULL DEFAULT false
);
CREATE INDEX shard_directory_shard_idx ON shard_directory (shard_id);
Pick directory when tenants have wildly different sizes and you need per-tenant placement you can change with one UPDATE. Avoid it when ops isn't ready to treat the directory as a Tier-0 service.
A small shard router
The decision logic fits in one file. Complexity lives in how each branch handles failure.
import hashlib
from dataclasses import dataclass
from typing import Protocol
@dataclass
class Key:
tenant_id: str
region: str | None = None
timestamp: int | None = None
class Directory(Protocol):
def lookup(self, tenant_id: str) -> int | None: ...
def pick_shard(
key: Key,
strategy: str,
n_shards: int,
directory: Directory | None = None,
) -> int:
if strategy == "range":
if key.timestamp is None:
raise ValueError("range needs a timestamp")
bucket_size = (2**31) // n_shards
return min(key.timestamp // bucket_size, n_shards - 1)
if strategy == "hash":
digest = hashlib.blake2b(
key.tenant_id.encode(), digest_size=8
).digest()
return int.from_bytes(digest, "big") % n_shards
if strategy == "geo":
if key.region is None:
raise ValueError("geo needs a region")
regions = ["us-east", "us-west", "eu-west", "ap-south"]
return regions.index(key.region)
if strategy == "directory":
if directory is None:
raise ValueError("directory needs a Directory")
shard = directory.lookup(key.tenant_id)
if shard is None:
# Fallback so a missing directory entry is not a 500.
return pick_shard(key, "hash", n_shards)
return shard
raise ValueError(f"unknown strategy: {strategy}")
Two deliberate choices. Hash uses BLAKE2b, not Python's hash(). Python randomizes hash() per process, so two services would route the same key differently. Directory falls back to hash on miss, so a directory hiccup degrades to a wrong-but-consistent answer rather than a 500. This is graceful degradation for reads, not a write strategy — a tenant the directory hasn't placed should be placed before it's written.
The matrix
| Strategy | Write spread | Range queries | Resharding | Ops weight |
|---|---|---|---|---|
| Range | Hot end | Cheap | Medium | Low |
| Hash | Even | Expensive (fan-out) | High without consistent hashing | Low |
| Geo | Per region | Cheap in region | Medium | High (compliance) |
| Directory | Tunable | Depends on placement | Cheap per tenant | High (Tier-0) |
The winning strategy depends on which question breaks first:
- If your worst day is "we ran out of write headroom in one place," range is the trap and hash is the answer.
- If your worst day is "analytics times out scanning everything," hash is the trap.
- If your worst day is "a regulator asked where exactly this row lives," geo is the only answer that doesn't require a six-month remediation.
Picking, in 2026
Most production systems run two of these at once. A typical SaaS stack: directory on the tenant axis, with range partitioning inside each tenant for time-bounded data. A typical global consumer product: geo for residency, hash for write distribution inside each region.
If this was useful
The System Design Pocket Guide: Fundamentals covers sharding alongside the rest of the building blocks at the depth where you can pick between options instead of memorizing definitions. The Database Playbook goes deeper on the storage side: choosing the right engine before you get to the partition strategy.


Top comments (0)