Gabriel Anhaia

Posted on Apr 27

Sharding Strategies in 2026: Range, Hash, Geo, Directory — Which Wins When

#database #architecture #postgres #tutorial

Book: System Design Pocket Guide: Fundamentals
Also by me: Database Playbook
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You picked range sharding by created_at because the data was time-series. Imagine it's six months in. The bulk of writes are landing on one shard while the rest of the cluster idles. Your on-call is a multi-hour rebalance script holding SHARE UPDATE EXCLUSIVE on every write.

Or you picked hash sharding for even spread. Two quarters later, the analytics team wants "all orders from last Tuesday." The planner fans out across every shard, holds partial results on the coordinator, and times out half the dashboards.

Sharding is one of the few choices that can break a system after launch. Most teams know the four options exist. Fewer are honest about what each one costs when the data goes lopsided. Here is the tradeoff matrix, with copy-able Postgres examples for each.

The four strategies

Range assigns rows to shards by a sortable key, usually a timestamp, a numeric ID, or a region code. Shard 1 holds A–F, shard 2 holds G–M.

Hash runs the key through a hash function and uses the result modulo N. Two adjacent keys end up on completely different nodes.

Geo maps rows to shards by region: eu-west, us-east, ap-south. The shard is chosen by where the user lives, where data must legally live, or where latency must be lowest.

Directory stores a small mapping table (tenant_id → shard_id) and hits it on every routed request. Any shard can hold any row; the directory decides.

Each is good at something and bad at something else. The trap is picking one because it fits the first six months and discovering the next eighteen.

Range: easy queries, painful rebalancing

Range works when access is bounded by the same key you're sharding on. Time-series workloads. Multi-tenant systems scoped to one tenant per query. Geographic systems where a request stays inside one country.

It dies on hot ranges. New writes pile onto whichever shard owns the highest key. Any monotonically increasing key (timestamps, sequential IDs, ULIDs) creates a single hot shard while the rest of the cluster watches. Splitting it means picking a new boundary, copying half the rows, and updating the routing layer. On a busy table, that's hours under contention.

In Postgres with native partitioning:

CREATE TABLE orders (
    id          BIGINT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL,
    tenant_id   UUID NOT NULL,
    payload     JSONB NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2026_q1 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');

CREATE TABLE orders_2026_q2 PARTITION OF orders
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');

pg_partman automates boundary creation so you stop adding the next quarter manually at 11pm on March 31st. Pair it with Citus when partitions need to live on different worker nodes.

Pick range when queries are bounded by the shard key. Avoid range when writes pile up at one end of the key space.

Hash: uniform load, painful range queries

Hash sharding solves the hot-shard problem by spreading writes uniformly across N nodes. Two consecutive tenant_id values land on different shards, so no single tenant starves the others.

It dies on range queries. "Give me all rows where created_at between X and Y" fans out to every shard. The coordinator gathers, merges, and sorts on the way back. At 12 shards, annoying. At 96, it's an SLA conversation.

Resharding is also expensive. Adding shard N+1 means rehashing every key, since hash(key) % 12 and hash(key) % 13 produce different answers for almost every row. Production systems handle this with consistent hashing (only ~1/N of keys move when you add a node) or by hashing into a large fixed virtual-shard space (say 4096) and mapping virtual-to-physical separately. The latter is the pattern Citus uses.

DDL with Citus:

-- Coordinator side. Citus turns this into a hash-distributed table
-- across worker nodes, using 32 shards by default.
CREATE TABLE events (
    id            BIGSERIAL,
    tenant_id     UUID NOT NULL,
    occurred_at   TIMESTAMPTZ NOT NULL,
    event_type    TEXT NOT NULL,
    payload       JSONB
);

SELECT create_distributed_table('events', 'tenant_id');

-- Co-locate a related table to keep tenant-scoped joins local.
CREATE TABLE event_attachments (
    event_id      BIGINT NOT NULL,
    tenant_id     UUID NOT NULL,
    blob_url      TEXT NOT NULL
);

SELECT create_distributed_table(
    'event_attachments', 'tenant_id',
    colocate_with => 'events'
);

The colocation hint is the load-bearing part. Without it, every join between events and event_attachments fans out. With it, joins stay inside a single shard because rows with the same tenant_id land on the same worker node.

Pick hash when writes need to spread evenly and most queries are point lookups. Avoid hash when range scans on a non-shard column drive the workload.

Geo: low-latency reads, regulatory headaches

Geo sharding gives you two things: every user reads from a node near them, and every byte of personal data lives in a jurisdiction your lawyers signed off on. EU users hit eu-west, US users hit us-east. Latency is bounded by the speed of light, not by an oversea round trip.

It dies on cross-region operations. A US user buys a gift for a German friend. Which shard owns the order? You pick one and treat the other as an asynchronous notification, or you build a two-phase write nobody wants to debug at 3am.

It also dies on compliance edges. GDPR doesn't prohibit storing EU personal data outside the EU outright; it prohibits transfers without a legal basis under Chapter V (Articles 44–49). If shards bleed into each other (a shared cache, an analytics pipeline pulling from all regions to one warehouse), the residency story falls apart. The data plane needs hard borders.

Failover is harder too. If eu-west goes down, you can't transparently promote us-east to serve EU traffic without violating the residency contract. You need a same-region standby, which doubles the shard count.

-- Schema-per-region pattern, enforced at the routing layer.
CREATE SCHEMA shard_eu_west;
CREATE SCHEMA shard_us_east;

CREATE TABLE shard_eu_west.users (
    id          UUID PRIMARY KEY,
    email       CITEXT NOT NULL,
    region      TEXT NOT NULL CHECK (region = 'eu-west'),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

The CHECK constraint on region is cheap insurance. It catches the bug where a stale config writes an EU user into the US shard before the misroute even reaches your alerting.

Pick geo when latency to the user is critical or residency is a hard constraint. Avoid geo when workloads cross regions frequently.

Directory: flexible, single-point-of-failure risk

Directory sharding stores a mapping table (typically tenant_id → shard_id) and routes every request through it. Move tenant 47 from shard 3 to shard 8 because they outgrew their neighbors? Update one row. The next request follows the new route.

This is the strategy with the highest ceiling and the most operational risk. The ceiling: arbitrary placement, surgical rebalancing, per-tenant capacity tuning. The risk: the directory itself. Every request reads it. If it goes down, the entire system is dark. If it lies, writes go to the wrong shard.

Standard pattern: directory in Postgres or a strongly-consistent KV, aggressive read-through caching, explicit invalidation on tenant moves, and a fallback "lookup by hash" for tenants the directory hasn't placed yet.

CREATE TABLE shard_directory (
    tenant_id    UUID PRIMARY KEY,
    shard_id     SMALLINT NOT NULL,
    moved_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    pinned       BOOLEAN NOT NULL DEFAULT false
);

CREATE INDEX shard_directory_shard_idx ON shard_directory (shard_id);

Pick directory when tenants have wildly different sizes and you need per-tenant placement you can change with one UPDATE. Avoid it when ops isn't ready to treat the directory as a Tier-0 service.

A small shard router

The decision logic fits in one file. Complexity lives in how each branch handles failure.

import hashlib
from dataclasses import dataclass
from typing import Protocol


@dataclass
class Key:
    tenant_id: str
    region: str | None = None
    timestamp: int | None = None


class Directory(Protocol):
    def lookup(self, tenant_id: str) -> int | None: ...


def pick_shard(
    key: Key,
    strategy: str,
    n_shards: int,
    directory: Directory | None = None,
) -> int:
    if strategy == "range":
        if key.timestamp is None:
            raise ValueError("range needs a timestamp")
        bucket_size = (2**31) // n_shards
        return min(key.timestamp // bucket_size, n_shards - 1)

    if strategy == "hash":
        digest = hashlib.blake2b(
            key.tenant_id.encode(), digest_size=8
        ).digest()
        return int.from_bytes(digest, "big") % n_shards

    if strategy == "geo":
        if key.region is None:
            raise ValueError("geo needs a region")
        regions = ["us-east", "us-west", "eu-west", "ap-south"]
        return regions.index(key.region)

    if strategy == "directory":
        if directory is None:
            raise ValueError("directory needs a Directory")
        shard = directory.lookup(key.tenant_id)
        if shard is None:
            # Fallback so a missing directory entry is not a 500.
            return pick_shard(key, "hash", n_shards)
        return shard

    raise ValueError(f"unknown strategy: {strategy}")

Two deliberate choices. Hash uses BLAKE2b, not Python's hash(). Python randomizes hash() per process, so two services would route the same key differently. Directory falls back to hash on miss, so a directory hiccup degrades to a wrong-but-consistent answer rather than a 500. This is graceful degradation for reads, not a write strategy — a tenant the directory hasn't placed should be placed before it's written.

The matrix

Strategy	Write spread	Range queries	Resharding	Ops weight
Range	Hot end	Cheap	Medium	Low
Hash	Even	Expensive (fan-out)	High without consistent hashing	Low
Geo	Per region	Cheap in region	Medium	High (compliance)
Directory	Tunable	Depends on placement	Cheap per tenant	High (Tier-0)

The winning strategy depends on which question breaks first:

If your worst day is "we ran out of write headroom in one place," range is the trap and hash is the answer.
If your worst day is "analytics times out scanning everything," hash is the trap.
If your worst day is "a regulator asked where exactly this row lives," geo is the only answer that doesn't require a six-month remediation.

Picking, in 2026

Most production systems run two of these at once. A typical SaaS stack: directory on the tenant axis, with range partitioning inside each tenant for time-bounded data. A typical global consumer product: geo for residency, hash for write distribution inside each region.

If this was useful

The System Design Pocket Guide: Fundamentals covers sharding alongside the rest of the building blocks at the depth where you can pick between options instead of memorizing definitions. The Database Playbook goes deeper on the storage side: choosing the right engine before you get to the partition strategy.