Database Sharding Strategies for Growing Video Platforms

#database #postgres #scaling #architecture

Scaling a European Video Database

ViralVidVault stores viral video data from 7 European regions. As the database grows, queries that scan the entire videos table get slower. Sharding -- splitting data across multiple tables or databases -- is how you keep query performance flat as data grows linearly.

Here are three strategies, with concrete PostgreSQL implementations.

Strategy 1: List Partitioning by Region

The most natural fit for a multi-region video vault. Each European region gets its own partition:

CREATE TABLE videos (
    id BIGSERIAL,
    video_id VARCHAR(16) NOT NULL,
    title TEXT NOT NULL,
    view_count BIGINT DEFAULT 0,
    virality_score FLOAT DEFAULT 0,
    region VARCHAR(4) NOT NULL,
    category_id INTEGER,
    fetched_at TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (id, region)
) PARTITION BY LIST (region);

-- One partition per tracked region
CREATE TABLE videos_pl PARTITION OF videos FOR VALUES IN ('PL');
CREATE TABLE videos_nl PARTITION OF videos FOR VALUES IN ('NL');
CREATE TABLE videos_se PARTITION OF videos FOR VALUES IN ('SE');
CREATE TABLE videos_no PARTITION OF videos FOR VALUES IN ('NO');
CREATE TABLE videos_at PARTITION OF videos FOR VALUES IN ('AT');
CREATE TABLE videos_gb PARTITION OF videos FOR VALUES IN ('GB');
CREATE TABLE videos_us PARTITION OF videos FOR VALUES IN ('US');

-- Default partition for any unexpected regions
CREATE TABLE videos_other PARTITION OF videos DEFAULT;

-- Each partition gets its own index
CREATE INDEX ON videos_pl (virality_score DESC);
CREATE INDEX ON videos_nl (virality_score DESC);
CREATE INDEX ON videos_se (virality_score DESC);
CREATE INDEX ON videos_no (virality_score DESC);
CREATE INDEX ON videos_at (virality_score DESC);
CREATE INDEX ON videos_gb (virality_score DESC);
CREATE INDEX ON videos_us (virality_score DESC);

Now region-scoped queries are lightning fast:

-- Only scans videos_pl, not the other 6 partitions
SELECT * FROM videos
WHERE region = 'PL' AND virality_score >= 85
ORDER BY virality_score DESC
LIMIT 20;

Strategy 2: Composite Partitioning (Region + Date)

For even finer granularity, partition by region first, then by date:

CREATE TABLE videos_partitioned (
    id BIGSERIAL,
    video_id VARCHAR(16) NOT NULL,
    title TEXT NOT NULL,
    virality_score FLOAT DEFAULT 0,
    region VARCHAR(4) NOT NULL,
    fetched_at TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (id, region, fetched_at)
) PARTITION BY LIST (region);

-- Region partition, further sub-partitioned by month
CREATE TABLE videos_gb_part PARTITION OF videos_partitioned
    FOR VALUES IN ('GB')
    PARTITION BY RANGE (fetched_at);

CREATE TABLE videos_gb_2026_02 PARTITION OF videos_gb_part
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE videos_gb_2026_03 PARTITION OF videos_gb_part
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

This is ideal for queries like "show me viral UK videos from this week."

Strategy 3: Application-Level Sharding

When you outgrow a single PostgreSQL instance, shard at the application layer:

import hashlib
from typing import Dict, List
import asyncpg

class RegionShardRouter:
    """Route queries to region-specific database instances."""

    SHARD_MAP = {
        "western": ["GB", "NL", "AT"],   # Western Europe shard
        "nordic": ["SE", "NO"],           # Nordic shard
        "eastern": ["PL"],                # Eastern Europe shard
        "global": ["US"],                 # Global shard
    }

    def __init__(self, shard_dsns: Dict[str, str]):
        self.dsns = shard_dsns
        self.pools: Dict[str, asyncpg.Pool] = {}
        # Build reverse lookup: region -> shard name
        self.region_to_shard = {}
        for shard, regions in self.SHARD_MAP.items():
            for r in regions:
                self.region_to_shard[r] = shard

    async def initialize(self):
        for name, dsn in self.dsns.items():
            self.pools[name] = await asyncpg.create_pool(dsn, min_size=2, max_size=10)

    def _get_pool(self, region: str) -> asyncpg.Pool:
        shard = self.region_to_shard.get(region, "western")
        return self.pools[shard]

    async def get_viral_videos(self, region: str, min_score: float = 85.0, limit: int = 20) -> list:
        pool = self._get_pool(region)
        return await pool.fetch(
            "SELECT * FROM videos WHERE region = $1 AND virality_score >= $2 "
            "ORDER BY virality_score DESC LIMIT $3",
            region, min_score, limit,
        )

    async def search_all_regions(self, query: str, limit: int = 20) -> list:
        """Fan-out search across all shards, merge by virality score."""
        import asyncio
        tasks = [
            pool.fetch(
                "SELECT * FROM videos WHERE title ILIKE $1 ORDER BY virality_score DESC LIMIT $2",
                f"%{query}%", limit,
            )
            for pool in self.pools.values()
        ]
        results = await asyncio.gather(*tasks)
        merged = [dict(r) for shard_rows in results for r in shard_rows]
        merged.sort(key=lambda v: v["virality_score"], reverse=True)
        return merged[:limit]

Choosing the Right Strategy

Strategy	Best For	Complexity	ViralVidVault Fit
Region list partition	Region-scoped queries	Low	Ideal for trending pages
Region + date composite	Time-bounded + region queries	Medium	Great for analytics
App-level shard	Multiple DB servers	High	Future scaling

For ViralVidVault, region-based list partitioning is the clear winner. Every page on the site is implicitly region-scoped -- "viral in Poland," "trending in Sweden" -- so partition pruning eliminates 80%+ of rows from every query automatically.

Partition Maintenance

Automate monthly partition creation:

CREATE OR REPLACE FUNCTION ensure_monthly_partitions(base_region TEXT, months_ahead INT DEFAULT 3)
RETURNS VOID AS $$
DECLARE
    m INT;
    part_name TEXT;
    start_d DATE;
    end_d DATE;
BEGIN
    FOR m IN 0..months_ahead LOOP
        start_d := DATE_TRUNC('month', NOW() + (m || ' months')::INTERVAL);
        end_d := start_d + INTERVAL '1 month';
        part_name := format('videos_%s_%s', lower(base_region), TO_CHAR(start_d, 'YYYY_MM'));
        IF NOT EXISTS (SELECT 1 FROM pg_class WHERE relname = part_name) THEN
            EXECUTE format(
                'CREATE TABLE %I PARTITION OF videos_%s_part FOR VALUES FROM (%L) TO (%L)',
                part_name, lower(base_region), start_d, end_d
            );
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;