Database Sharding Strategies for Growing Video Platforms

#database #postgres #scaling #architecture

When a Single Database Isn't Enough

At DailyWatch, we aggregate trending videos from 8 regions with a cron pipeline running every 2 hours. The videos table grows by thousands of rows daily. At some point, queries slow down, indexes bloat, and backups take too long. That's when you start thinking about sharding.

This article covers three sharding strategies we evaluated, with real PostgreSQL examples.

Strategy 1: Range Partitioning by Date

The most natural fit for time-series video data. PostgreSQL's declarative partitioning handles this natively:

-- Parent table (no data stored directly here)
CREATE TABLE videos (
    id BIGSERIAL,
    video_id VARCHAR(16) NOT NULL,
    title TEXT NOT NULL,
    view_count BIGINT DEFAULT 0,
    region VARCHAR(4) NOT NULL,
    category_id INTEGER,
    fetched_at TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (id, fetched_at)
) PARTITION BY RANGE (fetched_at);

-- Monthly partitions
CREATE TABLE videos_2026_01 PARTITION OF videos
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE videos_2026_02 PARTITION OF videos
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

CREATE TABLE videos_2026_03 PARTITION OF videos
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Each partition gets its own indexes automatically
CREATE INDEX ON videos_2026_01 (video_id);
CREATE INDEX ON videos_2026_02 (video_id);
CREATE INDEX ON videos_2026_03 (video_id);

The query planner automatically prunes partitions:

-- Only scans videos_2026_03, skips others entirely
EXPLAIN ANALYZE
SELECT * FROM videos
WHERE fetched_at >= '2026-03-01' AND fetched_at < '2026-03-15';

Automating Partition Creation

Don't create partitions manually. Use a function:

CREATE OR REPLACE FUNCTION create_monthly_partition(target_date DATE)
RETURNS TEXT AS $$
DECLARE
    partition_name TEXT;
    start_date DATE;
    end_date DATE;
BEGIN
    start_date := DATE_TRUNC('month', target_date);
    end_date := start_date + INTERVAL '1 month';
    partition_name := 'videos_' || TO_CHAR(start_date, 'YYYY_MM');

    -- Check if partition already exists
    IF NOT EXISTS (
        SELECT 1 FROM pg_class WHERE relname = partition_name
    ) THEN
        EXECUTE format(
            'CREATE TABLE %I PARTITION OF videos FOR VALUES FROM (%L) TO (%L)',
            partition_name, start_date, end_date
        );
        EXECUTE format(
            'CREATE INDEX ON %I (video_id)',
            partition_name
        );
        RETURN 'Created: ' || partition_name;
    END IF;

    RETURN 'Already exists: ' || partition_name;
END;
$$ LANGUAGE plpgsql;

-- Call before each month
SELECT create_monthly_partition('2026-04-01');

Strategy 2: List Partitioning by Region

When queries almost always filter by region, partition on that dimension:

CREATE TABLE videos_by_region (
    id BIGSERIAL,
    video_id VARCHAR(16) NOT NULL,
    title TEXT NOT NULL,
    view_count BIGINT DEFAULT 0,
    region VARCHAR(4) NOT NULL,
    fetched_at TIMESTAMP NOT NULL DEFAULT NOW(),
    PRIMARY KEY (id, region)
) PARTITION BY LIST (region);

CREATE TABLE videos_americas PARTITION OF videos_by_region
    FOR VALUES IN ('US', 'BR', 'CA');

CREATE TABLE videos_europe PARTITION OF videos_by_region
    FOR VALUES IN ('GB', 'DE', 'FR');

CREATE TABLE videos_apac PARTITION OF videos_by_region
    FOR VALUES IN ('IN', 'AU');

This shines when your application logic is already region-aware. At DailyWatch, each region's trending page only queries its own partition.

Strategy 3: Application-Level Sharding

When PostgreSQL partitioning isn't enough (different databases on different servers), shard at the application level:

import hashlib
from typing import Dict
import asyncpg

class VideoShardRouter:
    """Routes video queries to the correct database shard."""

    def __init__(self, shard_configs: Dict[str, str]):
        # shard_configs: {"shard_0": "postgres://...", "shard_1": "postgres://...", ...}
        self.shard_count = len(shard_configs)
        self.pools: Dict[str, asyncpg.Pool] = {}
        self.configs = shard_configs

    async def initialize(self):
        for name, dsn in self.configs.items():
            self.pools[name] = await asyncpg.create_pool(dsn, min_size=2, max_size=10)

    def get_shard(self, video_id: str) -> str:
        """Consistent hash-based routing."""
        hash_val = int(hashlib.md5(video_id.encode()).hexdigest(), 16)
        shard_idx = hash_val % self.shard_count
        return f"shard_{shard_idx}"

    async def insert_video(self, video_id: str, title: str, region: str, views: int):
        shard = self.get_shard(video_id)
        pool = self.pools[shard]
        await pool.execute(
            "INSERT INTO videos (video_id, title, region, view_count) VALUES ($1, $2, $3, $4) "
            "ON CONFLICT (video_id) DO UPDATE SET view_count = $4",
            video_id, title, region, views,
        )

    async def get_video(self, video_id: str) -> dict:
        shard = self.get_shard(video_id)
        pool = self.pools[shard]
        row = await pool.fetchrow(
            "SELECT * FROM videos WHERE video_id = $1", video_id
        )
        return dict(row) if row else None

    async def search_all_shards(self, query: str, limit: int = 20) -> list:
        """Fan-out query across all shards, merge results."""
        import asyncio
        tasks = [
            pool.fetch(
                "SELECT * FROM videos WHERE title ILIKE $1 ORDER BY view_count DESC LIMIT $2",
                f"%{query}%", limit,
            )
            for pool in self.pools.values()
        ]
        results = await asyncio.gather(*tasks)
        merged = [dict(row) for shard_rows in results for row in shard_rows]
        merged.sort(key=lambda v: v["view_count"], reverse=True)
        return merged[:limit]

Which Strategy to Choose?

Strategy	Best When	Complexity	DailyWatch Fit
Date partitioning	Time-series queries dominate	Low	Good for archiving
Region partitioning	Region-scoped queries	Low	Great for trending pages
App-level sharding	Multiple DB servers needed	High	Future-proofing

For most video platforms under 10M rows, PostgreSQL native partitioning is the right call. You get automatic partition pruning, simpler ops, and no application code changes. App-level sharding is the nuclear option for when you've outgrown a single server.

Key Takeaway

Start with date-based partitioning. It's the lowest-effort, highest-impact sharding strategy for content platforms where "recent" is always the most-queried dimension. You can always add region partitioning as a sub-partition later.

This article is part of the Building DailyWatch series. Check out DailyWatch to see these techniques in action.