ahmet gedik

Posted on Jul 2

Multi-Tenant Video Platform Schema Design in PostgreSQL With RLS

#postgres #database #multitenancy #architecture

When DailyWatch began indexing trending videos for partner publishers, each partner was just a partner_id column bolted onto a handful of tables. It held together until a cache key collision served one partner's "trending" rail to another. Nothing catastrophic reached end users, but it was the moment I understood that "add a column and filter on it" is not a multi-tenancy strategy — it is a missing WHERE clause waiting to happen. We run a deliberately lean stack at DailyWatch (PHP 8.4, SQLite FTS5 for discovery, LiteSpeed, Cloudflare at the edge), but tenant metadata, billing, and per-publisher configuration live in PostgreSQL, and that is where isolation has to be enforced by the engine rather than trusted to application code.

This is the schema I wish I had designed on day one: shared tables, hard isolation through Row-Level Security, tenant-aware partitioning for the high-volume event stream, and the operational glue — connection pooling, migrations, cache keys — that makes it survive contact with production.

Three ways to isolate tenants, and why I picked one

There are really three options, and the trade-offs are concrete:

Database per tenant. Strongest isolation and trivial to reason about, but painful past a few dozen tenants. Connection counts explode, every migration fans out, and cross-tenant analytics become a federation problem.
Schema per tenant. One database, one search_path switch per request. Better connection reuse than separate databases, but pg_dump slows to a crawl with thousands of schemas and the catalog bloats.
Shared schema, tenant_id on every row, enforced by Row-Level Security. One set of tables, one migration, isolation enforced by the database instead of by every developer remembering a filter.

For a video discovery platform onboarding publishers continuously, the shared-schema model wins. The catch everyone warns you about — "one forgotten WHERE tenant_id = ? leaks data" — is exactly the failure mode RLS eliminates. The policy lives in the database; the application physically cannot read another tenant's rows, even when a query forgets the filter.

The core schema

Start with the tenant table and make tenant_id a first-class part of every dependent table's primary key. Compound keys (tenant_id, id) keep indexes naturally clustered per tenant and make later table partitioning straightforward. Compound foreign keys are the quiet hero here: a video can only reference a channel belonging to the same tenant, because the FK includes tenant_id on both sides.

CREATE TABLE tenant (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    slug        text NOT NULL UNIQUE,
    name        text NOT NULL,
    plan        text NOT NULL DEFAULT 'free',
    created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE channel (
    tenant_id   bigint NOT NULL REFERENCES tenant(id),
    id          bigint GENERATED ALWAYS AS IDENTITY,
    external_id text NOT NULL,          -- e.g. a YouTube channel id
    title       text NOT NULL,
    created_at  timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, id),
    UNIQUE (tenant_id, external_id)
);

CREATE TABLE video (
    tenant_id    bigint NOT NULL REFERENCES tenant(id),
    id           bigint GENERATED ALWAYS AS IDENTITY,
    channel_id   bigint NOT NULL,
    external_id  text NOT NULL,
    title        text NOT NULL,
    duration_s   int  NOT NULL DEFAULT 0,
    published_at timestamptz,
    region       text NOT NULL DEFAULT 'US',
    created_at   timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, id),
    UNIQUE (tenant_id, external_id),
    -- This is the line that makes cross-tenant references impossible:
    FOREIGN KEY (tenant_id, channel_id) REFERENCES channel(tenant_id, id)
);

CREATE INDEX video_trending_idx
    ON video (tenant_id, region, published_at DESC);

The video_trending_idx index leads with tenant_id on purpose. Every read path is already tenant-scoped, so the planner can use the index for the most common query — "latest videos for this tenant in this region" — without scanning across tenants.

Locking it down with Row-Level Security

RLS is where the design stops being a convention and becomes a guarantee. Enable it, then force it so even the table owner is subject to the policy, and run your application as a dedicated non-superuser role (superusers and BYPASSRLS roles skip policies entirely).

ALTER TABLE channel ENABLE ROW LEVEL SECURITY;
ALTER TABLE channel FORCE  ROW LEVEL SECURITY;
ALTER TABLE video   ENABLE ROW LEVEL SECURITY;
ALTER TABLE video   FORCE  ROW LEVEL SECURITY;

-- current_setting WITHOUT the 'missing_ok' flag throws if app.tenant_id
-- is unset. That is intentional: no tenant context means no rows, never
-- all rows. Fail closed.
CREATE POLICY tenant_isolation ON channel
    USING      (tenant_id = current_setting('app.tenant_id')::bigint)
    WITH CHECK (tenant_id = current_setting('app.tenant_id')::bigint);

CREATE POLICY tenant_isolation ON video
    USING      (tenant_id = current_setting('app.tenant_id')::bigint)
    WITH CHECK (tenant_id = current_setting('app.tenant_id')::bigint);

The USING clause filters what a query can see; the WITH CHECK clause filters what it can write. Without WITH CHECK, a tenant could INSERT a row stamped with someone else's tenant_id. With both, the row a tenant inserts must belong to that tenant, and the rows it selects or updates are silently constrained to its own. The application never writes WHERE tenant_id = ? again — and crucially, cannot opt out of it.

The one thing the application must do is set app.tenant_id correctly for every request. Get that wrong and you get zero rows, not someone else's rows. That asymmetry is the whole point.

Setting tenant context from PHP

Here is the request-scoping helper in PHP 8.4. The non-negotiable detail is SET LOCAL semantics — via set_config(name, value, is_local => true) — so the setting is bound to the transaction and dies on COMMIT/ROLLBACK. On a pooled connection, a plain SET would leak one tenant's context into the next request that reuses the socket.

<?php
declare(strict_types=1);

final class TenantConnection
{
    public function __construct(private readonly PDO $pdo) {}

    /**
     * Run a unit of work scoped to a single tenant. The GUC is set with
     * is_local = true, so it is tied to this transaction and can never
     * outlive the request on a pooled connection.
     */
    public function scoped(int $tenantId, callable $work): mixed
    {
        $this->pdo->beginTransaction();
        try {
            // set_config is parameterisable; the bare SET command is not,
            // so this avoids string-concatenating the tenant id into SQL.
            $stmt = $this->pdo->prepare(
                "SELECT set_config('app.tenant_id', ?, true)"
            );
            $stmt->execute([(string) $tenantId]);

            $result = $work($this->pdo);

            $this->pdo->commit();
            return $result;
        } catch (Throwable $e) {
            $this->pdo->rollBack();
            throw $e;
        }
    }
}

// Usage in a controller. Note: no tenant_id in the query — RLS adds it.
$videos = $tenant->scoped($currentTenantId, static function (PDO $db): array {
    $stmt = $db->prepare(
        'SELECT id, title FROM video
          WHERE region = :region
          ORDER BY published_at DESC
          LIMIT 40'
    );
    $stmt->execute([':region' => 'US']);
    return $stmt->fetchAll(PDO::FETCH_ASSOC);
});

Notice the query has no tenant_id predicate at all. A new engineer can write the most naive query imaginable and it is still isolated, because the policy is appended by the planner. That is the property you are buying: correctness that does not depend on discipline.

The connection pooling trap

This is the bug that bites everyone who adopts a transaction pooler (PgBouncer in transaction mode, or a language-level pool). In that mode a physical connection is handed to a different client after every transaction. If you ever run SET app.tenant_id = 5 without LOCAL, the GUC sticks to the socket and the next checkout inherits tenant 5. The fix is two-fold: always use the local/transaction-scoped form, and reset connections on return to the pool as a belt-and-suspenders measure.

import psycopg_pool


def _reset(conn) -> None:
    # Belt-and-suspenders: even a leaked GUC cannot survive a checkout.
    conn.execute("RESET ALL")
    conn.commit()


pool = psycopg_pool.ConnectionPool(
    "postgresql://app@db/dailywatch",
    min_size=4,
    max_size=20,
    reset=_reset,
)


def trending(tenant_id: int, region: str = "US") -> list[tuple]:
    with pool.connection() as conn:
        with conn.transaction():            # explicit txn => SET LOCAL scope
            conn.execute(
                "SELECT set_config('app.tenant_id', %s, true)",
                (str(tenant_id),),
            )
            cur = conn.execute(
                """
                SELECT id, title
                  FROM video
                 WHERE region = %s
                 ORDER BY published_at DESC
                 LIMIT 40
                """,
                (region,),
            )
            return cur.fetchall()

The reset callback runs RESET ALL before a connection re-enters the pool. Combined with set_config(..., true) inside an explicit transaction, you now have two independent guarantees that tenant context cannot leak. When the failure mode is "customer A sees customer B's data," two guarantees are not paranoia.

Partitioning the event firehose

Metadata tables stay small. The table that grows without bound is the watch/impression event stream — millions of rows a day across all tenants. Range-partition it by time so retention is a metadata operation (drop last month's partition instead of running a DELETE that bloats the heap), and keep RLS on the parent so every partition inherits tenant isolation automatically.

CREATE TABLE watch_event (
    tenant_id   bigint      NOT NULL,
    video_id    bigint      NOT NULL,
    occurred_at timestamptz NOT NULL DEFAULT now(),
    country     text,
    PRIMARY KEY (tenant_id, video_id, occurred_at)
) PARTITION BY RANGE (occurred_at);

CREATE TABLE watch_event_2026_06 PARTITION OF watch_event
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
CREATE TABLE watch_event_2026_07 PARTITION OF watch_event
    FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');

-- RLS on the parent applies to every partition.
ALTER TABLE watch_event ENABLE ROW LEVEL SECURITY;
ALTER TABLE watch_event FORCE  ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON watch_event
    USING (tenant_id = current_setting('app.tenant_id')::bigint);

For a handful of very large "whale" tenants you can go a step further and LIST-partition the top level by tenant_id, then sub-partition each by time. That isolates a noisy tenant's data to its own physical files and lets you set per-tenant autovacuum thresholds. Start with time-range partitioning; reach for hybrid partitioning only when one tenant's volume distorts the others.

Migrations run once, not per tenant

This is the operational payoff that schema-per-tenant gives up. With a shared schema there is a single physical copy of each table, so a migration applies to every tenant the moment it commits — no fan-out, no half-migrated tenants, no drift. Here is a minimal, idempotent runner in Go that applies a pending file inside one transaction and records it.

package main

import (
    "context"
    "log"
    "os"

    "github.com/jackc/pgx/v5"
)

func main() {
    const version = "0007_add_watch_event_partitions"
    ctx := context.Background()

    conn, err := pgx.Connect(ctx, os.Getenv("DATABASE_URL"))
    if err != nil {
        log.Fatalf("connect: %v", err)
    }
    defer conn.Close(ctx)

    tx, err := conn.Begin(ctx)
    if err != nil {
        log.Fatal(err)
    }
    defer tx.Rollback(ctx) // no-op after a successful Commit

    var applied bool
    if err := tx.QueryRow(ctx,
        `SELECT EXISTS (SELECT 1 FROM schema_migrations WHERE version = $1)`,
        version,
    ).Scan(&applied); err != nil {
        log.Fatal(err)
    }
    if applied {
        log.Println("already applied; nothing to do")
        return
    }

    sql, err := os.ReadFile("migrations/" + version + ".sql")
    if err != nil {
        log.Fatal(err)
    }
    if _, err := tx.Exec(ctx, string(sql)); err != nil {
        log.Fatalf("migration failed, rolling back: %v", err)
    }
    if _, err := tx.Exec(ctx,
        `INSERT INTO schema_migrations (version) VALUES ($1)`, version,
    ); err != nil {
        log.Fatal(err)
    }
    if err := tx.Commit(ctx); err != nil {
        log.Fatal(err)
    }
    log.Println("migration applied to all tenants at once")
}

Running the schema change and the bookkeeping INSERT in the same transaction means either both land or neither does — you can never end up with a migration applied but unrecorded, which is the classic cause of a re-run double-applying DDL.

Search lives outside Postgres — but still per tenant

DailyWatch serves discovery from SQLite FTS5 indexes pushed to the edge behind LiteSpeed and Cloudflare, not from Postgres full-text search, because a read-only FTS5 file is dramatically cheaper to replicate and query for "type-ahead" workloads. Postgres stays the source of truth; a periodic job exports each tenant's videos into its own FTS5 file (search/{tenant_slug}.db). The tenant boundary that RLS enforces in Postgres is preserved downstream simply by never mixing two tenants into one index file. The lesson generalises: whatever you build on top of the database — search, cache, exports — must carry tenant_id all the way through, or you reopen the hole RLS just closed.

Cache keys must carry the tenant

That original incident was a cache bug, not a database bug. The page cache key was trending:{region} with no tenant in it, so the first publisher to warm the cache for US served their rail to everyone in that region. The rule is blunt and absolute: every cache key, at every layer, includes the tenant.

Page cache (LiteSpeed / PHP file cache): tenant:{id}:page:{route}:{region}
Cloudflare Cache-Key / Vary: include a tenant-derived header or hostname so the edge never crosses tenants
Application data cache: prefix every key with tenant:{id}:

If the tenant is part of the key, a collision is impossible by construction — the same defensive instinct as putting tenant_id in the primary key.

What I would tell my past self

Put tenant_id in the primary key, not just as a filtered column. Compound keys make isolation, indexing, and partitioning all fall out naturally.
Enforce isolation with FORCE RLS and a non-superuser app role. A USING plus WITH CHECK policy closes both read and write leaks.
Set tenant context with SET LOCAL / set_config(..., true) inside a transaction, and reset pooled connections. The pooling leak is the one that gets you in production.
Fail closed: no tenant context should yield zero rows, never all rows.
Carry tenant_id into every downstream system — search indexes, cache keys, exports. The database boundary is only as strong as its weakest consumer.

The shared-schema-plus-RLS design is not the flashiest answer, but it is the one that lets a small team onboard publishers indefinitely without the per-tenant operational tax, while making the worst-case bug — one tenant seeing another's data — a thing the database refuses to let happen rather than a thing your code has to remember to prevent.

DEV Community