Ronin_GO

Posted on Jan 29

PostgreSQL RLS in Go: Architecting Secure Multi-tenancy

#go #sql #rls #programming

Manual tenant isolation (adding WHERE tenant_id = ? to every query) is a ticking time bomb. It relies entirely on developer discipline. Eventually, someone will forget a filter during a hotfix or a late-night refactor, and data will leak.
In this article, I’ll share how we moved from manual checks to PostgreSQL Row Level Security (RLS) in a Go application. We’ll cover the implementation with pgx, performance benchmarks, and a zero-downtime migration strategy.

The Problem: Why Standard Solutions Failed

We evaluated three common isolation patterns before settling on RLS:
Logical Isolation (Manual WHERE clauses):
Pros: Simple to start.
Cons: Human error factor is critical. One missing clause = security breach.
Schema-per-tenant:
Pros: Strong isolation.
Cons: Doesn't scale past ~100 tenants. With 10,000 clients and 50 tables, you have 500,000 files in the database directory. Vacuuming becomes a nightmare, and inode usage explodes.
Database-per-tenant:
Pros: Perfect physical isolation.
Cons: Prohibitively expensive on infrastructure (RDS instances) for a startup.
We chose Row Level Security (RLS). It allows us to declare access rules once in the database schema, making them automatically applicable to every query—even those generated by ORMs or raw SQL.

Implementation Details

The core concept is simple: The application never writes tenant filters. The database automatically filters rows based on the current transaction's context.

The Database Schema (The "Paranoid" Mode) We define the policy once. Note the FORCE ROW LEVEL SECURITY command—this is crucial because, by default, table owners (the app user) bypass RLS.

-- Business data table
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    title TEXT NOT NULL,
    body TEXT
);

-- Enable RLS
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

-- CRITICAL: Force RLS
-- Without this, the table owner (usually the app user)
-- will see EVERYTHING, ignoring policies.
ALTER TABLE documents FORCE ROW LEVEL SECURITY;

-- The Policy: "Show only rows belonging to the current tenant"
CREATE POLICY tenant_isolation_policy ON documents
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant')::uuid);

Go Middleware: Setting the Context

In the HTTP layer, we extract the tenant ID (e.g., from a JWT claim) and place it into the Go context.

func TenantMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // PRODUCTION NOTE:
        // In a real app, extract tenant_id from validated JWT claims (sub/tenant_id).
        // Do NOT trust "X-Tenant-ID" headers from the client (security hole).
        // We use a header here only for demonstration simplicity.
        tenantStr := r.Header.Get("X-Tenant-ID")

        // Validation...
        if _, err := uuid.Parse(tenantStr); err != nil {
            http.Error(w, "invalid tenant id", http.StatusBadRequest)
            return
        }

        // Put into context (use a custom type for keys in production)
        ctx := context.WithValue(r.Context(), "tenant_id", tenantStr)
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

The Database Wrapper (The "Glue")

This is the trickiest part. We need to pass the tenant_id from Go context to the Postgres session.

We use pgx. We cannot simply use SET app.current_tenant on a connection because connections are pooled. If a connection returns to the pool with the variable set, the next user might inherit those privileges.

The Solution: Use set_config with the is_local=true parameter inside a transaction.

type Postgres struct {
    Pool *pgxpool.Pool
}

// Wrapper for all transactions
func (p *Postgres) RunInTx(ctx context.Context, fn func(ctx context.Context, tx pgx.Tx) error) error {
    // 1. Hard Check: Fail fast if tenant_id is missing.
    // If we proceed without it, current_setting() returns NULL,
    // and queries return 0 rows. This might mask a bug.
    // Panic is justified here as a developer-time fail-fast mechanism.
    tenantID, ok := ctx.Value("tenant_id").(string)
    if !ok || tenantID == "" {
        panic("CRITICAL: DB transaction without tenant_id!")
    }

    // Note: Production code should handle nested transactions (savepoints).
    tx, err := p.Pool.Begin(ctx)
    if err != nil { return err }
    defer tx.Rollback(ctx) // Always rollback if not committed

    // 2. Set Session Variable
    // The third parameter 'true' (is_local) means this setting lives
    // ONLY until the end of the transaction.
    // Even if the connection returns to the pool "dirty", Postgres resets it.
    _, err = tx.Exec(ctx, "SELECT set_config('app.current_tenant', $1, true)", tenantID)
    if err != nil { return err }

    // 3. Execute Business Logic
    if err := fn(ctx, tx); err != nil { return err }

    return tx.Commit(ctx)
}

Business Logic: Clean and Simple

Now, our repository code is clean. No more WHERE clauses.

func ListDocuments(ctx context.Context, db *Postgres) ([]Document, error) {
    var docs []Document

    // Just SELECT *
    // Postgres automatically injects "WHERE tenant_id = ..."
    err := db.RunInTx(ctx, func(ctx context.Context, tx pgx.Tx) error {
        // Using a scanner library like pgxscan
        return pgxscan.Select(ctx, tx, &docs, "SELECT * FROM documents ORDER BY created_at DESC")
    })

    return docs, err
}

Testing the Un-testable

Standard unit tests with mocks (like go.mock) are useless for RLS. You can mock the current_setting call, but you won't verify if the policy actually restricts data access.

We use Testcontainers to spin up a real, disposable Postgres instance for every test suite.

// internal/infra/db/container_test.go

func SetupTestDB(t testing.TB) string {
    ctx := context.Background()

    // 1. Start a container (once per test suite)
    // Use the exact same image as production!
    pgContainer, err := postgres.Run(ctx,
        "pgvector/pgvector:pg16",
        postgres.WithDatabase("ronin_test"),
        postgres.WithUsername("ronin"),
        postgres.WithPassword("password"),
        // Copy migrations into the container
        postgres.WithCopyFileToContainer(
            "./migrations/", 
            "/docker-entrypoint-initdb.d/", 
            0644,
        ),
    )
    if err != nil {
        t.Fatalf("failed to start postgres: %v", err)
    }

    // 2. Ensure cleanup
    t.Cleanup(func() { pgContainer.Terminate(ctx) })

    connStr, _ := pgContainer.ConnectionString(ctx, "sslmode=disable")
    return connStr
}

The test itself becomes a concise security proof:

func TestAlienAccess(t *testing.T) {
    // 1. Get a clean DB
    dsn := SetupTestDB(t) 
    db := connect(dsn)

    // 2. Create document for Tenant A
    docID := createDoc(db, "tenant-A", "Secret Plan")

    // 3. Attempt to read as Tenant B
    ctxB := context.WithValue(context.Background(), "tenant_id", "tenant-B")
    _, err := db.GetDoc(ctxB, docID)

    // 4. Verify the row "does not exist" for us
    // Ideally, we get ErrNoRows, not an Access Denied error.
    assert.ErrorIs(t, err, sql.ErrNoRows) 
}

Benchmarks: The Cost of Security

We ran load tests (10k records, Docker) to measure the overhead.

Scenario	Without RLS	With RLS	Overhead	Notes
Simple Select (ID lookup)	1.2 ms	1.3 ms	+0.1 ms	Negligible. Ideal for CRUD.
JOIN (Docs + Tenants)	1.25 ms	1.35 ms	+0.1 ms	Planner handles joins well.
Vector Search (HNSW)	3-5 ms	5-6 ms	~20%	Acceptable vs. schema overhead.
GROUP BY (Count)	0.8 ms	1.95 ms	x2.4	Painful. Requires full scan.
ILIKE Search (GIN Index)	0.2 ms	1.3 ms	x6	High. RLS checks every row.

Key Takeaways:

Aggregations are slow: SELECT COUNT(*) checks visibility for every row. For admin dashboards, consider denormalized counters.

Vector Search works: This is the only scalable way to do multi-tenant vector search (AI/RAG). You build one big HNSW index, and RLS filters the results. Note: Use pgvector 0.8.0+ for iterative index scans.

Production Pitfalls (Read Before Deploying)

Views are dangerous

By default (before PG 15), views run with the permissions of the view owner, not the caller.

Fix (PG 15+): Use security_invoker = true.

Fix (Older): Always use FORCE ROW LEVEL SECURITY.

Side-Channel Leaks (Leakproof)

Postgres might execute a function before the RLS filter. If you have a query like WHERE secret_func(data), the function might run on rows the user shouldn't see.

Fix: Mark your trusted functions as LEAKPROOF.

Fix: For GIN indexes, rely on extensions that implement leakproof operators (like pg_trgm).

PgBouncer Compatibility

Never use Statement Pooling with RLS. The session variable app.current_tenant will be lost or mixed between statements.
Always use Transaction Pooling.

Zero-Downtime Migration Strategy
How do you add this to a legacy monolith without breaking everything? We used a 3-phase rollout.

Phase 1: Permissive (Preparation)

Add the columns and enable RLS, but allow everything.

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY migration_phase_1 ON orders
    FOR ALL
    USING (true)
    WITH CHECK (true);

Phase 2: Hybrid (Transition)

Update the app to send context. The policy enforces rules only if context is present.

CREATE POLICY migration_phase_2 ON orders
    FOR ALL
    USING (
        -- Path A: Context exists -> Enforce isolation
        (current_setting('app.current_tenant', true) IS NOT NULL 
         AND tenant_id = current_setting('app.current_tenant')::uuid)
        OR 
        -- Path B: No context (Legacy code) -> Allow access
        (current_setting('app.current_tenant', true) IS NULL)
    );

Phase 3: Strict (Enforcement)

Once logs show 100% of requests have context, lock it down.

CREATE POLICY tenant_isolation_policy ON orders
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant')::uuid);

ALTER TABLE orders FORCE ROW LEVEL SECURITY;

Conclusion

RLS moved our security boundary from "every developer remembering WHERE clauses" to the infrastructure layer. The code is cleaner, and the "cold sweat" of potential data leaks is gone.

If you are building a multi-tenant B2B app, RLS is a robust engineering choice that is often overlooked. Give it a try.