Actinode

Posted on Mar 29

Multi-Tenant SaaS Architecture: What Nobody Tells You Before You Build

#saas #architecture #database #webdev

Multi-tenancy is one of those decisions that looks simple on a whiteboard and complicated in production. Choosing the wrong isolation model at the start — or not consciously choosing at all — creates a class of problems that are genuinely hard to undo later. Here's what you should know before you write the first migration.

The Three Isolation Models (and the Trade-offs Nobody Leads With)

Every multi-tenant SaaS sits somewhere on a spectrum between full isolation and full shared infrastructure. There are three canonical patterns:

1. Separate databases per tenant

Each tenant gets their own database instance. Full data isolation, no risk of cross-tenant data leakage, clean tenant offboarding, and trivial per-tenant backup and restore.

The trade-offs: provisioning time increases, connection pool management gets complicated fast, schema migrations need to run across N databases, and cost scales linearly with tenant count. This model makes sense when you have strong compliance requirements, enterprise clients who mandate data isolation, or when per-tenant customisation of the data model is a real product requirement.

2. Separate schemas, shared database

One database server, but each tenant gets their own schema. This is Postgres-native and is a reasonable middle ground: you get logical separation without the operational overhead of separate database instances.

The catch: connection pooling tools like PgBouncer work at the connection level, not the schema level. You'll need to set the search_path correctly per request, and some ORMs handle this gracefully and some don't. Schema migrations still need to be coordinated across all tenants.

3. Shared schema, shared database (row-level tenancy)

Every table has a tenant_id column. All tenants share the same tables. This is the cheapest to operate, the simplest to migrate, and the easiest to onboard new tenants into.

The risk is the one that bites teams most: if you forget to filter by tenant_id anywhere in your application, a tenant can see another tenant's data. This is not a theoretical risk — it has caused real security incidents. The mitigation is row-level security (RLS) at the database layer.

Row-Level Security: Use It, Don't Trust the Application Layer Alone

If you go with the shared schema approach, Postgres RLS is not optional — it's the last line of defence.

-- Enable RLS on the projects table
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;

-- Create a policy that restricts reads to the current tenant
CREATE POLICY tenant_isolation ON projects
  USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

Then in your application, set the tenant context at the start of each request:

SET app.current_tenant_id = '{{tenant_uuid}}';

This means that even if your application code has a bug and omits the tenant_id WHERE clause, the database enforces the boundary. Defence in depth.

The cost: current_setting() calls add marginal overhead per query. In practice this is negligible for most workloads, but benchmark it if you're operating at very high query rates.

The Tenant Resolution Problem

How does your application know which tenant a request belongs to? There are three common approaches, each with real implications for your routing and infrastructure:

Subdomain-based: acme.yourapp.com → resolve acme to a tenant.
Clean for end users. Requires wildcard TLS, and your CDN/load balancer needs to handle arbitrary subdomains. Custom domains (where a tenant uses app.acmecorp.com) add another layer of complexity: you need to map arbitrary domains to tenant IDs at the edge.

Path-based: yourapp.com/t/acme/dashboard
Simpler infrastructure. Less ergonomic for users. Works everywhere including native apps. Good for internal tooling or B2B products where users don't care about the URL.

Token-based: tenant encoded in the JWT or session token.
Common for API-first products. Tenant context is explicit in every request. No DNS dependency. Slightly more complex auth middleware.

Most teams mix approaches: subdomain for the main app, token-based for the API.

Schema Migrations at Scale

This is the operational burden that takes teams by surprise. In a single-tenant app, a migration runs once. In a multi-tenant app with separate databases or schemas, it runs N times.

For shared-schema tenancy, this is not a problem — migrations run once. For schema-per-tenant or database-per-tenant, you need a migration runner that can:

Track migration state per tenant
Run migrations in parallel (with configurable concurrency)
Handle failures gracefully — partial rollouts should not leave some tenants on schema version 7 and others on version 8 invisibly

A pattern that works well: maintain a tenant_migrations table in your management database that records the current migration version for each tenant. Your deployment pipeline queries this table, identifies tenants not at the current version, and runs migrations in batches.

Tenant Onboarding: Provisioning New Tenants Efficiently

In a shared-schema model, onboarding a new tenant is simply inserting a row into a tenants table and using the generated ID as the tenant_id in all subsequent writes. It's fast, it's atomic, and it doesn't require any infrastructure changes.

In a schema-per-tenant model, onboarding requires creating a new schema and running the baseline migrations against it. This adds latency to the signup flow — typically seconds for small schemas, potentially minutes for large ones. The standard approach is to provision the tenant schema asynchronously and show a "getting your workspace ready" state to the user while it completes.

In a database-per-tenant model, provisioning is the most expensive: create a new database instance, configure access, run migrations, and update the routing table. This is typically a background job that takes minutes, and the user experience must be designed to accommodate it.

Plan your onboarding UX around your provisioning model. Don't discover the mismatch after you've shipped.

Don't Confuse Multi-Tenancy with Multi-Instance

One thing worth saying explicitly: multi-tenancy is a data architecture and application design concern, not an infrastructure concern. You can run a multi-tenant application on a single server or across hundreds. You can run a single-tenant application in Kubernetes with 50 replicas.

The isolation model lives in your data layer and your application logic. Infrastructure decisions about scaling, availability, and deployment are separate — important, but separate.

For a deeper look at the specific patterns and when to apply each, the Actinode guide on multi-tenant SaaS architecture covers the full decision matrix including compliance considerations and per-pattern migration strategies.

Summary

Model	Isolation	Cost	Migration complexity	Best for
Separate databases	Full	High	High	Enterprise, regulated industries
Separate schemas	Logical	Medium	Medium	Mid-market SaaS
Shared schema + RLS	Row-level	Low	Low	High-volume B2B, most startups

The choice you make here will be with you for years. Make it deliberately.

A Note on Testing Your Isolation Model

Whatever model you choose, write explicit tests that verify your tenant isolation holds. Not just unit tests for the query logic — integration tests that simulate cross-tenant access attempts and verify they fail at the database layer.

A test suite that catches isolation failures looks like this:

Create two test tenants with separate datasets
Authenticate as tenant A
Attempt to read tenant B's records via your application's own API routes
Assert the response contains zero tenant B records

This test should run in your CI pipeline on every merge. Tenant isolation failures are the kind of bug that makes the news, and the cost of catching them in CI is trivially low compared to the cost of discovering them in production.

The same principle applies to your RLS policies: write a test that connects to Postgres with the RLS policy active, sets app.current_tenant_id to tenant A's ID, and attempts to SELECT tenant B's rows directly. If RLS is working correctly, the query returns zero rows. If it returns any rows, your policy has a gap.

Isolation is a correctness property, not just a design preference. Test it like one.

DEV Community