For a while, my multi-tenant SaaS isolated tenants the way most apps do: every query carried a WHERE organization_id = :current_org clause, enforced in the application layer. It works. Until it doesn't — one missing filter, one new endpoint that forgets the convention, one ORM relationship that loads more than you expected, and one tenant can see another tenant's data.
For most products that's a bug. For a product whose entire value proposition is being the trustworthy custodian of someone else's records, it's existential. "We filter by organization in the code, trust us" is not a sentence I wanted to say to a security reviewer.
So I moved isolation down a layer — into the database itself, with Postgres Row-Level Security (RLS). This is a short write-up of how that rollout went, and specifically about the moment it appeared to break, which turned out to be the moment it proved it was working.
The shape of the design
The idea behind RLS is simple: instead of trusting every query to filter correctly, you attach a policy to the table, and Postgres refuses to return rows that don't match — no matter what the query says.
The policy needs to know who is asking. The common pattern is to push the current tenant into a session-scoped runtime parameter (a GUC), and have the policy read it:
sql-- set per request, transaction-scoped
SELECT set_config('app.current_org_id', :org_id, true);
-- policy on each tenant table
CREATE POLICY tenant_isolation ON some_table
FOR ALL
USING (organization_id = current_setting('app.current_org_id', true))
WITH CHECK (organization_id = current_setting('app.current_org_id', true));
Two design choices matter here.
First, fail closed. I wrapped the GUC lookup in a helper that returns NULL when the parameter is missing, rather than throwing or defaulting to something permissive. Because organization_id = NULL is never true in SQL, a request that forgets to set its tenant context sees zero rows — not everything. The absence of identity is treated as "you are no one," not "you are everyone."
Second, the app connects as a non-superuser, non-BYPASSRLS role. RLS politely does nothing for superusers and table owners. If your app's database role can bypass the very policies you wrote, you've built a very expensive no-op. The runtime role has exactly the grants it needs and nothing more; privileged maintenance happens through a separate role.
The chicken-and-egg table
One table fought back: the membership table that maps users to organizations.
Every request resolves its org_id by querying that table ("which org does this user belong to?"). But the org GUC isn't set yet at that point — resolving it is the whole reason we're reading the table. If the membership table's read policy depends on the org GUC, you deadlock yourself out of every request.
The fix is to give that one table a policy keyed on the user identity (set earlier in the request) rather than the org identity (set later). The user GUC is available before org resolution; the org GUC is not. Writes to the membership table are blocked entirely on the app path and go through the privileged role, because membership changes are not something a normal tenant request should ever do directly.
Small thing, but if you get the ordering wrong, nothing works, and the failure looks terrifyingly total.
The moment it "broke"
Here's the part I actually wanted to write about.
I enabled the policies table by table, smoke-testing each one. Most went green. Then I enabled it on a table behind a create endpoint — make a thing, read it back, return it to the user — and it threw:
sqlalchemy.exc.InvalidRequestError: Could not refresh instance ''
The handler had just inserted a row, committed, and then tried to read it back. And the read came back empty. The database refused to return a row the same request had created seconds earlier.
For about thirty seconds this looks like a catastrophic regression. It is the opposite. It is the system working exactly as designed.
The cause: transaction-scoped GUCs (set_config(..., true), equivalent to SET LOCAL) are cleared on COMMIT. The pattern was:
pythondb.add(row)
db.commit() # <- SET LOCAL tenant context is wiped here
db.refresh(row) # <- new transaction, no GUC, policy sees NULL -> 0 rows
After the commit, the next statement runs in a fresh transaction with no tenant context. The fail-closed helper returns NULL. The policy matches nothing. The refresh() finds no row — including the one we just wrote — and errors.
This is the fail-closed design proving itself in production. A session that hadn't re-established who it was got shown nothing. If the isolation had been sloppy — defaulting to "see everything" when context is missing — this would have silently succeeded, and I'd have shipped a system that returns data to sessions with no identity. The error was the receipt.
The fix is to re-inject the tenant context immediately after committing, so legitimate work continues to carry its identity across the commit boundary:
pythondef commit_and_restore(db, ctx):
db.commit()
if ctx: # tenant request: re-apply GUCs for the next transaction
set_config(db, "app.current_user_id", ctx.user_id)
set_config(db, "app.current_org_id", ctx.org_id)
# privileged/admin sessions have no ctx and pass straight through
I routed every commit-then-keep-querying handler through that helper, and watched the same endpoints go green — this time because isolation was working, not because it was absent.
What I took away
Fail-closed is only useful if you can tell it apart from a bug. The "Could not refresh" error and a genuine isolation hole would look completely different in production, but in the moment they feel identical. Knowing your design well enough to read the failure correctly is the skill.
Test the negative, eventually. Watching your own tenant's data appear is necessary but not sufficient. The real property is that another tenant's data does not appear. Single-tenant smoke tests can't prove that; the fail-closed refresh error was, ironically, the strongest live evidence I got that the empty case behaves correctly.
Defense in depth means not trusting your own code. RLS doesn't replace careful query-writing. It's the layer that holds when the careful query-writing eventually slips — and over a long enough timeline, it slips.
The most reassuring thing a security layer can do is refuse you when you haven't proven who you are. Even when "you" is the request that wrote the row.
Top comments (0)