DEV Community

Cover image for PostgreSQL RLS Is Fail-Closed, But Is It Fast? Making django-rls-tenants Index-Friendly
Jan Dvorak
Jan Dvorak

Posted on

PostgreSQL RLS Is Fail-Closed, But Is It Fast? Making django-rls-tenants Index-Friendly

A couple of days ago I published a post about why PostgreSQL Row-Level Security is the right approach to Django multitenancy. The short version: application-level filtering is opt-in, RLS is opt-out. One fails by leaking data, the other fails by returning nothing.

I still stand by all of that. But after spending more time with RLS on actual data -- not the neat 50-row test tables I started with -- I ran into something the PostgreSQL documentation doesn't make obvious.

RLS policies have a performance cost, and it's not where you'd expect it.

The isolation itself is rock solid. The problem is the query planner. PostgreSQL has opinions about how RLS policy expressions get evaluated, and those opinions can prevent your indexes from doing their job. I only noticed because a query that should have been fast wasn't, and EXPLAIN told me a story I didn't expect.

This post is about what I found, why it happens, and what changed in django-rls-tenants 1.1.0 to fix it.


The Problem: RLS Policies Can't Use Your Indexes

Here's the setup. You have an orders table with a tenant_id column and a composite index on (tenant_id, created_at). You've enabled RLS with a policy that checks tenant_id = current_setting('rls.current_tenant')::integer. You run a query:

SELECT * FROM orders;
Enter fullscreen mode Exit fullscreen mode

RLS appends the policy filter automatically. You'd expect PostgreSQL to use your composite index to efficiently scan just the rows belonging to the current tenant. That's what the index is for. Makes sense, right?

It doesn't happen. PostgreSQL does a sequential scan on the entire table and applies the RLS filter row-by-row.

I stared at this for longer than I'd like to admit before I understood why.

Why This Happens

The reason is a security mechanism called security barriers. When PostgreSQL evaluates an RLS policy, it treats the policy expression as a "security barrier qual." This is intentional -- it prevents user-defined functions in regular WHERE clauses from seeing rows that the RLS policy would have filtered out. Without this protection, a cleverly crafted function could leak information about rows the user shouldn't see.

The consequence is that PostgreSQL will only push a security barrier expression into an index scan if every function in that expression is marked as leakproof. A leakproof function is one that PostgreSQL guarantees will never reveal information about its arguments through error messages, side effects, or any other channel.

And here's the thing: current_setting() -- the function that RLS policies use to read session variables -- is not leakproof. PostgreSQL can't guarantee it won't throw an error that reveals something about a row's data. So the planner refuses to push it into an index condition and falls back to evaluating it as a filter on every row after reading them.

This is the correct behavior from a security perspective. PostgreSQL is being conservative, and for good reason. But it means that on a table with a million rows across 500 tenants, every single query does a sequential scan and filters down to 2,000 rows after the fact. Your carefully designed composite index just sits there.

For small tables, none of this matters. For anything above a few tens of thousands of rows, it starts to show.

What This Looks Like in EXPLAIN

If you run EXPLAIN on a query against an RLS-protected table, you'll see something like this:

Seq Scan on orders  (cost=0.00..2541.00 rows=500 width=64)
  Filter: (tenant_id = (current_setting('rls.current_tenant',...))::integer)
Enter fullscreen mode Exit fullscreen mode

That Filter line is the giveaway. The RLS expression is evaluated per-row, after a full table scan. Now compare that to what you get when you add an explicit WHERE tenant_id = 42:

Index Scan using idx_orders_tenant_created on orders  (cost=0.42..85.30 rows=500 width=64)
  Index Cond: (tenant_id = 42)
  Filter: (tenant_id = (current_setting('rls.current_tenant',...))::integer)
Enter fullscreen mode Exit fullscreen mode

Now the planner uses the index for the explicit WHERE condition and applies the RLS policy as a secondary filter on the already-scoped rows. Both conditions evaluate to the same thing, but the planner trusts the explicit one and won't trust the RLS one.

Same logic. Completely different execution path.


The Fix: A Redundant WHERE Clause That's Not Actually Redundant

Once I understood the problem, the fix was almost obvious: if the ORM adds an explicit WHERE tenant_id = X to every query, the planner can use it for index scans. The RLS policy still runs as a secondary filter -- the fail-closed guarantee stays intact -- but the heavy lifting of narrowing down rows is done by the index.

In v1.0, RLSManager.get_queryset() returned a plain TenantQuerySet with no filters. Tenant scoping was entirely handled by PostgreSQL via the RLS policy. Correct, but slow.

In v1.1, RLSManager.get_queryset() checks whether a tenant context is active and, if so, adds .filter(tenant_id=X) automatically:

# Before (v1.0): RLS only
def get_queryset(self):
    return TenantQuerySet(self.model, using=self._db)

# After (v1.1): ORM filter + RLS
def get_queryset(self):
    qs = TenantQuerySet(self.model, using=self._db)
    tenant_id = get_current_tenant_id()
    if tenant_id is not None:
        qs = qs.filter(tenant_id=tenant_id)
    return qs
Enter fullscreen mode Exit fullscreen mode

The natural question: how does get_current_tenant_id() know which tenant is active?

ContextVar-Based Tenant State

v1.1 introduces a new module (tenants/state.py) that uses Python's contextvars.ContextVar to track the current tenant ID. When you use tenant_context(), admin_context(), or the RLSTenantMiddleware, the library now sets both the PostgreSQL GUC variable (for RLS) and the ContextVar (for ORM auto-scoping).

from django_rls_tenants import tenant_context

with tenant_context(tenant_id=42):
    # This now does TWO things:
    # 1. SET rls.current_tenant = '42'  (database-level, for RLS)
    # 2. ContextVar = 42                (Python-level, for ORM filter)

    orders = Order.objects.all()
    # Generated SQL: SELECT ... FROM orders WHERE tenant_id = 42
    # Plus RLS policy applied by PostgreSQL as secondary filter
Enter fullscreen mode Exit fullscreen mode

You don't change any of your code. If you're already using tenant_context() or the middleware, every query gets the ORM filter automatically. The RLS policy still enforces isolation at the database level -- the ORM filter just helps the planner find the right rows faster.

This is defense in depth that also happens to be a performance win. The ORM filter handles the "fast path" via index scans. The RLS policy handles the "safe path" -- catching raw SQL, dbshell, migrations, and anything else that bypasses the ORM. You get both.


select_related() Across RLS-Protected Joins

The same index problem applies to JOINs. If you do OrderItem.objects.select_related('order') and both OrderItem and Order are RLS-protected, PostgreSQL joins the tables and then applies RLS filters to both sides row-by-row. The indexes on the joined table don't help there either.

In v1.1, TenantQuerySet.select_related() detects when a joined model is RLS-protected and adds a tenant filter on the joined table:

with tenant_context(tenant_id=42):
    items = OrderItem.objects.select_related('order').all()
    # Generated SQL includes:
    #   WHERE order_item.tenant_id = 42
    #     AND (order.tenant_id = 42 OR order_item.order_id IS NULL)
Enter fullscreen mode Exit fullscreen mode

The library walks the relation path, checks whether the target model has an RLSConstraint in its Meta (results are cached for the process lifetime), and adds the appropriate filter. For nullable foreign keys, it preserves LEFT OUTER JOIN semantics -- rows where the FK is NULL are kept, not silently dropped by the inner join that .filter() would normally force.

If you're already using select_related() in your views, the tenant filters are added automatically when a context is active. Nothing to change.


Policy Rewrite: CASE WHEN Instead of OR

While I was in the EXPLAIN output anyway, I also took a closer look at the structure of the RLS policy itself. In v1.0, the policy used an OR-based structure:

-- v1.0 policy
USING (
    tenant_id = coalesce(nullif(current_setting('rls.current_tenant', true), '')::integer, NULL)
    OR coalesce(current_setting('rls.is_admin', true) = 'true', false)
)
Enter fullscreen mode Exit fullscreen mode

Two things bothered me. First, the coalesce(..., NULL) wrapping the tenant match is redundant -- nullif() already returns NULL when the setting is empty, and NULL::integer is NULL. The extra coalesce does nothing.

Second, the OR structure means PostgreSQL evaluates both branches. In v1.1, the policy uses CASE WHEN instead:

-- v1.1 policy
USING (
    CASE WHEN current_setting('rls.is_admin', true) = 'true'
         THEN true
         ELSE tenant_id = nullif(current_setting('rls.current_tenant', true), '')::integer
    END
)
Enter fullscreen mode Exit fullscreen mode

The CASE WHEN checks the admin flag first. If it's an admin session, it short-circuits to true without evaluating the tenant match. For regular tenant sessions, it evaluates just the tenant match.

I want to be honest here: the real performance win in this release comes from the ORM-level auto-scoping described above, not from this policy rewrite. The CASE WHEN change is more about clarity and correctness -- the evaluation order is explicit, and the policy reads more naturally. But I wouldn't call it a performance optimization by itself.

If you run makemigrations after upgrading, the library generates a migration that updates existing policies to the new structure.


Hardening: ASGI, Exceptions, and Superusers

Beyond performance, v1.1 fixes several correctness issues I should have caught earlier. These are the kind of things that work fine in development and then bite you in production under specific conditions.

threading.local to ContextVar

The v1.0 middleware used threading.local() to track whether GUC variables had been set during a request. This works fine in WSGI, where each request gets its own thread. In ASGI, though, multiple coroutines can share a single thread. threading.local() doesn't distinguish between them, so one coroutine's state can leak into another's.

v1.1 replaces all threading.local() usage with contextvars.ContextVar, which provides proper per-coroutine isolation. If you're running Django under an ASGI server (Uvicorn, Daphne, Hypercorn), this is the fix you didn't know you needed.

Middleware Exception Handling

In v1.0, if a view raised an unhandled exception, Django's middleware chain might skip process_response. That meant the GUC variables (and now the ContextVar) could remain set for the next request on the same connection. Not great.

v1.1 adds a process_exception() method to the middleware that runs the same cleanup. Between process_response, process_exception, and the request_finished signal safety net, there are now three independent layers of cleanup. Belt, suspenders, and a backup pair of suspenders.

Superuser Detection at Startup

This one is embarrassing that it wasn't there from the start. PostgreSQL superusers bypass all RLS policies. FORCE ROW LEVEL SECURITY doesn't override it. If your Django application connects to PostgreSQL as a superuser, every RLS policy is silently ignored and every query returns all tenants' data.

This is the single most dangerous misconfiguration possible with RLS, and in v1.0 there was no warning about it. v1.1 adds a Django system check (W005) that queries pg_user at startup and warns you immediately if the current connection is a superuser.

I mentioned this risk in the first post. The example project ships with a docker/init-db.sql that creates a non-superuser role. But mentioning it in docs is not the same as catching it automatically. Now the library catches it for you.

Fail-Fast on Bad User Configuration

In v1.0, if a non-admin user had rls_tenant_id = None (typically because the user model wasn't fully configured), the library would stringify it as "None" and pass that to PostgreSQL. The GUC variable would be set to the literal string "None", which would fail the integer cast in the RLS policy and return zero rows. Technically correct behavior -- fail-closed -- but completely unhelpful when you're trying to figure out why your users see empty pages.

v1.1 raises a ValueError immediately with a clear message: "Non-admin user has rls_tenant_id=None. Assign the user to a tenant or set is_tenant_admin=True." The same validation applies to @with_rls_context, which now includes the decorated function name in the error message so you know exactly where the problem is.


Bug Fixes

Two quick ones worth mentioning:

Custom TENANT_FK_FIELD was partially broken. The class_prepared signal handler that auto-adds the tenant FK to models was checking for a field named "tenant" regardless of your TENANT_FK_FIELD setting. If you configured TENANT_FK_FIELD = "organization", the handler wouldn't find "tenant" in the model's fields and would try to add a second FK. Fixed -- it now reads the configured field name.

CONN_MAX_AGE=None slipped through the system check. The W004 check warns about persistent connections with session-scoped GUCs (risk of GUC state leaking between requests). But CONN_MAX_AGE=None -- Django's sentinel for "keep connections alive forever" -- wasn't being caught because the check only tested for positive integers. None is arguably the most dangerous value here. Fixed.


Wrapping Up

v1.0 was about correctness: making the database enforce tenant isolation so your application code doesn't have to be perfect.

v1.1 is about making that correctness fast.

The core change is straightforward: add an ORM-level WHERE tenant_id = X alongside the RLS policy, so PostgreSQL can use your indexes for the fast path while RLS handles the safe path. Everything else -- the CASE WHEN policy rewrite, select_related() propagation, ContextVar migration, exception handling -- either supports that central idea or hardens the library for production use.

An empty page is still better than a data leak. But a fast, correct page is better than both.

The full changelog is in the release notes. If you run into anything, open an issue or start a discussion.


django-rls-tenants is open-source and MIT licensed. If you find it useful, a GitHub star helps others discover it.

Top comments (0)