Gabriel Anhaia

Posted on May 24

Multi-Tenant RAG: 4 Isolation Patterns and the One Regulators Actually Ask About

#rag #ai #security #architecture

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your RAG demo works. You add a second customer. Tenant A asks "what's our Q3 churn target?" and the model retrieves a chunk from tenant B's pricing memo. Nobody notices for a week because the answer reads plausible. Then somebody does notice, and now you're writing an incident report to a Fortune 500 procurement team that just signed a DPA promising data isolation.

There are four patterns for keeping tenant data apart in a RAG system. Three of them you'll see in any vector-DB tutorial. The fourth is the one a SOC 2 or HIPAA auditor will ask about by name, and most teams haven't built it.

The data-leak scenario

Concrete setup. You're running a SaaS that ingests customer documents (contracts, internal wikis, support tickets), chunks them, embeds with text-embedding-3-small, stores in Qdrant, retrieves on user query, stuffs into a Claude or GPT context window. Two paying tenants: acme-corp and globex. Both upload roughly 50k chunks each.

A developer adds a feature that lets users ask follow-up questions referencing "the document we discussed last week." The retrieval code forgets to apply a tenant filter on that particular path. Acme's CEO asks about pricing strategy. The retriever returns three chunks from Globex's competitive-pricing memo because the embeddings happen to be close.

This is not hypothetical. It's the most common postmortem shape in early-stage RAG. The question is how much of the blast radius you can prevent with architecture instead of code review.

Pattern 1: filter-at-query (tenant_id WHERE clause)

One collection, one index, every chunk carries a tenant_id payload. Every query gets a filter.

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

client = QdrantClient(url="http://localhost:6333")

def retrieve(tenant_id: str, query_vector: list[float], k: int = 5):
    # the filter MUST be applied here, every call, no exceptions
    return client.search(
        collection_name="documents",
        query_vector=query_vector,
        query_filter=Filter(
            must=[FieldCondition(key="tenant_id",
                                 match=MatchValue(value=tenant_id))]
        ),
        limit=k,
    )

Fast to ship. Cheap to run. One collection, one set of HNSW graphs, one operational target.

The problem is that "every call, no exceptions" is a property of human discipline, not architecture. A new endpoint, a debug script, a Jupyter notebook one of your data scientists runs against prod: any of those skips the filter and reads across tenants. Auditors hate this pattern for exactly that reason. The isolation guarantee lives in application code, which is the easiest layer to bypass.

You can harden it. Wrap the client in a class that refuses to call search without a tenant_id argument, write a pre-commit hook that fails on raw client.search( calls, add a Qdrant alias per tenant. Each of those raises the bar a little. None of them turn this into a defensible "tenants are physically separated" story.

Use it when: you're pre-product-market-fit, your tenants are internal teams, or you're shipping a demo. Not for anything regulated.

Pattern 2: namespace per tenant (collection per tenant)

Same Qdrant cluster, one collection per tenant. Naming convention: docs__<tenant_id>.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(url="http://localhost:6333")

def provision_tenant(tenant_id: str, dim: int = 1536) -> str:
    name = f"docs__{tenant_id}"
    client.create_collection(
        collection_name=name,
        vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
        # per-collection HNSW tuning is possible (handy for tiers)
        hnsw_config={"m": 16, "ef_construct": 100},
    )
    return name

def retrieve(tenant_id: str, query_vector: list[float], k: int = 5):
    # no payload filter needed; the collection IS the tenant boundary
    return client.search(
        collection_name=f"docs__{tenant_id}",
        query_vector=query_vector,
        limit=k,
    )

Now isolation is structural. If a query targets the wrong collection name, it returns "collection not found," not a leaky result set. The retriever can't accidentally cross tenants because the collection address itself is the tenant boundary.

Cost grows. Each Qdrant collection carries a baseline of memory for the HNSW graph metadata, a separate WAL, a separate set of segment files. On Qdrant 1.11 the baseline is small (kilobytes) but with 50k tenants the metadata overhead becomes real and the rebalance cost on a node failure becomes nonzero. The vector data itself is the same total size; it's the per-collection overhead that scales linearly with tenant count.

Operationally this is the SaaS default. It maps cleanly to tenant lifecycle: onboarding creates a collection, offboarding drops one. Backups can be per-tenant. You can move a hot tenant to its own shard without a migration script.

Two gotchas. First, your control plane has to track collection names; lose the mapping, lose the tenant. Persist it in your primary DB, not in Qdrant. Second, the embedding model is still shared (more on this below).

Use it when: SaaS, more than three tenants, no strict per-tenant compliance commitments yet.

Pattern 3: separate index per tenant (separate cluster or deployment)

Different physical Qdrant cluster, different EBS volume, different IAM role. Or a different managed-vector service entirely for the tenant that asked for it.

# tenant routing lives in your control plane, not the retriever
TENANT_ENDPOINTS = {
    "acme-corp": "https://qdrant-acme.internal.example.com:6333",
    "globex":    "https://qdrant-globex.internal.example.com:6333",
    "default":   "https://qdrant-shared.internal.example.com:6333",
}

def client_for(tenant_id: str) -> QdrantClient:
    url = TENANT_ENDPOINTS.get(tenant_id, TENANT_ENDPOINTS["default"])
    api_key = read_secret(f"qdrant/{tenant_id}/api-key")
    return QdrantClient(url=url, api_key=api_key)

def retrieve(tenant_id: str, query_vector: list[float], k: int = 5):
    c = client_for(tenant_id)
    return c.search(
        collection_name="documents",
        query_vector=query_vector,
        limit=k,
    )

Now the blast radius of any retriever bug is one tenant. A leaked API key gives access to one tenant's index, not all of them. A snapshot restore for one tenant doesn't risk overwriting another's data. You can put the regulated tenant in a different VPC, a different region, a different cloud account.

The operational cost is real. Per-tenant alerts, per-tenant capacity planning, per-tenant upgrade windows, per-tenant on-call. For three Fortune 500 tenants on premium tiers this is fine. For 800 small-business tenants on a $49/month plan it's bankruptcy.

What auditors like about this pattern: you can hand them a network diagram showing physical separation. "Tenant X's data lives on these volumes, accessed by these IAM roles, in this VPC." That's a much shorter conversation than explaining how your filter middleware works.

Use it when: a small number of regulated or enterprise tenants want isolation guarantees you can point at on a diagram.

Pattern 4: encrypted-at-rest with per-tenant keys

This is the one regulators ask about by name. The phrase shows up in HIPAA guidance, in PCI DSS 4.0, in the EU AI Act's "operator of high-risk AI system" obligations. "Customer-managed keys" or "BYOK" (bring your own key).

The shape: every vector and every chunk payload is encrypted with a tenant-specific data encryption key (DEK). The DEK itself is wrapped by a tenant-specific key encryption key (KEK) held in AWS KMS, GCP KMS, or Azure Key Vault. The tenant controls the KEK. Revoke the KEK, the data becomes unreadable, even by you.

Qdrant doesn't do envelope encryption natively as of 1.11. You build it in the application layer.

import base64, json, os
import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

kms = boto3.client("kms", region_name="eu-west-1")
qdrant = QdrantClient(url="http://localhost:6333")

def get_tenant_dek(tenant_id: str) -> bytes:
    # generate a fresh 256-bit DEK, wrapped by the tenant's KEK in KMS
    # cache the unwrapped DEK in-memory for the request lifetime, not longer
    resp = kms.generate_data_key(
        KeyId=f"alias/rag-tenant-{tenant_id}",
        KeySpec="AES_256",
    )
    return resp["Plaintext"]  # 32 bytes; drop after request

def encrypt_payload(dek: bytes, payload: dict) -> dict:
    aes = AESGCM(dek)
    nonce = os.urandom(12)
    plaintext = json.dumps(payload).encode("utf-8")
    ciphertext = aes.encrypt(nonce, plaintext, associated_data=None)
    return {
        "nonce": base64.b64encode(nonce).decode(),
        "ct":    base64.b64encode(ciphertext).decode(),
    }

def upsert_chunk(tenant_id: str, chunk_id: str,
                 vector: list[float], payload: dict):
    dek = get_tenant_dek(tenant_id)
    encrypted_payload = encrypt_payload(dek, payload)
    qdrant.upsert(
        collection_name=f"docs__{tenant_id}",
        points=[PointStruct(
            id=chunk_id,
            vector=vector,
            payload=encrypted_payload,  # opaque to Qdrant
        )],
    )

On retrieval, you decrypt the payload in the application after Qdrant returns the points. The vectors themselves stay in plaintext inside Qdrant (they have to, for similarity search to work), but the chunk text and metadata are opaque to anyone with raw disk access.

Two trade-offs. The vectors leak some information about the source text even when the payload is encrypted; the inversion attacks published in 2023 (Morris et al., "Text Embeddings Reveal As Much As Text") can recover partial inputs from embeddings alone. If your threat model includes "an attacker exfiltrates the vector store," you also need to encrypt-then-tokenize sensitive fields before embedding, or accept the risk and document it. The other trade-off is operational: KMS calls cost real money and add latency to ingestion. Cache the unwrapped DEK per request, never per process; a long-lived in-memory DEK defeats half the point of envelope encryption.

What you get in return: a tenant can revoke their KEK and your system stops working for that tenant within minutes. That's the bit auditors want to see written down, "the customer has cryptographic control over data deletion," because it's the strongest possible answer to GDPR's right-to-be-forgotten and to enterprise procurement's "what happens when we leave" question.

Use it when: regulated industry, enterprise tenant with a CISO, or any case where "the customer can delete their data without our cooperation" is a contract clause.

Combining patterns: the real production stack

Don't pick one. Pick a tier.

free / trial tier      → Pattern 1 (filter-at-query)
growth tier (SaaS)     → Pattern 2 (collection per tenant)
enterprise tier        → Pattern 2 + Pattern 3 (separate cluster)
regulated tier         → Pattern 2 + Pattern 3 + Pattern 4 (BYOK)

The combination matters more than the choice. A namespace-per-tenant deployment with per-tenant KMS keys covers ~95% of what enterprise procurement asks for. Add separate clusters when the contract requires "physical isolation" by name, which is more often than you'd guess for healthcare and finance buyers.

Your provisioning code becomes a tier-aware factory:

def provision_tenant(tenant_id: str, tier: str):
    if tier == "regulated":
        kms.create_alias(
            AliasName=f"alias/rag-tenant-{tenant_id}",
            TargetKeyId=create_tenant_kek(tenant_id),
        )
        cluster_url = provision_dedicated_cluster(tenant_id)
        register_tenant(tenant_id, cluster_url, encrypted=True)
    elif tier == "enterprise":
        cluster_url = provision_dedicated_cluster(tenant_id)
        register_tenant(tenant_id, cluster_url, encrypted=False)
    else:
        register_tenant(tenant_id,
                        url=SHARED_CLUSTER,
                        encrypted=False)
        provision_namespace(tenant_id)

The gotcha nobody documents: the shared embedder is a side channel

All four patterns above isolate the data at rest and at retrieval. They don't isolate the embedding model.

Every tenant's chunks pass through the same text-embedding-3-small call. If you're using a hosted embedder (OpenAI, Cohere, Voyage), tenant A's text and tenant B's text both travel to the same external endpoint. If the provider has a bug, a logging leak, or a billing-side data retention policy you didn't read, the isolation you built in your vector store stops at the embedder boundary.

Worse, embeddings are deterministic for a given model version. If tenant A and tenant B both ingest a document containing the exact same paragraph (say, the same boilerplate NDA clause), their vectors will be byte-identical. That's not a leak in itself, but it means a sufficiently motivated attacker who controls one tenant can probe for the presence of specific text in another tenant's corpus by ingesting candidate snippets and looking for vector collisions in any system that exposes raw vectors.

What to do about it:

Pick a self-hosted embedder for regulated tenants. bge-large-en-v1.5 or e5-mistral-7b-instruct on your own GPU. Slower, more expensive per-token, but the text never leaves your VPC.
If you stay on a hosted embedder, sign a DPA with the provider that explicitly covers embedding inputs and document it in your security overview. Saying "we use OpenAI for embeddings" in your DPA is the bare minimum your enterprise buyers will check for.
Never expose raw vector values back to API consumers. Return chunk IDs and decrypted text only. The collision probe attack relies on the attacker being able to read back vectors.
Use per-tenant embedding models for the highest tier if you can afford it. Fine-tuned variants per tenant give you both better retrieval quality and a real isolation boundary on the embedding side.

This is the bit most teams skip in their threat model. The vector store gets the security review; the embedder gets a footnote. Reverse that order when you write the threat model up, because the embedder is the part of the system you don't control and can't audit.

The takeaway

Filter-at-query is fast and fragile: fine for trials, terrible under audit. Namespace-per-tenant is the SaaS default and the right answer for the next thousand customers. Separate indexes buy you a network diagram you can show a procurement team. Per-tenant KMS keys are the only pattern that gives the customer cryptographic control over deletion, and that's the one regulators name in their questionnaires.

Pick a tier, layer the patterns, and write the embedder side-channel into your security overview before someone else finds it for you.

What's the leakiest tenant-isolation bug you've shipped in a RAG system, and which pattern would've caught it?

If this was useful

Multi-tenancy is one chapter of the production-RAG story; the rest is chunking strategy, hybrid retrieval, reranking, and the eval loop that keeps quality from rotting. The RAG Pocket Guide walks through the production patterns end-to-end, including the security and isolation chapter this post draws from. Worth a read if you're past the "it works in a notebook" phase and trying to make the system survive a SOC 2 audit.