Originally published on PrepStack. This is **Part 5* of* Context Engineering for Enterprise AI.
Parts 1–4 built the context layer for Mattrx, a multi-tenant marketing-analytics SaaS (110k MAU, ~9,000 tenants, ~3,200 req/sec peak, ASP.NET Core / .NET 9 + a separate Python FastAPI AI service). Every one of those parts quietly leaned on one primitive — the tenant boundary. This part makes it the design surface instead of an afterthought.
TL;DR
Multi-tenancy is not a WHERE tenant_id = ? you sprinkle on queries. It is a single boundary that has to hold across every surface of the context pipeline — identity, retrieval, memory, prompt, cache, rate limit, model routing, cost, and residency — each with its own failure mode.
Mattrx results after making tenant scope a first-class context primitive (2-week build):
- Cross-tenant leak incidents (docs + cache + prompt + logs, load + red-team): 0.
- Noisy-neighbor: other tenants' p95 during the whale's nightly batch: 6.4s -> 1.9s.
- Prompt-cache hit on the shared system preamble: 0% -> 71% (~50% cheaper prefill).
- Per-tenant cost attribution: 0% (one blended bill) -> 100% (per-call ledger).
- Runaway-tenant spend in one hour: ~$140 -> $5 (budget cap trips; clean 429).
- New-tenant isolation onboarding: ~2 days manual -> < 5 min automated.
- Retrieval p95 on the pool index with a tenant filter at ~9,000 tenants: 31 ms (held); whale on a dedicated index: 22 ms, and pool recall@5 recovered 0.88 -> 0.94.
- Cost per AI query: $0.008 (unchanged) — now attributed, capped, and routed per plan.
The one mental shift
Stop treating the tenant as a filter you remember to add. Treat tenant scope as part of the context itself — resolved once at the edge, carried as an unforgeable token through retrieval, prompt, cache, model, and ledger. If isolation depends on anyone remembering to add a filter, it will leak the day someone forgets.
1. Tenant identity: resolve once, never trust the body
The first multi-tenant version read tenant_id from wherever was convenient — a query string, the JSON body — and passed it to the AI service. Any authenticated user could read another tenant by changing one field.
The fix: resolve the tenant once, at the edge, from the signed workspace_id claim in the JWT. Everything downstream receives a TenantScope it cannot forge or widen, and every method signature gains a required TenantScope — so the type system refuses to compile a query that forgot the tenant. Moving identity from the body to the token closed the entire "change one field, read another tenant" class: 0 successful cross-tenant reads in red-team testing.
2. Knowledge isolation: pool by default, silo on a threshold
Three models, and why the hybrid wins:
| Model | What it is | Isolation | Best for |
|---|---|---|---|
| Pool | One index, every doc tagged tenant_id, hard filter on read |
Logical (a missing filter = leak) | The long tail of small tenants |
| Silo | One index per tenant | Physical (nothing to forget) | Whales, Enterprise, EU-residency |
| Bridge | Pool by default; promote to silo on a threshold | Logical for most, physical for the few | Mattrx's choice |
Pure silo is impossible at 9,000 tenants (Azure AI Search caps indexes per service). Pure pool can't give residency and lets one tenant's corpus degrade everyone's recall. Bridge keeps the cheap, instant pool for the 99% and spends physical isolation only where size, SLA, or regulation forces it. Crucially, the tenant_id filter is applied by the router, never by the call site — so no caller can forget it. Promoting the whale to its own index dropped its p95 to 22 ms and recovered the long tail's recall@5 from 0.88 to 0.94.
3. Prompt and policy variation: behavior as data, not branches
Per-tenant behavior used to accrete as if (tenantId == ...) branches, and large tenants got bespoke prompts rebuilt every call (a DateTime.UtcNow in the system block alone meant a 0% cache hit). The fix: tenant behavior is data (a TenantConfig row), and the prompt is assembled as a byte-stable shared preamble first (identical bytes for every tenant, so prompt caching reuses it) followed by a small tenant delta. Result: system-block cache hit 0% -> 71%, and every tenant inherits the same injection/redaction rules — no per-tenant safety drift.
4. Cache, quota, and routing: stop one tenant starving (or impersonating) another
Two shared resources had no tenant in them. The answer cache was keyed only by the question hash — so A's cached answer was served to B. And one global rate limiter let the whale's nightly batch consume the whole budget.
Fixes: the cache key carries residency + tenant (residency:tenant:hash), so answers can't cross the boundary; rate limiting is partitioned by tenant with per-plan token buckets (QueueLimit = 0 → a clean 429 with Retry-After, not an unbounded queue); model routing respects plan and remaining budget (over budget downgrades the model, never 500s). Partitioned fairness held small tenants' p95 at 1.9s during the whale's batch, down from 6.4s.
5. Cost attribution, residency, and the RLS backstop
Every model call books cost to a per-tenant ledger, which feeds dashboards, the budget cap from Section 4, and a usage-based billing export. Residency is enforced by region-pinned indexes and Azure SQL Row-Level Security, so even a buggy query that forgets the tenant filter returns nothing instead of leaking. The ledger took attribution 0% -> 100% and capped a runaway tenant's hour at $5 instead of ~$140; RLS + region pinning kept EU data in-region with 0 cross-region reads in audit.
The subtle part: RLS can mask a missing app filter by silently returning no rows. So Mattrx alerts when RLS filters a row the app should have scoped — the backstop catching something is a signal, not a success.
The closing mental model
Multi-tenancy is one boundary, enforced in many places, resolved once and never re-derived. Three habits:
-
Resolve at the edge, carry as a token. Derive
TenantScopefrom auth once; make every layer take it as a required input; let the type system reject any tenant-less call. - Default to pool, promote on a threshold. Keep the cheap shared path for the 99%; spend physical isolation only where size, SLA, or regulation demands it — and automate the promotion.
-
Enforce isolation twice. App filter for speed, RLS for the day the filter is missing. If a single forgotten
WHEREcan leak, you have not isolated anything — you've documented an intention.
👉 The full article — with all the C# (.NET 9) and Python code, the end-to-end boundary diagram, the pool/silo/bridge comparison, the pre-ship checklist, and the "honest stuff" caveats — is on PrepStack:
Context Engineering for Enterprise AI, Part 5
Top comments (0)