Context Engineering for Enterprise AI, Part 5: Multi-Tenant Patterns That Don't Leak, Starve, or Overspend

#ai #dotnet #architecture #machinelearning

Originally published on PrepStack. This is **Part 5* of* Context Engineering for Enterprise AI.

Parts 1–4 built the context layer for Mattrx, a multi-tenant marketing-analytics SaaS (110k MAU, ~9,000 tenants, ~3,200 req/sec peak, ASP.NET Core / .NET 9 + a separate Python FastAPI AI service). Every one of those parts quietly leaned on one primitive — the tenant boundary. This part makes it the design surface instead of an afterthought.

TL;DR

Multi-tenancy is not a WHERE tenant_id = ? you sprinkle on queries. It is a single boundary that has to hold across every surface of the context pipeline — identity, retrieval, memory, prompt, cache, rate limit, model routing, cost, and residency — each with its own failure mode.

Mattrx results after making tenant scope a first-class context primitive (2-week build):

Cross-tenant leak incidents (docs + cache + prompt + logs, load + red-team): 0.
Noisy-neighbor: other tenants' p95 during the whale's nightly batch: 6.4s -> 1.9s.
Prompt-cache hit on the shared system preamble: 0% -> 71% (~50% cheaper prefill).
Per-tenant cost attribution: 0% (one blended bill) -> 100% (per-call ledger).
Runaway-tenant spend in one hour: ~$140 -> $5 (budget cap trips; clean 429).
New-tenant isolation onboarding: ~2 days manual -> < 5 min automated.
Retrieval p95 on the pool index with a tenant filter at ~9,000 tenants: 31 ms (held); whale on a dedicated index: 22 ms, and pool recall@5 recovered 0.88 -> 0.94.
Cost per AI query: $0.008 (unchanged) — now attributed, capped, and routed per plan.

The one mental shift

Stop treating the tenant as a filter you remember to add. Treat tenant scope as part of the context itself — resolved once at the edge, carried as an unforgeable token through retrieval, prompt, cache, model, and ledger. If isolation depends on anyone remembering to add a filter, it will leak the day someone forgets.

1. Tenant identity: resolve once, never trust the body

The first multi-tenant version read tenant_id from wherever was convenient — a query string, the JSON body — and passed it to the AI service. Any authenticated user could read another tenant by changing one field.

The fix: resolve the tenant once, at the edge, from the signed workspace_id claim in the JWT. Everything downstream receives a TenantScope it cannot forge or widen, and every method signature gains a required TenantScope — so the type system refuses to compile a query that forgot the tenant. Moving identity from the body to the token closed the entire "change one field, read another tenant" class: 0 successful cross-tenant reads in red-team testing.

2. Knowledge isolation: pool by default, silo on a threshold

Three models, and why the hybrid wins:

Model	What it is	Isolation	Best for
Pool	One index, every doc tagged `tenant_id`, hard filter on read	Logical (a missing filter = leak)	The long tail of small tenants
Silo	One index per tenant	Physical (nothing to forget)	Whales, Enterprise, EU-residency
Bridge	Pool by default; promote to silo on a threshold	Logical for most, physical for the few	Mattrx's choice

Pure silo is impossible at 9,000 tenants (Azure AI Search caps indexes per service). Pure pool can't give residency and lets one tenant's corpus degrade everyone's recall. Bridge keeps the cheap, instant pool for the 99% and spends physical isolation only where size, SLA, or regulation forces it. Crucially, the tenant_id filter is applied by the router, never by the call site — so no caller can forget it. Promoting the whale to its own index dropped its p95 to 22 ms and recovered the long tail's recall@5 from 0.88 to 0.94.

3. Prompt and policy variation: behavior as data, not branches

Per-tenant behavior used to accrete as if (tenantId == ...) branches, and large tenants got bespoke prompts rebuilt every call (a DateTime.UtcNow in the system block alone meant a 0% cache hit). The fix: tenant behavior is data (a TenantConfig row), and the prompt is assembled as a byte-stable shared preamble first (identical bytes for every tenant, so prompt caching reuses it) followed by a small tenant delta. Result: system-block cache hit 0% -> 71%, and every tenant inherits the same injection/redaction rules — no per-tenant safety drift.

4. Cache, quota, and routing: stop one tenant starving (or impersonating) another

Two shared resources had no tenant in them. The answer cache was keyed only by the question hash — so A's cached answer was served to B. And one global rate limiter let the whale's nightly batch consume the whole budget.

Fixes: the cache key carries residency + tenant (residency:tenant:hash), so answers can't cross the boundary; rate limiting is partitioned by tenant with per-plan token buckets (QueueLimit = 0 → a clean 429 with Retry-After, not an unbounded queue); model routing respects plan and remaining budget (over budget downgrades the model, never 500s). Partitioned fairness held small tenants' p95 at 1.9s during the whale's batch, down from 6.4s.

5. Cost attribution, residency, and the RLS backstop

Every model call books cost to a per-tenant ledger, which feeds dashboards, the budget cap from Section 4, and a usage-based billing export. Residency is enforced by region-pinned indexes and Azure SQL Row-Level Security, so even a buggy query that forgets the tenant filter returns nothing instead of leaking. The ledger took attribution 0% -> 100% and capped a runaway tenant's hour at $5 instead of ~$140; RLS + region pinning kept EU data in-region with 0 cross-region reads in audit.

The subtle part: RLS can mask a missing app filter by silently returning no rows. So Mattrx alerts when RLS filters a row the app should have scoped — the backstop catching something is a signal, not a success.

The closing mental model

Multi-tenancy is one boundary, enforced in many places, resolved once and never re-derived. Three habits:

Resolve at the edge, carry as a token. Derive TenantScope from auth once; make every layer take it as a required input; let the type system reject any tenant-less call.
Default to pool, promote on a threshold. Keep the cheap shared path for the 99%; spend physical isolation only where size, SLA, or regulation demands it — and automate the promotion.
Enforce isolation twice. App filter for speed, RLS for the day the filter is missing. If a single forgotten WHERE can leak, you have not isolated anything — you've documented an intention.

👉 The full article — with all the C# (.NET 9) and Python code, the end-to-end boundary diagram, the pool/silo/bridge comparison, the pre-ship checklist, and the "honest stuff" caveats — is on PrepStack:
Context Engineering for Enterprise AI, Part 5