Originally published on PrepStack. This is **Part 6* (the finale) of* Context Engineering for Enterprise AI.
Parts 1–5 built capability for Mattrx, a multi-tenant marketing-analytics SaaS (110k MAU, ~9,000 tenants, ~3,200 req/sec peak, ASP.NET Core / .NET 9 on Azure SQL plus a Python FastAPI AI service): a budgeted context window, a memory layer, multi-agent orchestration, an enterprise design spine, and multi-tenant isolation. This part is the layer underneath all of them — the control plane that decides what data is allowed to enter or leave the context at all, for whom, and for what purpose. Capability without governance is a breach waiting for a date.
TL;DR
Governance is a control plane, not a checklist. Five controls decide what may enter or leave the context, each enforced before the model sees data, never after a breach:
| Control | The question it answers | Where it's enforced |
|---|---|---|
| Classification | How sensitive is this data? | A classify gate at ingestion; class + purpose tags to the Data Catalog |
| Entitlement-aware retrieval | May **this user* (not just this tenant) see it?* | Principal ACL/group filter stacked on the Part 5 tenant filter |
| Consent & purpose | Are we **allowed* to use it for AI at all?* | Purpose tags + a consent registry checked at use time |
| Lineage & provenance | What exactly did the model see, and why allowed? | An append-only record per generation |
| Policy-as-code (PDP) | Who decides, and can we prove the rule? | One deny-by-default Policy Decision Point every path calls |
Mattrx production results (3-week build):
- Confidential/Restricted documents reaching the shared embedding store ungoverned: ~3,100 -> 0.
- Intra-tenant cross-principal leaks (a support agent retrieving a finance doc inside their own tenant), red-team: reproducible -> 0.
- "What did the model see about subject X, and why allowed?": 0% -> 100% of generations; DSAR fulfillment ~2 days -> under 3 minutes (one query).
- Customer data used for eval/fine-tuning without a consenting purpose: unbounded -> 0.
- Governance enforcement: ~14 scattered sites -> 1 PDP + N versioned policies; a policy change ships without touching service code.
- Governance overhead on the hot path: +4 ms p95 (PDP decision ~3 ms cached, deny-by-default). Retrieval p95 31 ms -> 35 ms, recall@5 held 0.94; cost/query $0.008 (unchanged).
The one mental shift
Governance is a control plane that runs before retrieval, not a report you generate after an incident. Make every data access a question — can(principal, data, purpose, context)? — answered deny-by-default by one engine, and recorded — so the answer to "could this leak?" is a query, not a prayer.
1. Classification at ingestion: you cannot govern what you never labeled
Before: everything a tenant connected was embedded the moment it arrived — no sensitivity label, no purpose. A "knowledge base" sync could vectorize a spreadsheet of customer emails and a file of API tokens, now one cosine hop from any prompt.
After: a classify gate runs at ingestion. Every asset gets a sensitivity class (Public / Internal / Confidential / Restricted) and purpose tags, is registered in a Data Catalog, and then routed — Public/Internal to the pool index, Confidential to a restricted index, and Restricted is catalogued and quarantined, never embedded at all. Cheap deterministic detectors (regex/Presidio for secrets and PII) run first; an LLM classifier handles only the ambiguous remainder. Classifying at the front door means a missing filter downstream can't expose what was never indexed. The gate moved ~3,100 sensitive docs out of the shared store (0.4% over-quarantine rate, human-reviewed).
2. Entitlement-aware retrieval: authorize the principal, not just the tenant
Part 5 made retrieval tenant-scoped. But inside a tenant, retrieval returned anything that tenant owned, to anyone in it — a support agent could pull the revenue runbook. The fix stacks a principal predicate (the user's groups/clearance) on top of the tenant predicate; both apply at once. Mattrx runs a hybrid: a fast ACL filter in the index, then a live revalidation of the surviving top-k against the entitlements service, so a just-revoked permission can't leak through a stale index. Intra-tenant cross-role leaks went reproducible -> 0, recall@5 held at 0.94, +4 ms p95.
3. Consent and purpose limitation: allowed to have it isn't allowed to use it
If Mattrx stored data, every feature treated it as fair game — retrieval, agent analysis, eval, fine-tuning. But "we hold this to run the customer's campaigns" is not consent to "use it to improve our product." Every asset carries purpose tags; a consent registry records what each tenant agreed to; each AI use names its purpose (serve / eval / train) and the PDP refuses data whose purposes don't include it. Opt-out tenants are structurally excluded from eval/train while still fully served their own features. Customer data used without a consenting purpose went -> 0.
4. Lineage and provenance: prove exactly what the model saw
A DSAR ("what does your AI know about me, and where did it come from?") used to be a multi-day dig through stateless logs. Now every generation writes one append-only lineage record: output hash, the exact source assets (id, version, class) that entered the prompt, the principal, the declared purpose, and the PDP decisions that allowed each source. It stores references and versions, not raw content (so the audit trail isn't a second copy of the sensitive data), the table grants no UPDATE/DELETE, and the write is off the hot path. Provenance coverage 0% -> 100%; DSAR ~2 days -> under 3 minutes.
5. Policy-as-code: one decision point everything asks
Each rule above — and in Parts 2/4/5 — lived as its own if statements in its own service: ~14 enforcement sites, four dialects of "allowed," nothing to read, test, or change centrally. Drift was inevitable. The fix: a single Policy Decision Point answering can(principal, data, purpose, context)? deny-by-default, with rules as versioned, unit-tested policy-as-code (Mattrx runs OPA/Rego as a sidecar). Ingest, retrieval, output, and agent tools all ask the same PDP; every decision is cached (~3 ms) and recorded into lineage. The PDP fails closed — an outage blocks access rather than risking an open one, the correct failure mode for governance (and a real availability coupling you design for).
Honest stuff
- Governance can become theater — a catalog nobody trusts and policies nobody tests manufacture false confidence. The PDP earns trust only because its policies are unit-tested in CI and its decisions are recorded.
- Classification is probabilistic — make the failure bounded and reviewable (every asset has a catalog row), and put humans on the quarantine queue.
- The PDP is an availability coupling — deny-by-default makes it a tier-1 dependency (sidecar, health checks, cached decisions).
- Don't classify what you can delete — minimization beats governance. The cheapest data to govern is the data you never kept.
The closing mental model
Govern at the front door (classify and decide before data is embedded). Ask one engine, deny by default (every path asks the same PDP; if a new path doesn't ask, it's a hole). Record so you can prove it (append-only lineage turns "trust us" into "here's the query"). If you can't reconstruct what the model saw and why it was allowed, you don't have governance — you have hope.
👉 The full article — with all the C# (.NET 9), Python, OPA/Rego, and SQL code, the control-plane diagram, the pre-ship checklist, and the full "honest stuff" — is on PrepStack:
Context Engineering for Enterprise AI, Part 6
Top comments (0)