Originally published on PrepStack. This is **Part 4 of 6* of* Context Engineering for Enterprise AI.
Parts 1–3 gave us a context pipeline, a memory layer, and a multi-agent architecture. All real, all measurable — and all still a demo until you wrap them in what this part covers: governance, security, evaluation, observability, cost control, and reliability. That is the enterprise design that lets you ship AI to 110k paying users without losing sleep, money, or a compliance audit.
TL;DR
A context pipeline without governance is a liability, not a feature. The hard part of enterprise AI is not the model — it's the boundary around it.
Production metrics after the full enterprise design is in place:
- Wrong-answer / hallucination rate: 18% (naive RAG) → 3%.
- Faithfulness (groundedness) eval score: 0.96; answer-relevance: 0.91.
- Eval gate threshold: any change dropping faithfulness below 0.90 is blocked in CI.
- Prompt-injection attempts blocked at the boundary: ~40/week.
- Cost per AI query: $0.021 → $0.008 (caching + model routing + context compression).
- Context tokens per request: ~14,000 → ~3,500.
- Agentic query p95: 4.2s → 1.8s.
- The C# app API p95 stays 120 ms — the AI work never bled into the product API.
- Every AI response carries a trace id + an immutable audit row (prompt hash, tokens, cost, citations).
The one mental shift
Stop treating the model as the system. The model is one untrusted, non-deterministic dependency inside a system you do govern. Everything around it — eval gates, the security boundary, cost routing, tracing, audit — is the part you actually own, test, and are accountable for. Engineer that, and the model becomes swappable.
Evaluation gates: stop shipping prompts on vibes
A prompt change is a code change with a non-deterministic compiler. You'd never merge a refactor without tests; don't merge a system-prompt edit without an eval.
A golden set of ~200 curated (question, ideal-answer, must-cite-source) tuples lives in version control. Every prompt or model change runs the offline harness in CI, scoring faithfulness and answer-relevance. A change that drops faithfulness below 0.90 fails the build. We sit at 0.96 faithfulness, 0.91 relevance.
Offline catches regressions; online catches drift. We sample ~2% of live traffic and run the same groundedness judge asynchronously (never on the hot path), alerting if rolling faithfulness dips.
Security: the boundary that says no
Every AI request passes through AiGovernanceMiddleware before it can reach the Python service. It enforces RBAC, stamps the authenticated tenant_id (never trusting a client-supplied one), redacts PII, and runs an injection classifier. Only a sanitized, scoped request crosses the HTTP boundary.
The injection classifier is cheap on the Python side — a small, fast model plus a deny-pattern check, kept off the expensive model entirely. PII redaction happens in C# at both ingress (before the model sees it) and egress (before we log or store the answer).
Result: the boundary blocks ~40 prompt-injection attempts per week, and zero cross-tenant retrievals have occurred since tenant_id enforcement moved from "in the query" to "in the token."
Cost and reliability: budgets, routing, and graceful failure
At 3,200 req/sec, a 2-cent query versus a 0.8-cent query is a $30k/month argument. And the Python service will go down; the only question is whether the user sees a 500 or a graceful degrade.
The C# client wraps the call in a Polly resilience pipeline (timeout + circuit breaker + fallback), and a budget gate refuses queries from a tenant that has blown its monthly AI spend. The Python service routes cheap tasks to a small model. Combined with caching and context compression, cost per query dropped from $0.021 to $0.008.
Observability and audit: trace every prompt, token, and citation
OpenTelemetry spans flow from the C# request through the HTTP boundary into the Python service and back, carrying the same trace id. Every AI response writes an immutable audit row: prompt hash, model, token counts, cost, and the exact citations. Mean time to answer "what did the AI cite for this response" went from "we can't" to under 30 seconds.
The closing mental model
The model is the cheapest, most replaceable part of an enterprise AI system. The eval gate, the security boundary, the cost router, and the audit trail are the product — and they're the parts you can actually be held accountable for.
- No prompt or model change merges without passing the eval gate.
- The boundary is the only door. Every AI request goes through governance or it doesn't go at all.
- If you can't trace it and audit it, it didn't happen.
👉 The full article — with all the C# (.NET 9) and Python code, the architecture diagram, the pre-ship checklist, and the "honest stuff" caveats — is on PrepStack:
Context Engineering for Enterprise AI, Part 4
Top comments (1)
Governance and cost usually get treated as separate problems, but on .NET I've found they collapse into the same lever: how much context you let into each call. Tightening retrieval to only what a request actually needs cut both my token bill and my safety surface at once, fewer chances to leak something that shouldn't be there. Are you handling the cost side at the prompt-assembly layer, or further up with caching/routing?