kirandeepjassal-crypto

Posted on Jun 23 • Originally published at prepstack.co.in

Context Engineering for Enterprise AI, Part 4: Enterprise AI Design — Governance, Cost & Safety

#ai #dotnet #architecture #machinelearning

Originally published on PrepStack. This is **Part 4 of 6* of* Context Engineering for Enterprise AI.

Parts 1–3 gave us a context pipeline, a memory layer, and a multi-agent architecture. All real, all measurable — and all still a demo until you wrap them in what this part covers: governance, security, evaluation, observability, cost control, and reliability. That is the enterprise design that lets you ship AI to 110k paying users without losing sleep, money, or a compliance audit.

TL;DR

A context pipeline without governance is a liability, not a feature. The hard part of enterprise AI is not the model — it's the boundary around it.

Production metrics after the full enterprise design is in place:

Wrong-answer / hallucination rate: 18% (naive RAG) → 3%.
Faithfulness (groundedness) eval score: 0.96; answer-relevance: 0.91.
Eval gate threshold: any change dropping faithfulness below 0.90 is blocked in CI.
Prompt-injection attempts blocked at the boundary: ~40/week.
Cost per AI query: $0.021 → $0.008 (caching + model routing + context compression).
Context tokens per request: ~14,000 → ~3,500.
Agentic query p95: 4.2s → 1.8s.
The C# app API p95 stays 120 ms — the AI work never bled into the product API.
Every AI response carries a trace id + an immutable audit row (prompt hash, tokens, cost, citations).

The one mental shift

Stop treating the model as the system. The model is one untrusted, non-deterministic dependency inside a system you do govern. Everything around it — eval gates, the security boundary, cost routing, tracing, audit — is the part you actually own, test, and are accountable for. Engineer that, and the model becomes swappable.

Evaluation gates: stop shipping prompts on vibes

A prompt change is a code change with a non-deterministic compiler. You'd never merge a refactor without tests; don't merge a system-prompt edit without an eval.

A golden set of ~200 curated (question, ideal-answer, must-cite-source) tuples lives in version control. Every prompt or model change runs the offline harness in CI, scoring faithfulness and answer-relevance. A change that drops faithfulness below 0.90 fails the build. We sit at 0.96 faithfulness, 0.91 relevance.

Offline catches regressions; online catches drift. We sample ~2% of live traffic and run the same groundedness judge asynchronously (never on the hot path), alerting if rolling faithfulness dips.

Security: the boundary that says no

Every AI request passes through AiGovernanceMiddleware before it can reach the Python service. It enforces RBAC, stamps the authenticated tenant_id (never trusting a client-supplied one), redacts PII, and runs an injection classifier. Only a sanitized, scoped request crosses the HTTP boundary.

The injection classifier is cheap on the Python side — a small, fast model plus a deny-pattern check, kept off the expensive model entirely. PII redaction happens in C# at both ingress (before the model sees it) and egress (before we log or store the answer).

Result: the boundary blocks ~40 prompt-injection attempts per week, and zero cross-tenant retrievals have occurred since tenant_id enforcement moved from "in the query" to "in the token."

Cost and reliability: budgets, routing, and graceful failure

At 3,200 req/sec, a 2-cent query versus a 0.8-cent query is a $30k/month argument. And the Python service will go down; the only question is whether the user sees a 500 or a graceful degrade.

The C# client wraps the call in a Polly resilience pipeline (timeout + circuit breaker + fallback), and a budget gate refuses queries from a tenant that has blown its monthly AI spend. The Python service routes cheap tasks to a small model. Combined with caching and context compression, cost per query dropped from $0.021 to $0.008.

Observability and audit: trace every prompt, token, and citation

OpenTelemetry spans flow from the C# request through the HTTP boundary into the Python service and back, carrying the same trace id. Every AI response writes an immutable audit row: prompt hash, model, token counts, cost, and the exact citations. Mean time to answer "what did the AI cite for this response" went from "we can't" to under 30 seconds.

The closing mental model

The model is the cheapest, most replaceable part of an enterprise AI system. The eval gate, the security boundary, the cost router, and the audit trail are the product — and they're the parts you can actually be held accountable for.

No prompt or model change merges without passing the eval gate.
The boundary is the only door. Every AI request goes through governance or it doesn't go at all.
If you can't trace it and audit it, it didn't happen.

👉 The full article — with all the C# (.NET 9) and Python code, the architecture diagram, the pre-ship checklist, and the "honest stuff" caveats — is on PrepStack:
Context Engineering for Enterprise AI, Part 4

Top comments (3)

Vasyl • Jun 23

Governance and cost usually get treated as separate problems, but on .NET I've found they collapse into the same lever: how much context you let into each call. Tightening retrieval to only what a request actually needs cut both my token bill and my safety surface at once, fewer chances to leak something that shouldn't be there. Are you handling the cost side at the prompt-assembly layer, or further up with caching/routing?

kirandeepjassal-crypto • Jun 29

Spot on, Vasyl — and I'd push that framing one step further: caching and routing cut cost, but they don't shrink the safety surface. Only the assembly layer cuts both at once, so that's where I start.
It ends up layered, in priority order:

Assembly first. Retrieval is scoped to the request and compressed before anything is sent — that's the ~14k → ~3.5k tokens/request drop in the post. Fewer tokens means a lower bill and fewer chances to smuggle something in: exactly the collapse you're describing.
Routing next. Once context is minimal, cheap tasks go to a small model; only the hard ones hit the expensive one.
Caching on top. Semantic cache for repeat/near-repeat queries so the best assembled context never gets rebuilt.

The piece I keep upstream of all three is scoping itself — RBAC, tenant_id from the token, PII redaction in the governance middleware — so "only what the request needs" is enforced before assembly runs, not as a side effect of it.
How are you handling cache invalidation when the underlying docs change? That's the part that fights the cost win for me.

Vasyl • Jul 6

For me it's content-hash versioning, not TTL. Each doc gets a hash at ingestion, chunk cache keys include it, so an updated doc just misses the cache naturally and only changed chunks get re-embedded. The semantic cache is the annoying part: answers derived from a stale doc don't know their source changed. I ended up storing source doc ids next to each cached answer and evicting by doc id on re-ingestion. Crude but debuggable.