Jamie Thompson

Posted on Mar 30

Building a Governed AI Platform: What Enterprise RAG Actually Looks Like in Production

#ai #architecture #rag #security

Building a Governed AI Platform: What Enterprise RAG Actually Looks Like in Production

Every RAG tutorial follows the same script: chunk your documents, embed them, stuff them into a vector database, retrieve the top-k results, and feed them to an LLM. Congratulations, you have a demo.

Now try deploying that to an organization where 400 users across 12 departments need access to different document collections, where some queries touch controlled-unclassified information, where your CFO wants to know exactly which model processed which request and why, and where a prompt injection in one user's session absolutely cannot leak another user's retrieval context.

That tutorial didn't prepare you for any of this.

I've spent the last several years building and operating a multi-LLM RAG platform in production. What I've learned is that the hard part of enterprise AI isn't retrieval or generation. It's governance -- the configuration layer that determines who can do what, with which model, against which data, under what constraints. This article is about that layer.

The Governance Layer: Configuration as a First-Class Concern

Most AI platforms treat configuration as an afterthought -- environment variables, maybe a YAML file. In production, your governance configuration is the product. It's the thing that makes a platform trustworthy enough to deploy.

Think of it as a control plane that sits between three actors:

Operators (your platform admins) who define policies
Users who interact through constrained interfaces
Agents (LLMs + tools) who execute within boundaries set by operators on behalf of users

The governance layer answers questions like:

Which models is this user group authorized to use?
What retrieval collections can this role access?
What guardrails apply to this conversation context?
What happens when a policy violation is detected -- block, redact, or flag?

This isn't application logic. It's a declarative policy layer that operators configure and the platform enforces. Every request flows through it. No exceptions.

If you're designing an AI platform and you don't have a clean separation between "what the platform can do" and "what this specific user is allowed to do in this context," stop and fix that first. Everything else builds on it.

Multi-LLM Orchestration in Practice

Running one model is easy. Running 16 is an engineering problem that has almost nothing to do with the models themselves and everything to do with routing policy.

In production, model selection isn't a user choice. It's a policy decision based on:

Task classification: Is this summarization, extraction, code generation, or analysis?
Data sensitivity: Does the input or expected output contain controlled information?
Cost envelope: What's the budget allocation for this department/project?
Performance requirements: Is this synchronous (user waiting) or async (batch pipeline)?
Capability matching: Does the task require tool calling, vision, structured output?

Here's a simplified version of the routing pattern we use:

interface RoutingPolicy {
  taskType: string;
  sensitivityLevel: 'public' | 'internal' | 'controlled';
  maxLatencyMs: number;
  requiresToolCalling: boolean;
  costTier: 'standard' | 'premium';
}

interface ModelEndpoint {
  provider: string;
  model: string;
  capabilities: string[];
  costPerMillionTokens: number;
  avgLatencyMs: number;
  sensitivityClearance: string[];
}

function resolveModel(
  policy: RoutingPolicy,
  endpoints: ModelEndpoint[],
  userAuthorization: UserAuth
): ModelEndpoint {
  const eligible = endpoints
    .filter(e => e.sensitivityClearance.includes(policy.sensitivityLevel))
    .filter(e => e.avgLatencyMs <= policy.maxLatencyMs)
    .filter(e => !policy.requiresToolCalling || e.capabilities.includes('tool_calling'))
    .filter(e => userAuthorization.allowedProviders.includes(e.provider));

  if (eligible.length === 0) {
    throw new RoutingError('No eligible model for policy constraints', { policy });
  }

  // Within eligible set, optimize for cost unless premium tier
  return policy.costTier === 'premium'
    ? eligible.sort((a, b) => a.avgLatencyMs - b.avgLatencyMs)[0]
    : eligible.sort((a, b) => a.costPerMillionTokens - b.costPerMillionTokens)[0];
}

The critical detail: userAuthorization.allowedProviders. The model a user gets isn't just about capability matching. It's about what their role is authorized to use. An analyst might be restricted to a specific provider for data residency reasons. A developer might have access to bleeding-edge models that aren't yet approved for production workflows.

This routing happens on every single request. It's not a settings page -- it's a policy engine.

RAG Authorization: The Question You Should Actually Be Asking

Most RAG implementations have a retrieval step that looks like this:

const results = await vectorDB.query(embedding, { topK: 10 });

This is a security hole. The query runs with the platform's credentials, returning whatever is semantically similar. The user's authorization is nowhere in the picture.

The correct question isn't "can the agent access this document?" The agent is a process -- it doesn't have clearance. The question is: "Can THIS user retrieve from THIS collection given their current role and context?"

Document-level RBAC in a RAG system means:

Collections are authorization boundaries. Every document lives in a collection. Collections have access policies.
Queries are scoped at retrieval time. The vector search is filtered to only collections the user is authorized to access.
Authorization is evaluated per-request, not cached. Role changes propagate immediately.

async function authorizedRetrieval(
  query: string,
  user: AuthenticatedUser,
  conversationContext: ConversationMeta
): Promise<RetrievalResult[]> {
  // Resolve which collections this user can access
  const authorizedCollections = await rbac.resolveCollections(
    user.roles,
    user.department,
    conversationContext.classification
  );

  if (authorizedCollections.length === 0) {
    return []; // No retrieval, model runs on its own knowledge
  }

  const embedding = await embed(query);

  // Query ONLY authorized collections
  const results = await vectorDB.query(embedding, {
    topK: 10,
    filter: {
      collection: { $in: authorizedCollections.map(c => c.id) }
    }
  });

  // Log what was retrieved and under what authorization
  await audit.log({
    event: 'retrieval',
    userId: user.id,
    collections: authorizedCollections.map(c => c.id),
    resultCount: results.length,
    conversationId: conversationContext.id
  });

  return results;
}

The filter on the vector query is the entire security model. Get this wrong and you have a platform where users can semantically search documents they shouldn't be able to read. Get it right and you have per-user, per-role, per-context retrieval authorization that's invisible to the end user.

One more thing: the conversationContext.classification parameter. The same user might have different retrieval access depending on which workspace or project they're operating in. Authorization isn't just about who you are. It's about what you're doing right now.

Guardrails as Authorization Constructs

PII detection, prompt injection prevention, and content moderation are usually filed under "AI safety." That framing is correct but incomplete. In a governed platform, guardrails are authorization mechanisms.

Consider PII detection. When a user's input contains a Social Security number, you have options: block the request, redact the PII before it reaches the model, or allow it with enhanced logging. That's not a safety decision -- it's a policy decision that varies by user role, task context, and data classification.

Our guardrail engine runs as a pipeline of configurable stages:

interface GuardrailStage {
  name: string;
  evaluate: (input: string, context: RequestContext) => Promise<GuardrailResult>;
}

interface GuardrailResult {
  action: 'allow' | 'redact' | 'block' | 'flag';
  detections: Detection[];
  redactedContent?: string;
}

// Guardrail policies are configured per-workspace, not hardcoded
const workspacePolicy = await governance.getGuardrailPolicy(workspace.id);

for (const stage of workspacePolicy.stages) {
  const result = await stage.evaluate(input, requestContext);

  if (result.action === 'block') {
    await audit.log({ event: 'guardrail_block', stage: stage.name, ...result });
    throw new PolicyViolationError(stage.name, result.detections);
  }

  if (result.action === 'redact') {
    input = result.redactedContent;
    await audit.log({ event: 'guardrail_redact', stage: stage.name, ...result });
  }
}

The key design decision: guardrail policies are operator-configured, not developer-hardcoded. One workspace might block all PII. Another might allow SSNs but redact them from logs. A third might allow everything but flag it for review. These are governance choices made by the people accountable for the data, not by the engineering team at deploy time.

Prompt injection detection follows the same pattern. It's not just "is this input malicious?" It's "given this user's role and this workspace's policy, what's the appropriate response to a suspicious input?" Sometimes that's a hard block. Sometimes it's routing to a model with more conservative system prompts. The guardrail framework doesn't decide -- the policy does.

Audit Everything (Especially Configuration Changes)

Every enterprise AI platform logs requests. That's table stakes. What separates a governed platform is auditing configuration changes with the same rigor.

When an operator changes a model routing policy, modifies a guardrail configuration, updates collection access permissions, or adjusts a role's authorized capabilities -- those events matter more than any individual request log. A single configuration change affects every subsequent request.

We track 60+ distinct audit event types. The most important ones aren't user_query or model_response. They're:

policy_modified -- who changed what routing or guardrail policy, when, and what the previous value was
collection_access_granted / revoked -- changes to retrieval authorization
model_endpoint_added / removed -- changes to available models
guardrail_stage_modified -- changes to the guardrail pipeline
role_permission_changed -- RBAC modifications

Every configuration change is immutable, timestamped, and attributed to a specific operator. You can reconstruct the exact governance state at any point in time. When someone asks "why did the platform behave this way on Tuesday?" you can answer with precision.

What I Wish I Knew Before Building This

Governance configuration will change more often than your code. Design for it. Make policies hot-reloadable. Don't require deployments for policy changes.

Multi-tenancy is a governance problem, not an infrastructure problem. You can run everything on the same cluster. The isolation happens in the policy layer.

Your audit log is your most valuable table. Invest in making it queryable, exportable, and tamper-evident early. You will be asked to produce audit reports, and "let me write a script" is not an acceptable answer.

Guardrails need escape hatches. Not every detection is accurate. Build operator-controlled override mechanisms with enhanced logging, rather than hard blocks with no recourse.

Model routing will be political. Different stakeholders will have opinions about which models their teams should use. Make it easy to express those preferences as policies rather than arguments.

Start with four roles, not fourteen. Admin, operator, analyst, viewer. You can always add granularity. You can't easily remove it.

The hardest bugs are policy bugs. When a user can't access something they should be able to, or can access something they shouldn't, the issue is almost never in your code. It's in the governance configuration. Build tooling to visualize and debug policy resolution.

Enterprise RAG isn't a retrieval problem. It's a governance problem with a retrieval component. Get the governance layer right, and the retrieval and generation parts are almost straightforward. Get it wrong, and no amount of prompt engineering or embedding optimization will make your platform trustworthy enough to deploy where it matters.

Jamie Thompson is CEO of Sprinklenet, an AI technology company building governed AI platforms for government and enterprise. Learn more at sprinklenet.com.