DEV Community: Jamie Thompson

Building a Governed AI Platform: What Enterprise RAG Actually Looks Like in Production

Jamie Thompson — Mon, 30 Mar 2026 02:32:13 +0000

Building a Governed AI Platform: What Enterprise RAG Actually Looks Like in Production

Every RAG tutorial follows the same script: chunk your documents, embed them, stuff them into a vector database, retrieve the top-k results, and feed them to an LLM. Congratulations, you have a demo.

Now try deploying that to an organization where 400 users across 12 departments need access to different document collections, where some queries touch controlled-unclassified information, where your CFO wants to know exactly which model processed which request and why, and where a prompt injection in one user's session absolutely cannot leak another user's retrieval context.

That tutorial didn't prepare you for any of this.

I've spent the last several years building and operating a multi-LLM RAG platform in production. What I've learned is that the hard part of enterprise AI isn't retrieval or generation. It's governance -- the configuration layer that determines who can do what, with which model, against which data, under what constraints. This article is about that layer.

The Governance Layer: Configuration as a First-Class Concern

Most AI platforms treat configuration as an afterthought -- environment variables, maybe a YAML file. In production, your governance configuration is the product. It's the thing that makes a platform trustworthy enough to deploy.

Think of it as a control plane that sits between three actors:

Operators (your platform admins) who define policies
Users who interact through constrained interfaces
Agents (LLMs + tools) who execute within boundaries set by operators on behalf of users

The governance layer answers questions like:

Which models is this user group authorized to use?
What retrieval collections can this role access?
What guardrails apply to this conversation context?
What happens when a policy violation is detected -- block, redact, or flag?

This isn't application logic. It's a declarative policy layer that operators configure and the platform enforces. Every request flows through it. No exceptions.

If you're designing an AI platform and you don't have a clean separation between "what the platform can do" and "what this specific user is allowed to do in this context," stop and fix that first. Everything else builds on it.

Multi-LLM Orchestration in Practice

Running one model is easy. Running 16 is an engineering problem that has almost nothing to do with the models themselves and everything to do with routing policy.

In production, model selection isn't a user choice. It's a policy decision based on:

Task classification: Is this summarization, extraction, code generation, or analysis?
Data sensitivity: Does the input or expected output contain controlled information?
Cost envelope: What's the budget allocation for this department/project?
Performance requirements: Is this synchronous (user waiting) or async (batch pipeline)?
Capability matching: Does the task require tool calling, vision, structured output?

Here's a simplified version of the routing pattern we use:

interface RoutingPolicy {
  taskType: string;
  sensitivityLevel: 'public' | 'internal' | 'controlled';
  maxLatencyMs: number;
  requiresToolCalling: boolean;
  costTier: 'standard' | 'premium';
}

interface ModelEndpoint {
  provider: string;
  model: string;
  capabilities: string[];
  costPerMillionTokens: number;
  avgLatencyMs: number;
  sensitivityClearance: string[];
}

function resolveModel(
  policy: RoutingPolicy,
  endpoints: ModelEndpoint[],
  userAuthorization: UserAuth
): ModelEndpoint {
  const eligible = endpoints
    .filter(e => e.sensitivityClearance.includes(policy.sensitivityLevel))
    .filter(e => e.avgLatencyMs <= policy.maxLatencyMs)
    .filter(e => !policy.requiresToolCalling || e.capabilities.includes('tool_calling'))
    .filter(e => userAuthorization.allowedProviders.includes(e.provider));

  if (eligible.length === 0) {
    throw new RoutingError('No eligible model for policy constraints', { policy });
  }

  // Within eligible set, optimize for cost unless premium tier
  return policy.costTier === 'premium'
    ? eligible.sort((a, b) => a.avgLatencyMs - b.avgLatencyMs)[0]
    : eligible.sort((a, b) => a.costPerMillionTokens - b.costPerMillionTokens)[0];
}

The critical detail: userAuthorization.allowedProviders. The model a user gets isn't just about capability matching. It's about what their role is authorized to use. An analyst might be restricted to a specific provider for data residency reasons. A developer might have access to bleeding-edge models that aren't yet approved for production workflows.

This routing happens on every single request. It's not a settings page -- it's a policy engine.

RAG Authorization: The Question You Should Actually Be Asking

Most RAG implementations have a retrieval step that looks like this:

const results = await vectorDB.query(embedding, { topK: 10 });

This is a security hole. The query runs with the platform's credentials, returning whatever is semantically similar. The user's authorization is nowhere in the picture.

The correct question isn't "can the agent access this document?" The agent is a process -- it doesn't have clearance. The question is: "Can THIS user retrieve from THIS collection given their current role and context?"

Document-level RBAC in a RAG system means:

Collections are authorization boundaries. Every document lives in a collection. Collections have access policies.
Queries are scoped at retrieval time. The vector search is filtered to only collections the user is authorized to access.
Authorization is evaluated per-request, not cached. Role changes propagate immediately.

async function authorizedRetrieval(
  query: string,
  user: AuthenticatedUser,
  conversationContext: ConversationMeta
): Promise<RetrievalResult[]> {
  // Resolve which collections this user can access
  const authorizedCollections = await rbac.resolveCollections(
    user.roles,
    user.department,
    conversationContext.classification
  );

  if (authorizedCollections.length === 0) {
    return []; // No retrieval, model runs on its own knowledge
  }

  const embedding = await embed(query);

  // Query ONLY authorized collections
  const results = await vectorDB.query(embedding, {
    topK: 10,
    filter: {
      collection: { $in: authorizedCollections.map(c => c.id) }
    }
  });

  // Log what was retrieved and under what authorization
  await audit.log({
    event: 'retrieval',
    userId: user.id,
    collections: authorizedCollections.map(c => c.id),
    resultCount: results.length,
    conversationId: conversationContext.id
  });

  return results;
}

The filter on the vector query is the entire security model. Get this wrong and you have a platform where users can semantically search documents they shouldn't be able to read. Get it right and you have per-user, per-role, per-context retrieval authorization that's invisible to the end user.

One more thing: the conversationContext.classification parameter. The same user might have different retrieval access depending on which workspace or project they're operating in. Authorization isn't just about who you are. It's about what you're doing right now.

Guardrails as Authorization Constructs

PII detection, prompt injection prevention, and content moderation are usually filed under "AI safety." That framing is correct but incomplete. In a governed platform, guardrails are authorization mechanisms.

Consider PII detection. When a user's input contains a Social Security number, you have options: block the request, redact the PII before it reaches the model, or allow it with enhanced logging. That's not a safety decision -- it's a policy decision that varies by user role, task context, and data classification.

Our guardrail engine runs as a pipeline of configurable stages:

interface GuardrailStage {
  name: string;
  evaluate: (input: string, context: RequestContext) => Promise<GuardrailResult>;
}

interface GuardrailResult {
  action: 'allow' | 'redact' | 'block' | 'flag';
  detections: Detection[];
  redactedContent?: string;
}

// Guardrail policies are configured per-workspace, not hardcoded
const workspacePolicy = await governance.getGuardrailPolicy(workspace.id);

for (const stage of workspacePolicy.stages) {
  const result = await stage.evaluate(input, requestContext);

  if (result.action === 'block') {
    await audit.log({ event: 'guardrail_block', stage: stage.name, ...result });
    throw new PolicyViolationError(stage.name, result.detections);
  }

  if (result.action === 'redact') {
    input = result.redactedContent;
    await audit.log({ event: 'guardrail_redact', stage: stage.name, ...result });
  }
}

The key design decision: guardrail policies are operator-configured, not developer-hardcoded. One workspace might block all PII. Another might allow SSNs but redact them from logs. A third might allow everything but flag it for review. These are governance choices made by the people accountable for the data, not by the engineering team at deploy time.

Prompt injection detection follows the same pattern. It's not just "is this input malicious?" It's "given this user's role and this workspace's policy, what's the appropriate response to a suspicious input?" Sometimes that's a hard block. Sometimes it's routing to a model with more conservative system prompts. The guardrail framework doesn't decide -- the policy does.

Audit Everything (Especially Configuration Changes)

Every enterprise AI platform logs requests. That's table stakes. What separates a governed platform is auditing configuration changes with the same rigor.

When an operator changes a model routing policy, modifies a guardrail configuration, updates collection access permissions, or adjusts a role's authorized capabilities -- those events matter more than any individual request log. A single configuration change affects every subsequent request.

We track 60+ distinct audit event types. The most important ones aren't user_query or model_response. They're:

policy_modified -- who changed what routing or guardrail policy, when, and what the previous value was
collection_access_granted / revoked -- changes to retrieval authorization
model_endpoint_added / removed -- changes to available models
guardrail_stage_modified -- changes to the guardrail pipeline
role_permission_changed -- RBAC modifications

Every configuration change is immutable, timestamped, and attributed to a specific operator. You can reconstruct the exact governance state at any point in time. When someone asks "why did the platform behave this way on Tuesday?" you can answer with precision.

What I Wish I Knew Before Building This

Governance configuration will change more often than your code. Design for it. Make policies hot-reloadable. Don't require deployments for policy changes.

Multi-tenancy is a governance problem, not an infrastructure problem. You can run everything on the same cluster. The isolation happens in the policy layer.

Your audit log is your most valuable table. Invest in making it queryable, exportable, and tamper-evident early. You will be asked to produce audit reports, and "let me write a script" is not an acceptable answer.

Guardrails need escape hatches. Not every detection is accurate. Build operator-controlled override mechanisms with enhanced logging, rather than hard blocks with no recourse.

Model routing will be political. Different stakeholders will have opinions about which models their teams should use. Make it easy to express those preferences as policies rather than arguments.

Start with four roles, not fourteen. Admin, operator, analyst, viewer. You can always add granularity. You can't easily remove it.

The hardest bugs are policy bugs. When a user can't access something they should be able to, or can access something they shouldn't, the issue is almost never in your code. It's in the governance configuration. Build tooling to visualize and debug policy resolution.

Enterprise RAG isn't a retrieval problem. It's a governance problem with a retrieval component. Get the governance layer right, and the retrieval and generation parts are almost straightforward. Get it wrong, and no amount of prompt engineering or embedding optimization will make your platform trustworthy enough to deploy where it matters.

Jamie Thompson is CEO of Sprinklenet, an AI technology company building governed AI platforms for government and enterprise. Learn more at sprinklenet.com.

How to Evaluate AI Vendors Without Getting Burned

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

I've been on both sides of enterprise AI deals. I've sold platforms to government agencies and Fortune 500 companies. I've also sat in the buyer's chair, evaluating tools for our own stack. The experience of doing both has given me a clear picture of what separates a vendor worth trusting from one that's going to waste your next 18 months.

Most AI vendor evaluations fail because they focus on the wrong things. Teams get dazzled by demo magic, benchmark claims, and slide decks full of architecture diagrams that look impressive but tell you nothing about what happens when real users hit the system at scale.

Here's the framework I actually use.

Step 1: Ignore the Demo

I know that sounds extreme. But the demo is theater. Every vendor has a golden path demo that makes their product look flawless. The question isn't whether the demo works. It's whether the product works when your data is messy, your users are unpredictable, and your compliance team has 47 questions about audit logging.

Instead of watching a demo, ask for a sandbox. Give the vendor your actual data (or a representative sample) and your actual use cases. Let your team spend a week breaking it. If a vendor won't give you a sandbox environment, that tells you something important.

When we deploy Knowledge Spaces for new clients, we insist on a pilot phase with real data. Not because we're confident in demos (we are), but because the pilot surfaces integration issues, data quality problems, and user workflow gaps that no demo can reveal. The pilot is where trust gets built.

Step 2: Ask the Hard Questions

Most evaluation checklists are surface level. "Does it support SSO?" Yes, every enterprise vendor supports SSO. The real questions are deeper.

On data handling:

Where does my data live at rest and in transit?
Can I bring my own encryption keys?
What happens to my data if I cancel the contract?
Do you use customer data to train models? (If the answer is anything other than an immediate, unqualified "no," walk away.)

On model access:

Am I locked into a single LLM provider?
Can I swap models without re-architecting my prompts and workflows?
What happens when a model provider has an outage?

Single-provider lock-in is one of the most expensive mistakes an enterprise can make right now. The LLM landscape is shifting fast. A platform that forces you onto one provider is a platform that will cost you flexibility when you need it most. This is exactly why we built Knowledge Spaces to support 16+ models across OpenAI, Anthropic, Google, Groq, and others. Not because more is better, but because production environments need routing flexibility and provider redundancy.

On governance:

How granular is your audit logging?
Can I see exactly which model generated which response, with what context, for which user, at what time?
What role-based access controls exist beyond basic admin/user splits?
How do you handle PII in prompts and responses?

If a vendor can't answer these questions with specifics, they haven't built for enterprise. They've built a prototype and wrapped it in a sales pitch.

Step 3: Benchmark on Your Workload, Not Theirs

Vendor benchmarks are meaningless for your use case. A model that scores 94% on MMLU might perform terribly on your internal knowledge base because your documents are full of domain-specific jargon, acronyms, and context that general benchmarks don't capture.

Build your own evaluation set. Take 50 to 100 real questions that your users would actually ask. Include edge cases. Include questions where the correct answer is "I don't have enough information to answer that." Include questions that require synthesizing information from multiple documents.

Run these through every vendor you're evaluating. Score the results on accuracy, citation quality, response latency, and hallucination rate. This takes time. It's worth every hour.

For government clients, we also benchmark on compliance-specific scenarios. Can the system correctly refuse to answer questions outside its authorized scope? Does it properly cite source documents with section and paragraph references? Does the guardrail engine catch prompt injection attempts? These aren't academic concerns. For a DoW agency or an intelligence community customer, a single hallucinated citation in an operational context is a serious problem.

Step 4: Evaluate the Vendor, Not Just the Product

Products change. Vendors don't, at least not quickly. Here's what I look at beyond the software.

Engineering depth. How large is the engineering team relative to the sales team? If a company has 40 salespeople and 8 engineers, the product is not going to evolve at the pace you need.

Customer concentration. If a vendor's revenue depends heavily on one or two clients, they'll prioritize those clients' roadmap over yours. Ask about their customer distribution.

Deployment flexibility. Can you run this in your own cloud? On-premises? In an air-gapped environment? "Cloud-only" platforms are non-starters for a significant portion of the enterprise and government market.

Pricing transparency. If the pricing model requires a custom quote for everything, expect scope creep. The best vendors publish their pricing logic, even if the actual numbers are negotiated. You should understand exactly what drives your costs before you sign.

Step 5: Check the Integration Story

The most capable AI platform in the world is useless if it can't connect to your existing systems. Evaluate integrations ruthlessly.

Does it connect to your data sources natively, or do you need to build custom pipelines?
What authentication protocols does it support? SAML 2.0, OAuth 2.0, CAC/PKI?
Can it call your internal APIs with proper credential management?
How does it handle document ingestion at scale? Not "we support PDF" but "we can ingest 10,000 PDFs with metadata preservation and incremental updates."

In our platform, we've built 15+ data connectors covering Salesforce, PostgreSQL, REST APIs, and OAuth-protected services. Every one of those connectors exists because a real customer needed it. Not because it looked good on a feature matrix.

Red Flags That Should Kill a Deal

After years in this space, these are the signals that tell me to walk away.

"We use proprietary AI." Unless the vendor has genuinely trained their own foundation model (and almost none of them have), this means they've wrapped an API and don't want you to know. Proprietary claims in the current market are almost always misleading.

No audit trail. If the platform can't tell you exactly what happened, when, and why, it's not enterprise-ready. Period.

No security roadmap. Every platform is at a different stage of its compliance journey. What matters is whether the vendor has a clear, funded plan with milestones and timelines. Ask what controls they have in place today, what their target certifications are, and when they expect to achieve them. A vendor actively investing in compliance with a transparent roadmap is far more credible than one who waves their hands and changes the subject.

Contract lock-in longer than 12 months. The AI landscape changes too fast for multi-year commitments on unproven platforms. A vendor confident in their product will earn renewals, not trap you into them.

No customer references in your industry. If a vendor has never deployed in your sector, you're paying them to learn on your dime. That can work, but price it accordingly.

The Bottom Line

Enterprise AI evaluation is not a technology decision. It's a risk management decision. You're choosing a partner who will have deep access to your data, your workflows, and your users. Treat it with the same rigor you'd apply to hiring a senior executive.

Do the sandbox. Ask the hard questions. Benchmark on your own data. Check the humans behind the product. And never, ever let a polished demo substitute for real due diligence.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

The Gold Is in the Basement: Why Your AI Strategy Should Start With the Data You Already Have

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

Most organizations chasing AI transformation are looking in the wrong direction. The highest-value data isn't in the shiny new tool , it's buried in the systems you've been running for years.

Every few months, I sit across from a technical leader who tells me some version of the same story: "We've got GPT wired up for internal chat, but nobody's using it." Or: "We built a chatbot, but it just makes things up." Or my personal favorite: "We tried RAG, but the results were garbage."

And almost every time, the problem isn't the model. It's the plumbing.

I run a company called Sprinklenet AI, where we build and deploy multi-LLM platforms , primarily Knowledge Spaces, our RAG-based system , for government agencies and enterprise clients. Over the past two years, I've watched the industry fixate on model selection (GPT-4 vs. Claude vs. Gemini) while almost completely ignoring the far harder, far more valuable problem: getting AI systems reliably connected to the raw operational data that actually drives decisions.

The gold isn't in the model. It's in the basement , in the ERP logs, the CRM records, the SharePoint graveyards, the PostgreSQL tables that nobody has touched since 2019. And the organizations that figure out how to connect AI to that data, securely and at scale, are the ones that will win.

The RAG Gap Nobody Talks About

Retrieval-Augmented Generation has become the default architecture for enterprise AI, and for good reason. Instead of fine-tuning a model on your data (expensive, brittle, and a governance nightmare), you retrieve relevant documents at query time and inject them into the prompt context. The model generates answers grounded in your actual information.

In theory, this is elegant. In practice, most RAG implementations fail at the retrieval step , not the generation step.

Here's what I mean. A typical proof-of-concept goes like this: someone uploads a few PDFs into a vector store, builds a simple semantic search pipeline, and demos it to leadership. "Look, it can answer questions about our procurement policy!" Everyone applauds. Budget gets approved.

Then reality hits. The production system needs to pull from Salesforce, a legacy SQL database, an internal wiki, and six different file shares with overlapping and contradictory versions of the same document. The PDFs that worked great in the demo turn out to represent maybe 3% of the organization's actual knowledge. The other 97% lives in structured databases, transactional systems, and formats that don't neatly convert to text chunks.

This is the RAG gap: the distance between "we can do semantic search on a folder of documents" and "we can give our people AI-powered access to everything they need to make decisions." It's enormous, and closing it is mostly an engineering problem, not an AI problem.

Why Data Connectors Are the Real Moat

When we architect Knowledge Spaces deployments, we spend roughly 60% of our integration time on data connectors , the unglamorous middleware that pulls information from source systems, normalizes it, chunks it appropriately, generates vector embeddings, and keeps everything in sync.

We've built connectors for more than 15 different source systems: Salesforce, PostgreSQL, REST APIs with OAuth flows, file systems, cloud storage. Each one has its own authentication model, rate limits, data schema, and update patterns. And each one requires a different chunking strategy to produce embeddings that actually return relevant results during retrieval.

This is the part that doesn't make it into conference talks. Nobody gives a keynote about spending three weeks tuning chunk sizes for a PostgreSQL connector so that semantic search over transactional records returns meaningful results. But that's the work that separates a demo from a system people actually rely on.

A few hard-won lessons:

Chunk size matters more than model choice. I've seen teams agonize over whether to use GPT-4o or Claude 3.5 while their retrieval pipeline is returning irrelevant context because they're splitting documents at arbitrary 500-token boundaries. For structured data from relational databases, we typically chunk by logical record , one row or one transaction per chunk, with schema metadata preserved. For long-form documents, overlapping chunks of 800-1200 tokens with section-header context prepended tend to outperform naive splitting. The right strategy depends entirely on how your users actually query the data.

Embeddings are not one-size-fits-all. Different embedding models perform differently depending on the domain and the nature of the queries. We've found that for government and defense use cases , where terminology is highly specific and acronym-dense , general-purpose embedding models underperform unless you prepend definitional context to chunks. Running a small evaluation set before committing to an embedding strategy saves weeks of debugging bad retrieval later.

Freshness is a first-class concern. Static RAG (upload once, query forever) works for reference documents. It falls apart for operational data. If your sales team is asking the AI about pipeline status and your Salesforce connector last synced three days ago, trust evaporates immediately. We run incremental sync jobs on configurable schedules , hourly for transactional data, daily for documents , and surface last-sync timestamps in the UI so users know what they're working with.

Security Isn't a Feature , It's the Foundation

Here's where enterprise RAG diverges most sharply from the open-source tutorials.

In a real deployment , especially in government , you can't just dump all your documents into a single vector store and let everyone query everything. That's a data spill waiting to happen. The AI system has to respect the same access controls that govern the source systems.

In Knowledge Spaces, we implement this through a four-tier RBAC hierarchy (Organization Owner, Admin, Contributor, Viewer) that controls not just who can query, but what data each query can retrieve against. When a user asks a question, the retrieval step filters the vector search results by that user's permissions before anything reaches the LLM. The model never sees data the user isn't authorized to access.

We also enforce SAML 2.0 SSO and support CAC/PKI authentication for defense clients , because if your AI platform has a separate login from everything else, your security team will (rightly) shut it down.

And then there's audit logging. We capture 64+ event types , every query, every retrieval, every model invocation, every document access. Not because we love logging, but because our government clients need to answer the question: "Who asked what, and what data informed the answer?" If you can't answer that question, you don't have a governed AI system. You have a liability.

The Multi-LLM Reality

One more pattern I want to surface, because I think it's underappreciated: in production, you almost certainly need more than one model.

We currently orchestrate across models from OpenAI, Anthropic, Google, Groq, and xAI , 16+ foundation models with support for tool calling, streaming, and structured JSON output. Different models excel at different tasks. Some are better at precise factual extraction. Others handle nuanced summarization more gracefully. Some are fast and cheap enough for high-volume classification tasks. Others are worth the latency for complex analytical queries.

The point isn't to have options for the sake of options. It's that when you're connecting AI to diverse enterprise data sources, the queries that hit your system are diverse too. A procurement analyst asking "What were the top three cost overruns on Program X last quarter?" needs a different model behavior than a policy researcher asking "How does this draft regulation compare to FAR Part 15?" Routing queries to the right model , and having guardrails that catch PII leakage, prompt injection, and off-topic responses regardless of which model is active , is table stakes for production deployment.

Start With the Basement

If I could give one piece of advice to a technical leader starting an enterprise AI initiative, it would be this: before you evaluate a single model, before you pick a vector database, before you write a line of prompt engineering , go inventory your data.

Map every system that holds information your people need to make decisions. Understand the access controls on each one. Document the update frequency. Figure out what's structured versus unstructured. Identify which sources overlap, which contradict each other, and which are authoritative.

Then build your RAG architecture around that map. Let the data topology drive the system design, not the other way around.

The organizations that get this right don't just get a better chatbot. They get something much more valuable: a single, governed, intelligent interface to their institutional knowledge. An interface that respects security boundaries, stays current with source systems, and gets smarter as more data flows through it.

The gold has been in the basement all along. You just need to build the stairs.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

What Running AI for the Air Force Taught Me About Enterprise AI

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

I spent the better part of a year as Principal Investigator on an Air Force basic research program focused on trust measurement, multi-source signal analysis, and uncertainty quantification. Before that, I contributed to AI research efforts for the Air Force Research Laboratory through SBIR programs. And before any of that, I spent nearly two decades building AI products across industries.

None of it looked like the AI demos you see at conferences.

There were no polished chatbots on stage. No live demos where everything magically works. The actual work was methodical, sometimes tedious, and deeply focused on whether the outputs could be trusted, explained, and acted on by real people in real situations.

That experience changed how I think about enterprise AI. Not because government AI is so different from commercial AI, but because it strips away the hype and forces you to confront what actually matters. Here are five lessons I took away that apply to any organization trying to make AI work at scale.

1. Nobody needs another AI demo. They need production systems.

The defense and intelligence communities are drowning in AI prototypes. Every contractor, startup, and research lab has a demo. Most of them are impressive for about fifteen minutes. Then someone asks, "How does this connect to our existing systems?" or "What happens when the data format changes?" and the conversation gets quiet.

What I learned working with Air Force programs is that the gap between a working demo and a production system is enormous. It is not a gap you can close with more funding or more data scientists. It requires thinking about integration from day one. It requires designing for the unglamorous realities of authentication, access control, data pipelines, and system monitoring.

The organizations that succeed with AI are not the ones with the most sophisticated models. They are the ones that treat AI deployment the way they would treat any critical infrastructure project: with proper engineering, testing, and operational planning.

If your AI only works in a demo environment, you do not have an AI capability. You have a science project.

2. Trust is not a feature. It is the architecture.

My research focused specifically on trust and influence measurement frameworks. One of the clearest takeaways was that trust in AI systems cannot be bolted on after the fact. You cannot build a black-box system and then add an "explainability module" later. That approach fails every time.

Trustworthy AI requires architectural decisions made at the foundation. That means audit trails that log every decision and every data source consulted. It means source citations so a human can verify why the system produced a given output. It means uncertainty quantification so users know not just what the system thinks, but how confident it is.

In government contexts, this is not optional. An analyst cannot act on AI output they cannot explain to their leadership. A program manager cannot defend a recommendation to Congress if the reasoning is opaque.

But this applies equally in the commercial world. A CFO will not trust an AI-generated forecast if they cannot trace it back to source data. A compliance officer will not sign off on an AI system that cannot explain its decisions. A board of directors will not accept "the model said so" as justification for a strategic pivot.

If your AI system does not have built-in explainability, source attribution, and audit trails, you do not have a trustworthy system. You have a prototype with good marketing.

3. Multi-source data is where the real value lives.

The hardest part of AI is not the model. It never has been. The hardest part is connecting to the messy, fragmented, inconsistent data sources that organizations actually rely on.

During my Air Force research, the signal analysis work involved synthesizing information from multiple sources, each with different formats, different levels of reliability, and different update cadences. The model was almost the easy part. The data integration was where the real engineering happened.

This is true in every enterprise I have worked with. The data lives in SharePoint, in legacy databases, in email threads, in PDFs that were scanned in 2014, in Slack channels, and in the heads of subject matter experts who are three years from retirement. Getting AI to work means building connectors to all of it, normalizing it, handling conflicts between sources, and doing so in a way that is maintainable over time.

This is exactly why I built Knowledge Spaces as a multi-source RAG platform. The value is not in choosing the right LLM. The value is in connecting the right data, from the right sources, with the right context, so the AI can actually be useful.

Any vendor who tells you their AI solution "just works" without a serious conversation about data integration is selling you a fantasy.

4. Small teams move faster, and speed is everything right now.

I have seen AI programs run by teams of five deliver results in months that teams of fifty could not deliver in years. This is not an exaggeration. It is a pattern I observed repeatedly across SBIR programs, research labs, and commercial engagements.

Large defense contractors have enormous resources, deep relationships, and decades of institutional knowledge. But when the technology is evolving as fast as AI is evolving right now, those advantages can become liabilities. Big teams mean more coordination overhead. Legacy processes mean slower iteration. Risk aversion means waiting for someone else to prove the concept before committing.

Small, focused teams can prototype, test, deploy, learn, and iterate in the time it takes a large program to complete its requirements gathering phase. When the underlying models are improving every few months, the ability to adapt quickly is not just nice to have. It is the difference between deploying something useful and deploying something obsolete.

This does not mean large organizations should only work with small companies. It means they should structure their AI initiatives to preserve agility. Use small, empowered teams. Reduce approval layers. Accept that the first version will not be perfect and plan to iterate. The organizations that move fastest will learn fastest, and the ones that learn fastest will win.

5. The fractional Chief AI Officer model works.

Not every organization needs a full-time Chief AI Officer. Most do not. What they need is someone who has built AI systems in production, who understands both the technology and the organizational dynamics, and who can set the right architecture and strategy without the overhead of a permanent C-suite hire.

I have been operating as a fractional Chief AI Officer for multiple organizations, and the model works for a simple reason: the critical decisions in enterprise AI are architectural and strategic, not operational. You need experienced judgment to decide which problems AI should solve, which data sources to prioritize, which vendors to trust, and which hype to ignore. You do not need that person sitting in every standup meeting.

A good fractional CAIO sets the foundation, builds the evaluation frameworks, trains the team to execute, and then stays engaged enough to course-correct when the landscape shifts. They bring pattern recognition from working across multiple organizations and industries, which is something no internal hire can replicate.

The organizations getting the most value from AI right now are not the ones who hired the most impressive AI team. They are the ones who found someone who has done this before, who can tell them directly what AI can and cannot do for their specific situation, and who can keep the team focused on outcomes instead of technology for its own sake.

The Bottom Line

AI in the enterprise is not a technology problem. It is an integration problem, a trust problem, and a leadership problem. The technology is mature enough. The models are good enough. What most organizations lack is the experience to deploy AI in a way that is reliable, explainable, and connected to the data and systems that actually matter.

That is what working with the Air Force taught me. Not how to build better models, but how to build AI systems that people can actually trust and use.

If your organization is navigating these same challenges, I am always happy to compare notes.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

The $100M Question: Should You Embed AI or Go Native?

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

Jamie Thompson, Founder & CEO, Sprinklenet AI

I've had the same conversation six times in the last two months. Different companies, different verticals, different revenue levels. But the same question every time.

"Do we bolt AI onto what we have, or do we start over?"

If you run a SaaS company with real customers and real revenue, this is the most consequential decision you'll make in 2026. Get it right and you accelerate. Get it wrong and you spend 18 months building something your customers don't want, while a competitor eats your lunch.

Here's how I think about it.

The Case for Embedding

You have something most startups would kill for: an installed base. Customers who pay you money every month. Relationships. Data. Workflows that people depend on.

That is not nothing. That is a beachhead.

The smart version of the embed strategy looks like this. You decompose your platform into its core functional areas. Each one gets its own AI layer, purpose built for that domain. Then you put a master agent on top that orchestrates across all of them. Your customers get AI capabilities inside the product they already use, without switching costs, without migration pain, without retraining their teams.

This is the strategy that preserves revenue. It respects the fact that your customers chose you for a reason. And it lets you move incrementally, shipping value every sprint instead of disappearing into a cave for a year.

I've seen this work well when the underlying architecture is reasonably modern, when there's a clear API layer, and when the team has the discipline to treat each AI integration as a product decision, not a science experiment.

The Case for Going Native

Now here's the other side.

OpenAI, Anthropic, Google, xAI, Meta. These companies have billions of dollars, tens of thousands of engineers, and they are building platforms that overlap with yours. Every single one of them is expanding into adjacent capabilities. The pace of innovation is unlike anything I've seen in 25 years of building software.

If your product was built in 2015 on a monolithic architecture with years of technical debt baked in, embedding AI into it is like putting a turbocharger on a car with a cracked engine block. You can do it. It will even go faster for a while. But the foundation won't hold.

Sometimes you need to burn the past.

Going AI native means designing from the ground up around what foundation models can do today and what they'll be able to do in six months. It means building your product as an orchestration layer, not a feature set. It means accepting that the model will handle 80% of what your engineers used to build manually, and your job is to own the 20% that makes you irreplaceable.

This path is faster if you have the courage to take it. But it requires abandoning code, processes, and sometimes people that got you to where you are.

The Real Answer

Most companies will do a hybrid. That's fine. In fact, it's probably the right call for the majority.

But the companies that win will be the ones who are ruthlessly clear with themselves about what's working and what isn't.

Here's what I mean. You'll start embedding AI into your existing platform. Some of those integrations will land beautifully. Customers will love them. Usage will spike. Revenue will follow. Keep those. Double down.

Other integrations will feel forced. They'll require so much scaffolding and workaround code that your engineers spend more time fighting the legacy architecture than building AI features. Those are the ones you need to cut.

And this is where most companies fail. They hang on too long. They keep pouring resources into the embed strategy for a module that should have been rebuilt from scratch three months ago. They do it out of loyalty to the team that built it, or out of fear of writing off the sunk cost, or because the VP who owns that module has a loud voice in the leadership meeting.

Stop it.

The hybrid strategy only works if you're willing to be brutal about which parts get the embed treatment and which parts get rebuilt natively. If you try to embed everywhere, you'll move too slowly. If you try to go native everywhere, you'll break too much. The skill is in making the cut correctly and making it fast.

What Actually Matters

Here's the thing nobody wants to hear. The window for differentiation is closing.

The foundation models are getting better every quarter. Capabilities that were your competitive advantage six months ago are now available through an API call. The models will keep improving. You cannot build a moat on the AI itself.

So what do you build a moat on?

Three things.

Domain expertise. You know your customer's workflow better than OpenAI does. You know the edge cases, the regulatory requirements, the integration points with their other systems. That knowledge is your moat, but only if you encode it into your product fast enough.

Proprietary data. If your platform generates or captures data that nobody else has, that's gold. But only if you're using it to fine tune, to build retrieval systems, to create feedback loops that make your AI better than the generic version. If you're sitting on data and not weaponizing it, someone else will find a way to replicate it.

Integration depth. The company that is wired deepest into a customer's operations is the hardest to rip out. Every API connection, every workflow automation, every data pipeline is a thread that binds you to your customer. Build more threads.

Speed is the meta strategy. Whoever moves fastest to deliver real, tangible, unique value on top of the major AI models wins. Not whoever has the best pitch deck. Not whoever raises the most money. Whoever ships.

One More Thing

This is exactly the kind of strategic question I help companies answer. At Sprinklenet, I serve as a fractional Chief AI Officer for companies navigating these decisions. Not with a 200 page consulting report that sits on a shelf, but by working alongside your leadership team to make these calls in real time, with skin in the game.

If you're a CEO wrestling with this question, I'm always happy to compare notes.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

MCP Servers Explained: How AI Assistants Connect to Your Tools

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

If you have been building with AI coding assistants, agents, or LLM-powered tools recently, you have probably encountered the term MCP. Model Context Protocol is quickly becoming the standard way that AI assistants connect to external tools, data sources, and services. It is one of those infrastructure-level shifts that seems technical and niche until you realize it changes how every AI application gets built.

I run an AI platform company. We build Knowledge Spaces, a multi-LLM RAG platform with 15+ data connectors, and FARbot, a public AI chatbot for federal acquisition regulations. Connecting AI systems to real data sources is literally what we do every day. MCP is the most significant development in that space since function calling became standard in LLM APIs.

Here is what MCP actually is, why it matters, and how it works in practice.

What Is the Model Context Protocol?

MCP is an open protocol, originally developed by Anthropic, that standardizes how AI applications communicate with external tools and data sources. Think of it as a universal adapter layer between AI assistants and the services they need to interact with.

Before MCP, every AI tool integration was bespoke. If you wanted your AI assistant to read from Google Drive, you wrote custom code. If you wanted it to query a database, you wrote different custom code. If you wanted it to interact with Slack, GitHub, or Salesforce, each one required its own integration logic, authentication handling, and error management.

MCP replaces that fragmented approach with a single, standardized protocol. An MCP server exposes a set of tools (functions the AI can call), resources (data the AI can read), and prompts (templates the AI can use). An MCP client, which is typically the AI assistant or agent framework, connects to one or more MCP servers and makes those capabilities available to the LLM.

The protocol uses JSON-RPC over standard transport mechanisms (stdio for local servers, HTTP with server-sent events for remote ones). If you have worked with Language Server Protocol (LSP) in code editors, the mental model is similar. LSP standardized how editors talk to language tooling. MCP standardizes how AI talks to everything else.

Why MCP Matters

Three reasons, each increasingly important.

First, it eliminates redundant integration work. Before MCP, every AI application that wanted to connect to, say, PostgreSQL had to build its own PostgreSQL integration. With MCP, someone builds a PostgreSQL MCP server once, and every MCP-compatible AI client can use it. This is the same composability pattern that made REST APIs and package managers transformative. Build once, use everywhere.

Second, it creates a real ecosystem. There are already MCP servers for Google Workspace, Slack, GitHub, file systems, databases, web browsers, and dozens of other services. The list grows weekly. When you build an AI agent using an MCP-compatible framework, you get access to this entire ecosystem without writing integration code. Your agent can read email, check calendars, query databases, and interact with project management tools through a uniform interface.

Third, it separates concerns cleanly. The AI application does not need to know the details of how to authenticate with Salesforce or parse Google Sheets. The MCP server handles that complexity. The AI application just sees a set of tools with descriptions, input schemas, and output formats. This separation makes systems more maintainable, more secure, and easier to audit.

How MCP Servers Work in Practice

An MCP server is a process that implements the Model Context Protocol and exposes capabilities to AI clients. Let me walk through what this looks like concretely.

A typical MCP server does three things:

Declares its tools. Each tool has a name, a description (which the LLM reads to decide when to use it), and a JSON Schema defining its inputs. For example, a Google Calendar MCP server might expose tools like list_events, create_event, and find_free_time, each with parameters like date ranges, attendee lists, and event details.

Handles tool calls. When the AI decides to use a tool, the MCP client sends a JSON-RPC request to the server with the tool name and arguments. The server executes the operation (querying an API, reading a file, running a database query) and returns the result.

Manages authentication and state. The server handles credentials, session management, rate limiting, and any other operational concerns. The AI client never sees raw API keys or authentication tokens.

In our work at Sprinklenet, we use MCP servers extensively. Our development environment connects to Google Workspace (Drive, Sheets, Docs, Calendar), Gmail, and various internal tools through MCP. When I ask my AI assistant to check my calendar, draft an email, or look up a document on Drive, those requests flow through MCP servers that handle all the authentication and API complexity transparently.

Building Your Own MCP Server

If you have a tool or data source you want to expose to AI assistants, building an MCP server is surprisingly approachable. The official SDKs exist in TypeScript and Python, and the basic structure is straightforward.

You define your tools with clear descriptions (this matters enormously because the LLM uses these descriptions to decide when and how to use each tool), implement handlers for each tool, and configure the transport layer. A minimal MCP server can be running in under an hour.

The critical skill is writing good tool descriptions. Remember that an LLM is reading your tool's name and description to decide whether to call it and what arguments to pass. Vague descriptions lead to misuse. Overly technical descriptions confuse the model. The best tool descriptions are concise, specific, and written as if you were explaining the tool to a smart colleague who has never used it before.

Real World Applications

The most compelling MCP use cases I have seen fall into a few categories.

Enterprise data access. Connecting AI assistants to internal knowledge bases, CRMs, ERPs, and document management systems. This is what we do with Knowledge Spaces, where 15+ data connectors bring in information from Salesforce, PostgreSQL, REST APIs, and other sources through authenticated, access-controlled channels.

Developer productivity. MCP servers for GitHub, CI/CD systems, monitoring dashboards, and cloud infrastructure let AI coding assistants go beyond just writing code. They can check build status, review pull requests, query logs, and manage deployments.

Government and compliance workflows. This is close to home for us. Our FARbot product, built on Knowledge Spaces, provides cited answers to Federal Acquisition Regulation questions. The underlying architecture connects to curated regulatory data through the same kinds of structured data pipelines that MCP standardizes. As MCP matures, we see it becoming the standard interface layer for connecting AI to authoritative government data sources.

Personal productivity. Calendar management, email triage, document organization, task tracking. MCP servers for Google Workspace, Microsoft 365, and productivity tools turn AI assistants into genuine workflow automation platforms rather than just chat interfaces.

What to Watch For

MCP is still early. The protocol is evolving, and there are open questions around authentication standards for remote servers, capability negotiation between clients and servers, and security boundaries.

A few practical considerations if you are adopting MCP today:

Security is your responsibility. MCP servers can do anything their underlying credentials allow. Scope permissions tightly. Audit tool usage. Do not give an MCP server write access to systems unless you have thought carefully about what happens when the AI makes a mistake.

Tool sprawl is real. Connecting twenty MCP servers to a single AI assistant sounds powerful, but LLMs have finite context and attention. More tools means more opportunity for the model to pick the wrong one. Curate your tool set deliberately. Fewer, well-described tools outperform a bloated toolkit.

Descriptions drive behavior. The quality of your tool descriptions directly determines how well the AI uses them. Invest time in writing, testing, and refining these descriptions. Treat them like API documentation that your most important client will read.

Where MCP Is Headed

MCP is doing for AI tool integration what HTTP did for web services and what LSP did for developer tooling. It is creating a common language that lets independently developed systems work together without custom glue code.

For teams building AI products, the implication is clear: design your systems to be MCP-compatible from the start. Expose your capabilities as MCP tools. Consume external capabilities through MCP clients. The ecosystem effects will compound rapidly as more servers and clients come online.

At Sprinklenet, we are building toward a future where enterprise AI platforms like Knowledge Spaces connect to any data source, any tool, and any workflow through standardized, secure, auditable interfaces. MCP is a major step in that direction, and we are investing heavily in the ecosystem.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

What Is RAG? A Practitioner's Guide to Retrieval-Augmented Generation

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

If you have spent any time building with large language models over the past two years, you have almost certainly encountered the term RAG. Retrieval-Augmented Generation has become one of the most important architectural patterns in applied AI, and for good reason. It solves a fundamental problem that every team hits the moment they try to make LLMs useful in production: the model does not know your data.

I have been building RAG systems since before the term was trendy. At Sprinklenet, our flagship platform Knowledge Spaces is a multi-LLM RAG system that serves enterprise and government clients across sensitive, high-stakes environments. What follows is not theory. It is what we have learned building, deploying, and operating these systems in production.

What RAG Actually Is

RAG is a design pattern where you augment an LLM's generation capabilities by first retrieving relevant context from an external knowledge base. Instead of relying solely on what the model learned during training, you give it fresh, specific, verified information at inference time.

The concept is straightforward. A user asks a question. Your system searches a curated knowledge base for the most relevant documents or passages. Those passages get injected into the prompt as context. The LLM then generates a response grounded in that retrieved information.

This is fundamentally different from fine-tuning. Fine-tuning changes the model's weights. RAG changes the model's context window. That distinction matters enormously in practice because it means you can update your knowledge base without retraining anything, you can control exactly what information the model has access to, and you can cite specific sources in every response.

Why RAG Matters

LLMs are remarkable at language understanding, reasoning, and generation. They are terrible at knowing facts about your organization, your documents, your policies, or anything that happened after their training cutoff.

Without RAG, you are stuck with a model that confidently generates plausible answers that may be completely wrong. In enterprise settings, that is not just annoying. It is dangerous. An analyst acting on hallucinated intelligence, a contractor citing a regulation that does not exist, a compliance officer relying on outdated policy guidance: these are real failure modes with real consequences.

RAG addresses this by grounding every response in retrievable, verifiable source material. When done well, the system can tell you not just what it thinks, but exactly which documents it consulted and which passages informed its answer. That traceability is what makes RAG production-ready for serious applications.

The RAG Architecture Stack

Every RAG system has three core phases: embedding, retrieval, and generation. Getting each one right matters, and the interactions between them matter even more.

Phase 1: Embedding and Ingestion

Before you can retrieve anything, you need to transform your documents into a format that supports semantic search. This means converting text into vector embeddings, which are dense numerical representations that capture meaning rather than just keywords.

The ingestion pipeline typically looks like this: documents come in through connectors (file uploads, API integrations, database pulls), get parsed and cleaned, get split into chunks, and then get embedded and stored in a vector database.

Chunking strategy is one of the most consequential decisions you will make. Chunk too large and your retrieval loses precision. Too small and you lose context. In our experience building Knowledge Spaces, we have found that the optimal chunk size varies significantly by use case. Dense regulatory text (like the Federal Acquisition Regulation, which powers our FARbot product) benefits from smaller, paragraph-level chunks with overlapping windows. Narrative documents like reports and memos work better with larger chunks that preserve the author's reasoning flow.

Overlap between chunks is critical and often overlooked. If your chunks do not overlap, you will inevitably split important information across chunk boundaries, and the retrieval system will miss it. We typically use 10 to 20 percent overlap, though this is tunable.

Phase 2: Retrieval

Retrieval is where your system searches the vector database for chunks that are semantically similar to the user's query. The user's question gets embedded using the same model that embedded your documents, and then a similarity search (typically cosine similarity or dot product) finds the closest matches.

This sounds simple, but there are several layers of complexity in production systems.

Hybrid search combines vector similarity with traditional keyword matching. Pure semantic search can miss exact terms that matter (like specific regulation numbers or product names), while pure keyword search misses conceptual relevance. The best production systems use both and merge the results.

Metadata filtering lets you scope retrieval to specific document sets, time ranges, access levels, or categories before the similarity search even runs. In multi-tenant systems like Knowledge Spaces, this is essential. You cannot have one client's documents leaking into another client's results.

Reranking takes the initial retrieval results and applies a second, more computationally expensive model to reorder them by relevance. The initial vector search is fast but approximate. A cross-encoder reranker is slower but significantly more accurate. In practice, you retrieve a larger candidate set (say 20 to 50 chunks) and then rerank down to the top 5 to 10 that actually get passed to the LLM.

Phase 3: Generation

The retrieved chunks get assembled into a prompt alongside the user's question and any system instructions. The LLM then generates a response grounded in that context.

Prompt engineering for RAG is its own discipline. You need to instruct the model to use the provided context, cite its sources, and clearly indicate when it does not have enough information to answer. You also need to handle the case where the retrieved context is irrelevant to the question, because the retrieval system will always return something, even if nothing in the knowledge base actually answers the query.

Source attribution is non-negotiable in production RAG. Every claim in the response should trace back to a specific chunk from a specific document. This is what separates a useful enterprise tool from a liability. In Knowledge Spaces, we log retrieval results alongside every generated response so that administrators can audit not just what the system said, but what it consulted.

Vector Databases: Choosing Your Foundation

The vector database is the backbone of your RAG system. It stores your embeddings and handles similarity search at scale. The major options include Pinecone, Weaviate, Qdrant, Milvus, Chroma, and pgvector (for teams already running PostgreSQL).

We use Pinecone for Knowledge Spaces, and it has served us well at scale. But the choice depends on your constraints. If you need on-premises deployment for security requirements, Qdrant or Milvus give you that control. If you want to minimize infrastructure, Pinecone's managed service is hard to beat. If you are prototyping and want minimal setup, Chroma works fine locally but think carefully before taking it to production.

Key factors to evaluate: query latency at your expected scale, filtering capabilities (metadata filtering performance varies dramatically between solutions), managed versus self-hosted options, and cost at your projected data volume.

Common Pitfalls and How to Avoid Them

After building RAG systems for several years, I have seen the same mistakes repeatedly. Here are the ones that cause the most pain.

Ignoring chunk quality. Garbage in, garbage out applies doubly to RAG. If your ingestion pipeline produces poorly parsed, badly chunked documents, no amount of retrieval sophistication will save you. Invest heavily in document parsing and chunk quality. Parse tables correctly. Handle headers and footers. Strip boilerplate. This unsexy work is often the difference between a system that works and one that hallucinates.

Skipping evaluation. Most teams build a RAG pipeline, try a few queries manually, and call it done. You need systematic evaluation: a test set of questions with known correct answers, automated retrieval quality metrics (precision, recall, MRR), and end-to-end answer quality assessment. Without this, you are flying blind every time you change a parameter.

Overloading the context window. Retrieving too many chunks and stuffing them all into the prompt is counterproductive. LLMs have finite attention. Research consistently shows that models perform worse when given excessive context, particularly in the middle of long prompts (the "lost in the middle" phenomenon). Be selective. Five highly relevant chunks will outperform twenty mediocre ones.

Neglecting access control. In any multi-user or multi-tenant system, retrieval must respect authorization boundaries. A user should never receive information from documents they do not have permission to access. This sounds obvious, but implementing it correctly requires thinking about access control at the vector database level, not just at the application layer. In Knowledge Spaces, we enforce role-based access control with a four-tier hierarchy and 64+ auditable event types precisely because this is a hard problem that demands rigorous engineering.

Treating RAG as a one-time build. A production RAG system is a living system. Documents change. New sources get added. Embedding models improve. User needs evolve. You need operational infrastructure for re-ingestion, monitoring retrieval quality over time, and updating your pipeline as the underlying models and data shift.

When RAG Is Not the Answer

RAG is powerful, but it is not the right pattern for every problem. If your task requires real-time computation, complex multi-step reasoning over structured data, or actions in external systems, you likely need agentic architectures, tool use, or traditional software engineering rather than (or in addition to) retrieval-augmented generation.

RAG excels when the core task is: "Answer questions using information from a specific, curated knowledge base." The further you drift from that pattern, the more you should consider other approaches.

The Path Forward

RAG is maturing rapidly. The next generation of production systems will incorporate more sophisticated retrieval strategies (graph-based retrieval, hypothetical document embeddings, multi-hop reasoning), tighter integration with structured data sources, and better evaluation frameworks.

But the fundamentals remain the same. Ingest your data carefully. Retrieve with precision. Generate with grounding. Audit everything. If you get those four things right, you are ahead of most teams building in this space.

At Sprinklenet, we have distilled these lessons into Knowledge Spaces, a platform that handles multi-LLM orchestration across 16+ foundation models, enterprise-grade access control, and comprehensive audit logging out of the box. We built it because we got tired of solving the same hard infrastructure problems on every engagement. If you are building RAG systems seriously, whether for government or commercial use, the infrastructure layer matters as much as the AI layer.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

Multi-LLM Orchestration in Production: Lessons from Running 16+ Models

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

Most teams start with one LLM. Maybe GPT-4. Maybe Claude. You wire it up, ship a prototype, and everything looks great.

Then reality sets in.

Your users need different things. Some tasks need raw reasoning power. Others need speed. Some need to stay cheap because you're processing thousands of documents a day. And then a provider has an outage on a Tuesday afternoon and your entire platform goes dark.

At Sprinklenet, we've been building Knowledge Spaces, our enterprise AI platform, with multi-model orchestration from the start. We currently run 16+ foundation models across OpenAI, Anthropic, Google, Groq, xAI, and others. Not as a gimmick. Because production demands it.

Here's what we've learned.

Why You Actually Need Multiple Models

The pitch for multi-LLM isn't "more models equals more better." It's that different models have genuinely different strengths, and pretending otherwise costs you money, performance, or both.

Reasoning vs. speed tradeoffs are real. Claude Opus is exceptional at nuanced analysis and long document comprehension. GPT-4o is a strong generalist. Groq's LLaMA inference is blazingly fast for simpler extraction tasks. Google's Gemini handles massive context windows. You wouldn't use a sledgehammer to hang a picture frame.

Cost differences are staggering. Running every query through your most expensive model is like taking a limo to get groceries. We've seen 10x cost reductions on certain workloads by routing simple queries to smaller, faster models and reserving premium models for complex reasoning.

Provider reliability varies. Every major provider has outages. If your platform depends on a single provider, you inherit their downtime as your own. For enterprise and government clients, that's a nonstarter.

Routing Strategies That Actually Work

The hardest engineering problem in multi-LLM orchestration isn't calling APIs. It's deciding which model handles which request.

Complexity-Based Routing

We classify incoming queries by estimated complexity before they hit a model. Simple factual lookups go to fast, cheap models. Multi-step reasoning tasks go to premium models. Document analysis with large context goes to models with bigger context windows.

The classification itself can be lightweight. You don't need an LLM to classify for an LLM. A combination of query length, keyword detection, conversation history depth, and task type metadata handles 80% of routing decisions. For the remaining 20%, a small classifier model makes the call.

Task-Specific Assignment

Some tasks have a clear best model. Code generation, summarization, translation, structured extraction. These are all tasks where benchmarks and our own internal evals point to clear winners. We maintain a routing table that maps task types to preferred models, with fallbacks defined for each.

User-Driven Selection

In Knowledge Spaces, users can also select their preferred model directly. Enterprise users often have opinions about which models they trust, and some have compliance requirements that restrict them to specific providers. The platform respects that while still applying guardrails.

Cost Optimization in Practice

Running 16+ models sounds expensive. It can be, if you're naive about it. Here's how we keep costs manageable.

Token-aware routing. Before sending a request, we estimate the token count and factor that into the routing decision. A 50,000 token document analysis has very different cost implications on GPT-4o vs. Gemini 1.5 Pro.

Caching at multiple layers. We cache at the retrieval layer (so the same document chunks don't get re-embedded repeatedly), at the prompt layer (similar queries hitting the same context get cached responses), and at the provider layer (leveraging provider-side prompt caching where available).

Batch processing for non-interactive workloads. Not everything needs real-time responses. Document ingestion, bulk analysis, and background processing tasks can use batch APIs at significant discounts.

Model version pinning. We pin to specific model versions rather than floating aliases. This prevents surprise cost increases when a provider updates their "latest" pointer and it avoids subtle behavior changes that break downstream logic.

Fallback Handling and Resilience

This is where most multi-LLM implementations fall apart. Having multiple models available means nothing if your fallback logic is an afterthought.

Cascading fallbacks with budget awareness. Each primary model has an ordered fallback chain. If Claude Opus is down, we fall to GPT-4o, then to Gemini Pro. But the fallback chain also respects cost budgets. We won't fall back to a more expensive model unless the task priority warrants it.

Timeout-based failover. We don't wait for a provider to return an error. If a response hasn't started streaming within our threshold, we fire the request to the next provider in parallel. First response wins. This adds marginal cost but dramatically improves perceived reliability.

Graceful degradation. Sometimes the right answer isn't "try another premium model." It's "use a smaller model and tell the user the response may be less detailed." We surface model selection transparently so users understand what they're getting.

Health checking. We maintain a lightweight health check loop against each provider. If a provider starts returning errors or latency spikes above thresholds, we proactively remove it from the routing pool before user requests are affected.

Streaming Across Providers

Streaming is table stakes for LLM applications. Nobody wants to stare at a spinner for 30 seconds. But streaming implementations vary wildly across providers.

SSE vs. WebSocket vs. proprietary. OpenAI uses server-sent events. Anthropic uses SSE with a different event structure. Some providers use WebSockets. Your orchestration layer needs to normalize all of these into a single streaming interface for your frontend.

We built a unified streaming adapter that accepts provider-specific stream formats and emits a consistent event stream to the client. The frontend doesn't know or care which model is responding. It just renders tokens as they arrive.

Partial response handling matters. If a stream fails mid-response, you need to decide: retry from scratch, continue with a different model, or present what you have. We checkpoint streamed content and can resume with a different provider, passing the partial response as context.

Tool Calling Is the Wild West

If you're building agents or any system that needs structured outputs, tool calling (function calling) is essential. And every provider does it differently.

Schema formats vary. OpenAI uses JSON Schema. Anthropic uses a similar but not identical format. Google has its own approach. If you define tools once and expect them to work everywhere, you'll be debugging schema translation bugs for weeks.

We maintain a canonical tool schema and compile it to provider-specific formats at request time. One source of truth, multiple compilation targets. This also makes it easy to add new providers without touching tool definitions.

Reliability of tool calling differs. Some models are better at consistently producing valid tool calls. Some hallucinate tool names or parameters. We validate every tool call response against the schema before execution and retry with a corrective prompt if validation fails.

Parallel vs. sequential tool calls. Some providers support parallel tool calling. Others don't. Your orchestration layer needs to handle both gracefully, especially when tools have dependencies on each other.

What I'd Tell You If You're Starting Today

Don't abstract too early. Start with two models. Get your routing and fallback patterns right with two before you scale to sixteen. The patterns are the same, but debugging is much easier with fewer variables.

Invest in observability from day one. Log every request with the model used, tokens consumed, latency, and cost. You can't optimize what you can't measure. We track 64+ audit events per interaction in Knowledge Spaces, and that granularity has saved us countless times.

Treat model selection as a product feature, not an infrastructure detail. Your users care about which model is answering their question. Make it visible. Make it configurable. This is especially true in enterprise and government contexts where model provenance matters.

Build your evaluation pipeline early. When you swap models or update routing logic, you need to know if quality changed. Automated evals against your specific use cases are worth more than any public benchmark.

Running multiple LLMs in production is harder than running one. But for any serious AI platform, it's not optional. The landscape changes too fast, the strengths are too varied, and the risk of single-provider dependency is too high. Build for multi-model from the start. Your future self will thank you.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

Why Your AI Chatbot Fails (And How to Fix It with RAG)

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

You shipped an AI chatbot. Users tried it. And now you're dealing with some combination of these complaints:

"It made something up and it sounded completely confident."

"It gave me information that's six months out of date."

"How do I know if this answer is actually correct?"

"Wait, can everyone see our internal documents through this thing?"

These aren't edge cases. They're the predictable failure modes of any chatbot built on a raw LLM without proper retrieval architecture. And they're all fixable.

Let me walk through each failure mode and the architectural pattern that addresses it.

Failure 1: Hallucination

The problem. LLMs generate plausible text. That's what they're trained to do. When they don't have relevant information, they don't say "I don't know." They construct an answer that sounds authoritative and is completely fabricated.

This isn't a bug. It's the fundamental architecture of language models. They're optimizing for coherent next-token prediction, not factual accuracy.

The fix: Ground every response in retrieved documents.

Retrieval-Augmented Generation works by searching a knowledge base for relevant documents before generating a response. The LLM receives those documents as context and generates an answer based on what it found, not what it "knows" from training.

The key architectural decision: constrain the model's response to the retrieved content. Your system prompt should explicitly instruct the model to only answer based on the provided context and to say "I don't have information on that" when the retrieved documents don't cover the question.

This doesn't eliminate hallucination entirely. Models can still misinterpret retrieved content or over-extrapolate. But it reduces the failure rate dramatically because the model is working from specific source material rather than its parametric memory.

What this looks like in practice. We built FARbot, a free chatbot for searching the Federal Acquisition Regulation, using this exact pattern. The FAR is a massive, complex regulatory document. Getting an answer wrong isn't just unhelpful, it could lead to compliance violations. Every FARbot response is grounded in specific FAR sections that were retrieved based on the user's question.

Failure 2: Stale and Outdated Information

The problem. LLMs have a training cutoff. GPT-4's knowledge ends at some point in the past. Claude's does too. If your chatbot is answering questions about current policies, recent product changes, or evolving regulations, the base model's knowledge is already wrong.

Fine-tuning helps marginally, but it's expensive, slow, and still creates a new cutoff date.

The fix: Decouple knowledge from the model.

RAG separates the knowledge layer from the reasoning layer. Your documents live in a vector database, indexed and searchable. When those documents change, you re-index them. The model's training data becomes irrelevant for domain-specific questions because it's always reading from your current document set.

The practical architecture:

Ingest pipeline. Documents are chunked, embedded, and stored in a vector database. We use Pinecone for production workloads, but Qdrant, Weaviate, and pgvector all work depending on your requirements.
Incremental updates. When documents change, you re-process only the changed documents. You don't need to re-embed your entire corpus.
Metadata timestamps. Attach last-updated timestamps to your chunks. Surface these in responses so users know how current the information is.

In FARbot, when FAR clauses are updated, we re-ingest the affected sections. Users always get answers based on the current regulation, not whatever version existed when the underlying model was trained.

Failure 3: No Source Citations

The problem. Your chatbot gives an answer. The user asks, "Where did you get that?" And there's no good response.

For internal tools, this erodes trust. For anything customer-facing, it's a liability. For government and regulated industries, it's a disqualifier.

The fix: Track and surface retrieval provenance.

Every RAG response should include the source documents that informed it. Not as an afterthought, but as a core feature of the response architecture.

This requires:

Chunk-level attribution. When the retrieval system returns relevant passages, maintain references to the original documents, sections, and page numbers. Pass these through the generation step.

Source panel in the UI. Don't bury citations in footnotes. Give them a dedicated, prominent place in the interface. Users should be able to click through to the original document.

Retrieval logs. Log which documents were retrieved for each query, their relevance scores, and which ones the model actually referenced in its response. This is invaluable for debugging answer quality and for audit purposes.

FARbot implements all three. Every answer includes a source panel showing the specific FAR sections that were retrieved. Users can see exactly which regulatory text informed the response. The retrieval logs record every search, so we can analyze patterns and improve retrieval quality over time.

Failure 4: No Access Control

The problem. You built a chatbot over your company's internal documents. Sales proposals, HR policies, financial reports, engineering specs. A summer intern asks it a question, and the RAG system happily retrieves from the CFO's confidential board presentation.

The LLM doesn't understand permissions. It doesn't know that Document A is for executives only. It retrieved the most semantically relevant chunks, regardless of who was asking.

The fix: Permission-aware retrieval.

Access control in RAG must happen at the retrieval layer, before the LLM ever sees the documents. This means:

Per-document (or per-collection) permissions. When documents are ingested, they're tagged with access control metadata. User roles, groups, classification levels, whatever your permission model requires.

Filtered vector search. When a user submits a query, the vector search includes a permission filter. The query is: "Find the most relevant documents that this specific user is authorized to access." Not: "Find the most relevant documents, and we'll filter the display later."

That distinction matters. If you filter after retrieval, the LLM has already seen unauthorized content and may reference it in the response even if you strip the citations. The filter must happen before content reaches the model.

Hierarchical access models. Most organizations need more than flat roles. Team-level access, project-level access, classification-level access. These all need to compose correctly with your retrieval filters.

In Knowledge Spaces, we implement this with per-collection RBAC that's enforced at the vector query layer. An analyst in Division A only retrieves from collections they're authorized to access. The model never sees content from Division B's restricted collections. This is also why we log 64+ audit events per interaction. When you're handling sensitive data, you need to prove that access controls are working, not just assert it.

The Architecture That Ties It All Together

These four fixes aren't independent features you bolt on. They're layers of a coherent retrieval architecture:

Ingestion layer. Documents are chunked, embedded, tagged with metadata (source, timestamps, permissions), and stored in a vector database with the right connectors for your data sources.
Retrieval layer. Queries are embedded, permission filters are applied, and semantically relevant chunks are retrieved with full provenance tracking.
Generation layer. The LLM receives retrieved context with explicit instructions to ground responses in that context, cite sources, and acknowledge gaps.
Presentation layer. Responses are displayed with source citations, confidence indicators, and links to original documents.
Observability layer. Every step is logged. Retrieval scores, model selection, token usage, permission evaluations, response latency. All queryable, all auditable.

When people ask why their chatbot fails, the answer is almost always that they skipped one or more of these layers. They went straight from "user question" to "LLM response" and hoped for the best.

Getting Started

If you have a failing chatbot, you don't need to rebuild from scratch. Start with the retrieval layer. Add a vector database, index your documents, and inject retrieved context into your existing prompts. That alone fixes the hallucination and staleness problems.

Then add source tracking. Then add permission filters. Each layer compounds the reliability of the one before it.

RAG isn't magic. It's plumbing. But it's the plumbing that makes the difference between a demo that impresses people in a meeting and a product that people actually trust with real work.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

Vector Databases Explained: A Builder's Guide

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

If you're building a Retrieval-Augmented Generation (RAG) system, you'll spend more time thinking about your vector database than you expect. I know because I've been through this decision multiple times while building Knowledge Spaces, our enterprise AI platform, and ClearCast, our multilingual intelligence tool.

The vector database is the backbone of your retrieval pipeline. Get it wrong and your AI gives bad answers, no matter how good your LLM is. Get it right and retrieval feels invisible, which is exactly how it should feel.

This is the guide I wish I'd had when I started. No vendor marketing. Just practical observations from building and running these systems in production.

What a Vector Database Actually Does

Before comparing options, let's be precise about the job.

A vector database stores high-dimensional numerical representations (embeddings) of your content and enables fast similarity search across those embeddings. When a user asks a question, your system converts that question into an embedding using the same model that embedded your documents, then queries the vector database for the most similar document chunks.

The quality of your retrieval depends on three things: your embedding model, your chunking strategy, and your vector database's ability to return accurate results quickly. The vector database handles that third piece.

What makes vector databases different from traditional databases is the search algorithm. You're not matching exact values. You're finding the nearest neighbors in a high-dimensional space, typically using algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). The tradeoffs between these algorithms, and how each database implements them, drive most of the practical differences you'll encounter.

The Contenders

I've worked meaningfully with four vector databases. Here's what I've learned about each.

Pinecone

Pinecone is what we use in Knowledge Spaces for our primary RAG pipeline. The reason is straightforward: it's a fully managed service that handles scaling, replication, and index optimization without operational overhead.

What works well:

Zero infrastructure management. You create an index, push vectors, and query. Pinecone handles sharding, replication, and failover.
Consistent performance at scale. We've pushed millions of vectors through Pinecone indexes and query latency stays predictable in the low-millisecond range.
Metadata filtering. You can attach metadata to vectors and filter on it during queries. This is critical for multi-tenant systems where you need to scope searches to a specific organization's documents without maintaining separate indexes.
Namespace isolation. Pinecone namespaces give you logical separation within a single index, which maps cleanly to our multi-tenant architecture.

What to watch for:

Cost at scale. Pinecone is not cheap, especially with serverless pricing where you pay per read/write unit. If you're doing high-volume ingestion or running thousands of queries per minute, model the costs carefully before committing.
Vendor lock-in. Your data lives in Pinecone's infrastructure. Migration requires re-indexing everything, which is feasible but not trivial.
Limited query flexibility. Pinecone is excellent at what it does (vector similarity search with metadata filters), but if you need complex hybrid queries combining vector similarity with full-text search, keyword matching, and structured filters in a single query, you'll hit limitations.

Qdrant

Qdrant is what we use in ClearCast for multilingual semantic search. It's an open-source vector database written in Rust, and it offers a different set of tradeoffs than Pinecone.

What works well:

Self-hosted option. You can run Qdrant on your own infrastructure, which matters enormously for government and air-gapped deployments. This was the primary reason we chose it for ClearCast.
Rich filtering. Qdrant's payload filtering is more expressive than Pinecone's metadata filtering. You can build complex boolean queries combining vector similarity with structured data conditions.
Performance. Rust-based, HNSW indexing with quantization options. Query performance is excellent, and memory usage is well-optimized with scalar and product quantization.
Collection snapshots. You can snapshot and restore collections, which simplifies backup and migration workflows.

What to watch for:

Operational responsibility. Self-hosting means you manage scaling, backups, monitoring, and upgrades. Qdrant Cloud exists as a managed option, but the managed market is where Pinecone has more maturity.
Community size. Qdrant's community is growing fast but still smaller than some alternatives. Finding production-tested patterns and troubleshooting unusual issues takes more effort.
Shard management. For very large datasets, you'll need to configure sharding and replication carefully. The defaults work for moderate scale, but high-volume production deployments need tuning.

pgvector

pgvector is a PostgreSQL extension that adds vector similarity search to your existing Postgres database. It's not a standalone vector database. It's vector search bolted onto the database you probably already have.

What works well:

No new infrastructure. If you're already running PostgreSQL (and you probably are), pgvector is a CREATE EXTENSION away. No new services to deploy, monitor, or pay for.
Unified queries. You can combine vector similarity search with standard SQL in a single query. Join your embeddings table with your metadata tables, filter on timestamps, aggregate results. This is genuinely powerful and something standalone vector databases can't match.
Familiar tooling. Your existing Postgres backup, monitoring, and scaling infrastructure works with pgvector. Your team already knows how to manage it.
Transaction support. Vector operations participate in PostgreSQL transactions. If your application needs ACID guarantees around vector operations (rare, but it happens), pgvector handles this natively.

What to watch for:

Performance ceiling. pgvector with HNSW indexing is fast enough for many workloads, but it won't match Pinecone or Qdrant at high scale. If you're querying across tens of millions of vectors with sub-10ms latency requirements, pgvector will struggle.
Memory usage. HNSW indexes in pgvector are memory-resident. Large indexes eat RAM, and PostgreSQL's memory management wasn't designed with vector indexes as a primary concern.
Scaling limitations. PostgreSQL scaling (read replicas, partitioning) applies, but horizontal scaling of vector workloads is less natural than with purpose-built vector databases.

My take: pgvector is the right choice more often than the vector database vendors want you to believe. If your dataset is under 5 million vectors, your query volume is moderate, and you're already running Postgres, pgvector eliminates an entire category of operational complexity. Start here unless you have a specific reason not to.

Weaviate

Weaviate is an open-source vector database with a strong focus on hybrid search (combining vector and keyword search) and a GraphQL-based query API.

What works well:

Hybrid search. Weaviate's BM25 + vector fusion is the best out-of-the-box hybrid search I've used. If your retrieval quality depends on combining semantic similarity with keyword matching (and for many domains it does), Weaviate handles this natively.
Schema-driven approach. Weaviate uses a typed schema for your data classes. This enforces structure and makes the data model explicit, which helps with long-term maintenance.
Vectorization modules. Built-in integrations with embedding providers mean you can push raw text to Weaviate and it handles embedding automatically. Convenient for simpler architectures.
Multi-tenancy. Native multi-tenant support with per-tenant data isolation.

What to watch for:

Resource consumption. Weaviate tends to use more memory and CPU than Qdrant for comparable workloads. On shared infrastructure, this adds up.
GraphQL complexity. The GraphQL query API is powerful but adds a learning curve. REST and gRPC APIs exist too, but GraphQL is the primary interface.
Operational weight. Weaviate has more moving parts than Qdrant or Pinecone. Backup, restore, and cluster management require more attention.

How to Choose

After building with all four, here's my decision framework.

Choose Pinecone if: you want zero operational overhead, your team is small, your budget can absorb the cost, and you don't have air-gap or data sovereignty requirements. It's the fastest path to production-quality retrieval.

Choose Qdrant if: you need self-hosting capability, you want strong performance with lower cost, and your team can handle infrastructure management. Best fit for government deployments and organizations with strict data residency requirements.

Choose pgvector if: your dataset is moderate (under 5 million vectors), you're already on PostgreSQL, and you want to minimize architectural complexity. The right default choice for most early-stage RAG systems.

Choose Weaviate if: hybrid search quality is your primary concern, you need native multi-tenancy, and you're comfortable with a heavier operational footprint.

Things That Matter More Than Your Vector Database Choice

Here is what matters most. Your vector database choice matters less than three other decisions in your RAG pipeline.

Chunking strategy. How you split documents into chunks has a bigger impact on retrieval quality than which database stores those chunks. Chunk size, overlap, boundary detection (splitting on paragraphs vs. sentences vs. token counts), and metadata preservation all affect retrieval precision. We've spent more engineering time on chunking in Knowledge Spaces than on vector database integration.

Embedding model selection. The embedding model determines the quality of your vector representations. A great vector database can't fix bad embeddings. Test multiple embedding models on your specific data. We use different embedding approaches in Knowledge Spaces (Pinecone's built-in embedding pipeline) and ClearCast (BGE-M3 for multilingual coverage) because the data characteristics are completely different.

Retrieval evaluation. Most teams never systematically measure their retrieval quality. They ship a RAG system and hope it works. Build an evaluation set of queries with known relevant documents. Measure recall and precision. Track these metrics over time. A 10% improvement in retrieval recall will do more for your system's output quality than switching vector databases.

The vector database is infrastructure. Important infrastructure, but infrastructure. Get the fundamentals right, pick the option that fits your operational reality, and spend your engineering energy on the problems that actually drive output quality.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

The Fractional Chief AI Officer: Why Every Enterprise Needs One

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

Most companies know they need AI leadership. Very few of them need a full-time Chief AI Officer.

I say this as someone who serves as a fractional CAIO for multiple organizations. I have spent nearly two decades building AI products, led research programs for the Air Force Office of Scientific Research, and run an AI platform company that serves government and commercial clients. When organizations bring me in, the problem is almost never "we need more AI." The problem is "we have no idea how to make AI actually work for us."

That gap between AI ambition and AI execution is where the fractional CAIO role lives. And it is a gap that is getting wider, not narrower.

The Leadership Problem

Every enterprise is feeling pressure to adopt AI. Board members are asking about it. Competitors are announcing initiatives. Employees are using ChatGPT on their personal devices whether IT approves or not.

The typical response is one of two extremes. Either the company hires a senior AI hire (VP of AI, Chief AI Officer, Head of AI Strategy) at $300K to $500K annually, or leadership assigns AI responsibilities to an existing executive who already has a full plate and limited technical depth.

Both approaches fail more often than they succeed. The expensive full-time hire often arrives to find that the organization lacks the data infrastructure, the engineering culture, or the strategic clarity to execute on any AI initiative. They spend their first year building a team and fighting for budget, and by month eighteen the board is asking what they have to show for it.

The overloaded existing executive, meanwhile, makes well-intentioned decisions based on vendor demos and analyst reports. They greenlight pilots that never reach production. They buy platforms they do not need. They underestimate the governance, security, and change management requirements that determine whether an AI initiative succeeds or stalls.

What a Fractional CAIO Actually Does

A fractional CAIO is a senior AI leader who works with your organization on a part-time, contracted basis. Typically 10 to 20 hours per week, sometimes less, sometimes more depending on the phase.

The work falls into four areas.

Strategy and prioritization. Not every process needs AI. A fractional CAIO evaluates your operations, identifies where AI creates genuine value versus where it creates complexity, and builds a roadmap that sequences initiatives based on feasibility, impact, and organizational readiness. This is the highest-leverage work because it prevents the most common failure mode: doing too many AI things badly instead of doing a few AI things well.

Architecture and vendor evaluation. The AI tooling landscape is overwhelming. Foundation models, vector databases, orchestration frameworks, evaluation tools, deployment platforms. A fractional CAIO brings current, hands-on knowledge of what works and what does not. They can evaluate vendor claims against technical reality because they have built these systems themselves. At Sprinklenet, I evaluate and work with dozens of models, frameworks, and infrastructure components every month. That operational knowledge is the difference between choosing tools that work and choosing tools that demo well.

Governance and risk management. AI governance is not optional, especially for regulated industries, government contractors, and any organization handling sensitive data. A fractional CAIO establishes policies for data handling, model evaluation, output monitoring, access control, and audit logging. They build these frameworks proportionally, appropriate for your organization's size and risk profile, rather than either ignoring governance entirely or implementing an unwieldy bureaucracy that kills momentum.

Team development and culture. AI adoption is ultimately a people problem. A fractional CAIO helps your existing team develop AI literacy, establishes best practices for prompt engineering and AI-assisted workflows, and creates the internal knowledge base that lets the organization sustain AI capabilities independently over time. The goal is not permanent dependency on outside leadership. The goal is building internal capability while having experienced guidance during the critical early phases.

When to Hire a Fractional CAIO

The fractional model is the right fit in several common scenarios.

You are early in your AI journey. If you are still figuring out where AI fits in your organization, a fractional CAIO provides strategic direction without the commitment and cost of a full-time executive. Once the strategy is clear and execution is underway, you can decide whether to hire a permanent leader or continue with fractional support.

You are a mid-market company. Organizations with 50 to 500 employees often need AI leadership but cannot justify a $400K executive salary plus the team-building costs that come with it. A fractional CAIO gives you senior-level guidance at a fraction of the cost.

You are a government contractor. This is a space I know well. Government contractors face unique AI challenges: compliance requirements (FedRAMP, CMMC, DFARS), acquisition cycle constraints, and the need to demonstrate AI capabilities in proposals and past performance narratives. A fractional CAIO who understands the federal landscape can accelerate your competitive positioning while ensuring compliance.

You had an AI initiative fail. If your first attempt at AI adoption stalled, a fractional CAIO can diagnose what went wrong, salvage what is salvageable, and reset the organization's approach based on lessons learned rather than hype.

What to Look For

Not every experienced AI practitioner is a good fractional CAIO. The role requires a specific combination of skills.

Hands-on technical depth. Beware of AI strategists who have never built a production system. Your fractional CAIO should understand model architectures, RAG pipelines, embedding strategies, deployment patterns, and evaluation frameworks at a practitioner level. They should be able to read code, evaluate technical designs, and have informed opinions about infrastructure choices.

Business acumen. Technical skill without business judgment produces impressive solutions to problems nobody has. A good fractional CAIO ties every AI initiative to a measurable business outcome: revenue, cost reduction, cycle time, error rate, customer satisfaction. If they cannot articulate the ROI case for an initiative in plain language, they should not be recommending it.

Communication skills. The fractional CAIO sits between technical teams and executive leadership. They need to explain complex AI concepts to non-technical stakeholders without condescension, and translate business requirements into technical specifications without losing fidelity. This translation capability is rarer than it sounds.

Current, operational knowledge. AI moves fast. Someone whose last hands-on work was three years ago is working from outdated mental models. Look for someone who is actively building, deploying, and operating AI systems today. They should have opinions about current tools and frameworks that come from direct experience, not analyst reports.

The ROI Case

The math is straightforward. A full-time CAIO costs $300K to $500K in salary alone, plus benefits, equity, and the team they will inevitably need to hire. A fractional CAIO typically costs $10K to $25K per month, scales up or down based on need, and brings a breadth of experience from working across multiple organizations simultaneously.

More importantly, the fractional model reduces the risk of the most expensive AI failure: spending six to twelve months and significant budget building the wrong thing. An experienced fractional CAIO has seen enough implementations to know which approaches work, which vendors deliver, and which "AI strategies" are just repackaged consulting frameworks.

Getting Started

If this resonates, here is what I recommend.

Start with an assessment. A good fractional CAIO will begin by understanding your current state: what data you have, what systems you run, what your team's capabilities are, and what outcomes actually matter to your business. At Sprinklenet, we built the Enterprise AI Scorecard specifically for this purpose, a structured evaluation that gives organizations a clear baseline and a prioritized roadmap.

From there, the engagement takes shape around your specific needs. Some clients need heavy strategic work upfront and lighter ongoing guidance. Others need hands-on architecture support during a build phase. The fractional model adapts to the work rather than the other way around.

The organizations that succeed with AI are not the ones with the biggest budgets or the most sophisticated technology. They are the ones with the right leadership at the right time. For most companies, "right" means experienced, practical, available when needed, and focused on outcomes rather than empire building. That is exactly what the fractional CAIO model delivers.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.

Building AI for Government: What Developers Need to Know

Jamie Thompson — Tue, 10 Mar 2026 14:12:03 +0000

Government AI is a massive market that most developers ignore because it looks intimidating from the outside. The acronyms alone could fill a dictionary. FedRAMP, FISMA, IL2 through IL6, DFARS, NIST 800-53, CAC, PKI, SAML, SCIM.

But here's the thing. The underlying technical problems are ones you already know how to solve. Authentication, authorization, audit logging, data isolation, encryption. The difference is that government has very specific, very documented requirements for how you solve them. And if you can meet those requirements, you're competing in a market where most startups never bother to show up.

At Sprinklenet, we've spent years building AI products for federal agencies. Our platform, Knowledge Spaces, serves government clients with the full stack of compliance requirements. Here's what I wish someone had told me when we started.

The Compliance Landscape (Without the Jargon)

FedRAMP: Your Cloud Hosting Matters

FedRAMP (Federal Risk and Authorization Management Program) is a standardized security assessment framework for cloud services. If a government agency is going to use your SaaS product, they'll likely ask about FedRAMP.

The practical implication: you need to host on FedRAMP-authorized infrastructure. AWS GovCloud, Azure Government, and Google Cloud have FedRAMP-authorized regions. If you're already on one of these clouds, you're closer than you think.

Getting your own FedRAMP Authorization to Operate (ATO) is a heavy lift. Hundreds of security controls, third-party assessment, months of documentation. For most startups, the better path is to deploy on infrastructure that's already authorized and inherit those controls.

Impact Levels: Not All Government Data Is Equal

The Department of War and other agencies classify data by Impact Level (IL).

IL2 covers publicly releasable information. Low bar. Most commercial cloud environments can handle this.

IL4 covers Controlled Unclassified Information (CUI). This is where things get real. You need encryption at rest and in transit, access controls, audit logging, and hosting in a U.S. sovereign cloud environment.

IL5 covers CUI with additional restrictions, typically requiring dedicated infrastructure within the U.S. with personnel controls.

IL6 is classified. If you're reading this article, you're probably not building for IL6 yet.

Most federal AI opportunities sit at IL2 or IL4. Don't let the higher levels scare you away from the space.

FAR and DFARS: The Rules of Engagement

The Federal Acquisition Regulation (FAR) governs how the government buys things. DFARS adds Department of War specific rules on top.

As a developer, the parts that matter most are:

DFARS 252.204-7012 requires safeguarding of Covered Defense Information and cyber incident reporting. If you're handling any DoW data, you need to implement NIST SP 800-171 controls.

FAR 52.204-21 covers basic safeguarding of covered contractor information systems. This is the minimum bar.

The practical takeaway: you need documented security controls, incident response procedures, and the ability to demonstrate compliance. Not aspirational compliance. Documented, auditable compliance.

We built FARbot as a free tool specifically because navigating FAR/DFARS is painful for developers and contractors alike. It's a RAG-powered chatbot that searches the complete FAR and DFARS with cited answers, so you can look up specific clauses without reading thousands of pages of regulations. Every answer includes source citations and retrieval logs.

Authentication: CAC, PKI, and Why OAuth Isn't Enough

This is where government AI diverges most sharply from commercial AI.

CAC/PIV Authentication

Most government employees authenticate using a Common Access Card (CAC) or Personal Identity Verification (PIV) card. It's a smart card with X.509 certificates. Your application needs to support certificate-based authentication, typically via mutual TLS at the web server layer.

If you've never implemented smart card auth before, here's the short version: the user's browser presents a client certificate during the TLS handshake. Your server validates the certificate chain against DoW or federal PKI root certificates. You extract the user's identity from the certificate's Subject DN or SAN.

SAML SSO

Many agencies use SAML-based single sign-on through identity providers like Okta, Azure AD (now Entra ID), or ICAM solutions. Your app needs to be a SAML Service Provider. This means supporting SAML assertions, attribute mapping, and session management.

In Knowledge Spaces, we support SAML SSO alongside CAC/PKI so agencies can use whichever authentication method fits their environment. Some agencies use both, with CAC for on-premises access and SAML for remote.

The Key Difference

Commercial apps can get away with username/password plus MFA. Government apps can't. Plan for certificate-based auth and SAML from the start. Retrofitting it later is painful.

Audit Logging: If It's Not Logged, It Didn't Happen

Government requires comprehensive audit logging. Not "we log errors." Everything.

Who accessed the system, when they accessed it, what they did, what data they touched, and from where they connected. Every interaction needs an immutable audit trail.

For AI platforms specifically, this extends to:

Which model processed the query
What documents were retrieved (and from which collections)
What permissions were evaluated during retrieval
Whether any content was filtered or modified
Token counts, response times, and cost attribution

In Knowledge Spaces, we log 64+ distinct audit events per interaction. That sounds excessive until an agency security officer asks you to produce a complete access history for a specific document over the last 90 days. Then it sounds like exactly the right amount.

Practical tip: Use structured logging (JSON) with consistent schemas from day one. Ship logs to an immutable store. Make them queryable. Government auditors don't want to grep through text files.

RBAC: Access Control That Means Something

Role-Based Access Control in government isn't just "admin, editor, viewer." It's granular, hierarchical, and tied to data classification.

You need to control:

Which users can access which document collections
Which models are available to which user groups
Which data can leave which boundaries
Who can administer which portions of the system

In a RAG system, RBAC gets especially important at the retrieval layer. When a user asks a question, the system should only retrieve documents that user is authorized to see. This means your vector database queries need to include permission filters, not just semantic similarity.

We implement this in Knowledge Spaces with per-collection access controls that are evaluated at query time. A user might have access to three collections out of twenty. Their RAG results only draw from those three. No leakage, no "the model saw it but we filtered the display."

Data Sovereignty: Where Your Bits Live Matters

Government agencies care deeply about where data is stored and processed.

Data residency. All data must reside in the continental United States, on infrastructure operated by U.S. persons. This applies to your database, your vector store, your object storage, your backups, and your logs.

Model API calls. If you're calling external LLM APIs, where is that data being processed? Can you guarantee it's not being used for training? Most major providers now offer data processing agreements, but you need to read them carefully and be able to articulate the data flow to your government client.

Air-gapped deployments. Some environments require fully disconnected operation. No external API calls, no cloud dependencies. This means running local models, local vector databases, and local everything. It's a different architectural challenge entirely.

Getting Started Without Getting Overwhelmed

If you're a developer looking to enter the government AI space, here's my practical advice:

Start with IL2 workloads. Public-facing government data, open-source intelligence, unclassified research. The compliance bar is manageable, and you'll learn the ecosystem.

Use already-authorized infrastructure. Deploy on AWS GovCloud or Azure Government. Inherit their FedRAMP authorization rather than pursuing your own.

Get on GSA Schedule. The General Services Administration's Multiple Award Schedule is the easiest procurement vehicle for agencies to buy from. It takes months to get, but it opens doors that are otherwise locked.

Build compliance into your architecture, not your roadmap. Audit logging, RBAC, encryption, certificate auth. These aren't features you add later. They're architectural decisions that affect everything.

Read the FAR. Or better yet, use FARbot to search it conversationally. Understanding the acquisition rules gives you a massive advantage over developers who treat government as just another market segment.

The government AI market is growing fast, and the technical barriers are more approachable than they appear. The developers who invest in understanding the compliance landscape now will be well positioned as agencies accelerate their AI adoption.

Jamie Thompson is the Founder and CEO of Sprinklenet AI, where he builds enterprise AI platforms for government and commercial clients. He writes weekly at newsletter.sprinklenet.com.