Elvin Suleymanov

Posted on Mar 26

RAG Architecture Patterns in .NET: From Naive to Production-Grade

#ai #dotnet #rag #llm

Your AI model is brilliant. It can write essays, analyze data, and hold conversations that feel human. But ask it about your company's internal documentation, last quarter's sales numbers, or the return policy you updated yesterday, and it fails. Not because it is stupid. Because it does not know.

Large language models only know what was in their training data. Your private documents, your proprietary databases, your recent updates: none of it exists in the model's world. Fine-tuning is expensive, slow, and inflexible. Every time your data changes, you would need to retrain.

Retrieval-Augmented Generation solves this. Instead of teaching the model your data, you fetch the relevant data at query time and hand it to the model alongside the question. The model generates a response grounded in real, current information from your systems.

RAG is not a framework. It is not a library. It is an architecture pattern. And the .NET ecosystem in 2026 has everything you need to implement it at production scale.

This article walks through the entire journey. We start with what RAG actually is. We build a naive implementation. Then we make it production-grade, covering chunking strategies, embedding generation, vector storage, hybrid search, and the architectural patterns that make RAG maintainable in enterprise .NET applications.

What RAG Actually Is

RAG has three letters and five stages. Here is what happens when a user asks a question:

1. The user asks a question. "What is our refund policy for digital products?"

2. The question gets embedded. An embedding model converts the question into a vector: a high-dimensional array of numbers that represents the meaning of the question.

3. Relevant documents are retrieved. The system searches a vector database for documents whose embeddings are similar to the question's embedding. It returns the top 5 or 10 most relevant chunks.

4. The context is injected into the prompt. The retrieved chunks are added to the prompt alongside the user's question. The model now has the original question plus the relevant context.

5. The model generates a grounded response. Instead of guessing or hallucinating, the model answers based on the retrieved context. It can cite sources, quote specific passages, and admit when the context does not contain the answer.

That is the entire pattern. Embed, retrieve, inject, generate. Everything else is optimization.

The Naive Implementation

Let us build the simplest possible RAG pipeline in .NET. No frameworks, no magic. Just the raw pattern.

// The simplest RAG pipeline you can write

var embedder = new OpenAIEmbeddingGenerator(
    "text-embedding-3-small", apiKey);
var chatClient = new OpenAIChatClient("gpt-4o", apiKey);

// 1. Your "database" of documents
var documents = new List<(string Text, float[] Vector)>();

// 2. Index a document
var docText = "Digital products are non-refundable after " +
    "download. Customers may request a refund within " +
    "24 hours of purchase if the product has not been " +
    "downloaded.";
var docVector = await embedder.GenerateVectorAsync(docText);
documents.Add((docText, docVector));

// 3. User asks a question
var question = "Can I get a refund on a digital purchase?";
var questionVector = await embedder
    .GenerateVectorAsync(question);

// 4. Find the most similar document (cosine similarity)
var bestMatch = documents
    .OrderBy(d => CosineDistance(questionVector, d.Vector))
    .First();

// 5. Build the prompt with context
var prompt = $"""
    Answer the question based on the context below.
    If the context does not contain the answer, say
    "I don't know."

    Context:
    {bestMatch.Text}

    Question: {question}
    """;

// 6. Generate response
var response = await chatClient.GetResponseAsync(prompt);
Console.WriteLine(response.Text);

This works. The model gets the refund policy document as context and answers accurately. But this naive approach has problems that become obvious at scale.

Problem 1: Documents are too big. A 50-page PDF does not fit in the model's context window. You need to split documents into smaller chunks.

Problem 2: In-memory storage. A list of tuples does not survive a restart, does not scale, and does not support metadata filtering.

Problem 3: No relevance threshold. The system always returns something, even when nothing is relevant.

Problem 4: Single embedding model. If the provider changes pricing or goes down, everything breaks.

Let us fix all of these.

Stage 1: Chunking — Splitting Documents Intelligently

Before you can embed documents, you need to split them into chunks that fit in the model's context window and are semantically coherent. This is the single most impactful decision in a RAG pipeline.

Chunking Strategies

Fixed-size chunking. Split every N tokens. Simple, predictable, but can break mid-sentence or mid-paragraph. A chunk about refund policy might end with "customers may request a" and the next chunk starts with "refund within 24 hours." Neither chunk is useful alone.

Paragraph-based chunking. Split on double newlines. Preserves natural semantic units, but paragraphs vary wildly in size. A one-sentence paragraph wastes embedding capacity. A five-page paragraph defeats the purpose.

Recursive chunking. Try to split on paragraphs first. If a paragraph is too big, split on sentences. If a sentence is too big, split on tokens. This is the best default for most use cases.

Semantic chunking. Use embedding similarity to detect topic boundaries. When the embedding of consecutive sentences diverges sharply, that is a chunk boundary. Produces the most coherent chunks but is computationally expensive.

Implementation with Semantic Kernel's TextChunker

using Microsoft.SemanticKernel.Text;

// Load your document
var documentText = await File.ReadAllTextAsync(
    "refund-policy.md");

// Recursive chunking: paragraphs first, then lines
var lines = TextChunker.SplitPlainTextLines(
    documentText, maxTokensPerLine: 128);

var chunks = TextChunker.SplitPlainTextParagraphs(
    lines,
    maxTokensPerParagraph: 256,
    overlapTokens: 32);

// Each chunk is now 128-256 tokens with 32-token
// overlap between consecutive chunks

The overlap parameter is important. A 32-token overlap means consecutive chunks share context at their boundaries, so information that spans a chunk boundary is not lost entirely. Typical overlap is 10-15% of the chunk size.

How Big Should Chunks Be

There is no universal answer, but here is what works in practice:

128-256 tokens for factual Q&A (support docs, FAQs, policies). Small chunks mean more precise retrieval.

256-512 tokens for analytical content (reports, articles, research). Larger chunks preserve more context for complex reasoning.

512-1024 tokens for code documentation and tutorials. Code blocks need surrounding explanation to make sense.

Start with 256 tokens and adjust based on retrieval quality.

Stage 2: Embedding — Converting Text to Vectors

Each chunk needs to be converted to a vector that captures its semantic meaning. The choice of embedding model determines the quality of your retrieval.

Using Microsoft.Extensions.AI

using Microsoft.Extensions.AI;

// Register in DI (provider-agnostic)
builder.Services.AddEmbeddingGenerator(
    new OpenAIClient(apiKey)
    .GetEmbeddingClient("text-embedding-3-small")
    .AsIEmbeddingGenerator());

The IEmbeddingGenerator<string, Embedding<float>> abstraction means you can switch providers without changing your RAG code.

Embedding Models Compared

OpenAI text-embedding-3-small. 1536 dimensions. Good quality, low cost. Best default for most .NET applications.

OpenAI text-embedding-3-large. 3072 dimensions. Higher quality, higher cost and storage. Use when precision matters more than cost.

Local models via Ollama (mxbai-embed-large). 1024 dimensions. Runs on your machine. Zero API cost. Good for development and privacy-sensitive deployments.

Azure OpenAI. Same models as OpenAI, but deployed in your Azure tenant. Best for enterprise compliance requirements.

Batch Embedding

Never embed one chunk at a time in production. Batch calls reduce latency and cost:

public async Task<List<(string Text, float[] Vector)>>
    EmbedChunksAsync(
        List<string> chunks,
        IEmbeddingGenerator<string,
            Embedding<float>> embedder)
{
    var results = new List<(string, float[])>();

    // Process in batches of 100
    foreach (var batch in chunks.Chunk(100))
    {
        var embeddings = await embedder
            .GenerateAsync(batch.ToList());

        for (int i = 0; i < batch.Length; i++)
        {
            results.Add((
                batch[i],
                embeddings[i].Vector.ToArray()));
        }
    }

    return results;
}

Stage 3: Storage — Where Your Vectors Live

You have three options in the .NET ecosystem, each with different trade-offs.

Option A: EF Core 10 with SQL Server 2025

If you already use SQL Server, this is the simplest path. No new infrastructure.

public class DocumentChunk
{
    public int Id { get; set; }
    public string Text { get; set; } = "";
    public string Source { get; set; } = "";
    public string Category { get; set; } = "";
    public DateTime IndexedAt { get; set; }

    [Column(TypeName = "vector(1536)")]
    public SqlVector<float> Embedding { get; set; }
}

// Query
var results = await db.Chunks
    .Where(c => c.Category == "policies")
    .OrderBy(c => EF.Functions.VectorDistance(
        "cosine", c.Embedding, queryVector))
    .Take(5)
    .ToListAsync();

Best for: datasets under 10 million vectors, existing SQL Server infrastructure, applications that need joins and transactions alongside vector search.

Option B: Microsoft.Extensions.VectorData

A provider-agnostic abstraction that works with Azure AI Search, Qdrant, Cosmos DB, Redis, Weaviate, and an in-memory implementation for testing.

using Microsoft.Extensions.VectorData;
using Microsoft.SemanticKernel.Connectors.InMemory;

// Define your record
public class DocChunk
{
    [VectorStoreRecordKey]
    public string Id { get; set; } = "";

    [VectorStoreRecordData]
    public string Text { get; set; } = "";

    [VectorStoreRecordData]
    public string Source { get; set; } = "";

    [VectorStoreRecordVector(1536)]
    public ReadOnlyMemory<float> Embedding { get; set; }
}

// Create a collection
var vectorStore = new InMemoryVectorStore();
var collection = vectorStore
    .GetCollection<string, DocChunk>("documents");
await collection.CreateCollectionIfNotExistsAsync();

// Upsert a record
await collection.UpsertAsync(new DocChunk
{
    Id = Guid.NewGuid().ToString(),
    Text = chunkText,
    Source = "refund-policy.md",
    Embedding = embeddingVector
});

// Search
var searchResults = await collection
    .VectorizedSearchAsync(queryVector,
        new VectorSearchOptions { Top = 5 });

await foreach (var result in searchResults.Results)
{
    Console.WriteLine(
        $"{result.Score}: {result.Record.Text}");
}

Best for: applications that may switch vector backends, teams using Semantic Kernel, multi-backend architectures.

Option C: Kernel Memory

Microsoft's out-of-the-box RAG solution. Handles the entire pipeline: document ingestion, chunking, embedding, storage, and retrieval through a single service.

using Microsoft.KernelMemory;

builder.Services.AddKernelMemory<MemoryServerless>(
    memoryBuilder =>
{
    memoryBuilder
        .WithPostgresMemoryDb(postgresConfig)
        .WithOpenAITextGeneration(openAiConfig)
        .WithOpenAITextEmbeddingGeneration(openAiConfig);
});

// Ingest a document (handles chunking + embedding)
await memory.ImportDocumentAsync(
    "refund-policy.pdf", documentId: "policy-001");

// Ask a question (handles retrieval + generation)
var answer = await memory.AskAsync(
    "What is the refund policy for digital products?");

Console.WriteLine(answer.Result);

Best for: rapid prototyping, applications where you want RAG with minimal custom code, teams that prefer convention over configuration.

Stage 4: Retrieval — Getting the Right Chunks

Retrieval quality determines RAG quality. If you retrieve irrelevant chunks, no amount of prompt engineering fixes the output.

Basic Vector Retrieval

var queryVector = await embedder
    .GenerateVectorAsync(userQuestion);

var chunks = await db.Chunks
    .OrderBy(c => EF.Functions.VectorDistance(
        "cosine", c.Embedding,
        new SqlVector<float>(queryVector)))
    .Take(5)
    .Select(c => new { c.Text, c.Source })
    .ToListAsync();

Filtered Retrieval

Always filter before searching when you can. It reduces the search space and improves relevance:

var chunks = await db.Chunks
    .Where(c => c.Category == "policies")
    .Where(c => c.IndexedAt >= DateTime.UtcNow
        .AddMonths(-3))
    .OrderBy(c => EF.Functions.VectorDistance(
        "cosine", c.Embedding,
        new SqlVector<float>(queryVector)))
    .Take(5)
    .ToListAsync();

Hybrid Retrieval

Combine vector similarity with keyword matching for the best results. On Cosmos DB, EF Core 10 provides native hybrid search:

var results = await db.Chunks
    .OrderBy(c => EF.Functions.Rrf(
        EF.Functions.FullTextScore(
            c.Text, "refund digital"),
        EF.Functions.VectorDistance(
            c.Embedding, queryVector)))
    .Take(5)
    .ToListAsync();

Relevance Thresholds

Not every query has a relevant answer in your database. Set a distance threshold to avoid returning garbage:

var results = await db.Chunks
    .Select(c => new
    {
        c.Text,
        c.Source,
        Distance = EF.Functions.VectorDistance(
            "cosine", c.Embedding,
            new SqlVector<float>(queryVector))
    })
    .Where(r => r.Distance < 0.35)
    .OrderBy(r => r.Distance)
    .Take(5)
    .ToListAsync();

if (!results.Any())
    return "I don't have information about that topic.";

A cosine distance below 0.35 typically indicates meaningful relevance. Above 0.5, the results are likely noise. Tune based on your data.

Stage 5: Generation — Building the Prompt

The prompt is where retrieval and generation meet. A well-structured prompt makes the difference between a useful answer and a hallucinated one.

The Basic RAG Prompt

public async Task<string> GenerateAnswerAsync(
    string question,
    List<RetrievedChunk> chunks,
    IChatClient chat)
{
    var context = string.Join("\n\n",
        chunks.Select((c, i) =>
            $"[Source {i + 1}: {c.Source}]\n{c.Text}"));

    var prompt = $"""
        You are a helpful assistant that answers
        questions based on the provided context.

        Rules:
        1. Only use information from the context below.
        2. If the context does not contain the answer,
           say "I don't have information about that."
        3. Cite your sources using [Source N] format.
        4. Be concise and direct.

        Context:
        {context}

        Question: {question}
        """;

    var response = await chat.GetResponseAsync(prompt);
    return response.Text;
}

Prompt Design Principles

Be explicit about grounding. Tell the model to only use the provided context. Without this instruction, models happily mix retrieved context with training knowledge, which defeats the purpose of RAG.

Include source attribution. By numbering sources and instructing the model to cite them, you get verifiable answers. Users can check the cited source to confirm accuracy.

Handle the "no answer" case. Explicitly instruct the model what to do when the context does not contain the answer. Without this, models confabulate convincing but incorrect responses.

Keep the system prompt short. The system message should establish behavior, not provide content. Put all retrieved context in the user message or a dedicated context section.

The Complete Production RAG Service

Here is everything combined into a clean, injectable service:

public class RagService
{
    private readonly AppDbContext _db;
    private readonly IEmbeddingGenerator<string,
        Embedding<float>> _embedder;
    private readonly IChatClient _chat;
    private readonly ILogger<RagService> _logger;

    public RagService(
        AppDbContext db,
        IEmbeddingGenerator<string,
            Embedding<float>> embedder,
        IChatClient chat,
        ILogger<RagService> logger)
    {
        _db = db;
        _embedder = embedder;
        _chat = chat;
        _logger = logger;
    }

    public async Task<RagResponse> AskAsync(
        string question,
        string? category = null,
        int topK = 5,
        double maxDistance = 0.35)
    {
        // 1. Embed the question
        var queryVector = new SqlVector<float>(
            await _embedder
                .GenerateVectorAsync(question));

        // 2. Retrieve relevant chunks
        var query = _db.Chunks.AsQueryable();

        if (category is not null)
            query = query
                .Where(c => c.Category == category);

        var chunks = await query
            .Select(c => new
            {
                c.Text,
                c.Source,
                Distance = EF.Functions.VectorDistance(
                    "cosine", c.Embedding, queryVector)
            })
            .Where(c => c.Distance < maxDistance)
            .OrderBy(c => c.Distance)
            .Take(topK)
            .ToListAsync();

        _logger.LogInformation(
            "Retrieved {Count} chunks for question: {Q}",
            chunks.Count, question);

        if (!chunks.Any())
        {
            return new RagResponse
            {
                Answer = "I don't have information " +
                    "about that topic in my knowledge base.",
                Sources = [],
                ChunksUsed = 0
            };
        }

        // 3. Build prompt with context
        var context = string.Join("\n\n",
            chunks.Select((c, i) =>
                $"[Source {i + 1}: {c.Source}]\n{c.Text}"));

        var prompt = $"""
            Answer the question based only on the
            context below. Cite sources using
            [Source N] format. If the context does not
            contain the answer, say so.

            Context:
            {context}

            Question: {question}
            """;

        // 4. Generate response
        var response = await _chat
            .GetResponseAsync(prompt);

        return new RagResponse
        {
            Answer = response.Text,
            Sources = chunks
                .Select(c => c.Source)
                .Distinct()
                .ToList(),
            ChunksUsed = chunks.Count
        };
    }
}

public class RagResponse
{
    public string Answer { get; set; } = "";
    public List<string> Sources { get; set; } = [];
    public int ChunksUsed { get; set; }
}

This service is provider-agnostic (IEmbeddingGenerator + IChatClient), supports category filtering, enforces relevance thresholds, includes source attribution, handles the "no answer" case gracefully, and logs retrieval metrics.

Where RAG Fits in Clean Architecture

RAG is not a separate system. It is a pattern within your application architecture.

Your Application
    Domain          → Entities, business rules
    Application     → RagService, IndexingService,
                      commands, queries
    Infrastructure  → EF Core (vector storage),
                      embedding providers,
                      LLM providers
    Presentation    → REST endpoints, MCP tools

The RagService lives in the Application layer. It depends on abstractions (IEmbeddingGenerator, IChatClient, DbContext) defined or referenced in that layer. Infrastructure provides the implementations. Presentation exposes the service through REST endpoints, MCP tools, or both.

This means your RAG pipeline follows the same architectural rules as the rest of your application. Testing is straightforward: mock the embedding generator and chat client, provide a test database, and verify retrieval quality in isolation.

RAG Through MCP

RAG and MCP complement each other naturally. An MCP server can expose RAG as a tool that any AI agent can call:

[McpToolType]
public class KnowledgeBaseTools
{
    private readonly RagService _rag;

    public KnowledgeBaseTools(RagService rag)
        => _rag = rag;

    [McpTool("search_knowledge_base")]
    [Description(
        "Search the company knowledge base using " +
        "semantic search. Returns relevant documents " +
        "with source citations. Use when the user " +
        "asks about company policies, procedures, " +
        "products, or internal documentation.")]
    public async Task<ToolResult> SearchKnowledgeBase(
        [Description("The user's question")]
        string question,
        [Description("Category filter (optional)")]
        string? category = null)
    {
        var response = await _rag.AskAsync(
            question, category);

        var result = $"{response.Answer}\n\n" +
            $"Sources: {string.Join(", ", response.Sources)}\n" +
            $"Chunks used: {response.ChunksUsed}";

        return ToolResult.Success(result);
    }
}

Now any MCP-compatible AI client, Claude, Copilot, Semantic Kernel agents, can search your knowledge base through a standardized protocol. The RAG pipeline is an implementation detail behind the tool. The AI agent does not need to know about embeddings, vectors, or chunking. It just calls a tool and gets grounded answers.

Common Pitfalls and How to Avoid Them

Chunks too large (over 512 tokens). Wastes context window space and dilutes relevance. The retrieved chunk contains the answer somewhere in 500 tokens of surrounding text. The model has to find the needle.

Chunks too small (under 64 tokens). Loses context. A 50-token chunk rarely contains enough information to be useful on its own. "The refund period is 24 hours" without knowing what product or under what conditions is incomplete.

No metadata filtering. Vector similarity alone retrieves the most semantically similar chunks, but "similar" is not always "relevant." A question about refund policies might retrieve chunks about return shipping procedures because they are semantically close. Category and date filters narrow the search space.

Embedding the wrong text. If your documents have titles, embed "Title: Refund Policy. Content: Digital products are non-refundable..." not just the content. The title carries significant semantic signal.

Ignoring the "no answer" case. Without a relevance threshold, RAG returns the "least irrelevant" chunk for every question, and the model generates a convincing answer from irrelevant context. Always set a distance threshold and handle the case where nothing is relevant.

Not measuring retrieval quality. You can have a perfect LLM and a perfect prompt, but if retrieval returns wrong chunks, the output is wrong. Measure retrieval precision and recall separately from generation quality. Build a test set of questions with known relevant chunks and evaluate your retrieval pipeline against it.

Performance Tips

Cache embeddings for common queries. If users frequently ask similar questions, cache the query embedding and retrieval results. A semantic cache (search cached queries by vector similarity) can serve repeated questions without hitting the embedding API.

Pre-compute embeddings during ingestion. Never compute embeddings at query time for documents. All document embeddings should be generated during the indexing phase and stored alongside the text.

Use appropriate dimensions. OpenAI's text-embedding-3-small supports dimension reduction. If 1536 dimensions is too expensive for storage, you can generate 512-dimension embeddings with modest accuracy loss. Test with your data to find the sweet spot.

Index your vectors. For datasets over 100,000 rows, vector search without an index is a full table scan. Use DiskANN indexes on SQL Server 2025 or HNSW indexes on dedicated vector databases.

Batch everything. Batch embedding generation during indexing. Batch retrieval if handling multiple queries. Batch database operations when ingesting large document sets.

Conclusion

RAG is not complicated. It is five stages: chunk, embed, store, retrieve, generate. The .NET ecosystem provides production-ready tools for every stage. Semantic Kernel's TextChunker handles splitting. Microsoft.Extensions.AI provides embedding abstractions. EF Core 10 or Microsoft.Extensions.VectorData handles storage and retrieval. IChatClient handles generation.

The architecture is clean. The RagService lives in the Application layer. Infrastructure provides the implementations. Presentation exposes it through REST or MCP. Testing is straightforward. Provider switching is a configuration change.

The difference between a naive RAG implementation and a production-grade one is not the pattern. It is the details. Chunk size. Overlap. Metadata filters. Relevance thresholds. Prompt structure. Source attribution. Error handling for the "no answer" case.

Get those details right and RAG transforms your .NET application from a system that can only answer questions about public knowledge into one that can answer questions about your data. Your policies. Your products. Your documentation. Your knowledge.

That is the whole point.

DEV Community