DEV Community

Cover image for Production RAG with Semantic Kernel: Patterns, Chunking, and Retrieval Strategies
Brian Spann
Brian Spann

Posted on

Production RAG with Semantic Kernel: Patterns, Chunking, and Retrieval Strategies

Retrieval-Augmented Generation (RAG) is the pattern that makes LLMs genuinely useful for enterprise applications. Instead of relying solely on training data, RAG grounds responses in your actual documents, databases, and knowledge bases.

In Part 3, we explored memory and vector stores. Now we'll build production-ready RAG systems with proper chunking, retrieval strategies, and evaluation.

The RAG Pipeline

Every RAG system follows this flow:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   INGEST     │ -> │    INDEX     │ -> │   RETRIEVE   │
│ Load docs    │    │ Chunk + embed│    │ Vector search│
└──────────────┘    └──────────────┘    └──────────────┘
                                              │
                                              v
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   RESPOND    │ <- │   AUGMENT    │ <- │    RANK      │
│ LLM generates│    │ Build prompt │    │ Score + filter│
└──────────────┘    └──────────────┘    └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Let's build each component properly.

Document Chunking: The Foundation

Chunking is where most RAG systems succeed or fail. Too large, and you waste context window space. Too small, and you lose coherence. Let's explore strategies.

Semantic Kernel's TextChunker

using Microsoft.SemanticKernel.Text;

var documentText = await File.ReadAllTextAsync("docs/user-manual.md");

// Split by paragraphs with overlap
var chunks = TextChunker.SplitPlainTextParagraphs(
    lines: documentText.Split('\n'),
    maxTokensPerParagraph: 500,    // Target chunk size
    overlapTokens: 50);             // Overlap for context continuity

Console.WriteLine($"Created {chunks.Count} chunks");
foreach (var (chunk, index) in chunks.Select((c, i) => (c, i)))
{
    Console.WriteLine($"Chunk {index}: {chunk.Length} chars, ~{chunk.Split(' ').Length} words");
}
Enter fullscreen mode Exit fullscreen mode

Markdown-Aware Chunking

For structured documents, respect the hierarchy:

var markdownContent = await File.ReadAllTextAsync("docs/api-reference.md");
var lines = markdownContent.Split('\n').ToList();

// Markdown chunking respects headers and code blocks
var chunks = TextChunker.SplitMarkdownParagraphs(
    lines: lines,
    maxTokensPerParagraph: 500,
    overlapTokens: 50);

// Each chunk maintains markdown structure
foreach (var chunk in chunks.Take(3))
{
    Console.WriteLine("---CHUNK---");
    Console.WriteLine(chunk);
}
Enter fullscreen mode Exit fullscreen mode

Custom Chunking Strategies

For complex documents, build domain-specific chunkers:

public class SemanticChunker
{
    private readonly ITextEmbeddingGenerationService _embeddingService;
    private readonly float _similarityThreshold;

    public SemanticChunker(
        ITextEmbeddingGenerationService embeddingService,
        float similarityThreshold = 0.85f)
    {
        _embeddingService = embeddingService;
        _similarityThreshold = similarityThreshold;
    }

    public async Task<List<string>> ChunkBySemanticSimilarityAsync(
        string text,
        int targetChunkSize = 500)
    {
        // Split into sentences
        var sentences = SplitIntoSentences(text);

        // Generate embeddings for each sentence
        var embeddings = await _embeddingService.GenerateEmbeddingsAsync(sentences);

        var chunks = new List<string>();
        var currentChunk = new List<string> { sentences[0] };
        var currentEmbedding = embeddings[0];

        for (int i = 1; i < sentences.Length; i++)
        {
            var similarity = CosineSimilarity(currentEmbedding, embeddings[i]);
            var currentLength = string.Join(" ", currentChunk).Split(' ').Length;

            // Start new chunk if semantically different or too long
            if (similarity < _similarityThreshold || currentLength > targetChunkSize)
            {
                chunks.Add(string.Join(" ", currentChunk));
                currentChunk = new List<string>();
                currentEmbedding = embeddings[i];
            }

            currentChunk.Add(sentences[i]);
        }

        if (currentChunk.Count > 0)
            chunks.Add(string.Join(" ", currentChunk));

        return chunks;
    }

    private string[] SplitIntoSentences(string text)
    {
        return Regex.Split(text, @"(?<=[.!?])\s+")
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .ToArray();
    }
}
Enter fullscreen mode Exit fullscreen mode

Chunking with Metadata

Preserve document structure as metadata:

public record DocumentChunk
{
    public required string Text { get; init; }
    public required string DocumentId { get; init; }
    public required string DocumentTitle { get; init; }
    public int ChunkIndex { get; init; }
    public int TotalChunks { get; init; }
    public string? SectionHeader { get; init; }
    public int? PageNumber { get; init; }
    public Dictionary<string, string> CustomMetadata { get; init; } = new();
}

public class StructuredChunker
{
    public List<DocumentChunk> ChunkMarkdownDocument(string markdown, string docId, string title)
    {
        var chunks = new List<DocumentChunk>();
        var lines = markdown.Split('\n');
        var currentSection = "";
        var currentContent = new StringBuilder();
        var chunkIndex = 0;

        foreach (var line in lines)
        {
            // Track section headers
            if (line.StartsWith("# "))
                currentSection = line[2..].Trim();
            else if (line.StartsWith("## "))
                currentSection = line[3..].Trim();

            currentContent.AppendLine(line);

            // Check if we've reached chunk size
            if (currentContent.Length > 2000)  // Characters, not tokens
            {
                chunks.Add(new DocumentChunk
                {
                    Text = currentContent.ToString().Trim(),
                    DocumentId = docId,
                    DocumentTitle = title,
                    ChunkIndex = chunkIndex++,
                    TotalChunks = -1,  // Update after processing
                    SectionHeader = currentSection
                });

                // Keep some overlap
                var overlap = GetLastParagraph(currentContent.ToString());
                currentContent.Clear();
                currentContent.Append(overlap);
            }
        }

        // Don't forget the last chunk
        if (currentContent.Length > 0)
        {
            chunks.Add(new DocumentChunk
            {
                Text = currentContent.ToString().Trim(),
                DocumentId = docId,
                DocumentTitle = title,
                ChunkIndex = chunkIndex,
                TotalChunks = chunkIndex + 1,
                SectionHeader = currentSection
            });
        }

        // Update total chunks count
        foreach (var chunk in chunks)
        {
            // Use with expression since records are immutable
            chunks[chunks.IndexOf(chunk)] = chunk with { TotalChunks = chunks.Count };
        }

        return chunks;
    }
}
Enter fullscreen mode Exit fullscreen mode

Building the RAG Plugin

Here's a complete RAG plugin that ties everything together:

public class RagPlugin
{
    private readonly ISemanticTextMemory _memory;
    private readonly string _collection;
    private readonly ILogger<RagPlugin> _logger;
    private readonly RagOptions _options;

    public RagPlugin(
        ISemanticTextMemory memory,
        IOptions<RagOptions> options,
        ILogger<RagPlugin> logger)
    {
        _memory = memory;
        _options = options.Value;
        _collection = options.Value.Collection;
        _logger = logger;
    }

    [KernelFunction("search_knowledge_base")]
    [Description("Searches the knowledge base for information relevant to a query")]
    public async Task<string> SearchAsync(
        [Description("The search query")] string query,
        [Description("Maximum number of results (default: 5)")] int limit = 5)
    {
        _logger.LogInformation("RAG search: {Query}", query);

        var results = await _memory
            .SearchAsync(_collection, query, limit, _options.MinRelevanceScore)
            .ToListAsync();

        if (results.Count == 0)
        {
            return "No relevant information found in the knowledge base.";
        }

        var contextBuilder = new StringBuilder();
        contextBuilder.AppendLine("## Relevant Information:\n");

        foreach (var result in results)
        {
            contextBuilder.AppendLine($"**Source: {result.Metadata.Description}** (Relevance: {result.Relevance:P0})");
            contextBuilder.AppendLine(result.Metadata.Text);
            contextBuilder.AppendLine();
        }

        return contextBuilder.ToString();
    }

    [KernelFunction("answer_from_knowledge_base")]
    [Description("Answers a question using the knowledge base, with citations")]
    public async Task<AnswerWithCitations> AnswerAsync(
        Kernel kernel,
        [Description("The question to answer")] string question)
    {
        _logger.LogInformation("RAG answer: {Question}", question);

        // Retrieve relevant context
        var searchResults = await _memory
            .SearchAsync(_collection, question, _options.MaxContextChunks, _options.MinRelevanceScore)
            .ToListAsync();

        if (searchResults.Count == 0)
        {
            return new AnswerWithCitations
            {
                Answer = "I don't have information about that in my knowledge base.",
                Citations = new List<Citation>()
            };
        }

        // Build context with citation markers
        var contextBuilder = new StringBuilder();
        var citations = new List<Citation>();

        for (int i = 0; i < searchResults.Count; i++)
        {
            var result = searchResults[i];
            var citationId = $"[{i + 1}]";

            citations.Add(new Citation
            {
                Id = citationId,
                Source = result.Metadata.Description ?? result.Metadata.Id,
                Text = result.Metadata.Text,
                Relevance = result.Relevance
            });

            contextBuilder.AppendLine($"{citationId} {result.Metadata.Text}");
            contextBuilder.AppendLine();
        }

        // Generate answer with citation instructions
        var prompt = $"""
            Answer the following question based ONLY on the provided context.
            Include citation numbers [1], [2], etc. when using information from the sources.
            If the answer isn't in the context, say "I don't have that specific information."

            ## Context:
            {contextBuilder}

            ## Question:
            {question}

            ## Answer (with citations):
            """;

        var answer = await kernel.InvokePromptAsync<string>(prompt);

        return new AnswerWithCitations
        {
            Answer = answer ?? "Unable to generate an answer.",
            Citations = citations
        };
    }
}

public record AnswerWithCitations
{
    public required string Answer { get; init; }
    public required List<Citation> Citations { get; init; }
}

public record Citation
{
    public required string Id { get; init; }
    public required string Source { get; init; }
    public required string Text { get; init; }
    public double Relevance { get; init; }
}

public class RagOptions
{
    public string Collection { get; set; } = "knowledge-base";
    public int MaxContextChunks { get; set; } = 5;
    public double MinRelevanceScore { get; set; } = 0.7;
}
Enter fullscreen mode Exit fullscreen mode

Retrieval Strategies

Similarity Search (Default)

Pure vector similarity—fast and effective for most cases:

var results = await memory.SearchAsync(
    collection: "docs",
    query: userQuestion,
    limit: 5,
    minRelevanceScore: 0.75);
Enter fullscreen mode Exit fullscreen mode

Hybrid Search with Azure AI Search

Combine vector similarity with keyword matching:

using Azure.Search.Documents;
using Azure.Search.Documents.Models;

public class HybridSearchService
{
    private readonly SearchClient _searchClient;
    private readonly ITextEmbeddingGenerationService _embeddingService;

    public async Task<List<SearchResult>> HybridSearchAsync(string query, int limit = 10)
    {
        // Generate query embedding
        var embeddings = await _embeddingService.GenerateEmbeddingsAsync(new[] { query });
        var queryVector = embeddings[0].ToArray();

        var options = new SearchOptions
        {
            Size = limit,
            QueryType = SearchQueryType.Semantic,
            SemanticSearch = new SemanticSearchOptions
            {
                SemanticConfigurationName = "my-semantic-config",
                QueryCaption = new QueryCaption(QueryCaptionType.Extractive),
                QueryAnswer = new QueryAnswer(QueryAnswerType.Extractive)
            },
            VectorSearch = new VectorSearchOptions
            {
                Queries =
                {
                    new VectorizedQuery(queryVector)
                    {
                        KNearestNeighborsCount = limit,
                        Fields = { "contentVector" }
                    }
                }
            },
            Select = { "id", "content", "title", "metadata" }
        };

        var response = await _searchClient.SearchAsync<SearchDocument>(query, options);

        var results = new List<SearchResult>();
        await foreach (var result in response.Value.GetResultsAsync())
        {
            results.Add(new SearchResult
            {
                Id = result.Document["id"].ToString()!,
                Content = result.Document["content"].ToString()!,
                Title = result.Document["title"]?.ToString(),
                Score = result.Score ?? 0
            });
        }

        return results;
    }
}
Enter fullscreen mode Exit fullscreen mode

Maximum Marginal Relevance (MMR)

Reduce redundancy by diversifying results:

public class MmrRetriever
{
    private readonly ISemanticTextMemory _memory;
    private readonly ITextEmbeddingGenerationService _embeddingService;
    private readonly float _lambda;  // Balance relevance vs diversity (0.5 = balanced)

    public async Task<List<MemoryQueryResult>> SearchWithMmrAsync(
        string collection,
        string query,
        int limit = 5,
        int candidateMultiplier = 3)
    {
        // Get more candidates than we need
        var candidates = await _memory
            .SearchAsync(collection, query, limit * candidateMultiplier, 0.5)
            .ToListAsync();

        if (candidates.Count <= limit)
            return candidates;

        // Query embedding
        var queryEmbedding = (await _embeddingService.GenerateEmbeddingsAsync(new[] { query }))[0];

        var selected = new List<MemoryQueryResult>();
        var remaining = candidates.ToList();

        while (selected.Count < limit && remaining.Count > 0)
        {
            MemoryQueryResult? best = null;
            float bestScore = float.MinValue;

            foreach (var candidate in remaining)
            {
                // Relevance to query
                var relevance = (float)candidate.Relevance;

                // Maximum similarity to already selected
                float maxSimToSelected = 0;
                foreach (var sel in selected)
                {
                    var sim = CosineSimilarity(
                        await GetEmbeddingAsync(candidate.Metadata.Text),
                        await GetEmbeddingAsync(sel.Metadata.Text));
                    maxSimToSelected = Math.Max(maxSimToSelected, sim);
                }

                // MMR score
                var mmrScore = _lambda * relevance - (1 - _lambda) * maxSimToSelected;

                if (mmrScore > bestScore)
                {
                    bestScore = mmrScore;
                    best = candidate;
                }
            }

            if (best != null)
            {
                selected.Add(best);
                remaining.Remove(best);
            }
        }

        return selected;
    }
}
Enter fullscreen mode Exit fullscreen mode

Context Window Management

LLMs have limited context windows. Manage them carefully:

public class ContextWindowManager
{
    private readonly int _maxTokens;
    private readonly int _reservedForResponse;
    private readonly Tiktoken.Encoding _tokenizer;

    public ContextWindowManager(int maxTokens = 128000, int reservedForResponse = 4000)
    {
        _maxTokens = maxTokens;
        _reservedForResponse = reservedForResponse;
        _tokenizer = Tiktoken.Encoding.ForModel("gpt-4o");
    }

    public string BuildOptimalContext(
        string systemPrompt,
        List<MemoryQueryResult> retrievedChunks,
        string userQuery,
        List<ChatMessage>? conversationHistory = null)
    {
        var budget = _maxTokens - _reservedForResponse;
        var usedTokens = 0;

        // System prompt (required)
        usedTokens += CountTokens(systemPrompt);

        // User query (required)
        usedTokens += CountTokens(userQuery);

        // Conversation history (most recent first, within budget)
        var includedHistory = new List<ChatMessage>();
        if (conversationHistory != null)
        {
            var historyBudget = (int)(budget * 0.2);  // 20% for history
            var historyTokens = 0;

            foreach (var msg in conversationHistory.AsEnumerable().Reverse())
            {
                var msgTokens = CountTokens(msg.Content);
                if (historyTokens + msgTokens > historyBudget)
                    break;

                includedHistory.Insert(0, msg);
                historyTokens += msgTokens;
            }
            usedTokens += historyTokens;
        }

        // Retrieved context (fill remaining budget)
        var contextBudget = budget - usedTokens;
        var includedChunks = new List<string>();
        var contextTokens = 0;

        foreach (var chunk in retrievedChunks.OrderByDescending(c => c.Relevance))
        {
            var chunkTokens = CountTokens(chunk.Metadata.Text);
            if (contextTokens + chunkTokens > contextBudget)
                break;

            includedChunks.Add(chunk.Metadata.Text);
            contextTokens += chunkTokens;
        }

        // Build final prompt
        return $"""
            {systemPrompt}

            ## Retrieved Context:
            {string.Join("\n\n---\n\n", includedChunks)}

            ## Conversation History:
            {string.Join("\n", includedHistory.Select(m => $"{m.Role}: {m.Content}"))}

            ## Current Question:
            {userQuery}
            """;
    }

    private int CountTokens(string text) => _tokenizer.CountTokens(text);
}
Enter fullscreen mode Exit fullscreen mode

Evaluating RAG Quality

Measure and improve your RAG system:

public class RagEvaluator
{
    private readonly Kernel _kernel;
    private readonly ISemanticTextMemory _memory;

    public async Task<RagEvaluationResult> EvaluateAsync(
        string question,
        string expectedAnswer,
        string actualAnswer,
        List<MemoryQueryResult> retrievedContext)
    {
        // 1. Retrieval metrics
        var retrievalScore = await EvaluateRetrievalAsync(question, retrievedContext);

        // 2. Answer relevance
        var relevanceScore = await EvaluateAnswerRelevanceAsync(question, actualAnswer);

        // 3. Faithfulness (is answer grounded in context?)
        var faithfulnessScore = await EvaluateFaithfulnessAsync(actualAnswer, retrievedContext);

        // 4. Correctness (if we have ground truth)
        var correctnessScore = await EvaluateCorrectnessAsync(expectedAnswer, actualAnswer);

        return new RagEvaluationResult
        {
            RetrievalScore = retrievalScore,
            RelevanceScore = relevanceScore,
            FaithfulnessScore = faithfulnessScore,
            CorrectnessScore = correctnessScore,
            OverallScore = (retrievalScore + relevanceScore + faithfulnessScore + correctnessScore) / 4
        };
    }

    private async Task<double> EvaluateRetrievalAsync(
        string question, 
        List<MemoryQueryResult> context)
    {
        var prompt = $"""
            Rate how relevant the retrieved documents are to the question on a scale of 0-1.

            Question: {question}

            Retrieved Documents:
            {string.Join("\n---\n", context.Select(c => c.Metadata.Text))}

            Relevance Score (0-1, just the number):
            """;

        var result = await _kernel.InvokePromptAsync<string>(prompt);
        return double.TryParse(result, out var score) ? score : 0;
    }

    private async Task<double> EvaluateFaithfulnessAsync(
        string answer,
        List<MemoryQueryResult> context)
    {
        var prompt = $"""
            Rate how well the answer is supported by the provided context on a scale of 0-1.
            A score of 1 means every claim in the answer is directly supported by the context.
            A score of 0 means the answer contains claims not found in the context.

            Context:
            {string.Join("\n---\n", context.Select(c => c.Metadata.Text))}

            Answer:
            {answer}

            Faithfulness Score (0-1, just the number):
            """;

        var result = await _kernel.InvokePromptAsync<string>(prompt);
        return double.TryParse(result, out var score) ? score : 0;
    }
}

public record RagEvaluationResult
{
    public double RetrievalScore { get; init; }
    public double RelevanceScore { get; init; }
    public double FaithfulnessScore { get; init; }
    public double CorrectnessScore { get; init; }
    public double OverallScore { get; init; }
}
Enter fullscreen mode Exit fullscreen mode

Running Evaluation Suites

public class RagTestSuite
{
    private readonly RagPlugin _ragPlugin;
    private readonly RagEvaluator _evaluator;
    private readonly Kernel _kernel;

    public async Task<EvaluationReport> RunTestSuiteAsync(List<TestCase> testCases)
    {
        var results = new List<TestResult>();

        foreach (var testCase in testCases)
        {
            // Get RAG response
            var answer = await _ragPlugin.AnswerAsync(_kernel, testCase.Question);

            // Get retrieved context for evaluation
            var context = await _ragPlugin.SearchAsync(testCase.Question);

            // Evaluate
            var evaluation = await _evaluator.EvaluateAsync(
                testCase.Question,
                testCase.ExpectedAnswer,
                answer.Answer,
                ParseContext(context));

            results.Add(new TestResult
            {
                TestCase = testCase,
                ActualAnswer = answer.Answer,
                Citations = answer.Citations,
                Evaluation = evaluation
            });
        }

        return new EvaluationReport
        {
            Results = results,
            AverageRetrievalScore = results.Average(r => r.Evaluation.RetrievalScore),
            AverageFaithfulness = results.Average(r => r.Evaluation.FaithfulnessScore),
            AverageRelevance = results.Average(r => r.Evaluation.RelevanceScore),
            AverageCorrectness = results.Average(r => r.Evaluation.CorrectnessScore)
        };
    }
}

public record TestCase(string Question, string ExpectedAnswer, List<string>? ExpectedSources = null);
Enter fullscreen mode Exit fullscreen mode

Production Checklist

Before deploying RAG:

  • [ ] Chunking tuned: Test different sizes, measure retrieval quality
  • [ ] Overlap configured: Prevent context loss at chunk boundaries
  • [ ] Minimum relevance set: Filter low-quality retrievals
  • [ ] Context window managed: Never exceed model limits
  • [ ] Citations working: Users can verify sources
  • [ ] Evaluation baseline: Know your quality metrics
  • [ ] Monitoring in place: Track retrieval scores, latency, failures
  • [ ] Index refresh strategy: Keep embeddings current with source changes

What's Next

In this article, we built production-ready RAG:

  • Chunking strategies: From simple to semantic
  • RAG plugin: Complete implementation with citations
  • Retrieval patterns: Similarity, hybrid, MMR
  • Context management: Optimizing for token budgets
  • Evaluation: Measuring retrieval and answer quality

In Part 5, we'll explore AI agents—ChatCompletionAgent, multi-agent orchestration with AgentGroupChat, and building autonomous systems that can reason and act.


This is Part 4 of a 5-part series on Semantic Kernel. Next up: AI Agents and Orchestration

Top comments (0)