RAG without the cloud: .NET + Semantic Kernel + Ollama on your laptop

#ai #programming #dotnet #llm

1. Introduction

Generative AI is now impossible to ignore in software development. Tools like ChatGPT and GitHub Copilot can answer technical questions in seconds. But there is a downside: you become dependent on external APIs, mainly from US providers, you pay per prompt, and you may have to hand over sensitive company data. An alternative is to run language models locally with Ollama. In this article I show how to make documents searchable locally with a console application that applies Retrieval Augmented Generation (RAG). We use Ollama together with the language model llama3.2.

What is Retrieval Augmented Generation (RAG)?

RAG is a pattern where you do not only give a Large Language Model (LLM) the question, but also provide extra context from your own sources. Via a semantic search query (retrieval) you pull the most relevant text fragments from documents, databases, or API responses. You add those fragments to the prompt, so during answering (generation) the LLM can use up to date and domain specific knowledge. The result is fewer hallucinations and direct reuse of your existing data assets.

2. Requirements and setup

The AI world moves fast, and many components are still in preview or experimental. The configuration below works at the time of writing. Check the GitHub repository for the most up to date information.

.NET 9 SDK
Ollama v0.9.5 or newer
4 GB (V)RAM for the language model
Licenses: Llama 3.2 falls under the Meta Llama Community License v2. Semantic Kernel and the demo code are MIT.

Create a standard console application, add Semantic Kernel, and install the prerelease package Microsoft.SemanticKernel.Connectors.Ollama. Then start Ollama and download the model:

ollama serve
ollama run llama3.2

3. Architecture at a glance

The RAG demo consists of three core components that together form the chain from question to retrieval to answer.

3.1 Semantic Kernel (orchestrator)

.NET SDK for calling large language models, embeddings, chat history, and optionally function calling.
Keeps conversation state and makes it easy to enrich context with documents.
Has connectors to, among others, OpenAI, Azure OpenAI, Ollama, and multiple vector databases.

3.2 Ollama (local LLM runtime)

Starts and manages models locally (CPU/GPU), including model pulls.
Supports dozens of open source models. We use llama3.2.
Can also generate embeddings with the same language model.

3.3 Document storage and retrieval

In this demo: a simple in memory store with cosine similarity (a calculation that measures similarity between two vectors) and top k selection (take the k best scoring matches).
Great for small document sets and ideal to understand the core of RAG.
For production or larger datasets: replace it with Qdrant or another vector database (a specialized datastore for vector embeddings with persistent storage, indexing, filtering, and horizontal scalability).
Tip: try Qdrant locally: docker run -p 6333:6333 qdrant/qdrant (starts an instance on your machine in seconds).

When a question is asked, it flows like this: user prompt -> embedding -> most relevant docs -> context in prompt -> language model answer.

4. Step by step implementation

First install two NuGet packages: Microsoft.SemanticKernel and Microsoft.SemanticKernel.Connectors.Ollama. For the Ollama connector you need the prerelease version. Also add the attribute below to the class, otherwise the project will not compile:

[Experimental("SKEXP0070")]
public static class SkOllama {

In the Main method we initialize the required components. The OllamaApiClient points to the default URL of the locally running Ollama API and we choose llama3.2 as the model. That same model also generates embeddings, so no separate embedding model is needed. This choice is a bit slower and sometimes less accurate than a specialized embedding model, but for a local proof of concept it is fine.

Next we read the documents. The application scans the documents folder, reads all markdown files, and generates an embedding for each document. You do not want to repeat this process every time; the larger the set, the longer it takes.

Finally we start the chat loop with the initialized services.

    public static async Task Main(string[] args)
    {
        using var ollamaClient = new OllamaApiClient(uriString: "http://localhost:11434", defaultModel: "llama3.2");

        // llama3.2 supports embeddings, so no separate model is required 
        var embeddingService = ollamaClient.AsTextEmbeddingGenerationService();

        var documentsPath = Path.Combine(AppContext.BaseDirectory, "documents");
        var documentStore = await ImportDocumentsFromDirectoryAsync(documentsPath, embeddingService);

        await ChatLoop(ollamaClient, embeddingService, documentStore);
    }

Importing documents is straightforward. The application reads all files in the specified folder, generates an embedding per document, and stores both the text and the vector in the in memory document store. During a chat question we can directly retrieve both values.

    private static async Task<InMemoryDocumentStore> ImportDocumentsFromDirectoryAsync(
        string directory,
        ITextEmbeddingGenerationService embeddingService)
    {
        var documentStore = new InMemoryDocumentStore();
        if (!Directory.Exists(directory))
        {
            Console.WriteLine($"Directory '{directory}' not found. No documents loaded.");
            return documentStore;
        }

        foreach (var file in Directory.GetFiles(directory, "*.md"))
        {
            var content = await File.ReadAllTextAsync(file);
            var embedding = (await embeddingService.GenerateEmbeddingsAsync([content]))[0].ToArray();
            documentStore.Add(content, embedding);
        }

        Console.WriteLine($"Loaded {Directory.GetFiles(directory, "*.md").Length} documents.");
        return documentStore;
    }

The document store keeps each document together with its embedding in a list of records. During semantic search we convert your question into a vector and measure with cosine similarity an angle based metric that shows how close two vectors are and how well each document matches. We then keep the two best scores.

record Document(string Content, float[] Embedding);

class InMemoryDocumentStore
{
    private readonly List<Document> _documents = [];

    public void Add(string content, float[] embedding) =>
        _documents.Add(new Document(content, embedding));

    public IList<(string Content, double Score)> GetRelevantWithScores(float[] queryEmbedding, int top = 2)
    {
        return _documents
            .Select(d => (d.Content, Score: CosineSimilarity(queryEmbedding, d.Embedding)))
            .OrderByDescending(item => item.Score) 
            .Take(top)
            .ToList();
    }

    private static double CosineSimilarity(float[] a, float[] b)
    {
        double dot = 0, normA = 0, normB = 0;
        for (int i = 0; i < a.Length && i < b.Length; i++)
        {
            dot += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dot / (Math.Sqrt(normA) * Math.Sqrt(normB) + 1e-5);
    }
}

Everything comes together in the ChatLoop, the heart of the application. Here we connect the semantic search results to a live conversation with the language model and make sure the context stays up to date on every turn.

    private static async Task ChatLoop(
        OllamaApiClient ollamaClient,
        ITextEmbeddingGenerationService embeddingService,
        InMemoryDocumentStore store)
    {
        // The method starts by creating a ChatCompletionService based on the Ollama client. Semantic Kernel provides standard helpers, such as a ChatHistory object, role handling, and token streaming.
        var chatService = ollamaClient.AsChatCompletionService();

        // Next we initialize a ChatHistory with a system prompt. That prompt defines the model behavior so we do not need to repeat rules in every prompt.
        var chatHistory = new ChatHistory("You are an expert in comics and sci-fi, you always try to help the user and give random facts about the topics");

        // In the do while loop the actual conversation happens:
        do 
        {
            Console.Write("user: ");
            // 1. Read the user message.
            var userMessage = Console.ReadLine();
            if (userMessage == null)
                continue;

            // 2. Ask Ollama for an embedding. Each question becomes a vector of length 3072.
            var queryEmbedding = (await embeddingService.GenerateEmbeddingsAsync([userMessage]))[0].ToArray();

            // 3. Use cosine similarity to find the two documents with the highest relevance.
            var contextTuples = store.GetRelevantWithScores(queryEmbedding, top: 2);
            double bestSimilarityScore = contextTuples.FirstOrDefault().Score;

            // 4. Check the highest score. If it is above 0.10 (determine empirically via, for example, a manual evaluation) then we add the corresponding text fragments as context. This keeps the prompt short and relevant.
            var addDocsToContext = bestSimilarityScore >= 0.10;
            if (addDocsToContext)
                Console.WriteLine($"[gate] relevant docs added (bestSim={bestSimilarityScore:F2}, hits={contextTuples.Count})");

            var context = addDocsToContext ? string.Join("\n---\n", contextTuples.Select(t => t.Content)) : string.Empty;

            // 5. Create a temporary copy of the ChatHistory (this works as prompt toxicity isolation and keeps your system prompt clean), add the context (if present) and the user message, and send it to the model.
            var promptHistory = new ChatHistory(chatHistory);

            if (!string.IsNullOrWhiteSpace(context))
                promptHistory.AddSystemMessage($"Context:\n{context}");

            promptHistory.AddUserMessage(userMessage);

            // 6. The model returns the full response.
            var reply = await chatService.GetChatMessageContentAsync(promptHistory);

            // 7. Then we store it in the global ChatHistory and print it to the screen.
            chatHistory.AddUserMessage(userMessage);
            chatHistory.Add(reply);
            var lastMessage = chatHistory[^1];
            Console.WriteLine($"{lastMessage.Role}: {lastMessage.Content}\n");

            // Then the loop starts again and the conversation continues until the user stops the program.
        } while (true);
    }

5. Result: a local RAG conversation

The console session below shows at a glance what happens when you start the demo:

The application reports that fourteen documents were loaded and vectorized. In the /documents directory there happen to be several Star Wars texts. See the GitHub repository for a full overview of the documents and setup.
After the user question the gate line appears, showing that two documents were added as context and showing the best relevance score (0.17 in this example).
The language model processes those Star Wars texts and returns an answer.

For readability the model output is truncated after a few lines, but in a real session the dialog continues.

Loaded 14 documents.
user: Who is the greatest user of the force?
[gate] relevant docs added (bestSim=0.17, hits=2)
assistant: A question that sparks debate among Star Wars fans! While opinions may vary, I'd argue that Yoda is one of the most powerful users of the Force.

As a wise and ancient Jedi Master, Yoda's mastery of the Force is unparalleled. His unique connection to the Living Force allows him to tap into its energy, using it to augment his physical abilities and....

6. Optimizations and extensions

This demo is intentionally kept as simple as possible: one console app, one model, and an in memory vector store. If you scale up more seriously you can get results quickly:

Use a vector database: As the number of documents grows, replace the InMemoryDocumentStore with a vector database like Qdrant (started with Docker in two lines). This gives millisecond search performance, persistent storage, and advanced filters, and prevents recalculating embeddings on every startup.
Use a specialized embedding model: Right now we use llama3.2 for both chat and embeddings. Instead, use a compact model trained specifically for embeddings. That reduces compute time and improves similarity accuracy.
Pick a better language model: If you have extra (V)RAM, consider a stronger model such as Phi-4 or llama3.3 for richer answers. For niche domains a smaller tuned model can sometimes be better and faster. Always measure latency and quality with an objective evaluation framework so you pick the right model.

7. Conclusion

With the combination of Semantic Kernel, Ollama, and a compact in memory vector store, you have seen how easy it is to implement Retrieval Augmented Generation fully locally. No vendor lock in with a cloud provider, no compliance headaches due to data exfiltration, and with smart hardware choices, a response time that can compete with SaaS alternatives.

In this article we built it step by step:

Laying the foundation: Semantic Kernel plus the preview connector for Ollama is enough to run a proof of concept within minutes.
Embeddings and storage: one llama3.2 model can produce both chat responses and embeddings. The simple InMemoryDocumentStore shows the core logic is only a few dozen lines of code.
Context injection: with a minimal threshold you keep the prompt lean and relevant, which is crucial for smaller models.
Expand when needed: replace the in memory store with a vector database for faster and persistent storage, use a specialized embedding model, and move to a stronger or tuned LLM when the use case and hardware allow it.

The result is a robust RAG chat app that runs on a developer laptop, but also fits just as well in an air gapped datacenter. That gives your .NET team full control over privacy, cost, and performance, while still using the power of generative AI.

My advice: start small, measure a lot, and scale with intent. First run an internal pilot with a handful of documents, monitor accuracy, and then experiment with larger datasets and models. You will notice the learning curve is steep, but the time to value is even steeper.

That way you can benefit from GenAI today without compromising on governance or budget. In short: RAG in .NET is not only production ready, it may be the most pragmatic path to safe and scalable AI assistants on your own turf.