DEV Community: Frank Noorloos

[Boost]

Frank Noorloos — Tue, 13 Jan 2026 14:25:01 +0000

RAG without the cloud: .NET + Semantic Kernel + Ollama on your laptop

Frank Noorloos ・ Jan 9

#ai #programming #dotnet #llm

[Boost]

Frank Noorloos — Fri, 09 Jan 2026 14:07:54 +0000

RAG without the cloud: .NET + Semantic Kernel + Ollama on your laptop

Frank Noorloos ・ Jan 9

#ai #programming #dotnet #llm

RAG without the cloud: .NET + Semantic Kernel + Ollama on your laptop

Frank Noorloos — Fri, 09 Jan 2026 13:56:07 +0000

1. Introduction

Generative AI is now impossible to ignore in software development. Tools like ChatGPT and GitHub Copilot can answer technical questions in seconds. But there is a downside: you become dependent on external APIs, mainly from US providers, you pay per prompt, and you may have to hand over sensitive company data. An alternative is to run language models locally with Ollama. In this article I show how to make documents searchable locally with a console application that applies Retrieval Augmented Generation (RAG). We use Ollama together with the language model llama3.2.

What is Retrieval Augmented Generation (RAG)?

RAG is a pattern where you do not only give a Large Language Model (LLM) the question, but also provide extra context from your own sources. Via a semantic search query (retrieval) you pull the most relevant text fragments from documents, databases, or API responses. You add those fragments to the prompt, so during answering (generation) the LLM can use up to date and domain specific knowledge. The result is fewer hallucinations and direct reuse of your existing data assets.

2. Requirements and setup

The AI world moves fast, and many components are still in preview or experimental. The configuration below works at the time of writing. Check the GitHub repository for the most up to date information.

.NET 9 SDK
Ollama v0.9.5 or newer
4 GB (V)RAM for the language model
Licenses: Llama 3.2 falls under the Meta Llama Community License v2. Semantic Kernel and the demo code are MIT.

Create a standard console application, add Semantic Kernel, and install the prerelease package Microsoft.SemanticKernel.Connectors.Ollama. Then start Ollama and download the model:

ollama serve
ollama run llama3.2

3. Architecture at a glance

The RAG demo consists of three core components that together form the chain from question to retrieval to answer.

3.1 Semantic Kernel (orchestrator)

.NET SDK for calling large language models, embeddings, chat history, and optionally function calling.
Keeps conversation state and makes it easy to enrich context with documents.
Has connectors to, among others, OpenAI, Azure OpenAI, Ollama, and multiple vector databases.

3.2 Ollama (local LLM runtime)

Starts and manages models locally (CPU/GPU), including model pulls.
Supports dozens of open source models. We use llama3.2.
Can also generate embeddings with the same language model.

3.3 Document storage and retrieval

In this demo: a simple in memory store with cosine similarity (a calculation that measures similarity between two vectors) and top k selection (take the k best scoring matches).
Great for small document sets and ideal to understand the core of RAG.
For production or larger datasets: replace it with Qdrant or another vector database (a specialized datastore for vector embeddings with persistent storage, indexing, filtering, and horizontal scalability).
Tip: try Qdrant locally: docker run -p 6333:6333 qdrant/qdrant (starts an instance on your machine in seconds).

When a question is asked, it flows like this: user prompt -> embedding -> most relevant docs -> context in prompt -> language model answer.

4. Step by step implementation

First install two NuGet packages: Microsoft.SemanticKernel and Microsoft.SemanticKernel.Connectors.Ollama. For the Ollama connector you need the prerelease version. Also add the attribute below to the class, otherwise the project will not compile:

[Experimental("SKEXP0070")]
public static class SkOllama {

In the Main method we initialize the required components. The OllamaApiClient points to the default URL of the locally running Ollama API and we choose llama3.2 as the model. That same model also generates embeddings, so no separate embedding model is needed. This choice is a bit slower and sometimes less accurate than a specialized embedding model, but for a local proof of concept it is fine.

Next we read the documents. The application scans the documents folder, reads all markdown files, and generates an embedding for each document. You do not want to repeat this process every time; the larger the set, the longer it takes.

Finally we start the chat loop with the initialized services.

    public static async Task Main(string[] args)
    {
        using var ollamaClient = new OllamaApiClient(uriString: "http://localhost:11434", defaultModel: "llama3.2");

        // llama3.2 supports embeddings, so no separate model is required 
        var embeddingService = ollamaClient.AsTextEmbeddingGenerationService();

        var documentsPath = Path.Combine(AppContext.BaseDirectory, "documents");
        var documentStore = await ImportDocumentsFromDirectoryAsync(documentsPath, embeddingService);

        await ChatLoop(ollamaClient, embeddingService, documentStore);
    }

Importing documents is straightforward. The application reads all files in the specified folder, generates an embedding per document, and stores both the text and the vector in the in memory document store. During a chat question we can directly retrieve both values.

    private static async Task<InMemoryDocumentStore> ImportDocumentsFromDirectoryAsync(
        string directory,
        ITextEmbeddingGenerationService embeddingService)
    {
        var documentStore = new InMemoryDocumentStore();
        if (!Directory.Exists(directory))
        {
            Console.WriteLine($"Directory '{directory}' not found. No documents loaded.");
            return documentStore;
        }

        foreach (var file in Directory.GetFiles(directory, "*.md"))
        {
            var content = await File.ReadAllTextAsync(file);
            var embedding = (await embeddingService.GenerateEmbeddingsAsync([content]))[0].ToArray();
            documentStore.Add(content, embedding);
        }

        Console.WriteLine($"Loaded {Directory.GetFiles(directory, "*.md").Length} documents.");
        return documentStore;
    }

The document store keeps each document together with its embedding in a list of records. During semantic search we convert your question into a vector and measure with cosine similarity an angle based metric that shows how close two vectors are and how well each document matches. We then keep the two best scores.

record Document(string Content, float[] Embedding);

class InMemoryDocumentStore
{
    private readonly List<Document> _documents = [];

    public void Add(string content, float[] embedding) =>
        _documents.Add(new Document(content, embedding));

    public IList<(string Content, double Score)> GetRelevantWithScores(float[] queryEmbedding, int top = 2)
    {
        return _documents
            .Select(d => (d.Content, Score: CosineSimilarity(queryEmbedding, d.Embedding)))
            .OrderByDescending(item => item.Score) 
            .Take(top)
            .ToList();
    }

    private static double CosineSimilarity(float[] a, float[] b)
    {
        double dot = 0, normA = 0, normB = 0;
        for (int i = 0; i < a.Length && i < b.Length; i++)
        {
            dot += a[i] * b[i];
            normA += a[i] * a[i];
            normB += b[i] * b[i];
        }
        return dot / (Math.Sqrt(normA) * Math.Sqrt(normB) + 1e-5);
    }
}

Everything comes together in the ChatLoop, the heart of the application. Here we connect the semantic search results to a live conversation with the language model and make sure the context stays up to date on every turn.

    private static async Task ChatLoop(
        OllamaApiClient ollamaClient,
        ITextEmbeddingGenerationService embeddingService,
        InMemoryDocumentStore store)
    {
        // The method starts by creating a ChatCompletionService based on the Ollama client. Semantic Kernel provides standard helpers, such as a ChatHistory object, role handling, and token streaming.
        var chatService = ollamaClient.AsChatCompletionService();

        // Next we initialize a ChatHistory with a system prompt. That prompt defines the model behavior so we do not need to repeat rules in every prompt.
        var chatHistory = new ChatHistory("You are an expert in comics and sci-fi, you always try to help the user and give random facts about the topics");

        // In the do while loop the actual conversation happens:
        do 
        {
            Console.Write("user: ");
            // 1. Read the user message.
            var userMessage = Console.ReadLine();
            if (userMessage == null)
                continue;

            // 2. Ask Ollama for an embedding. Each question becomes a vector of length 3072.
            var queryEmbedding = (await embeddingService.GenerateEmbeddingsAsync([userMessage]))[0].ToArray();

            // 3. Use cosine similarity to find the two documents with the highest relevance.
            var contextTuples = store.GetRelevantWithScores(queryEmbedding, top: 2);
            double bestSimilarityScore = contextTuples.FirstOrDefault().Score;

            // 4. Check the highest score. If it is above 0.10 (determine empirically via, for example, a manual evaluation) then we add the corresponding text fragments as context. This keeps the prompt short and relevant.
            var addDocsToContext = bestSimilarityScore >= 0.10;
            if (addDocsToContext)
                Console.WriteLine($"[gate] relevant docs added (bestSim={bestSimilarityScore:F2}, hits={contextTuples.Count})");

            var context = addDocsToContext ? string.Join("\n---\n", contextTuples.Select(t => t.Content)) : string.Empty;

            // 5. Create a temporary copy of the ChatHistory (this works as prompt toxicity isolation and keeps your system prompt clean), add the context (if present) and the user message, and send it to the model.
            var promptHistory = new ChatHistory(chatHistory);

            if (!string.IsNullOrWhiteSpace(context))
                promptHistory.AddSystemMessage($"Context:\n{context}");

            promptHistory.AddUserMessage(userMessage);

            // 6. The model returns the full response.
            var reply = await chatService.GetChatMessageContentAsync(promptHistory);

            // 7. Then we store it in the global ChatHistory and print it to the screen.
            chatHistory.AddUserMessage(userMessage);
            chatHistory.Add(reply);
            var lastMessage = chatHistory[^1];
            Console.WriteLine($"{lastMessage.Role}: {lastMessage.Content}\n");

            // Then the loop starts again and the conversation continues until the user stops the program.
        } while (true);
    }

5. Result: a local RAG conversation

The console session below shows at a glance what happens when you start the demo:

The application reports that fourteen documents were loaded and vectorized. In the /documents directory there happen to be several Star Wars texts. See the GitHub repository for a full overview of the documents and setup.
After the user question the gate line appears, showing that two documents were added as context and showing the best relevance score (0.17 in this example).
The language model processes those Star Wars texts and returns an answer.

For readability the model output is truncated after a few lines, but in a real session the dialog continues.

Loaded 14 documents.
user: Who is the greatest user of the force?
[gate] relevant docs added (bestSim=0.17, hits=2)
assistant: A question that sparks debate among Star Wars fans! While opinions may vary, I'd argue that Yoda is one of the most powerful users of the Force.

As a wise and ancient Jedi Master, Yoda's mastery of the Force is unparalleled. His unique connection to the Living Force allows him to tap into its energy, using it to augment his physical abilities and....

6. Optimizations and extensions

This demo is intentionally kept as simple as possible: one console app, one model, and an in memory vector store. If you scale up more seriously you can get results quickly:

Use a vector database: As the number of documents grows, replace the InMemoryDocumentStore with a vector database like Qdrant (started with Docker in two lines). This gives millisecond search performance, persistent storage, and advanced filters, and prevents recalculating embeddings on every startup.
Use a specialized embedding model: Right now we use llama3.2 for both chat and embeddings. Instead, use a compact model trained specifically for embeddings. That reduces compute time and improves similarity accuracy.
Pick a better language model: If you have extra (V)RAM, consider a stronger model such as Phi-4 or llama3.3 for richer answers. For niche domains a smaller tuned model can sometimes be better and faster. Always measure latency and quality with an objective evaluation framework so you pick the right model.

7. Conclusion

With the combination of Semantic Kernel, Ollama, and a compact in memory vector store, you have seen how easy it is to implement Retrieval Augmented Generation fully locally. No vendor lock in with a cloud provider, no compliance headaches due to data exfiltration, and with smart hardware choices, a response time that can compete with SaaS alternatives.

In this article we built it step by step:

Laying the foundation: Semantic Kernel plus the preview connector for Ollama is enough to run a proof of concept within minutes.
Embeddings and storage: one llama3.2 model can produce both chat responses and embeddings. The simple InMemoryDocumentStore shows the core logic is only a few dozen lines of code.
Context injection: with a minimal threshold you keep the prompt lean and relevant, which is crucial for smaller models.
Expand when needed: replace the in memory store with a vector database for faster and persistent storage, use a specialized embedding model, and move to a stronger or tuned LLM when the use case and hardware allow it.

The result is a robust RAG chat app that runs on a developer laptop, but also fits just as well in an air gapped datacenter. That gives your .NET team full control over privacy, cost, and performance, while still using the power of generative AI.

My advice: start small, measure a lot, and scale with intent. First run an internal pilot with a handful of documents, monitor accuracy, and then experiment with larger datasets and models. You will notice the learning curve is steep, but the time to value is even steeper.

That way you can benefit from GenAI today without compromising on governance or budget. In short: RAG in .NET is not only production ready, it may be the most pragmatic path to safe and scalable AI assistants on your own turf.

References

GitHub repo (demo): https://github.com/Frankiey/semantic-kernel-ollama-demo
Ollama documentation: https://github.com/ollama/ollama
Semantic Kernel docs: https://aka.ms/semantic-kernel
Qdrant Quickstart: https://qdrant.tech/documentation/quick-start/

Building an Azure VM Sizer for LLMs — with Codex Doing 90% of the Work

Frank Noorloos — Sat, 16 Aug 2025 10:25:46 +0000

Introduction

Lately I’ve been looking into hosting open‑source models on dedicated Azure virtual machines and thought: how hard can it be to pick the right VM and how much cost can you actually save by choosing a smaller model? Of course, Microsoft’s serverless options are cheaper and much easier to deploy. But I like to know how things work and to manage my own compute, partly out of technical curiosity, partly for compliance and privacy. Running on dedicated VMs gives you more control on both fronts.

I couldn’t find a clear, practical guide for sizing a VM based on the model you choose, so I built one: a quick single page web app. Try it here → Live Azure LLM Sizer (code: GitHub).
To move fast (and avoid writing more code than necessary) I used Codex for about 90% of the work. In this post I’ll cover how I built the site with Codex, where it’s useful and where it hits its limits, and I’ll share what I learned along the way, plus what’s next.

How i build it with Codex

I started with a simple problem definition, a rough goal for the app, and a few broad requirements. I deliberately kept things open ended to see what frameworks it would pick. I fed the description into ChatGPT with a “deep research” prompt, and out came a functional spec (trimmed here for brevity):

Azure-LLM-Sizer — Functional Spec (≤ 512 words)

Purpose
Browser-only SPA that tells users the smallest Azure VM (GPU SKU) capable of serving / fine-tuning a chosen open-source large-language model under given precision, context length, batch size, and optional multi-GPU constraints.

⸻

Core Workflow
    1.  Select inputs
    • Model (type-ahead over HF IDs)
    • Precision (FP32 / FP16 / BF16 / INT8 / INT4)
    • Context length slider (256 – 128 k)
    • Batch size slider (1 – 64)
    • Advanced: “Training mode” toggle adds optimizer-state multiplier; “World-size” sets required GPUs.
    2.  Estimator (runs client-side, ≤ 100 ms)

It even suggested the tech stack, some non-functional requirements, and the reasoning behind certain choices. I quickly reviewed it, dropped it into Codex, and off it went 🚀. The very first PR included a complete repo setup and a working version of the app. The frontend wasn’t pretty, but it did the job.

From there, the process was a loop of iterating and adding features. If something didn’t work or I didn’t like it, I’d just describe the change to Codex. It would spin up its own environment, make the changes, and open a PR. My job was to review, test locally, and merge to main.

My workflow looked like this:
Define problem or feature → Create Codex task → Review PR → Test locally → Approve/merge or request tweaks → Repeat.

When Codex hit a wall, either struggling with complex issues or producing subpar results, I’d switch tools. ChatGPT and Perplexity handled research heavy questions. For the frontend, I brought in Claude from Anthropic to rework the UI in Tailwind, which massively improved the look. Once Claude produced the HTML, I pasted it back into Codex to integrate with the existing codebase, and it slotted in seamlessly.

Challenges & Insights

Where Codex shines

Turns a vague problem into a working repo fast. The first PR had scaffolding, routing, basic state, and CI/CD to GitHub Pages—without me touching YAML.
Excellent at wiring and repetitive tasks: project setup, glue code, small refactors, and “add this option to the form + estimator.”
PR-first flow works well: I describe the change, it opens a PR, I review/test, and merge.

Where it struggled

Visual design: getting a polished UI took multiple passes. I handed the layout to Claude (Tailwind) and then had Codex integrate it.
Cross cutting changes: when work spanned multiple files or data flow edges, it sometimes lost the thread and anchored on the existing approach.
Code critique & alternatives: it tended to justify the current implementation instead of proposing new designs.

What worked

Keep tasks tight and atomic with explicit acceptance criteria.
Minimize context: link to 1–2 files and quote exact function names or components.
Use specialists: ChatGPT/Perplexity for research, Claude for UI drafts, Codex for integration and wiring.

Takeaways

Treat codegen like a sharp junior engineer on rails: great at well-specified tasks; less great at architecture and product taste.
Always keep PR review and a local run in the loop.

Results & Impact

One of the biggest wins was how quickly I could go from a blank page to a fully working, deployed application. With Codex handling most of the scaffolding and setup, the initial project came together in hours, not days. The automatic build and deployment pipeline to GitHub Pages was in place right after the first PR merge. Again, all generated by Codex without me touching a single YAML file.

Iteration speed was another game changer. Adding new features or fixing bugs became a matter of describing the change, letting Codex implement it, reviewing the PR, and pushing it live. User feedback could be integrated almost instantly. If someone reported that something looked off or didn’t work as expected, I could have a fix deployed within minutes. The only caveat was the UI: while Codex could wire up the functionality quickly, the styling usually needed a few rounds of refinement (or a complete handoff to Claude) to make it fit the look and feel I wanted.

Overall, the development loop felt incredibly fast and satisfying, giving me the ability to turn ideas into live, functional features almost as quickly as I could think of them.

Next Steps

This is just the first iteration of a wild idea. I want to expand it so I can use it for any open source model on any hardware. For example, I still need to add the latest gpt-oss models from OpenAI and many others. For frontend and functional changes, I will try to do as much as possible with Codex or other AI systems, while the data mining and sourcing will mostly be done by hand or with other approaches, since Codex does not excel in this part. Also, the way the calculation is done can be improved, as it makes some assumptions and is only relevant for loading the whole model with a certain KV cache. While it gives a decent estimate, it does not represent real world scenarios: you might have a multi cluster setup or want to know how fast a system is and how many requests it can handle. The website currently focuses solely on inference and is limited to Azure. Expanding into training capabilities and support for other cloud platforms would be a natural next step.

Final Thoughts

Codex really surprised me. I started with a broad, almost vague prompt, and it spun up a complete repository with everything I needed to get going. Adding new features was just as smooth, sometimes I’d have five changes in progress at once, each handled in parallel. As a full stack engineer, I’ve never been fond of wrangling CSS or Tailwind, so being able to delegate the design side and focus on the technical logic felt like a huge productivity win.

That said, Codex isn’t perfect. It can be stubborn, especially when dealing with bigger patterns or more abstract feedback. Frontend polish often took a few extra passes, and longer conversations sometimes left it stuck in a loop. The way I’ve come to see it: Codex is like a sharp junior engineer. It’s fast, confident, and great with well-defined tasks, but it struggles when the problem gets fuzzy or spans multiple layers.

Even with those limitations, the experience of working this way was exciting. It shifted my role from typing out every line to directing, reviewing, and refining. That’s a glimpse of where software development is headed: AI-assisted tools embedded in our everyday workflows, helping us move faster while we focus on the higher level decisions. I’m curious to see how quickly these tools mature and how they’ll change the way we build software in the enterprise.

Resources & Links

Codex CLI – Adding Azure OpenAI to Codex by using… Codex!

Frank Noorloos — Fri, 18 Apr 2025 18:24:42 +0000

You might have missed it amid the launch of the new models o3 and o4‑mini, but OpenAI also announced Codex, a lightweight, open‑source coding agent that runs entirely in your terminal.

Because Codex is open source, it immediately caught my attention. Three questions sprang to mind:

How well does it work with the latest frontier models?
How does it work under the hood—how complex is an agent like this?
How easy is it to adapt the code and make it enterprise‑ready by connecting it to an Azure OpenAI endpoint?

I’ll tackle each of these in the sections below.

What exactly is Codex?

Originally “Codex” referred to an early OpenAI code‑generation model, but the name has now been repurposed for an open‑source CLI companion for software engineers.

https://github.com/openai/codex

In practice, Codex is a coding assistant you run from your terminal. You can:

Ask questions about your code base
Request suggestions or refactors
Let it automatically edit, test, and even run parts of your project—safely isolated inside a sandbox to prevent unwanted commands or outbound connections

If you’ve used tools like Cursor, Claude Code, or GitHub Copilot, you’ll feel right at home. What sets Codex apart is its fully open‑source nature and its simple CLI interface, which makes it easy to integrate into existing workflows.

Extending Codex with… Codex!

During the launch stream the presenters hinted that you could modify Codex using Codex itself. Challenge accepted! My goal: wire Codex to an enterprise‑ready Azure OpenAI deployment—using only Codex.

Setting up

Clone the repo and install the latest release.
Export your OpenAI API key as OPENAI_API_KEY.
Run Codex in the directory you want to target.

By default, the CLI selects the gpt‑4o model, which is brilliant for many tasks but can struggle with very large code bases. I switched to o4‑mini immediately. (In the more recent versions it is defaulted to o4-mini) Add this to ~/.codex/config.yaml:

model: o4-mini   # default model

—or pass it ad‑hoc:

codex -m o4-mini

Finding the hook

After a few exploratory questions I located the heart of the system: agent-loop.ts. That’s the place to add Azure OpenAI support. But manual edits were slow, so I relaunched Codex in full‑auto mode:

codex -m o4-mini --approval-mode full-auto

Then I asked it to update the code so that, when a specific environment variable is present, it calls my Azure OpenAI endpoint instead of api.openai.com.

Watching it work

Codex began reading files, drafting patches, and applying them. Because I was still using my regular OpenAI account, the agent occasionally hit rate limits on o4‑mini and crashed, losing context. Annoying, but expected in an early release; simply re‑running the last instruction resumes after it scans the project again.

I was impressed by how economically it sliced the repo into small context windows. One reason I only burned ~4.5 M tokens during the experiment.

The results

The generated code was surprisingly solid—minor syntax errors cropped up once or twice, usually after an interrupted run, but nothing serious. My only real blunder was pointing Codex to the wrong Azure endpoint (/chat/completions instead of /responses), but Codex even helped debug that.

You can find the patched fork here on Github: https://github.com/Frankiey/codex/tree/feature/add_openai_service

Conclusion

Codex isn’t just another code‑completion toy; it’s a hackable, scriptable engineering agent small enough to grasp in an afternoon and powerful enough to refactor itself. With a few tweaks (and a lot of tokens) I had it talking to Azure OpenAI, proving that an enterprise‑ready workflow is within easy reach.

Next Steps

Here’s the personal checklist I’m working through to take Codex from “cool prototype” to production‑ready helper:

Wire up Azure Entra ID so Codex authenticates through SSO without API keys scattered around.
Add robust retries and back‑off - rate limits should become log lines, not crashes.
Drop Codex into the CI pipeline so every pull request gets an automated review (and, where safe, an auto‑patch).
Spin up companion agents for docs, security, and test generation, all sharing the same project context.
Automate releases once the workflow feels stable, keeping a final human approval gate in place.

I’ll report back on each milestone and any surprises in the next post!

Run LLMs Locally with Ollama & Semantic Kernel in .NET: A Quick Start

Frank Noorloos — Tue, 24 Dec 2024 15:09:55 +0000

Introduction

As AI becomes increasingly central to modern applications, developers in the .NET ecosystem are exploring ways to incorporate powerful language models with minimal friction. Ollama and Semantic Kernel provide a compelling approach for running generative AI applications locally, giving teams the flexibility to keep data on-premises, reduce network latency, and avoid recurring cloud costs.

In this post, you’ll learn how to:

Set up Ollama on your machine.
Pull and serve a local model (like llama3.2).
Integrate it with Semantic Kernel in a .NET 9 project.

By the end, you’ll have a simple yet powerful local AI application — no cloud dependency required.

What is Ollama?

Ollama is a self-hosted platform for running language models locally, eliminating the need for external cloud services. Key benefits include:

Data Privacy: Your data never leaves your environment.
Lower Costs: Eliminate the “pay by API call” model of external services.
Ease of Setup: A quick and straightforward way to get up and running with local AI.

With Ollama, you pull the models you need (for example, llama3.2), serve them locally, and integrate them just like you would with a remote API—except it all stays on your machine or server.

Introducing Semantic Kernel

Semantic Kernel is an open-source SDK from Microsoft that enables developers to seamlessly integrate AI capabilities into .NET applications. It allows you to combine AI models with APIs and existing application logic to build intelligent and context-aware solutions.

Key features include:

Function Orchestration: Efficiently manage and compose multiple AI functions, such as generating text, summarization, and Q&A, for your applications.
Extensibility: Support for a wide range of AI model providers, including OpenAI, Azure OpenAI, and local deployment options like Ollama.
Context Management: Retain and utilize contextual information, such as conversation history and user preferences, to create personalized and coherent AI-driven experiences.

By pairing Semantic Kernel with Ollama, you can explore powerful AI interactions entirely on your own hardware. This ensures you maintain full control over your data and resources, and eliminates the ongoing costs tied to cloud-based APIs. It’s a great way to experiment, prototype, or run offline—without sacrificing the advanced capabilities of AI.

Prerequisites

.NET 9 SDK

Make sure you have the .NET 9 SDK installed on your system.
Ollama
- Install or build Ollama from the Ollama GitHub repository.
- Start Ollama locally:
```
 ollama serve
```

Pull the desired model, for example:
```
 ollama pull llama3.2
```

Step-by-Step Integration

1. Create a New .NET 9 Project

Open your terminal or command prompt and run:

dotnet new console -n OllamaSemanticKernelDemo
cd OllamaSemanticKernelDemo

2. Add the Required NuGet Packages

Inside your project directory, add the Semantic Kernel package and the Ollama connector:

dotnet add package Microsoft.SemanticKernel --version 1.32.0
dotnet add package Microsoft.SemanticKernel.Connectors.Ollama --version 1.32.0-alpha

3. Ensure Ollama is Running and Pull Your Model

In another terminal window:

ollama serve

Then pull the model you want to use:

ollama pull llama3.2

Ollama will listen on http://localhost:11434 by default.

Example Code

In your Program.cs (or any .cs file you designate as the entry point), use the following code:

using System.Diagnostics.CodeAnalysis;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.Ollama;

public static class OllamaSemanticKernelDemo
{
    [Experimental("SKEXP0070")]
    public static async Task Main(string[] args)
    {
        // 1. Initialize the Ollama client with the local endpoint and model
        using var ollamaClient = new OllamaClient(
            endpoint: new Uri("http://localhost:11434"),
            model: "llama3.2");

        // 2. Create a chat service from the Ollama client
        var chatService = ollamaClient.AsChatCompletionService();

        // 3. Define the AI's role/behavior via a system message
        var chatHistory = new ChatHistory("You are an expert about comic books");

        // 4. User initiates the conversation
        chatHistory.AddUserMessage("Hi, I'm looking for comic suggestions");
        OutputLastMessage(chatHistory);

        // 5. Get the AI's reply
        var reply = await chatService.GetChatMessageContentAsync(chatHistory);
        chatHistory.Add(reply);
        OutputLastMessage(chatHistory);

        // 6. User follows up with more info
        chatHistory.AddUserMessage("I love sci-fi, I'd like to learn something new about the galatic empire, any suggestion");
        OutputLastMessage(chatHistory);

        // 7. AI responds with more tailored suggestions
        reply = await chatService.GetChatMessageContentAsync(chatHistory);
        chatHistory.Add(reply);
        OutputLastMessage(chatHistory);
    }

    private static void OutputLastMessage(ChatHistory chatHistory)
    {
        var lastMessage = chatHistory.Last();
        Console.WriteLine($"{lastMessage.Role}: {lastMessage.Content}\n");
    }
}

Example output

user: Hi, I'm looking for comic suggestions

assistant: I'd be happy to help you find some great comics.

Before we get started, can you please tell me a bit more about what you're in the mood for? Here are a few questions to help me narrow down some recommendations:

1. What genre are you interested in? (e.g., superhero, fantasy, horror, romance, etc.)
2. Are there any specific characters or franchises that you enjoy?
3. Do you prefer classic comics from the past (e.g., 50s-90s), or more modern releases?
4. Is there a particular tone you're looking for? (e.g., light-hearted, dark and gritty, adventurous, etc.)
5. Are you looking for something new to read, or are you open to exploring older comics?

Let me know your answers to these questions, and I'll do my best to suggest some fantastic comics that fit your tastes!

user: I love sci-fi, I'd like to learn something new about the galatic empire, any suggestion

assistant: Science fiction is an amazing genre.

If you're interested in learning more about the Galactic Empire from a comic book perspective, here are some suggestions:

**Must-read comics:**

1. **Darth Vader** (Marvel Comics, 2015-2019) - A solo series that explores Darth Vader's backstory and his fall to the dark side.
2. **Star Wars: Tarkin** (Dynamite Entertainment, 2016) - A graphic novel that delves into Grand Moff Tarkin's personality and motivations.
3. **Star Wars: Lando** (Marvel Comics, 2018-2020) - A series that explores the life of Lando Calrissian, a key character in the Galactic Empire.

**Recommended comics for Imperial insights:**

1. **Star Wars: Rebel Run** (Dark Horse Comics, 2009) - A miniseries that shows the inner workings of the Empire's security forces and their efforts to track down Rebel Alliance members.
2. **Star Wars: The Old Republic - Revan** (Dark Horse Comics, 2014-2015) - A limited series based on the popular video game, which explores the complexities of the Mandalorian Wars and the Imperial forces involved.

**Classic comics with Imperial connections:**

1. **Tales of the Jedi** (Dark Horse Comics, 1993-1996) - A comic book series that explores various events in the Star Wars universe, including some involving the Galactic Empire.
2. **Star Wars: X-Wing** (Dark Horse Comics, 1998-2000) - A comic book series based on the popular video game, which focuses on a group of Rebel Alliance pilots fighting against Imperial forces.

**Recent comics with Imperial themes:**

1. **Star Wars: The High Republic** (Marvel Comics, 2020-present) - An ongoing series that explores a new era in the Star Wars universe, including events and characters connected to the Galactic Empire.
2. **Star Wars: Resistance** (IDW Publishing, 2018-2020) - A comic book series set during the First Order's rise to power, which includes connections to the original trilogy.

These comics offer a mix of Imperial perspectives, character studies, and historical context that can help deepen your understanding of the Galactic Empire. May the Force be with you!

How It Works

Ollama is listening on localhost:11434. All requests go to your local server.
By specifying "llama3.2", you tell Ollama which model to use for inference.
The ChatHistory class from Semantic Kernel logs each turn in the conversation; the AsChatCompletionService() sends those messages to Ollama, which generates the next reply.

Running the Application

To run your newly created console application:

dotnet run

Interact with your console and observe how the local AI model responds to your prompts—no cloud calls required.

Performance Considerations

Resource Usage: Larger models demand more CPU/RAM resources. Llama3.2 is very small and only requires 2.0GB.
Latency: Local inference is often faster than cloud-based solutions, but ensure your hardware can handle the load.

Use Cases

Prototyping: Experiment quickly without incurring cloud costs or dealing with rate limits.
Internal Knowledge Bases: Keep information in-house for compliance and data privacy.
Edge or Offline Applications: Perfect for scenarios with limited internet access or strict data governance.

Conclusion

By running AI models locally with Ollama and coordinating them through Semantic Kernel, you can build private, cost-effective, and high-performance .NET applications. After installing Ollama, serving your model, and adding the Semantic Kernel + Ollama NuGet packages, you can host generative AI experiences entirely on your own infrastructure. This setup is especially well-suited for local development, letting you experiment and prototype without relying on cloud services or external APIs. In doing so, you maintain complete control over your environment while reducing both complexity and recurring costs.

As local AI continues to evolve, the combination of Ollama and Semantic Kernel lays the groundwork for building robust, self-contained solutions that scale from simple prototypes to production-ready applications—without the hidden costs of cloud dependencies.

Next Steps

Experiment with different models in Ollama to find the best fit for your application.
Tweak your system and user prompts to see how they change the model’s responses.
Preview of What’s Next: In an upcoming post, we’ll dive into Retrieval Augmented Generation (RAG) to give your LLM context-aware responses sourced from your own data—still running entirely on local infrastructure.

Happy coding! 🎄

Additional Resources

Ollama: Ollama GitHub Repository – Install instructions, docs, and source code
Semantic Kernel: Semantic Kernel GitHub Repository – Documentation, samples, and community contributions