Todd Tanner

Posted on Jun 25 • Edited on Jun 27

A Drop-In Ollama Replacement in Pure C# - No llama.cpp, No Native Binaries, Just ILGPU Kernels

#csharp #machinelearning #llm #dotnet

A month ago I shipped a neural network engine written entirely in C# - six GPU backends, no ONNX Runtime, no JavaScript bridge, no native binaries. It ran image models in the browser straight from a <canvas>.

This week it grew up. It now serves Large Language Models, and it speaks Ollama.

SpawnDev.ILGPU.ML now ships an Ollama-compatible inference server. You point Claude CLI, Pi, Codex, OpenCode, Continue - any agentic frontend that talks to Ollama or an OpenAI-compatible endpoint - at http://localhost:11434, and they just work. No reconfiguration. And here's the part I still find a little hard to believe: it runs the GGUF models already sitting in your Ollama cache, zero-copy. No re-download, no duplicate files. It reads ~/.ollama/models, the exact same store the real ollama CLI uses.

The whole inference stack - tokenizer, GGUF dequantization, attention, the GEMV that generates each token - is 100% C#. No llama.cpp under the hood. No native interop. The GPU kernels are C# methods transpiled to PTX / WGSL / SPIR-V by SpawnDev.ILGPU, my fork of ILGPU.

Here's the part to keep in mind for everything that follows: the GPU engine (SpawnDev.ILGPU) had its first commit on February 7, 2026, and the ML library doing this LLM inference started on March 17, 2026 - it's about three months old. Three months. The honest read on the numbers below isn't "it's a bit slower than Ollama" - it's "a three-month-old C# library is in the same ballpark as the tool everyone uses, and it's nowhere near done."

The part that genuinely surprised me

I wired the server up to the Pi agent frontend, threw normal interactive prompts at it, and timed it against real Ollama running the same model. For interactive chat, the response times are indistinguishable. You ask a question, the answer streams back at the same conversational pace. If I didn't already know which port I was hitting, I couldn't tell you which engine answered.

One screen, four panes. Left: real Ollama (ollama serve → a one-shot Pi prompt). It's then stopped and our C# engine starts on the right (dotnet run → the exact same Pi prompt). Same model (qwen2.5-coder:7b), same prompt - one is llama.cpp, one is a three-month-old C# library.

That is not where this started. A few weeks ago our token generation was 7x slower than Ollama. Today, on the same quantized model and the same RTX 4070, our decode loop runs within about 1.75x of Ollama's - and since decode comfortably outruns human reading speed either way, the interactive experience lands as "the same." We'll come back to the honest version of these numbers, including the gap that's still open.

Here's a real, unedited run - qwen2.5-coder:7b generating through the C# engine, asked to write a palindrome check:

Accelerator: NVIDIA GeForce RTX 4070
Model: qwen2.5-coder:7b-instruct-q4_K_M  arch=qwen2  chat-format=ChatML

Response:
public static bool IsPalindrome(string input)
{
    int left = 0;
    int right = input.Length - 1;
    while (left < right)
    {
        if (input[left] != input[right]) return false;
        left++;
        right--;
    }
    return true;
}

[timing] prefill(TTFT)=826ms | decode=51.4 tok/s (19.5 ms/tok)

That's correct, idiomatic C#, generated by C# kernels - no llama.cpp anywhere in the stack.

Three protocols, one engine

The thing that makes it a drop-in replacement instead of "yet another inference server" is that it impersonates three different APIs at once, all driving the same loaded model:

Protocol	Endpoints	Streaming	Who uses it
Ollama-native	`/api/chat`, `/api/generate`, `/api/tags`, `/api/version`	NDJSON	Ollama-aware tools
OpenAI-compatible	`/v1/chat/completions`, `/v1/completions`, `/v1/models`	SSE	Codex, OpenCode, Continue
Anthropic Messages	`/v1/messages`, `/v1/messages/count_tokens`	SSE	Claude CLI

Pointing a client at it is one environment variable:

# Claude CLI thinks it's talking to Anthropic:
ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_MODEL=gemma4:12b claude

# Anything OpenAI-shaped:
#   base URL  http://localhost:11434/v1
#   API key   (anything - it's local, there's nothing to authenticate)
#   model     any name from your Ollama cache

# Anything Ollama-aware:
OLLAMA_HOST=http://localhost:11434

Start it the same way you'd run any other example:

# Serves on Ollama's default port. (Real Ollama already there? Use OLLAMA_PORT=11435.)
dotnet run --project Examples/06.OllamaServer.Console -c Release

# List the cached models it can serve:
dotnet run --project Examples/06.OllamaServer.Console -- --list

# One-shot generate straight from the CLI:
dotnet run --project Examples/06.OllamaServer.Console -- --chat gemma4:12b "Hi"

The model loads lazily on the first request and generation is serialized through a single gate - one decode at a time per GPU, exactly like Ollama on a single card.

A compatibility detail that fell out of driving it with a real agent: pointed at qwen2.5-coder, the model often chooses to call a tool - to write its answer to a file - instead of just printing it. Our server hands those tool calls back to the agent correctly, so it just works. Real Ollama, driving the same model over its OpenAI-compatible endpoint, currently prints the raw tool-call JSON to the screen instead - a known issue (ollama#12557). That's no knock on Ollama - it's a great tool and the bug is narrow - but it's exactly the kind of edge we're set on getting right, and a small sign of how fast this is moving on every axis at once: models, quant formats, agent frontends, and platforms.

What "pure C#" actually means here

There's no model format magic and no shelling out. When a request comes in:

The server finds the GGUF blob in your Ollama cache and memory-maps it.
It reads the model's own metadata - the tokenizer family (tokenizer.ggml.model) and the chat template (tokenizer.chat_template) - and auto-selects the right tokenizer (SentencePiece for Gemma, byte-level BPE for Qwen / Llama3) and the right chat format (ChatML / Llama3 / Gemma). Nothing is hard-coded per model.
The quantized weights are dequantized on the GPU - Q4_0/1, Q5_0/1, Q8_0, Q2_K through Q6_K, MXFP4 - by C# kernels.
Attention, the feed-forward MLPs, the RMSNorms, and the token-by-token GEMV all run as ILGPU kernels.

That last point is the whole bet. A kernel in this engine is an ordinary C# method:

private static void DoubleKernel(Index1D idx,
    TensorView<float> input, TensorView<float> output)
{
    output[idx] = input[idx] * 2f;
}

ILGPU transpiles that .NET IL into a PTX kernel on CUDA, an OpenCL kernel on AMD/Intel, a WGSL compute shader on WebGPU, a GLSL shader on WebGL2, a WebAssembly function across web workers, or a parallel-for on the CPU. One C# function. Six backends. Chosen at runtime. The LLM server runs the CUDA path; the exact same kernels run in a browser tab on WebGPU. That's not a separate port - it's the same code.

And that's the thing I'd put in front of any developer comparing this to Ollama: Ollama runs on your desktop. This runs on your desktop and inside a browser tab, with nothing installed - same model, same kernels, no native binary anywhere. The decode kernels in the table below are already verified running on WebGPU in the browser. A drop-in Ollama replacement is the desktop proof; the browser is the part Ollama can't follow us into.

The honest performance picture

I'm not going to tell you we beat Ollama. We don't, yet. Here's a real head-to-head I ran while writing this article - same model, same machine, same prompts, both engines warmed first:

Model: qwen2.5-coder:7b-instruct-q4_K_M (Q4_K_M) · GPU: RTX 4070 · Ollama: 0.30.10

Phase	Ollama	SpawnDev.ILGPU.ML (C#)	Gap
Decode (generate each token)	89.6 tok/s	51.4 tok/s	Ollama 1.74x
Prefill (~700-token prompt)	3065 tok/s	260 tok/s	Ollama ~11.8x

Two very different stories in those two rows, and I want to be precise about both.

Decode is the part that dominates an interactive chat - the speed answers stream back at - and it's the part we've gotten close on. We went from ~7x slower than Ollama a few weeks ago to 1.74x slower today. 51 tokens/sec is far faster than anyone reads, which is exactly why it feels identical in use.

What closed that gap was one insight about the decode matrix-multiply. Generating a token multiplies a single vector against the whole weight matrix - a GEMV, not a GEMM. Our first kernel put one thread per output and had each thread stream an entire weight row, so consecutive threads hit non-consecutive memory: roughly 32x uncoalesced, burning ~97% of decode time on bad access patterns. Rewriting it as a coalesced GEMV - a thread group per output column, striding through K together, reducing in shared memory - was a 6x jump on the big model (gemma4:12b: ~2317 → ~380 ms/token), token-for-token identical output. Then int8 dp4a activation quantization on top, the same path Ollama's own MMVQ kernels take.

Prefill - ingesting your prompt - is where the gap is still real and still large: about 12x. Prefill is a big dense matrix-multiply, and Ollama routes it through the GPU's tensor cores; ours still runs on the regular ALU path. On a short interactive prompt you never feel it (both finish in well under a second). On a long paste you do. Closing it means emitting WMMA / tensor-core instructions from the ILGPU kernel compiler, and it's the next big lever I'm pulling. Worth noting: that 12x is already down from ~20x a few weeks ago, before the multi-row prefill GEMM and the grouped attention kernel landed.

I'd rather show you exactly where the line is than wave a single cherry-picked number at you. The line is: interactive decode is there; bulk prefill is the work in front of me - and I can see the kernel that fixes it.

Why build this at all when Ollama exists

Two reasons, one practical and one that's the actual point.

The practical one: this server is the best stress test the ML library has. A real agentic frontend hammering a real inference pipeline finds bugs that no demo ever will - it has already caught and driven fixes for a quantized-matmul crash and a byte-level-BPE tokenization bug. Make the engine correct and fast here, and every other consumer benefits, including the browser path.

The real one: it's all the same engine, and that engine runs in a browser tab with no install, no server, and no native dependency. An Ollama-compatible server proves the LLM stack works end-to-end on the desktop GPU. The next step is that exact stack streaming a quantized LLM on WebGPU inside a Blazor WebAssembly page - the decode kernels are already verified there. When the model runs on the user's device, the user's data never leaves it. No upload, no "we promise we won't train on your prompts," no data plane at all. That's the destination the whole SpawnDev stack is pointed at, and a drop-in Ollama replacement written in C# is one more layer of the proof.

Try it

The library is on NuGet:

dotnet add package SpawnDev.ILGPU.ML --prerelease

The source, including the Ollama server example, is at github.com/LostBeard/SpawnDev.ILGPU.ML. Clone it, run Example 06, and point your favorite agent CLI at localhost:11434. If you've got models in your Ollama cache already, it'll list and serve them as-is.

It works today, verified end-to-end, on gemma4, qwen2 / qwen2.5-coder / qwen3, and llama3. Tool-calling already works on the Qwen/ChatML and Gemma paths (see above); wider tool-calling coverage, vision, and a full GGUF-template renderer are on the v2 roadmap. IQ-quants aren't supported yet - the loader tells you so instead of serving garbage.

If you're a .NET developer who has watched the local-LLM space happen entirely in C++ and Python and wondered where C# was - it's here, and after three months it's already competitive. Try it, break it, file issues from your own models. Star the repo if you want to see it keep going.

And to be clear about where this is: we are nowhere near done. We're just getting started. Prefill tensor cores, the in-browser LLM demo, tool-calling, vision, peer-to-peer model delivery over WebTorrent - all in flight. The decode gap went 7x → 1.74x in weeks; the prefill gap went 20x → 12x; both are still closing. This isn't a finished product I'm showing off. It's a three-month-old library that already keeps up with Ollama on interactive chat and runs in a browser tab, and I'm telling you now so you can watch the rest happen.

And if you can help fund the crew: github.com/sponsors/LostBeard. This is built on a $20/month budget by one person with a small crew when the budget allows. $5/month is a vote of confidence that local, private, C#-native AI is worth finishing.

The SpawnDev Crew

LostBeard (Todd Tanner) - Captain, library author, keeper of the vision
Riker - First Officer, consuming-project implementation
Data - Operations, deep-library work and test rigor
Tuvok - Security & research, design, docs, code review
Geordi - Chief Engineer, GPU kernels and backend internals
Seven - Wasm backend and fail-loud verification

🖖🚀

DEV Community