Maksim Demin

Posted on May 27

Adding Gemma 4 speech recognition to a .NET desktop app: the llama-server sidecar that survived

#ai #architecture #dotnet #gemma

In April 2026 Google shipped Gemma 4, a multimodal model with a native audio path. I wanted to add it to Parlotype, my .NET 10 dictation app, as a second speech engine alongside Whisper. Four runtime paths got cut before I landed on llama.cpp's llama-server as a child process. This post walks through the cuts, the architecture that survived, the variant catalog, and the benchmarks.

Parlotype is a voice-to-text desktop app for Windows with on-device speech recognition as the default. You hold a global hotkey, speak, release. Text appears in whatever app you were typing into. This post is about adding a second on-device engine. Cloud speech providers are a separate, opt-in track and not the subject here.

This is the long companion to my Gemma 4 Challenge submission on the same topic. The challenge post is the 5-variant tour with the shipping decision. This one is the runtime selection and the architecture under it.

The constraints

Worth naming the constraints up front so the obvious answers make sense as dead-ends:

On-device engine. Gemma 4 is being added as another local recognizer alongside Whisper, so inference for this path stays on the user's machine. Cloud providers are a separate, opt-in track and out of scope for this post.
Windows desktop, single end-user installer. No "first install Python, then WSL2, then..." Real users will not do that.
Cross-vendor GPU. AMD, Intel, and NVIDIA, with CPU fallback. Locking the app to one vendor is not acceptable.
The audio pipeline already exists. WASAPI capture -> 16 kHz mono float[] -> Silero VAD -> speech segments -> recognizer -> text injection. The new engine has to slot in behind the existing ISpeechRecognizer interface without redesigning the pipeline.
.NET 10 and Avalonia UI 12 for the host process.

Then the trigger. Google released Gemma 4 (E2B and E4B) with a conformer audio encoder. Their reported WER on LibriSpeech-test-clean is 4.17%, which is competitive with bigger Whisper variants on clean speech. The same checkpoint can also do text post-processing later. The question was never "should we add Gemma 4". It was "how, on Windows, in .NET, as another local engine that preserves the on-device default".

Four runtime dead-ends

This is the part of the post that took the most engineering and the part most worth writing down. Each rejection has a specific reason.

Dead-end 1: native .NET inference via `onnxruntime-genai`

The obvious first stop. ONNX Runtime with the GenAI extension already runs Phi-3 and similar small models from .NET. If Gemma 4 were supported, the app would have nothing more than a new ISpeechRecognizer implementation. No extra processes, no separate installer.

It is not supported. Gemma 4's architecture uses per-layer embeddings, variable head dimensions, and KV cache sharing. None of those were understood by onnxruntime-genai at the time of writing. Tracking issue: microsoft/onnxruntime-genai#2062.

Per-layer embeddings, briefly, mean each transformer layer has its own embedding matrix instead of sharing one. Variable head dimensions mean attention heads in different layers can have different sizes. Standard ONNX exporters and runtimes assume neither of these. Until ONNX Runtime ships the underlying support, no .NET-native path exists.

Dead-end 2: a Python sidecar with HuggingFace Transformers

The second attempt was a small Python sidecar. Spawn a local FastAPI server, talk HTTP to 127.0.0.1, transcribe via HF Transformers with bitsandbytes for 4-bit quantization. From .NET: write a temp WAV, POST it, parse JSON, clean up.

This actually shipped, as a benchmark-only tool (ADR-024). It was never wired into the desktop app. Three reasons:

It pulls Python and CUDA into the install. That is a non-starter for non-developer users.
bitsandbytes has limited Windows support. Users would need WSL2 or Linux to get the 4-bit path that makes Gemma 4 affordable on consumer GPUs.
The benchmarks were unreliable.

That third point is worth dwelling on. The first Gemma 4 benchmark on LibriSpeech-test-other came back at 96.94% WER. Peak host RAM for the sidecar process was about 79 MB, for a model that should occupy several gigabytes. The number was so bad that the obvious conclusion was not "Gemma 4 is bad". It was "this pipeline is silently broken". Two weeks later, on the same dataset and same machine, the llama.cpp path produced 13.15% WER for the same model.

The lesson is not "Python is bad". The lesson is that the inference path you ship matters more than the model card claims, and you only learn that by measuring on your own stack.

The broken benchmark was also what prompted the search that found llama-server.

Dead-end 3: LLamaSharp

LLamaSharp is a native .NET P/Invoke layer over llama.cpp. More control, no separate process, no HTTP boundary. On paper this is the best fit for a .NET app.

The blocker was build-coupling. LLamaSharp links against a specific llama.cpp build at compile time. Switching the user's backend from Vulkan to CUDA means rebuilding the host app. There is no good way to ship "use Vulkan on AMD, use CUDA on NVIDIA" from one binary. Audio support for Gemma 4 was also significantly more engineering than the chat-completions path.

Dead-end 4: Ollama and Lemonade

Ollama would have given the smoothest UX of any option. It also did not support Gemma audio at the time. Tracking issue: ollama/ollama#15333.

Lemonade is strong on Ryzen AI hardware, but it is AMD-specific. Cross-vendor was a hard requirement.

Why `llama-server`

llama-server is the HTTP server that ships with llama.cpp. At the decision date (2026-05-09, ADR-025), it was the only cross-vendor native Windows runtime with a stable HTTP API that supported Gemma 4 audio.

The concrete reasons:

It exposes an OpenAI-compatible /v1/chat/completions endpoint. Audio goes in as an input_audio content block. The shape is documented and stable.
Pre-built Vulkan binaries (llama-bXXXX-bin-win-vulkan-x64 from llama.cpp's GitHub Releases) work on AMD, Intel, and NVIDIA GPUs from a single download.
CUDA, Vulkan, CPU, and other backends each ship as a separate archive. You can install more than one side by side and switch.
Gemma 4 GGUF weights and the audio projector (mmproj) are published by ggml-org on HuggingFace.

The cost is an extra process to manage. Cold start, port conflicts, crash handling, file locks during upgrade. Most of the rest of this post is how that was tamed.

Architecture

Two diagrams. The first shows what is on disk and who downloads what. The second shows where the audio pipeline branches by engine.

Top-level integration

The diagram has three layers. The app (the .NET host process), disk (%LOCALAPPDATA%/parlotype for installed servers, models, and prompts), and external sources (HuggingFace for GGUFs, GitHub Releases for llama-server builds). The sidecar sits between the app and disk because it spans both: spawned by the app, but its binary and weights live on disk.

Audio pipeline: Whisper and Gemma 4, side by side

The diamond in the middle is the architectural pivot. DelegatingSpeechRecognizer reads the user's SpeechEngine setting at init time and forwards every call to either WhisperSpeechRecognizer or LlamaCppSpeechRecognizer. The audio pipeline itself does not know which engine is active. Same capture, same VAD, same injector. The right branch crosses a process boundary, which is the cost of the Gemma 4 path.

Key types worth naming:

SpeechEngine enum in Parlotype.Core (Whisper or Gemma4), persisted via SettingsKeys.SpeechEngine.
DelegatingSpeechRecognizer is registered as the ISpeechRecognizer singleton. It picks the underlying recognizer at InitializeAsync time.
LlamaCppSpeechRecognizer owns the llama-server.exe process lifecycle. Spawn, poll /health, transcribe, terminate.
JsonLlamaServerRegistry tracks managed installs in manifest.json (covered below).
IPromptTemplateRegistry looks up the active transcription prompt per call.

The `input_audio` content block

Most "use llama.cpp from .NET" tutorials cover text-only chat. The audio path is worth showing in detail. Audio is sent as a base64-encoded WAV blob in an input_audio content block:

// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    model = "gemma-4",
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false,
    max_tokens = 200
};

using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);

stream = false is deliberate. Simpler error handling, no SSE parser, and transcription is short-burst (under 30 seconds per clip, see trade-offs below). When post-processing lands and outputs longer text, streaming becomes worth the complexity.

Trade-offs

The decisions that bit me, in the order they bit.

Model size. GGUF E4B Q4_K_M is about 5.9 GiB. BF16 variants reach about 15 GiB. The Gemma4ModelInfo catalog (ADR-029) curates five variants and explicitly notes that ggml-org/gemma-4-E2B-it-GGUF does not publish a Q4_K_M asset. I learned this from a 404 in manual testing, then rebuilt the catalog from the actual file lists.

Noisy audio. On LibriSpeech-test-other, Whisper LargeV3Turbo on CUDA lands at 11.48% WER. The best Gemma 4 variant (E2B-it-BF16) lands at 13.15% WER. A 1.7-point gap on the harder English split. Google's own evaluations showed Gemma 4 falling further behind on meeting-style noise (AMI is about 41% WER for Gemma versus about 16% for Whisper-large-v3). The honest pitch is that Gemma 4 is competitive on read speech and degrades faster than Whisper as noise and overlap rise.

Cold start. ADR-025 estimated 3 to 30 seconds for llama-server cold start. My first benchmark numbers confirmed the high end (21.3 seconds modelLoad for E2B-Q8_0). After I added an always-on warm-up pass (ADR-031), the same modelLoad dropped to 6.7 seconds. Most of the original cost was OS page cache and CUDA driver init, not the recognizer. Real InitializeAsync on a warm host is about 6.7 to 9.3 seconds for Gemma 4 and about 0.7 to 1.5 seconds for Whisper.

30-second clip limit. Gemma 4 audio is bounded at 30 seconds per request. Parlotype's VAD already chunks below this, so it did not bite, but it is a real architectural ceiling.

E2B-Q8_0 is unstable. During the benchmark, gemma-4-E2B-it-Q8_0 intermittently emitted stray <|channel> reasoning tokens that crashed llama-server's chat-template parser with HTTP 500. The first 50-sample run failed mid-stream. The second succeeded but with abnormally high RTF (0.315 versus about 0.04 for other Gemma quants) because of verbose thought-text bleed-through. The catalog keeps E2B-Q8_0 selectable for experimentation. The default is E4B Q4_K_M.

BF16 hallucinations on Blackwell GPUs. Separate from the E2B-Q8_0 issue, BF16 variants have a documented hallucination behavior on some NVIDIA Blackwell hardware. On the CUDA 13.1 box used here, BF16 was actually the strongest Gemma 4 variant, so this is hardware-specific.

The managed-install subsystem

The simplest "give the user a folder picker" version of this worked for about two weeks. Then it became obvious that:

llama.cpp ships a different archive per backend, OS, and architecture. Vulkan, CUDA 12.4, CUDA 13.1, CPU, HIP, SYCL.
CUDA on Windows needs a companion cudart-llama-bin-*.zip for the NVIDIA runtime DLLs.
New releases land several times a week, tagged bXXXX, with no "latest" alias.

ADR-026 added a full managed-install subsystem. A catalog backed by GitHub Releases with ETag caching. An installer that stages downloads under .staging/{guid}/payload/ and commits with a single Directory.Move. A registry (manifest.json) as the source of truth for what is installed. A tolerant asset parser that turns unknown backend strings into Unknown rather than throwing.

Two details worth calling out.

Atomic rename. Every install assembles under a staging directory and is committed by a single Directory.Move. A crash mid-install leaves no visible state. The user does not end up with a half-installed server. This is the kind of detail no library does for you.

Shared download primitive. StreamingFileDownloader was extracted from the pre-existing Whisper model downloader and is now used by both. About 150 lines, no abstraction layer, just a shared chunk loop.

The whole subsystem is about 1,800 lines across Core, Platform, and Desktop, plus tests. Worth naming so the cost is visible. "Add a button that downloads a binary" is not what shipped.

Configurable prompts

The Gemma 4 path sends a prompt alongside each audio clip. The text block in the user message tells the model what to do with the audio. Originally this was a hardcoded const. ADR-030 made it a first-class registry. Users create, edit, and duplicate prompts via the Settings UI. Prompts persist to prompts.json. The active prompt is re-read per transcription, no model reload required.

A {language} placeholder is the one small interface seam left for a future feature: source-language detection from keyboard layout. Small interface seams beat retroactive migrations of saved user data.

An example prompt to show what the multimodal-prompt approach actually unlocks.

1. Technical: bug-report formatter

Transcribe the speech verbatim. Then, on a new line, reformat it as a GitHub bug report with sections "Steps to reproduce", "Expected", "Actual", "Environment". If a section cannot be inferred from the speech, write (not specified).

Input (spoken): "I clicked save and the app just died, nothing in the logs, on my Windows machine, 64-bit."

Output: a structured issue

I click save and the app just died. Nothing in the logs. On
my windows machine, 64 bit.

** Expected*
(not specified)

** Actual **
The app crashes/dies. There are no errors in the logs.

*Environment*
Windows machine, 64 bit.

** Steps to reproduce*
Click save.

Benchmarks

The main benchmark data is in results/comparison-libri-speech-test-other-2026-05-23-cuda.md in the repo. The numbers below match that file exactly.

Setup

Dataset: LibriSpeech test-other, 50 samples. The harder English split, with more diverse accents and recording conditions than test-clean.
Whisper: CUDA runtime (Whisper.net.Runtime.Cuda, strict via runtimePreference: "Cuda"), beam size 1 (greedy, deterministic).
Gemma 4: llama-server CUDA build b9297-win-cuda-13.1-x64, port 8321, no streaming, the built-in transcription prompt.
VAD: disabled. The dataset is pre-segmented, so full-file transcription is correct here.
Warm-up: one throwaway transcription before the timed loop, per ADR-031.

Methodology sidebar: the warm-up fix

The first time I ran these numbers, gemma-4-E2B-it-Q8_0 reported a 21.3-second modelLoad. The other Gemma variants reported about 9 seconds. Whisper Small reported 1.2. None of that matched my hand measurements. Once I added an always-on warm-up pass, the picture changed:

Model	Cold modelLoad	Warm modelLoad	Delta
Whisper `Small`	1192 ms	755 ms	-437 ms
Whisper `LargeV3Turbo`	1567 ms	1511 ms	about the same
Gemma `E2B-Q8_0`	21300 ms	6741 ms	-14.6 s
Gemma `E2B-BF16`	9256 ms	6703 ms	-2.5 s

The decoder is greedy and deterministic, so WER and CER did not change between cold and warm runs. Only the timing fields became meaningful. If you publish inference timings without an explicit warm-up policy, you are publishing your filesystem cache state.

Results

Rank	Engine	Model	WER %	CER %	RTF	Model load (s)
1	Whisper (CUDA)	`LargeV3Turbo`	11.48	4.97	0.055	1.31
2	Whisper (CUDA)	`Medium`	12.18	5.41	0.073	1.28
3	Whisper (CUDA)	`Small`	13.10	5.87	0.034	0.71
4	Gemma 4 (llama.cpp CUDA)	`E2B-it-BF16`	13.15	4.95	0.038	6.70
5	Gemma 4 (llama.cpp CUDA)	`E4B-it-Q4_K_M`	13.82	5.80	0.038	6.73
6	Gemma 4 (llama.cpp CUDA)	`E4B-it-BF16`	14.20	5.40	0.038	6.72
7	Gemma 4 (llama.cpp CUDA)	`E4B-it-Q8_0`	14.39	5.79	0.044	9.25
8	Gemma 4 (llama.cpp CUDA)	`E2B-it-Q8_0`	19.22	8.95	0.315	6.74

Things worth calling out in prose.

Whisper LargeV3Turbo still leads. 11.48% versus Gemma's best 13.15%. The gap is 1.67 points, and the gap is smaller than I expected before running these numbers.
Whisper Small on CUDA is the fastest in the field. RTF 0.034 beats every Gemma variant (0.038 or higher) and every other Whisper. At 13.10% WER it also essentially ties Gemma E2B-it-BF16 (13.15%) on accuracy. If you only keep one configuration on disk, Whisper Small on CUDA is hard to argue against on this dataset.
Gemma 4 E2B-it-BF16 has the lowest CER of the whole field. 4.95% versus Whisper LargeV3Turbo's 4.97%. The WER ordering does not always agree with the CER ordering, and Gemma's character-level errors at this size are unusually small.
Gemma BF16 and Q4 are faster than mid-tier Whisper. Gemma variants sit at RTF 0.038, faster than Whisper Medium (0.073) and LargeV3Turbo (0.055), but slower than Whisper Small (0.034).
E2B-Q8_0 is broken on this dataset. RTF 0.315 (8x slower than other Gemma variants), WER 19.22%. The crash on the stray <|channel> token is the same issue from the trade-offs section.

Vulkan vs CUDA: a regression I did not expect

Before pivoting Whisper to CUDA, I ran the same three Whisper models on Vulkan. The result is almost invariant, but not quite.

Model	Vulkan WER	CUDA WER	Delta
`Small`	13.10	13.10	0.00 (bit-identical)
`Medium`	12.18	12.18	0.00 (bit-identical)
`LargeV3Turbo`	10.15	11.48	+1.33 pp

Small and Medium produce bit-identical WER across runtimes. The greedy decoder is deterministic and the kernels reproduce. LargeV3Turbo regresses by 1.33 percentage points on CUDA, reproducibly.

The most likely culprit is non-bitwise-identical kernel math between the Vulkan and CUDA backends. Matmul and softmax reduction order, and FP16 accumulation order, are not guaranteed to be deterministic across GPU backends. At the scale of LargeV3Turbo's larger matrices, accumulated FP error tips a handful of borderline decoder choices.

The takeaway is not "CUDA is buggy". It is that GPU backends are not interchangeable when you care about exact transcripts. If LargeV3Turbo is your production target, benchmark on the runtime you will actually ship.

CUDA also delivered what you would expect on the other dimensions. RTF improved 8 to 26% across all three Whisper models. Host RAM dropped 30 to 60% because weights now live in VRAM. The speed and memory wins are real and worth taking.

What this section does not claim

Gemma 4 wins. It does not, on this dataset.
Whisper is obsolete. It is not. LargeV3Turbo still leads by 1.67 WER points.
These numbers generalize. They are 50 samples of read English with a single benchmark machine. The point is to give readers numbers they can replicate, not to declare a winner.

What's next

Three concrete follow-ups, each with one sentence of why.

A LlamaServerHost extraction. Right now LlamaCppSpeechRecognizer owns the llama-server process. The first post-processing consumer will need to share the server. A dedicated host class will manage spawn and terminate so neither workload can tear the server down on the other.

A post-processing pipeline. Same loaded model, second invocation. Whisper text -> llama-server -> cleaned, translated, or structured text -> injector. The configurable prompts feature is the first half of this. The consumer is what is still missing.

Source language detection from keyboard layout. The {language} token in PromptTemplate.Render is already in place. The detector is what comes next.

Try it

Repo: github.com/mdemin729/parlotype
Demo video, 60 seconds, Gemma 4 dictation walkthrough:

ADRs: docs/decisions/024-gemma4-python-sidecar.md through 030-configurable-gemma4-prompts.md.
Benchmark data: results/comparison-libri-speech-test-other-2026-05-23-cuda.md.

Windows only for now. .NET 10, MIT licensed. Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads llama-server and the GGUF for you.

If you have shipped llama.cpp's /v1/chat/completions audio path in production, I am curious about cold-start mitigations beyond keeping the server warm. Spinning-disk first-inference times in the 30-second range are the part I have not solved cleanly yet.

Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.

DEV Community

Adding Gemma 4 speech recognition to a .NET desktop app: the llama-server sidecar that survived

The constraints

Four runtime dead-ends

Dead-end 1: native .NET inference via `onnxruntime-genai`