In April 2026 Google shipped Gemma 4, a multimodal model with a native audio path. I wanted to add it to Parlotype, my .NET 10 dictation app, as a second speech engine alongside Whisper. Four runtime paths got cut before I landed on llama.cpp's llama-server as a child process. This post walks through the cuts, the architecture that survived, the variant catalog, and the benchmarks.
Parlotype is a voice-to-text desktop app for Windows with on-device speech recognition as the default. You hold a global hotkey, speak, release. Text appears in whatever app you were typing into. This post is about adding a second on-device engine. Cloud speech providers are a separate, opt-in track and not the subject here.
This is the long companion to my Gemma 4 Challenge submission on the same topic. The challenge post is the 5-variant tour with the shipping decision. This one is the runtime selection and the architecture under it.
The constraints
Worth naming the constraints up front so the obvious answers make sense as dead-ends:
- On-device engine. Gemma 4 is being added as another local recognizer alongside Whisper, so inference for this path stays on the user's machine. Cloud providers are a separate, opt-in track and out of scope for this post.
- Windows desktop, single end-user installer. No "first install Python, then WSL2, then..." Real users will not do that.
- Cross-vendor GPU. AMD, Intel, and NVIDIA, with CPU fallback. Locking the app to one vendor is not acceptable.
-
The audio pipeline already exists. WASAPI capture -> 16 kHz mono float[] -> Silero VAD -> speech segments -> recognizer -> text injection. The new engine has to slot in behind the existing
ISpeechRecognizerinterface without redesigning the pipeline. - .NET 10 and Avalonia UI 12 for the host process.
Then the trigger. Google released Gemma 4 (E2B and E4B) with a conformer audio encoder. Their reported WER on LibriSpeech-test-clean is 4.17%, which is competitive with bigger Whisper variants on clean speech. The same checkpoint can also do text post-processing later. The question was never "should we add Gemma 4". It was "how, on Windows, in .NET, as another local engine that preserves the on-device default".
Four runtime dead-ends
This is the part of the post that took the most engineering and the part most worth writing down. Each rejection has a specific reason.
Dead-end 1: native .NET inference via onnxruntime-genai
The obvious first stop. ONNX Runtime with the GenAI extension already runs Phi-3 and similar small models from .NET. If Gemma 4 were supported, the app would have nothing more than a new ISpeechRecognizer implementation. No extra processes, no separate installer.
It is not supported. Gemma 4's architecture uses per-layer embeddings, variable head dimensions, and KV cache sharing. None of those were understood by onnxruntime-genai at the time of writing. Tracking issue: microsoft/onnxruntime-genai#2062.
Per-layer embeddings, briefly, mean each transformer layer has its own embedding matrix instead of sharing one. Variable head dimensions mean attention heads in different layers can have different sizes. Standard ONNX exporters and runtimes assume neither of these. Until ONNX Runtime ships the underlying support, no .NET-native path exists.
Dead-end 2: a Python sidecar with HuggingFace Transformers
The second attempt was a small Python sidecar. Spawn a local FastAPI server, talk HTTP to 127.0.0.1, transcribe via HF Transformers with bitsandbytes for 4-bit quantization. From .NET: write a temp WAV, POST it, parse JSON, clean up.
This actually shipped, as a benchmark-only tool (ADR-024). It was never wired into the desktop app. Three reasons:
- It pulls Python and CUDA into the install. That is a non-starter for non-developer users.
-
bitsandbyteshas limited Windows support. Users would need WSL2 or Linux to get the 4-bit path that makes Gemma 4 affordable on consumer GPUs. - The benchmarks were unreliable.
That third point is worth dwelling on. The first Gemma 4 benchmark on LibriSpeech-test-other came back at 96.94% WER. Peak host RAM for the sidecar process was about 79 MB, for a model that should occupy several gigabytes. The number was so bad that the obvious conclusion was not "Gemma 4 is bad". It was "this pipeline is silently broken". Two weeks later, on the same dataset and same machine, the llama.cpp path produced 13.15% WER for the same model.
The lesson is not "Python is bad". The lesson is that the inference path you ship matters more than the model card claims, and you only learn that by measuring on your own stack.
The broken benchmark was also what prompted the search that found llama-server.
Dead-end 3: LLamaSharp
LLamaSharp is a native .NET P/Invoke layer over llama.cpp. More control, no separate process, no HTTP boundary. On paper this is the best fit for a .NET app.
The blocker was build-coupling. LLamaSharp links against a specific llama.cpp build at compile time. Switching the user's backend from Vulkan to CUDA means rebuilding the host app. There is no good way to ship "use Vulkan on AMD, use CUDA on NVIDIA" from one binary. Audio support for Gemma 4 was also significantly more engineering than the chat-completions path.
Dead-end 4: Ollama and Lemonade
Ollama would have given the smoothest UX of any option. It also did not support Gemma audio at the time. Tracking issue: ollama/ollama#15333.
Lemonade is strong on Ryzen AI hardware, but it is AMD-specific. Cross-vendor was a hard requirement.
Why llama-server
llama-server is the HTTP server that ships with llama.cpp. At the decision date (2026-05-09, ADR-025), it was the only cross-vendor native Windows runtime with a stable HTTP API that supported Gemma 4 audio.
The concrete reasons:
- It exposes an OpenAI-compatible
/v1/chat/completionsendpoint. Audio goes in as aninput_audiocontent block. The shape is documented and stable. - Pre-built Vulkan binaries (
llama-bXXXX-bin-win-vulkan-x64from llama.cpp's GitHub Releases) work on AMD, Intel, and NVIDIA GPUs from a single download. - CUDA, Vulkan, CPU, and other backends each ship as a separate archive. You can install more than one side by side and switch.
-
Gemma 4 GGUF weights and the audio projector (
mmproj) are published byggml-orgon HuggingFace.
The cost is an extra process to manage. Cold start, port conflicts, crash handling, file locks during upgrade. Most of the rest of this post is how that was tamed.
Architecture
Two diagrams. The first shows what is on disk and who downloads what. The second shows where the audio pipeline branches by engine.
Top-level integration
The diagram has three layers. The app (the .NET host process), disk (%LOCALAPPDATA%/parlotype for installed servers, models, and prompts), and external sources (HuggingFace for GGUFs, GitHub Releases for llama-server builds). The sidecar sits between the app and disk because it spans both: spawned by the app, but its binary and weights live on disk.
Audio pipeline: Whisper and Gemma 4, side by side
The diamond in the middle is the architectural pivot. DelegatingSpeechRecognizer reads the user's SpeechEngine setting at init time and forwards every call to either WhisperSpeechRecognizer or LlamaCppSpeechRecognizer. The audio pipeline itself does not know which engine is active. Same capture, same VAD, same injector. The right branch crosses a process boundary, which is the cost of the Gemma 4 path.
Key types worth naming:
-
SpeechEngineenum inParlotype.Core(WhisperorGemma4), persisted viaSettingsKeys.SpeechEngine. -
DelegatingSpeechRecognizeris registered as theISpeechRecognizersingleton. It picks the underlying recognizer atInitializeAsynctime. -
LlamaCppSpeechRecognizerowns thellama-server.exeprocess lifecycle. Spawn, poll/health, transcribe, terminate. -
JsonLlamaServerRegistrytracks managed installs inmanifest.json(covered below). -
IPromptTemplateRegistrylooks up the active transcription prompt per call.
The input_audio content block
Most "use llama.cpp from .NET" tutorials cover text-only chat. The audio path is worth showing in detail. Audio is sent as a base64-encoded WAV blob in an input_audio content block:
// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
model = "gemma-4",
messages = new[]
{
new
{
role = "user",
content = new object[]
{
new { type = "text", text = promptText },
new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
}
}
},
stream = false,
max_tokens = 200
};
using var response = await _httpClient.PostAsJsonAsync(
"/v1/chat/completions", body, cancellationToken);
stream = false is deliberate. Simpler error handling, no SSE parser, and transcription is short-burst (under 30 seconds per clip, see trade-offs below). When post-processing lands and outputs longer text, streaming becomes worth the complexity.
Trade-offs
The decisions that bit me, in the order they bit.
Model size. GGUF E4B Q4_K_M is about 5.9 GiB. BF16 variants reach about 15 GiB. The Gemma4ModelInfo catalog (ADR-029) curates five variants and explicitly notes that ggml-org/gemma-4-E2B-it-GGUF does not publish a Q4_K_M asset. I learned this from a 404 in manual testing, then rebuilt the catalog from the actual file lists.
Noisy audio. On LibriSpeech-test-other, Whisper LargeV3Turbo on CUDA lands at 11.48% WER. The best Gemma 4 variant (E2B-it-BF16) lands at 13.15% WER. A 1.7-point gap on the harder English split. Google's own evaluations showed Gemma 4 falling further behind on meeting-style noise (AMI is about 41% WER for Gemma versus about 16% for Whisper-large-v3). The honest pitch is that Gemma 4 is competitive on read speech and degrades faster than Whisper as noise and overlap rise.
Cold start. ADR-025 estimated 3 to 30 seconds for llama-server cold start. My first benchmark numbers confirmed the high end (21.3 seconds modelLoad for E2B-Q8_0). After I added an always-on warm-up pass (ADR-031), the same modelLoad dropped to 6.7 seconds. Most of the original cost was OS page cache and CUDA driver init, not the recognizer. Real InitializeAsync on a warm host is about 6.7 to 9.3 seconds for Gemma 4 and about 0.7 to 1.5 seconds for Whisper.
30-second clip limit. Gemma 4 audio is bounded at 30 seconds per request. Parlotype's VAD already chunks below this, so it did not bite, but it is a real architectural ceiling.
E2B-Q8_0 is unstable. During the benchmark, gemma-4-E2B-it-Q8_0 intermittently emitted stray <|channel> reasoning tokens that crashed llama-server's chat-template parser with HTTP 500. The first 50-sample run failed mid-stream. The second succeeded but with abnormally high RTF (0.315 versus about 0.04 for other Gemma quants) because of verbose thought-text bleed-through. The catalog keeps E2B-Q8_0 selectable for experimentation. The default is E4B Q4_K_M.
BF16 hallucinations on Blackwell GPUs. Separate from the E2B-Q8_0 issue, BF16 variants have a documented hallucination behavior on some NVIDIA Blackwell hardware. On the CUDA 13.1 box used here, BF16 was actually the strongest Gemma 4 variant, so this is hardware-specific.
The managed-install subsystem
The simplest "give the user a folder picker" version of this worked for about two weeks. Then it became obvious that:
- llama.cpp ships a different archive per backend, OS, and architecture. Vulkan, CUDA 12.4, CUDA 13.1, CPU, HIP, SYCL.
- CUDA on Windows needs a companion
cudart-llama-bin-*.zipfor the NVIDIA runtime DLLs. - New releases land several times a week, tagged
bXXXX, with no "latest" alias.
ADR-026 added a full managed-install subsystem. A catalog backed by GitHub Releases with ETag caching. An installer that stages downloads under .staging/{guid}/payload/ and commits with a single Directory.Move. A registry (manifest.json) as the source of truth for what is installed. A tolerant asset parser that turns unknown backend strings into Unknown rather than throwing.
Two details worth calling out.
Atomic rename. Every install assembles under a staging directory and is committed by a single Directory.Move. A crash mid-install leaves no visible state. The user does not end up with a half-installed server. This is the kind of detail no library does for you.
Shared download primitive. StreamingFileDownloader was extracted from the pre-existing Whisper model downloader and is now used by both. About 150 lines, no abstraction layer, just a shared chunk loop.
The whole subsystem is about 1,800 lines across Core, Platform, and Desktop, plus tests. Worth naming so the cost is visible. "Add a button that downloads a binary" is not what shipped.
Configurable prompts
The Gemma 4 path sends a prompt alongside each audio clip. The text block in the user message tells the model what to do with the audio. Originally this was a hardcoded const. ADR-030 made it a first-class registry. Users create, edit, and duplicate prompts via the Settings UI. Prompts persist to prompts.json. The active prompt is re-read per transcription, no model reload required.
A {language} placeholder is the one small interface seam left for a future feature: source-language detection from keyboard layout. Small interface seams beat retroactive migrations of saved user data.
An example prompt to show what the multimodal-prompt approach actually unlocks.
1. Technical: bug-report formatter
Transcribe the speech verbatim. Then, on a new line, reformat it as a GitHub bug report with sections "Steps to reproduce", "Expected", "Actual", "Environment". If a section cannot be inferred from the speech, write
(not specified).
Input (spoken): "I clicked save and the app just died, nothing in the logs, on my Windows machine, 64-bit."
Output: a structured issue
I click save and the app just died. Nothing in the logs. On
my windows machine, 64 bit.
** Expected*
(not specified)
** Actual **
The app crashes/dies. There are no errors in the logs.
*Environment*
Windows machine, 64 bit.
** Steps to reproduce*
Click save.
Benchmarks
The main benchmark data is in results/comparison-libri-speech-test-other-2026-05-23-cuda.md in the repo. The numbers below match that file exactly.
Setup
-
Dataset: LibriSpeech
test-other, 50 samples. The harder English split, with more diverse accents and recording conditions thantest-clean. -
Whisper: CUDA runtime (
Whisper.net.Runtime.Cuda, strict viaruntimePreference: "Cuda"), beam size 1 (greedy, deterministic). -
Gemma 4:
llama-serverCUDA buildb9297-win-cuda-13.1-x64, port 8321, no streaming, the built-in transcription prompt. - VAD: disabled. The dataset is pre-segmented, so full-file transcription is correct here.
- Warm-up: one throwaway transcription before the timed loop, per ADR-031.
Methodology sidebar: the warm-up fix
The first time I ran these numbers, gemma-4-E2B-it-Q8_0 reported a 21.3-second modelLoad. The other Gemma variants reported about 9 seconds. Whisper Small reported 1.2. None of that matched my hand measurements. Once I added an always-on warm-up pass, the picture changed:
| Model | Cold modelLoad | Warm modelLoad | Delta |
|---|---|---|---|
Whisper Small
|
1192 ms | 755 ms | -437 ms |
Whisper LargeV3Turbo
|
1567 ms | 1511 ms | about the same |
Gemma E2B-Q8_0
|
21300 ms | 6741 ms | -14.6 s |
Gemma E2B-BF16
|
9256 ms | 6703 ms | -2.5 s |
The decoder is greedy and deterministic, so WER and CER did not change between cold and warm runs. Only the timing fields became meaningful. If you publish inference timings without an explicit warm-up policy, you are publishing your filesystem cache state.
Results
| Rank | Engine | Model | WER % | CER % | RTF | Model load (s) |
|---|---|---|---|---|---|---|
| 1 | Whisper (CUDA) | LargeV3Turbo |
11.48 | 4.97 | 0.055 | 1.31 |
| 2 | Whisper (CUDA) | Medium |
12.18 | 5.41 | 0.073 | 1.28 |
| 3 | Whisper (CUDA) | Small |
13.10 | 5.87 | 0.034 | 0.71 |
| 4 | Gemma 4 (llama.cpp CUDA) | E2B-it-BF16 |
13.15 | 4.95 | 0.038 | 6.70 |
| 5 | Gemma 4 (llama.cpp CUDA) | E4B-it-Q4_K_M |
13.82 | 5.80 | 0.038 | 6.73 |
| 6 | Gemma 4 (llama.cpp CUDA) | E4B-it-BF16 |
14.20 | 5.40 | 0.038 | 6.72 |
| 7 | Gemma 4 (llama.cpp CUDA) | E4B-it-Q8_0 |
14.39 | 5.79 | 0.044 | 9.25 |
| 8 | Gemma 4 (llama.cpp CUDA) | E2B-it-Q8_0 |
19.22 | 8.95 | 0.315 | 6.74 |
Things worth calling out in prose.
-
Whisper
LargeV3Turbostill leads. 11.48% versus Gemma's best 13.15%. The gap is 1.67 points, and the gap is smaller than I expected before running these numbers. -
Whisper
Smallon CUDA is the fastest in the field. RTF 0.034 beats every Gemma variant (0.038 or higher) and every other Whisper. At 13.10% WER it also essentially ties GemmaE2B-it-BF16(13.15%) on accuracy. If you only keep one configuration on disk, Whisper Small on CUDA is hard to argue against on this dataset. -
Gemma 4
E2B-it-BF16has the lowest CER of the whole field. 4.95% versus WhisperLargeV3Turbo's 4.97%. The WER ordering does not always agree with the CER ordering, and Gemma's character-level errors at this size are unusually small. -
Gemma BF16 and Q4 are faster than mid-tier Whisper. Gemma variants sit at RTF 0.038, faster than Whisper
Medium(0.073) andLargeV3Turbo(0.055), but slower than WhisperSmall(0.034). -
E2B-Q8_0 is broken on this dataset. RTF 0.315 (8x slower than other Gemma variants), WER 19.22%. The crash on the stray
<|channel>token is the same issue from the trade-offs section.
Vulkan vs CUDA: a regression I did not expect
Before pivoting Whisper to CUDA, I ran the same three Whisper models on Vulkan. The result is almost invariant, but not quite.
| Model | Vulkan WER | CUDA WER | Delta |
|---|---|---|---|
Small |
13.10 | 13.10 | 0.00 (bit-identical) |
Medium |
12.18 | 12.18 | 0.00 (bit-identical) |
LargeV3Turbo |
10.15 | 11.48 | +1.33 pp |
Small and Medium produce bit-identical WER across runtimes. The greedy decoder is deterministic and the kernels reproduce. LargeV3Turbo regresses by 1.33 percentage points on CUDA, reproducibly.
The most likely culprit is non-bitwise-identical kernel math between the Vulkan and CUDA backends. Matmul and softmax reduction order, and FP16 accumulation order, are not guaranteed to be deterministic across GPU backends. At the scale of LargeV3Turbo's larger matrices, accumulated FP error tips a handful of borderline decoder choices.
The takeaway is not "CUDA is buggy". It is that GPU backends are not interchangeable when you care about exact transcripts. If LargeV3Turbo is your production target, benchmark on the runtime you will actually ship.
CUDA also delivered what you would expect on the other dimensions. RTF improved 8 to 26% across all three Whisper models. Host RAM dropped 30 to 60% because weights now live in VRAM. The speed and memory wins are real and worth taking.
What this section does not claim
- Gemma 4 wins. It does not, on this dataset.
- Whisper is obsolete. It is not.
LargeV3Turbostill leads by 1.67 WER points. - These numbers generalize. They are 50 samples of read English with a single benchmark machine. The point is to give readers numbers they can replicate, not to declare a winner.
What's next
Three concrete follow-ups, each with one sentence of why.
A LlamaServerHost extraction. Right now LlamaCppSpeechRecognizer owns the llama-server process. The first post-processing consumer will need to share the server. A dedicated host class will manage spawn and terminate so neither workload can tear the server down on the other.
A post-processing pipeline. Same loaded model, second invocation. Whisper text -> llama-server -> cleaned, translated, or structured text -> injector. The configurable prompts feature is the first half of this. The consumer is what is still missing.
Source language detection from keyboard layout. The {language} token in PromptTemplate.Render is already in place. The detector is what comes next.
Try it
- Repo: github.com/mdemin729/parlotype
- Demo video, 60 seconds, Gemma 4 dictation walkthrough:
- ADRs:
docs/decisions/024-gemma4-python-sidecar.mdthrough030-configurable-gemma4-prompts.md. - Benchmark data:
results/comparison-libri-speech-test-other-2026-05-23-cuda.md.
Windows only for now. .NET 10, MIT licensed. Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads llama-server and the GGUF for you.
If you have shipped llama.cpp's /v1/chat/completions audio path in production, I am curious about cold-start mitigations beyond keeping the server warm. Spinning-disk first-inference times in the 30-second range are the part I have not solved cleanly yet.
Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.




Top comments (0)