DEV Community: Maksim Demin

Adding Gemma 4 speech recognition to a .NET desktop app: the llama-server sidecar that survived

Maksim Demin — Wed, 27 May 2026 02:36:13 +0000

In April 2026 Google shipped Gemma 4, a multimodal model with a native audio path. I wanted to add it to Parlotype, my .NET 10 dictation app, as a second speech engine alongside Whisper. Four runtime paths got cut before I landed on llama.cpp's llama-server as a child process. This post walks through the cuts, the architecture that survived, the variant catalog, and the benchmarks.

Parlotype is a voice-to-text desktop app for Windows with on-device speech recognition as the default. You hold a global hotkey, speak, release. Text appears in whatever app you were typing into. This post is about adding a second on-device engine. Cloud speech providers are a separate, opt-in track and not the subject here.

This is the long companion to my Gemma 4 Challenge submission on the same topic. The challenge post is the 5-variant tour with the shipping decision. This one is the runtime selection and the architecture under it.

The constraints

Worth naming the constraints up front so the obvious answers make sense as dead-ends:

On-device engine. Gemma 4 is being added as another local recognizer alongside Whisper, so inference for this path stays on the user's machine. Cloud providers are a separate, opt-in track and out of scope for this post.
Windows desktop, single end-user installer. No "first install Python, then WSL2, then..." Real users will not do that.
Cross-vendor GPU. AMD, Intel, and NVIDIA, with CPU fallback. Locking the app to one vendor is not acceptable.
The audio pipeline already exists. WASAPI capture -> 16 kHz mono float[] -> Silero VAD -> speech segments -> recognizer -> text injection. The new engine has to slot in behind the existing ISpeechRecognizer interface without redesigning the pipeline.
.NET 10 and Avalonia UI 12 for the host process.

Then the trigger. Google released Gemma 4 (E2B and E4B) with a conformer audio encoder. Their reported WER on LibriSpeech-test-clean is 4.17%, which is competitive with bigger Whisper variants on clean speech. The same checkpoint can also do text post-processing later. The question was never "should we add Gemma 4". It was "how, on Windows, in .NET, as another local engine that preserves the on-device default".

Four runtime dead-ends

This is the part of the post that took the most engineering and the part most worth writing down. Each rejection has a specific reason.

Dead-end 1: native .NET inference via `onnxruntime-genai`

The obvious first stop. ONNX Runtime with the GenAI extension already runs Phi-3 and similar small models from .NET. If Gemma 4 were supported, the app would have nothing more than a new ISpeechRecognizer implementation. No extra processes, no separate installer.

It is not supported. Gemma 4's architecture uses per-layer embeddings, variable head dimensions, and KV cache sharing. None of those were understood by onnxruntime-genai at the time of writing. Tracking issue: microsoft/onnxruntime-genai#2062.

Per-layer embeddings, briefly, mean each transformer layer has its own embedding matrix instead of sharing one. Variable head dimensions mean attention heads in different layers can have different sizes. Standard ONNX exporters and runtimes assume neither of these. Until ONNX Runtime ships the underlying support, no .NET-native path exists.

Dead-end 2: a Python sidecar with HuggingFace Transformers

The second attempt was a small Python sidecar. Spawn a local FastAPI server, talk HTTP to 127.0.0.1, transcribe via HF Transformers with bitsandbytes for 4-bit quantization. From .NET: write a temp WAV, POST it, parse JSON, clean up.

This actually shipped, as a benchmark-only tool (ADR-024). It was never wired into the desktop app. Three reasons:

It pulls Python and CUDA into the install. That is a non-starter for non-developer users.
bitsandbytes has limited Windows support. Users would need WSL2 or Linux to get the 4-bit path that makes Gemma 4 affordable on consumer GPUs.
The benchmarks were unreliable.

That third point is worth dwelling on. The first Gemma 4 benchmark on LibriSpeech-test-other came back at 96.94% WER. Peak host RAM for the sidecar process was about 79 MB, for a model that should occupy several gigabytes. The number was so bad that the obvious conclusion was not "Gemma 4 is bad". It was "this pipeline is silently broken". Two weeks later, on the same dataset and same machine, the llama.cpp path produced 13.15% WER for the same model.

The lesson is not "Python is bad". The lesson is that the inference path you ship matters more than the model card claims, and you only learn that by measuring on your own stack.

The broken benchmark was also what prompted the search that found llama-server.

Dead-end 3: LLamaSharp

LLamaSharp is a native .NET P/Invoke layer over llama.cpp. More control, no separate process, no HTTP boundary. On paper this is the best fit for a .NET app.

The blocker was build-coupling. LLamaSharp links against a specific llama.cpp build at compile time. Switching the user's backend from Vulkan to CUDA means rebuilding the host app. There is no good way to ship "use Vulkan on AMD, use CUDA on NVIDIA" from one binary. Audio support for Gemma 4 was also significantly more engineering than the chat-completions path.

Dead-end 4: Ollama and Lemonade

Ollama would have given the smoothest UX of any option. It also did not support Gemma audio at the time. Tracking issue: ollama/ollama#15333.

Lemonade is strong on Ryzen AI hardware, but it is AMD-specific. Cross-vendor was a hard requirement.

Why `llama-server`

llama-server is the HTTP server that ships with llama.cpp. At the decision date (2026-05-09, ADR-025), it was the only cross-vendor native Windows runtime with a stable HTTP API that supported Gemma 4 audio.

The concrete reasons:

It exposes an OpenAI-compatible /v1/chat/completions endpoint. Audio goes in as an input_audio content block. The shape is documented and stable.
Pre-built Vulkan binaries (llama-bXXXX-bin-win-vulkan-x64 from llama.cpp's GitHub Releases) work on AMD, Intel, and NVIDIA GPUs from a single download.
CUDA, Vulkan, CPU, and other backends each ship as a separate archive. You can install more than one side by side and switch.
Gemma 4 GGUF weights and the audio projector (mmproj) are published by ggml-org on HuggingFace.

The cost is an extra process to manage. Cold start, port conflicts, crash handling, file locks during upgrade. Most of the rest of this post is how that was tamed.

Architecture

Two diagrams. The first shows what is on disk and who downloads what. The second shows where the audio pipeline branches by engine.

Top-level integration

The diagram has three layers. The app (the .NET host process), disk (%LOCALAPPDATA%/parlotype for installed servers, models, and prompts), and external sources (HuggingFace for GGUFs, GitHub Releases for llama-server builds). The sidecar sits between the app and disk because it spans both: spawned by the app, but its binary and weights live on disk.

Audio pipeline: Whisper and Gemma 4, side by side

The diamond in the middle is the architectural pivot. DelegatingSpeechRecognizer reads the user's SpeechEngine setting at init time and forwards every call to either WhisperSpeechRecognizer or LlamaCppSpeechRecognizer. The audio pipeline itself does not know which engine is active. Same capture, same VAD, same injector. The right branch crosses a process boundary, which is the cost of the Gemma 4 path.

Key types worth naming:

SpeechEngine enum in Parlotype.Core (Whisper or Gemma4), persisted via SettingsKeys.SpeechEngine.
DelegatingSpeechRecognizer is registered as the ISpeechRecognizer singleton. It picks the underlying recognizer at InitializeAsync time.
LlamaCppSpeechRecognizer owns the llama-server.exe process lifecycle. Spawn, poll /health, transcribe, terminate.
JsonLlamaServerRegistry tracks managed installs in manifest.json (covered below).
IPromptTemplateRegistry looks up the active transcription prompt per call.

The `input_audio` content block

Most "use llama.cpp from .NET" tutorials cover text-only chat. The audio path is worth showing in detail. Audio is sent as a base64-encoded WAV blob in an input_audio content block:

// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    model = "gemma-4",
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false,
    max_tokens = 200
};

using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);

stream = false is deliberate. Simpler error handling, no SSE parser, and transcription is short-burst (under 30 seconds per clip, see trade-offs below). When post-processing lands and outputs longer text, streaming becomes worth the complexity.

Trade-offs

The decisions that bit me, in the order they bit.

Model size. GGUF E4B Q4_K_M is about 5.9 GiB. BF16 variants reach about 15 GiB. The Gemma4ModelInfo catalog (ADR-029) curates five variants and explicitly notes that ggml-org/gemma-4-E2B-it-GGUF does not publish a Q4_K_M asset. I learned this from a 404 in manual testing, then rebuilt the catalog from the actual file lists.

Noisy audio. On LibriSpeech-test-other, Whisper LargeV3Turbo on CUDA lands at 11.48% WER. The best Gemma 4 variant (E2B-it-BF16) lands at 13.15% WER. A 1.7-point gap on the harder English split. Google's own evaluations showed Gemma 4 falling further behind on meeting-style noise (AMI is about 41% WER for Gemma versus about 16% for Whisper-large-v3). The honest pitch is that Gemma 4 is competitive on read speech and degrades faster than Whisper as noise and overlap rise.

Cold start. ADR-025 estimated 3 to 30 seconds for llama-server cold start. My first benchmark numbers confirmed the high end (21.3 seconds modelLoad for E2B-Q8_0). After I added an always-on warm-up pass (ADR-031), the same modelLoad dropped to 6.7 seconds. Most of the original cost was OS page cache and CUDA driver init, not the recognizer. Real InitializeAsync on a warm host is about 6.7 to 9.3 seconds for Gemma 4 and about 0.7 to 1.5 seconds for Whisper.

30-second clip limit. Gemma 4 audio is bounded at 30 seconds per request. Parlotype's VAD already chunks below this, so it did not bite, but it is a real architectural ceiling.

E2B-Q8_0 is unstable. During the benchmark, gemma-4-E2B-it-Q8_0 intermittently emitted stray <|channel> reasoning tokens that crashed llama-server's chat-template parser with HTTP 500. The first 50-sample run failed mid-stream. The second succeeded but with abnormally high RTF (0.315 versus about 0.04 for other Gemma quants) because of verbose thought-text bleed-through. The catalog keeps E2B-Q8_0 selectable for experimentation. The default is E4B Q4_K_M.

BF16 hallucinations on Blackwell GPUs. Separate from the E2B-Q8_0 issue, BF16 variants have a documented hallucination behavior on some NVIDIA Blackwell hardware. On the CUDA 13.1 box used here, BF16 was actually the strongest Gemma 4 variant, so this is hardware-specific.

The managed-install subsystem

The simplest "give the user a folder picker" version of this worked for about two weeks. Then it became obvious that:

llama.cpp ships a different archive per backend, OS, and architecture. Vulkan, CUDA 12.4, CUDA 13.1, CPU, HIP, SYCL.
CUDA on Windows needs a companion cudart-llama-bin-*.zip for the NVIDIA runtime DLLs.
New releases land several times a week, tagged bXXXX, with no "latest" alias.

ADR-026 added a full managed-install subsystem. A catalog backed by GitHub Releases with ETag caching. An installer that stages downloads under .staging/{guid}/payload/ and commits with a single Directory.Move. A registry (manifest.json) as the source of truth for what is installed. A tolerant asset parser that turns unknown backend strings into Unknown rather than throwing.

Two details worth calling out.

Atomic rename. Every install assembles under a staging directory and is committed by a single Directory.Move. A crash mid-install leaves no visible state. The user does not end up with a half-installed server. This is the kind of detail no library does for you.

Shared download primitive. StreamingFileDownloader was extracted from the pre-existing Whisper model downloader and is now used by both. About 150 lines, no abstraction layer, just a shared chunk loop.

The whole subsystem is about 1,800 lines across Core, Platform, and Desktop, plus tests. Worth naming so the cost is visible. "Add a button that downloads a binary" is not what shipped.

Configurable prompts

The Gemma 4 path sends a prompt alongside each audio clip. The text block in the user message tells the model what to do with the audio. Originally this was a hardcoded const. ADR-030 made it a first-class registry. Users create, edit, and duplicate prompts via the Settings UI. Prompts persist to prompts.json. The active prompt is re-read per transcription, no model reload required.

A {language} placeholder is the one small interface seam left for a future feature: source-language detection from keyboard layout. Small interface seams beat retroactive migrations of saved user data.

An example prompt to show what the multimodal-prompt approach actually unlocks.

1. Technical: bug-report formatter

Transcribe the speech verbatim. Then, on a new line, reformat it as a GitHub bug report with sections "Steps to reproduce", "Expected", "Actual", "Environment". If a section cannot be inferred from the speech, write (not specified).

Input (spoken): "I clicked save and the app just died, nothing in the logs, on my Windows machine, 64-bit."

Output: a structured issue

I click save and the app just died. Nothing in the logs. On
my windows machine, 64 bit.

** Expected*
(not specified)

** Actual **
The app crashes/dies. There are no errors in the logs.

*Environment*
Windows machine, 64 bit.

** Steps to reproduce*
Click save.

Benchmarks

The main benchmark data is in results/comparison-libri-speech-test-other-2026-05-23-cuda.md in the repo. The numbers below match that file exactly.

Setup

Dataset: LibriSpeech test-other, 50 samples. The harder English split, with more diverse accents and recording conditions than test-clean.
Whisper: CUDA runtime (Whisper.net.Runtime.Cuda, strict via runtimePreference: "Cuda"), beam size 1 (greedy, deterministic).
Gemma 4: llama-server CUDA build b9297-win-cuda-13.1-x64, port 8321, no streaming, the built-in transcription prompt.
VAD: disabled. The dataset is pre-segmented, so full-file transcription is correct here.
Warm-up: one throwaway transcription before the timed loop, per ADR-031.

Methodology sidebar: the warm-up fix

The first time I ran these numbers, gemma-4-E2B-it-Q8_0 reported a 21.3-second modelLoad. The other Gemma variants reported about 9 seconds. Whisper Small reported 1.2. None of that matched my hand measurements. Once I added an always-on warm-up pass, the picture changed:

Model	Cold modelLoad	Warm modelLoad	Delta
Whisper `Small`	1192 ms	755 ms	-437 ms
Whisper `LargeV3Turbo`	1567 ms	1511 ms	about the same
Gemma `E2B-Q8_0`	21300 ms	6741 ms	-14.6 s
Gemma `E2B-BF16`	9256 ms	6703 ms	-2.5 s

The decoder is greedy and deterministic, so WER and CER did not change between cold and warm runs. Only the timing fields became meaningful. If you publish inference timings without an explicit warm-up policy, you are publishing your filesystem cache state.

Results

Rank	Engine	Model	WER %	CER %	RTF	Model load (s)
1	Whisper (CUDA)	`LargeV3Turbo`	11.48	4.97	0.055	1.31
2	Whisper (CUDA)	`Medium`	12.18	5.41	0.073	1.28
3	Whisper (CUDA)	`Small`	13.10	5.87	0.034	0.71
4	Gemma 4 (llama.cpp CUDA)	`E2B-it-BF16`	13.15	4.95	0.038	6.70
5	Gemma 4 (llama.cpp CUDA)	`E4B-it-Q4_K_M`	13.82	5.80	0.038	6.73
6	Gemma 4 (llama.cpp CUDA)	`E4B-it-BF16`	14.20	5.40	0.038	6.72
7	Gemma 4 (llama.cpp CUDA)	`E4B-it-Q8_0`	14.39	5.79	0.044	9.25
8	Gemma 4 (llama.cpp CUDA)	`E2B-it-Q8_0`	19.22	8.95	0.315	6.74

Things worth calling out in prose.

Whisper LargeV3Turbo still leads. 11.48% versus Gemma's best 13.15%. The gap is 1.67 points, and the gap is smaller than I expected before running these numbers.
Whisper Small on CUDA is the fastest in the field. RTF 0.034 beats every Gemma variant (0.038 or higher) and every other Whisper. At 13.10% WER it also essentially ties Gemma E2B-it-BF16 (13.15%) on accuracy. If you only keep one configuration on disk, Whisper Small on CUDA is hard to argue against on this dataset.
Gemma 4 E2B-it-BF16 has the lowest CER of the whole field. 4.95% versus Whisper LargeV3Turbo's 4.97%. The WER ordering does not always agree with the CER ordering, and Gemma's character-level errors at this size are unusually small.
Gemma BF16 and Q4 are faster than mid-tier Whisper. Gemma variants sit at RTF 0.038, faster than Whisper Medium (0.073) and LargeV3Turbo (0.055), but slower than Whisper Small (0.034).
E2B-Q8_0 is broken on this dataset. RTF 0.315 (8x slower than other Gemma variants), WER 19.22%. The crash on the stray <|channel> token is the same issue from the trade-offs section.

Vulkan vs CUDA: a regression I did not expect

Before pivoting Whisper to CUDA, I ran the same three Whisper models on Vulkan. The result is almost invariant, but not quite.

Model	Vulkan WER	CUDA WER	Delta
`Small`	13.10	13.10	0.00 (bit-identical)
`Medium`	12.18	12.18	0.00 (bit-identical)
`LargeV3Turbo`	10.15	11.48	+1.33 pp

Small and Medium produce bit-identical WER across runtimes. The greedy decoder is deterministic and the kernels reproduce. LargeV3Turbo regresses by 1.33 percentage points on CUDA, reproducibly.

The most likely culprit is non-bitwise-identical kernel math between the Vulkan and CUDA backends. Matmul and softmax reduction order, and FP16 accumulation order, are not guaranteed to be deterministic across GPU backends. At the scale of LargeV3Turbo's larger matrices, accumulated FP error tips a handful of borderline decoder choices.

The takeaway is not "CUDA is buggy". It is that GPU backends are not interchangeable when you care about exact transcripts. If LargeV3Turbo is your production target, benchmark on the runtime you will actually ship.

CUDA also delivered what you would expect on the other dimensions. RTF improved 8 to 26% across all three Whisper models. Host RAM dropped 30 to 60% because weights now live in VRAM. The speed and memory wins are real and worth taking.

What this section does not claim

Gemma 4 wins. It does not, on this dataset.
Whisper is obsolete. It is not. LargeV3Turbo still leads by 1.67 WER points.
These numbers generalize. They are 50 samples of read English with a single benchmark machine. The point is to give readers numbers they can replicate, not to declare a winner.

What's next

Three concrete follow-ups, each with one sentence of why.

A LlamaServerHost extraction. Right now LlamaCppSpeechRecognizer owns the llama-server process. The first post-processing consumer will need to share the server. A dedicated host class will manage spawn and terminate so neither workload can tear the server down on the other.

A post-processing pipeline. Same loaded model, second invocation. Whisper text -> llama-server -> cleaned, translated, or structured text -> injector. The configurable prompts feature is the first half of this. The consumer is what is still missing.

Source language detection from keyboard layout. The {language} token in PromptTemplate.Render is already in place. The detector is what comes next.

Try it

Repo: github.com/mdemin729/parlotype
Demo video, 60 seconds, Gemma 4 dictation walkthrough:

ADRs: docs/decisions/024-gemma4-python-sidecar.md through 030-configurable-gemma4-prompts.md.
Benchmark data: results/comparison-libri-speech-test-other-2026-05-23-cuda.md.

Windows only for now. .NET 10, MIT licensed. Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads llama-server and the GGUF for you.

If you have shipped llama.cpp's /v1/chat/completions audio path in production, I am curious about cold-start mitigations beyond keeping the server warm. Spinning-disk first-inference times in the 30-second range are the part I have not solved cleanly yet.

Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.

Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour

Maksim Demin — Sun, 24 May 2026 03:51:31 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Parlotype is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.

Google released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.

The interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The ggml-org GGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.

Demo

The video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.

Code

Source, ADRs, and benchmark configs: github.com/mdemin729/parlotype

Relevant entry points:

src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs: the recognizer that talks to llama-server.
src/Parlotype.Core/Speech/Gemma4ModelInfo.cs: the 5-variant catalog.
docs/decisions/025-gemma4-llamacpp-desktop.md through 030-configurable-gemma4-prompts.md: the ADR series covering the integration.
results/comparison-libri-speech-test-other-2026-05-23-cuda.md: the benchmark data behind the choices below.

How I Used Gemma 4

Why a separate engine at all

Whisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to "clean read" than to "AMI meeting", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.

Why `llama-server` as the runtime

I looked at several inference paths before picking llama-server, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.

onnxruntime-genai does not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: microsoft/onnxruntime-genai#2062. A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet (ollama/ollama#15333). Lemonade is AMD-only.

llama-server with the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at /v1/chat/completions, with input_audio blocks for audio. A release cadence I can manage from in-app updates. ADR-025 has the longer version of this decision.

Picking a variant: the benchmark

The catalog has five variants. That is what ggml-org/gemma-4-E2B-it-GGUF and ggml-org/gemma-4-E4B-it-GGUF actually publish, not what I would ideally pick (see ADR-029):

ModelId	GGUF	Size on disk (with bf16 mmproj)
`gemma-4-E2B-it-Q8_0`	E2B Q8_0	~5.5 GiB
`gemma-4-E2B-it-bf16`	E2B BF16	~9.6 GiB
`gemma-4-E4B-it-Q4_K_M`	E4B Q4_K_M	~5.9 GiB
`gemma-4-E4B-it-Q8_0`	E4B Q8_0	~8.4 GiB
`gemma-4-E4B-it-bf16`	E4B BF16	~15 GiB

E2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.

I ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech test-other, which is the "harder" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.

Rank	Engine	Model	WER %	CER %	RTF	Model load (s)
1	Whisper (CUDA)	`LargeV3Turbo`	11.48	4.97	0.055	1.31
2	Whisper (CUDA)	`Medium`	12.18	5.41	0.073	1.28
3	Whisper (CUDA)	`Small`	13.10	5.87	0.034	0.71
4	Gemma 4 (llama.cpp)	`E2B-it-BF16`	13.15	4.95	0.038	6.70
5	Gemma 4 (llama.cpp)	`E4B-it-Q4_K_M`	13.82	5.80	0.038	6.73
6	Gemma 4 (llama.cpp)	`E4B-it-BF16`	14.20	5.40	0.038	6.72
7	Gemma 4 (llama.cpp)	`E4B-it-Q8_0`	14.39	5.79	0.044	9.25
8	Gemma 4 (llama.cpp)	`E2B-it-Q8_0`	19.22	8.95	0.315	6.74

Three things from the table:

E2B-it-BF16 has the lowest CER of any model here (4.95%). It barely beats Whisper LargeV3Turbo (4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small.
E4B-it-Q4_K_M (the shipping default) is at 13.82% WER and 0.038 RTF. That is close to Whisper Small (13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB.
E2B-it-Q8_0 is broken on this dataset. RTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashed llama-server mid-sample because the model emitted a stray <|channel> reasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.

What I picked, and why

The shipping default is gemma-4-E4B-it-Q4_K_M. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a "known issue" tag.

The model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.

Architecture

Gemma 4 sits behind the same ISpeechRecognizer interface as Whisper. A DelegatingSpeechRecognizer (backed by a small SpeechRecognizerFactory) picks one or the other at init time, based on the user's engine setting. The LlamaCppSpeechRecognizer owns a child llama-server.exe process. It posts audio as a base64 WAV blob to /v1/chat/completions:

// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false
};
using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);

Same capture, same VAD, different recognizer:

The llama-server binary itself is also managed by the app. ADR-026 covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.

The transcription prompt is also user-editable. ADR-030 turned the hardcoded prompt into a small registry with a built-in default and a {language} placeholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.

What this taught me

Three things I learned from doing this:

The model card's headline numbers do not transfer to your stack. Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from "the model can do 4.17%" to "my app does 13.82% on noisy audio with the quantization that fits on user disks" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack.
Most of the work is in the catalog, not in the inference call. The actual /v1/chat/completions HTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went.
Asymmetric quantization coverage is the rule, not the exception. E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.

Try Parlotype

Repo: github.com/mdemin729/parlotype
Windows only for now. .NET 10, MIT licensed.
Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads llama-server and the GGUF for you.

Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.

Why I built Parlotype: a privacy-first voice-to-English desktop app on .NET 10

Maksim Demin — Fri, 08 May 2026 00:05:45 +0000

The friction

I've been shipping production code for 20 years across five languages — C, C++, Java, Scala, and now C#. My English is decent enough for daily work, but it's not native.

So whenever I want a sharper adjective in an email, or a phrase that doesn't read as translated, I still reach for Google Translate. Sometimes I dictate into it. Sometimes I type — which is slower. And if I'm on a machine without a Russian keyboard layout, the friction goes up another notch.

Multiple times a day. Across email, MS Teams, PR descriptions, design docs.

I finally got tired of switching context, and built a tool to skip it.

Why not the built-in Windows dictation?

Windows 11 has perfectly fine built-in dictation. But it doesn't translate — and translation is the half that matters for non-native English speakers like me.

The workflow I needed:

Press a global hotkey
Speak in my native language
Get English text inserted directly into whatever app I'm in

No browser tab. No copy-paste. Nothing sent to the cloud.

That's Parlotype.

The stack

The first version is Windows-only, but I picked every piece with cross-platform support in mind from day one:

.NET 10 — runtime
Avalonia UI 12 — cross-platform desktop UI (tray-based)
Whisper.net — on-device speech recognition (OpenAI Whisper bindings for .NET)
Silero VAD — voice activity detection (ONNX-based)
NAudio — Windows audio capture (WASAPI)
CommunityToolkit.Mvvm — MVVM source generators
SharpHook — cross-platform global hotkeys

A few decisions worth highlighting:

Avalonia over MAUI. I needed a real desktop tray app on Windows/Linux/macOS. MAUI's desktop story is still uneven; Avalonia handles tray, hotkeys, and native window chrome cleanly across all three platforms.

Whisper.net over Whisper.cpp directly. Whisper.cpp is the reference implementation, but Whisper.net wraps it with idiomatic C# APIs and managed memory handling — meaningful when integrating with the rest of a .NET app.

Silero VAD over WebRTC VAD. WebRTC's VAD is older and noisier on modern audio. Silero, running through ONNX Runtime, gives much better speech/silence segmentation, which matters for snappy hotkey-triggered dictation.

GPU acceleration: CUDA and Vulkan

There's a second reason this project exists. A year ago I assembled a PC with an NVIDIA RTX 5000-series GPU for one specific purpose: to run local LLMs. It mostly sat idle — until Parlotype gave it a job.

Whisper.net supports CUDA out of the box, which is great for NVIDIA hardware. But "NVIDIA-only" isn't a cross-platform-friendly story — and many developers (including potential users) run on AMD or integrated GPUs.

The current build adds Vulkan as a second acceleration backend. Vulkan runs on NVIDIA, AMD, and Intel GPUs, including AMD integrated graphics, which broadens the hardware story significantly. CUDA is still preferred when available (faster on NVIDIA), but Vulkan covers the rest without falling back to CPU.

I'll publish benchmarks comparing CUDA vs Vulkan vs CPU across model sizes (tiny, base, small, medium, large-v3) in a follow-up post.

Parlotype as an AI-coding-agent testbed

Parlotype also became my real-world lab for AI coding agents — Claude Code, Copilot, OpenCode, and others. After 20 years of writing code by hand, I wanted to see how these tools hold up on a non-trivial .NET codebase. Not toy demos, not greenfield React apps — actual cross-platform desktop work with audio pipelines, native interop, and ONNX runtimes.

I'll write about that workflow in detail later: agent setup, automated project memory in an Obsidian vault, and which kinds of tasks each agent handles well versus poorly.

What's next in this series

Posts I'm planning to write next:

The speech recognition pipeline end-to-end (audio capture → VAD → Whisper → translation → injection)
Benchmarks for Whisper model parameters (size, language, beam size, temperature) on real hardware
CUDA vs Vulkan vs CPU performance across model sizes
My AI coding agent setup and the Obsidian-based project memory

Which one would you want to read first? Drop a comment.

Try it

Repo: github.com/mdemin729/parlotype

Issues, feedback, and PRs all welcome — especially benchmark numbers if you run it on AMD or Intel GPUs.

DEV Community: Maksim Demin

Adding Gemma 4 speech recognition to a .NET desktop app: the llama-server sidecar that survived

The constraints

Four runtime dead-ends

Dead-end 1: native .NET inference via onnxruntime-genai

Dead-end 2: a Python sidecar with HuggingFace Transformers

Dead-end 3: LLamaSharp

Dead-end 4: Ollama and Lemonade

Why llama-server

Architecture

Top-level integration

Audio pipeline: Whisper and Gemma 4, side by side

The input_audio content block

Trade-offs

The managed-install subsystem

Configurable prompts

Benchmarks

Setup

Methodology sidebar: the warm-up fix

Results

Vulkan vs CUDA: a regression I did not expect

What this section does not claim

What's next

Try it

Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour

What I Built

Demo

Code

How I Used Gemma 4

Why a separate engine at all

Why llama-server as the runtime

Picking a variant: the benchmark

What I picked, and why

Architecture

What this taught me

Try Parlotype

Why I built Parlotype: a privacy-first voice-to-English desktop app on .NET 10

The friction

Why not the built-in Windows dictation?

The stack

GPU acceleration: CUDA and Vulkan

Parlotype as an AI-coding-agent testbed

What's next in this series

Try it

Dead-end 1: native .NET inference via `onnxruntime-genai`

Why `llama-server`

The `input_audio` content block

Why `llama-server` as the runtime