This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Parlotype is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.
Google released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.
The interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The ggml-org GGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.
Demo
The video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.
Code
Source, ADRs, and benchmark configs: github.com/mdemin729/parlotype
Relevant entry points:
-
src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs: the recognizer that talks tollama-server. -
src/Parlotype.Core/Speech/Gemma4ModelInfo.cs: the 5-variant catalog. -
docs/decisions/025-gemma4-llamacpp-desktop.mdthrough030-configurable-gemma4-prompts.md: the ADR series covering the integration. -
results/comparison-libri-speech-test-other-2026-05-23-cuda.md: the benchmark data behind the choices below.
How I Used Gemma 4
Why a separate engine at all
Whisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to "clean read" than to "AMI meeting", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.
Why llama-server as the runtime
I looked at several inference paths before picking llama-server, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.
onnxruntime-genai does not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: microsoft/onnxruntime-genai#2062. A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet (ollama/ollama#15333). Lemonade is AMD-only.
llama-server with the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at /v1/chat/completions, with input_audio blocks for audio. A release cadence I can manage from in-app updates. ADR-025 has the longer version of this decision.
Picking a variant: the benchmark
The catalog has five variants. That is what ggml-org/gemma-4-E2B-it-GGUF and ggml-org/gemma-4-E4B-it-GGUF actually publish, not what I would ideally pick (see ADR-029):
| ModelId | GGUF | Size on disk (with bf16 mmproj) |
|---|---|---|
gemma-4-E2B-it-Q8_0 |
E2B Q8_0 | ~5.5 GiB |
gemma-4-E2B-it-bf16 |
E2B BF16 | ~9.6 GiB |
gemma-4-E4B-it-Q4_K_M |
E4B Q4_K_M | ~5.9 GiB |
gemma-4-E4B-it-Q8_0 |
E4B Q8_0 | ~8.4 GiB |
gemma-4-E4B-it-bf16 |
E4B BF16 | ~15 GiB |
E2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.
I ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech test-other, which is the "harder" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.
| Rank | Engine | Model | WER % | CER % | RTF | Model load (s) |
|---|---|---|---|---|---|---|
| 1 | Whisper (CUDA) | LargeV3Turbo |
11.48 | 4.97 | 0.055 | 1.31 |
| 2 | Whisper (CUDA) | Medium |
12.18 | 5.41 | 0.073 | 1.28 |
| 3 | Whisper (CUDA) | Small |
13.10 | 5.87 | 0.034 | 0.71 |
| 4 | Gemma 4 (llama.cpp) | E2B-it-BF16 |
13.15 | 4.95 | 0.038 | 6.70 |
| 5 | Gemma 4 (llama.cpp) | E4B-it-Q4_K_M |
13.82 | 5.80 | 0.038 | 6.73 |
| 6 | Gemma 4 (llama.cpp) | E4B-it-BF16 |
14.20 | 5.40 | 0.038 | 6.72 |
| 7 | Gemma 4 (llama.cpp) | E4B-it-Q8_0 |
14.39 | 5.79 | 0.044 | 9.25 |
| 8 | Gemma 4 (llama.cpp) | E2B-it-Q8_0 |
19.22 | 8.95 | 0.315 | 6.74 |
Three things from the table:
-
E2B-it-BF16has the lowest CER of any model here (4.95%). It barely beats WhisperLargeV3Turbo(4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small. -
E4B-it-Q4_K_M(the shipping default) is at 13.82% WER and 0.038 RTF. That is close to WhisperSmall(13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB. -
E2B-it-Q8_0is broken on this dataset. RTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashedllama-servermid-sample because the model emitted a stray<|channel>reasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.
What I picked, and why
The shipping default is gemma-4-E4B-it-Q4_K_M. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a "known issue" tag.
The model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.
Architecture
Gemma 4 sits behind the same ISpeechRecognizer interface as Whisper. A DelegatingSpeechRecognizer (backed by a small SpeechRecognizerFactory) picks one or the other at init time, based on the user's engine setting. The LlamaCppSpeechRecognizer owns a child llama-server.exe process. It posts audio as a base64 WAV blob to /v1/chat/completions:
// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
messages = new[]
{
new
{
role = "user",
content = new object[]
{
new { type = "text", text = promptText },
new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
}
}
},
stream = false
};
using var response = await _httpClient.PostAsJsonAsync(
"/v1/chat/completions", body, cancellationToken);
Same capture, same VAD, different recognizer:
The llama-server binary itself is also managed by the app. ADR-026 covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.
The transcription prompt is also user-editable. ADR-030 turned the hardcoded prompt into a small registry with a built-in default and a {language} placeholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.
What this taught me
Three things I learned from doing this:
- The model card's headline numbers do not transfer to your stack. Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from "the model can do 4.17%" to "my app does 13.82% on noisy audio with the quantization that fits on user disks" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack.
-
Most of the work is in the catalog, not in the inference call. The actual
/v1/chat/completionsHTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went. - Asymmetric quantization coverage is the rule, not the exception. E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.
Try Parlotype
- Repo: github.com/mdemin729/parlotype
- Windows only for now. .NET 10, MIT licensed.
- Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads
llama-serverand the GGUF for you.
Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.



Top comments (0)