<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maksim Demin</title>
    <description>The latest articles on DEV Community by Maksim Demin (@mdemin729).</description>
    <link>https://dev.to/mdemin729</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3910746%2F58086179-1e90-45c4-8f59-7032f153ab5d.jpg</url>
      <title>DEV Community: Maksim Demin</title>
      <link>https://dev.to/mdemin729</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mdemin729"/>
    <language>en</language>
    <item>
      <title>Adding Gemma 4 speech recognition to a .NET desktop app: the llama-server sidecar that survived</title>
      <dc:creator>Maksim Demin</dc:creator>
      <pubDate>Wed, 27 May 2026 02:36:13 +0000</pubDate>
      <link>https://dev.to/mdemin729/adding-gemma-4-speech-recognition-to-a-net-desktop-app-the-llama-server-sidecar-that-survived-298j</link>
      <guid>https://dev.to/mdemin729/adding-gemma-4-speech-recognition-to-a-net-desktop-app-the-llama-server-sidecar-that-survived-298j</guid>
      <description>&lt;p&gt;In April 2026 Google shipped Gemma 4, a multimodal model with a native audio path. I wanted to add it to Parlotype, my .NET 10 dictation app, as a second speech engine alongside Whisper. Four runtime paths got cut before I landed on llama.cpp's &lt;code&gt;llama-server&lt;/code&gt; as a child process. This post walks through the cuts, the architecture that survived, the variant catalog, and the benchmarks.&lt;/p&gt;

&lt;p&gt;Parlotype is a voice-to-text desktop app for Windows with on-device speech recognition as the default. You hold a global hotkey, speak, release. Text appears in whatever app you were typing into. This post is about adding a second on-device engine. Cloud speech providers are a separate, opt-in track and not the subject here.&lt;/p&gt;

&lt;p&gt;This is the long companion to my &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge submission&lt;/a&gt; on the same topic. The challenge post is the 5-variant tour with the shipping decision. This one is the runtime selection and the architecture under it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraints
&lt;/h2&gt;

&lt;p&gt;Worth naming the constraints up front so the obvious answers make sense as dead-ends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-device engine.&lt;/strong&gt; Gemma 4 is being added as another local recognizer alongside Whisper, so inference for this path stays on the user's machine. Cloud providers are a separate, opt-in track and out of scope for this post.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows desktop, single end-user installer.&lt;/strong&gt; No "first install Python, then WSL2, then..." Real users will not do that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-vendor GPU.&lt;/strong&gt; AMD, Intel, and NVIDIA, with CPU fallback. Locking the app to one vendor is not acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The audio pipeline already exists.&lt;/strong&gt; WASAPI capture -&amp;gt; 16 kHz mono float[] -&amp;gt; Silero VAD -&amp;gt; speech segments -&amp;gt; recognizer -&amp;gt; text injection. The new engine has to slot in behind the existing &lt;code&gt;ISpeechRecognizer&lt;/code&gt; interface without redesigning the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;.NET 10 and Avalonia UI 12&lt;/strong&gt; for the host process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then the trigger. Google released Gemma 4 (E2B and E4B) with a conformer audio encoder. Their reported WER on LibriSpeech-test-clean is 4.17%, which is competitive with bigger Whisper variants on clean speech. The same checkpoint can also do text post-processing later. The question was never "should we add Gemma 4". It was "how, on Windows, in .NET, as another local engine that preserves the on-device default".&lt;/p&gt;

&lt;h2&gt;
  
  
  Four runtime dead-ends
&lt;/h2&gt;

&lt;p&gt;This is the part of the post that took the most engineering and the part most worth writing down. Each rejection has a specific reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dead-end 1: native .NET inference via &lt;code&gt;onnxruntime-genai&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The obvious first stop. ONNX Runtime with the GenAI extension already runs Phi-3 and similar small models from .NET. If Gemma 4 were supported, the app would have nothing more than a new &lt;code&gt;ISpeechRecognizer&lt;/code&gt; implementation. No extra processes, no separate installer.&lt;/p&gt;

&lt;p&gt;It is not supported. Gemma 4's architecture uses per-layer embeddings, variable head dimensions, and KV cache sharing. None of those were understood by &lt;code&gt;onnxruntime-genai&lt;/code&gt; at the time of writing. Tracking issue: &lt;a href="https://github.com/microsoft/onnxruntime-genai/issues/2062" rel="noopener noreferrer"&gt;microsoft/onnxruntime-genai#2062&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Per-layer embeddings, briefly, mean each transformer layer has its own embedding matrix instead of sharing one. Variable head dimensions mean attention heads in different layers can have different sizes. Standard ONNX exporters and runtimes assume neither of these. Until ONNX Runtime ships the underlying support, no .NET-native path exists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dead-end 2: a Python sidecar with HuggingFace Transformers
&lt;/h3&gt;

&lt;p&gt;The second attempt was a small Python sidecar. Spawn a local FastAPI server, talk HTTP to &lt;code&gt;127.0.0.1&lt;/code&gt;, transcribe via HF Transformers with &lt;code&gt;bitsandbytes&lt;/code&gt; for 4-bit quantization. From .NET: write a temp WAV, POST it, parse JSON, clean up.&lt;/p&gt;

&lt;p&gt;This actually shipped, as a benchmark-only tool (&lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/024-gemma4-python-sidecar.md" rel="noopener noreferrer"&gt;ADR-024&lt;/a&gt;). It was never wired into the desktop app. Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It pulls Python and CUDA into the install. That is a non-starter for non-developer users.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bitsandbytes&lt;/code&gt; has limited Windows support. Users would need WSL2 or Linux to get the 4-bit path that makes Gemma 4 affordable on consumer GPUs.&lt;/li&gt;
&lt;li&gt;The benchmarks were unreliable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That third point is worth dwelling on. The first Gemma 4 benchmark on LibriSpeech-test-other came back at 96.94% WER. Peak host RAM for the sidecar process was about 79 MB, for a model that should occupy several gigabytes. The number was so bad that the obvious conclusion was not "Gemma 4 is bad". It was "this pipeline is silently broken". Two weeks later, on the same dataset and same machine, the llama.cpp path produced 13.15% WER for the same model.&lt;/p&gt;

&lt;p&gt;The lesson is not "Python is bad". The lesson is that the inference path you ship matters more than the model card claims, and you only learn that by measuring on your own stack.&lt;/p&gt;

&lt;p&gt;The broken benchmark was also what prompted the search that found &lt;code&gt;llama-server&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dead-end 3: LLamaSharp
&lt;/h3&gt;

&lt;p&gt;LLamaSharp is a native .NET P/Invoke layer over llama.cpp. More control, no separate process, no HTTP boundary. On paper this is the best fit for a .NET app.&lt;/p&gt;

&lt;p&gt;The blocker was build-coupling. LLamaSharp links against a specific llama.cpp build at compile time. Switching the user's backend from Vulkan to CUDA means rebuilding the host app. There is no good way to ship "use Vulkan on AMD, use CUDA on NVIDIA" from one binary. Audio support for Gemma 4 was also significantly more engineering than the chat-completions path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dead-end 4: Ollama and Lemonade
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; would have given the smoothest UX of any option. It also did not support Gemma audio at the time. Tracking issue: &lt;a href="https://github.com/ollama/ollama/issues/15333" rel="noopener noreferrer"&gt;ollama/ollama#15333&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/lemonade-sdk/lemonade" rel="noopener noreferrer"&gt;Lemonade&lt;/a&gt; is strong on Ryzen AI hardware, but it is AMD-specific. Cross-vendor was a hard requirement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why &lt;code&gt;llama-server&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; is the HTTP server that ships with llama.cpp. At the decision date (2026-05-09, &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/025-gemma4-llamacpp-desktop.md" rel="noopener noreferrer"&gt;ADR-025&lt;/a&gt;), it was the only cross-vendor native Windows runtime with a stable HTTP API that supported Gemma 4 audio.&lt;/p&gt;

&lt;p&gt;The concrete reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It exposes an OpenAI-compatible &lt;code&gt;/v1/chat/completions&lt;/code&gt; endpoint. Audio goes in as an &lt;code&gt;input_audio&lt;/code&gt; content block. The shape is documented and stable.&lt;/li&gt;
&lt;li&gt;Pre-built Vulkan binaries (&lt;code&gt;llama-bXXXX-bin-win-vulkan-x64&lt;/code&gt; from llama.cpp's GitHub Releases) work on AMD, Intel, and NVIDIA GPUs from a single download.&lt;/li&gt;
&lt;li&gt;CUDA, Vulkan, CPU, and other backends each ship as a separate archive. You can install more than one side by side and switch.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF" rel="noopener noreferrer"&gt;Gemma 4 GGUF&lt;/a&gt; weights and the audio projector (&lt;code&gt;mmproj&lt;/code&gt;) are published by &lt;code&gt;ggml-org&lt;/code&gt; on HuggingFace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost is an extra process to manage. Cold start, port conflicts, crash handling, file locks during upgrade. Most of the rest of this post is how that was tamed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Two diagrams. The first shows what is on disk and who downloads what. The second shows where the audio pipeline branches by engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top-level integration
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zy3xapmoo4a9rs5yagk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zy3xapmoo4a9rs5yagk.png" alt="Top-level integration" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram has three layers. The app (the .NET host process), disk (&lt;code&gt;%LOCALAPPDATA%/parlotype&lt;/code&gt; for installed servers, models, and prompts), and external sources (HuggingFace for GGUFs, GitHub Releases for &lt;code&gt;llama-server&lt;/code&gt; builds). The sidecar sits between the app and disk because it spans both: spawned by the app, but its binary and weights live on disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audio pipeline: Whisper and Gemma 4, side by side
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bf3edvc7f4lj26f13z1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7bf3edvc7f4lj26f13z1.png" alt="Audio pipeline: Whisper and Gemma 4, side by side" width="672" height="1109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diamond in the middle is the architectural pivot. &lt;code&gt;DelegatingSpeechRecognizer&lt;/code&gt; reads the user's &lt;code&gt;SpeechEngine&lt;/code&gt; setting at init time and forwards every call to either &lt;code&gt;WhisperSpeechRecognizer&lt;/code&gt; or &lt;code&gt;LlamaCppSpeechRecognizer&lt;/code&gt;. The audio pipeline itself does not know which engine is active. Same capture, same VAD, same injector. The right branch crosses a process boundary, which is the cost of the Gemma 4 path.&lt;/p&gt;

&lt;p&gt;Key types worth naming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SpeechEngine&lt;/code&gt; enum&lt;/strong&gt; in &lt;code&gt;Parlotype.Core&lt;/code&gt; (&lt;code&gt;Whisper&lt;/code&gt; or &lt;code&gt;Gemma4&lt;/code&gt;), persisted via &lt;code&gt;SettingsKeys.SpeechEngine&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DelegatingSpeechRecognizer&lt;/code&gt;&lt;/strong&gt; is registered as the &lt;code&gt;ISpeechRecognizer&lt;/code&gt; singleton. It picks the underlying recognizer at &lt;code&gt;InitializeAsync&lt;/code&gt; time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;LlamaCppSpeechRecognizer&lt;/code&gt;&lt;/strong&gt; owns the &lt;code&gt;llama-server.exe&lt;/code&gt; process lifecycle. Spawn, poll &lt;code&gt;/health&lt;/code&gt;, transcribe, terminate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;JsonLlamaServerRegistry&lt;/code&gt;&lt;/strong&gt; tracks managed installs in &lt;code&gt;manifest.json&lt;/code&gt; (covered below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IPromptTemplateRegistry&lt;/code&gt;&lt;/strong&gt; looks up the active transcription prompt per call.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;input_audio&lt;/code&gt; content block
&lt;/h3&gt;

&lt;p&gt;Most "use llama.cpp from .NET" tutorials cover text-only chat. The audio path is worth showing in detail. Audio is sent as a base64-encoded WAV blob in an &lt;code&gt;input_audio&lt;/code&gt; content block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Excerpt from LlamaCppSpeechRecognizer.cs&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gemma-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;promptText&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"input_audio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_audio&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"wav"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_httpClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;PostAsJsonAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"/v1/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;stream = false&lt;/code&gt; is deliberate. Simpler error handling, no SSE parser, and transcription is short-burst (under 30 seconds per clip, see trade-offs below). When post-processing lands and outputs longer text, streaming becomes worth the complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs
&lt;/h2&gt;

&lt;p&gt;The decisions that bit me, in the order they bit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model size.&lt;/strong&gt; GGUF E4B Q4_K_M is about 5.9 GiB. BF16 variants reach about 15 GiB. The &lt;code&gt;Gemma4ModelInfo&lt;/code&gt; catalog (&lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/029-gemma4-model-download-ui.md" rel="noopener noreferrer"&gt;ADR-029&lt;/a&gt;) curates five variants and explicitly notes that &lt;code&gt;ggml-org/gemma-4-E2B-it-GGUF&lt;/code&gt; does not publish a Q4_K_M asset. I learned this from a 404 in manual testing, then rebuilt the catalog from the actual file lists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Noisy audio.&lt;/strong&gt; On LibriSpeech-test-other, Whisper &lt;code&gt;LargeV3Turbo&lt;/code&gt; on CUDA lands at 11.48% WER. The best Gemma 4 variant (&lt;code&gt;E2B-it-BF16&lt;/code&gt;) lands at 13.15% WER. A 1.7-point gap on the harder English split. Google's own evaluations showed Gemma 4 falling further behind on meeting-style noise (AMI is about 41% WER for Gemma versus about 16% for Whisper-large-v3). The honest pitch is that Gemma 4 is competitive on read speech and degrades faster than Whisper as noise and overlap rise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold start.&lt;/strong&gt; &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/025-gemma4-llamacpp-desktop.md" rel="noopener noreferrer"&gt;ADR-025&lt;/a&gt; estimated 3 to 30 seconds for &lt;code&gt;llama-server&lt;/code&gt; cold start. My first benchmark numbers confirmed the high end (21.3 seconds modelLoad for E2B-Q8_0). After I added an always-on warm-up pass (&lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/031-benchmark-warmup-pass.md" rel="noopener noreferrer"&gt;ADR-031&lt;/a&gt;), the same modelLoad dropped to 6.7 seconds. Most of the original cost was OS page cache and CUDA driver init, not the recognizer. Real &lt;code&gt;InitializeAsync&lt;/code&gt; on a warm host is about 6.7 to 9.3 seconds for Gemma 4 and about 0.7 to 1.5 seconds for Whisper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30-second clip limit.&lt;/strong&gt; Gemma 4 audio is bounded at 30 seconds per request. Parlotype's VAD already chunks below this, so it did not bite, but it is a real architectural ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2B-Q8_0 is unstable.&lt;/strong&gt; During the benchmark, &lt;code&gt;gemma-4-E2B-it-Q8_0&lt;/code&gt; intermittently emitted stray &lt;code&gt;&amp;lt;|channel&amp;gt;&lt;/code&gt; reasoning tokens that crashed &lt;code&gt;llama-server&lt;/code&gt;'s chat-template parser with HTTP 500. The first 50-sample run failed mid-stream. The second succeeded but with abnormally high RTF (0.315 versus about 0.04 for other Gemma quants) because of verbose thought-text bleed-through. The catalog keeps E2B-Q8_0 selectable for experimentation. The default is E4B Q4_K_M.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BF16 hallucinations on Blackwell GPUs.&lt;/strong&gt; Separate from the E2B-Q8_0 issue, BF16 variants have a documented hallucination behavior on some NVIDIA Blackwell hardware. On the CUDA 13.1 box used here, BF16 was actually the strongest Gemma 4 variant, so this is hardware-specific.&lt;/p&gt;

&lt;h2&gt;
  
  
  The managed-install subsystem
&lt;/h2&gt;

&lt;p&gt;The simplest "give the user a folder picker" version of this worked for about two weeks. Then it became obvious that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp ships a different archive per backend, OS, and architecture. Vulkan, CUDA 12.4, CUDA 13.1, CPU, HIP, SYCL.&lt;/li&gt;
&lt;li&gt;CUDA on Windows needs a companion &lt;code&gt;cudart-llama-bin-*.zip&lt;/code&gt; for the NVIDIA runtime DLLs.&lt;/li&gt;
&lt;li&gt;New releases land several times a week, tagged &lt;code&gt;bXXXX&lt;/code&gt;, with no "latest" alias.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/026-managed-llama-server-install.md" rel="noopener noreferrer"&gt;ADR-026&lt;/a&gt; added a full managed-install subsystem. A catalog backed by GitHub Releases with ETag caching. An installer that stages downloads under &lt;code&gt;.staging/{guid}/payload/&lt;/code&gt; and commits with a single &lt;code&gt;Directory.Move&lt;/code&gt;. A registry (&lt;code&gt;manifest.json&lt;/code&gt;) as the source of truth for what is installed. A tolerant asset parser that turns unknown backend strings into &lt;code&gt;Unknown&lt;/code&gt; rather than throwing.&lt;/p&gt;

&lt;p&gt;Two details worth calling out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Atomic rename.&lt;/strong&gt; Every install assembles under a staging directory and is committed by a single &lt;code&gt;Directory.Move&lt;/code&gt;. A crash mid-install leaves no visible state. The user does not end up with a half-installed server. This is the kind of detail no library does for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared download primitive.&lt;/strong&gt; &lt;code&gt;StreamingFileDownloader&lt;/code&gt; was extracted from the pre-existing Whisper model downloader and is now used by both. About 150 lines, no abstraction layer, just a shared chunk loop.&lt;/p&gt;

&lt;p&gt;The whole subsystem is about 1,800 lines across Core, Platform, and Desktop, plus tests. Worth naming so the cost is visible. "Add a button that downloads a binary" is not what shipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configurable prompts
&lt;/h2&gt;

&lt;p&gt;The Gemma 4 path sends a prompt alongside each audio clip. The text block in the user message tells the model what to do with the audio. Originally this was a hardcoded const. &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/030-configurable-gemma4-prompts.md" rel="noopener noreferrer"&gt;ADR-030&lt;/a&gt; made it a first-class registry. Users create, edit, and duplicate prompts via the Settings UI. Prompts persist to &lt;code&gt;prompts.json&lt;/code&gt;. The active prompt is re-read per transcription, no model reload required.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;{language}&lt;/code&gt; placeholder is the one small interface seam left for a future feature: source-language detection from keyboard layout. Small interface seams beat retroactive migrations of saved user data.&lt;/p&gt;

&lt;p&gt;An example prompt to show what the multimodal-prompt approach actually unlocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Technical: bug-report formatter&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Transcribe the speech verbatim. Then, on a new line, reformat it as a GitHub bug report with sections "Steps to reproduce", "Expected", "Actual", "Environment". If a section cannot be inferred from the speech, write &lt;code&gt;(not specified)&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Input (spoken): "I clicked save and the app just died, nothing in the logs, on my Windows machine, 64-bit."&lt;/p&gt;

&lt;p&gt;Output: a structured issue&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I click save and the app just died. Nothing in the logs. On
my windows machine, 64 bit.

** Expected*
(not specified)

** Actual **
The app crashes/dies. There are no errors in the logs.

*Environment*
Windows machine, 64 bit.

** Steps to reproduce*
Click save.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;The main benchmark data is in &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/results/comparison-libri-speech-test-other-2026-05-23-cuda.md" rel="noopener noreferrer"&gt;&lt;code&gt;results/comparison-libri-speech-test-other-2026-05-23-cuda.md&lt;/code&gt;&lt;/a&gt; in the repo. The numbers below match that file exactly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; LibriSpeech &lt;code&gt;test-other&lt;/code&gt;, 50 samples. The harder English split, with more diverse accents and recording conditions than &lt;code&gt;test-clean&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper:&lt;/strong&gt; CUDA runtime (&lt;code&gt;Whisper.net.Runtime.Cuda&lt;/code&gt;, strict via &lt;code&gt;runtimePreference: "Cuda"&lt;/code&gt;), beam size 1 (greedy, deterministic).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4:&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; CUDA build &lt;code&gt;b9297-win-cuda-13.1-x64&lt;/code&gt;, port 8321, no streaming, the built-in transcription prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAD:&lt;/strong&gt; disabled. The dataset is pre-segmented, so full-file transcription is correct here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warm-up:&lt;/strong&gt; one throwaway transcription before the timed loop, per ADR-031.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Methodology sidebar: the warm-up fix
&lt;/h3&gt;

&lt;p&gt;The first time I ran these numbers, &lt;code&gt;gemma-4-E2B-it-Q8_0&lt;/code&gt; reported a 21.3-second modelLoad. The other Gemma variants reported about 9 seconds. Whisper Small reported 1.2. None of that matched my hand measurements. Once I added an always-on warm-up pass, the picture changed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cold modelLoad&lt;/th&gt;
&lt;th&gt;Warm modelLoad&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Whisper &lt;code&gt;Small&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;1192 ms&lt;/td&gt;
&lt;td&gt;755 ms&lt;/td&gt;
&lt;td&gt;-437 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper &lt;code&gt;LargeV3Turbo&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;1567 ms&lt;/td&gt;
&lt;td&gt;1511 ms&lt;/td&gt;
&lt;td&gt;about the same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma &lt;code&gt;E2B-Q8_0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;21300 ms&lt;/td&gt;
&lt;td&gt;6741 ms&lt;/td&gt;
&lt;td&gt;-14.6 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma &lt;code&gt;E2B-BF16&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;9256 ms&lt;/td&gt;
&lt;td&gt;6703 ms&lt;/td&gt;
&lt;td&gt;-2.5 s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decoder is greedy and deterministic, so WER and CER did not change between cold and warm runs. Only the timing fields became meaningful. If you publish inference timings without an explicit warm-up policy, you are publishing your filesystem cache state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;WER %&lt;/th&gt;
&lt;th&gt;CER %&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;th&gt;Model load (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LargeV3Turbo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.48&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.97&lt;/td&gt;
&lt;td&gt;0.055&lt;/td&gt;
&lt;td&gt;1.31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12.18&lt;/td&gt;
&lt;td&gt;5.41&lt;/td&gt;
&lt;td&gt;0.073&lt;/td&gt;
&lt;td&gt;1.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.10&lt;/td&gt;
&lt;td&gt;5.87&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.034&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E2B-it-BF16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.15&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.95&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.82&lt;/td&gt;
&lt;td&gt;5.80&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-BF16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14.20&lt;/td&gt;
&lt;td&gt;5.40&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14.39&lt;/td&gt;
&lt;td&gt;5.79&lt;/td&gt;
&lt;td&gt;0.044&lt;/td&gt;
&lt;td&gt;9.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E2B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;19.22&lt;/td&gt;
&lt;td&gt;8.95&lt;/td&gt;
&lt;td&gt;0.315&lt;/td&gt;
&lt;td&gt;6.74&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4hhho58fdrwlvpubtjz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4hhho58fdrwlvpubtjz.png" alt="WER % by model (lower is better). LibriSpeech test-other, 50 samples, CUDA" width="800" height="558"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdarb6me3daniqgqryq8z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdarb6me3daniqgqryq8z.png" alt="RTF by model (lower is faster). Same models, same dataset" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Things worth calling out in prose.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Whisper &lt;code&gt;LargeV3Turbo&lt;/code&gt; still leads.&lt;/strong&gt; 11.48% versus Gemma's best 13.15%. The gap is 1.67 points, and the gap is smaller than I expected before running these numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper &lt;code&gt;Small&lt;/code&gt; on CUDA is the fastest in the field.&lt;/strong&gt; RTF 0.034 beats every Gemma variant (0.038 or higher) and every other Whisper. At 13.10% WER it also essentially ties Gemma &lt;code&gt;E2B-it-BF16&lt;/code&gt; (13.15%) on accuracy. If you only keep one configuration on disk, Whisper Small on CUDA is hard to argue against on this dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 &lt;code&gt;E2B-it-BF16&lt;/code&gt; has the lowest CER of the whole field.&lt;/strong&gt; 4.95% versus Whisper &lt;code&gt;LargeV3Turbo&lt;/code&gt;'s 4.97%. The WER ordering does not always agree with the CER ordering, and Gemma's character-level errors at this size are unusually small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma BF16 and Q4 are faster than mid-tier Whisper.&lt;/strong&gt; Gemma variants sit at RTF 0.038, faster than Whisper &lt;code&gt;Medium&lt;/code&gt; (0.073) and &lt;code&gt;LargeV3Turbo&lt;/code&gt; (0.055), but slower than Whisper &lt;code&gt;Small&lt;/code&gt; (0.034).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E2B-Q8_0 is broken on this dataset.&lt;/strong&gt; RTF 0.315 (8x slower than other Gemma variants), WER 19.22%. The crash on the stray &lt;code&gt;&amp;lt;|channel&amp;gt;&lt;/code&gt; token is the same issue from the trade-offs section.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Vulkan vs CUDA: a regression I did not expect
&lt;/h3&gt;

&lt;p&gt;Before pivoting Whisper to CUDA, I ran the same three Whisper models on Vulkan. The result is almost invariant, but not quite.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Vulkan WER&lt;/th&gt;
&lt;th&gt;CUDA WER&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.10&lt;/td&gt;
&lt;td&gt;13.10&lt;/td&gt;
&lt;td&gt;0.00 (bit-identical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12.18&lt;/td&gt;
&lt;td&gt;12.18&lt;/td&gt;
&lt;td&gt;0.00 (bit-identical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LargeV3Turbo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10.15&lt;/td&gt;
&lt;td&gt;11.48&lt;/td&gt;
&lt;td&gt;+1.33 pp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Small and Medium produce bit-identical WER across runtimes. The greedy decoder is deterministic and the kernels reproduce. &lt;code&gt;LargeV3Turbo&lt;/code&gt; regresses by 1.33 percentage points on CUDA, reproducibly.&lt;/p&gt;

&lt;p&gt;The most likely culprit is non-bitwise-identical kernel math between the Vulkan and CUDA backends. Matmul and softmax reduction order, and FP16 accumulation order, are not guaranteed to be deterministic across GPU backends. At the scale of &lt;code&gt;LargeV3Turbo&lt;/code&gt;'s larger matrices, accumulated FP error tips a handful of borderline decoder choices.&lt;/p&gt;

&lt;p&gt;The takeaway is not "CUDA is buggy". It is that GPU backends are not interchangeable when you care about exact transcripts. If &lt;code&gt;LargeV3Turbo&lt;/code&gt; is your production target, benchmark on the runtime you will actually ship.&lt;/p&gt;

&lt;p&gt;CUDA also delivered what you would expect on the other dimensions. RTF improved 8 to 26% across all three Whisper models. Host RAM dropped 30 to 60% because weights now live in VRAM. The speed and memory wins are real and worth taking.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this section does not claim
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Gemma 4 wins. It does not, on this dataset.&lt;/li&gt;
&lt;li&gt;Whisper is obsolete. It is not. &lt;code&gt;LargeV3Turbo&lt;/code&gt; still leads by 1.67 WER points.&lt;/li&gt;
&lt;li&gt;These numbers generalize. They are 50 samples of read English with a single benchmark machine. The point is to give readers numbers they can replicate, not to declare a winner.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Three concrete follow-ups, each with one sentence of why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A &lt;code&gt;LlamaServerHost&lt;/code&gt; extraction.&lt;/strong&gt; Right now &lt;code&gt;LlamaCppSpeechRecognizer&lt;/code&gt; owns the &lt;code&gt;llama-server&lt;/code&gt; process. The first post-processing consumer will need to share the server. A dedicated host class will manage spawn and terminate so neither workload can tear the server down on the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A post-processing pipeline.&lt;/strong&gt; Same loaded model, second invocation. Whisper text -&amp;gt; llama-server -&amp;gt; cleaned, translated, or structured text -&amp;gt; injector. The configurable prompts feature is the first half of this. The consumer is what is still missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source language detection from keyboard layout.&lt;/strong&gt; The &lt;code&gt;{language}&lt;/code&gt; token in &lt;code&gt;PromptTemplate.Render&lt;/code&gt; is already in place. The detector is what comes next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/mdemin729/parlotype" rel="noopener noreferrer"&gt;github.com/mdemin729/parlotype&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Demo video, 60 seconds, Gemma 4 dictation walkthrough:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/IKjBvYKNKHs"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ADRs: &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/024-gemma4-python-sidecar.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/decisions/024-gemma4-python-sidecar.md&lt;/code&gt;&lt;/a&gt; through &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/docs/decisions/030-configurable-gemma4-prompts.md" rel="noopener noreferrer"&gt;&lt;code&gt;030-configurable-gemma4-prompts.md&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Benchmark data: &lt;a href="https://github.com/mdemin729/parlotype/blob/v0.1.0/results/comparison-libri-speech-test-other-2026-05-23-cuda.md" rel="noopener noreferrer"&gt;&lt;code&gt;results/comparison-libri-speech-test-other-2026-05-23-cuda.md&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Windows only for now. .NET 10, MIT licensed. Pick Gemma 4 in Settings -&amp;gt; Speech Engine. The in-app installer downloads &lt;code&gt;llama-server&lt;/code&gt; and the GGUF for you.&lt;/p&gt;

&lt;p&gt;If you have shipped llama.cpp's &lt;code&gt;/v1/chat/completions&lt;/code&gt; audio path in production, I am curious about cold-start mitigations beyond keeping the server warm. Spinning-disk first-inference times in the 30-second range are the part I have not solved cleanly yet.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>dotnet</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour</title>
      <dc:creator>Maksim Demin</dc:creator>
      <pubDate>Sun, 24 May 2026 03:51:31 +0000</pubDate>
      <link>https://dev.to/mdemin729/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant-model-selection-tour-2l8i</link>
      <guid>https://dev.to/mdemin729/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant-model-selection-tour-2l8i</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Parlotype&lt;/strong&gt; is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.&lt;/p&gt;

&lt;p&gt;Google released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.&lt;/p&gt;

&lt;p&gt;The interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The &lt;code&gt;ggml-org&lt;/code&gt; GGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrm7tkfb4aszhtv961b2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrm7tkfb4aszhtv961b2.png" alt="Gemma 4 model picker" width="800" height="648"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/IKjBvYKNKHs"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;The video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Source, ADRs, and benchmark configs: &lt;strong&gt;&lt;a href="https://github.com/mdemin729/parlotype" rel="noopener noreferrer"&gt;github.com/mdemin729/parlotype&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Relevant entry points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs" rel="noopener noreferrer"&gt;&lt;code&gt;src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs&lt;/code&gt;&lt;/a&gt;: the recognizer that talks to &lt;code&gt;llama-server&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/src/Parlotype.Core/Speech/Gemma4ModelInfo.cs" rel="noopener noreferrer"&gt;&lt;code&gt;src/Parlotype.Core/Speech/Gemma4ModelInfo.cs&lt;/code&gt;&lt;/a&gt;: the 5-variant catalog.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/025-gemma4-llamacpp-desktop.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/decisions/025-gemma4-llamacpp-desktop.md&lt;/code&gt;&lt;/a&gt; through &lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/030-configurable-gemma4-prompts.md" rel="noopener noreferrer"&gt;&lt;code&gt;030-configurable-gemma4-prompts.md&lt;/code&gt;&lt;/a&gt;: the ADR series covering the integration.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/results/comparison-libri-speech-test-other-2026-05-23-cuda.md" rel="noopener noreferrer"&gt;&lt;code&gt;results/comparison-libri-speech-test-other-2026-05-23-cuda.md&lt;/code&gt;&lt;/a&gt;: the benchmark data behind the choices below.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why a separate engine at all
&lt;/h3&gt;

&lt;p&gt;Whisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to "clean read" than to "AMI meeting", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why &lt;code&gt;llama-server&lt;/code&gt; as the runtime
&lt;/h3&gt;

&lt;p&gt;I looked at several inference paths before picking &lt;code&gt;llama-server&lt;/code&gt;, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;onnxruntime-genai&lt;/code&gt; does not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: &lt;a href="https://github.com/microsoft/onnxruntime-genai/issues/2062" rel="noopener noreferrer"&gt;microsoft/onnxruntime-genai#2062&lt;/a&gt;. A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet (&lt;a href="https://github.com/ollama/ollama/issues/15333" rel="noopener noreferrer"&gt;ollama/ollama#15333&lt;/a&gt;). Lemonade is AMD-only.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; with the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at &lt;code&gt;/v1/chat/completions&lt;/code&gt;, with &lt;code&gt;input_audio&lt;/code&gt; blocks for audio. A release cadence I can manage from in-app updates. &lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/025-gemma4-llamacpp-desktop.md" rel="noopener noreferrer"&gt;ADR-025&lt;/a&gt; has the longer version of this decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Picking a variant: the benchmark
&lt;/h3&gt;

&lt;p&gt;The catalog has five variants. That is what &lt;code&gt;ggml-org/gemma-4-E2B-it-GGUF&lt;/code&gt; and &lt;code&gt;ggml-org/gemma-4-E4B-it-GGUF&lt;/code&gt; actually publish, not what I would ideally pick (see &lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/029-gemma4-model-download-ui.md" rel="noopener noreferrer"&gt;ADR-029&lt;/a&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ModelId&lt;/th&gt;
&lt;th&gt;GGUF&lt;/th&gt;
&lt;th&gt;Size on disk (with bf16 mmproj)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;E2B Q8_0&lt;/td&gt;
&lt;td&gt;~5.5 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E2B-it-bf16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;E2B BF16&lt;/td&gt;
&lt;td&gt;~9.6 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E4B-it-Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;E4B Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.9 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E4B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;E4B Q8_0&lt;/td&gt;
&lt;td&gt;~8.4 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-E4B-it-bf16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;E4B BF16&lt;/td&gt;
&lt;td&gt;~15 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;E2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.&lt;/p&gt;

&lt;p&gt;I ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech &lt;code&gt;test-other&lt;/code&gt;, which is the "harder" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;WER %&lt;/th&gt;
&lt;th&gt;CER %&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;th&gt;Model load (s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LargeV3Turbo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.48&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.97&lt;/td&gt;
&lt;td&gt;0.055&lt;/td&gt;
&lt;td&gt;1.31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12.18&lt;/td&gt;
&lt;td&gt;5.41&lt;/td&gt;
&lt;td&gt;0.073&lt;/td&gt;
&lt;td&gt;1.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Whisper (CUDA)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.10&lt;/td&gt;
&lt;td&gt;5.87&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.034&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E2B-it-BF16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.15&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.95&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-Q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;13.82&lt;/td&gt;
&lt;td&gt;5.80&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-BF16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14.20&lt;/td&gt;
&lt;td&gt;5.40&lt;/td&gt;
&lt;td&gt;0.038&lt;/td&gt;
&lt;td&gt;6.72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E4B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;14.39&lt;/td&gt;
&lt;td&gt;5.79&lt;/td&gt;
&lt;td&gt;0.044&lt;/td&gt;
&lt;td&gt;9.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Gemma 4 (llama.cpp)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;E2B-it-Q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;19.22&lt;/td&gt;
&lt;td&gt;8.95&lt;/td&gt;
&lt;td&gt;0.315&lt;/td&gt;
&lt;td&gt;6.74&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focb4u9a7sxve6dizuzve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focb4u9a7sxve6dizuzve.png" alt="WER % by model" width="799" height="551"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three things from the table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;E2B-it-BF16&lt;/code&gt; has the lowest CER of any model here&lt;/strong&gt; (4.95%). It barely beats Whisper &lt;code&gt;LargeV3Turbo&lt;/code&gt; (4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;E4B-it-Q4_K_M&lt;/code&gt; (the shipping default) is at 13.82% WER and 0.038 RTF.&lt;/strong&gt; That is close to Whisper &lt;code&gt;Small&lt;/code&gt; (13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;E2B-it-Q8_0&lt;/code&gt; is broken on this dataset.&lt;/strong&gt; RTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashed &lt;code&gt;llama-server&lt;/code&gt; mid-sample because the model emitted a stray &lt;code&gt;&amp;lt;|channel&amp;gt;&lt;/code&gt; reasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I picked, and why
&lt;/h3&gt;

&lt;p&gt;The shipping default is &lt;strong&gt;&lt;code&gt;gemma-4-E4B-it-Q4_K_M&lt;/code&gt;&lt;/strong&gt;. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a "known issue" tag.&lt;/p&gt;

&lt;p&gt;The model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Gemma 4 sits behind the same &lt;code&gt;ISpeechRecognizer&lt;/code&gt; interface as Whisper. A &lt;code&gt;DelegatingSpeechRecognizer&lt;/code&gt; (backed by a small &lt;code&gt;SpeechRecognizerFactory&lt;/code&gt;) picks one or the other at init time, based on the user's engine setting. The &lt;code&gt;LlamaCppSpeechRecognizer&lt;/code&gt; owns a child &lt;code&gt;llama-server.exe&lt;/code&gt; process. It posts audio as a base64 WAV blob to &lt;code&gt;/v1/chat/completions&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Excerpt from LlamaCppSpeechRecognizer.cs&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;object&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;promptText&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"input_audio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_audio&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"wav"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="nn"&gt;var&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_httpClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;PostAsJsonAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"/v1/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same capture, same VAD, different recognizer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F204yp52cmb5ojghrainp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F204yp52cmb5ojghrainp.png" alt=" " width="672" height="1109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;llama-server&lt;/code&gt; binary itself is also managed by the app. &lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/026-managed-llama-server-install.md" rel="noopener noreferrer"&gt;ADR-026&lt;/a&gt; covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.&lt;/p&gt;

&lt;p&gt;The transcription prompt is also user-editable. &lt;a href="https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/030-configurable-gemma4-prompts.md" rel="noopener noreferrer"&gt;ADR-030&lt;/a&gt; turned the hardcoded prompt into a small registry with a built-in default and a &lt;code&gt;{language}&lt;/code&gt; placeholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this taught me
&lt;/h2&gt;

&lt;p&gt;Three things I learned from doing this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The model card's headline numbers do not transfer to your stack.&lt;/strong&gt; Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from "the model can do 4.17%" to "my app does 13.82% on noisy audio with the quantization that fits on user disks" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most of the work is in the catalog, not in the inference call.&lt;/strong&gt; The actual &lt;code&gt;/v1/chat/completions&lt;/code&gt; HTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asymmetric quantization coverage is the rule, not the exception.&lt;/strong&gt; E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try Parlotype
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/mdemin729/parlotype" rel="noopener noreferrer"&gt;github.com/mdemin729/parlotype&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Windows only for now. .NET 10, MIT licensed.&lt;/li&gt;
&lt;li&gt;Pick Gemma 4 in Settings -&amp;gt; Speech Engine. The in-app installer downloads &lt;code&gt;llama-server&lt;/code&gt; and the GGUF for you.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>Why I built Parlotype: a privacy-first voice-to-English desktop app on .NET 10</title>
      <dc:creator>Maksim Demin</dc:creator>
      <pubDate>Fri, 08 May 2026 00:05:45 +0000</pubDate>
      <link>https://dev.to/mdemin729/why-i-built-parlotype-a-privacy-first-voice-to-english-desktop-app-on-net-10-5gc5</link>
      <guid>https://dev.to/mdemin729/why-i-built-parlotype-a-privacy-first-voice-to-english-desktop-app-on-net-10-5gc5</guid>
      <description>&lt;h2&gt;
  
  
  The friction
&lt;/h2&gt;

&lt;p&gt;I've been shipping production code for 20 years across five languages — C, C++, Java, Scala, and now C#. My English is decent enough for daily work, but it's not native.&lt;/p&gt;

&lt;p&gt;So whenever I want a sharper adjective in an email, or a phrase that doesn't read as translated, I still reach for Google Translate. Sometimes I dictate into it. Sometimes I type — which is slower. And if I'm on a machine without a Russian keyboard layout, the friction goes up another notch.&lt;/p&gt;

&lt;p&gt;Multiple times a day. Across email, MS Teams, PR descriptions, design docs.&lt;/p&gt;

&lt;p&gt;I finally got tired of switching context, and built a tool to skip it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not the built-in Windows dictation?
&lt;/h2&gt;

&lt;p&gt;Windows 11 has perfectly fine built-in dictation. But it doesn't translate — and translation is the half that matters for non-native English speakers like me.&lt;/p&gt;

&lt;p&gt;The workflow I needed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Press a global hotkey&lt;/li&gt;
&lt;li&gt;Speak in my native language&lt;/li&gt;
&lt;li&gt;Get English text inserted directly into whatever app I'm in&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No browser tab. No copy-paste. Nothing sent to the cloud.&lt;/p&gt;

&lt;p&gt;That's Parlotype.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/gMPKQqMKp8c"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;The first version is Windows-only, but I picked every piece with cross-platform support in mind from day one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;.NET 10&lt;/strong&gt; — runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avalonia UI 12&lt;/strong&gt; — cross-platform desktop UI (tray-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper.net&lt;/strong&gt; — on-device speech recognition (OpenAI Whisper bindings for .NET)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silero VAD&lt;/strong&gt; — voice activity detection (ONNX-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NAudio&lt;/strong&gt; — Windows audio capture (WASAPI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CommunityToolkit.Mvvm&lt;/strong&gt; — MVVM source generators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SharpHook&lt;/strong&gt; — cross-platform global hotkeys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few decisions worth highlighting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avalonia over MAUI.&lt;/strong&gt; I needed a real desktop tray app on Windows/Linux/macOS. MAUI's desktop story is still uneven; Avalonia handles tray, hotkeys, and native window chrome cleanly across all three platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whisper.net over Whisper.cpp directly.&lt;/strong&gt; Whisper.cpp is the reference implementation, but Whisper.net wraps it with idiomatic C# APIs and managed memory handling — meaningful when integrating with the rest of a .NET app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silero VAD over WebRTC VAD.&lt;/strong&gt; WebRTC's VAD is older and noisier on modern audio. Silero, running through ONNX Runtime, gives much better speech/silence segmentation, which matters for snappy hotkey-triggered dictation.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU acceleration: CUDA &lt;em&gt;and&lt;/em&gt; Vulkan
&lt;/h2&gt;

&lt;p&gt;There's a second reason this project exists. A year ago I assembled a PC with an NVIDIA RTX 5000-series GPU for one specific purpose: to run local LLMs. It mostly sat idle — until Parlotype gave it a job.&lt;/p&gt;

&lt;p&gt;Whisper.net supports CUDA out of the box, which is great for NVIDIA hardware. But "NVIDIA-only" isn't a cross-platform-friendly story — and many developers (including potential users) run on AMD or integrated GPUs.&lt;/p&gt;

&lt;p&gt;The current build adds &lt;strong&gt;Vulkan&lt;/strong&gt; as a second acceleration backend. Vulkan runs on NVIDIA, AMD, and Intel GPUs, including AMD integrated graphics, which broadens the hardware story significantly. CUDA is still preferred when available (faster on NVIDIA), but Vulkan covers the rest without falling back to CPU.&lt;/p&gt;

&lt;p&gt;I'll publish benchmarks comparing CUDA vs Vulkan vs CPU across model sizes (&lt;code&gt;tiny&lt;/code&gt;, &lt;code&gt;base&lt;/code&gt;, &lt;code&gt;small&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt;, &lt;code&gt;large-v3&lt;/code&gt;) in a follow-up post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parlotype as an AI-coding-agent testbed
&lt;/h2&gt;

&lt;p&gt;Parlotype also became my real-world lab for AI coding agents — Claude Code, Copilot, OpenCode, and others. After 20 years of writing code by hand, I wanted to see how these tools hold up on a non-trivial .NET codebase. Not toy demos, not greenfield React apps — actual cross-platform desktop work with audio pipelines, native interop, and ONNX runtimes.&lt;/p&gt;

&lt;p&gt;I'll write about that workflow in detail later: agent setup, automated project memory in an Obsidian vault, and which kinds of tasks each agent handles well versus poorly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next in this series
&lt;/h2&gt;

&lt;p&gt;Posts I'm planning to write next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The speech recognition pipeline end-to-end (audio capture → VAD → Whisper → translation → injection)&lt;/li&gt;
&lt;li&gt;Benchmarks for Whisper model parameters (size, language, beam size, temperature) on real hardware&lt;/li&gt;
&lt;li&gt;CUDA vs Vulkan vs CPU performance across model sizes&lt;/li&gt;
&lt;li&gt;My AI coding agent setup and the Obsidian-based project memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which one would you want to read first? Drop a comment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Repo: &lt;strong&gt;&lt;a href="https://github.com/mdemin729/parlotype" rel="noopener noreferrer"&gt;github.com/mdemin729/parlotype&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Issues, feedback, and PRs all welcome — especially benchmark numbers if you run it on AMD or Intel GPUs.&lt;/p&gt;

</description>
      <category>dotnet</category>
      <category>csharp</category>
      <category>showdev</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
