Why I built Parlotype: a privacy-first voice-to-English desktop app on .NET 10

#dotnet #csharp #showdev #ai

The friction

I've been shipping production code for 20 years across five languages — C, C++, Java, Scala, and now C#. My English is decent enough for daily work, but it's not native.

So whenever I want a sharper adjective in an email, or a phrase that doesn't read as translated, I still reach for Google Translate. Sometimes I dictate into it. Sometimes I type — which is slower. And if I'm on a machine without a Russian keyboard layout, the friction goes up another notch.

Multiple times a day. Across email, MS Teams, PR descriptions, design docs.

I finally got tired of switching context, and built a tool to skip it.

Why not the built-in Windows dictation?

Windows 11 has perfectly fine built-in dictation. But it doesn't translate — and translation is the half that matters for non-native English speakers like me.

The workflow I needed:

Press a global hotkey
Speak in my native language
Get English text inserted directly into whatever app I'm in

No browser tab. No copy-paste. Nothing sent to the cloud.

That's Parlotype.

The stack

The first version is Windows-only, but I picked every piece with cross-platform support in mind from day one:

.NET 10 — runtime
Avalonia UI 12 — cross-platform desktop UI (tray-based)
Whisper.net — on-device speech recognition (OpenAI Whisper bindings for .NET)
Silero VAD — voice activity detection (ONNX-based)
NAudio — Windows audio capture (WASAPI)
CommunityToolkit.Mvvm — MVVM source generators
SharpHook — cross-platform global hotkeys

A few decisions worth highlighting:

Avalonia over MAUI. I needed a real desktop tray app on Windows/Linux/macOS. MAUI's desktop story is still uneven; Avalonia handles tray, hotkeys, and native window chrome cleanly across all three platforms.

Whisper.net over Whisper.cpp directly. Whisper.cpp is the reference implementation, but Whisper.net wraps it with idiomatic C# APIs and managed memory handling — meaningful when integrating with the rest of a .NET app.

Silero VAD over WebRTC VAD. WebRTC's VAD is older and noisier on modern audio. Silero, running through ONNX Runtime, gives much better speech/silence segmentation, which matters for snappy hotkey-triggered dictation.

GPU acceleration: CUDA and Vulkan

There's a second reason this project exists. A year ago I assembled a PC with an NVIDIA RTX 5000-series GPU for one specific purpose: to run local LLMs. It mostly sat idle — until Parlotype gave it a job.

Whisper.net supports CUDA out of the box, which is great for NVIDIA hardware. But "NVIDIA-only" isn't a cross-platform-friendly story — and many developers (including potential users) run on AMD or integrated GPUs.

The current build adds Vulkan as a second acceleration backend. Vulkan runs on NVIDIA, AMD, and Intel GPUs, including AMD integrated graphics, which broadens the hardware story significantly. CUDA is still preferred when available (faster on NVIDIA), but Vulkan covers the rest without falling back to CPU.

I'll publish benchmarks comparing CUDA vs Vulkan vs CPU across model sizes (tiny, base, small, medium, large-v3) in a follow-up post.

Parlotype as an AI-coding-agent testbed

Parlotype also became my real-world lab for AI coding agents — Claude Code, Copilot, OpenCode, and others. After 20 years of writing code by hand, I wanted to see how these tools hold up on a non-trivial .NET codebase. Not toy demos, not greenfield React apps — actual cross-platform desktop work with audio pipelines, native interop, and ONNX runtimes.

I'll write about that workflow in detail later: agent setup, automated project memory in an Obsidian vault, and which kinds of tasks each agent handles well versus poorly.

What's next in this series

Posts I'm planning to write next:

The speech recognition pipeline end-to-end (audio capture → VAD → Whisper → translation → injection)
Benchmarks for Whisper model parameters (size, language, beam size, temperature) on real hardware
CUDA vs Vulkan vs CPU performance across model sizes
My AI coding agent setup and the Obsidian-based project memory

Which one would you want to read first? Drop a comment.