Intro
This post breaks down Voice Type — a system-wide speech-to-text tool for Linux I built and use daily. I'll cover why I built it, how it works under the hood, and what I learned along the way.
Why I built it
Two reasons: wrist pain and productivity.
Modern dev workflows involve a ton of prompting — LLMs, search, documentation. If you're doing that all day with a keyboard, your wrists notice. And beyond the ergonomics, you simply speak faster than you type. Offloading even part of that to voice has a real impact.
The obvious solution was a local STT model. I tried SpeechNote on Linux with a Whisper.cpp small model — it wasn't accurate enough and had noticeable latency. Running a larger model wasn't an option either: I'm on an 8GB RAM laptop, and my dev setup (VSCode, Chrome, Docker, DBeaver, etc.) already pushes that. A 3–4GB model sitting in RAM wasn't viable.
I eventually found that Chrome's built-in Web Speech API — which routes audio to Google's servers under the hood — matches the accuracy and speed of much larger models like Whisper large, with virtually zero local resource usage. That was the unlock.
How it works
Voice Type launches a headless Chrome instance via Puppeteer and uses the Web Speech API to do real-time transcription. Here's the flow:
Browser — Chrome runs the Web Speech API with
interimResults: true, which streams partial transcripts as you speak in real time. These flow in via Google's WebSocket infrastructure.IPC — When a new interim result arrives in the browser, it triggers a callback into the main Node.js daemon. This is done via Puppeteer's
exposeFunction, which leverages the Chrome DevTools Protocol WebSocket connection to invoke a function in the main process in real time.Diffing — The daemon keeps track of
currentText. On every update, it runs a diff: if nothing changed, do nothing. If it changed, find the longest common prefix between the old and new text, send the right number of backspaces, and type the new characters.Typing — Text is typed system-wide via
dotool. Standard ASCII goes through directly. Accented characters, emojis, and CJK are handled via GNOME's Unicode Hex Input sequence (Ctrl+Shift+U→ hex code → Enter), which makes it work across languages without needing to match the user's keyboard layout.
This is why Voice Type can correct itself mid-sentence without retyping everything from scratch — it only fixes what changed.
Privacy note
Since the Web Speech API is powered by Google's servers, your audio does leave your machine. Worth knowing before you use it for anything sensitive.
How I actually use it
Mostly for prompting and writing blog posts like this one. Once it's bound to a key, it gets out of the way completely — press, speak, done. It supports text and sound notifications, though I prefer them disabled — GNOME already shows a mic icon in the system tray when the mic is active.
Conclusion
STT is underrated as a dev tool. If you're on Linux, give Voice Type a try — it's free, open source, and uses almost no resources. If it's useful to you, a star on GitHub goes a long way. Happy coding!
Top comments (0)