How I built voice-type

#linux #stt #dictation #softwaredevelopment

Intro

This post breaks down Voice Type — a system-wide speech-to-text tool for Linux I built and use daily. I'll cover why I built it, how it works under the hood, and what I learned along the way.

Why I built it

Two reasons: wrist pain and productivity.

Modern dev workflows involve a ton of prompting — LLMs, search, documentation. If you're doing that all day with a keyboard, your wrists notice. And beyond the ergonomics, you simply speak faster than you type. Offloading even part of that to voice has a real impact.

The obvious solution was a local STT model. I tried SpeechNote on Linux with a Whisper.cpp small model — it wasn't accurate enough and had noticeable latency. Running a larger model wasn't an option either: I'm on an 8GB RAM laptop, and my dev setup (VSCode, Chrome, Docker, DBeaver, etc.) already pushes that. A 3–4GB model sitting in RAM wasn't viable.

I eventually found that Chrome's built-in Web Speech API — which routes audio to Google's servers under the hood — matches the accuracy and speed of much larger models like Whisper large, with virtually zero local resource usage. That was the unlock.

How it works

Voice Type launches a headless Chrome instance via Puppeteer and uses the Web Speech API to do real-time transcription. Here's the flow:

Browser — Chrome runs the Web Speech API with interimResults: true, which streams partial transcripts as you speak in real time. These flow in via Google's WebSocket infrastructure.
IPC — When a new interim result arrives in the browser, it triggers a callback into the main Node.js daemon. This is done via Puppeteer's exposeFunction, which leverages the Chrome DevTools Protocol WebSocket connection to invoke a function in the main process in real time.
Diffing — The daemon keeps track of currentText. On every update, it runs a diff: if nothing changed, do nothing. If it changed, find the longest common prefix between the old and new text, send the right number of backspaces, and type the new characters.
Typing — Text is typed system-wide via dotool. Standard ASCII goes through directly. Accented characters, emojis, and CJK are handled via GNOME's Unicode Hex Input sequence (Ctrl+Shift+U → hex code → Enter), which makes it work across languages without needing to match the user's keyboard layout.

This is why Voice Type can correct itself mid-sentence without retyping everything from scratch — it only fixes what changed.

Privacy note

Since the Web Speech API is powered by Google's servers, your audio does leave your machine. Worth knowing before you use it for anything sensitive.

How I actually use it

Mostly for prompting and writing blog posts like this one. Once it's bound to a key, it gets out of the way completely — press, speak, done. It supports text and sound notifications, though I prefer them disabled — GNOME already shows a mic icon in the system tray when the mic is active.

Conclusion

STT is underrated as a dev tool. If you're on Linux, give Voice Type a try — it's free, open source, and uses almost no resources. If it's useful to you, a star on GitHub goes a long way. Happy coding!

Top comments (2)

Huan Nghiem • May 17 • Edited

Excellent tool, this is what I have been looking for in a while.
I just posted some feature requests and found out you wrote an entire blog on this.
The code seems pretty clean and understandable. 😁
Which agent you used to make this?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.