Voice-to-Text for Developers: Why I Stopped Typing Half My Code Comments

#privacy #ai #productivity #rust

I type fast. Probably 90-100 WPM on a good day. So when someone first suggested I try voice-to-text for development work, I laughed. Why would I dictate when my fingers are already on the keyboard?

Then I timed myself writing a pull request description. Three paragraphs explaining a refactor — what changed, why, what to watch for in review. It took eight minutes. Not because I type slowly, but because I kept rewording things, deleting sentences, second-guessing phrasing. Writing prose is a different cognitive task than writing code, and the keyboard creates friction between thinking and expressing.

I tried dictating the same kind of description the next day. Spoke for about 90 seconds, let the tool clean it up, made two small edits. Done in under three minutes. The output was arguably better because I'd just explained it like I was talking to a colleague, which is exactly what a good PR description should sound like.

That was six months ago. Now I dictate roughly half of all the non-code text I produce in a day. Here's what I've learned.

What Developers Actually Dictate

Let me be clear: I'm not dictating for loops. Voice-to-text isn't replacing the keyboard for writing code. It's replacing the keyboard for everything around the code:

Pull request descriptions. The best PRs read like you're explaining the change to a teammate. Dictation naturally produces that tone because you're literally just... explaining it.

Code comments and docstrings. That function that needs a "why" comment? Explaining it out loud produces clearer, more natural documentation than staring at the screen trying to compose the perfect terse sentence.

Commit messages. "Refactored the authentication middleware to separate token validation from session management, reducing coupling and making it easier to unit test each concern independently." That came from about five seconds of speaking. Typing it would've taken 30 seconds and I probably would've just written "refactor auth" instead.

Slack and Teams messages. Developers spend a shocking amount of time writing messages. Dictation turns a two-minute typing session into a 20-second speaking session. Multiply that by dozens of messages per day.

Documentation. README files, architecture decision records, onboarding guides, runbooks. These all benefit from a conversational tone, and dictation naturally produces one.

Emails and stand-up notes. The low-value text that eats time every day. Dictate it, clean it up, move on.

Why Local Matters for Developer Workflows

If you're going to dictate work content, where that audio goes matters. Developer conversations contain proprietary information — architecture decisions, security vulnerabilities, unreleased features, customer names, internal debates.

Cloud-based dictation tools process your audio on remote servers. That means your PR description about a security fix, your Slack message about a customer's infrastructure, your commit message mentioning an unpatched vulnerability — all of it passes through a third party's infrastructure.

Local voice-to-text eliminates this entirely. The audio never leaves your machine, so there's no vector for data exposure. For developers working under NDA, in regulated industries, or simply at companies with security policies that prohibit sending data to unauthorized third parties, local processing isn't optional — it's required.

MumbleFlow is built on this principle. It uses whisper.cpp and llama.cpp to run the entire speech-to-text pipeline on your hardware — no cloud, no API calls, no audio stored anywhere. As a developer, you can verify this yourself: run it with network monitoring and watch nothing leave your machine.

The Workflow That Actually Works

After experimenting with different tools and approaches, here's the workflow I've settled on:

The hardware: Any microphone that's not your laptop's built-in one. I use a $40 USB condenser mic. The accuracy difference is massive — local Whisper models are good, but they're not magic. Clean audio input matters.

The tool: MumbleFlow. Hold Fn, speak, release. Text appears at cursor position. Works in VS Code, terminal, Slack, browser — any text field. The LLM cleanup step (via llama.cpp) is critical for developer use because it turns stream-of-consciousness speech into properly punctuated, grammatically correct text without changing the meaning.

The habit: I dictate anything that's more than two sentences and isn't code. If I catch myself staring at a text field composing prose, I hold Fn instead. The mental shift took about a week.

The editing pass: Dictated text is 90% ready. I do a quick scan for technical terms that got mangled (model names, library names, and acronyms sometimes need a fix) and hit send. Total time: a fraction of what typing takes.

Common Objections (And What I've Found)

"I'll look weird talking to my computer." If you work from home, nobody's watching. If you're in an office, you already take calls at your desk. This is quieter than a phone call.

"It won't understand technical terms." Modern Whisper models handle technical vocabulary surprisingly well. "Kubernetes," "PostgreSQL," "middleware," "refactor" — all transcribed correctly in my experience. Unusual library names or internal jargon occasionally need manual correction, but the LLM cleanup catches most formatting issues.

"It's slower than typing." For code, yes. For prose, absolutely not. The average person speaks at 130-150 WPM. Even fast typists top out at 80-100 WPM, and that's raw speed — not accounting for the thinking-while-typing overhead that slows actual composition to 30-40 WPM for most people. Dictation lets you think and produce text simultaneously.

"I need to be precise with technical writing." Dictation produces a first draft. You edit it. This is exactly how most writing works anyway — the difference is that the first draft takes 30 seconds instead of five minutes.

The Numbers

Here's my rough before/after over the past six months:

Task	Typing	Dictating + Editing
PR description (3 paragraphs)	6-8 min	2-3 min
Substantial Slack message	2-3 min	30-60 sec
Code comment (2-3 sentences)	45 sec	15 sec
Commit message (detailed)	30-45 sec	10-15 sec
Documentation section (500 words)	20-25 min	8-10 min

The savings compound. If you produce 2,000 words of non-code text per day (which most developers do across PRs, messages, docs, and emails), dictation saves roughly 30-45 minutes daily. That's 2.5-4 hours per week. Over a year, it's a meaningful chunk of time reclaimed.

Getting Started

If you're curious, here's the lowest-friction way to try it:

Get MumbleFlow ($5, runs on Mac/Windows/Linux).
Use a decent microphone (even earbuds with a mic beat a laptop mic).
Start with low-stakes text — Slack messages, commit messages, casual docs.
Give it a week before judging. The first few dictations feel awkward. By day three, it's natural.

You don't have to dictate everything. You don't have to give up your keyboard. Just try dictating the next PR description and see if the output surprises you.

It surprised me.

MumbleFlow — local voice-to-text for developers. $5 one-time. Fully offline. Works everywhere your cursor does.