How Minds Work

Posted on May 7 • Edited on May 9

I built a Windows dictation app with Groq Whisper — here's what I learned

#productivity #windows #ai #javascript

I built a Windows dictation app with Groq Whisper — here's what I learned

I've been a bad typist my whole life. Not slow — just error-prone. I spend more time correcting than creating. So a few months ago I decided to build my own Windows dictation app powered by Groq's Whisper API. What shipped is dictate.app, and the journey taught me more than I expected.

Why not just use Windows built-in dictation?

Windows has had dictation since Windows 10. It works okay — until it doesn't. The accuracy drops on technical vocabulary, it doesn't handle punctuation well without training, and you can't pipe the output anywhere cleanly. I wanted something that:

Worked in any app, not just Microsoft ones
Had real-time transcription, not batch
Used a modern model, not a 2018-era acoustic model
Cost almost nothing per use

Groq's Whisper API fit every box.

The technical stack

The app is a lightweight Windows system tray application. Here's the core flow:

Press a hotkey (customizable)
Audio is captured from the default mic using the Windows audio APIs
Audio is chunked and sent to Groq's Whisper endpoint
Transcribed text is injected directly into whatever input field is focused

The Groq API call itself is dead simple:

const transcription = await groq.audio.transcriptions.create({
  file: audioFile,
  model: "whisper-large-v3",
  response_format: "json",
  language: "en",
});

That's it. The model does the heavy lifting. The tricky parts were all Windows-specific.

Where I got surprised

Simulating keystrokes is harder than it looks

Injecting text into arbitrary Windows apps sounds trivial. It's not. Different apps handle keyboard events differently. Some respond to SendInput, some need WM_CHAR messages, some (looking at you, certain Electron apps) need both. I ended up building a small compatibility layer that tries methods in order and falls back gracefully.

Latency matters more than accuracy

I assumed accuracy would be the thing users cared most about. I was wrong. Latency was the real killer. If there's more than ~1.5 seconds between you stopping speaking and the text appearing, the UX feels broken — even if the transcription is perfect. Groq's speed advantage over OpenAI Whisper here is dramatic. On identical audio clips, Groq returns in ~300ms vs ~1200ms on OpenAI's API. That gap is the entire difference between the app feeling native and feeling laggy.

Background audio capture on Windows is a minefield

Capturing audio while other apps are running means navigating Windows audio session management. I hit exclusivity conflicts with certain pro audio setups. The fix was adding a configurable audio device selector — power users who have weird audio routing can specify exactly which device to use.

System tray UX has its own conventions

Windows users have strong expectations about tray apps. They should:

Start minimized
Show a meaningful context menu on right-click
Not hijack focus
Not spawn a console window

Violate any of these and people feel like something is wrong, even if they can't articulate why.

What I'd do differently

Offline fallback. When Groq's API is unreachable (VPN, firewall, offline), the app just fails. I'm adding a local Whisper model fallback — heavier, slower, but it works without internet.

Better onboarding. First-run experience is terrible. I dumped people into a settings screen. Users want to press a button and hear it work within 30 seconds. I'm rebuilding the first-run flow to be a literal one-click demo.

Usage analytics (opt-in). I have no idea which features people actually use. Adding privacy-respecting, opt-in telemetry to guide future decisions.

The pricing model

I landed on $9/month. The reasoning:

Groq's Whisper API costs are roughly $0.04–0.08/hour of audio depending on volume
Heavy users might do 2–3 hours/day of dictation, but most do 15–30 minutes
At 30 min/day × 30 days = 15 hours/month × $0.06 = ~$0.90 API cost
$9 gives enough margin to support, improve, and not go bankrupt

Surprisingly, the price has not been the objection I expected. The objection is trust — people want to know their audio isn't being stored or sold. I now have a privacy page and a clear statement on first run: audio is sent to Groq for transcription and immediately discarded. Groq's own privacy policy backs this up.

Numbers so far

This is an indie project, not a funded startup. Early days. But the retention among people who actually adopt it into their workflow is strong — the ones who stick past day 3 are still using it a month later. That's the signal I'm building toward.

Want to try it?

If you type a lot and wish you could just talk — dictate.app is $9/month and there's a free trial. It works on Windows 10 and 11. No cloud accounts, no OAuth, just a Groq API key you bring yourself (or use the managed version where I handle the key).

Happy to answer questions about the build in the comments. The Windows audio API rabbit holes alone could fill another post.

DEV Community

I built a Windows dictation app with Groq Whisper — here's what I learned

I built a Windows dictation app with Groq Whisper — here's what I learned

Why not just use Windows built-in dictation?

The technical stack

Where I got surprised

Simulating keystrokes is harder than it looks

Latency matters more than accuracy

Background audio capture on Windows is a minefield

System tray UX has its own conventions

What I'd do differently

The pricing model

Numbers so far

Want to try it?

Top comments (0)