I built a Windows dictation app with Groq Whisper — here's what I learned
I've been a bad typist my whole life. Not slow — just error-prone. I spend more time correcting than creating. So a few months ago I decided to build my own Windows dictation app powered by Groq's Whisper API. What shipped is dictate.app, and the journey taught me more than I expected.
Why not just use Windows built-in dictation?
Windows has had dictation since Windows 10. It works okay — until it doesn't. The accuracy drops on technical vocabulary, it doesn't handle punctuation well without training, and you can't pipe the output anywhere cleanly. I wanted something that:
- Worked in any app, not just Microsoft ones
- Had real-time transcription, not batch
- Used a modern model, not a 2018-era acoustic model
- Cost almost nothing per use
Groq's Whisper API fit every box.
The technical stack
The app is a lightweight Windows system tray application. Here's the core flow:
- Press a hotkey (customizable)
- Audio is captured from the default mic using the Windows audio APIs
- Audio is chunked and sent to Groq's Whisper endpoint
- Transcribed text is injected directly into whatever input field is focused
The Groq API call itself is dead simple:
const transcription = await groq.audio.transcriptions.create({
file: audioFile,
model: "whisper-large-v3",
response_format: "json",
language: "en",
});
That's it. The model does the heavy lifting. The tricky parts were all Windows-specific.
Where I got surprised
Simulating keystrokes is harder than it looks
Injecting text into arbitrary Windows apps sounds trivial. It's not. Different apps handle keyboard events differently. Some respond to SendInput, some need WM_CHAR messages, some (looking at you, certain Electron apps) need both. I ended up building a small compatibility layer that tries methods in order and falls back gracefully.
Latency matters more than accuracy
I assumed accuracy would be the thing users cared most about. I was wrong. Latency was the real killer. If there's more than ~1.5 seconds between you stopping speaking and the text appearing, the UX feels broken — even if the transcription is perfect. Groq's speed advantage over OpenAI Whisper here is dramatic. On identical audio clips, Groq returns in ~300ms vs ~1200ms on OpenAI's API. That gap is the entire difference between the app feeling native and feeling laggy.
Background audio capture on Windows is a minefield
Capturing audio while other apps are running means navigating Windows audio session management. I hit exclusivity conflicts with certain pro audio setups. The fix was adding a configurable audio device selector — power users who have weird audio routing can specify exactly which device to use.
System tray UX has its own conventions
Windows users have strong expectations about tray apps. They should:
- Start minimized
- Show a meaningful context menu on right-click
- Not hijack focus
- Not spawn a console window
Violate any of these and people feel like something is wrong, even if they can't articulate why.
What I'd do differently
Offline fallback. When Groq's API is unreachable (VPN, firewall, offline), the app just fails. I'm adding a local Whisper model fallback — heavier, slower, but it works without internet.
Better onboarding. First-run experience is terrible. I dumped people into a settings screen. Users want to press a button and hear it work within 30 seconds. I'm rebuilding the first-run flow to be a literal one-click demo.
Usage analytics (opt-in). I have no idea which features people actually use. Adding privacy-respecting, opt-in telemetry to guide future decisions.
The pricing model
I landed on $9/month. The reasoning:
- Groq's Whisper API costs are roughly $0.04–0.08/hour of audio depending on volume
- Heavy users might do 2–3 hours/day of dictation, but most do 15–30 minutes
- At 30 min/day × 30 days = 15 hours/month × $0.06 = ~$0.90 API cost
- $9 gives enough margin to support, improve, and not go bankrupt
Surprisingly, the price has not been the objection I expected. The objection is trust — people want to know their audio isn't being stored or sold. I now have a privacy page and a clear statement on first run: audio is sent to Groq for transcription and immediately discarded. Groq's own privacy policy backs this up.
Numbers so far
This is an indie project, not a funded startup. Early days. But the retention among people who actually adopt it into their workflow is strong — the ones who stick past day 3 are still using it a month later. That's the signal I'm building toward.
Want to try it?
If you type a lot and wish you could just talk — dictate.app is $9/month and there's a free trial. It works on Windows 10 and 11. No cloud accounts, no OAuth, just a Groq API key you bring yourself (or use the managed version where I handle the key).
Happy to answer questions about the build in the comments. The Windows audio API rabbit holes alone could fill another post.
Top comments (0)