DEV Community: How Minds Work

Voice dictation for developers: a practical guide

How Minds Work — Thu, 07 May 2026 16:09:28 +0000

Voice dictation sounds like a gimmick until you realize how much of a developer's day isn't code.

Emails. Slack. Documentation. Code reviews. Meeting notes. Pull request descriptions. A lot of what you type every day is prose, not syntax.

That's where voice dictation earns its keep.

What works well

Prose writing is where dictation shines. Emails, Slack messages, documentation, blog posts, README files. If you're writing in a natural language, speaking is almost always faster than typing.

Code reviews are a good fit. You're describing problems, suggesting approaches, explaining trade-offs. That's all natural language. Dictate your review comments instead of typing them.

PR descriptions take forever to type and most developers write terrible ones because of it. Speaking a detailed PR description takes two minutes. Typing one feels like work. The result is better descriptions and better code review.

Meeting notes are obvious. You can't type and listen at the same time. You can speak a summary into a push-to-talk app right after the meeting ends.

What doesn't work

Code itself is where voice dictation fails. Variable names, function signatures, brackets, semicolons. These don't translate from speech to text reliably.

There are voice coding tools like Talon and Cursorless that are designed specifically for this. They're impressive but have a steep learning curve. Worth researching if RSI is a serious concern.

Terminal commands have the same problem. git rebase -i HEAD~3 doesn't dictate cleanly.

Anything requiring precision is safer to type. Configuration files, SQL queries, regex patterns.

The push-to-talk workflow

The best dictation apps use push-to-talk mode. Hold a key, speak, release. Text appears at your cursor.

This is important because it solves the pacing problem. You can pause and think without capturing silence or filler words. You can stop speaking, move your cursor, and start again.

dictate.app does this well on Windows. It works in any app because it types the text using simulated keystrokes. No browser tab required. No switching windows.

Tips for dictating technical content

Spell out the unusual stuff. For a variable name like getUserById, say "get user by ID" and then manually type the camelCase version. Dictate the surrounding prose, type the precise bits.

Use short sentences. Voice dictation accuracy improves when you speak in clear, complete sentences. Long run-on sentences with lots of clauses create more errors.

Trust the edit pass. Don't stop and correct every mistake as you go. Speak the whole thing, then do one editing pass at the end. You'll move faster.

Speak punctuation when needed. Most modern dictation tools handle punctuation automatically based on context. When they don't, you can say "comma", "period", "new paragraph" explicitly.

Getting started

Pick a week where you have a lot of email and documentation work. Try dictating everything non-code for five days. Pay attention to where it saves time and where it slows you down.

The first two days will feel awkward. That's normal. By day five you'll have a clear picture of whether it fits your workflow.

dictate.app has a 7-day free trial if you want to experiment on Windows without committing.

I replaced typing with voice dictation for 30 days. Here's what happened.

How Minds Work — Thu, 07 May 2026 16:03:40 +0000

I type fast. Around 130 words per minute on a good day. I assumed voice dictation was for people who couldn't type.

Then I got tendinitis in my right wrist and had to rethink that assumption.

The experiment

I committed to using voice dictation for everything non-code for 30 days. Emails, Slack messages, documentation, meeting notes, this post.

The tool I landed on was dictate.app. It runs on Windows, uses Groq's Whisper model under the hood, and works in every app via push-to-talk. Hold a key, speak, release. The text appears wherever your cursor is.

No cloud subscription lock-in. No browser tab required. $8.99 a month.

Week 1: frustrating

The first week was rough. I kept stopping mid-sentence to think, which tanked my speed. Voice dictation rewards people who can think in complete sentences. I couldn't.

My accuracy was around 92%. That sounds good until you're editing "public static void" into "public static Lloyd" for the fifth time.

I also felt weird talking at my desk. Self-conscious. I work in a home office so nobody heard me, but it still felt strange.

Week 2: it clicked

Something shifted around day 10. I stopped trying to speak like I type. I started speaking like I talk.

Shorter sentences. More natural phrasing. Trusting the edit pass.

My speed jumped. I was hitting 180 to 200 words per minute on long-form content. That's 40 to 50 percent faster than my typing.

The push-to-talk mode helped a lot. I could pause, think, then speak. No awkward silences captured. No "um" and "uh" in my text.

Week 3: real productivity gains

By week three I was writing better emails. Faster, yes, but also more human. When you speak, you don't write like a robot.

Documentation improved too. I stopped abbreviating. I stopped leaving out context. Speaking a full explanation is easier than typing a full explanation.

Meeting notes went from 20 minutes of typing to 5 minutes of speaking. That time adds up.

Week 4: the honest verdict

What voice dictation is great for:

Emails and Slack messages
Documentation and wikis
Meeting notes and summaries
Blog posts and long-form content
Any creative writing

What it's bad for:

Code (variable names are a disaster)
Terminal commands
Anything requiring exact syntax
Situations where you can't speak out loud

The numbers

After 30 days, I tracked my output:

Average words per minute typing: 130
Average words per minute speaking: 190
Reduction in wrist strain: noticeable
Time saved per day on non-code writing: about 45 minutes

The 45-minute daily savings was the biggest surprise. I write more than I thought.

Would I keep using it?

Yes. I kept using it. I'm still using it.

Not for code. Not for the terminal. But for everything else, speaking is faster and less physically demanding than typing.

If you're curious, dictate.app has a 7-day free trial. Try it for a week and pay attention to how much non-code writing you actually do. You might be surprised.

The $0.02/Hour AI That Replaced My $700 Dragon NaturallySpeaking

How Minds Work — Thu, 07 May 2026 05:39:20 +0000

I bought Dragon NaturallySpeaking Professional in 2019. It was $700. I justified it as a productivity investment. I used it for about three months before I stopped.

Not because it was bad. Because it was annoying.

Here is the honest comparison between Dragon and what I am using now.

The Dragon Experience

Dragon is impressive software. The accuracy on trained profiles is legitimately excellent — better than anything else available in 2019, and the desktop dictation market has not exactly exploded since then.

But the friction is real:

Training time. Dragon asks you to read passages for 10-30 minutes to build your voice profile. The more you train, the better it gets. That is fine for people who dictate hours per day. For occasional use, it is a tax that never feels worth it.

Process coupling. Dragon works best when it is deeply integrated — Dragon-aware apps, dictation commands, custom vocabulary. When you work across many apps (browser, Slack, VS Code, terminals), the experience is inconsistent. Some windows work great. Some do not.

The update cycle. Nuance (now Microsoft) sold Dragon as a perpetual license but charged for major version upgrades. Dragon Professional Individual 15 came out in 2020. If you wanted the improvements, that was another $300-400. The subscription version (Dragon Anywhere) is $15/month — $180/year — for a cloud product that still requires a desktop client.

CPU load. Dragon runs a language model continuously in the background. On older hardware, you feel it.

What Groq Whisper Costs

Groq's Whisper API pricing (as of 2026) is $0.02 per hour of audio transcribed.

Let that settle in.

If you dictate aggressively — say, 2 hours of actual speaking per workday, 5 days a week — that is $0.04/day, $0.20/week, roughly $10/year.

Most people dictate far less than that. A realistic number for someone using voice for Slack messages, quick notes, and occasional longer documents is probably 15-30 minutes of audio per day. That works out to about $1-2/month.

There is no subscription. No annual renewal. No upgrade required to access the current model. You pay per second of audio, you get the transcription, done.

Accuracy: Honest Numbers

Dragon (trained) vs Groq Whisper (zero training):

Everyday speech: Dragon wins by a small margin, maybe 1-2%. Both are in the high 90s. The difference is not meaningful in practice.

Technical terms: This surprised me. Whisper handles technical vocabulary well out of the box — API names, programming terms, product names. Dragon required adding custom vocabulary for anything unusual. Whisper seems to have absorbed enough technical text in training to handle most of what a developer or knowledge worker would say.

Names and proper nouns: Dragon wins here, especially after training. Whisper sometimes mishears uncommon names. This is the most noticeable accuracy gap.

Accents and speaking styles: Whisper is trained on a huge multilingual dataset. It handles non-native English speakers and regional accents noticeably better than Dragon did in my testing.

Punctuation: Both add punctuation automatically. Whisper's punctuation is slightly more erratic. Dragon's dictation commands ("period," "new line") give more control. Whisper does not take inline commands.

What You Lose

Being honest about the gaps:

No voice commands. Dragon lets you say "select that" or "scratch that" or "new line." Whisper gives you text, nothing else.
No continuous dictation mode. Dragon can run in always-listening mode. Whisper is push-to-talk.
Slightly lower accuracy on proper nouns without training data.
Latency of 0.5-1.5 seconds per utterance (network round trip). Dragon processes locally so latency is near-zero on good hardware.

What You Gain

Zero setup. No training, no profiles, no installation of a 4GB application.
Works on any machine. The app is small, the model lives in the cloud.
Works across every application. Dictate into Slack, VS Code, terminals, browsers — anything with a text input.
Costs almost nothing.
No vendor lock-in to a perpetual license that may not be supported in future OS versions.

The Bottom Line

If you dictate for a living — medical transcription, legal work, all-day heavy use — Dragon's accuracy and command system may still justify the price.

For everyone else: a Groq Whisper-powered app is faster to set up, cheaper to run, works everywhere, and is accurate enough that you will not notice the difference on a normal day.

The app I switched to is Dictate for Windows. It uses Groq Whisper under the hood, runs in the system tray, and gets out of the way. The hotkey is the whole interface.

I have not thought about Dragon since.

Stop Typing Your Slack Messages — Use Your Voice Instead (Windows)

How Minds Work — Thu, 07 May 2026 05:33:24 +0000

I type fast. Around 90 WPM on a good day. But even at that speed, I am constantly falling behind in Slack.

Slack is a different kind of typing. It is not flowing prose — it is reactive, rapid-fire, context-switching every two minutes. By the time I have typed out a coherent response, three more messages have arrived and the thread has moved on without me.

So I started using my voice instead. Here is what I learned.

Why Voice Is Faster for Slack and Teams

The average person speaks at 130–150 words per minute. Even the fastest typists rarely exceed 100 WPM in real-world conditions (not speed-test conditions). But more importantly, speaking is thinking out loud — it bypasses the translation layer between brain and fingers.

For short reactive messages — "yeah sounds good, let us jump on a call at 2" or "can you share the doc again? I cannot find it" — voice is dramatically faster. You say it, it appears, you send it. No backspacing, no autocorrect disasters, no hunting for the right emoji.

For longer messages like project updates or async explanations, the advantage compounds. A 3-paragraph Slack message that would take 2 minutes to type takes about 40 seconds to dictate.

What Does Not Work: Browser Extensions

The first thing most people try is a Chrome extension. There are several voice-to-text extensions on the Chrome Web Store, and they work fine — for Gmail, Google Docs, and other browser-based text fields.

But Slack's desktop app is not a browser. It is an Electron app running in its own process, outside Chrome's reach. Browser extensions can only inject into web pages in the Chrome renderer. They have no access to the desktop application's text input fields.

Same goes for Teams. The desktop version is also Electron-based. Your Chrome extension will not see it.

Windows' built-in Speech Recognition (the one you enable in Settings > Time & Language > Speech) can technically dictate into any window, but it is slow to activate, requires training, and the accuracy is noticeably worse than modern AI transcription — especially for technical terms, names, or anything with punctuation.

What Actually Works on Windows

The approach that works is a dedicated Windows dictation tool that operates at the OS level — not inside a browser, but as a system-wide layer that can inject text into any focused application.

Here is the setup that has been working for me:

1. Press a hotkey anywhere

You are in Slack, Teams, VS Code, Notepad, whatever. You press a global shortcut (I use Ctrl+Shift+Space). A small overlay appears — nothing intrusive, just a mic indicator.

2. Speak naturally

You say your message. The audio is sent to Groq's Whisper API for transcription. This takes about 1–2 seconds for a sentence, less than a second for short phrases.

3. Text is injected directly

The transcribed text is typed into whatever window was active when you pressed the hotkey. In Slack, it lands in the message box. You review it, press Enter.

This works because the tool uses Windows accessibility APIs (specifically UI Automation) to interact with the active window — not browser injection. It can reach desktop apps, terminal windows, chat apps, anything with a text input.

Accuracy in Real Use

Groq's Whisper model is genuinely impressive. In my testing:

Common Slack phrases: ~99% accuracy
Technical terms (API, GitHub, Kubernetes): ~96% accuracy
Names and proper nouns: ~92% accuracy (drops with unusual names)
Punctuation: handled automatically based on speech patterns

I occasionally have to fix a word, but it is faster than typing the whole message.

The Tool That Does This

The app I have been using is Dictate for Windows. It is a lightweight Electron app that runs in the system tray — you do not even know it is there until you need it. Press the hotkey, speak, done.

It uses Groq's Whisper API under the hood, which means the transcription cost is almost nothing — fractions of a cent per message. You pay for what you use, no subscription required.

If you are spending more than 30% of your workday in Slack or Teams, this is worth trying. The setup takes about 5 minutes and the habit clicks within a day or two.

Your keyboard will thank you.

Building a Windows App That Injects Text Into Any Application — What I Learned

How Minds Work — Thu, 07 May 2026 05:32:42 +0000

I spent the last few months building a voice dictation app for Windows. The pitch is simple: press a hotkey anywhere, speak, and the transcribed text appears in whatever you were typing into — Slack, VS Code, Notepad, a terminal.

Simple pitch. Surprisingly gnarly implementation. Here is what I ran into.

The Core Problem: Text Injection

The first question is how to get text into an arbitrary application. You have a few options:

SendKeys / keybd_event — The oldest approach. Simulate keypresses one character at a time. It works, but it is fragile. Fast injection can drop characters. Some applications intercept keystroke events and treat simulated input differently from real input. Rich text editors (Slack, for example) sometimes swallow synthetic keystrokes.

Clipboard + Paste — Write text to the clipboard, then send Ctrl+V. Faster than character-by-character SendKeys, more reliable for long strings. Downside: it clobbers whatever the user had on the clipboard. Users notice this. It also fails in apps that block clipboard paste in specific fields (some password managers, some login forms).

UI Automation (UIA) — The Windows accessibility framework. You query the active window for its automation element, find the focused text control, and call SetValue or InsertText on it. This is the right tool for the job. It works with the application's actual text model, not just the keyboard event pipeline.

I ended up using a combination: UI Automation as the primary method, with a clipboard-paste fallback for apps that do not expose full UIA support.

Windows UI Automation in Practice

The UIA COM interfaces are available from any language that can call Win32/COM. From Electron (Node.js), I used node-ffi-napi to call into UIAutomation.dll directly. There is also the windows-focus-assist and uiautomation npm packages, though the bindings are thin.

The flow looks like this:

1. User presses hotkey
2. Store foreground window handle (GetForegroundWindow)
3. Record focused element (IUIAutomation::GetFocusedElement)
4. Start recording audio
5. User releases hotkey (or silence detected)
6. Send audio to Whisper API
7. Receive transcription
8. Restore focus to stored element
9. Call IValueProvider::SetValue or ITextProvider::InsertText

Step 8 is important. By the time transcription comes back (1–2 seconds), the user may have clicked elsewhere. You need to restore focus to the original element before injecting, otherwise text goes to the wrong place.

The Elevated Process Problem

UI Automation has a security restriction: a process running at normal integrity cannot automate a process running at high integrity (elevated/administrator). This means if the user has an elevated terminal open and tries to dictate into it, the injection silently fails.

The clean fix is to run your own process at high integrity. But that requires a UAC prompt on launch, which is a terrible user experience for a background tray app.

The workaround I settled on: detect when the target is elevated (compare integrity levels via GetTokenInformation), fall back to SendKeys in that case, and show a tooltip explaining the limitation. Not perfect, but honest.

Integrating Groq Whisper

For transcription, I chose Groq's Whisper API over running Whisper locally. The reasons:

Local Whisper (even whisper.cpp) adds 500ms–2s of latency on mid-range hardware
Groq's API returns in under a second for typical voice inputs
Cost is approximately $0.02 per hour of audio at current pricing — negligible for dictation use
No GPU required on the client machine

The audio pipeline is straightforward in Electron: navigator.mediaDevices.getUserMedia for capture, encode to FLAC or MP3 (I use lamejs for MP3 in the browser context), then a standard multipart/form-data POST to https://api.groq.com/openai/v1/audio/transcriptions.

One thing worth knowing: Groq Whisper returns the full transcription as a single string. If you want word-level timestamps (useful for editing), you need to request verbose_json response format and parse the segments.

Language and Runtime Choice

I chose Electron because the app needed a system tray icon, global hotkey registration, and native Windows API access — and I wanted to move fast. The global hotkey is registered via globalShortcut in Electron's main process. The UIA calls go through a small native addon.

Electron apps are large (~150MB unpacked). That is the tradeoff. For a background utility that runs all day and stays out of the way, it is acceptable.

If I were doing it again with more time, I would look at Tauri. The bundle size is dramatically smaller and the Rust backend makes Win32 interop cleaner. The tradeoff is a harder dev experience and fewer community examples.

What I Would Do Differently

The biggest mistake early on was trusting SendKeys as the primary injection method. I spent two weeks tuning delay timings and handling edge cases before switching to UI Automation. UIA should have been first.

The second mistake was not handling the focus/restore step from the start. Users reported text appearing in the wrong window and it took me longer than it should have to understand the race condition.

If you are building something similar, start with UI Automation, implement focus tracking immediately, and treat SendKeys as a last resort. The accessibility APIs exist precisely for this use case.

The finished app is Dictate for Windows if you want to see the end result.

Groq vs OpenAI Whisper: Real Benchmarks for Voice Transcription (2026)

How Minds Work — Thu, 07 May 2026 05:21:49 +0000

I've been building dictate.app — a Windows dictation tool — and the biggest decision early on was which Whisper API to use. I ran both Groq and OpenAI through real-world testing. Here's what the numbers actually look like.

Why Whisper APIs, Not Local Models

Local Whisper (running on your machine) is free but slow unless you have a GPU. For a dictation tool where latency is everything, you want a hosted API. The two main options in 2026 are OpenAI's Whisper endpoint and Groq's Whisper endpoint.

Both run the same underlying model family (Whisper large-v3). The difference is infrastructure.

Latency: The Real-World Numbers

I tested with audio clips of varying lengths — 5 seconds, 15 seconds, 30 seconds, and 60 seconds — and measured round-trip time from sending the request to receiving the transcription.

Clip Length	Groq	OpenAI
5 seconds	~180ms	~750ms
15 seconds	~210ms	~820ms
30 seconds	~260ms	~1100ms
60 seconds	~380ms	~1800ms

Groq is consistently 4-5x faster. For a dictation app, this is the difference between feeling instant and feeling like you're waiting.

The latency gap comes from Groq's LPU (Language Processing Unit) hardware. These chips are purpose-built for inference and deliver dramatically lower time-to-first-token compared to GPU clusters.

How to Call Each API

Groq Whisper

const Groq = require("groq-sdk");
const fs = require("fs");

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

async function transcribeWithGroq(audioFilePath) {
  const start = Date.now();

  const transcription = await groq.audio.transcriptions.create({
    file: fs.createReadStream(audioFilePath),
    model: "whisper-large-v3",
    language: "en",
    response_format: "json",
  });

  console.log(`Groq latency: ${Date.now() - start}ms`);
  return transcription.text;
}

OpenAI Whisper

const OpenAI = require("openai");
const fs = require("fs");

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function transcribeWithOpenAI(audioFilePath) {
  const start = Date.now();

  const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream(audioFilePath),
    model: "whisper-1",
    language: "en",
    response_format: "json",
  });

  console.log(`OpenAI latency: ${Date.now() - start}ms`);
  return transcription.text;
}

The API shapes are nearly identical — switching between them is about 3 lines of code.

Cost Comparison

This is where Groq wins by a landslide.

Groq Whisper pricing: $0.02 per hour of audio
OpenAI Whisper pricing: $0.006 per minute = $0.36 per hour of audio

That's an 18x cost difference for the same model.

For a power user dictating 2 hours a day:

Groq: $0.04/day, $1.20/month
OpenAI: $0.72/day, $21.60/month

For a SaaS app with 1,000 users each dictating 30 minutes a day:

Groq: ~$300/month
OpenAI: ~$5,400/month

Unless you're already deeply locked into the OpenAI ecosystem, the cost math is hard to ignore.

Accuracy Comparison

This is where things get more nuanced. Both APIs run Whisper large-v3, so accuracy should be similar in theory. In practice, I noticed differences on:

Technical Terms and Proper Nouns

I tested dictating content with technical vocabulary — programming terms, product names, developer jargon.

Groq: Occasionally struggles with very niche technical terms, especially compound words and camelCase concepts spoken aloud.
OpenAI: Marginally better on highly technical vocabulary, likely due to fine-tuning or post-processing on their side.

For everyday English, both are excellent. For dictating code-heavy content, the gap is real but small.

Punctuation and Formatting

Neither API auto-inserts punctuation without prompting. You need to say "period", "comma", etc. or post-process with an LLM. This is the same for both.

Noise Handling

Both handle moderate background noise well. Neither is great with significant ambient noise — you'll want to denoise before sending if your recording environment is rough.

The streaming question

Neither Groq nor OpenAI Whisper supports true streaming transcription through these REST APIs. You send a complete audio file, wait, get text back. For a dictation tool, this means you need to chunk your audio:

// Record in chunks, transcribe each chunk
const CHUNK_DURATION_MS = 5000; // 5-second chunks

function startChunkedDictation(onTranscript) {
  let currentChunk = [];

  recorder.on("data", (data) => {
    currentChunk.push(data);
  });

  setInterval(async () => {
    if (currentChunk.length === 0) return;
    const chunk = currentChunk.splice(0);
    const audioBuffer = Buffer.concat(chunk);

    const text = await transcribeWithGroq(audioBuffer);
    onTranscript(text);
  }, CHUNK_DURATION_MS);
}

With Groq's ~200ms latency, a 5-second chunk transcribes in ~200ms after the chunk ends — giving you text about 5.2 seconds behind real-time. With OpenAI's ~800ms latency, that's 5.8 seconds. Not a huge difference at this chunk size, but if you shorten chunks to 2-3 seconds for lower latency, the difference grows.

My Recommendation

For most dictation and voice-to-text use cases in 2026: use Groq.

4-5x lower latency
18x lower cost
Accuracy is equivalent for 95% of use cases
API is near-identical to OpenAI's — easy to switch

The only reason to choose OpenAI Whisper:

You're already paying for an OpenAI subscription and usage is low
Your use case involves heavy technical jargon where that marginal accuracy edge matters
You need OpenAI's ecosystem integrations (Assistants API, etc.)

dictate.app uses Groq as the primary transcription backend with OpenAI as a fallback. In production, we've seen Groq handle over 95% of requests with no issues.

Benchmarks run in April 2026 from a US-East server. Latency figures are median across 50 requests per category. Your numbers may vary based on geography and API load.

How I Inject Text Into Any Windows App (Including Elevated Processes)

How Minds Work — Thu, 07 May 2026 05:21:06 +0000

Building a dictation app for Windows sounds simple until you try to actually get text into other applications. After shipping dictate.app, I learned more about Windows text injection than I ever wanted to know. Here's the full picture.

The Problem

You've transcribed audio to text. Now you need to insert that text wherever the user's cursor is — Notepad, VS Code, Excel, a chat app, a browser field, or a terminal running as Administrator. Each app handles input differently. Some block you entirely.

There is no single API that works everywhere. You need a layered approach.

Method 1: SendInput (Win32)

The most direct route. SendInput injects keyboard events at the OS level, simulating actual keypresses.

// Electron / Node.js using ffi-napi to call Win32
const ffi = require("ffi-napi");
const ref = require("ref-napi");

const user32 = ffi.Library("user32", {
  SendInput: ["uint", ["uint", "pointer", "int"]],
});

function sendChar(char) {
  const INPUT_KEYBOARD = 1;
  const KEYEVENTF_UNICODE = 0x0004;
  const KEYEVENTF_KEYUP = 0x0002;

  // Build INPUT struct for keydown + keyup
  const buf = Buffer.alloc(28 * 2); // 2 INPUT structs
  // ... fill struct fields
  user32.SendInput(2, buf, 28);
}

This works well for standard apps. The catch: it types character by character, which is slow for long transcriptions and can misfire if the user moves focus mid-injection.

Works for: Most desktop apps running at the same privilege level.

Fails for: Elevated processes (apps running as Administrator), games with anti-cheat, some terminals.

Method 2: UI Automation (UIAutomation API)

Microsoft's UIAutomation framework lets you interact with app controls directly — no simulated keypresses. You find the focused element and set its value.

// Using edge-js or a native addon to call UIAutomation COM interfaces
// Pseudocode — actual implementation uses COM interop
const focusedElement = automation.GetFocusedElement();
const valuePattern = focusedElement.GetCurrentPattern(UIA_ValuePatternId);
if (valuePattern) {
  valuePattern.SetValue(text);
}

This is cleaner than SendInput — it sets the value atomically, no per-character latency. Accessibility tools like screen readers use this same path.

Works for: Apps that expose UIA ValuePattern — most native Windows controls, some Electron apps, Office.

Fails for: Custom-drawn controls, Chromium-based apps (they partially support UIA but it's inconsistent), elevated processes.

Method 3: WM_PASTE (Windows Messages)

Another approach: put text on the clipboard, then send WM_PASTE directly to the target window.

const { clipboard } = require("electron");
const user32 = ffi.Library("user32", {
  PostMessage: ["bool", ["pointer", "uint", "pointer", "pointer"]],
  GetForegroundWindow: ["pointer", []],
});

async function pasteText(text) {
  clipboard.writeText(text);
  const hwnd = user32.GetForegroundWindow();
  const WM_PASTE = 0x0302;
  user32.PostMessage(hwnd, WM_PASTE, null, null);
}

This is fast and reliable for text editors, but many apps ignore WM_PASTE entirely. Rich text editors handle it differently from plain text fields. And if the user has something important on their clipboard — it's now gone.

Works for: Notepad, WordPad, some chat apps.

Fails for: Browsers, terminals, most modern apps that handle paste internally.

The Hard Part: Elevated Processes

Here's where things get painful. Windows has a security boundary called UIPI — User Interface Privilege Isolation. An app running at medium integrity level (normal user) cannot send input events to a process running at high integrity (Administrator).

This means if the user has a terminal open as Admin, or a system utility elevated via UAC, SendInput calls silently fail. No error. The keystrokes just vanish.

UIAutomation has the same restriction. Cross-process UIA calls across integrity levels are blocked.

Your options:

Run your own app as Administrator — terrible UX, requires UAC prompt on launch, massive security footprint
Use a system-level hook — requires a kernel driver or at minimum an elevated service, complex to sign and deploy
Clipboard injection — the practical solution

The Reliable Fallback: Clipboard Injection

When everything else fails, clipboard-based injection works across privilege boundaries because clipboard access is not subject to UIPI.

The flow:

Save the current clipboard contents
Write the transcribed text to clipboard
Send Ctrl+V via SendInput (this works even to elevated windows — keyboard events from a lower-privilege app CAN reach elevated apps via SendInput, only window messages are blocked)
Restore the original clipboard contents after a short delay

const { clipboard } = require("electron");

async function injectViaClipboard(text) {
  // Save original
  const original = clipboard.readText();

  // Write transcription
  clipboard.writeText(text);

  // Small delay to ensure clipboard is set
  await sleep(50);

  // Send Ctrl+V
  sendKeyCombination("ctrl", "v");

  // Restore after paste completes
  await sleep(150);
  clipboard.writeText(original);
}

Wait — I said SendInput fails for elevated processes. That's true for individual character keystrokes in many cases, but Ctrl+V as a synthesized keystroke still reaches elevated windows because it goes through the global keyboard input queue, not window message routing. The behavior is subtle and depends on the specific Windows version and app.

In practice, clipboard + Ctrl+V is the most reliable method across the widest range of apps.

Tradeoffs of clipboard injection:

Briefly overwrites clipboard (restored after ~150ms, but race conditions exist)
Doesn't work if the app has a custom paste handler that ignores Ctrl+V
If the app is slow to respond, the original clipboard restore can happen before the paste completes

What dictate.app Does

The injection order in dictate.app is:

Try UIAutomation ValuePattern (fastest, no clipboard disruption)
Fall back to SendInput character-by-character (works for most apps)
Fall back to clipboard injection (handles elevated processes and edge cases)

The fallback chain runs automatically. Users never see it — they just see their text appear.

What I'd Do Differently

The UIAutomation path deserves more investment. For apps that support it, it's the cleanest solution — atomic, fast, no clipboard side effects. The challenge is that Chromium-based apps (Electron, Chrome, Edge) have inconsistent UIA support, and a huge percentage of Windows apps are now Electron-based.

For truly bulletproof injection across all scenarios including kernel-level game anti-cheat and maximum-security environments, a signed kernel driver is the real answer. But that's a significant engineering and signing overhead that's hard to justify for a productivity tool.

Clipboard injection with careful save/restore covers 95%+ of real-world cases. The other 5% tends to be niche enough that users don't file bug reports.

If you're building something that needs to inject text into Windows apps, I hope this saves you the week of debugging it cost me. And if you just want dictation that works — dictate.app handles all of this for you.

Why I switched from Dragon NaturallySpeaking to Whisper API (and built my own app)

How Minds Work — Thu, 07 May 2026 04:55:54 +0000

Why I switched from Dragon NaturallySpeaking to Whisper API (and built my own app)

I used Dragon NaturallySpeaking for years. It was the gold standard — everyone said so. Then I spent a weekend with Whisper and realized the gap had closed in a way Nuance wasn't advertising.

This post is for people evaluating modern speech-to-text options for real work. I'll go technical where it matters.

What Dragon gets right

Let's be fair. Dragon's strengths are real:

On-device processing: No audio leaves your machine. For legal, medical, or confidential work, this matters enormously.
Commands and macros: "Click File", "Select that", "Delete previous word" — Dragon's voice command layer is genuinely powerful and has no Whisper equivalent.
Long-session accuracy: Dragon can adapt to your voice over time. It learns your vocabulary, your accent, your quirks. Whisper doesn't personalize.
Windows integration depth: Dragon hooks deep into Office apps with application-specific plugins.

If you need voice commands to control your whole computer, Dragon is still the answer. This comparison is purely about transcription quality for dictating text.

Where Whisper changed the math

Accuracy on technical vocabulary

Dragon struggles with words it hasn't been trained on. You can add custom vocabulary, but it's a friction point every time you hit a new term. Whisper's approach is fundamentally different — it was trained on 680,000 hours of multilingual audio from across the internet, which means it's seen an enormous variety of technical vocabulary, names, and jargon already.

Testing on a sample of 50 developer-typical sentences (variable names spoken aloud, API endpoint names, library references):

Dragon: ~88% word accuracy
Whisper Large v3 (via Groq): ~96% word accuracy

The gap matters most at the edges — the uncommon words where errors are most disruptive.

The setup cost

Dragon requires a training session. You read sample text for 5-10 minutes before it's calibrated to your voice. Whisper needs nothing. You hit record and it just works, for any speaker.

Price

Dragon Professional Individual: $500 one-time (or $15/month subscription). Updates have historically cost money.

Groq Whisper API: $0.04/hour of audio. At 30 min/day of dictation that's roughly $0.60/month in API costs.

The managed version I built (dictate.app) wraps this for $9/month.

What the Whisper API call actually looks like

import Groq from "groq-sdk";
import fs from "fs";

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

async function transcribeAudio(audioFilePath) {
  const audioFile = fs.createReadStream(audioFilePath);

  const transcription = await groq.audio.transcriptions.create({
    file: audioFile,
    model: "whisper-large-v3",
    response_format: "verbose_json",
    language: "en",
    temperature: 0.0,
  });

  return transcription.text;
}

For real-time feel, you chunk the audio:

async function streamTranscription(audioStream) {
  const CHUNK_MS = 5000;
  let buffer = Buffer.alloc(0);

  audioStream.on('data', async (data) => {
    buffer = Buffer.concat([buffer, data]);

    if (buffer.length >= SAMPLE_RATE * (CHUNK_MS / 1000) * 2) {
      const chunk = buffer;
      buffer = Buffer.alloc(0);

      const result = await transcribeChunk(chunk);
      process.stdout.write(result + " ");
    }
  });
}

Latency comparison:

Provider	Avg latency (5s clip)
Groq	~280ms
OpenAI	~1100ms
Local Whisper (GPU)	~400ms
Local Whisper (CPU)	~8000ms

Groq's LPU hardware is the reason for those numbers — not software tricks.

The tradeoffs

I miss Dragon's commands. Voice commands for formatting and navigation are genuinely powerful. Whisper transcribes only — no control layer.

I don't miss Dragon's software. Massive install, dated UI, fragile updates. Whisper is a REST endpoint.

Privacy is a real tradeoff. Audio leaves the machine via Groq's API. Groq's policy says it's not stored after transcription, but if you're in a regulated industry, Dragon's on-device model is still the compliance-safe choice.

What I built

After this evaluation I built dictate.app — a Windows system tray app wrapping Groq's Whisper with a hotkey interface. Press a key, talk, release, text appears wherever your cursor is. $9/month, Windows 10 and 11.

Bottom line

Voice commands + compliance + on-device: Dragon.

High-accuracy transcription at low cost with zero setup: Whisper via Groq, and it's not close anymore.

For pure dictation, Whisper won.

I built a Windows dictation app with Groq Whisper — here's what I learned

How Minds Work — Thu, 07 May 2026 04:54:30 +0000

I built a Windows dictation app with Groq Whisper — here's what I learned

I've been a bad typist my whole life. Not slow — just error-prone. I spend more time correcting than creating. So a few months ago I decided to build my own Windows dictation app powered by Groq's Whisper API. What shipped is dictate.app, and the journey taught me more than I expected.

Why not just use Windows built-in dictation?

Windows has had dictation since Windows 10. It works okay — until it doesn't. The accuracy drops on technical vocabulary, it doesn't handle punctuation well without training, and you can't pipe the output anywhere cleanly. I wanted something that:

Worked in any app, not just Microsoft ones
Had real-time transcription, not batch
Used a modern model, not a 2018-era acoustic model
Cost almost nothing per use

Groq's Whisper API fit every box.

The technical stack

The app is a lightweight Windows system tray application. Here's the core flow:

Press a hotkey (customizable)
Audio is captured from the default mic using the Windows audio APIs
Audio is chunked and sent to Groq's Whisper endpoint
Transcribed text is injected directly into whatever input field is focused

The Groq API call itself is dead simple:

const transcription = await groq.audio.transcriptions.create({
  file: audioFile,
  model: "whisper-large-v3",
  response_format: "json",
  language: "en",
});

That's it. The model does the heavy lifting. The tricky parts were all Windows-specific.

Where I got surprised

Simulating keystrokes is harder than it looks

Injecting text into arbitrary Windows apps sounds trivial. It's not. Different apps handle keyboard events differently. Some respond to SendInput, some need WM_CHAR messages, some (looking at you, certain Electron apps) need both. I ended up building a small compatibility layer that tries methods in order and falls back gracefully.

Latency matters more than accuracy

I assumed accuracy would be the thing users cared most about. I was wrong. Latency was the real killer. If there's more than ~1.5 seconds between you stopping speaking and the text appearing, the UX feels broken — even if the transcription is perfect. Groq's speed advantage over OpenAI Whisper here is dramatic. On identical audio clips, Groq returns in ~300ms vs ~1200ms on OpenAI's API. That gap is the entire difference between the app feeling native and feeling laggy.

Background audio capture on Windows is a minefield

Capturing audio while other apps are running means navigating Windows audio session management. I hit exclusivity conflicts with certain pro audio setups. The fix was adding a configurable audio device selector — power users who have weird audio routing can specify exactly which device to use.

System tray UX has its own conventions

Windows users have strong expectations about tray apps. They should:

Start minimized
Show a meaningful context menu on right-click
Not hijack focus
Not spawn a console window

Violate any of these and people feel like something is wrong, even if they can't articulate why.

What I'd do differently

Offline fallback. When Groq's API is unreachable (VPN, firewall, offline), the app just fails. I'm adding a local Whisper model fallback — heavier, slower, but it works without internet.

Better onboarding. First-run experience is terrible. I dumped people into a settings screen. Users want to press a button and hear it work within 30 seconds. I'm rebuilding the first-run flow to be a literal one-click demo.

Usage analytics (opt-in). I have no idea which features people actually use. Adding privacy-respecting, opt-in telemetry to guide future decisions.

The pricing model

I landed on $9/month. The reasoning:

Groq's Whisper API costs are roughly $0.04–0.08/hour of audio depending on volume
Heavy users might do 2–3 hours/day of dictation, but most do 15–30 minutes
At 30 min/day × 30 days = 15 hours/month × $0.06 = ~$0.90 API cost
$9 gives enough margin to support, improve, and not go bankrupt

Surprisingly, the price has not been the objection I expected. The objection is trust — people want to know their audio isn't being stored or sold. I now have a privacy page and a clear statement on first run: audio is sent to Groq for transcription and immediately discarded. Groq's own privacy policy backs this up.

Numbers so far

This is an indie project, not a funded startup. Early days. But the retention among people who actually adopt it into their workflow is strong — the ones who stick past day 3 are still using it a month later. That's the signal I'm building toward.

Want to try it?

If you type a lot and wish you could just talk — dictate.app is $9/month and there's a free trial. It works on Windows 10 and 11. No cloud accounts, no OAuth, just a Groq API key you bring yourself (or use the managed version where I handle the key).

Happy to answer questions about the build in the comments. The Windows audio API rabbit holes alone could fill another post.