<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: How Minds Work</title>
    <description>The latest articles on DEV Community by How Minds Work (@howmindswork).</description>
    <link>https://dev.to/howmindswork</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3916981%2F9333884d-c60a-4532-b3c2-143c8296d951.png</url>
      <title>DEV Community: How Minds Work</title>
      <link>https://dev.to/howmindswork</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/howmindswork"/>
    <language>en</language>
    <item>
      <title>The $0.02/Hour AI That Replaced My $700 Dragon NaturallySpeaking</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 05:39:20 +0000</pubDate>
      <link>https://dev.to/howmindswork/the-002hour-ai-that-replaced-my-700-dragon-naturallyspeaking-4jd0</link>
      <guid>https://dev.to/howmindswork/the-002hour-ai-that-replaced-my-700-dragon-naturallyspeaking-4jd0</guid>
      <description>&lt;p&gt;I bought Dragon NaturallySpeaking Professional in 2019. It was $700. I justified it as a productivity investment. I used it for about three months before I stopped.&lt;/p&gt;

&lt;p&gt;Not because it was bad. Because it was annoying.&lt;/p&gt;

&lt;p&gt;Here is the honest comparison between Dragon and what I am using now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dragon Experience
&lt;/h2&gt;

&lt;p&gt;Dragon is impressive software. The accuracy on trained profiles is legitimately excellent — better than anything else available in 2019, and the desktop dictation market has not exactly exploded since then.&lt;/p&gt;

&lt;p&gt;But the friction is real:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training time.&lt;/strong&gt; Dragon asks you to read passages for 10-30 minutes to build your voice profile. The more you train, the better it gets. That is fine for people who dictate hours per day. For occasional use, it is a tax that never feels worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process coupling.&lt;/strong&gt; Dragon works best when it is deeply integrated — Dragon-aware apps, dictation commands, custom vocabulary. When you work across many apps (browser, Slack, VS Code, terminals), the experience is inconsistent. Some windows work great. Some do not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The update cycle.&lt;/strong&gt; Nuance (now Microsoft) sold Dragon as a perpetual license but charged for major version upgrades. Dragon Professional Individual 15 came out in 2020. If you wanted the improvements, that was another $300-400. The subscription version (Dragon Anywhere) is $15/month — $180/year — for a cloud product that still requires a desktop client.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU load.&lt;/strong&gt; Dragon runs a language model continuously in the background. On older hardware, you feel it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Groq Whisper Costs
&lt;/h2&gt;

&lt;p&gt;Groq's Whisper API pricing (as of 2026) is $0.02 per hour of audio transcribed.&lt;/p&gt;

&lt;p&gt;Let that settle in.&lt;/p&gt;

&lt;p&gt;If you dictate aggressively — say, 2 hours of actual speaking per workday, 5 days a week — that is $0.04/day, $0.20/week, roughly $10/year.&lt;/p&gt;

&lt;p&gt;Most people dictate far less than that. A realistic number for someone using voice for Slack messages, quick notes, and occasional longer documents is probably 15-30 minutes of audio per day. That works out to about $1-2/month.&lt;/p&gt;

&lt;p&gt;There is no subscription. No annual renewal. No upgrade required to access the current model. You pay per second of audio, you get the transcription, done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accuracy: Honest Numbers
&lt;/h2&gt;

&lt;p&gt;Dragon (trained) vs Groq Whisper (zero training):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Everyday speech:&lt;/strong&gt; Dragon wins by a small margin, maybe 1-2%. Both are in the high 90s. The difference is not meaningful in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical terms:&lt;/strong&gt; This surprised me. Whisper handles technical vocabulary well out of the box — API names, programming terms, product names. Dragon required adding custom vocabulary for anything unusual. Whisper seems to have absorbed enough technical text in training to handle most of what a developer or knowledge worker would say.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Names and proper nouns:&lt;/strong&gt; Dragon wins here, especially after training. Whisper sometimes mishears uncommon names. This is the most noticeable accuracy gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accents and speaking styles:&lt;/strong&gt; Whisper is trained on a huge multilingual dataset. It handles non-native English speakers and regional accents noticeably better than Dragon did in my testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Punctuation:&lt;/strong&gt; Both add punctuation automatically. Whisper's punctuation is slightly more erratic. Dragon's dictation commands ("period," "new line") give more control. Whisper does not take inline commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Lose
&lt;/h2&gt;

&lt;p&gt;Being honest about the gaps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No voice commands. Dragon lets you say "select that" or "scratch that" or "new line." Whisper gives you text, nothing else.&lt;/li&gt;
&lt;li&gt;No continuous dictation mode. Dragon can run in always-listening mode. Whisper is push-to-talk.&lt;/li&gt;
&lt;li&gt;Slightly lower accuracy on proper nouns without training data.&lt;/li&gt;
&lt;li&gt;Latency of 0.5-1.5 seconds per utterance (network round trip). Dragon processes locally so latency is near-zero on good hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Gain
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Zero setup. No training, no profiles, no installation of a 4GB application.&lt;/li&gt;
&lt;li&gt;Works on any machine. The app is small, the model lives in the cloud.&lt;/li&gt;
&lt;li&gt;Works across every application. Dictate into Slack, VS Code, terminals, browsers — anything with a text input.&lt;/li&gt;
&lt;li&gt;Costs almost nothing.&lt;/li&gt;
&lt;li&gt;No vendor lock-in to a perpetual license that may not be supported in future OS versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;If you dictate for a living — medical transcription, legal work, all-day heavy use — Dragon's accuracy and command system may still justify the price.&lt;/p&gt;

&lt;p&gt;For everyone else: a Groq Whisper-powered app is faster to set up, cheaper to run, works everywhere, and is accurate enough that you will not notice the difference on a normal day.&lt;/p&gt;

&lt;p&gt;The app I switched to is &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;Dictate for Windows&lt;/a&gt;. It uses Groq Whisper under the hood, runs in the system tray, and gets out of the way. The hotkey is the whole interface.&lt;/p&gt;

&lt;p&gt;I have not thought about Dragon since.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>windows</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Stop Typing Your Slack Messages — Use Your Voice Instead (Windows)</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 05:33:24 +0000</pubDate>
      <link>https://dev.to/howmindswork/stop-typing-your-slack-messages-use-your-voice-instead-windows-5hd5</link>
      <guid>https://dev.to/howmindswork/stop-typing-your-slack-messages-use-your-voice-instead-windows-5hd5</guid>
      <description>&lt;p&gt;I type fast. Around 90 WPM on a good day. But even at that speed, I am constantly falling behind in Slack.&lt;/p&gt;

&lt;p&gt;Slack is a different kind of typing. It is not flowing prose — it is reactive, rapid-fire, context-switching every two minutes. By the time I have typed out a coherent response, three more messages have arrived and the thread has moved on without me.&lt;/p&gt;

&lt;p&gt;So I started using my voice instead. Here is what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Voice Is Faster for Slack and Teams
&lt;/h2&gt;

&lt;p&gt;The average person speaks at 130–150 words per minute. Even the fastest typists rarely exceed 100 WPM in real-world conditions (not speed-test conditions). But more importantly, speaking is &lt;em&gt;thinking out loud&lt;/em&gt; — it bypasses the translation layer between brain and fingers.&lt;/p&gt;

&lt;p&gt;For short reactive messages — "yeah sounds good, let us jump on a call at 2" or "can you share the doc again? I cannot find it" — voice is dramatically faster. You say it, it appears, you send it. No backspacing, no autocorrect disasters, no hunting for the right emoji.&lt;/p&gt;

&lt;p&gt;For longer messages like project updates or async explanations, the advantage compounds. A 3-paragraph Slack message that would take 2 minutes to type takes about 40 seconds to dictate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does Not Work: Browser Extensions
&lt;/h2&gt;

&lt;p&gt;The first thing most people try is a Chrome extension. There are several voice-to-text extensions on the Chrome Web Store, and they work fine — for Gmail, Google Docs, and other browser-based text fields.&lt;/p&gt;

&lt;p&gt;But Slack's desktop app is not a browser. It is an Electron app running in its own process, outside Chrome's reach. Browser extensions can only inject into web pages in the Chrome renderer. They have no access to the desktop application's text input fields.&lt;/p&gt;

&lt;p&gt;Same goes for Teams. The desktop version is also Electron-based. Your Chrome extension will not see it.&lt;/p&gt;

&lt;p&gt;Windows' built-in Speech Recognition (the one you enable in Settings &amp;gt; Time &amp;amp; Language &amp;gt; Speech) can technically dictate into any window, but it is slow to activate, requires training, and the accuracy is noticeably worse than modern AI transcription — especially for technical terms, names, or anything with punctuation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works on Windows
&lt;/h2&gt;

&lt;p&gt;The approach that works is a dedicated Windows dictation tool that operates at the OS level — not inside a browser, but as a system-wide layer that can inject text into any focused application.&lt;/p&gt;

&lt;p&gt;Here is the setup that has been working for me:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Press a hotkey anywhere&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are in Slack, Teams, VS Code, Notepad, whatever. You press a global shortcut (I use &lt;code&gt;Ctrl+Shift+Space&lt;/code&gt;). A small overlay appears — nothing intrusive, just a mic indicator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Speak naturally&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You say your message. The audio is sent to Groq's Whisper API for transcription. This takes about 1–2 seconds for a sentence, less than a second for short phrases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Text is injected directly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The transcribed text is typed into whatever window was active when you pressed the hotkey. In Slack, it lands in the message box. You review it, press Enter.&lt;/p&gt;

&lt;p&gt;This works because the tool uses Windows accessibility APIs (specifically UI Automation) to interact with the active window — not browser injection. It can reach desktop apps, terminal windows, chat apps, anything with a text input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accuracy in Real Use
&lt;/h2&gt;

&lt;p&gt;Groq's Whisper model is genuinely impressive. In my testing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common Slack phrases: ~99% accuracy&lt;/li&gt;
&lt;li&gt;Technical terms (API, GitHub, Kubernetes): ~96% accuracy&lt;/li&gt;
&lt;li&gt;Names and proper nouns: ~92% accuracy (drops with unusual names)&lt;/li&gt;
&lt;li&gt;Punctuation: handled automatically based on speech patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I occasionally have to fix a word, but it is faster than typing the whole message.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tool That Does This
&lt;/h2&gt;

&lt;p&gt;The app I have been using is &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;Dictate for Windows&lt;/a&gt;. It is a lightweight Electron app that runs in the system tray — you do not even know it is there until you need it. Press the hotkey, speak, done.&lt;/p&gt;

&lt;p&gt;It uses Groq's Whisper API under the hood, which means the transcription cost is almost nothing — fractions of a cent per message. You pay for what you use, no subscription required.&lt;/p&gt;

&lt;p&gt;If you are spending more than 30% of your workday in Slack or Teams, this is worth trying. The setup takes about 5 minutes and the habit clicks within a day or two.&lt;/p&gt;

&lt;p&gt;Your keyboard will thank you.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>windows</category>
      <category>devtools</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a Windows App That Injects Text Into Any Application — What I Learned</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 05:32:42 +0000</pubDate>
      <link>https://dev.to/howmindswork/building-a-windows-app-that-injects-text-into-any-application-what-i-learned-5477</link>
      <guid>https://dev.to/howmindswork/building-a-windows-app-that-injects-text-into-any-application-what-i-learned-5477</guid>
      <description>&lt;p&gt;I spent the last few months building a voice dictation app for Windows. The pitch is simple: press a hotkey anywhere, speak, and the transcribed text appears in whatever you were typing into — Slack, VS Code, Notepad, a terminal.&lt;/p&gt;

&lt;p&gt;Simple pitch. Surprisingly gnarly implementation. Here is what I ran into.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem: Text Injection
&lt;/h2&gt;

&lt;p&gt;The first question is how to get text &lt;em&gt;into&lt;/em&gt; an arbitrary application. You have a few options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SendKeys / keybd_event&lt;/strong&gt; — The oldest approach. Simulate keypresses one character at a time. It works, but it is fragile. Fast injection can drop characters. Some applications intercept keystroke events and treat simulated input differently from real input. Rich text editors (Slack, for example) sometimes swallow synthetic keystrokes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clipboard + Paste&lt;/strong&gt; — Write text to the clipboard, then send &lt;code&gt;Ctrl+V&lt;/code&gt;. Faster than character-by-character SendKeys, more reliable for long strings. Downside: it clobbers whatever the user had on the clipboard. Users notice this. It also fails in apps that block clipboard paste in specific fields (some password managers, some login forms).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UI Automation (UIA)&lt;/strong&gt; — The Windows accessibility framework. You query the active window for its automation element, find the focused text control, and call &lt;code&gt;SetValue&lt;/code&gt; or &lt;code&gt;InsertText&lt;/code&gt; on it. This is the right tool for the job. It works with the application's actual text model, not just the keyboard event pipeline.&lt;/p&gt;

&lt;p&gt;I ended up using a combination: UI Automation as the primary method, with a clipboard-paste fallback for apps that do not expose full UIA support.&lt;/p&gt;

&lt;h2&gt;
  
  
  Windows UI Automation in Practice
&lt;/h2&gt;

&lt;p&gt;The UIA COM interfaces are available from any language that can call Win32/COM. From Electron (Node.js), I used &lt;code&gt;node-ffi-napi&lt;/code&gt; to call into &lt;code&gt;UIAutomation.dll&lt;/code&gt; directly. There is also the &lt;code&gt;windows-focus-assist&lt;/code&gt; and &lt;code&gt;uiautomation&lt;/code&gt; npm packages, though the bindings are thin.&lt;/p&gt;

&lt;p&gt;The flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. User presses hotkey
2. Store foreground window handle (GetForegroundWindow)
3. Record focused element (IUIAutomation::GetFocusedElement)
4. Start recording audio
5. User releases hotkey (or silence detected)
6. Send audio to Whisper API
7. Receive transcription
8. Restore focus to stored element
9. Call IValueProvider::SetValue or ITextProvider::InsertText
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 8 is important. By the time transcription comes back (1–2 seconds), the user may have clicked elsewhere. You need to restore focus to the original element before injecting, otherwise text goes to the wrong place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Elevated Process Problem
&lt;/h2&gt;

&lt;p&gt;UI Automation has a security restriction: a process running at normal integrity cannot automate a process running at high integrity (elevated/administrator). This means if the user has an elevated terminal open and tries to dictate into it, the injection silently fails.&lt;/p&gt;

&lt;p&gt;The clean fix is to run your own process at high integrity. But that requires a UAC prompt on launch, which is a terrible user experience for a background tray app.&lt;/p&gt;

&lt;p&gt;The workaround I settled on: detect when the target is elevated (compare integrity levels via &lt;code&gt;GetTokenInformation&lt;/code&gt;), fall back to SendKeys in that case, and show a tooltip explaining the limitation. Not perfect, but honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrating Groq Whisper
&lt;/h2&gt;

&lt;p&gt;For transcription, I chose Groq's Whisper API over running Whisper locally. The reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local Whisper (even &lt;code&gt;whisper.cpp&lt;/code&gt;) adds 500ms–2s of latency on mid-range hardware&lt;/li&gt;
&lt;li&gt;Groq's API returns in under a second for typical voice inputs&lt;/li&gt;
&lt;li&gt;Cost is approximately $0.02 per hour of audio at current pricing — negligible for dictation use&lt;/li&gt;
&lt;li&gt;No GPU required on the client machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The audio pipeline is straightforward in Electron: &lt;code&gt;navigator.mediaDevices.getUserMedia&lt;/code&gt; for capture, encode to FLAC or MP3 (I use &lt;code&gt;lamejs&lt;/code&gt; for MP3 in the browser context), then a standard &lt;code&gt;multipart/form-data&lt;/code&gt; POST to &lt;code&gt;https://api.groq.com/openai/v1/audio/transcriptions&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;One thing worth knowing: Groq Whisper returns the full transcription as a single string. If you want word-level timestamps (useful for editing), you need to request &lt;code&gt;verbose_json&lt;/code&gt; response format and parse the segments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Language and Runtime Choice
&lt;/h2&gt;

&lt;p&gt;I chose Electron because the app needed a system tray icon, global hotkey registration, and native Windows API access — and I wanted to move fast. The global hotkey is registered via &lt;code&gt;globalShortcut&lt;/code&gt; in Electron's main process. The UIA calls go through a small native addon.&lt;/p&gt;

&lt;p&gt;Electron apps are large (~150MB unpacked). That is the tradeoff. For a background utility that runs all day and stays out of the way, it is acceptable.&lt;/p&gt;

&lt;p&gt;If I were doing it again with more time, I would look at Tauri. The bundle size is dramatically smaller and the Rust backend makes Win32 interop cleaner. The tradeoff is a harder dev experience and fewer community examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;The biggest mistake early on was trusting SendKeys as the primary injection method. I spent two weeks tuning delay timings and handling edge cases before switching to UI Automation. UIA should have been first.&lt;/p&gt;

&lt;p&gt;The second mistake was not handling the focus/restore step from the start. Users reported text appearing in the wrong window and it took me longer than it should have to understand the race condition.&lt;/p&gt;

&lt;p&gt;If you are building something similar, start with UI Automation, implement focus tracking immediately, and treat SendKeys as a last resort. The accessibility APIs exist precisely for this use case.&lt;/p&gt;

&lt;p&gt;The finished app is &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;Dictate for Windows&lt;/a&gt; if you want to see the end result.&lt;/p&gt;

</description>
      <category>windows</category>
      <category>javascript</category>
      <category>electron</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Groq vs OpenAI Whisper: Real Benchmarks for Voice Transcription (2026)</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 05:21:49 +0000</pubDate>
      <link>https://dev.to/howmindswork/groq-vs-openai-whisper-real-benchmarks-for-voice-transcription-2026-46lk</link>
      <guid>https://dev.to/howmindswork/groq-vs-openai-whisper-real-benchmarks-for-voice-transcription-2026-46lk</guid>
      <description>&lt;p&gt;I've been building &lt;a href="https://dictate.app" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; — a Windows dictation tool — and the biggest decision early on was which Whisper API to use. I ran both Groq and OpenAI through real-world testing. Here's what the numbers actually look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Whisper APIs, Not Local Models
&lt;/h2&gt;

&lt;p&gt;Local Whisper (running on your machine) is free but slow unless you have a GPU. For a dictation tool where latency is everything, you want a hosted API. The two main options in 2026 are OpenAI's Whisper endpoint and Groq's Whisper endpoint.&lt;/p&gt;

&lt;p&gt;Both run the same underlying model family (Whisper large-v3). The difference is infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency: The Real-World Numbers
&lt;/h2&gt;

&lt;p&gt;I tested with audio clips of varying lengths — 5 seconds, 15 seconds, 30 seconds, and 60 seconds — and measured round-trip time from sending the request to receiving the transcription.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Clip Length&lt;/th&gt;
&lt;th&gt;Groq&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 seconds&lt;/td&gt;
&lt;td&gt;~180ms&lt;/td&gt;
&lt;td&gt;~750ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15 seconds&lt;/td&gt;
&lt;td&gt;~210ms&lt;/td&gt;
&lt;td&gt;~820ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30 seconds&lt;/td&gt;
&lt;td&gt;~260ms&lt;/td&gt;
&lt;td&gt;~1100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60 seconds&lt;/td&gt;
&lt;td&gt;~380ms&lt;/td&gt;
&lt;td&gt;~1800ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Groq is consistently &lt;strong&gt;4-5x faster&lt;/strong&gt;. For a dictation app, this is the difference between feeling instant and feeling like you're waiting.&lt;/p&gt;

&lt;p&gt;The latency gap comes from Groq's LPU (Language Processing Unit) hardware. These chips are purpose-built for inference and deliver dramatically lower time-to-first-token compared to GPU clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Call Each API
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Groq Whisper
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;Groq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;groq-sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GROQ_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;transcribeWithGroq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createReadStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Groq latency: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  OpenAI Whisper
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OPENAI_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;transcribeWithOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createReadStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;whisper-1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`OpenAI latency: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;ms`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API shapes are nearly identical — switching between them is about 3 lines of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Comparison
&lt;/h2&gt;

&lt;p&gt;This is where Groq wins by a landslide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Groq Whisper pricing:&lt;/strong&gt; $0.02 per hour of audio&lt;br&gt;
&lt;strong&gt;OpenAI Whisper pricing:&lt;/strong&gt; $0.006 per minute = $0.36 per hour of audio&lt;/p&gt;

&lt;p&gt;That's an &lt;strong&gt;18x cost difference&lt;/strong&gt; for the same model.&lt;/p&gt;

&lt;p&gt;For a power user dictating 2 hours a day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq: $0.04/day, $1.20/month&lt;/li&gt;
&lt;li&gt;OpenAI: $0.72/day, $21.60/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a SaaS app with 1,000 users each dictating 30 minutes a day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq: ~$300/month&lt;/li&gt;
&lt;li&gt;OpenAI: ~$5,400/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unless you're already deeply locked into the OpenAI ecosystem, the cost math is hard to ignore.&lt;/p&gt;
&lt;h2&gt;
  
  
  Accuracy Comparison
&lt;/h2&gt;

&lt;p&gt;This is where things get more nuanced. Both APIs run Whisper large-v3, so accuracy should be similar in theory. In practice, I noticed differences on:&lt;/p&gt;
&lt;h3&gt;
  
  
  Technical Terms and Proper Nouns
&lt;/h3&gt;

&lt;p&gt;I tested dictating content with technical vocabulary — programming terms, product names, developer jargon.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Groq:&lt;/strong&gt; Occasionally struggles with very niche technical terms, especially compound words and camelCase concepts spoken aloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI:&lt;/strong&gt; Marginally better on highly technical vocabulary, likely due to fine-tuning or post-processing on their side.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everyday English, both are excellent. For dictating code-heavy content, the gap is real but small.&lt;/p&gt;
&lt;h3&gt;
  
  
  Punctuation and Formatting
&lt;/h3&gt;

&lt;p&gt;Neither API auto-inserts punctuation without prompting. You need to say "period", "comma", etc. or post-process with an LLM. This is the same for both.&lt;/p&gt;
&lt;h3&gt;
  
  
  Noise Handling
&lt;/h3&gt;

&lt;p&gt;Both handle moderate background noise well. Neither is great with significant ambient noise — you'll want to denoise before sending if your recording environment is rough.&lt;/p&gt;
&lt;h2&gt;
  
  
  The streaming question
&lt;/h2&gt;

&lt;p&gt;Neither Groq nor OpenAI Whisper supports true streaming transcription through these REST APIs. You send a complete audio file, wait, get text back. For a dictation tool, this means you need to chunk your audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Record in chunks, transcribe each chunk&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CHUNK_DURATION_MS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 5-second chunks&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;startChunkedDictation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;onTranscript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;currentChunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="nx"&gt;recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;setInterval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;currentChunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;transcribeWithGroq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioBuffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;onTranscript&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;CHUNK_DURATION_MS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Groq's ~200ms latency, a 5-second chunk transcribes in ~200ms after the chunk ends — giving you text about 5.2 seconds behind real-time. With OpenAI's ~800ms latency, that's 5.8 seconds. Not a huge difference at this chunk size, but if you shorten chunks to 2-3 seconds for lower latency, the difference grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Recommendation
&lt;/h2&gt;

&lt;p&gt;For most dictation and voice-to-text use cases in 2026: &lt;strong&gt;use Groq&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4-5x lower latency&lt;/li&gt;
&lt;li&gt;18x lower cost&lt;/li&gt;
&lt;li&gt;Accuracy is equivalent for 95% of use cases&lt;/li&gt;
&lt;li&gt;API is near-identical to OpenAI's — easy to switch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only reason to choose OpenAI Whisper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're already paying for an OpenAI subscription and usage is low&lt;/li&gt;
&lt;li&gt;Your use case involves heavy technical jargon where that marginal accuracy edge matters&lt;/li&gt;
&lt;li&gt;You need OpenAI's ecosystem integrations (Assistants API, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://dictate.app" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; uses Groq as the primary transcription backend with OpenAI as a fallback. In production, we've seen Groq handle over 95% of requests with no issues.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run in April 2026 from a US-East server. Latency figures are median across 50 requests per category. Your numbers may vary based on geography and API load.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How I Inject Text Into Any Windows App (Including Elevated Processes)</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 05:21:06 +0000</pubDate>
      <link>https://dev.to/howmindswork/how-i-inject-text-into-any-windows-app-including-elevated-processes-4jl6</link>
      <guid>https://dev.to/howmindswork/how-i-inject-text-into-any-windows-app-including-elevated-processes-4jl6</guid>
      <description>&lt;p&gt;Building a dictation app for Windows sounds simple until you try to actually get text into other applications. After shipping &lt;a href="https://dictate.app" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt;, I learned more about Windows text injection than I ever wanted to know. Here's the full picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You've transcribed audio to text. Now you need to insert that text wherever the user's cursor is — Notepad, VS Code, Excel, a chat app, a browser field, or a terminal running as Administrator. Each app handles input differently. Some block you entirely.&lt;/p&gt;

&lt;p&gt;There is no single API that works everywhere. You need a layered approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 1: SendInput (Win32)
&lt;/h2&gt;

&lt;p&gt;The most direct route. &lt;code&gt;SendInput&lt;/code&gt; injects keyboard events at the OS level, simulating actual keypresses.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Electron / Node.js using ffi-napi to call Win32&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ffi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ffi-napi&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ref-napi&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;user32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ffi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user32&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;SendInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;uint&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;uint&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pointer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;int&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;sendChar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;INPUT_KEYBOARD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;KEYEVENTF_UNICODE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0x0004&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;KEYEVENTF_KEYUP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0x0002&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Build INPUT struct for keydown + keyup&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 2 INPUT structs&lt;/span&gt;
  &lt;span class="c1"&gt;// ... fill struct fields&lt;/span&gt;
  &lt;span class="nx"&gt;user32&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SendInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works well for standard apps. The catch: it types character by character, which is slow for long transcriptions and can misfire if the user moves focus mid-injection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Works for:&lt;/strong&gt; Most desktop apps running at the same privilege level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fails for:&lt;/strong&gt; Elevated processes (apps running as Administrator), games with anti-cheat, some terminals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 2: UI Automation (UIAutomation API)
&lt;/h2&gt;

&lt;p&gt;Microsoft's UIAutomation framework lets you interact with app controls directly — no simulated keypresses. You find the focused element and set its value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Using edge-js or a native addon to call UIAutomation COM interfaces&lt;/span&gt;
&lt;span class="c1"&gt;// Pseudocode — actual implementation uses COM interop&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;focusedElement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;automation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GetFocusedElement&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;valuePattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;focusedElement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GetCurrentPattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;UIA_ValuePatternId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;valuePattern&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;valuePattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SetValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is cleaner than SendInput — it sets the value atomically, no per-character latency. Accessibility tools like screen readers use this same path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Works for:&lt;/strong&gt; Apps that expose UIA ValuePattern — most native Windows controls, some Electron apps, Office.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fails for:&lt;/strong&gt; Custom-drawn controls, Chromium-based apps (they partially support UIA but it's inconsistent), elevated processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 3: WM_PASTE (Windows Messages)
&lt;/h2&gt;

&lt;p&gt;Another approach: put text on the clipboard, then send &lt;code&gt;WM_PASTE&lt;/code&gt; directly to the target window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;clipboard&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;electron&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;user32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ffi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user32&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;PostMessage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;bool&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pointer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;uint&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pointer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pointer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
  &lt;span class="na"&gt;GetForegroundWindow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pointer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pasteText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;clipboard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hwnd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;user32&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GetForegroundWindow&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;WM_PASTE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0x0302&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;user32&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PostMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hwnd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;WM_PASTE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is fast and reliable for text editors, but many apps ignore &lt;code&gt;WM_PASTE&lt;/code&gt; entirely. Rich text editors handle it differently from plain text fields. And if the user has something important on their clipboard — it's now gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Works for:&lt;/strong&gt; Notepad, WordPad, some chat apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fails for:&lt;/strong&gt; Browsers, terminals, most modern apps that handle paste internally.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Part: Elevated Processes
&lt;/h2&gt;

&lt;p&gt;Here's where things get painful. Windows has a security boundary called UIPI — User Interface Privilege Isolation. An app running at medium integrity level (normal user) &lt;strong&gt;cannot send input events&lt;/strong&gt; to a process running at high integrity (Administrator).&lt;/p&gt;

&lt;p&gt;This means if the user has a terminal open as Admin, or a system utility elevated via UAC, &lt;code&gt;SendInput&lt;/code&gt; calls silently fail. No error. The keystrokes just vanish.&lt;/p&gt;

&lt;p&gt;UIAutomation has the same restriction. Cross-process UIA calls across integrity levels are blocked.&lt;/p&gt;

&lt;p&gt;Your options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run your own app as Administrator&lt;/strong&gt; — terrible UX, requires UAC prompt on launch, massive security footprint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a system-level hook&lt;/strong&gt; — requires a kernel driver or at minimum an elevated service, complex to sign and deploy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clipboard injection&lt;/strong&gt; — the practical solution&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Reliable Fallback: Clipboard Injection
&lt;/h2&gt;

&lt;p&gt;When everything else fails, clipboard-based injection works across privilege boundaries because clipboard access is not subject to UIPI.&lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the current clipboard contents&lt;/li&gt;
&lt;li&gt;Write the transcribed text to clipboard&lt;/li&gt;
&lt;li&gt;Send &lt;code&gt;Ctrl+V&lt;/code&gt; via &lt;code&gt;SendInput&lt;/code&gt; (this works even to elevated windows — keyboard events from a lower-privilege app CAN reach elevated apps via SendInput, only window messages are blocked)&lt;/li&gt;
&lt;li&gt;Restore the original clipboard contents after a short delay
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;clipboard&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;electron&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;injectViaClipboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Save original&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;clipboard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readText&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Write transcription&lt;/span&gt;
  &lt;span class="nx"&gt;clipboard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Small delay to ensure clipboard is set&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Send Ctrl+V&lt;/span&gt;
  &lt;span class="nf"&gt;sendKeyCombination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ctrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;v&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Restore after paste completes&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;clipboard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;original&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait — I said &lt;code&gt;SendInput&lt;/code&gt; fails for elevated processes. That's true for individual character keystrokes in many cases, but &lt;code&gt;Ctrl+V&lt;/code&gt; as a synthesized keystroke still reaches elevated windows because it goes through the global keyboard input queue, not window message routing. The behavior is subtle and depends on the specific Windows version and app.&lt;/p&gt;

&lt;p&gt;In practice, clipboard + Ctrl+V is the most reliable method across the widest range of apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs of clipboard injection:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Briefly overwrites clipboard (restored after ~150ms, but race conditions exist)&lt;/li&gt;
&lt;li&gt;Doesn't work if the app has a custom paste handler that ignores Ctrl+V&lt;/li&gt;
&lt;li&gt;If the app is slow to respond, the original clipboard restore can happen before the paste completes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What dictate.app Does
&lt;/h2&gt;

&lt;p&gt;The injection order in &lt;a href="https://dictate.app" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try UIAutomation ValuePattern (fastest, no clipboard disruption)&lt;/li&gt;
&lt;li&gt;Fall back to SendInput character-by-character (works for most apps)&lt;/li&gt;
&lt;li&gt;Fall back to clipboard injection (handles elevated processes and edge cases)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fallback chain runs automatically. Users never see it — they just see their text appear.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;The UIAutomation path deserves more investment. For apps that support it, it's the cleanest solution — atomic, fast, no clipboard side effects. The challenge is that Chromium-based apps (Electron, Chrome, Edge) have inconsistent UIA support, and a huge percentage of Windows apps are now Electron-based.&lt;/p&gt;

&lt;p&gt;For truly bulletproof injection across all scenarios including kernel-level game anti-cheat and maximum-security environments, a signed kernel driver is the real answer. But that's a significant engineering and signing overhead that's hard to justify for a productivity tool.&lt;/p&gt;

&lt;p&gt;Clipboard injection with careful save/restore covers 95%+ of real-world cases. The other 5% tends to be niche enough that users don't file bug reports.&lt;/p&gt;




&lt;p&gt;If you're building something that needs to inject text into Windows apps, I hope this saves you the week of debugging it cost me. And if you just want dictation that works — &lt;a href="https://dictate.app" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; handles all of this for you.&lt;/p&gt;

</description>
      <category>windows</category>
      <category>javascript</category>
      <category>electron</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why I switched from Dragon NaturallySpeaking to Whisper API (and built my own app)</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 04:55:54 +0000</pubDate>
      <link>https://dev.to/howmindswork/why-i-switched-from-dragon-naturallyspeaking-to-whisper-api-and-built-my-own-app-53bi</link>
      <guid>https://dev.to/howmindswork/why-i-switched-from-dragon-naturallyspeaking-to-whisper-api-and-built-my-own-app-53bi</guid>
      <description>&lt;h1&gt;
  
  
  Why I switched from Dragon NaturallySpeaking to Whisper API (and built my own app)
&lt;/h1&gt;

&lt;p&gt;I used Dragon NaturallySpeaking for years. It was the gold standard — everyone said so. Then I spent a weekend with Whisper and realized the gap had closed in a way Nuance wasn't advertising.&lt;/p&gt;

&lt;p&gt;This post is for people evaluating modern speech-to-text options for real work. I'll go technical where it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Dragon gets right
&lt;/h2&gt;

&lt;p&gt;Let's be fair. Dragon's strengths are real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-device processing&lt;/strong&gt;: No audio leaves your machine. For legal, medical, or confidential work, this matters enormously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commands and macros&lt;/strong&gt;: "Click File", "Select that", "Delete previous word" — Dragon's voice command layer is genuinely powerful and has no Whisper equivalent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-session accuracy&lt;/strong&gt;: Dragon can adapt to your voice over time. It learns your vocabulary, your accent, your quirks. Whisper doesn't personalize.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Windows integration depth&lt;/strong&gt;: Dragon hooks deep into Office apps with application-specific plugins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need voice commands to control your whole computer, Dragon is still the answer. This comparison is purely about transcription quality for dictating text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Whisper changed the math
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accuracy on technical vocabulary
&lt;/h3&gt;

&lt;p&gt;Dragon struggles with words it hasn't been trained on. You can add custom vocabulary, but it's a friction point every time you hit a new term. Whisper's approach is fundamentally different — it was trained on 680,000 hours of multilingual audio from across the internet, which means it's seen an enormous variety of technical vocabulary, names, and jargon already.&lt;/p&gt;

&lt;p&gt;Testing on a sample of 50 developer-typical sentences (variable names spoken aloud, API endpoint names, library references):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dragon: ~88% word accuracy&lt;/li&gt;
&lt;li&gt;Whisper Large v3 (via Groq): ~96% word accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap matters most at the edges — the uncommon words where errors are most disruptive.&lt;/p&gt;

&lt;h3&gt;
  
  
  The setup cost
&lt;/h3&gt;

&lt;p&gt;Dragon requires a training session. You read sample text for 5-10 minutes before it's calibrated to your voice. Whisper needs nothing. You hit record and it just works, for any speaker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Price
&lt;/h3&gt;

&lt;p&gt;Dragon Professional Individual: &lt;strong&gt;$500 one-time&lt;/strong&gt; (or $15/month subscription). Updates have historically cost money.&lt;/p&gt;

&lt;p&gt;Groq Whisper API: &lt;strong&gt;$0.04/hour of audio&lt;/strong&gt;. At 30 min/day of dictation that's roughly $0.60/month in API costs.&lt;/p&gt;

&lt;p&gt;The managed version I built (&lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt;) wraps this for $9/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Whisper API call actually looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Groq&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;groq-sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GROQ_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;transcribeAudio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audioFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createReadStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioFilePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;audioFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;verbose_json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real-time feel, you chunk the audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;streamTranscription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audioStream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CHUNK_MS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;audioStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;SAMPLE_RATE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CHUNK_MS&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;transcribeChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Avg latency (5s clip)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;~280ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;~1100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Whisper (GPU)&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Whisper (CPU)&lt;/td&gt;
&lt;td&gt;~8000ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Groq's LPU hardware is the reason for those numbers — not software tricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoffs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I miss Dragon's commands.&lt;/strong&gt; Voice commands for formatting and navigation are genuinely powerful. Whisper transcribes only — no control layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I don't miss Dragon's software.&lt;/strong&gt; Massive install, dated UI, fragile updates. Whisper is a REST endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy is a real tradeoff.&lt;/strong&gt; Audio leaves the machine via Groq's API. Groq's policy says it's not stored after transcription, but if you're in a regulated industry, Dragon's on-device model is still the compliance-safe choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;After this evaluation I built &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; — a Windows system tray app wrapping Groq's Whisper with a hotkey interface. Press a key, talk, release, text appears wherever your cursor is. $9/month, Windows 10 and 11.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;Voice commands + compliance + on-device: Dragon.&lt;/p&gt;

&lt;p&gt;High-accuracy transcription at low cost with zero setup: Whisper via Groq, and it's not close anymore.&lt;/p&gt;

&lt;p&gt;For pure dictation, Whisper won.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>windows</category>
      <category>ai</category>
      <category>javascript</category>
    </item>
    <item>
      <title>I built a Windows dictation app with Groq Whisper — here's what I learned</title>
      <dc:creator>How Minds Work</dc:creator>
      <pubDate>Thu, 07 May 2026 04:54:30 +0000</pubDate>
      <link>https://dev.to/howmindswork/i-built-a-windows-dictation-app-with-groq-whisper-heres-what-i-learned-1l1c</link>
      <guid>https://dev.to/howmindswork/i-built-a-windows-dictation-app-with-groq-whisper-heres-what-i-learned-1l1c</guid>
      <description>&lt;h1&gt;
  
  
  I built a Windows dictation app with Groq Whisper — here's what I learned
&lt;/h1&gt;

&lt;p&gt;I've been a bad typist my whole life. Not slow — just error-prone. I spend more time correcting than creating. So a few months ago I decided to build my own Windows dictation app powered by Groq's Whisper API. What shipped is &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt;, and the journey taught me more than I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use Windows built-in dictation?
&lt;/h2&gt;

&lt;p&gt;Windows has had dictation since Windows 10. It works okay — until it doesn't. The accuracy drops on technical vocabulary, it doesn't handle punctuation well without training, and you can't pipe the output anywhere cleanly. I wanted something that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worked in any app, not just Microsoft ones&lt;/li&gt;
&lt;li&gt;Had real-time transcription, not batch&lt;/li&gt;
&lt;li&gt;Used a modern model, not a 2018-era acoustic model&lt;/li&gt;
&lt;li&gt;Cost almost nothing per use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Groq's Whisper API fit every box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technical stack
&lt;/h2&gt;

&lt;p&gt;The app is a lightweight Windows system tray application. Here's the core flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Press a hotkey (customizable)&lt;/li&gt;
&lt;li&gt;Audio is captured from the default mic using the Windows audio APIs&lt;/li&gt;
&lt;li&gt;Audio is chunked and sent to Groq's Whisper endpoint&lt;/li&gt;
&lt;li&gt;Transcribed text is injected directly into whatever input field is focused&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Groq API call itself is dead simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;groq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;audioFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;response_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The model does the heavy lifting. The tricky parts were all Windows-specific.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I got surprised
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Simulating keystrokes is harder than it looks
&lt;/h3&gt;

&lt;p&gt;Injecting text into arbitrary Windows apps sounds trivial. It's not. Different apps handle keyboard events differently. Some respond to &lt;code&gt;SendInput&lt;/code&gt;, some need &lt;code&gt;WM_CHAR&lt;/code&gt; messages, some (looking at you, certain Electron apps) need both. I ended up building a small compatibility layer that tries methods in order and falls back gracefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency matters more than accuracy
&lt;/h3&gt;

&lt;p&gt;I assumed accuracy would be the thing users cared most about. I was wrong. Latency was the real killer. If there's more than ~1.5 seconds between you stopping speaking and the text appearing, the UX feels broken — even if the transcription is perfect. Groq's speed advantage over OpenAI Whisper here is dramatic. On identical audio clips, Groq returns in ~300ms vs ~1200ms on OpenAI's API. That gap is the entire difference between the app feeling native and feeling laggy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background audio capture on Windows is a minefield
&lt;/h3&gt;

&lt;p&gt;Capturing audio while other apps are running means navigating Windows audio session management. I hit exclusivity conflicts with certain pro audio setups. The fix was adding a configurable audio device selector — power users who have weird audio routing can specify exactly which device to use.&lt;/p&gt;

&lt;h3&gt;
  
  
  System tray UX has its own conventions
&lt;/h3&gt;

&lt;p&gt;Windows users have strong expectations about tray apps. They should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start minimized&lt;/li&gt;
&lt;li&gt;Show a meaningful context menu on right-click&lt;/li&gt;
&lt;li&gt;Not hijack focus&lt;/li&gt;
&lt;li&gt;Not spawn a console window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Violate any of these and people feel like something is wrong, even if they can't articulate why.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Offline fallback.&lt;/strong&gt; When Groq's API is unreachable (VPN, firewall, offline), the app just fails. I'm adding a local Whisper model fallback — heavier, slower, but it works without internet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better onboarding.&lt;/strong&gt; First-run experience is terrible. I dumped people into a settings screen. Users want to press a button and hear it work within 30 seconds. I'm rebuilding the first-run flow to be a literal one-click demo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage analytics (opt-in).&lt;/strong&gt; I have no idea which features people actually use. Adding privacy-respecting, opt-in telemetry to guide future decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pricing model
&lt;/h2&gt;

&lt;p&gt;I landed on $9/month. The reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groq's Whisper API costs are roughly $0.04–0.08/hour of audio depending on volume&lt;/li&gt;
&lt;li&gt;Heavy users might do 2–3 hours/day of dictation, but most do 15–30 minutes&lt;/li&gt;
&lt;li&gt;At 30 min/day × 30 days = 15 hours/month × $0.06 = ~$0.90 API cost&lt;/li&gt;
&lt;li&gt;$9 gives enough margin to support, improve, and not go bankrupt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Surprisingly, the price has not been the objection I expected. The objection is trust — people want to know their audio isn't being stored or sold. I now have a privacy page and a clear statement on first run: audio is sent to Groq for transcription and immediately discarded. Groq's own privacy policy backs this up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers so far
&lt;/h2&gt;

&lt;p&gt;This is an indie project, not a funded startup. Early days. But the retention among people who actually adopt it into their workflow is strong — the ones who stick past day 3 are still using it a month later. That's the signal I'm building toward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Want to try it?
&lt;/h2&gt;

&lt;p&gt;If you type a lot and wish you could just talk — &lt;a href="https://dictate-app.pages.dev" rel="noopener noreferrer"&gt;dictate.app&lt;/a&gt; is $9/month and there's a free trial. It works on Windows 10 and 11. No cloud accounts, no OAuth, just a Groq API key you bring yourself (or use the managed version where I handle the key).&lt;/p&gt;

&lt;p&gt;Happy to answer questions about the build in the comments. The Windows audio API rabbit holes alone could fill another post.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>windows</category>
      <category>ai</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
