DEV Community

How Minds Work
How Minds Work

Posted on

Building a Windows App That Injects Text Into Any Application — What I Learned

I spent the last few months building a voice dictation app for Windows. The pitch is simple: press a hotkey anywhere, speak, and the transcribed text appears in whatever you were typing into — Slack, VS Code, Notepad, a terminal.

Simple pitch. Surprisingly gnarly implementation. Here is what I ran into.

The Core Problem: Text Injection

The first question is how to get text into an arbitrary application. You have a few options:

SendKeys / keybd_event — The oldest approach. Simulate keypresses one character at a time. It works, but it is fragile. Fast injection can drop characters. Some applications intercept keystroke events and treat simulated input differently from real input. Rich text editors (Slack, for example) sometimes swallow synthetic keystrokes.

Clipboard + Paste — Write text to the clipboard, then send Ctrl+V. Faster than character-by-character SendKeys, more reliable for long strings. Downside: it clobbers whatever the user had on the clipboard. Users notice this. It also fails in apps that block clipboard paste in specific fields (some password managers, some login forms).

UI Automation (UIA) — The Windows accessibility framework. You query the active window for its automation element, find the focused text control, and call SetValue or InsertText on it. This is the right tool for the job. It works with the application's actual text model, not just the keyboard event pipeline.

I ended up using a combination: UI Automation as the primary method, with a clipboard-paste fallback for apps that do not expose full UIA support.

Windows UI Automation in Practice

The UIA COM interfaces are available from any language that can call Win32/COM. From Electron (Node.js), I used node-ffi-napi to call into UIAutomation.dll directly. There is also the windows-focus-assist and uiautomation npm packages, though the bindings are thin.

The flow looks like this:

1. User presses hotkey
2. Store foreground window handle (GetForegroundWindow)
3. Record focused element (IUIAutomation::GetFocusedElement)
4. Start recording audio
5. User releases hotkey (or silence detected)
6. Send audio to Whisper API
7. Receive transcription
8. Restore focus to stored element
9. Call IValueProvider::SetValue or ITextProvider::InsertText
Enter fullscreen mode Exit fullscreen mode

Step 8 is important. By the time transcription comes back (1–2 seconds), the user may have clicked elsewhere. You need to restore focus to the original element before injecting, otherwise text goes to the wrong place.

The Elevated Process Problem

UI Automation has a security restriction: a process running at normal integrity cannot automate a process running at high integrity (elevated/administrator). This means if the user has an elevated terminal open and tries to dictate into it, the injection silently fails.

The clean fix is to run your own process at high integrity. But that requires a UAC prompt on launch, which is a terrible user experience for a background tray app.

The workaround I settled on: detect when the target is elevated (compare integrity levels via GetTokenInformation), fall back to SendKeys in that case, and show a tooltip explaining the limitation. Not perfect, but honest.

Integrating Groq Whisper

For transcription, I chose Groq's Whisper API over running Whisper locally. The reasons:

  • Local Whisper (even whisper.cpp) adds 500ms–2s of latency on mid-range hardware
  • Groq's API returns in under a second for typical voice inputs
  • Cost is approximately $0.02 per hour of audio at current pricing — negligible for dictation use
  • No GPU required on the client machine

The audio pipeline is straightforward in Electron: navigator.mediaDevices.getUserMedia for capture, encode to FLAC or MP3 (I use lamejs for MP3 in the browser context), then a standard multipart/form-data POST to https://api.groq.com/openai/v1/audio/transcriptions.

One thing worth knowing: Groq Whisper returns the full transcription as a single string. If you want word-level timestamps (useful for editing), you need to request verbose_json response format and parse the segments.

Language and Runtime Choice

I chose Electron because the app needed a system tray icon, global hotkey registration, and native Windows API access — and I wanted to move fast. The global hotkey is registered via globalShortcut in Electron's main process. The UIA calls go through a small native addon.

Electron apps are large (~150MB unpacked). That is the tradeoff. For a background utility that runs all day and stays out of the way, it is acceptable.

If I were doing it again with more time, I would look at Tauri. The bundle size is dramatically smaller and the Rust backend makes Win32 interop cleaner. The tradeoff is a harder dev experience and fewer community examples.

What I Would Do Differently

The biggest mistake early on was trusting SendKeys as the primary injection method. I spent two weeks tuning delay timings and handling edge cases before switching to UI Automation. UIA should have been first.

The second mistake was not handling the focus/restore step from the start. Users reported text appearing in the wrong window and it took me longer than it should have to understand the race condition.

If you are building something similar, start with UI Automation, implement focus tracking immediately, and treat SendKeys as a last resort. The accessibility APIs exist precisely for this use case.

The finished app is Dictate for Windows if you want to see the end result.

Top comments (0)