I Built a Voice-to-Code VS Code Extension That Runs Entirely On-Device

#vscode #ai #opensource #a11y

Every AI coding assistant requires typing. GitHub Copilot, Continue, Kiro — they all expect you to type your prompts. But what if you could just talk?

That's why I built VoxPilot.

The Problem

I spend a lot of time typing prompts like "refactor this function to use async/await with proper error handling and add unit tests." That's 15 seconds of typing for something I could say in 3 seconds.

For developers with RSI or carpal tunnel, the problem is worse. Typing isn't just slow — it's painful.

The Solution

VoxPilot is a VS Code extension that captures your voice, transcribes it locally using Moonshine ASR, and sends the text to your coding assistant.

The key word is "locally." Your audio never leaves your machine. There are no API keys, no cloud calls, no telemetry. The ASR model is 27MB and runs via ONNX Runtime.

How It Works

Microphone → PCM Audio → Voice Activity Detection → Moonshine ASR → Text → VS Code Chat

Audio Capture: Native CLI tools (arecord on Linux, sox on macOS, ffmpeg on Windows) capture raw PCM audio at 16kHz.
Voice Activity Detection: An energy-based VAD detects when you start and stop speaking. No need to press a button — just talk.
Transcription: Moonshine's encoder-decoder architecture processes the audio through ONNX Runtime. The Tiny model (27MB) handles quick commands; the Base model (65MB) is better for longer dictation.
Delivery: The transcript goes to VS Code's Chat API, targeting whatever participant you've configured (Copilot, Continue, etc.).

Privacy

This was non-negotiable. Voice data is sensitive. VoxPilot processes everything in-memory and never writes audio to disk or sends it over the network.

Try It

Open VSX: https://open-vsx.org/extension/natearcher-ai/voxpilot
GitHub: https://github.com/natearcher-ai/voxpilot

MIT licensed. PRs welcome. Star the repo if it's useful.

Top comments (1)

Matthew Hou • Mar 1

Running transcription entirely on-device is the right call. Every cloud-based voice tool I've tried adds enough latency to break the flow — by the time it processes, I've already context-switched to something else.

The RSI/accessibility angle is undersold here. I know several senior devs who've had to change their entire workflow because of repetitive strain. A reliable voice-to-prompt path isn't a convenience feature for them — it's what keeps them productive.

Curious about accuracy with technical jargon. "Refactor to use async/await" is straightforward, but things like variable names, library-specific APIs, or code like "useState" vs "use state" tend to trip up general-purpose ASR. Does Moonshine handle developer vocabulary well, or do you find yourself correcting transcriptions often?