I Built a Context-Aware Voice Input Tool for macOS — 100% On-Device, Zero Cloud

#opensource #macos #swift #ai

Every voice input tool I've tried on Mac has the same problem: it doesn't know what I'm doing.

I'm writing Swift code and say "optional." The recognizer gives me the English adjective. I'm drafting an email about OKR targets and say "retention." It transcribes something phonetically similar but semantically wrong — because it has no idea I'm looking at a quarterly business review.

So I asked: what if the recognizer already knew your context before you started speaking?

That question led to ambient-voice — an open-source macOS voice input system where every layer runs on Apple-native frameworks, everything stays on your device, and screen context is injected into the recognizer at transcription time.

The Stack: 100% Apple-Native

Capability	Framework
Speech recognition	SpeechAnalyzer
Screen capture	ScreenCaptureKit
OCR	Vision
Text injection	Accessibility API + CGEvent
Speaker diarization	FluidAudio (CoreML)
Hotkey listening	CGEventTap

No Whisper. No Electron. No cloud APIs. No third-party dependencies for core functionality.

Why this matters:

On-device processing. Your audio never leaves your Mac. No network calls, no telemetry, no cloud storage.
Zero cost. No subscriptions, no per-minute charges. The Neural Engine is already in your Mac.
Automatic improvement. When Apple improves SpeechAnalyzer in macOS 27, ambient-voice gets better without code changes.

The Core Mechanism: Context Biasing

When you press the hotkey, two things happen simultaneously:

Audio capture begins — AVCaptureSession feeds audio to SpeechAnalyzer
Screen context capture — ScreenCaptureKit grabs the focused window, Vision OCR extracts visible text, keywords get injected into SpeechAnalyzer's AnalysisContext

By the time your first word reaches the recognizer, it already knows what's on your screen.

Example: You're replying to an email about OKR targets. Your screen shows "retention rate," "Q3 objectives," "churn reduction." You say "change the retention target." Without context biasing, "retention" gets mis-transcribed. With it, the recognizer sees "retention" in the AnalysisContext, and the ambiguity resolves correctly — on the first pass.

This isn't post-processing correction. Prevention, not correction.

Self-Improving Data Loop

Every transcription session automatically generates training data:

Each transcription logs to voice-history.jsonl
A 30-second observation window captures your corrections via Accessibility API
Whisper re-transcribes the audio as a high-quality reference
The three outputs merge with weighted scoring → QLoRA fine-tuning of a local model

The system improves without requiring any effort from you. Strong-model-distills-to-small-model architecture.

Meeting Mode

Press ⌘M to start recording. Real-time transcription in a floating panel. When you stop, FluidAudio performs on-device speaker diarization.

Output: a Markdown file with timestamps, speaker labels, and full text. Every word stays on your Mac.

Hardest Bugs (Solved with Claude Code)

Most of ambient-voice was developed with Claude Code using structured "Skills" — domain knowledge documents that capture the why and what, letting Claude figure out the how.

The trickiest problems had no Stack Overflow answers:

Bluetooth audio silence → rewrote capture pipeline around AVCaptureSession
Swift 6 concurrency crashes → CGEventTap with DispatchQueue bridging
Accessibility permissions resetting on build → switched to Apple Development certificate signing