Every voice input tool I've tried on Mac has the same problem: it doesn't know what I'm doing.
I'm writing Swift code and say "optional." The recognizer gives me the English adjective. I'm drafting an email about OKR targets and say "retention." It transcribes something phonetically similar but semantically wrong — because it has no idea I'm looking at a quarterly business review.
So I asked: what if the recognizer already knew your context before you started speaking?
That question led to ambient-voice — an open-source macOS voice input system where every layer runs on Apple-native frameworks, everything stays on your device, and screen context is injected into the recognizer at transcription time.
The Stack: 100% Apple-Native
| Capability | Framework |
|---|---|
| Speech recognition | SpeechAnalyzer |
| Screen capture | ScreenCaptureKit |
| OCR | Vision |
| Text injection | Accessibility API + CGEvent |
| Speaker diarization | FluidAudio (CoreML) |
| Hotkey listening | CGEventTap |
No Whisper. No Electron. No cloud APIs. No third-party dependencies for core functionality.
Why this matters:
- On-device processing. Your audio never leaves your Mac. No network calls, no telemetry, no cloud storage.
- Zero cost. No subscriptions, no per-minute charges. The Neural Engine is already in your Mac.
- Automatic improvement. When Apple improves SpeechAnalyzer in macOS 27, ambient-voice gets better without code changes.
The Core Mechanism: Context Biasing
When you press the hotkey, two things happen simultaneously:
- Audio capture begins — AVCaptureSession feeds audio to SpeechAnalyzer
-
Screen context capture — ScreenCaptureKit grabs the focused window, Vision OCR extracts visible text, keywords get injected into SpeechAnalyzer's
AnalysisContext
By the time your first word reaches the recognizer, it already knows what's on your screen.
Example: You're replying to an email about OKR targets. Your screen shows "retention rate," "Q3 objectives," "churn reduction." You say "change the retention target." Without context biasing, "retention" gets mis-transcribed. With it, the recognizer sees "retention" in the AnalysisContext, and the ambiguity resolves correctly — on the first pass.
This isn't post-processing correction. Prevention, not correction.
Self-Improving Data Loop
Every transcription session automatically generates training data:
- Each transcription logs to
voice-history.jsonl - A 30-second observation window captures your corrections via Accessibility API
- Whisper re-transcribes the audio as a high-quality reference
- The three outputs merge with weighted scoring → QLoRA fine-tuning of a local model
The system improves without requiring any effort from you. Strong-model-distills-to-small-model architecture.
Meeting Mode
Press ⌘M to start recording. Real-time transcription in a floating panel. When you stop, FluidAudio performs on-device speaker diarization.
Output: a Markdown file with timestamps, speaker labels, and full text. Every word stays on your Mac.
Hardest Bugs (Solved with Claude Code)
Most of ambient-voice was developed with Claude Code using structured "Skills" — domain knowledge documents that capture the why and what, letting Claude figure out the how.
The trickiest problems had no Stack Overflow answers:
- Bluetooth audio silence → rewrote capture pipeline around AVCaptureSession
- Swift 6 concurrency crashes → CGEventTap with DispatchQueue bridging
- Accessibility permissions resetting on build → switched to Apple Development certificate signing
Try It
ambient-voice is MIT licensed: github.com/Marvinngg/ambient-voice
Requirements: macOS 26 (Tahoe)+, Apple Silicon (M1+).
If you care about privacy-first voice input or building on Apple's latest frameworks — stars, issues, and PRs welcome.
Top comments (0)