Last month I open-sourced ambient-voice — a macOS voice input tool built entirely on Apple-native frameworks. The headline feature was context biasing: it OCRs your screen before you speak, so the recognizer already knows your domain.
But the other headline feature — a self-improving distillation pipeline — turned out to be over-engineered. Here's what we changed in v2, and what we learned.
The v1 Pipeline (RIP)
Audio → Whisper re-transcription ──┐
├─→ Merge → QLoRA → ollama
User correction capture (30s) ─────┘
Three problems:
Whisper was a GPU tax. Re-transcribing 30 min of audio → 2 hours on a GPU server. Most users don't have spare compute for background distillation.
Correction capture was noisy. Users edit text for many reasons — rephrasing, restructuring, deleting. Only a fraction of edits are actual recognition error corrections. The training data was polluted.
The feedback loop never closed. Need dozens of data points → training run → model deploy. Too slow for anyone to see improvement.
The v2 Pipeline
dictionary.json + raw transcription → Gemini correction → QLoRA → ollama
That's it.
dictionary.json
{ "terms": ["Sharpe ratio", "MPLS", "Claude Code", "MCP", "QLoRA", "ollama"] }
You list your domain-specific terms. The distillation pipeline sends the raw SpeechAnalyzer transcription + your dictionary to Gemini. Gemini returns a corrected version respecting your vocabulary. The pair becomes QLoRA training data.
Why this works better than "automatic learning":
The user's real pain was never "the system doesn't learn from my corrections." It was "certain terms never come out right." dictionary.json targets that pain directly — zero noise, exact user intent.
What Got Deleted
-
WhisperTranscriber— entire module removed -
CorrectionCapture— removed -
CorrectionStore— removed - Dual-path merge logic — removed
- GPU server dependency — gone
~30% code reduction. The cron job ("run every 10 minutes") became "run pipeline.sh when you want to."
Evaluation Framework
v2 ships with proper benchmarks. On Mac Mini M4:
AliMeeting (real Chinese meeting recordings):
- Nearfield (headset): ~25% CER
- Farfield (8-ch single channel): 40% CER (high overlap, no beamforming)
AMI (English meetings):
- FluidAudio speaker diarization: 23.2% DER average
- Processing speed: 130x real-time
End-to-end: 30 min meeting → 20-30s processing. Peak memory < 1GB. Runs on 8GB MacBook Air.
Not SOTA — but fully on-device, zero cost, no network calls.
Community PRs
Two external contributions merged:
- TextInjector clipboard restore bug fix
- OpenSSL 3.x certificate script compatibility
An MIT project getting outside PRs at two weeks old — that's the best validation metric.
Architecture Overview
Daily dictation flow:
Right Option key
→ ScreenCaptureKit + Vision OCR (context extraction)
→ SpeechAnalyzer (transcription with context bias)
→ Local LLM polish (ollama)
→ Paste to focused app
Improvement flow:
dictionary.json + voice-history.jsonl
→ Gemini distillation
→ QLoRA fine-tuning
→ Deploy to ollama
The Takeaway
We replaced an ML pipeline with a JSON file and got better results. The lesson: capture user intent explicitly, don't infer it from noisy behavioral signals.
Complex systems are seductive. Simple systems ship.
GitHub: github.com/Marvinngg/ambient-voice
License: MIT
Requirements: macOS 26 (Tahoe), Apple Silicon (M1+)
If you tried v1: git pull && make install.
If you didn't: now is a better time to start.
PRs and issues welcome.
Top comments (0)