xiaocai oh

Posted on Mar 28

ambient-voice v2: How Deleting Whisper and Adding a JSON File Made Our Voice Pipeline Better

#opensource #macos #machinelearning #apple

Last month I open-sourced ambient-voice — a macOS voice input tool built entirely on Apple-native frameworks. The headline feature was context biasing: it OCRs your screen before you speak, so the recognizer already knows your domain.

But the other headline feature — a self-improving distillation pipeline — turned out to be over-engineered. Here's what we changed in v2, and what we learned.

The v1 Pipeline (RIP)

Audio → Whisper re-transcription ──┐
                                    ├─→ Merge → QLoRA → ollama
User correction capture (30s) ─────┘

Three problems:

Whisper was a GPU tax. Re-transcribing 30 min of audio → 2 hours on a GPU server. Most users don't have spare compute for background distillation.
Correction capture was noisy. Users edit text for many reasons — rephrasing, restructuring, deleting. Only a fraction of edits are actual recognition error corrections. The training data was polluted.
The feedback loop never closed. Need dozens of data points → training run → model deploy. Too slow for anyone to see improvement.

The v2 Pipeline

dictionary.json + raw transcription → Gemini correction → QLoRA → ollama

That's it.

dictionary.json

{ "terms": ["Sharpe ratio", "MPLS", "Claude Code", "MCP", "QLoRA", "ollama"] }

You list your domain-specific terms. The distillation pipeline sends the raw SpeechAnalyzer transcription + your dictionary to Gemini. Gemini returns a corrected version respecting your vocabulary. The pair becomes QLoRA training data.

Why this works better than "automatic learning":

The user's real pain was never "the system doesn't learn from my corrections." It was "certain terms never come out right." dictionary.json targets that pain directly — zero noise, exact user intent.

What Got Deleted

WhisperTranscriber — entire module removed
CorrectionCapture — removed
CorrectionStore — removed
Dual-path merge logic — removed
GPU server dependency — gone

~30% code reduction. The cron job ("run every 10 minutes") became "run pipeline.sh when you want to."

Evaluation Framework

v2 ships with proper benchmarks. On Mac Mini M4:

AliMeeting (real Chinese meeting recordings):

Nearfield (headset): ~25% CER
Farfield (8-ch single channel): 40% CER (high overlap, no beamforming)

AMI (English meetings):

FluidAudio speaker diarization: 23.2% DER average
Processing speed: 130x real-time

End-to-end: 30 min meeting → 20-30s processing. Peak memory < 1GB. Runs on 8GB MacBook Air.

Not SOTA — but fully on-device, zero cost, no network calls.

Community PRs

Two external contributions merged:

TextInjector clipboard restore bug fix
OpenSSL 3.x certificate script compatibility

An MIT project getting outside PRs at two weeks old — that's the best validation metric.

Architecture Overview

Daily dictation flow:

Right Option key
  → ScreenCaptureKit + Vision OCR (context extraction)
  → SpeechAnalyzer (transcription with context bias)
  → Local LLM polish (ollama)
  → Paste to focused app

Improvement flow:

dictionary.json + voice-history.jsonl
  → Gemini distillation
  → QLoRA fine-tuning
  → Deploy to ollama

The Takeaway

We replaced an ML pipeline with a JSON file and got better results. The lesson: capture user intent explicitly, don't infer it from noisy behavioral signals.

Complex systems are seductive. Simple systems ship.

GitHub: github.com/Marvinngg/ambient-voice
License: MIT
Requirements: macOS 26 (Tahoe), Apple Silicon (M1+)

If you tried v1: git pull && make install.
If you didn't: now is a better time to start.

PRs and issues welcome.

DEV Community