DEV Community

xiaocai oh
xiaocai oh

Posted on

ambient-voice v2: How Deleting Whisper and Adding a JSON File Made Our Voice Pipeline Better

Last month I open-sourced ambient-voice — a macOS voice input tool built entirely on Apple-native frameworks. The headline feature was context biasing: it OCRs your screen before you speak, so the recognizer already knows your domain.

But the other headline feature — a self-improving distillation pipeline — turned out to be over-engineered. Here's what we changed in v2, and what we learned.

The v1 Pipeline (RIP)

Audio → Whisper re-transcription ──┐
                                    ├─→ Merge → QLoRA → ollama
User correction capture (30s) ─────┘
Enter fullscreen mode Exit fullscreen mode

Three problems:

  1. Whisper was a GPU tax. Re-transcribing 30 min of audio → 2 hours on a GPU server. Most users don't have spare compute for background distillation.

  2. Correction capture was noisy. Users edit text for many reasons — rephrasing, restructuring, deleting. Only a fraction of edits are actual recognition error corrections. The training data was polluted.

  3. The feedback loop never closed. Need dozens of data points → training run → model deploy. Too slow for anyone to see improvement.

The v2 Pipeline

dictionary.json + raw transcription → Gemini correction → QLoRA → ollama
Enter fullscreen mode Exit fullscreen mode

That's it.

dictionary.json

{ "terms": ["Sharpe ratio", "MPLS", "Claude Code", "MCP", "QLoRA", "ollama"] }
Enter fullscreen mode Exit fullscreen mode

You list your domain-specific terms. The distillation pipeline sends the raw SpeechAnalyzer transcription + your dictionary to Gemini. Gemini returns a corrected version respecting your vocabulary. The pair becomes QLoRA training data.

Why this works better than "automatic learning":

The user's real pain was never "the system doesn't learn from my corrections." It was "certain terms never come out right." dictionary.json targets that pain directly — zero noise, exact user intent.

What Got Deleted

  • WhisperTranscriber — entire module removed
  • CorrectionCapture — removed
  • CorrectionStore — removed
  • Dual-path merge logic — removed
  • GPU server dependency — gone

~30% code reduction. The cron job ("run every 10 minutes") became "run pipeline.sh when you want to."

Evaluation Framework

v2 ships with proper benchmarks. On Mac Mini M4:

AliMeeting (real Chinese meeting recordings):

  • Nearfield (headset): ~25% CER
  • Farfield (8-ch single channel): 40% CER (high overlap, no beamforming)

AMI (English meetings):

  • FluidAudio speaker diarization: 23.2% DER average
  • Processing speed: 130x real-time

End-to-end: 30 min meeting → 20-30s processing. Peak memory < 1GB. Runs on 8GB MacBook Air.

Not SOTA — but fully on-device, zero cost, no network calls.

Community PRs

Two external contributions merged:

  • TextInjector clipboard restore bug fix
  • OpenSSL 3.x certificate script compatibility

An MIT project getting outside PRs at two weeks old — that's the best validation metric.

Architecture Overview

Daily dictation flow:

Right Option key
  → ScreenCaptureKit + Vision OCR (context extraction)
  → SpeechAnalyzer (transcription with context bias)
  → Local LLM polish (ollama)
  → Paste to focused app
Enter fullscreen mode Exit fullscreen mode

Improvement flow:

dictionary.json + voice-history.jsonl
  → Gemini distillation
  → QLoRA fine-tuning
  → Deploy to ollama
Enter fullscreen mode Exit fullscreen mode

The Takeaway

We replaced an ML pipeline with a JSON file and got better results. The lesson: capture user intent explicitly, don't infer it from noisy behavioral signals.

Complex systems are seductive. Simple systems ship.


GitHub: github.com/Marvinngg/ambient-voice
License: MIT
Requirements: macOS 26 (Tahoe), Apple Silicon (M1+)

If you tried v1: git pull && make install.
If you didn't: now is a better time to start.

PRs and issues welcome.

Top comments (0)