BrethofAI

Posted on May 25

The local voice stack that beats the cloud at its own benchmarks

#localai #privacy #opensource #ai

Brethof Voice Pro 2.0 — offline voice-to-text and 38-language translation, 100% on your machine.

Every major dictation tool — Dragon, Otter, Google, Apple, the cloud transcription service of the week — captures your voice on your machine, streams it to a data centre, transcribes it there, and sends text back. Sometimes the audio is stored. Sometimes it trains a model. Sometimes it's 'anonymised', a word that stopped meaning much years ago.

Watch what people actually dictate and you see why that matters: medical notes, legal drafts, interviews with named sources, therapy summaries, deal memos, personal journals. The most sensitive text a person produces — uploaded by default, often against HIPAA, GDPR, or plain decency, because there was no alternative.

Brethof Voice Pro is the alternative, and 2.0 is the release where 'local' stops being a compromise: it transcribes, translates, dictates into any app, and trains on your own voice — all on your hardware, with no cloud mode to forget to switch off.

The engine: GGUF + llama.cpp, 5–7× faster than Whisper

Voice Pro runs Qwen3-ASR on llama.cpp with GGUF-quantised models. What that buys you:

5–7× faster transcription than Whisper, with a ~400 ms cold start — weights are memory-mapped, so the first hotkey press after a reboot is already listening.
An 83 MB install on Windows (161 MB on Linux) — one binary that runs on CPU, NVIDIA, AMD, and Intel GPUs via Vulkan. No CUDA-only lock-in, no runtime wheels to match to your hardware.
A genuinely state-of-the-art base model. Qwen3-ASR posts 1.84% average word error rate across a 10-language test and 4.5% on English — where OpenAI's Whisper Large-v3 sits at 7.4%. Its language identification is 97.9% accurate across 30 languages, vs Whisper Large-v3's 94.1%.

Smaller, faster, and more accurate than the model everyone benchmarks against — running entirely on your box.

What's new in 2.0: offline translation

The headline feature is translation that never leaves your machine, across 38 languages, powered by Tencent's Hunyuan-MT2 (open-sourced May 2026). It earns the billing: the Hunyuan-MT line took first place in 30 of 31 categories at WMT25, and MT2 is a step beyond it — its translation quality is comparable to Google Gemini 3.1 Pro on the FLORES-200 benchmark (XCOMET-XXL), in a model small enough to run on your own GPU.

We benchmarked both tiers ourselves — COMET-22, higher is better, across EN↔Polish, EN→Chinese, German, and Arabic:

Tier	Size on disk	COMET-22
Fast (1.8B)	~1 GB	87.6
Quality (7B)	~4.3 GB	89.0

Both run locally — sub-second on a GPU, and the Fast tier is sub-second even on CPU. Because the engine gives us per-engine device control, you can run ASR on one GPU and translation on another, or pin the 7B model to CPU on a VRAM-tight laptop.

Translation shows up everywhere transcription does:

Transcribe popup — a 'Translate to' dropdown on file, mic, and system-audio capture.
Voice keyboard — pick one or several targets; it types the translation (one per line, inline, or primary-only).
Subtitle translator — translate every cue of an SRT/VTT, keep the timings, optional bilingual mode (source line with the translation beneath).

The core, end to end

Transcription takes three inputs in one popup: an audio or video file (drag-and-drop; it pulls the track out of mp4/mkv/mov/webm and a dozen more formats), the microphone, or system audio — whatever is playing on your speakers, so you can capture a meeting, a browser tab, or a video. Output is plain text or SRT with timestamps; add the optional Forced Aligner for word-level timestamps.

Good for: transcribing interviews, turning a recorded talk into subtitles, capturing a call you're in without a bot joining the room.

The voice keyboard is push-to-talk dictation into any focused app. Default F9, hold-to-talk or toggle, optional right-mouse trigger; it injects text at the OS level — editor, browser, terminal, chat box. Turn on live translation and you speak English while it types Polish.

Good for: dictating commit messages into your IDE, replying in a language you read better than you write, drafting hands-free.

Hotwords do two jobs from one field: they bias ASR toward your brand names and jargon (so 'VFIO' stops becoming 'VEAF1'), and they pin terminology for the translator. Noise reduction (DeepFilter) is included but off by default — it hurts quality on short clean clips, so it's there for noisy rooms when you need it.

Train it on your own voice — and beat the big model

This is the part the cloud can't do. Every time you correct a misheard word, the audio-and-correction pair is saved to a local dataset, and the main window shows your running sample count. One click runs a LoRA fine-tune (it auto-selects an NVIDIA CUDA backend if you have one, CPU otherwise), then merges and exports the result to GGUF — and you switch to your personal model right from the main screen.

Does it actually work? We fine-tuned the small 0.6B model on about 11 hours of Polish. It scored 6.10% WER — beating Whisper Large-v3's 8.40% on the same audio. A model a fraction of the size, adapted on-device to one language and voice, out-performing the big general model. Nothing left the machine to get there.

Good for: strong accents, field vocabulary (medical, legal, engineering), or simply grinding your error rate down over a few weeks of normal use.

For developers: the MCP server

Voice Pro ships as a Model Context Protocol server — 19 tools exposing ASR and translation to any MCP agent: Claude Desktop, Claude Code, Cursor, Cline, OpenClaw, Hermes. Same binary, just --mcp; transport is stdio, so there's no port, no localhost binding, no firewall prompt:

{
  "mcpServers": {
    "brethof-voice": { "command": "brethof-voice", "args": ["--mcp"] }
  }
}

Now your agent can transcribe files, record and transcribe the mic, translate text and SRTs, switch compute devices, and manage voice profiles — locally, with no API keys and no per-minute billing. 'Transcribe this interview and give me a German SRT' becomes a fully offline operation.

Good for: agent pipelines that process audio without shipping it to a third party, batch subtitle jobs, and voice-driven tooling you actually control.

Languages, stated honestly

No rounded-up number:

Transcription: 30 selectable languages + 22 Chinese dialects the model recognises automatically (52 languages and dialects in total), plus auto-detect.
Translation: 38 languages via Hunyuan-MT2.
23 languages work in both directions — speak it, see it written, then see it in any of the others.

They don't perfectly overlap (ASR handles Danish, Greek, Finnish, and Swedish that translation doesn't; translation handles Hindi, Bengali, Tamil, and Ukrainian that ASR doesn't surface), so the feature tour publishes the full per-language table with a tick in each column. No asterisks.

The privacy guarantee

No cloud mode. There is no toggle to send audio to a server for better accuracy. Your CPU or GPU is the only option.
No telemetry. No usage stats, no crash phone-home. The only network calls are a license check, an update check, and the model downloads you trigger — all documented, all disableable.
Audio never hits disk. The buffer lives in RAM during transcription and is freed the moment the text is produced. Nothing to leak, nothing to recover.

Your voice is the most personal data you generate. It shouldn't leave your machine unless you explicitly send it somewhere. That isn't a tagline — it's why the product exists.

Platforms

Linux x86_64 — Ubuntu 22.04+, Fedora 38+, Arch, Debian 12+, CachyOS, openSUSE; X11 and Wayland; a single portable binary, no install.
Windows x64 — 10 (21H2+) and 11; per-user graphical installer, no admin rights.
macOS — not yet; on the roadmap, no ETA.

It runs CPU-only on 8 GB of RAM with an AVX2 chip. For GPU acceleration you need Vulkan 1.2+ drivers — which means NVIDIA, AMD, and Intel Arc all work from the same build, not just CUDA cards.

Try it

Pay once, own it forever — no subscription. There's a 14-day free trial with every feature unlocked and no credit card. Download for Linux or Windows at brethof.ai/voice.

Local. Private. Slightly opinionated.

DEV Community