DEV Community: BrethofAI

The local voice stack that beats the cloud at its own benchmarks

BrethofAI — Mon, 25 May 2026 20:57:05 +0000

Brethof Voice Pro 2.0 — offline voice-to-text and 38-language translation, 100% on your machine.

Every major dictation tool — Dragon, Otter, Google, Apple, the cloud transcription service of the week — captures your voice on your machine, streams it to a data centre, transcribes it there, and sends text back. Sometimes the audio is stored. Sometimes it trains a model. Sometimes it's 'anonymised', a word that stopped meaning much years ago.

Watch what people actually dictate and you see why that matters: medical notes, legal drafts, interviews with named sources, therapy summaries, deal memos, personal journals. The most sensitive text a person produces — uploaded by default, often against HIPAA, GDPR, or plain decency, because there was no alternative.

Brethof Voice Pro is the alternative, and 2.0 is the release where 'local' stops being a compromise: it transcribes, translates, dictates into any app, and trains on your own voice — all on your hardware, with no cloud mode to forget to switch off.

The engine: GGUF + llama.cpp, 5–7× faster than Whisper

Voice Pro runs Qwen3-ASR on llama.cpp with GGUF-quantised models. What that buys you:

5–7× faster transcription than Whisper, with a ~400 ms cold start — weights are memory-mapped, so the first hotkey press after a reboot is already listening.
An 83 MB install on Windows (161 MB on Linux) — one binary that runs on CPU, NVIDIA, AMD, and Intel GPUs via Vulkan. No CUDA-only lock-in, no runtime wheels to match to your hardware.
A genuinely state-of-the-art base model. Qwen3-ASR posts 1.84% average word error rate across a 10-language test and 4.5% on English — where OpenAI's Whisper Large-v3 sits at 7.4%. Its language identification is 97.9% accurate across 30 languages, vs Whisper Large-v3's 94.1%.

Smaller, faster, and more accurate than the model everyone benchmarks against — running entirely on your box.

What's new in 2.0: offline translation

The headline feature is translation that never leaves your machine, across 38 languages, powered by Tencent's Hunyuan-MT2 (open-sourced May 2026). It earns the billing: the Hunyuan-MT line took first place in 30 of 31 categories at WMT25, and MT2 is a step beyond it — its translation quality is comparable to Google Gemini 3.1 Pro on the FLORES-200 benchmark (XCOMET-XXL), in a model small enough to run on your own GPU.

We benchmarked both tiers ourselves — COMET-22, higher is better, across EN↔Polish, EN→Chinese, German, and Arabic:

Tier	Size on disk	COMET-22
Fast (1.8B)	~1 GB	87.6
Quality (7B)	~4.3 GB	89.0

Both run locally — sub-second on a GPU, and the Fast tier is sub-second even on CPU. Because the engine gives us per-engine device control, you can run ASR on one GPU and translation on another, or pin the 7B model to CPU on a VRAM-tight laptop.

Translation shows up everywhere transcription does:

Transcribe popup — a 'Translate to' dropdown on file, mic, and system-audio capture.
Voice keyboard — pick one or several targets; it types the translation (one per line, inline, or primary-only).
Subtitle translator — translate every cue of an SRT/VTT, keep the timings, optional bilingual mode (source line with the translation beneath).

The core, end to end

Transcription takes three inputs in one popup: an audio or video file (drag-and-drop; it pulls the track out of mp4/mkv/mov/webm and a dozen more formats), the microphone, or system audio — whatever is playing on your speakers, so you can capture a meeting, a browser tab, or a video. Output is plain text or SRT with timestamps; add the optional Forced Aligner for word-level timestamps.

Good for: transcribing interviews, turning a recorded talk into subtitles, capturing a call you're in without a bot joining the room.

The voice keyboard is push-to-talk dictation into any focused app. Default F9, hold-to-talk or toggle, optional right-mouse trigger; it injects text at the OS level — editor, browser, terminal, chat box. Turn on live translation and you speak English while it types Polish.

Good for: dictating commit messages into your IDE, replying in a language you read better than you write, drafting hands-free.

Hotwords do two jobs from one field: they bias ASR toward your brand names and jargon (so 'VFIO' stops becoming 'VEAF1'), and they pin terminology for the translator. Noise reduction (DeepFilter) is included but off by default — it hurts quality on short clean clips, so it's there for noisy rooms when you need it.

Train it on your own voice — and beat the big model

This is the part the cloud can't do. Every time you correct a misheard word, the audio-and-correction pair is saved to a local dataset, and the main window shows your running sample count. One click runs a LoRA fine-tune (it auto-selects an NVIDIA CUDA backend if you have one, CPU otherwise), then merges and exports the result to GGUF — and you switch to your personal model right from the main screen.

Does it actually work? We fine-tuned the small 0.6B model on about 11 hours of Polish. It scored 6.10% WER — beating Whisper Large-v3's 8.40% on the same audio. A model a fraction of the size, adapted on-device to one language and voice, out-performing the big general model. Nothing left the machine to get there.

Good for: strong accents, field vocabulary (medical, legal, engineering), or simply grinding your error rate down over a few weeks of normal use.

For developers: the MCP server

Voice Pro ships as a Model Context Protocol server — 19 tools exposing ASR and translation to any MCP agent: Claude Desktop, Claude Code, Cursor, Cline, OpenClaw, Hermes. Same binary, just --mcp; transport is stdio, so there's no port, no localhost binding, no firewall prompt:

{
  "mcpServers": {
    "brethof-voice": { "command": "brethof-voice", "args": ["--mcp"] }
  }
}

Now your agent can transcribe files, record and transcribe the mic, translate text and SRTs, switch compute devices, and manage voice profiles — locally, with no API keys and no per-minute billing. 'Transcribe this interview and give me a German SRT' becomes a fully offline operation.

Good for: agent pipelines that process audio without shipping it to a third party, batch subtitle jobs, and voice-driven tooling you actually control.

Languages, stated honestly

No rounded-up number:

Transcription: 30 selectable languages + 22 Chinese dialects the model recognises automatically (52 languages and dialects in total), plus auto-detect.
Translation: 38 languages via Hunyuan-MT2.
23 languages work in both directions — speak it, see it written, then see it in any of the others.

They don't perfectly overlap (ASR handles Danish, Greek, Finnish, and Swedish that translation doesn't; translation handles Hindi, Bengali, Tamil, and Ukrainian that ASR doesn't surface), so the feature tour publishes the full per-language table with a tick in each column. No asterisks.

The privacy guarantee

No cloud mode. There is no toggle to send audio to a server for better accuracy. Your CPU or GPU is the only option.
No telemetry. No usage stats, no crash phone-home. The only network calls are a license check, an update check, and the model downloads you trigger — all documented, all disableable.
Audio never hits disk. The buffer lives in RAM during transcription and is freed the moment the text is produced. Nothing to leak, nothing to recover.

Your voice is the most personal data you generate. It shouldn't leave your machine unless you explicitly send it somewhere. That isn't a tagline — it's why the product exists.

Platforms

Linux x86_64 — Ubuntu 22.04+, Fedora 38+, Arch, Debian 12+, CachyOS, openSUSE; X11 and Wayland; a single portable binary, no install.
Windows x64 — 10 (21H2+) and 11; per-user graphical installer, no admin rights.
macOS — not yet; on the roadmap, no ETA.

It runs CPU-only on 8 GB of RAM with an AVX2 chip. For GPU acceleration you need Vulkan 1.2+ drivers — which means NVIDIA, AMD, and Intel Arc all work from the same build, not just CUDA cards.

Try it

Pay once, own it forever — no subscription. There's a 14-day free trial with every feature unlocked and no credit card. Download for Linux or Windows at brethof.ai/voice.

Local. Private. Slightly opinionated.

Don't summarize your memory — search it

BrethofAI — Fri, 22 May 2026 15:54:00 +0000

Every long session with an AI coding agent eventually hits the same wall: the context window fills up, the conversation gets compacted, and a summary takes the place of what actually happened. Summaries are lossy by design. The decision you made three sessions ago, the reason you ruled out approach B, the exact path you fixed last Tuesday — quietly gone, because something decided they weren't important enough to keep.

I got tired of re-explaining my own project to my own assistant. So I built brethof-mind: long-term memory for Claude Code (and Claude Desktop), built on SurrealDB. The core idea is in the title — instead of summarizing your history down to fit, keep all of it and search it.

It's open source (MIT), runs 100% on your machine, and talks to no external API.

🔗 https://github.com/BrethofAI/brethof-mind

Two memories, not one

Most "memory" tools give you a single bucket of notes. brethof-mind keeps two layers, because they answer different questions:

Curated memory — the things you decide are worth pinning: architecture decisions, locked rules, project status, bugs and their fixes. Small, high-signal, hand-or-agent-curated.
Full chat archive — every session you've ever had, stored verbatim and searchable. This is the safety net: when a summary would have dropped a detail, the raw exchange is still there to retrieve.

The curated layer answers "what did we decide?" The archive answers "what did we actually say back in March?" Together they mean a compaction is no longer a memory wipe — it's just the working context shrinking while the real record stays intact.

Three ways to search

Different questions want different retrieval. brethof-mind exposes all three over MCP:

Full-text (BM25) — when you know the words. SurrealDB's full-text index, lowercased + stemmed.
Vector similarity (HNSW) — when you know the meaning but not the words. Embeddings come from fastembed (all-MiniLM-L6-v2, 384-dim) — local, fast, no API key.
Graph traversal — records link to each other (decision → supersedes → decision; episode → covers → topic), so you can walk relationships, not just match text.

There are 7 MCP tools in total — semantic_search, search_memory, search_chat, query_raw, save_memory, save_commit, load_project — so the agent can pick the right retrieval for the question instead of being stuck with one.

100% local stack

No cloud, no telemetry, no keys leaving your box:

SurrealDB for storage (vector + full-text + graph in one engine).
fastembed for embeddings, on CPU.
FastMCP over stdio for the server.
Credentials via env vars; projects configured in a simple projects.json.

Install

git clone https://github.com/BrethofAI/brethof-mind
cd brethof-mind

# 1. Bring up SurrealDB
docker compose up -d

# 2. Configure
cp .env.example .env                    # set DB creds
cp projects.example.json projects.json
python mcp-server/scripts/init_db.py    # create namespace + schema + indexes

# 3. Register the MCP server with Claude Code (claude mcp add ...)
# 4. Drop the hooks into your Claude settings (see settings.example.json)

Full steps are in the README.

The hooks are where it gets nice

The MCP tools are useful on demand, but the hooks make memory ambient — you don't have to remember to remember:

SessionStart loads the relevant project memory into context the moment you open a session.
UserPromptSubmit nudges the agent to search memory first before answering questions about past decisions.
Stop archives the session into the searchable chat history when you're done.
A commit hook records each commit as a memory record, so your project history and your conversation history live in the same searchable place.

The result: start a fresh session and your agent already knows where the project stands — no re-briefing.

Works with

Claude Code and Claude Desktop today (Desktop runs the Claude Code engine under the hood, so it gets the full hooks experience). OpenClaw and Hermes integrations are next.

Why it's free

brethof-mind is MIT and will stay free. It comes from the team behind Brethof Voice Pro (local, offline voice-to-text) — same principle: your data stays on your machine. This is the tooling we use ourselves; sharing it because the "summarize-your-memory" default deserves a better answer.

If you try it, I'd genuinely like feedback on the hook design — that's the part with the most room to get smarter.

https://github.com/BrethofAI/brethof-mind

Transcribe, Translate, Timestamps, Fine tune, MCP server ALL in 1 app

BrethofAI — Fri, 22 May 2026 11:00:59 +0000

Brethof Voice Pro v2.0.0 — offline speech-to-text, translation, and a voice keyboard in one app

Your voice, transcribed and translated on your own machine. No cloud, no subscription, nothing leaving your laptop.

Brethof Voice Pro just hit v2.0.0 — and it's no longer "just" dictation. It's a full local voice + translation layer for your desktop and your AI stack.

What it does

🎙️ Transcribe 30 languages + 22 Chinese dialects (52 in total) — from a file, your mic, or your system audio (meetings, videos, anything playing through your speakers).
🌍 Translate across 38 languages, fully offline. Choose the fast model or the quality one (Tencent's Hunyuan-MT — #1 in 30 of 31 categories at WMT25).
⌨️ Voice keyboard — dictate into any app with a hotkey. New in v2.0: pick several target languages and it types the translation as you speak.
📝 Timestamped subtitles — export SRT/VTT, sentence- or word-level.
🧠 Fine-tune it on you — every correction trains a personal model (one-click LoRA), so it learns your voice and your vocabulary.
🔌 MCP server — 19 tools so Claude or any MCP client can drive transcription and translation inside your own pipelines.

Why it's different

100% local. Audio and translations never leave your machine. Linux & Windows; GPU optional — runs on a plain laptop CPU.
All in one app. Transcribe, translate, subtitle, dictate, fine-tune — no stitching five tools together.
No subscription. Pay once, own it forever. 14-day free trial, no credit card.

👉 brethof.ai/voice

I Built a $49 Voice-to-Text App That Never Touches the Cloud

BrethofAI — Thu, 16 Apr 2026 07:32:31 +0000

Why I Built Brethof Voice Pro

I got tired of two things:

Voice-to-text tools that only work well in English
Every good STT solution either costs $700 (Dragon) or sends your voice to the cloud monthly

So I built something different.

What It Does

Brethof Voice Pro is a desktop app (Windows + Linux) that converts speech to text using AI — entirely on your device. Press Ctrl+D, speak, text appears where your cursor is.

Zero cloud calls during transcription. Models download once on first launch, then it works fully offline.

The Tech Stack

Qwen3-ASR engine — 1.84% word error rate across 10 languages (arXiv 2601.21337)
GGUF models via llama.cpp — 6 quantization tiers from 1 GB to 3.2 GB
Vulkan GPU acceleration — works on NVIDIA, AMD, Intel, and CPU-only
DeepFilter noise reduction
36 languages with auto-detection

Why It Matters

If you speak Thai, Polish, Arabic, Vietnamese, Korean, or dozens of other languages — there has been no good voice-to-text option for you. Dragon doesn't support most languages. Google's API charges per-minute and requires internet. Whisper's accuracy on non-English languages is mediocre.

Qwen3-ASR changes this. State-of-the-art accuracy across 36 languages, running locally on consumer hardware.

Pricing

$49 one-time. Perpetual license. No subscription.

Compare: Dragon $699. Otter.ai $100-240/year. Google STT per-minute cloud pricing.

14-day free trial, no credit card: brethof.com

Happy to answer questions about the architecture, GGUF quantization, or Vulkan vs CUDA tradeoffs.