DEV Community

Aura Technologies
Aura Technologies

Posted on

I Built a Local Voice-to-Text App with Rust, Tauri 2.0, whisper.cpp, and llama.cpp — Here's How

I got tired of paying $15/month to send my voice to someone else's server.

Wispr Flow is a great product — I used it for months. But one day I opened Wireshark out of curiosity and watched my audio clips leave my machine, hit a cloud endpoint, and come back as text. Every sentence I dictated — emails to my wife, Slack messages to coworkers, notes about half-baked startup ideas — all of it routed through infrastructure I didn't control.

That was the moment I decided to build my own. Fully local. No cloud. No subscription. Just a hotkey, a microphone, and local AI models doing the work on my own hardware.

The result is MumbleFlow — a local voice-to-text desktop app built with Tauri 2.0, whisper.cpp, and llama.cpp. It runs on macOS, Windows, and Linux, costs $5 once, and never sends a single byte of audio off your machine.

Here's how I built it.

The Architecture (Big Picture)

The pipeline is deceptively simple:

Fn key held → mic capture → whisper.cpp (STT) → llama.cpp (cleanup) → text injected at cursor
Enter fullscreen mode Exit fullscreen mode

Under the hood, there are four layers:

  1. Tauri 2.0 shell — the desktop app framework, handling the window, system tray, hotkey registration, and IPC between the frontend and backend
  2. Rust backend — the core logic: audio capture, model management, pipeline orchestration
  3. whisper.cpp — OpenAI's Whisper model compiled to C/C++, called from Rust via FFI bindings, running inference on GPU (Metal on macOS, CUDA on NVIDIA)
  4. llama.cpp — a local LLM (typically a small quantized model like Qwen 2.5 3B) that takes raw transcription and cleans it into proper text

No Node.js runtime. No Python. No Docker. One binary, two model files, zero network calls.

Why Tauri Over Electron

I know, I know — "why not Electron" is a tired debate. But for this project it wasn't even close.

Wispr Flow's Electron-based competitor (not naming names) idles at 400MB of RAM. MumbleFlow idles at ~45MB. When you're also loading ML models into memory, every megabyte of framework overhead matters.

Tauri 2.0 gave me:

  • Rust backend natively — no bridge tax between "the app framework" and "the real code." The backend is the app.
  • ~8MB bundle for the app shell (before models). Electron would add 150MB+ just for Chromium.
  • Native OS integration — Tauri 2.0's plugin system for things like global hotkeys, notifications, and system tray is clean and well-documented.
  • Security model — Tauri's allowlist-based IPC means the webview can only call explicitly permitted Rust functions. For a privacy-focused app, this matters philosophically too.

The tradeoff? Tauri's webview rendering isn't pixel-identical across platforms (it uses the OS webview — WebKit on macOS, WebView2 on Windows, WebKitGTK on Linux). For a utility app with a minimal UI, that's fine. For a design tool, maybe not.

// Tauri 2.0 command — called from the frontend via IPC
#[tauri::command]
async fn transcribe_audio(
    state: tauri::State<'_, AppState>,
    audio_data: Vec<f32>,
) -> Result<String, String> {
    let raw_text = state.whisper
        .transcribe(&audio_data)
        .map_err(|e| e.to_string())?;

    let cleaned = state.llm
        .cleanup_text(&raw_text)
        .await
        .map_err(|e| e.to_string())?;

    Ok(cleaned)
}
Enter fullscreen mode Exit fullscreen mode

Integrating whisper.cpp in Rust

This is where it gets fun. whisper.cpp is Georgi Gerganov's C/C++ port of OpenAI's Whisper — and it's fast. On Metal (Apple Silicon), it runs the small model in real-time. On CUDA, even faster.

The Rust integration uses FFI bindings (via whisper-rs, which wraps the C API). The flow:

  1. Load the model once at startup — this takes 1-3 seconds depending on the model size and whether it's loading into GPU VRAM.
  2. Capture audio from the default input device using cpal (a cross-platform audio library for Rust).
  3. Buffer the audio while the hotkey is held.
  4. Run inference when the hotkey is released.
use whisper_rs::{WhisperContext, WhisperContextParameters, FullParams, SamplingStrategy};

fn init_whisper(model_path: &str) -> Result<WhisperContext> {
    let mut params = WhisperContextParameters::default();
    params.use_gpu(true); // Metal on macOS, CUDA on NVIDIA

    WhisperContext::new_with_params(model_path, params)
}

fn transcribe(ctx: &WhisperContext, audio: &[f32]) -> Result<String> {
    let mut params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
    params.set_language(Some("en"));
    params.set_no_timestamps(true);
    params.set_single_segment(true);

    let mut state = ctx.create_state()?;
    state.full(params, audio)?;

    let num_segments = state.full_n_segments()?;
    let mut text = String::new();
    for i in 0..num_segments {
        text.push_str(&state.full_get_segment_text(i)?);
    }

    Ok(text.trim().to_string())
}
Enter fullscreen mode Exit fullscreen mode

The GPU acceleration was the biggest performance win. On CPU, the small model takes ~3 seconds for a 10-second clip. With Metal acceleration on an M1, the same clip processes in ~400ms. With CUDA on an RTX 3060, it's closer to 250ms.

One gotcha: audio sample rate. Whisper expects 16kHz mono float32. Most microphones capture at 44.1kHz or 48kHz. You need a resampling step — I use rubato for high-quality sample rate conversion without adding latency.

Adding llama.cpp for Smart Text Cleanup

Raw Whisper output is... raw. You get things like:

"so um basically what I wanted to say was that the the meeting is at like 3 pm tomorrow and uh we should probably bring the the documents"

Nobody wants to paste that into an email. That's where llama.cpp comes in.

I run a small quantized LLM (Qwen 2.5 3B Q4_K_M — about 2GB) locally through llama.cpp bindings. The prompt is simple:

Clean up this transcribed speech. Fix grammar, remove filler words,
add punctuation. Keep the original meaning and tone. Output only
the cleaned text, nothing else.

Input: {raw_whisper_output}
Enter fullscreen mode Exit fullscreen mode

The output:

"The meeting is at 3 PM tomorrow. We should bring the documents."

The LLM step adds ~200-400ms depending on the input length and your hardware. For most dictation (a sentence or two), it's barely noticeable. The total pipeline — audio capture, whisper inference, LLM cleanup — typically completes in under a second on any machine with a decent GPU.

// Simplified — actual implementation handles streaming and context management
async fn cleanup_text(llm: &LlamaContext, raw: &str) -> Result<String> {
    let prompt = format!(
        "Clean up this transcribed speech. Fix grammar, remove filler words, \
         add punctuation. Keep the original meaning and tone. Output only \
         the cleaned text.\n\nInput: {raw}\n\nOutput:"
    );

    let response = llm.generate(&prompt, GenerateParams {
        max_tokens: 256,
        temperature: 0.1,  // Low temp = deterministic cleanup
        stop: vec!["\n".into()],
    }).await?;

    Ok(response.trim().to_string())
}
Enter fullscreen mode Exit fullscreen mode

Why not just use Whisper with a larger model? Because Whisper is a transcription model — it's optimized to faithfully reproduce what you said, filler words and all. An LLM understands intent and can restructure text intelligently. The two-model pipeline consistently produces better output than either model alone.

The Hotkey + Text Injection Pipeline

This is the part that took the most iteration. The goal: press Fn (or any configured hotkey), speak, release, and have clean text appear wherever your cursor is — in any app, any text field, anywhere.

The pipeline:

  1. Global hotkey registration — Tauri 2.0's global-shortcut plugin handles this. The key press starts audio capture; the key release stops it and triggers the pipeline.
  2. Audio capturecpal grabs audio from the default input device, buffering PCM float32 samples.
  3. Whisper inference — the buffered audio goes to whisper.cpp.
  4. LLM cleanup — raw text goes to llama.cpp.
  5. Text injection — the cleaned text is "typed" into whatever app has focus.

Step 5 is where platform hell begins.

Cross-Platform Challenges

macOS

On macOS, text injection uses CGEventCreateKeyboardEvent from Core Graphics. You simulate keystrokes one character at a time. Sounds simple — except macOS Accessibility permissions gate all synthetic input. MumbleFlow needs the user to grant Accessibility access in System Preferences, or nothing works. Every macOS developer knows this dance.

There's also a fun gotcha with macOS's clipboard approach (copy-paste injection via Cmd+V): some apps detect programmatic paste events and block them. Keystroke simulation is more reliable but slower for long text.

Windows

Windows is actually the most straightforward here. SendInput from the Win32 API lets you inject keystrokes globally. No special permissions needed (though some games and secure input fields block synthetic input). Unicode support requires using KEYEVENTF_UNICODE flags, which took a while to get right for non-ASCII characters.

Linux

Linux is... Linux. X11 has XSendEvent and XTest, but Wayland deliberately blocks synthetic input from arbitrary processes (for security reasons — which I respect, but it makes this use case painful). On Wayland, you need compositor-specific protocols like wlr-virtual-pointer or zwp_virtual_keyboard_v1, and not all compositors support them.

The current approach: detect the display server at runtime and use the appropriate injection method. It works on GNOME and KDE (the two biggest Wayland compositors) and all X11 setups.

// Platform-specific text injection (simplified)
#[cfg(target_os = "macos")]
fn inject_text(text: &str) -> Result<()> {
    use core_graphics::event::*;
    for ch in text.chars() {
        let event = CGEvent::new_keyboard_event(source, 0, true)?;
        event.set_string_from_virtual_keycode(ch);
        event.post(CGEventTapLocation::HID);
    }
    Ok(())
}

#[cfg(target_os = "windows")]
fn inject_text(text: &str) -> Result<()> {
    use windows::Win32::UI::Input::KeyboardAndMouse::*;
    for ch in text.encode_utf16() {
        let input = INPUT {
            r#type: INPUT_KEYBOARD,
            Anonymous: INPUT_0 {
                ki: KEYBDINPUT {
                    wScan: ch,
                    dwFlags: KEYEVENTF_UNICODE,
                    ..Default::default()
                },
            },
        };
        unsafe { SendInput(&[input], std::mem::size_of::<INPUT>() as i32) };
    }
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Performance Numbers

Real benchmarks on real hardware — no cherry-picking:

Metric M1 MacBook Air i7 + RTX 3060 Ryzen 5 (CPU only)
Whisper inference (10s clip, small model) ~400ms ~250ms ~3.1s
LLM cleanup (1-2 sentences) ~200ms ~150ms ~800ms
Total pipeline (press → paste) ~700ms ~500ms ~4.2s
Idle RAM usage ~45MB ~50MB ~45MB
RAM with models loaded ~1.8GB ~2.1GB ~1.8GB
App bundle size (without models) 8MB 12MB 10MB

The CPU-only path is noticeably slower — about 4 seconds for the full pipeline. Usable, but not the "instant" feel you get with GPU acceleration. If you have any Apple Silicon Mac or an NVIDIA GPU, the experience is sub-second and feels like magic.

What's Next

MumbleFlow is live and stable, but there's more to build:

  • Custom vocabularies — domain-specific terms (medical, legal, code) that Whisper tends to fumble
  • Multi-language support — Whisper supports 99 languages; MumbleFlow currently defaults to English but the foundation is there
  • Voice commands — "delete that," "new paragraph," "capitalize"
  • Streaming transcription — show partial results while you're still speaking (currently it processes after you release the hotkey)
  • Smaller models — experimenting with distilled Whisper variants that could bring CPU-only latency under 2 seconds

Try It

If you're a developer who dictates code comments, writes docs, drafts messages, or just wants to stop typing sometimes — MumbleFlow might be what you're looking for.

$5 one-time. Fully local. No subscription. No cloud. No telemetry.

It's a Wispr Flow alternative that respects your privacy and your wallet. Your voice data never leaves your machine — not because of a privacy policy, but because there's literally no networking code in the transcription pipeline.

Check it out at mumble.helix-co.com →


If you found this useful, I'd appreciate a ❤️ or a share. Building local-first AI tools is a hill I'm willing to die on, and the more developers who care about this stuff, the better the ecosystem gets.

Top comments (0)