DEV Community

Cover image for Voice Typing Anywhere on macOS — I Built Apex Voice with mlx-whisper, Amazon Bedrock, and Strands Agents
Yuuki Yamashita
Yuuki Yamashita

Posted on

Voice Typing Anywhere on macOS — I Built Apex Voice with mlx-whisper, Amazon Bedrock, and Strands Agents

A weekend project that turned into a daily-driver: a macOS menu-bar app that lets you talk into any input field — Slack, browser, Notes, anywhere — powered by local Whisper and Amazon Bedrock.

TL;DR

I built Apex Voice, an open-source macOS voice typing tool. It listens through your microphone, transcribes speech offline with mlx-whisper, and inserts the result wherever your cursor is. With Amazon Bedrock layered on top, it can also polish the text, translate it, or execute agent actions like "add a reminder" or "summarize this page for me."

Why I Built It

macOS already has built-in dictation, and there are great commercial tools like Aqua Voice. So why bother?

  1. Built-in dictation doesn't reliably work in every app, and Japanese accuracy is uneven.
  2. Aqua Voice is polished but closed and paid.
  3. I wanted to own the stack: pick my model, my post-processing, my agent tools. And to learn by building.

What It Does

The core loop:

Mic → VAD → mlx-whisper → (optional Bedrock post-process) → Clipboard → ⌘V into any app
Enter fullscreen mode Exit fullscreen mode

On top of that:

  • Post-process modes powered by Claude Haiku 4.5 on Bedrock: polish, formal, translate, bullets.
  • Agent mode via Strands Agents: one utterance can trigger multiple tool calls — add a reminder, open a calendar event, fetch a webpage and summarize it.
  • Vocabulary learning with Amazon Bedrock AgentCore Memory: proper nouns and domain terms accumulate over time and get injected as Whisper's initial_prompt, so accuracy improves with use.

Architecture

The whole thing runs as a Python process managed by launchd. Local-only by default; Bedrock features kick in when you enable them.

The main pipeline (mic → whisper → insertion) is fully offline. Bedrock and AgentCore Memory are auxiliary — they make the experience richer but the app works without them.

The AWS Side

Three Bedrock-shaped pieces:

Piece Role
Amazon Bedrock (Claude Haiku 4.5) Post-processing (rewrite/translate/bullets), agent classification, page summarization
Strands Agents Defines tool schemas and orchestrates multi-step calls to Claude
Amazon Bedrock AgentCore Memory Persists extracted vocabulary across sessions; injected as Whisper prompt hints

I picked Haiku 4.5 because the post-processing happens every utterance. Sub-second latency matters more than top-tier reasoning here. For the agent mode, Haiku is still strong enough to pick the right tool from ~10 options reliably.

The companion web app (apex-voice-web) runs on Vercel with a Python serverless function that calls Bedrock for URL classification and summarization. It uses Upstash Redis for a live history feed. The macOS app fires history entries to it via a non-blocking POST.

The Hard Part Wasn't AI. It Was Packaging.

I lost almost a full day to py2app.

The plan was obvious: setup.py py2appdist/Apex Voice.app → drop it in /Applications → done. Reality:

[12:12:22] 認識エラー: bad local file header:
  '/Users/.../Apex Voice.app/Contents/Resources/lib/python312.zip'
Enter fullscreen mode Exit fullscreen mode

py2app bundles your dependencies into a python312.zip, and mlx-whisper's native extensions don't survive being zipped. I tried zip_include_packages: [] — not a valid py2app option. I tried a shell-script launcher inside the .app — macOS warned the user about needing Rosetta. I tried a hand-compiled arm64 launcher binary — that worked, but every iteration meant re-granting Accessibility permission because the bundle signature changed.

The fix: stop building a .app entirely.

I switched to a LaunchAgent plist (com.yamashita.apexvoice.plist) that points directly at the venv's Python and the script. Drop it in ~/Library/LaunchAgents/, launchctl load, done. With KeepAlive: true, it auto-restarts on crash. The "restart" menu item just calls rumps.quit_application() and lets launchd bring it back.

Two small touches kept it feeling like a proper app:

import setproctitle
setproctitle.setproctitle("Apex Voice")
Enter fullscreen mode Exit fullscreen mode

So Activity Monitor shows "Apex Voice" instead of "python3.12". And for the menu-bar icon, I rendered SF Symbols (waveform.and.mic / mic.fill) to PNG once and loaded them as a template image — they auto-adapt to light/dark mode.

What I Learned

  • Don't fight the OS packaging story when you have a better path. A LaunchAgent + venv beat py2app on every axis: simpler, more stable, easier to update.
  • Pick the model for the latency profile, not the leaderboard. Haiku 4.5 wins here because users feel every 500 ms in a voice typing loop.
  • Bluetooth mic quality is an OS problem, not an app problem. When the mic engages HFP mode, sample rate drops to 8 kHz across the board. No app-level fix exists on macOS. (Windows handles this slightly better in some driver combos, but the underlying HFP/A2DP tradeoff is universal.)
  • Accessibility-driven keystroke injection (osascript-Cmd-V) is the right primitive. It works in every app I've tried — Slack, browsers, native editors. No per-app integration needed.

What's Next

  • Vocabulary learning UI — show the user what AgentCore Memory has learned.
  • Windows port — swap mlx-whisper for faster-whisper, rumps for pystray. The core loop is OS-agnostic.
  • Approval-gated agent actions — wire it into Aegis, a Slack-based approval plane I'm building for AI agents.

Try It

git clone https://github.com/yama3133/apex-voice.git
cd apex-voice
/opt/homebrew/bin/python3.12 -m venv .venv
.venv/bin/pip install -r requirements.txt
# Run directly:
.venv/bin/python voicetype.py
# Or install as a LaunchAgent (auto-start, auto-restart):
# Edit com.yamashita.apexvoice.plist to point at your clone path
cp com.yamashita.apexvoice.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.yamashita.apexvoice.plist
Enter fullscreen mode Exit fullscreen mode

You'll need an Apple Silicon Mac (mlx is Apple Silicon only) and AWS credentials if you want post-processing and agent features.


If you build something similar, or hit the same py2app wall I did, I'd love to hear about it. Code, issues, and PRs welcome on GitHub.

Top comments (0)