DEV Community

Cover image for Polly wants a transcript: giving agents ears and a voice, on your own machine
Anton Yakutovich
Anton Yakutovich

Posted on

Polly wants a transcript: giving agents ears and a voice, on your own machine

Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice where the audio goes to become text.

Because the default answer to "transcribe this" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through their model, on their meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text back to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill.

Meanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all?

That itch turned into Kesha Voice Kit 🦜 — and Polly can keep her transcript.

What it is

Kesha (yes, named after the cartoon parrot"Свободу попугаям!" is literally the demo clip) is a local-first voice toolkit: speech-to-text and back, no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no ffmpeg, no native Node addons to babysit.

  • Transcribe in 25 languages — up to ~19× faster than Whisper on Apple Silicon, ~2.5× on CPU
  • Speak back in 9 languages, auto-picking a voice from the text's language
  • On a Mac it runs on the Apple Neural Engine via CoreML; everywhere else it falls back to ONNX
  • Models are never auto-downloaded — you ask for them once, explicitly, and every weight is pinned by SHA-256 so a mirror can't quietly swap it

The CLI is a thin Bun wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr.

Quick start

bun add -g @drakulavich/kesha-voice-kit
kesha install                 # downloads engine + models (explicit — never automatic)

kesha audio.ogg               # → transcript to stdout
Enter fullscreen mode Exit fullscreen mode

Want it to talk?

kesha install --tts           # opt-in voices (~990 MB)
kesha say "Свободу попугаям!" > freedom.wav
Enter fullscreen mode Exit fullscreen mode

That's it. No OPENAI_API_KEY, no region to pick, no spend alert to set up.

What's new in 1.22.0

The release I just cut adds two things I'd wanted for a while.

Multilingual voices. Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual Kokoro voices on Apple Silicon, so kesha say now covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline.

Stable error codes everywhere. Every failure path now prints a machine-readable line:

error [E_MODEL_MISSING]: TTS models not installed. Run: kesha install --tts
Enter fullscreen mode Exit fullscreen mode

kesha --error-codes-json dumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out why it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise.

A gotcha I'll save you from (and why I made it louder)

Here's the kind of thing that only shows up once real users point it at real text.

I added the multilingual voices, ran kesha say on a line of Hindi… and got noise. Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken.

The root cause is buried a layer down. The on-device Kokoro path phonemizes text with an English-only grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't fail — it just produces phonemes that mean nothing, and the model dutifully sings them.

I had three options:

Option Verdict
Emit the garbage audio No. Confidently wrong is the worst failure mode there is.
Quietly transliterate to Latin and guess Fragile, surprising, hides the real gap
Refuse with a clear, coded error

So now non-Latin text aimed at a Latin-only voice stops immediately:

error [E_SCRIPT_UNSUPPORTED]: voice 'hi' cannot phonemize Devanagari text;
it only supports Latin-script input. Romanize the text, or use a voice
whose engine supports Devanagari.
Enter fullscreen mode Exit fullscreen mode

Exit code 4, a stable code, an actionable hint. Fail fast beats fail quietly, every single time — especially for an agent that can't hear that the WAV it got back is nonsense. Real multilingual G2P for those scripts is tracked as an open issue; until it lands, the tool tells you the truth instead of humming gibberish.

Plugging it into agents

The reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do:

  • MCPkesha mcp exposes transcribe / synthesize / list-voices as tools to any MCP client (Claude, Cursor, Codex, Gemini)
  • OpenClaw — drop-in skill so your agent grows ears
  • Hermes — local STT/TTS through command providers
  • Raycast on macOS, and a programmatic @drakulavich/kesha-voice-kit/core API if you'd rather call it from a Bun program

A voice note in, a transcript out, an answer spoken back — and the audio never left the laptop.

Honest about the edges

It's not magic. Diarization and the multilingual voices are Apple-Silicon-only today (Linux/Windows get a clear error, not a crash). The first TTS download is ~990 MB — local models aren't free, they're just yours. And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it.

It's MIT, it's on GitHub and npm, and bun add -g @drakulavich/kesha-voice-kit is the whole install.

What does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜

Top comments (3)

Collapse
 
gimi5555 profile image
Gilder Miller

I agree with the local-first direction. Cloud still helps when accuracy or language coverage matters.
The script issue is the important part here since silent failure is worse than a hard stop.
On routing, are you choosing CoreML vs ONNX strictly by platform or doing runtime checks per model?

Collapse
 
drakulavich profile image
Anton Yakutovich

Routing by platform by default, but you can use ONNX path on MacOS if you want.

Collapse
 
gimi5555 profile image
Gilder Miller

Totally makes sense to stick with platform defaults for reliability. Runtime switching per model can get tricky fast, especially with subtle behavior differences. The ONNX option on macOS is a nice flexibility for testing or edge cases. Thanks.