DEV Community

Cover image for Polly wants a transcript: giving agents ears and a voice, on your own machine
Anton Yakutovich
Anton Yakutovich

Posted on

Polly wants a transcript: giving agents ears and a voice, on your own machine

Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice where the audio goes to become text.

Because the default answer to "transcribe this" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through their model, on their meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text back to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill.

Meanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all?

That itch turned into Kesha Voice Kit 🦜 — and Polly can keep her transcript.

What it is

Kesha (yes, named after the cartoon parrot"Свободу попугаям!" is literally the demo clip) is a local-first voice toolkit: speech-to-text and back, no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no ffmpeg, no native Node addons to babysit.

  • Transcribe in 25 languages — up to ~19× faster than Whisper on Apple Silicon, ~2.5× on CPU
  • Speak back in 9 languages, auto-picking a voice from the text's language
  • On a Mac it runs on the Apple Neural Engine via CoreML; everywhere else it falls back to ONNX
  • Models are never auto-downloaded — you ask for them once, explicitly, and every weight is pinned by SHA-256 so a mirror can't quietly swap it

The CLI is a thin Bun wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr.

Quick start

bun add -g @drakulavich/kesha-voice-kit
kesha install                 # downloads engine + models (explicit — never automatic)

kesha audio.ogg               # → transcript to stdout
Enter fullscreen mode Exit fullscreen mode

Want it to talk?

kesha install --tts           # opt-in voices (~990 MB)
kesha say "Свободу попугаям!" > freedom.wav
Enter fullscreen mode Exit fullscreen mode

That's it. No OPENAI_API_KEY, no region to pick, no spend alert to set up.

What's new in 1.22.0

The release I just cut adds two things I'd wanted for a while.

Multilingual voices. Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual Kokoro voices on Apple Silicon, so kesha say now covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline.

Stable error codes everywhere. Every failure path now prints a machine-readable line:

error [E_MODEL_MISSING]: TTS models not installed. Run: kesha install --tts
Enter fullscreen mode Exit fullscreen mode

kesha --error-codes-json dumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out why it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise.

A gotcha I'll save you from (and why I made it louder)

Here's the kind of thing that only shows up once real users point it at real text.

I added the multilingual voices, ran kesha say on a line of Hindi… and got noise. Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken.

The root cause is buried a layer down. The on-device Kokoro path phonemizes text with an English-only grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't fail — it just produces phonemes that mean nothing, and the model dutifully sings them.

I had three options:

Option Verdict
Emit the garbage audio No. Confidently wrong is the worst failure mode there is.
Quietly transliterate to Latin and guess Fragile, surprising, hides the real gap
Refuse with a clear, coded error

So now non-Latin text aimed at a Latin-only voice stops immediately:

error [E_SCRIPT_UNSUPPORTED]: voice 'hi' cannot phonemize Devanagari text;
it only supports Latin-script input. Romanize the text, or use a voice
whose engine supports Devanagari.
Enter fullscreen mode Exit fullscreen mode

Exit code 4, a stable code, an actionable hint. Fail fast beats fail quietly, every single time — especially for an agent that can't hear that the WAV it got back is nonsense. Real multilingual G2P for those scripts is tracked as an open issue; until it lands, the tool tells you the truth instead of humming gibberish.

Plugging it into agents

The reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do:

  • MCPkesha mcp exposes transcribe / synthesize / list-voices as tools to any MCP client (Claude, Cursor, Codex, Gemini)
  • OpenClaw — drop-in skill so your agent grows ears
  • Hermes — local STT/TTS through command providers
  • Raycast on macOS, and a programmatic @drakulavich/kesha-voice-kit/core API if you'd rather call it from a Bun program

A voice note in, a transcript out, an answer spoken back — and the audio never left the laptop.

Honest about the edges

It's not magic. Diarization and the multilingual voices are Apple-Silicon-only today (Linux/Windows get a clear error, not a crash). The first TTS download is ~990 MB — local models aren't free, they're just yours. And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it.

It's MIT, it's on GitHub and npm, and bun add -g @drakulavich/kesha-voice-kit is the whole install.

What does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜

Top comments (0)