I hold a belief that software is heading toward a voice-first future. Not voice as a gimmick bolted onto an app, but voice as a genuine interface. The kind where talking to your tools feels as natural as talking to a colleague.
The spark for this project came from something small. I was in a live voice session with Claude, the kind where you talk and it talks back in real time, no typing at all. Something about that experience stuck with me. It felt like a glimpse of where things are going.
Around the same time, I had been wanting an excuse to properly play with local AI models. I have always been drawn to the idea of running models on your own machine, no API calls, no cloud dependency, just you and your hardware. I had read about it, dabbled a little, but never gone deep.
So I combined the two curiosities into a weekend project. Build a voice coding agent. Run the speech parts locally. Let Claude be the brain.
That project became Mabara.
What Mabara Actually Does
Mabara lets you talk to your codebase the way you would talk to a coworker. You hold down a key, ask a question or give an instruction, and it responds out loud. It can explain how your project is structured, make changes to your files when you approve them by voice, and if it does something you do not like, you just say "revert that" and it undoes it.
The clever part is where the thinking happens versus where the listening and speaking happen. Claude, running through the Claude Agent SDK, is the brain. It understands your code and decides what to do. But the ears and the mouth, the part that hears you and the part that talks back, run entirely on my own laptop. No cloud speech service. Just my CPU.
And that laptop is not a powerful one. It is an i5-10210U with 8 GB of RAM and no GPU. Not the machine anyone benchmarks AI on. That constraint is actually the whole reason this story is interesting.
This is really a story about measuring things, and about learning to trust data over assumptions.
Disclaimer: I am not an expert on speech models or real-time systems. I am sharing what I measured and what it taught me. If you see something I got wrong, please tell me. I am here to learn.
Lesson One: The Benchmark Is the Truth, Everything Else Is Marketing
Here is a thing everyone "knows" in the AI world. Quantizing a model to int8 makes it faster on CPU. Smaller numbers, faster math. Every optimization guide says so.
So when my text-to-speech model was too slow, I downloaded the int8 version, fully expecting a win.
It ran 2.5 times SLOWER than the full precision model.
Not slower by a little. The full precision model hit 1.6x real-time on my CPU. The int8 version managed 0.6x, which is slower than the speech it was supposed to be producing.
Understanding why taught me more about hardware than any course I have taken. Int8 speedups depend on special CPU instructions, called VNNI on Intel chips, that handle quantized math natively. My Comet Lake CPU does not have them. Without those instructions, the CPU has to quantize, dequantize, and shuffle formats on every single operation. That is pure overhead with zero benefit.
The pattern repeated soon after. Distil-Whisper, marketed as six times faster than Whisper, was exactly zero percent faster for my use case. Push-to-talk clips are short, and short clips spend almost all their time in the encoder, which distillation does not shrink. That famous speedup only shows up on hour-long recordings.
A tuning parameter that was supposed to cut Whisper's processing window did nothing at all. I benchmarked three different values. Identical times, down to the decimal.
By the end of day one I had a personal rule. Nothing ships in Mabara unless it wins a benchmark on this exact laptop. Not a benchmark from someone else's README. Mine.
Lesson Two: I Traded a Beautiful Voice for 1.2 Seconds, and It Was the Easiest Decision of the build
The voice journey humbled me.
I started with Kokoro, an 82 million parameter model that genuinely sounds human. On my CPU it synthesizes at roughly 1.0x real-time. That number sounds fine until you understand what it actually means. The system generates speech exactly as fast as it speaks it. There is zero margin for error. Every sentence boundary becomes a coin flip between smooth speech and an awkward gap.
So I switched to Piper, a smaller and admittedly more robotic sounding model that runs at 7x real-time. The gaps disappeared completely. But the voice bothered me. Every reply reminded me I was talking to a machine, not a colleague.
Then I found Supertonic, a 66 million parameter model that sounds nearly as good as Kokoro and runs at 2x real-time on my hardware. I spent a couple of hours integrating it properly. Picked a voice I liked. Tuned the pacing until it felt right.
Then I actually used it. Within twenty minutes I switched back to Piper.
Here is why. Supertonic needed about 1.6 seconds to produce the first word of every reply. Piper needed 0.4 seconds. That 1.2 second difference does not sound like much written down. But inside an actual conversation, it is the difference between talking to a colleague and waiting for a system to catch up.
I had happily tolerated Piper's robotic edge for a full day of building. I could not tolerate 1.2 extra seconds of silence for twenty minutes of actually using it.
That taught me something I now treat as a rule for conversational interfaces. Response latency beats voice beauty. Users say they want natural. Their patience says they want fast. Watch the patience, not the words.
A small aside. I am Nigerian, and I genuinely wanted a Nigerian accented voice for this. There is a lovely project called YarnGPT that does exactly this, built by a young Nigerian engineer whose work I genuinely admire. But it uses an autoregressive language model architecture for text to speech, which my CPU simply cannot run in real time. It is on my someday list.
Lesson Three: A Voice Agent Is a Pipeline, and Every Stage Must Overlap
The single biggest perceived speed win of the whole weekend had nothing to do with which model I chose.
Early on, Mabara waited for Claude's complete response before speaking a single word. For a long answer that meant thirty seconds of silence, followed by an entire essay all at once. Unusable.
The fix is treating the whole thing as a pipeline. Claude streams text token by token. The moment a full sentence exists, it gets sent to the speech synthesizer. The moment that sentence's audio exists, playback starts, while the second sentence is still being synthesized and the third sentence is still being generated by Claude. Three stages, all running simultaneously.
Speech now starts after the first sentence instead of after the entire response. The total time to finish is the same. The felt experience is completely different.
The same thinking applied everywhere else in the system. The microphone never fully closes. It keeps a rolling pre-roll buffer, because opening the device only after the key is pressed clips your first syllable and ruins transcription. Sentences get batched into single synthesis calls when the queue backs up, because every call carries fixed overhead regardless of length. And every piece of queued audio carries an epoch number, so when you interrupt Mabara mid-sentence, which you do by simply holding the talk key again, it shuts up within 0.2 seconds and any stale audio from the previous epoch gets silently discarded instead of playing awkwardly after you have already started talking.
None of this is exotic engineering. It is simply accepting that in real-time systems, waiting for anything to finish completely before starting the next thing is almost always the wrong design. At least in my experience.
Lesson Four: If the Interface Is Trust, Then Engineer the Trust
A voice agent that edits your code has a unique problem. You cannot see what it is about to do before it does it. The entire interface is confidence itself. So the safety model became the part of Mabara I am proudest of.
Reads are free, no approval needed. But every file edit and every shell command gets spoken out loud, something like "I would like to edit the file page.tsx, do you approve?", and requires a verbal yes from me. Saying "yes for the whole task" approves the rest of that task's edits, because approving fifteen individual edits one by one is misery. Shell commands always ask, every single time, because that is where the truly unrecoverable mistakes live.
Underneath all of this, git does the real safety work. Mabara refuses to edit anything outside a git repository. The first approved edit of any task automatically triggers a checkpoint. Saying "revert that" restores everything the last task touched, including restoring my own uncommitted files from a backup rather than deleting them, which is an edge case that would have destroyed real work in the first week if I had not caught it. Saying "commit this" turns a good task into a proper git commit, containing only the files that task actually touched.
I learned the trust lesson from the model side too, not just the tooling side. I tried running Mabara on a smaller, cheaper model to save on cost. Within an hour it confidently told me my backend was built on a PDF generation library. The real backend was FastAPI, plainly declared in requirements.txt, a file it apparently had not bothered to read. It answered from vibes and documentation prose instead of actual code.
The fix was partly a prompt rule: code is the truth, documentation is intentions, verify before answering. But the deeper lesson stuck with me. A wrong answer delivered confidently in a warm human voice is the most expensive output a tool can produce. I moved back to the larger model as the default. Speed is negotiable. Trust is not.
The Honest Part
Mabara was pair built with Claude Code. The AI wrote most of the actual lines of code. I want to be straightforward about that, because I think the genuinely interesting part of this story is the division of labor, not the line count.
Every decision in this project was mine. Which benchmarks to run. Which numbers to actually believe. When to abandon two hours of integration work because twenty minutes of real use disagreed with it. The AI never once pushed back on the int8 hype or the distil-Whisper hype. The benchmarks did. I chose to run them and I chose to listen to what they said.
I came away from this weekend thinking that this is what engineering looks like now. The typing is cheap. The judgment is the job.
Why I Built This
I built Mabara because I wanted to test a belief I hold, that voice is going to become a real interface layer for software, not a novelty. And I wanted to get my hands genuinely dirty with local models instead of just reading about them.
Both things happened. I now understand the practical tradeoffs of local speech models in a way no article could have taught me. And I have a small, honest proof that voice-first tooling is buildable on modest hardware if you are willing to measure everything and trust nothing you have not tested yourself.
Mabara is open source on my GitHub. One Python file, one process, five threads, no framework. Every default in it won a fight on my specific laptop, which means on your machine some of them might lose. So the losers are all still there behind flags, waiting for your benchmarks to decide.
Have you ever benchmarked a "known" optimization and watched it lose? Or built something voice driven yourself? I would genuinely love to hear what surprised you.

Top comments (0)