Phillip Gray

Posted on Jun 27

Building a Local-First Voice Copilot for the Shell with HoldSpeak and Ollama

#python #cli #voice #ollama

The Promise: A Private, Voice-Activated Shell

The dream of a voice-activated command line is compelling: speak a command, see it executed. But for many developers, piping terminal input through a cloud-based API is a non-starter. This is the promise of a project like karolswdev/HoldSpeak, a cross-platform tool for local voice typing. Could it be the core of a truly local-first, push-to-talk shell assistant? I paired it with Ollama and a local llama3.2 model to find out.

The goal was simple: hold a key, speak a command like "list files by size," release the key, and have the correct shell command appear, gated by a final confirmation prompt. This project turned out to be a tale of two stacks: one for voice that was surprisingly clean, and one for language that revealed the sharp edges of the local-first promise.

Building the Demo

To test this idea, I built a small Python script to tie these components together. You can find the complete code for this experiment, including the prompt engineering, in my demo project on GitHub: voice-activated-shell-demo.

Setup Instructions

Recreating this local-first voice assistant involves a few distinct steps:

Install HoldSpeak from Source: Since we need to use it as a library, clone the repository and install it in editable mode.
```
git clone https://github.com/karolswdev/HoldSpeak.git
cd HoldSpeak
pip3 install -e .
```
Install and Run Ollama: Use Homebrew (on macOS) to install the Ollama CLI, then start the server.
```
brew install ollama
ollama serve
```
Pull a Local LLM: In a separate terminal, pull a small, capable model. I used llama3.2.
```
ollama pull llama3.2
```
Grant Permissions (macOS): To allow the hotkey listener to work, your terminal application (e.g., iTerm, Terminal.app) must be given Accessibility permissions in System Settings > Privacy & Security > Accessibility.
Run the Demo Script: With the setup complete, you can run the final Python script that integrates all these components.

Finding the Seams in HoldSpeak

HoldSpeak presents itself as an application, but my goal was to use it as a library. The first step, installation, was a simple pip3 install -e .. The second step was hitting a wall: import holdspeak exports no documented, usable API. The path forward was to grep the source code.

This source-diving quickly revealed the clean seams I needed:

AudioRecorder (holdspeak/audio.py): A simple class to capture audio from the default microphone.
Transcriber (holdspeak/transcribe.py): A wrapper that intelligently selects a local backend (in my case, MLX Whisper on macOS) to convert audio into text.

The core components were there, but finding them was the tax. For a project with extensive product documentation, it provided no documented API.

Assembling the Local Stack

With the voice components identified, I set up the other half of the local stack: the Large Language Model. Standing up a local LLM with Ollama is one of the easiest parts of the modern AI landscape: brew install ollama, ollama serve, and ollama pull llama3.2. No accounts, no API keys.

The friction appeared immediately, not in setup, but in correctness. A smoke test asking for "list files sorted by size" returned ls -lh | sort -k1, a plausible but incorrect command that sorts by file permissions, not size. The correct command is ls -lhS. This early result established the central tension of the project: the local LLM was easy to install but unreliable out of the box.

In contrast, wiring up HoldSpeak's AudioRecorder and Transcriber was almost anticlimactic. Twelve lines of Python were all it took to capture and transcribe audio. When I spoke "list files by size," Whisper returned the exact text, 'List files by size', flawlessly. The local stack had two very different halves: speech-to-text was crisp and effortless; text-to-command was already the weak link.

Adding Push-to-Talk and a Safety Gate

More source-diving uncovered a third seam: HotkeyListener (holdspeak/hotkey.py). This class provided a clean callback API for push-to-talk functionality, defaulting to the Right Option key. Integrating it was trivial, but it surfaced an undocumented OS-level hurdle: on macOS, my terminal (iTerm) needed Accessibility permissions. The listener failed silently until I granted the permission and completely restarted the application.

With the hotkey, recorder, transcriber, and Ollama all wired together, the final piece was a confirmation loop. Before executing any command with subprocess, a simple input("run? [y/N]") provided a critical safety gate.

The very first real run proved why it was so necessary. I held Right Option and said, "show me the 5 largest files." The LLM returned ls -lhS | tail -n 5. It looked correct, so I hit y.

It was wrong. ls -lhS sorts files from largest to smallest. tail -n 5 therefore returns the 5 smallest files from that list. The confirmation prompt protected me from malicious commands, but not from plausible but subtly incorrect ones. I had approved my own bug.

Evaluation: A Great Voice Library, A Fragile Copilot

Should you build a voice CLI copilot on HoldSpeak today? Yes for the voice layer, but with major caveats about relying on a small, local LLM for autonomy.

HoldSpeak as a Library

I would use HoldSpeak again. Once you find the entry points—AudioRecorder, Transcriber, and HotkeyListener—they are well-shaped components that compose into a working push-to-talk loop in under 100 lines. The local Whisper transcription was fast, accurate, and required zero configuration to use the MLX backend on Apple Silicon. The single biggest improvement HoldSpeak could make is documenting this programmatic path. A short "using HoldSpeak as a library" guide would have erased most of this project's friction.

The Local-First Tradeoff

This demo perfectly illustrates the tradeoff of a fully local AI stack. The "no cloud" setup is fast, cheap, and private. But that convenience comes at the cost of correctness. The llama3.2 model produced plausible-but-wrong commands repeatedly, and while prompt engineering can fix specific failure cases you identify in advance, it's a process of patching over specific failures rather than building generalized reliability.

The confirm-before-execute gate is mandatory. It makes the tool safe, but it does not make it correct. It protects you from commands you recognize as wrong, but does nothing for the ones that look right, which is where small models often fail. A voice interface, by its nature, encourages speed over the careful inspection of an LLM's work.

HoldSpeak is a recommendable, high-quality voice library waiting to be discovered. The local-first shell copilot I built on top of it, however, remains a great demo but not yet a shippable tool. To make it real, you'd need a larger, more capable model (defeating some of the purpose of a small local setup) or a UI that explains why a command was chosen, giving the user a better chance to catch the subtle bugs a confirmation prompt can't.

DEV Community