justin chen

Posted on Jun 25

How I Built a Local TTS Daemon That Actually Knows When to Shut Up (claude-tts v0.1.0)

#claudecode #tts #opensource #devtools

I run Claude Code — Anthropic's CLI coding agent — for long builds and test runs. The agent does real work: it edits files, runs tests, reads errors, tries again. The problem is I need to be somewhere else. I can't watch the terminal.

The obvious answer is text-to-speech. The naive implementation is catastrophic. Five minutes of listening to your computer narrate ====> eslint --ext .ts --ignore-path .gitignore . && tsc --noEmit | grep -E 'error TS' will make you never try this again.

So the real problem — the engineering problem — is filtering. Not speaking everything. Not speaking nothing. Speaking the right slice: status pivots, errors, final answers. And staying quiet through the noise.

I spent a few weeks building this as a Claude Code plugin. It's called claude-tts, it's v0.1.0, MIT licensed, and this is how it works.

The Architecture: Four Moving Parts

The system is four pipeline stages connected over a Unix socket:

Claude Code hooks
      │  (raw agent event, JSON over socket)
      ▼
ContentRouter (filter brain)
      │  should_speak? → text to speak
      ▼
GenerateStage (TTS synthesis)
      │  audio file
      ▼
PlaybackStage (OS audio)

Claude Code fires hooks at SessionStart, PreToolUse, PostToolUse, and Stop. Each hook sends an event payload to the daemon over a Unix socket (~/.local/share/claude-tts/claude-tts.sock on XDG systems, /tmp/claude-tts.sock as fallback). The daemon processes these asynchronously; hooks return immediately and don't block the agent.

The audio side is swappable: Kokoro MLX on Apple Silicon for local neural TTS, edge-tts for Azure voices (needs internet), or the zero-dependency fallback — macOS say / Linux espeak. The LLM side is equally swappable: Ollama by default, any OpenAI-compatible endpoint (LM Studio, llama.cpp server, vLLM, Groq), or null for fully deterministic operation with no model at all.

But the interesting part is the filter.

The Filter Brain: Why Deterministic Rules Alone Produce Spoken Gibberish

My first pass was purely rule-based. I wrote regexes: speak lines that look like test results (N passed, N failed), speak lines that look like errors, drop everything else.

The result was still gibberish. Not because the rules were wrong — they correctly classified the class of content — but because individual tool outputs routinely contain content that passes classification while being unspeakable as audio.

Here's a concrete example. A Bash tool output from a linting run might classify as "error output" (correct — it contains errors) and pass the should-speak gate, then get handed to TTS as:

src/auth/middleware.ts:47:12 - error TS2345: Argument of type 'string | undefined'
is not assignable to parameter of type 'string'.

46    const token = headers['authorization']?.split(' ')[1];
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

That's not speakable. It's a syntax-dense, whitespace-dependent error format. TTS will read it character by character, including the tildes.

But more insidious: a lot of agent output that looks like structured data to a human eye is actually gibberish tokens to a TTS engine. Git commit SHAs (a3f9b2c1d4e5), UUIDs (f47ac10b-58cc-4372-a567-0e02b2c3d479), base64 blobs, hex color codes, diff hunk headers (@@ -23,7 +28,4 @@), ls -l permission triplets — these all slip through regex classifiers because they look like "output" but sound like keyboard mashing.

The Two-Stage Filter

The solution is two gates in sequence:

Gate 1: Should we speak this event at all? This is the ContentRouter. For structured signals — test counts, error totals, build success/failure, final agent answers — the answer is deterministic. The router knows what a test result line looks like (\d+ passed, \d+ failed). It knows a Stop hook event is the agent's final answer and should almost always be spoken. It knows a Read tool invocation is never worth narrating.

For the ambiguous middle — a Bash tool that ran something you can't immediately classify — the router consults a local LLM judge. The judge receives the raw stdout alongside compact context built from signals already on the event: which tool ran, a short target hint (e.g. "ran the tests in test_router"), and whether similar output was recently spoken this session. The verdict must be exactly SPEAK; anything else is treated as SKIP. This means model weirdness degrades to silence, not spoken garbage.

Gate 2: Is the text actually speakable? Even after gate 1 passes, the text gets a speakability check in is_speakable() in daemon/text_utils.py. Several things happen here:

Strip code artifacts: === → =, && → and, != → not equal, => → to. Drop git SHAs (hex strings mixing letters and digits in a run of 7+), UUIDs, base64 blobs, @@...@@ diff hunks, ISO-8601 timestamps, env-var assignment dumps.
Check that the normalized text contains real words. The daemon loads /usr/share/dict/words on macOS; on Linux (where the system dictionary is absent by default) it falls back to a bundled public-domain wordlist (daemon/data/words.txt.gz). A conservative inflectional stemmer handles the fact that most system dictionaries store base forms only — "passed" isn't in the dictionary, but strip -ed and check "pass" is.
Drop text where the real-word ratio is too low, or where a vowelless non-acronym token dominates in a low-real-word context.

The precision metric across roughly 4,500 real speak decisions in my production shadow log — after iterating through nine rounds of filter refinement — reached 0% spoken code-artifact gibberish. Getting there took nine rounds of iteration: early passes got the markup class to zero but were blind to non-markup gibberish (orphan punctuation, non-word tokens from ps output and agent IDs); later rounds addressed those.

Here's a terminal transcript showing the pipeline in action:

# Agent runs: pytest tests/ -v

# [daemon receives PostToolUse event for Bash]
# ContentRouter:
#   tool=Bash, cmd="pytest tests/ -v"
#   → _is_test_command() True
#   → extracts "23 passed, 2 failed" from stdout
#   → context_hint: "test result from the test suite"
#   → should_speak: True, route: test_result
# GenerateStage: synthesize "In the test suite: 23 passed, 2 failed" → /tmp/tts_chunk_482.wav
# PlaybackStage: afplay /tmp/tts_chunk_482.wav

# What you actually hear:
"In the test suite: 23 passed, 2 failed"

# Agent runs: cat package.json | grep '"version"'
# ContentRouter: tool=Bash, short stdout, no test pattern → SKIP (silent)

# Agent final answer (Stop hook):
# ContentRouter: stop_event → PRIORITY_HIGH → summarize if long
# Heard: "Done. Updated the auth middleware, fixed the token null check, all 23 tests passing."

The macOS afplay Bug: A Concrete Engineering Anecdote

When I added the say/espeak fallback engine (the zero-dependency path that works with no ML and no network), I ran into a silent failure that took some digging to understand.

The TTS pipeline works like this: GenerateStage calls engine.synthesize(text, audio_path) which writes an audio file, then returns True on success. PlaybackStage then calls afplay <audio_path> separately. The two stages are decoupled intentionally — generation and playback are different concerns.

The bug: GenerateStage was naming all non-Kokoro outputs with a .mp3 extension. That's fine for edge-tts, which actually writes MP3 bytes. But SystemTTSEngine wraps macOS say, which writes WAVE/AIFF output. So the pipeline was writing RIFF/WAVE bytes into a file called tts_chunk_482.mp3.

On Linux, mpv and ffplay content-sniff the file header. They play WAVE bytes regardless of what the filename says. The tests passed. The CI on macOS also passed because the tests used mocked subprocess calls.

The production failure looked like this:

$ afplay /tmp/tts_chunk_482.mp3
Error: AudioFileOpen failed ('dta?')

Exit code 1. No audio. The daemon logged a PlaybackStage failure and moved on. synthesize() had returned True — the file existed, had nonzero size, say exited 0. The failure was invisible to the generation stage.

The root cause: macOS afplay (and AudioToolbox generally) selects the audio decoder from the file extension, not the file's byte content. WAVE bytes in a .mp3 file fail to open. The same bytes in a .wav file play fine.

The fix is _audio_ext_for(engine) in generate_stage.py — it returns "wav" for any engine in _WAV_ENGINES (kokoro, say, espeak, system) and "mp3" for edge-tts. But the lesson is more interesting: mocked subprocess tests cannot catch format/extension mismatches. The real check is an integration test that runs the actual say binary and then calls afinfo <output_path> to verify AudioToolbox can open it.

@pytest.mark.skipif(
    platform.system() != "Darwin" or not shutil.which("say")
    or not shutil.which("afinfo"),
    reason="macOS + say + afinfo required"
)
def test_system_real_say_writes_afplay_openable_wav(tmp_path):
    out = tmp_path / "real.wav"
    e = SystemTTSEngine()
    ok = asyncio.run(e.synthesize("plan 3d regression check", str(out), "", 1.0))
    assert ok is True
    data = out.read_bytes()
    assert data[:4] == b"RIFF" and b"WAVE" in data[:16]
    rc = subprocess.run(["afinfo", str(out)],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.DEVNULL).returncode
    assert rc == 0

This test lives in CI and runs on macOS runners. It would have caught the original bug.

Graceful Degradation: The LLM Is an Upgrade, Not a Requirement

One design decision I'm most glad I made: the LLM is optional.

The naive approach to "smart content filtering" is to make the LLM a hard dependency. The problem is that this means the daemon fails or produces no output if Ollama isn't running, the model isn't downloaded, or the model returns garbage.

Instead, the daemon has three tiers:

Tier 1: Full LLM mode. Ollama (or any OpenAI-compatible endpoint) is configured and reachable. The judge classifies ambiguous content; the summarizer condenses long output. The model we recommend is qwen2.5-coder:1.5b — 986MB, fast on CPU, good at following strict single-word verdict instructions.

Tier 2: Deterministic mode (llm_provider.type = "null"). No model. The router still runs all deterministic rules: test counts, error lines, build status, final answers. Long content is truncated at a character threshold rather than summarized. You lose the judgment on ambiguous Bash output, but you still get the important signals.

Tier 3: Zero-dependency audio. Even if you have no Ollama and no Kokoro, say on macOS and espeak/espeak-ng on Linux are usually already installed. No Python ML deps, no model downloads, no network.

The configuration looks like this:

{
  "llm_provider": {
    "type": "ollama",
    "model": "qwen2.5-coder:1.5b",
    "base_url": "http://localhost:11434"
  },
  "voice": {
    "engine": "kokoro",
    "name": "bf_emma",
    "rate": 1.2
  }
}

To drop to deterministic mode:

{
  "llm_provider": {
    "type": "null"
  },
  "voice": {
    "engine": "say"
  }
}

The null provider still participates in the LLMProvider interface — it returns deterministic outputs from provider.judge() and provider.summarize(). Nothing downstream knows the difference.

For other LLM backends, the openai_compat provider takes a base_url and optional API key. That's the same adapter for LM Studio, llama.cpp server, vLLM, Groq, or OpenAI itself.

Cross-Platform Audio Format Safety

The platform layer abstracts OS-specific audio playback. macOS uses afplay. Linux uses a decoder-first chain:

LINUX_PLAYER_CHAIN = [
    "ffplay",   # container-agnostic decoder
    "mpv",      # container-agnostic decoder
    "pw-play",  # PipeWire: WAV only
    "paplay",   # PulseAudio: WAV only
    "aplay",    # ALSA: WAV only
]

The order matters. ffplay and mpv content-sniff and handle any container. pw-play, paplay, and aplay are WAV-only. By probing shutil.which() in decoder-first order, the daemon uses the most capable player available. On a minimal system with only aplay, it also defaults the engine to say/espeak (which writes WAV), so the format always matches the player.

Service installation is platform-native: launchd on macOS with a generated plist, systemd --user on Linux with loginctl enable-linger so the daemon survives user session logout. The SessionStart hook auto-launches the daemon if it isn't running.

CI Caught What My Local Gate Missed

The Linux portability story has a good cautionary note.

My speakability gate — is_speakable() — loads /usr/share/dict/words to check vocabulary ratios. On macOS, that file is always present (235,976 words). On Ubuntu CI runners, it's absent by default. So the gate was silently disabled on Linux: gibberish tokens that correctly dropped on macOS were kept on Linux.

My local gate (663 tests, all passing) didn't catch this. My static portability audit checked for macOS syscalls (afplay, launchctl, say) but not for filtering logic that diverges on a system data file.

The CI matrix caught it on first push: all three macOS cells green, all three Ubuntu cells red on test_is_speakable_drops_noise. The fix was bundling a public-domain wordlist (daemon/data/words.txt.gz) as a fallback when the system dictionary isn't present, with a test that forces the bundled dict path. The lesson: a cross-platform CI matrix catches data-file and locale divergences that a syscall-level audit cannot.

What v0.1.0 Looks Like in Practice

This is a first release from a solo author. The filter brain works well on my actual Claude Code sessions. The CI matrix is green on macOS and Ubuntu across Python 3.11–3.13. The engineering quality bar is real: make verify fails if code-artifact gibberish leaks to speech, if classification regressions appear, or if the test count drops.

Honest caveats:

Kokoro (the local neural voice) is Apple Silicon only and requires a separate mlx-audio interpreter. Point MLX_PYTHON at it. edge-tts works everywhere but needs internet. say/espeak work with no deps anywhere.
Volume control is macOS-only right now.
Windows has no native service install — WSL2 or Docker would work, but it's not wired.
Setup requires manual steps currently (see below). A /tts:setup command that handles calibration and service install is the next milestone — not yet in this release.
This is v0.1.0. I'd love feedback, bug reports, and contributors.

Try It

Manual setup (current):

git clone https://github.com/chendrizzy/claude-tts
cd claude-tts
uv sync --extra edge
cp config.example.json config.json
# Edit config.json to set your engine and LLM backend
# Wire the hooks from hooks/hooks.json into your Claude Code settings

Requires Python >= 3.11 and uv. For Kokoro: a separate mlx-audio Python environment, set MLX_PYTHON to point at it. For Ollama: ollama pull qwen2.5-coder:1.5b.

Repo: github.com/chendrizzy/claude-tts

The filter brain is the part I'm most interested in improving. If you use Claude Code for long autonomous runs and have opinions on what should be spoken vs. what should stay silent, I'd genuinely like to hear it — open an issue or leave a comment here.

DEV Community