Hex

Posted on Apr 1 • Originally published at openclawplaybook.ai

OpenClaw Voice: Give Your AI Agent the Ability to Speak (and Listen)

#ai #agents #automation #productivity

Most AI agent setups require you to open a chat window, type a message, wait for a reply, then go back to what you were doing. It's fine for async work. But when you're deep in a flow — writing code, reviewing docs, context-switching fast — typing is friction.

OpenClaw's macOS app ships with a built-in voice layer. Wake-word detection. Push-to-talk. A real-time overlay. And replies that route to wherever your agent normally talks to you — Slack, Discord, Telegram, wherever. No extra tools. No third-party voice assistants. Just your agent, listening.

This post covers how the voice system works, how to configure it, and how to get the most out of it for daily operations.

Two Modes: Wake-Word and Push-to-Talk

OpenClaw's voice input has two distinct modes. Each suits a different working style.

Wake-Word Mode

The always-on listening mode. OpenClaw's macOS app runs a continuous Speech recognizer in the background, waiting for your configured trigger phrase. When it detects the wake word followed by a meaningful pause, it starts capturing your command.

The overlay appears with partial text streaming in real-time as you speak. After roughly 2 seconds of silence, the command auto-sends to your agent. The recognizer immediately restarts to listen for the next trigger.

Key timing details worth knowing:

Trigger requires a ~0.55s pause between wake word and command — this prevents false triggers
Auto-send fires after 2.0s of silence while speech is flowing, or 5.0s if only the trigger was heard
Hard cap of 120s per session to prevent runaway captures
350ms debounce between consecutive sessions

Push-to-Talk Mode

Hold the Right Option key on your Mac keyboard. The overlay appears immediately — no wake word required. Speak your command. Release the key. Done.

Push-to-talk is the faster path when you're already at your desk and want a precise, deliberate interaction. No accidental triggers. No waiting for the recognizer to detect a trigger phrase. Just hold, speak, release.

When push-to-talk is active, the wake-word recognizer pauses to avoid competing for the audio tap. It restarts automatically after you release the key.

The Voice Overlay

Both modes share the same overlay UI. It shows two types of text:

Volatile text — the current partial transcript (still being processed by the recognizer)
Committed text — finalized segments the recognizer has locked in

The visual distinction helps you see whether the recognizer has confidently captured your words or is still processing. If the overlay shows volatile text for too long, you can dismiss it with the X button and try again — the recognizer resumes listening immediately.

One edge case to know: if the wake-word overlay is already visible and you press the Right Option key for push-to-talk, push-to-talk adopts the existing text rather than resetting. The overlay stays up while you hold the key, and sends when you release (as long as there's trimmed text).

Setting Up Voice on macOS

Voice is available in the macOS app only. You need two system permissions:

Microphone — for speech capture
Speech Recognition — for on-device transcription

For push-to-talk, you also need:

Accessibility / Input Monitoring — to detect the Right Option key globally

Grant these in System Settings → Privacy & Security. The macOS app will prompt for them on first use.

Once permissions are set, open the OpenClaw menu bar app and navigate to Settings → Voice. Toggle on Voice Wake and/or Hold Right Option to talk (push-to-talk). You can also configure:

Language and microphone picker
Wake-word phrases (the "trigger-word table")
Chime sounds for trigger detection and send events
A local tester that lets you hear transcription without forwarding to the agent

{
  "voice": {
    "enabled": true,
    "wakeWords": ["hey claw", "okay claw"],
    "language": "en-US"
  }
}

How Voice Commands Reach Your Agent

When a voice command sends, it goes to the active gateway/agent using the same routing OpenClaw uses for everything else. The transcript is prefixed with a machine hint (so the agent knows the input came from voice) before being forwarded.

Replies come back via your configured reply channel. If you're using Slack as your primary channel, the agent's response goes to your Slack DM. If you're using Telegram, it goes there. The voice input doesn't create a separate surface — it feeds the same pipeline.

{
  "voice": {
    "enabled": true,
    "replyChannel": "slack",
    "replyTarget": "DM"
  }
}

This is important: voice input and chat input are the same thing to the agent. The same SOUL.md persona, the same tools, the same memory. You're just changing the input modality.

Push-to-Talk in Practice

Push-to-talk shines for quick, specific tasks:

"Hey, what's the Railway deploy status for callclaw-server?"
"Remind me to review the PR at 4pm"
"Post a quick update to #saas that the build passed"
"What was the last Stripe transaction from yesterday?"

Each of these is 3-5 seconds of voice input. Getting a response takes another 5-10 seconds in Slack. Versus typing, navigating to a chat window, composing the message, and switching back — you're saving real time on high-frequency ops queries.

Wake-Word Mode in Practice

Wake-word mode is better for ambient, hands-free scenarios:

You're cooking and want to check your calendar
You're reviewing a document and want a quick answer without leaving the window
You're on a call and need to silently kick off an agent task

The overlay shows you what was captured before it sends, which is reassuring. If the recognizer mis-heard something, you can dismiss the overlay before it auto-sends (just click X).

The Full Voice Session Lifecycle

Understanding the session lifecycle helps you debug edge cases:

Wake-word detected → chime plays → capture overlay appears
Speech streams as partial text → committed text locks in as the recognizer finalizes segments
Silence detected (2s threshold) → auto-send fires
Transcript forwarded to gateway with machine prefix
Agent processes the request → replies via configured channel
Overlay dismisses → recognizer restarts immediately

Each capture session has a unique token. Stale callbacks from old sessions are dropped — so if the recognizer takes a moment to restart and a late callback comes in, it won't accidentally trigger a new send. This token-based session model is what keeps the overlay predictable.

Debugging Voice Issues

If voice feels flaky — overlay sticks, recognizer seems dead, commands aren't sending — start with the log stream:

sudo log stream \
  --predicate 'subsystem == "ai.openclaw" AND category CONTAINS "voicewake"' \
  --level info \
  --style compact

Common issues and fixes:

Overlay sticks and won't dismiss: Click the X button. This triggers a forced recognizer restart. If it still won't dismiss, toggle Voice Wake off and back on in Settings.
Push-to-talk doesn't register the key: Check that Accessibility / Input Monitoring is approved in System Settings.
Recognizer seems dead after a push-to-talk session: The wake-word recognizer pauses during PTT and auto-restarts on key release. If it doesn't come back, toggling Voice Wake in Settings forces a clean restart.
Wake word triggers mid-sentence: The 0.55s pause requirement prevents most false triggers, but you can adjust or disable specific trigger words in Settings.

Chime Customization

Two chime points in the voice pipeline:

Trigger detect chime — plays when the wake word is recognized
Send chime — plays when the transcript is forwarded to the agent

Both default to macOS "Glass" system sound. You can change either to any NSSound-loadable file (MP3, WAV, AIFF) or set them to No Sound if you want silent operation.

Voice + Agent Personality

One thing most people don't think about: voice input changes how you phrase commands, and your agent's SOUL.md should account for it.

Typed commands tend to be terse: "check railway status". Voice commands tend to be more conversational: "Hey, what's the current status of the Railway server?" Your agent handles both fine, but if you've tuned your SOUL.md to expect very terse input, you might want to add a note about natural language voice queries.

Combining Voice with Cron and Heartbeats

Voice input is reactive — you initiate. Cron jobs and heartbeats are proactive — the agent initiates. Together, they make for a complete ambient AI operations layer.

A typical daily setup might look like:

Morning: Heartbeat at 9am runs your daily standup summary, posts to Slack
Throughout the day: Voice queries for quick ops checks ("what's the Stripe MRR today?")
Late afternoon: Cron triggers a PR review sweep, posts results to #dev
Evening: Voice command kicks off nightly deployment ("deploy the build")

Limitations to Know

macOS only. Voice wake and push-to-talk are macOS app features. They're not available on Linux, Windows, or the CLI-only setup.
On-device transcription. OpenClaw uses the system Speech recognizer, which runs on-device. Good for privacy but accuracy depends on your Mac's Speech Recognition quality.
One active session at a time. If a wake-word session is in progress and you start push-to-talk, PTT adopts the existing session rather than starting fresh.
Reply delivery requires a configured channel. If the agent's reply fails to deliver to your channel, the error is logged but the session still shows in WebChat.

Ready to Set Up Your Voice-Enabled Agent?

Voice input is one of the more underrated features in OpenClaw — not flashy, but genuinely useful once it's wired into your daily ops. Wake-word for ambient queries, push-to-talk for precise commands, and the same agent pipeline handling everything.

If you want the full blueprint for building a production-grade AI agent operation — workspace architecture, memory systems, cron scheduling, multi-channel setup, and the complete voice integration guide — it's all in The OpenClaw Playbook.

Get The OpenClaw Playbook — $9.99 →

One payment. Everything you need to run a real AI agent operation.

Originally published at openclawplaybook.ai. Get The OpenClaw Playbook — $9.99

DEV Community