Quick links.
Code: https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent
Configuration Deep Dive: https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg
Building a working voice agent usually means stitching state across speech, logic, and external APIs by hand. OpenClaw gives you a runtime that handles most of that for you. To see how far that gets you in practice, I wired OpenClaw up to a microphone, a Murf Falcon voice, and a Swiggy account. In about 800 lines of TypeScript, I had an agent that could search restaurants, take add-to-cart instructions, and place a real order end to end.
This post is an architecture walkthrough. I'll explain what OpenClaw is doing under the hood, where I had to fight its defaults, and how the same blueprint applies to any voice agent you want to build.
Why OpenClaw
There are several agent frameworks out there. Most of them treat an agent as a function: input goes in, tool calls happen, output comes out. OpenClaw is different — it treats the agent as a runtime, more like a long-running server than a single call. Sessions can be paused and resumed, state is keyed and persisted, tool calls go through a typed MCP (Model Context Protocol) interface. And critically for what we are building, OpenClaw exposes block-level streaming hooks that let you intercept the model's output as it arrives.
A voice agent is the hardest case any agent framework will face, because the user can hear every millisecond of latency. If your framework only hands you the full reply at the end, you cannot stream audio to the speakers. The user is left in silence while the model generates 300 characters, which can take 2 to 4 seconds, which feels like forever in conversation time. OpenClaw hands you each block, a sentence or two, the moment it arrives. You turn that block into audio and play it while the model keeps generating the next one.
Three things in particular made this build feel small once I understood them:
-
Skills are markdown, not function definitions. The Swiggy integration is a
SKILL.mdfile the model reads. No JSON schemas, no function-calling boilerplate. To swap Swiggy for GitHub or Notion later, I would install a different skill and change one config line. -
MCP is built in. OpenClaw treats MCP servers as first-class. The Swiggy MCP plugs in through
mcporter. Adding a new tool surface means adding a new skill, not writing glue code. -
Streaming hooks are real.
onBlockReplyfires as the model writes. You drive synthesis from inside the callback.
Once those three things are in place, the rest of the build is mostly wiring the audio loop around them.
The pipeline
A microphone library captures audio and a streaming STT turns it into transcripts. Those transcripts go into OpenClaw, which decides what to do, calls skills, and streams text out. A streaming TTS turns each block into audio as it arrives, and a speaker library plays it back.
The audio loop is the same regardless of what the agent is doing. Plug a calendar skill into OpenClaw and you have a voice scheduling assistant. Plug in a GitHub skill and you have a voice PR reviewer. The loop does not change, only the skill and the system prompt do.
The stack
| Layer | Tool | Why this one |
|---|---|---|
| Agent runtime | OpenClaw | Skill registry, MCP integration, block-level streaming hooks. The framework this post is built around. |
| Tool surface | Swiggy skill via ClawHub | Vendored MCP skill. Documents the API in markdown the model can read. |
| Microphone and speaker | Decibri | Native WASAPI on Windows, CoreAudio on Mac, ALSA on Linux. No browser layer. |
| Speech to text | Deepgram Flux | Streaming STT with end-of-turn detection inside the model. No separate VAD to wire up. |
| Text to speech | Murf Falcon | Low time-to-first-audio, and conversational voice styles that sound right in back-and-forth dialogue. |
| Language model | Gemini | Free tier, supports tool calling, fast on first token. Substitutable with any tool-calling LLM. |
What I deliberately left out
Before the build, here is what is not in this version:
- Wake word detection. The microphone is always on while the agent is not speaking. No "Hey Claw" trigger.
- Cross-session memory. Every restart starts fresh. The session key is per-process.
- Order cancellation. Swiggy's MCP does not expose it, so the skill routes the user to customer care.
- Production hardening. This is a single-user CLI. No auth, no rate limiting, no observability. Don't ship it as is.
A note on latency worth setting expectations on now. Streaming TTS plays each sentence as soon as it is ready, which makes the agent feel responsive on most turns. But tool calls still take as long as tool calls take. When the agent is hitting Swiggy's API for restaurant search, there is real waiting that streaming cannot hide. I cover this in detail in the latency section.
Requirements
Set these up before continuing.
Node and package manager
- Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.
- pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.
Platform audio dependencies
Decibri uses the native audio stack on each operating system, so the install steps differ.
- Linux:
apt install libasound2-devon Debian-family, oralsa-lib-develon Fedora-family. - Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.
- macOS: CoreAudio is built in. You need Xcode Command Line Tools:
xcode-select --install.
External CLIs
npm install -g clawhub
clawhub is OpenClaw's skill registry.
API keys
- Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.
- Murf for the Falcon TTS key. Created on the API tab of your Murf account, separate from a regular Murf Studio account.
- An LLM provider of your choice. Most have a free tier sufficient for development.
Swiggy account
A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.
Step 1: Clone, install, env
git clone --filter=blob:none --sparse https://github.com/murf-ai/murf-cookbook.git
cd murf-cookbook
git sparse-checkout set examples/openclaw/food_ordering_agent
cd examples/openclaw/food_ordering_agent
pnpm install
cp .env.example .env
Open .env and add:
DEEPGRAM_API_KEY=...
MURF_API_KEY=...
GEMINI_API_KEY=...
If you would rather use OpenAI or Anthropic instead of Gemini, change one line in openclaw.json and the env variable name. Tool-calling support is the only requirement.
Step 2: Authenticate Swiggy
Swiggy's MCP needs OAuth. Run this once, the browser opens, you log in, you approve.
node scripts/swiggy-auth.mjs
This opens a browser and signs you into Swiggy via PKCE OAuth, and writes the token as a static Authorization header into ~/.mcporter/mcporter.json. You won't need to do this again unless the token expires.
If the browser doesn't open automatically, the script prints the full auth URL that you can copy and paste manually.
Confirm that the skill can actually reach Swiggy:
node skills/swiggy/swiggy-cli.js food addresses
This should print your saved addresses. If the list is empty, save one in the Swiggy app before moving on.
Note: In the video I use
mcporter auth swiggy-food— that no longer works. See the repo README for current auth steps.
Step 3: The four files you write
src/
ear.ts ~110 lines microphone capture and Deepgram WebSocket
brain.ts ~500 lines streaming TTS pipeline, calls OpenClaw
voice.ts ~140 lines speaker output, two channels
index.ts ~140 lines the event loop
The whole agent fits in 900 lines. Three of these files are pure adapter code: microphone in, speaker out. The interesting file is brain.ts, because that is where OpenClaw and Murf Falcon meet.
ear.ts: microphone in, transcript out
Decibri captures 16-bit PCM at 16 kHz in 100 ms chunks. Each chunk goes to a Deepgram Flux WebSocket on /v2/listen.
const params = new URLSearchParams();
params.append("model", "flux-general-en");
params.append("encoding", "linear16");
params.append("sample_rate", "16000");
for (const k of keyterms) params.append("keyterm", k);
const url = `wss://api.deepgram.com/v2/listen?${params.toString()}`;
Two things to know.
First, Flux has end-of-turn detection inside the transcription model. You don't need a separate Voice Activity Detector. You get one event called EndOfTurn and you respond to it.
if (data.type === "TurnInfo") {
if (data.event === "EndOfTurn") {
const transcript = data.transcript ?? "";
if (transcript.trim().length > 0) {
onTranscription(transcript.trim());
}
}
return;
}
Second, there is a contextual keyterm trick that mattered a lot for Indian-English food vocabulary. After each agent reply, I extract the capitalised words ("Punjab Grill," "Paneer," "Meghana") and pass them as keyterms for the next turn. This is what fixes "Kadhai Paneer" being heard as "car die panel." Standard English ASR doesn't handle Indian food names well. Per-turn keyterm biasing gets it most of the way there.
I wired Deepgram in directly here, not through OpenClaw's STT plugin slot. OpenClaw's STT integration is built for telephony, not a local CLI microphone. 110 lines of WebSocket code was the right tool for this job.
brain.ts: where OpenClaw earns its keep
This is the file that uses every OpenClaw primitive worth using.
The basic flow:
- Call OpenClaw's
chat()with the user's transcript. - Subscribe to OpenClaw's
onBlockReplyhook. - Hand each block to Murf Falcon for synthesis as it arrives.
- Stream audio back to
voice.tsin order.
OpenClaw's defaults are tuned for chat, where each block can be a paragraph and the user is reading on a screen. For voice, three overrides matter.
Override 1: turn streaming on. OpenClaw has two switches that both have to allow streaming. The naming is confusing because one is called disable. So you want disableBlockStreaming: false, which means "do not disable," which means "do stream."
const llmCall = getReplyFromConfig(ctx, {
disableBlockStreaming: false,
});
Override 2: fix the coalescer. OpenClaw has a coalescer that decides when to flush a buffered block to your code. Its default minChars is 800. A typical voice reply is 200 to 300 characters, so the coalescer waits for a block that never arrives, then dumps everything at end-of-reply. Streaming defeated.
blockStreamingCoalesce: {
minChars: 1,
maxChars: 200,
idleMs: 0,
flushOnEnqueue: true,
},
flushOnEnqueue: true is the line that makes the rest of this work. It tells OpenClaw to hand the block over the moment it arrives, instead of waiting for more.
Override 3: track deltas yourself. OpenClaw's onBlockReply callback gives you the full text so far, not just the new piece. You compute the delta yourself. Three cases: extension (new starts with old), duplicate (skip), and reset (fresh string after a tool call). The reset case is easy to miss and shows up after every tool call.
let delta: string;
if (currentBlockStream && text.startsWith(currentBlockStream)) {
delta = text.slice(currentBlockStream.length);
currentBlockStream = text;
} else if (currentBlockStream && currentBlockStream.includes(text)) {
return;
} else {
delta = text;
currentBlockStream = text;
}
Once you have the delta, you call Murf's synthesize(). Synthesis runs in parallel across blocks, but playback runs in order, serialised through a Promise chain so that chunk 2 always plays after chunk 1 even if chunk 2's network call finishes first.
const synthP = synthesizeSpeech(trimmed).catch(() => null);
emitChain = emitChain.then(async () => {
const audio = await synthP;
if (audio) onAudioChunk(audio);
});
That is roughly 30 lines of streaming logic. The rest of brain.ts is the agent setup, the OpenClaw config, and a fallback path for when the model batches output after tool calls.
voice.ts: two speakers, not one
Falcon's synthesis is fast — Murf reports 130 ms time-to-first-audio, and that matches what I see in practice. So when there is dead air on the agent's first turn, it is not the TTS that is causing it. It is the cold-start cost of OpenClaw initialising, the Swiggy MCP handshake, the LLM doing its first call against a fresh tool chain. All of that has to finish before the model has produced its first block of text for Falcon to synthesise.
That is the gap pre-recorded filler audio is for. Short clips like "One moment please" or "let me check that for you" play 100 ms after the user stops talking, which is fast enough that the user does not perceive a delay.
The catch: the filler is a variable-length clip, and the first real audio chunk can arrive before the filler finishes. If both play through one audio output, one cuts off the other. The fix is two separate Decibri outputs.
let oneShotSpeaker: InstanceType<typeof DecibriOutput> | null = null;
let streamSpeaker: InstanceType<typeof DecibriOutput> | null = null;
oneShotSpeaker plays fillers. streamSpeaker plays the real reply. When the first reply chunk arrives, I stop the filler channel without touching the reply channel. Anything queued on the reply channel keeps playing.
This sounds like overkill until you hear the alternative. With one channel, the filler clips the agent saying "Sure" and the user only hears "...I'll add that."
index.ts: the loop
async function startSession() {
renderBanner();
setImmediate(() => warmup()); // amortise OpenClaw cold start
await playIntro();
await openMicrophone();
}
ear.on("transcript", async (text) => {
closeMicrophone();
await playFiller(); // mask LLM latency
await chat(text); // streams audio as it arrives
await openMicrophone();
});
That is the whole loop. Render the banner, kick off OpenClaw warmup in the background, play the intro, open the microphone. On each transcript: stop the microphone, play a filler, run the agent, reopen the microphone.
The setImmediate(() => warmup()) line runs OpenClaw's initialisation and the Swiggy MCP handshake while the user is hearing the intro. By the time the user finishes their first sentence, both are warm. That shaves several seconds off turn 1.
How the skill actually works
This is the part that surprised me most when I first used OpenClaw.
The agent learns to use Swiggy by reading a markdown file. Not a JSON schema, not function definitions. A human-readable file called SKILL.md that documents the commands, the sequencing rules, and the things to never do. The model reads this, figures out what to call, and emits shell commands that run against a CLI wrapper.
The wrapper is small. node skills/swiggy/swiggy-cli.js food <command> is the shape of every call. The skill knows commands like search-restaurants, get-menu, add-to-cart, checkout. The model sequences them on its own, based on the markdown documentation.
Here is a snippet from SKILL.md (paraphrased):
search-restaurants: Find restaurants matching a cuisine or dish. Use this first whenever the user mentions a food. Example:
search-restaurants --query "biryani". Always callget-addressesfirst if you have not yet, because results depend on delivery location.
The model reads it the same way a new developer would read documentation on day one.
The one tweak I made: every swiggy food <cmd> call in SKILL.md became node skills/swiggy/swiggy-cli.js food <cmd>. OpenClaw's shell executor doesn't have npm globals on PATH, so the swiggy binary from npm link is not reachable.
The implication for builders: writing a new skill is writing a markdown file and a thin CLI. There is no SDK to learn, no function-calling glue to debug. If you can document an API in English with examples, you can give an OpenClaw agent the ability to call it.
Latency
The first turn is the slowest. Before any audio plays, OpenClaw has to initialise, complete the Swiggy MCP handshake, and make its first LLM call against a fresh tool chain. On a typical machine that takes anywhere from 15 to 50 seconds, depending on your network and your LLM provider. Streaming TTS does not save you here — the model has not produced anything to synthesise yet.
What does help is the combination of filler audio (which plays 100 ms after the user stops talking) and the background warmup that runs during the intro. Together they keep the perceived gap small even when the actual cold start is not.
Turn 2 onwards is a different story. With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking, and most of that is the LLM's time to its first sentence. Falcon's 130 ms TTFA is what makes "first sentence" actually translate to "first audio you hear."
If you genuinely need to push first-turn latency below this on tool-heavy turns, the only real lever is to take OpenClaw out of the loop on those turns — wiring the tool calls in directly, parallelising what OpenClaw would have serialised. I haven't done that in this build.
Swap the skill
The voice loop in this post does not care what the agent does. The skill lives in two files:
-
agents.defaults.skillsinopenclaw.json. Replaceswiggywith another MCP skill. Google Calendar. GitHub. Notion. Linear. Pick one from ClawHub or write your own. -
workspace/IDENTITY.md. The system prompt that describes who the agent is and how it should talk. Rewrite it for the new domain.
That portability is the case I wanted to make for OpenClaw with this post. The framework is doing real work behind the scenes, hiding the runtime, the MCP integration, the streaming, and the skill format behind primitives that are small enough to use without ceremony.
What I learned
The skill format is the part I underestimated going in. The model was reading it the way a developer would read API docs on day one. There is no JSON schema to maintain, no function-calling boilerplate to update when the API changes. If your API is documentable in markdown, an OpenClaw agent can use it.
Voice agents are mostly a latency engineering problem. The transcription, the agent, the TTS are mostly solved. The work that made this build feel real was around the seams — two-channel playback, background warmup, per-turn keyterm bias, pre-baked fillers. You have to find these by listening to your own demo and noticing what sounds wrong.
The combination of streaming hooks and per-block synthesis is what made the conversational rhythm work. Falcon at 130 ms TTFA is fast on its own, OpenClaw handing off blocks the moment they arrive is fast on its own. Together, if the LLM produces text in roughly 200 ms chunks and the TTS adds 130 ms on top, the user hears a new sentence every ~330 ms. That is faster than most humans speak, and it is what makes the agent feel like it is actually thinking out loud rather than waiting to deliver a finished answer.
If this was useful, the code is at github.com/murf-ai/murf-cookbook. A star helps the project reach more builders. Clone it, swap the skill, and build something else tonight. The configuration deep dive, with the parameter tables and error mappings, is at dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg.
I would love to hear what you build with it.

Top comments (0)