Sanchita Sunil

Posted on Jun 3 • Edited on Jun 4

Notes from the Openclaw Voice Tutorial

#ai #openclaw #voice #typescript

This is a companion to the food-ordering agent tutorial video (You can find the video here: https://www.youtube.com/watch?v=ypqzB093VLc). The video walks you through cloning the repo and placing a real Swiggy order with your voice. This post fills in the parts the video pointed at but did not have time to cover:

Every Deepgram Flux parameter, what it does, and how the event model behaves
Why OpenClaw's block streaming defaults are wrong for voice, and which ones to flip
Falcon voice and locale compatibility, and how to swap voices without breaking things
Streaming-pipeline bugs that show up after setup, with their root causes

Repo: https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent
Video: https://www.youtube.com/watch?v=ypqzB093VLc

OpenClaw treats an agent as a runtime, not a prompt. A runtime is a program that runs continuously and remembers state between calls, like a server. A prompt, in contrast, is a single block of text sent to the model. The difference matters because OpenClaw can pause, resume, and track sessions across many turns.

That model works well for chat. Voice is where it starts to break down.

A microphone does not produce text. It produces audio frames (small chunks of raw sound data). A speaker cannot wait for the full reply before playing anything. The user will hear silence and assume the agent is broken. The same tool-call delay that is invisible in a chat UI becomes obvious dead air the moment the user can hear it.

Every piece of OpenClaw still works for voice. You just have to point each piece at the voice use case on purpose, instead of relying on the chat-friendly defaults. The next three sections walk through which defaults to change and why.

Requirements

If you do not have these, set them up before continuing.

Node and package manager

Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.
pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.

Platform audio dependencies

Decibri uses the native audio stack on each operating system, so the install steps differ.

Linux: apt install libasound2-dev on Debian-family distros, or alsa-lib-devel on Fedora-family. Required at install time.
Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.
macOS: CoreAudio is built in. You need Xcode Command Line Tools: xcode-select --install.

External CLIs

clawhub. OpenClaw's skill registry. The Swiggy skill in this repo is vendored, so you do not strictly need clawhub to run the agent, but you will need it if you want to fetch other skills later.

API keys

Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.
Murf for the Falcon TTS key. This is created on the API tab of your Murf account, separate from a regular Murf Studio account.
An LLM provider of your choice. Most have a free tier sufficient for development.

Swiggy

A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.

Update: The Swiggy auth flow has changed since the video was recorded. mcporter auth swiggy-food no longer works — Swiggy MCP now requires an approved client_id and uses a custom PKCE script instead. Run node scripts/swiggy-auth.mjs. See the repo README for current steps.

Deepgram Flux

Flux is the STT we use in this build. There are several streaming STTs that work for voice agents; Flux is the one wired up here, and the parts below are the configuration you need to get right regardless of which API you go with.

One concept worth covering before the parameters: turn-taking. This is the decision of when the user has stopped talking and the agent should respond. Many streaming STT APIs hand back partial transcripts and leave turn-taking to your code, which usually means adding a separate Voice Activity Detector (VAD) that listens for silence. Flux does turn-taking inside the transcription model and emits structured events for it, so for this build we do not need a separate VAD.

Endpoint

An endpoint is the URL path you connect to on a server. Flux only works on /v2/listen. The older /v1/listen endpoint will silently reject the model parameter. You will spend an hour wondering why nothing transcribes.

const params = new URLSearchParams();
params.append("model", "flux-general-en");
params.append("encoding", "linear16");
params.append("sample_rate", "16000");
for (const k of keyterms) params.append("keyterm", k);
const url = `wss://api.deepgram.com/v2/listen?${params.toString()}`;

Use URLSearchParams to build the URL. It encodes spaces in multi-word keyterms correctly (as +). If you build the query string by hand and use %20 instead, Deepgram will close the connection without telling you why. This is the most common setup bug.

Parameters

The audio format below uses the term PCM, which means pulse-code modulation. It is the standard way to represent raw audio as numbers. linear16 means each sample is a 16-bit number stored in little-endian byte order. Most audio libraries use this format by default.

Parameter	Value used	What it does
`model`	`flux-general-en`	Flux English. Use `flux-general-multi` for multilingual.
`encoding`	`linear16`	16-bit PCM audio. Must match what your microphone library outputs.
`sample_rate`	`16000`	16 kHz audio. Decibri captures at this rate by default.
`keyterm`	repeated	Vocabulary biasing. Up to 100 keyterms per connection.
`eager_eot_threshold`	not set	Enables EagerEndOfTurn events at this confidence. Off in this repo.

You can also pass eot_threshold to tune end-of-turn sensitivity. The default works well for short food-ordering sentences. If your agent handles longer thinking-out-loud utterances, raise it.

The Flux events we use

Flux sends five event types on its TurnInfo stream. The repo only consumes one of them, but the others are worth knowing because you will probably want some of them later.

Update. Partial transcript, updated as the user keeps talking. Useful if you want a live transcript display. Not used here.
StartOfTurn. The user just started speaking. This is where you would handle barge-in (cutting off the agent if it is still talking, so the user can interrupt). Not connected here.
EndOfTurn. High confidence the user is done. This is the only event the repo uses. When it fires, the transcript goes to the LLM and the agent starts generating a reply.
EagerEndOfTurn. Medium confidence the user is done. Off by default. If you turn it on (with eager_eot_threshold), the agent can start drafting a reply early. Saves some delay at the cost of more LLM calls because some drafts get thrown away.
TurnResumed. Only fires after an EagerEndOfTurn. Means the user was not actually done, and any draft you started should be discarded.

if (data.type === "TurnInfo") {
  if (data.event === "EndOfTurn") {
    const transcript: string = data.transcript ?? "";
    if (transcript.trim().length > 0) {
      onTranscription(transcript.trim());
    }
  }
  return;
}

Keyterm biasing for Indian-English food vocabulary

Deepgram lets you pass up to 100 keyterms per connection. Keyterms tell the model "if you hear something close to one of these words, lean toward this spelling." Most apps set keyterms once at connect time using a fixed vocabulary.

Flux's Configure control message lets you update keyterms on every turn. The repo uses this to bias the next turn on whatever proper nouns the agent just said.

function extractContextualKeyterms(text: string): string[] {
  const tokens = text
    .replace(/[.,!?;:()"']/g, " ")
    .split(/\s+/)
    .filter((w) => w.length >= 3 && /^[A-Z]/.test(w) && !KEYTERM_STOPWORDS.has(w));
  return [...new Set(tokens)];
}

The idea is simple. If the agent just said "Paneer Butter Masala from Punjab Grill," the user's reply is much more likely to contain those words than some random restaurant name. So we extract the capitalised words from the agent's last reply and use them as bias for the next turn.

For Indian-English food vocabulary, where standard English speech recognition struggles the most, this one feature is the difference between the agent hearing "Kadhai Paneer" and hearing "car die panel."

Cost

Deepgram bills Flux per second of streaming audio. As of early 2026, the pay-as-you-go rate sits in the range of $0.0077 to $0.015 per minute, depending on the plan and region. Check Deepgram's pricing page for current numbers. New accounts get $200 in starter credit.

A rough cost estimate for the food-ordering agent:

Average turn: 3 seconds of user speech, microphone open during user speech only
Per-turn STT cost: 3 seconds at the higher end of the range, roughly $0.00075
Ten-turn ordering session: under one cent for STT

You will run out of $200 of credit long before you run out of patience for testing.

Block streaming

OpenClaw was built for chat first. Its block streaming was tuned for long replies on a screen. In that setup, each block (a unit of text the model sends back) might be a whole paragraph. For voice, each block should be a sentence or two. Every millisecond between "LLM produced text" and "speaker plays sound" is silence the user can hear.

The defaults are wrong for voice. Until you change them, OpenClaw quietly holds onto your blocks instead of sending them to your code right away.

First, turn streaming on

OpenClaw has two settings that control block streaming:

blockStreamingDefault in the config (the channel-wide default)
disableBlockStreaming at the call site (the override for one call)

Both have to allow streaming, or it will not happen.

const llmCall = getReplyFromConfig(ctx, {
  disableBlockStreaming: false,
});

The naming is confusing. The option is called disable, so false means "do not disable." Which means "do stream." So you want disableBlockStreaming: false. Read it twice if needed.

Fix the coalescer

The coalescer is the component that decides when to send a buffered block to your code. To buffer means to hold onto something until enough has built up. To send the buffered content onward is to flush it.

The coalescer's default minChars setting is 800. A typical voice reply is 200 to 300 characters. So with the default, the coalescer waits for an 800-character block that will never arrive. It gives up at the end of the reply and dumps everything at once. Streaming defeated.

Override it like this (brain.ts lines 96 to 109):

blockStreamingChunk: {
  minChars: 1,
  maxChars: 200,
  breakPreference: "sentence",
},
blockStreamingCoalesce: {
  minChars: 1,
  maxChars: 200,
  idleMs: 0,
  flushOnEnqueue: true,
},

The line that matters most is flushOnEnqueue: true. It tells the coalescer to send the block to your code the moment it arrives, without waiting. Every other override is necessary, but useless without this one.

Track deltas yourself

A callback is a function that OpenClaw calls when something happens, like a new block arriving. OpenClaw's onBlockReply callback is given the full text so far, not just the new piece. So you have to figure out what is new yourself. The new piece is called the delta.

Here is how the repo computes it (brain.ts lines 486 to 501):

let delta: string;
if (currentBlockStream && text.startsWith(currentBlockStream)) {
  delta = text.slice(currentBlockStream.length);
  currentBlockStream = text;
} else if (currentBlockStream && currentBlockStream.includes(text)) {
  return; // already covered
} else {
  delta = text;
  currentBlockStream = text;
}

There are three cases here, and the third is the one that matters most:

Extension. The new text starts with the old text. The delta is just the part at the end. Easy.
Duplicate. The same block got reported twice. Skip it.
Reset. The new text has nothing to do with the old text. This happens after a tool call finishes. OpenClaw starts a fresh block stream, and the new text is a brand-new string. Without this branch, you would either lose the new block or join it incorrectly to the old one.

The empty payload.text quirk

When block streaming is actually working, payload.text in the final reply is an empty string. This is not a bug.

OpenClaw has a check called shouldDropFinalPayloads that removes the text from the final payload once it has already been streamed. This avoids sending the same text twice. The repo handles this by collecting text in its own buffer (canonicalText) as chunks arrive. It only falls back to payload.text if the buffer is empty:

if (!canonicalText && payloadText) canonicalText = payloadText;

Murf Falcon

Synthesis is the technical word for generating audio from text. Murf Falcon is the TTS model used in this build. Murf reports a model latency of 55 ms and a time-to-first-audio of 130 ms, at $0.01 per 1,000 characters — roughly 1 cent per minute of generated audio.

Turn off OpenClaw's built-in TTS

OpenClaw ships with its own TTS pipeline. By default it runs in auto: "on" mode, which produces one final audio file at the end of a reply. That mode is incompatible with per-block streaming, so we turn it off (openclaw.json lines 30 to 47):

"tts": {
  "provider": "murf",
  "auto": "off",
  "mode": "final",
  "providers": {
    "murf": {
      "voiceId": "en-IN-anusha",
      "model": "FALCON",
      "locale": "en-IN",
      "style": "Conversational"
    }
  }
}

With auto: "off", the Murf provider stays loaded and configured. But your code is now in charge of synthesis. You call murfProvider.synthesize() directly on each block.

Voice and locale compatibility

A locale is a code that identifies a language and region together, like en-IN for English in India or es-MX for Spanish in Mexico.

Falcon supports voices across many languages, but each voice is bound to its locale. If you set voiceId to an English voice and locale to hi-IN, the API rejects the request. If you change just one of the two when swapping voices, things silently break.

Voice ID prefix	Locale	Notes
`en-IN-*`	`en-IN`	Indian English. Used in this repo.
`en-US-*`	`en-US`	American English.
`en-UK-*`	`en-UK`	British English.
`hi-IN-*`	`hi-IN`	Hindi.
`es-ES-*`	`es-ES`	Spanish (Spain).
`es-MX-*`	`es-MX`	Spanish (Mexico). Different voices than Spain.

The full list is in Murf's API docs. Before you change voiceId in openclaw.json, query /v1/speech/voices?model=FALCON and pick a voice and its matching locale together.

Pick the right voice style

Falcon exposes a style parameter. Pick Conversational for agent work. A voice that sounds great reading an audiobook usually sounds wrong in a back-and-forth conversation. Promotional and Narration styles sound theatrical when the agent is saying short things like "Sure, anything else?"

Two speaker outputs

The pre-recorded filler audio masks the cold-start delay by playing while the LLM is still thinking. The problem is that the filler clip is a variable length, and the first real audio chunk can arrive before the filler finishes.

If you play both through the same audio output, one of two bad things happens:

The filler cuts off the start of the real reply, or
The reply cuts off the end of the filler.

The fix is two separate audio outputs (voice.ts lines 10 to 11):

let oneShotSpeaker: InstanceType<typeof DecibriOutput> | null = null;
let streamSpeaker: InstanceType<typeof DecibriOutput> | null = null;

oneShotSpeaker plays fillers. streamSpeaker plays the actual reply. When the first reply chunk arrives, stopOneShotPlayback() stops the filler channel without touching the reply channel. Anything already queued on the reply channel keeps playing.

Synthesise in parallel, play back in order

There are two layers of parallelism worth understanding.

Within a single block. Murf splits long input into chunks of up to 1500 characters and synthesises them at the same time on its own infrastructure. You do not have to do anything for this.

Across blocks. The repo starts synthesis calls the moment each block arrives. So multiple blocks can be synthesising at the same time. But the audio plays back in order through a Promise chain:

const dispatchChunk = (text: string) => {
  const trimmed = text.trim();
  if (!trimmed) return;
  if (!streamingEnabled) return;
  const synthP = synthesizeSpeech(trimmed).catch(() => null);
  emitChain = emitChain.then(async () => {
    const audio = await synthP;
    if (audio) {
      streamedAnyAudio = true;
      onAudioChunk!(audio);
    }
  });
};

synthesizeSpeech() starts the Murf network call right away. emitChain.then() waits for the previous chunk's synthesis to finish before playing the current one. So if chunk 1 and chunk 2 both take 400 ms to synthesise but chunk 1's network is slower, chunk 2 still plays second. Never first.

Streaming-pipeline bugs and their root causes

The video has a short error table for the bugs you hit during setup. This section covers the ones specific to the streaming pipeline that show up later, when the agent is mostly working.

WebSocket closes with code 1008 the moment audio starts

Code 1008 means "policy violation," which Deepgram uses for invalid API keys. Check DEEPGRAM_API_KEY in your environment, and check the Deepgram console for remaining credit.

WebSocket closes with code 1011 partway through a session

Code 1011 means "internal server error," but in practice the most common cause is running out of credit mid-session. Top up and retry.

Transcripts come back empty even though audio is sending

Three things to check, in order:

Sample rate. sample_rate in the URL must match your microphone's actual rate. The repo captures at 16000. If your system is recording at 44100 or 48000, you have to resample before sending.
Encoding. The encoding parameter and the audio format must match. linear16 expects 16-bit signed little-endian PCM.
Model. model must be flux-general-en or flux-general-multi. No other model name works on /v2/listen.

The agent's first sentence plays, then nothing

This is the coalescer holding onto your blocks. If you did not override flushOnEnqueue, the first block flushes but nothing after it streams. Check brain.ts for the coalesce override.

Audio plays out of order

The Promise chain in dispatchChunk is what keeps playback in order. If you removed the emitChain.then(...) wrapper or replaced it with Promise.all, chunks will play in synthesis-completion order instead of arrival order. Put the chain back.

The agent talks over itself

This means the filler kept playing after the real reply started. Check that stopOneShotPlayback() runs on the first chunk of the real reply, not at the end of the reply.

Voice cuts off mid-sentence

Falcon synthesis can fail silently for a single chunk. The .catch(() => null) in dispatchChunk protects you from one failed chunk crashing the whole reply. But if too many chunks fail, the user hears gaps. Log the failures and check Murf's status page.

ALSA errors on Linux

On minimal Linux installs the ALSA development headers have to be installed before the npm package will build. apt install libasound2-dev covers it on Debian-family. If install completes but the device is not found at runtime, the default ALSA device is probably pointing at an output that does not exist.

No audio on Windows

Decibri on Windows uses WASAPI. If your default output device is a Bluetooth headset that is not currently connected, the stream opens silently and no audio plays. Switch the default device in Sound settings, or set the output device explicitly in code.

Silent failure on macOS

The first run asks for microphone permission. If you deny it, subsequent runs fail silently. The agent will appear to start normally and the WebSocket will connect, but no audio frames reach Deepgram. Check microphone permissions in System Settings under Privacy and Security.

Extending the agent to something that is not Swiggy

It takes two changes.

Swap the skill. The agents.defaults.skills array in openclaw.json is the list of MCP skills the agent can call. Remove the Swiggy skill, add a different one. A calendar scheduler imports a Google Calendar MCP skill. A GitHub PR merger imports the GitHub MCP skill. A Notion assistant imports the Notion MCP skill. The runtime does not change.

Rewrite the identity. workspace/IDENTITY.md is the system prompt. It describes who the agent is, what it does, what it refuses to do, and how it should format replies. Rewriting this file changes the agent's personality and its understanding of the task.

For a calendar scheduler, you would describe an assistant that looks up free slots and confirms bookings. For a PR merger, you would describe a reviewer that summarises diffs and merges when checks pass.

Everything else stays. The audio pipeline, the streaming coalescer, the keyterm bias, the two-channel playback. That is the value of keeping the voice layer separate from the agent layer. The voice layer does not care what the agent is doing.

What this pipeline does not fix

Turn 1 latency is not solved. Time-to-first-audio on a cold start is mostly caused by tool chains and LLM time-to-first-token, not by synthesis. The slow path still includes OpenClaw's cold start, the Swiggy MCP setup, and the LLM's first-token delay. Streaming synthesis cannot hide that. The filler audio can. That is why it is there.

Getting to true sub-second first audio on turn 1 would require starting the OpenClaw runtime ahead of time, keeping the MCP connection alive across sessions, and starting tool calls before the user finishes speaking. None of those are in this repo. What is in this repo is the pattern that makes the problem manageable: split the audio pipeline from the agent pipeline, stream what can be streamed, mask the rest with fillers, and measure the result.

Turn 2 onwards is a different story. With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking. Falcon plus block streaming are why. That is the number that makes the agent usable in practice. The cold-start number is what makes every tutorial-shaped demo look slower than it will be in production.

Block streaming, Falcon, and contextual keyterm biasing are three improvements that build on each other. Each does less than a demo suggests. Together they do more than any one of them alone. That is usually how voice pipelines work.

Resources:
Murf Plugin: https://clawhub.ai/plugins/openclaw-murf-tts
Murf Falcon: https://murf.ai/api/dashboard
Openclaw: https://openclaw.ai/
Clawhub: https://clawhub.ai/
Deepgram: https://console.deepgram.com/

DEV Community