<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sanchita Sunil</title>
    <description>The latest articles on DEV Community by Sanchita Sunil (@sanchita_sunil).</description>
    <link>https://dev.to/sanchita_sunil</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3965632%2F98cd6e11-e12d-4f4c-8f6a-e29d3340e76a.png</url>
      <title>DEV Community: Sanchita Sunil</title>
      <link>https://dev.to/sanchita_sunil</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sanchita_sunil"/>
    <language>en</language>
    <item>
      <title>I Gave OpenClaw a Voice and It Ordered Me Dinner</title>
      <dc:creator>Sanchita Sunil</dc:creator>
      <pubDate>Wed, 03 Jun 2026 09:09:15 +0000</pubDate>
      <link>https://dev.to/sanchita_sunil/i-gave-openclaw-a-voice-and-it-ordered-me-dinner-40og</link>
      <guid>https://dev.to/sanchita_sunil/i-gave-openclaw-a-voice-and-it-ordered-me-dinner-40og</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Quick links.&lt;/strong&gt;&lt;br&gt;
Code: &lt;a href="https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent" rel="noopener noreferrer"&gt;https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent&lt;/a&gt;&lt;br&gt;
Video Walkthrough: &lt;a href="https://www.youtube.com/watch?v=ypqzB093VLc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=ypqzB093VLc&lt;/a&gt;&lt;br&gt;
Configuration Deep Dive: &lt;a href="https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg"&gt;https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Building a working voice agent usually means stitching state across speech, logic, and external APIs by hand. OpenClaw gives you a runtime that handles most of that for you. To see how far that gets you in practice, I wired OpenClaw up to a microphone, a Murf Falcon voice, and a Swiggy account. In about 800 lines of TypeScript, I had an agent that could search restaurants, take add-to-cart instructions, and place a real order end to end.&lt;/p&gt;

&lt;p&gt;This post is an architecture walkthrough. I'll explain what OpenClaw is doing under the hood, where I had to fight its defaults, and how the same blueprint applies to any voice agent you want to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenClaw
&lt;/h2&gt;

&lt;p&gt;There are several agent frameworks out there. Most of them treat an agent as a function: input goes in, tool calls happen, output comes out. OpenClaw is different — it treats the agent as a runtime, more like a long-running server than a single call. Sessions can be paused and resumed, state is keyed and persisted, tool calls go through a typed MCP (Model Context Protocol) interface. And critically for what we are building, OpenClaw exposes block-level streaming hooks that let you intercept the model's output as it arrives.&lt;/p&gt;

&lt;p&gt;A voice agent is the hardest case any agent framework will face, because the user can hear every millisecond of latency. If your framework only hands you the full reply at the end, you cannot stream audio to the speakers. The user is left in silence while the model generates 300 characters, which can take 2 to 4 seconds, which feels like forever in conversation time. OpenClaw hands you each block, a sentence or two, the moment it arrives. You turn that block into audio and play it while the model keeps generating the next one.&lt;/p&gt;

&lt;p&gt;Three things in particular made this build feel small once I understood them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Skills are markdown, not function definitions.&lt;/strong&gt; The Swiggy integration is a &lt;code&gt;SKILL.md&lt;/code&gt; file the model reads. No JSON schemas, no function-calling boilerplate. To swap Swiggy for GitHub or Notion later, I would install a different skill and change one config line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP is built in.&lt;/strong&gt; OpenClaw treats MCP servers as first-class. The Swiggy MCP plugs in through &lt;code&gt;mcporter&lt;/code&gt;. Adding a new tool surface means adding a new skill, not writing glue code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming hooks are real.&lt;/strong&gt; &lt;code&gt;onBlockReply&lt;/code&gt; fires as the model writes. You drive synthesis from inside the callback.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once those three things are in place, the rest of the build is mostly wiring the audio loop around them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuic8v8ztkc7faiveqfm1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuic8v8ztkc7faiveqfm1.png" alt="Food Ordering Voice Agent Architecture Diagram" width="799" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A microphone library captures audio and a streaming STT turns it into transcripts. Those transcripts go into OpenClaw, which decides what to do, calls skills, and streams text out. A streaming TTS turns each block into audio as it arrives, and a speaker library plays it back.&lt;/p&gt;

&lt;p&gt;The audio loop is the same regardless of what the agent is doing. Plug a calendar skill into OpenClaw and you have a voice scheduling assistant. Plug in a GitHub skill and you have a voice PR reviewer. The loop does not change, only the skill and the system prompt do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why this one&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent runtime&lt;/td&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;Skill registry, MCP integration, block-level streaming hooks. The framework this post is built around.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool surface&lt;/td&gt;
&lt;td&gt;Swiggy skill via ClawHub&lt;/td&gt;
&lt;td&gt;Vendored MCP skill. Documents the API in markdown the model can read.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microphone and speaker&lt;/td&gt;
&lt;td&gt;Decibri&lt;/td&gt;
&lt;td&gt;Native WASAPI on Windows, CoreAudio on Mac, ALSA on Linux. No browser layer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speech to text&lt;/td&gt;
&lt;td&gt;Deepgram Flux&lt;/td&gt;
&lt;td&gt;Streaming STT with end-of-turn detection inside the model. No separate VAD to wire up.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text to speech&lt;/td&gt;
&lt;td&gt;Murf Falcon&lt;/td&gt;
&lt;td&gt;Low time-to-first-audio, and conversational voice styles that sound right in back-and-forth dialogue.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language model&lt;/td&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;Free tier, supports tool calling, fast on first token. Substitutable with any tool-calling LLM.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I deliberately left out
&lt;/h2&gt;

&lt;p&gt;Before the build, here is what is not in this version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wake word detection.&lt;/strong&gt; The microphone is always on while the agent is not speaking. No "Hey Claw" trigger.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-session memory.&lt;/strong&gt; Every restart starts fresh. The session key is per-process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order cancellation.&lt;/strong&gt; Swiggy's MCP does not expose it, so the skill routes the user to customer care.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production hardening.&lt;/strong&gt; This is a single-user CLI. No auth, no rate limiting, no observability. Don't ship it as is.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A note on latency worth setting expectations on now. Streaming TTS plays each sentence as soon as it is ready, which makes the agent feel responsive on most turns. But tool calls still take as long as tool calls take. When the agent is hitting Swiggy's API for restaurant search, there is real waiting that streaming cannot hide. I cover this in detail in the latency section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;Set these up before continuing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node and package manager&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.&lt;/li&gt;
&lt;li&gt;pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Platform audio dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Decibri uses the native audio stack on each operating system, so the install steps differ.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux: &lt;code&gt;apt install libasound2-dev&lt;/code&gt; on Debian-family, or &lt;code&gt;alsa-lib-devel&lt;/code&gt; on Fedora-family.&lt;/li&gt;
&lt;li&gt;Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.&lt;/li&gt;
&lt;li&gt;macOS: CoreAudio is built in. You need Xcode Command Line Tools: &lt;code&gt;xcode-select --install&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;External CLIs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; clawhub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;clawhub&lt;/code&gt; is OpenClaw's skill registry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API keys&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.&lt;/li&gt;
&lt;li&gt;Murf for the Falcon TTS key. Created on the API tab of your Murf account, separate from a regular Murf Studio account.&lt;/li&gt;
&lt;li&gt;An LLM provider of your choice. Most have a free tier sufficient for development.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Swiggy account&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Clone, install, env
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;blob:none &lt;span class="nt"&gt;--sparse&lt;/span&gt; https://github.com/murf-ai/murf-cookbook.git
&lt;span class="nb"&gt;cd &lt;/span&gt;murf-cookbook
git sparse-checkout &lt;span class="nb"&gt;set &lt;/span&gt;examples/openclaw/food_ordering_agent
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/openclaw/food_ordering_agent
pnpm &lt;span class="nb"&gt;install
cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt; and add:&lt;br&gt;
DEEPGRAM_API_KEY=...&lt;br&gt;
MURF_API_KEY=...&lt;br&gt;
GEMINI_API_KEY=...&lt;/p&gt;

&lt;p&gt;If you would rather use OpenAI or Anthropic instead of Gemini, change one line in &lt;code&gt;openclaw.json&lt;/code&gt; and the env variable name. Tool-calling support is the only requirement.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Authenticate Swiggy
&lt;/h2&gt;

&lt;p&gt;Swiggy's MCP needs OAuth. Run this once, the browser opens, you log in, you approve.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node scripts/swiggy-auth.mjs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This opens a browser and signs you into Swiggy via PKCE OAuth, and writes the token as a static &lt;code&gt;Authorization&lt;/code&gt; header into &lt;code&gt;~/.mcporter/mcporter.json&lt;/code&gt;. You won't need to do this again unless the token expires.&lt;/p&gt;

&lt;p&gt;If the browser doesn't open automatically, the script prints the full auth URL that you can copy and paste manually.&lt;/p&gt;

&lt;p&gt;Confirm that the skill can actually reach Swiggy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;node skills/swiggy/swiggy-cli.js food addresses
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should print your saved addresses. If the list is empty, save one in the Swiggy app before moving on.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; In the video I use &lt;code&gt;mcporter auth swiggy-food&lt;/code&gt; — that no longer works. See the &lt;a href="https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent" rel="noopener noreferrer"&gt;repo README&lt;/a&gt; for current auth steps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3: The four files you write
&lt;/h2&gt;

&lt;p&gt;src/&lt;br&gt;
ear.ts      ~110 lines  microphone capture and Deepgram WebSocket&lt;br&gt;
brain.ts    ~500 lines  streaming TTS pipeline, calls OpenClaw&lt;br&gt;
voice.ts    ~140 lines  speaker output, two channels&lt;br&gt;
index.ts    ~140 lines  the event loop&lt;/p&gt;

&lt;p&gt;The whole agent fits in 900 lines. Three of these files are pure adapter code: microphone in, speaker out. The interesting file is &lt;code&gt;brain.ts&lt;/code&gt;, because that is where OpenClaw and Murf Falcon meet.&lt;/p&gt;
&lt;h3&gt;
  
  
  ear.ts: microphone in, transcript out
&lt;/h3&gt;

&lt;p&gt;Decibri captures 16-bit PCM at 16 kHz in 100 ms chunks. Each chunk goes to a Deepgram Flux WebSocket on &lt;code&gt;/v2/listen&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URLSearchParams&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;flux-general-en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;encoding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;linear16&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample_rate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;16000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;keyterms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;keyterm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`wss://api.deepgram.com/v2/listen?&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to know.&lt;/p&gt;

&lt;p&gt;First, Flux has end-of-turn detection inside the transcription model. You don't need a separate Voice Activity Detector. You get one event called &lt;code&gt;EndOfTurn&lt;/code&gt; and you respond to it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TurnInfo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;EndOfTurn&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;onTranscription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second, there is a contextual keyterm trick that mattered a lot for Indian-English food vocabulary. After each agent reply, I extract the capitalised words ("Punjab Grill," "Paneer," "Meghana") and pass them as keyterms for the next turn. This is what fixes "Kadhai Paneer" being heard as "car die panel." Standard English ASR doesn't handle Indian food names well. Per-turn keyterm biasing gets it most of the way there.&lt;/p&gt;

&lt;p&gt;I wired Deepgram in directly here, not through OpenClaw's STT plugin slot. OpenClaw's STT integration is built for telephony, not a local CLI microphone. 110 lines of WebSocket code was the right tool for this job.&lt;/p&gt;

&lt;h3&gt;
  
  
  brain.ts: where OpenClaw earns its keep
&lt;/h3&gt;

&lt;p&gt;This is the file that uses every OpenClaw primitive worth using.&lt;/p&gt;

&lt;p&gt;The basic flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call OpenClaw's &lt;code&gt;chat()&lt;/code&gt; with the user's transcript.&lt;/li&gt;
&lt;li&gt;Subscribe to OpenClaw's &lt;code&gt;onBlockReply&lt;/code&gt; hook.&lt;/li&gt;
&lt;li&gt;Hand each block to Murf Falcon for synthesis as it arrives.&lt;/li&gt;
&lt;li&gt;Stream audio back to &lt;code&gt;voice.ts&lt;/code&gt; in order.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenClaw's defaults are tuned for chat, where each block can be a paragraph and the user is reading on a screen. For voice, three overrides matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Override 1: turn streaming on.&lt;/strong&gt; OpenClaw has two switches that both have to allow streaming. The naming is confusing because one is called &lt;code&gt;disable&lt;/code&gt;. So you want &lt;code&gt;disableBlockStreaming: false&lt;/code&gt;, which means "do not disable," which means "do stream."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;llmCall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getReplyFromConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;disableBlockStreaming&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Override 2: fix the coalescer.&lt;/strong&gt; OpenClaw has a coalescer that decides when to flush a buffered block to your code. Its default &lt;code&gt;minChars&lt;/code&gt; is 800. A typical voice reply is 200 to 300 characters, so the coalescer waits for a block that never arrives, then dumps everything at end-of-reply. Streaming defeated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;blockStreamingCoalesce&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;minChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;idleMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;flushOnEnqueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;flushOnEnqueue: true&lt;/code&gt; is the line that makes the rest of this work. It tells OpenClaw to hand the block over the moment it arrives, instead of waiting for more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Override 3: track deltas yourself.&lt;/strong&gt; OpenClaw's &lt;code&gt;onBlockReply&lt;/code&gt; callback gives you the full text so far, not just the new piece. You compute the delta yourself. Three cases: extension (new starts with old), duplicate (skip), and reset (fresh string after a tool call). The reset case is easy to miss and shows up after every tool call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you have the delta, you call Murf's &lt;code&gt;synthesize()&lt;/code&gt;. Synthesis runs in parallel across blocks, but playback runs in order, serialised through a Promise chain so that chunk 2 always plays after chunk 1 even if chunk 2's network call finishes first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;synthP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;synthesizeSpeech&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;emitChain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;emitChain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;synthP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;onAudioChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is roughly 30 lines of streaming logic. The rest of &lt;code&gt;brain.ts&lt;/code&gt; is the agent setup, the OpenClaw config, and a fallback path for when the model batches output after tool calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  voice.ts: two speakers, not one
&lt;/h3&gt;

&lt;p&gt;Falcon's synthesis is fast — Murf reports 130 ms time-to-first-audio, and that matches what I see in practice. So when there is dead air on the agent's first turn, it is not the TTS that is causing it. It is the cold-start cost of OpenClaw initialising, the Swiggy MCP handshake, the LLM doing its first call against a fresh tool chain. All of that has to finish before the model has produced its first block of text for Falcon to synthesise.&lt;/p&gt;

&lt;p&gt;That is the gap pre-recorded filler audio is for. Short clips like "One moment please" or "let me check that for you" play 100 ms after the user stops talking, which is fast enough that the user does not perceive a delay.&lt;/p&gt;

&lt;p&gt;The catch: the filler is a variable-length clip, and the first real audio chunk can arrive before the filler finishes. If both play through one audio output, one cuts off the other. The fix is two separate Decibri outputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;oneShotSpeaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;InstanceType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;DecibriOutput&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;streamSpeaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;InstanceType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;DecibriOutput&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;oneShotSpeaker&lt;/code&gt; plays fillers. &lt;code&gt;streamSpeaker&lt;/code&gt; plays the real reply. When the first reply chunk arrives, I stop the filler channel without touching the reply channel. Anything queued on the reply channel keeps playing.&lt;/p&gt;

&lt;p&gt;This sounds like overkill until you hear the alternative. With one channel, the filler clips the agent saying "Sure" and the user only hears "...I'll add that."&lt;/p&gt;

&lt;h3&gt;
  
  
  index.ts: the loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;startSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;renderBanner&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nf"&gt;setImmediate&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;      &lt;span class="c1"&gt;// amortise OpenClaw cold start&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;playIntro&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;openMicrophone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;ear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;transcript&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;closeMicrophone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;playFiller&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;                  &lt;span class="c1"&gt;// mask LLM latency&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                    &lt;span class="c1"&gt;// streams audio as it arrives&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;openMicrophone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole loop. Render the banner, kick off OpenClaw warmup in the background, play the intro, open the microphone. On each transcript: stop the microphone, play a filler, run the agent, reopen the microphone.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;setImmediate(() =&amp;gt; warmup())&lt;/code&gt; line runs OpenClaw's initialisation and the Swiggy MCP handshake while the user is hearing the intro. By the time the user finishes their first sentence, both are warm. That shaves several seconds off turn 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the skill actually works
&lt;/h2&gt;

&lt;p&gt;This is the part that surprised me most when I first used OpenClaw.&lt;/p&gt;

&lt;p&gt;The agent learns to use Swiggy by reading a markdown file. Not a JSON schema, not function definitions. A human-readable file called &lt;code&gt;SKILL.md&lt;/code&gt; that documents the commands, the sequencing rules, and the things to never do. The model reads this, figures out what to call, and emits shell commands that run against a CLI wrapper.&lt;/p&gt;

&lt;p&gt;The wrapper is small. &lt;code&gt;node skills/swiggy/swiggy-cli.js food &amp;lt;command&amp;gt;&lt;/code&gt; is the shape of every call. The skill knows commands like &lt;code&gt;search-restaurants&lt;/code&gt;, &lt;code&gt;get-menu&lt;/code&gt;, &lt;code&gt;add-to-cart&lt;/code&gt;, &lt;code&gt;checkout&lt;/code&gt;. The model sequences them on its own, based on the markdown documentation.&lt;/p&gt;

&lt;p&gt;Here is a snippet from &lt;code&gt;SKILL.md&lt;/code&gt; (paraphrased):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;search-restaurants&lt;/strong&gt;: Find restaurants matching a cuisine or dish. Use this first whenever the user mentions a food. Example: &lt;code&gt;search-restaurants --query "biryani"&lt;/code&gt;. Always call &lt;code&gt;get-addresses&lt;/code&gt; first if you have not yet, because results depend on delivery location.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model reads it the same way a new developer would read documentation on day one.&lt;/p&gt;

&lt;p&gt;The one tweak I made: every &lt;code&gt;swiggy food &amp;lt;cmd&amp;gt;&lt;/code&gt; call in &lt;code&gt;SKILL.md&lt;/code&gt; became &lt;code&gt;node skills/swiggy/swiggy-cli.js food &amp;lt;cmd&amp;gt;&lt;/code&gt;. OpenClaw's shell executor doesn't have npm globals on PATH, so the &lt;code&gt;swiggy&lt;/code&gt; binary from &lt;code&gt;npm link&lt;/code&gt; is not reachable.&lt;/p&gt;

&lt;p&gt;The implication for builders: writing a new skill is writing a markdown file and a thin CLI. There is no SDK to learn, no function-calling glue to debug. If you can document an API in English with examples, you can give an OpenClaw agent the ability to call it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency
&lt;/h2&gt;

&lt;p&gt;The first turn is the slowest. Before any audio plays, OpenClaw has to initialise, complete the Swiggy MCP handshake, and make its first LLM call against a fresh tool chain. On a typical machine that takes anywhere from 15 to 50 seconds, depending on your network and your LLM provider. Streaming TTS does not save you here — the model has not produced anything to synthesise yet.&lt;/p&gt;

&lt;p&gt;What does help is the combination of filler audio (which plays 100 ms after the user stops talking) and the background warmup that runs during the intro. Together they keep the perceived gap small even when the actual cold start is not.&lt;/p&gt;

&lt;p&gt;Turn 2 onwards is a different story. With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking, and most of that is the LLM's time to its first sentence. Falcon's 130 ms TTFA is what makes "first sentence" actually translate to "first audio you hear."&lt;/p&gt;

&lt;p&gt;If you genuinely need to push first-turn latency below this on tool-heavy turns, the only real lever is to take OpenClaw out of the loop on those turns — wiring the tool calls in directly, parallelising what OpenClaw would have serialised. I haven't done that in this build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Swap the skill
&lt;/h2&gt;

&lt;p&gt;The voice loop in this post does not care what the agent does. The skill lives in two files:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;agents.defaults.skills&lt;/code&gt; in &lt;code&gt;openclaw.json&lt;/code&gt;. Replace &lt;code&gt;swiggy&lt;/code&gt; with another MCP skill. Google Calendar. GitHub. Notion. Linear. Pick one from ClawHub or write your own.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;workspace/IDENTITY.md&lt;/code&gt;. The system prompt that describes who the agent is and how it should talk. Rewrite it for the new domain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That portability is the case I wanted to make for OpenClaw with this post. The framework is doing real work behind the scenes, hiding the runtime, the MCP integration, the streaming, and the skill format behind primitives that are small enough to use without ceremony.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The skill format is the part I underestimated going in. The model was reading it the way a developer would read API docs on day one. There is no JSON schema to maintain, no function-calling boilerplate to update when the API changes. If your API is documentable in markdown, an OpenClaw agent can use it.&lt;/p&gt;

&lt;p&gt;Voice agents are mostly a latency engineering problem. The transcription, the agent, the TTS are mostly solved. The work that made this build feel real was around the seams — two-channel playback, background warmup, per-turn keyterm bias, pre-baked fillers. You have to find these by listening to your own demo and noticing what sounds wrong.&lt;/p&gt;

&lt;p&gt;The combination of streaming hooks and per-block synthesis is what made the conversational rhythm work. Falcon at 130 ms TTFA is fast on its own, OpenClaw handing off blocks the moment they arrive is fast on its own. Together, if the LLM produces text in roughly 200 ms chunks and the TTS adds 130 ms on top, the user hears a new sentence every ~330 ms. That is faster than most humans speak, and it is what makes the agent feel like it is actually thinking out loud rather than waiting to deliver a finished answer.&lt;/p&gt;

&lt;p&gt;If this was useful, the code is at &lt;a href="https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent" rel="noopener noreferrer"&gt;github.com/murf-ai/murf-cookbook&lt;/a&gt;. A star helps the project reach more builders. Clone it, swap the skill, and build something else tonight. The configuration deep dive, with the parameter tables and error mappings, is at &lt;a href="https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg"&gt;dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I would love to hear what you build with it.&lt;/p&gt;

</description>
      <category>voice</category>
      <category>ai</category>
      <category>openclaw</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Notes from the Openclaw Voice Tutorial</title>
      <dc:creator>Sanchita Sunil</dc:creator>
      <pubDate>Wed, 03 Jun 2026 08:29:41 +0000</pubDate>
      <link>https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg</link>
      <guid>https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg</guid>
      <description>&lt;p&gt;This is a companion to the food-ordering agent tutorial video (You can find the video here: &lt;a href="https://www.youtube.com/watch?v=ypqzB093VLc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=ypqzB093VLc&lt;/a&gt;). The video walks you through cloning the repo and placing a real Swiggy order with your voice. This post fills in the parts the video pointed at but did not have time to cover:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every Deepgram Flux parameter, what it does, and how the event model behaves&lt;/li&gt;
&lt;li&gt;Why OpenClaw's block streaming defaults are wrong for voice, and which ones to flip&lt;/li&gt;
&lt;li&gt;Falcon voice and locale compatibility, and how to swap voices without breaking things&lt;/li&gt;
&lt;li&gt;Streaming-pipeline bugs that show up after setup, with their root causes&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent" rel="noopener noreferrer"&gt;https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Video:&lt;/strong&gt; &lt;a href="https://www.youtube.com/watch?v=ypqzB093VLc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=ypqzB093VLc&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;OpenClaw treats an agent as a runtime, not a prompt. A runtime is a program that runs continuously and remembers state between calls, like a server. A prompt, in contrast, is a single block of text sent to the model. The difference matters because OpenClaw can pause, resume, and track sessions across many turns.&lt;/p&gt;

&lt;p&gt;That model works well for chat. Voice is where it starts to break down.&lt;/p&gt;

&lt;p&gt;A microphone does not produce text. It produces audio frames (small chunks of raw sound data). A speaker cannot wait for the full reply before playing anything. The user will hear silence and assume the agent is broken. The same tool-call delay that is invisible in a chat UI becomes obvious dead air the moment the user can hear it.&lt;/p&gt;

&lt;p&gt;Every piece of OpenClaw still works for voice. You just have to point each piece at the voice use case on purpose, instead of relying on the chat-friendly defaults. The next three sections walk through which defaults to change and why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;If you do not have these, set them up before continuing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node and package manager&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.&lt;/li&gt;
&lt;li&gt;pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Platform audio dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Decibri uses the native audio stack on each operating system, so the install steps differ.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux: &lt;code&gt;apt install libasound2-dev&lt;/code&gt; on Debian-family distros, or &lt;code&gt;alsa-lib-devel&lt;/code&gt; on Fedora-family. Required at install time.&lt;/li&gt;
&lt;li&gt;Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.&lt;/li&gt;
&lt;li&gt;macOS: CoreAudio is built in. You need Xcode Command Line Tools: &lt;code&gt;xcode-select --install&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;External CLIs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;clawhub&lt;/code&gt;. OpenClaw's skill registry. The Swiggy skill in this repo is vendored, so you do not strictly need &lt;code&gt;clawhub&lt;/code&gt; to run the agent, but you will need it if you want to fetch other skills later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;API keys&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.&lt;/li&gt;
&lt;li&gt;Murf for the Falcon TTS key. This is created on the API tab of your Murf account, separate from a regular Murf Studio account.&lt;/li&gt;
&lt;li&gt;An LLM provider of your choice. Most have a free tier sufficient for development.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Swiggy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; The Swiggy auth flow has changed since the video was recorded. &lt;code&gt;mcporter auth swiggy-food&lt;/code&gt; no longer works — Swiggy MCP now requires an approved &lt;code&gt;client_id&lt;/code&gt; and uses a custom PKCE script instead. Run &lt;code&gt;node scripts/swiggy-auth.mjs&lt;/code&gt;. See the &lt;a href="https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent" rel="noopener noreferrer"&gt;repo README&lt;/a&gt; for current steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deepgram Flux
&lt;/h2&gt;

&lt;p&gt;Flux is the STT we use in this build. There are several streaming STTs that work for voice agents; Flux is the one wired up here, and the parts below are the configuration you need to get right regardless of which API you go with.&lt;/p&gt;

&lt;p&gt;One concept worth covering before the parameters: turn-taking. This is the decision of when the user has stopped talking and the agent should respond. Many streaming STT APIs hand back partial transcripts and leave turn-taking to your code, which usually means adding a separate Voice Activity Detector (VAD) that listens for silence. Flux does turn-taking inside the transcription model and emits structured events for it, so for this build we do not need a separate VAD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Endpoint
&lt;/h3&gt;

&lt;p&gt;An endpoint is the URL path you connect to on a server. Flux only works on &lt;code&gt;/v2/listen&lt;/code&gt;. The older &lt;code&gt;/v1/listen&lt;/code&gt; endpoint will silently reject the model parameter. You will spend an hour wondering why nothing transcribes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URLSearchParams&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;flux-general-en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;encoding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;linear16&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sample_rate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;16000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;keyterms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;keyterm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`wss://api.deepgram.com/v2/listen?&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;URLSearchParams&lt;/code&gt; to build the URL. It encodes spaces in multi-word keyterms correctly (as &lt;code&gt;+&lt;/code&gt;). If you build the query string by hand and use &lt;code&gt;%20&lt;/code&gt; instead, Deepgram will close the connection without telling you why. This is the most common setup bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameters
&lt;/h3&gt;

&lt;p&gt;The audio format below uses the term PCM, which means pulse-code modulation. It is the standard way to represent raw audio as numbers. &lt;code&gt;linear16&lt;/code&gt; means each sample is a 16-bit number stored in little-endian byte order. Most audio libraries use this format by default.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value used&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;flux-general-en&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Flux English. Use &lt;code&gt;flux-general-multi&lt;/code&gt; for multilingual.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;encoding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;linear16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16-bit PCM audio. Must match what your microphone library outputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sample_rate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16 kHz audio. Decibri captures at this rate by default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;keyterm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;repeated&lt;/td&gt;
&lt;td&gt;Vocabulary biasing. Up to 100 keyterms per connection.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;eager_eot_threshold&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;not set&lt;/td&gt;
&lt;td&gt;Enables EagerEndOfTurn events at this confidence. Off in this repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can also pass &lt;code&gt;eot_threshold&lt;/code&gt; to tune end-of-turn sensitivity. The default works well for short food-ordering sentences. If your agent handles longer thinking-out-loud utterances, raise it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Flux events we use
&lt;/h3&gt;

&lt;p&gt;Flux sends five event types on its TurnInfo stream. The repo only consumes one of them, but the others are worth knowing because you will probably want some of them later.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Update.&lt;/strong&gt; Partial transcript, updated as the user keeps talking. Useful if you want a live transcript display. Not used here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StartOfTurn.&lt;/strong&gt; The user just started speaking. This is where you would handle barge-in (cutting off the agent if it is still talking, so the user can interrupt). Not connected here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EndOfTurn.&lt;/strong&gt; High confidence the user is done. This is the only event the repo uses. When it fires, the transcript goes to the LLM and the agent starts generating a reply.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EagerEndOfTurn.&lt;/strong&gt; Medium confidence the user is done. Off by default. If you turn it on (with &lt;code&gt;eager_eot_threshold&lt;/code&gt;), the agent can start drafting a reply early. Saves some delay at the cost of more LLM calls because some drafts get thrown away.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TurnResumed.&lt;/strong&gt; Only fires after an EagerEndOfTurn. Means the user was not actually done, and any draft you started should be discarded.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TurnInfo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;EndOfTurn&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;onTranscription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Keyterm biasing for Indian-English food vocabulary
&lt;/h3&gt;

&lt;p&gt;Deepgram lets you pass up to 100 keyterms per connection. Keyterms tell the model "if you hear something close to one of these words, lean toward this spelling." Most apps set keyterms once at connect time using a fixed vocabulary.&lt;/p&gt;

&lt;p&gt;Flux's &lt;code&gt;Configure&lt;/code&gt; control message lets you update keyterms on every turn. The repo uses this to bias the next turn on whatever proper nouns the agent just said.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractContextualKeyterms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;.,!?;:()"'&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;A-Z&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;KEYTERM_STOPWORDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idea is simple. If the agent just said "Paneer Butter Masala from Punjab Grill," the user's reply is much more likely to contain those words than some random restaurant name. So we extract the capitalised words from the agent's last reply and use them as bias for the next turn.&lt;/p&gt;

&lt;p&gt;For Indian-English food vocabulary, where standard English speech recognition struggles the most, this one feature is the difference between the agent hearing "Kadhai Paneer" and hearing "car die panel."&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost
&lt;/h3&gt;

&lt;p&gt;Deepgram bills Flux per second of streaming audio. As of early 2026, the pay-as-you-go rate sits in the range of $0.0077 to $0.015 per minute, depending on the plan and region. Check Deepgram's pricing page for current numbers. New accounts get $200 in starter credit.&lt;/p&gt;

&lt;p&gt;A rough cost estimate for the food-ordering agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average turn: 3 seconds of user speech, microphone open during user speech only&lt;/li&gt;
&lt;li&gt;Per-turn STT cost: 3 seconds at the higher end of the range, roughly $0.00075&lt;/li&gt;
&lt;li&gt;Ten-turn ordering session: under one cent for STT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You will run out of $200 of credit long before you run out of patience for testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Block streaming
&lt;/h2&gt;

&lt;p&gt;OpenClaw was built for chat first. Its block streaming was tuned for long replies on a screen. In that setup, each block (a unit of text the model sends back) might be a whole paragraph. For voice, each block should be a sentence or two. Every millisecond between "LLM produced text" and "speaker plays sound" is silence the user can hear.&lt;/p&gt;

&lt;p&gt;The defaults are wrong for voice. Until you change them, OpenClaw quietly holds onto your blocks instead of sending them to your code right away.&lt;/p&gt;

&lt;h3&gt;
  
  
  First, turn streaming on
&lt;/h3&gt;

&lt;p&gt;OpenClaw has two settings that control block streaming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;blockStreamingDefault&lt;/code&gt; in the config (the channel-wide default)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;disableBlockStreaming&lt;/code&gt; at the call site (the override for one call)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both have to allow streaming, or it will not happen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;llmCall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getReplyFromConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;disableBlockStreaming&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The naming is confusing. The option is called &lt;code&gt;disable&lt;/code&gt;, so &lt;code&gt;false&lt;/code&gt; means "do not disable." Which means "do stream." So you want &lt;code&gt;disableBlockStreaming: false&lt;/code&gt;. Read it twice if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix the coalescer
&lt;/h3&gt;

&lt;p&gt;The coalescer is the component that decides when to send a buffered block to your code. To buffer means to hold onto something until enough has built up. To send the buffered content onward is to flush it.&lt;/p&gt;

&lt;p&gt;The coalescer's default &lt;code&gt;minChars&lt;/code&gt; setting is 800. A typical voice reply is 200 to 300 characters. So with the default, the coalescer waits for an 800-character block that will never arrive. It gives up at the end of the reply and dumps everything at once. Streaming defeated.&lt;/p&gt;

&lt;p&gt;Override it like this (&lt;code&gt;brain.ts&lt;/code&gt; lines 96 to 109):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;blockStreamingChunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;minChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;breakPreference&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sentence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="nx"&gt;blockStreamingCoalesce&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;minChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;maxChars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;idleMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;flushOnEnqueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The line that matters most is &lt;code&gt;flushOnEnqueue: true&lt;/code&gt;. It tells the coalescer to send the block to your code the moment it arrives, without waiting. Every other override is necessary, but useless without this one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track deltas yourself
&lt;/h3&gt;

&lt;p&gt;A callback is a function that OpenClaw calls when something happens, like a new block arriving. OpenClaw's &lt;code&gt;onBlockReply&lt;/code&gt; callback is given the full text so far, not just the new piece. So you have to figure out what is new yourself. The new piece is called the delta.&lt;/p&gt;

&lt;p&gt;Here is how the repo computes it (&lt;code&gt;brain.ts&lt;/code&gt; lines 486 to 501):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// already covered&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;currentBlockStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are three cases here, and the third is the one that matters most:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extension.&lt;/strong&gt; The new text starts with the old text. The delta is just the part at the end. Easy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate.&lt;/strong&gt; The same block got reported twice. Skip it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reset.&lt;/strong&gt; The new text has nothing to do with the old text. This happens after a tool call finishes. OpenClaw starts a fresh block stream, and the new text is a brand-new string. Without this branch, you would either lose the new block or join it incorrectly to the old one.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The empty payload.text quirk
&lt;/h3&gt;

&lt;p&gt;When block streaming is actually working, &lt;code&gt;payload.text&lt;/code&gt; in the final reply is an empty string. This is not a bug.&lt;/p&gt;

&lt;p&gt;OpenClaw has a check called &lt;code&gt;shouldDropFinalPayloads&lt;/code&gt; that removes the text from the final payload once it has already been streamed. This avoids sending the same text twice. The repo handles this by collecting text in its own buffer (&lt;code&gt;canonicalText&lt;/code&gt;) as chunks arrive. It only falls back to &lt;code&gt;payload.text&lt;/code&gt; if the buffer is empty:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;canonicalText&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;payloadText&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;canonicalText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;payloadText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Murf Falcon
&lt;/h2&gt;

&lt;p&gt;Synthesis is the technical word for generating audio from text. Murf Falcon is the TTS model used in this build. Murf reports a model latency of 55 ms and a time-to-first-audio of 130 ms, at $0.01 per 1,000 characters — roughly 1 cent per minute of generated audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  Turn off OpenClaw's built-in TTS
&lt;/h3&gt;

&lt;p&gt;OpenClaw ships with its own TTS pipeline. By default it runs in &lt;code&gt;auto: "on"&lt;/code&gt; mode, which produces one final audio file at the end of a reply. That mode is incompatible with per-block streaming, so we turn it off (&lt;code&gt;openclaw.json&lt;/code&gt; lines 30 to 47):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"tts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"murf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"auto"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"off"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"final"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"murf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"voiceId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en-IN-anusha"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FALCON"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"locale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en-IN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"style"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Conversational"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;auto: "off"&lt;/code&gt;, the Murf provider stays loaded and configured. But your code is now in charge of synthesis. You call &lt;code&gt;murfProvider.synthesize()&lt;/code&gt; directly on each block.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice and locale compatibility
&lt;/h3&gt;

&lt;p&gt;A locale is a code that identifies a language and region together, like &lt;code&gt;en-IN&lt;/code&gt; for English in India or &lt;code&gt;es-MX&lt;/code&gt; for Spanish in Mexico.&lt;/p&gt;

&lt;p&gt;Falcon supports voices across many languages, but each voice is bound to its locale. If you set &lt;code&gt;voiceId&lt;/code&gt; to an English voice and &lt;code&gt;locale&lt;/code&gt; to &lt;code&gt;hi-IN&lt;/code&gt;, the API rejects the request. If you change just one of the two when swapping voices, things silently break.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Voice ID prefix&lt;/th&gt;
&lt;th&gt;Locale&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;en-IN-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en-IN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Indian English. Used in this repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;en-US-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en-US&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;American English.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;en-UK-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en-UK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;British English.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hi-IN-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hi-IN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hindi.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;es-ES-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;es-ES&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spanish (Spain).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;es-MX-*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;es-MX&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spanish (Mexico). Different voices than Spain.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full list is in Murf's API docs. Before you change &lt;code&gt;voiceId&lt;/code&gt; in &lt;code&gt;openclaw.json&lt;/code&gt;, query &lt;code&gt;/v1/speech/voices?model=FALCON&lt;/code&gt; and pick a voice and its matching locale together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pick the right voice style
&lt;/h3&gt;

&lt;p&gt;Falcon exposes a &lt;code&gt;style&lt;/code&gt; parameter. Pick &lt;code&gt;Conversational&lt;/code&gt; for agent work. A voice that sounds great reading an audiobook usually sounds wrong in a back-and-forth conversation. Promotional and Narration styles sound theatrical when the agent is saying short things like "Sure, anything else?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Two speaker outputs
&lt;/h3&gt;

&lt;p&gt;The pre-recorded filler audio masks the cold-start delay by playing while the LLM is still thinking. The problem is that the filler clip is a variable length, and the first real audio chunk can arrive before the filler finishes.&lt;/p&gt;

&lt;p&gt;If you play both through the same audio output, one of two bad things happens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The filler cuts off the start of the real reply, or&lt;/li&gt;
&lt;li&gt;The reply cuts off the end of the filler.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix is two separate audio outputs (&lt;code&gt;voice.ts&lt;/code&gt; lines 10 to 11):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;oneShotSpeaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;InstanceType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;DecibriOutput&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;streamSpeaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;InstanceType&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;DecibriOutput&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;oneShotSpeaker&lt;/code&gt; plays fillers. &lt;code&gt;streamSpeaker&lt;/code&gt; plays the actual reply. When the first reply chunk arrives, &lt;code&gt;stopOneShotPlayback()&lt;/code&gt; stops the filler channel without touching the reply channel. Anything already queued on the reply channel keeps playing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Synthesise in parallel, play back in order
&lt;/h3&gt;

&lt;p&gt;There are two layers of parallelism worth understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Within a single block.&lt;/strong&gt; Murf splits long input into chunks of up to 1500 characters and synthesises them at the same time on its own infrastructure. You do not have to do anything for this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Across blocks.&lt;/strong&gt; The repo starts synthesis calls the moment each block arrives. So multiple blocks can be synthesising at the same time. But the audio plays back in order through a Promise chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dispatchChunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trimmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;streamingEnabled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;synthP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;synthesizeSpeech&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trimmed&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;emitChain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;emitChain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;synthP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;streamedAnyAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;onAudioChunk&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;synthesizeSpeech()&lt;/code&gt; starts the Murf network call right away. &lt;code&gt;emitChain.then()&lt;/code&gt; waits for the previous chunk's synthesis to finish before playing the current one. So if chunk 1 and chunk 2 both take 400 ms to synthesise but chunk 1's network is slower, chunk 2 still plays second. Never first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming-pipeline bugs and their root causes
&lt;/h2&gt;

&lt;p&gt;The video has a short error table for the bugs you hit during setup. This section covers the ones specific to the streaming pipeline that show up later, when the agent is mostly working.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebSocket closes with code 1008 the moment audio starts
&lt;/h3&gt;

&lt;p&gt;Code 1008 means "policy violation," which Deepgram uses for invalid API keys. Check &lt;code&gt;DEEPGRAM_API_KEY&lt;/code&gt; in your environment, and check the Deepgram console for remaining credit.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebSocket closes with code 1011 partway through a session
&lt;/h3&gt;

&lt;p&gt;Code 1011 means "internal server error," but in practice the most common cause is running out of credit mid-session. Top up and retry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Transcripts come back empty even though audio is sending
&lt;/h3&gt;

&lt;p&gt;Three things to check, in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sample rate.&lt;/strong&gt; &lt;code&gt;sample_rate&lt;/code&gt; in the URL must match your microphone's actual rate. The repo captures at 16000. If your system is recording at 44100 or 48000, you have to resample before sending.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encoding.&lt;/strong&gt; The &lt;code&gt;encoding&lt;/code&gt; parameter and the audio format must match. &lt;code&gt;linear16&lt;/code&gt; expects 16-bit signed little-endian PCM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model.&lt;/strong&gt; &lt;code&gt;model&lt;/code&gt; must be &lt;code&gt;flux-general-en&lt;/code&gt; or &lt;code&gt;flux-general-multi&lt;/code&gt;. No other model name works on &lt;code&gt;/v2/listen&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The agent's first sentence plays, then nothing
&lt;/h3&gt;

&lt;p&gt;This is the coalescer holding onto your blocks. If you did not override &lt;code&gt;flushOnEnqueue&lt;/code&gt;, the first block flushes but nothing after it streams. Check &lt;code&gt;brain.ts&lt;/code&gt; for the coalesce override.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audio plays out of order
&lt;/h3&gt;

&lt;p&gt;The Promise chain in &lt;code&gt;dispatchChunk&lt;/code&gt; is what keeps playback in order. If you removed the &lt;code&gt;emitChain.then(...)&lt;/code&gt; wrapper or replaced it with &lt;code&gt;Promise.all&lt;/code&gt;, chunks will play in synthesis-completion order instead of arrival order. Put the chain back.&lt;/p&gt;

&lt;h3&gt;
  
  
  The agent talks over itself
&lt;/h3&gt;

&lt;p&gt;This means the filler kept playing after the real reply started. Check that &lt;code&gt;stopOneShotPlayback()&lt;/code&gt; runs on the first chunk of the real reply, not at the end of the reply.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice cuts off mid-sentence
&lt;/h3&gt;

&lt;p&gt;Falcon synthesis can fail silently for a single chunk. The &lt;code&gt;.catch(() =&amp;gt; null)&lt;/code&gt; in &lt;code&gt;dispatchChunk&lt;/code&gt; protects you from one failed chunk crashing the whole reply. But if too many chunks fail, the user hears gaps. Log the failures and check Murf's status page.&lt;/p&gt;

&lt;h3&gt;
  
  
  ALSA errors on Linux
&lt;/h3&gt;

&lt;p&gt;On minimal Linux installs the ALSA development headers have to be installed before the npm package will build. &lt;code&gt;apt install libasound2-dev&lt;/code&gt; covers it on Debian-family. If install completes but the device is not found at runtime, the default ALSA device is probably pointing at an output that does not exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  No audio on Windows
&lt;/h3&gt;

&lt;p&gt;Decibri on Windows uses WASAPI. If your default output device is a Bluetooth headset that is not currently connected, the stream opens silently and no audio plays. Switch the default device in Sound settings, or set the output device explicitly in code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Silent failure on macOS
&lt;/h3&gt;

&lt;p&gt;The first run asks for microphone permission. If you deny it, subsequent runs fail silently. The agent will appear to start normally and the WebSocket will connect, but no audio frames reach Deepgram. Check microphone permissions in System Settings under Privacy and Security.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extending the agent to something that is not Swiggy
&lt;/h2&gt;

&lt;p&gt;It takes two changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Swap the skill.&lt;/strong&gt; The &lt;code&gt;agents.defaults.skills&lt;/code&gt; array in &lt;code&gt;openclaw.json&lt;/code&gt; is the list of MCP skills the agent can call. Remove the Swiggy skill, add a different one. A calendar scheduler imports a Google Calendar MCP skill. A GitHub PR merger imports the GitHub MCP skill. A Notion assistant imports the Notion MCP skill. The runtime does not change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rewrite the identity.&lt;/strong&gt; &lt;code&gt;workspace/IDENTITY.md&lt;/code&gt; is the system prompt. It describes who the agent is, what it does, what it refuses to do, and how it should format replies. Rewriting this file changes the agent's personality and its understanding of the task.&lt;/p&gt;

&lt;p&gt;For a calendar scheduler, you would describe an assistant that looks up free slots and confirms bookings. For a PR merger, you would describe a reviewer that summarises diffs and merges when checks pass.&lt;/p&gt;

&lt;p&gt;Everything else stays. The audio pipeline, the streaming coalescer, the keyterm bias, the two-channel playback. That is the value of keeping the voice layer separate from the agent layer. The voice layer does not care what the agent is doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this pipeline does not fix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Turn 1 latency is not solved.&lt;/strong&gt; Time-to-first-audio on a cold start is mostly caused by tool chains and LLM time-to-first-token, not by synthesis. The slow path still includes OpenClaw's cold start, the Swiggy MCP setup, and the LLM's first-token delay. Streaming synthesis cannot hide that. The filler audio can. That is why it is there.&lt;/p&gt;

&lt;p&gt;Getting to true sub-second first audio on turn 1 would require starting the OpenClaw runtime ahead of time, keeping the MCP connection alive across sessions, and starting tool calls before the user finishes speaking. None of those are in this repo. What is in this repo is the pattern that makes the problem manageable: split the audio pipeline from the agent pipeline, stream what can be streamed, mask the rest with fillers, and measure the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 2 onwards is a different story.&lt;/strong&gt; With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking. Falcon plus block streaming are why. That is the number that makes the agent usable in practice. The cold-start number is what makes every tutorial-shaped demo look slower than it will be in production.&lt;/p&gt;

&lt;p&gt;Block streaming, Falcon, and contextual keyterm biasing are three improvements that build on each other. Each does less than a demo suggests. Together they do more than any one of them alone. That is usually how voice pipelines work.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openclaw</category>
      <category>voice</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
