DEV Community: Sanchita Sunil

Building a Compliant BFSI Voice Agent

Sanchita Sunil — Thu, 16 Jul 2026 11:34:25 +0000

Quick links.
Code: https://github.com/murf-ai/murf-cookbook/tree/main/examples/agents/payment-reminder
Video Walkthrough:

Financial services is one of the most heavily regulated spaces to put a voice AI agent into. Whether it's a payment reminder, a KYC verification call or a fraud alert, the conversation is bound by rules.

The common thread across all of them is what your agent says, when it says it, and to whom — and these are no longer just a UX choice. Reveal an account detail before verifying who's on the line, apply a note of pressure to someone who just disclosed a hardship, say the wrong thing at the wrong moment, and you've crossed a line that carries real consequences.

Most people reach for a better prompt to fix this, which works most of the time. But a phone call is a live, irreversible conversation, and "most of the time" isn't good enough when it comes to compliance.

So the question isn't "how do I write a better prompt?" — it's "how do I build a voice agent that stays compliant even when the model ignores the prompt?"

In this tutorial, we'll build an outbound payment-reminder agent that calls a customer, verifies identity and offers a payment link, while handling disputes, hardship and wrong numbers the way the rules demand.

Architecture

The basic voice architecture for most agents remains the same: wire up the telephony, connect the LLM and write a system prompt telling the agent to "be a polite debt collector".

In financial services, a single prompt left to its own devices is a compliance violation waiting to happen. LLMs are probabilistic — they want to please the user, answer questions, and keep the conversation going. But in a highly regulated space, you need the agent to be deterministic.

To achieve this, we aren't just writing a prompt; we are building a 3-Layer Architecture.

Layer 1: The Foundation (Basic Outbound Call Agent + System Prompt)

This is our baseline. In this layer, we wire up Twilio, LiveKit, Murf, and the LLM, and give the agent its identity, its voice, and its core context. This layer handles the mechanics of listening, thinking, and speaking — but we can't trust it to govern the flow of the call.

Layer 2: The Enforcer (The State Machine)

This is the secret sauce of a compliant BFSI agent. Instead of hoping the LLM remembers to verify a user's identity before talking about debt, we hardcode a state machine. The conversation is broken into strict phases:

Greeting
Verification
Payment_Discussion
Outcome

The agent is physically locked out of the context required to discuss a payment until the Verification state is cleared. If the user tries to skip ahead, the state machine yanks the LLM back to the current required task, turning an unpredictable AI into a strict, compliant workflow.

Layer 3: The Safety Net (Guardrails & Human Escalation)

Even with a state machine, the real world is messy. Callers get angry, they mention bankruptcy, or they threaten legal action. Layer 3 acts as our emergency brake. We implement semantic guardrails that constantly monitor the user's intent. If a trigger word or high-stress emotion is detected, the agent immediately stops generating responses and triggers an escalation protocol, gracefully terminating the call.

Setup & Requirements

Twilio

Twilio is a cloud communications platform that turns the global telecom network into simple software APIs, acting as a digital bridge between the internet and phone networks.

Setup:

Create an account at console.twilio.com or 1console.twilio.com. Copy your account SID, auth token and phone number, and paste them into the .env file.
On the console, go to TwiML Bins and create a bin.
Give it a name and put this in the TwiML section:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial>
    <Sip>sip:<phone_number>@<sip_uri>;transport=tcp</Sip>
  </Dial>
</Response>

Replace <phone_number> with your Twilio phone number and <sip_uri> with your SIP URI. For example, if your phone number is +1234567 and your SIP URI is sip:abc123.sip.livekit.cloud, your TwiML section would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial>
    <Sip>sip:1234567@abc123.sip.livekit.cloud;transport=tcp</Sip>
  </Dial>
</Response>

On 1console.twilio.com, go to Products & Services → Numbers & Senders, click your phone number, and in the Voice and emergency calling section click Edit configuration details.
Select your configuration method as the one with TwiML Bins in it, set your primary method as TwiML Bins, and select the bin you just created.
Go to Products & Services → Elastic SIP Trunking → Trunks and create a new trunk.
In the Termination tab of your trunk, set the Termination SIP URI and paste it into your .env file.
Scroll down to Credential Lists and create a new one. Copy the username and password into your .env file as well.

LiveKit

LiveKit is streaming infrastructure built on WebRTC that manages audio buffering and handles network drops gracefully.

Setup:

Create an account at cloud.livekit.io and create a project.
From your project settings, copy the URL, API key, API secret and SIP URI into your .env file.
On the console, go to Telephony → SIP Trunks and create a new trunk.
Give the trunk a name, select outbound, and put the Termination SIP URI of your Twilio trunk under addresses.
Add your Twilio phone number to Numbers, and under Optional Settings enter the username and password from the credential list you created in the Twilio trunk.

If you switch to the JSON editor, your JSON should look something like this:

{
  "name": "payment reminder",
  "address": "payment-reminder.pstn.twilio.com",
  "transport": "SIP_TRANSPORT_TCP",
  "numbers": [
    "+11234567"
  ],
  "authUsername": "payment-reminder",
  "authPassword": "********"
}

Go to Telephony → Dispatch Rules and create a new dispatch rule.
Give your rule a name, a prefix and an agent name. Remember your agent name — it must match the one in your code, or the call will connect but all you'll hear is silence.

If you switch to the JSON editor, your dispatch rule should look something like this:

{
  "sipDispatchRuleId": "SDR_abc123",
  "rule": {
    "dispatchRuleIndividual": {
      "roomPrefix": "payment-"
    }
  },
  "trunkIds": [
    "ST_abc123"
  ],
  "name": "payment",
  "roomConfig": {
    "agents": [
      {
        "agentName": "payment-reminder"
      }
    ]
  }
}

Copy this and save it as dispatch-rule.json in your project root directory.

Speech-To-Text (STT) — Deepgram Nova-3

We use Deepgram's Nova-3 as the STT for this project. Deepgram offers $200 in free credits when your account is created, which is more than enough to test and run this project. You can get your API key at console.deepgram.com.

LLM

You will require a Large Language Model to act as the central reasoning engine for your voice agent. This project is configured to support two major providers. To choose yours, set the LLM_PROVIDER variable in your .env file to one of:

openai (requires a paid API key)
gemini (an excellent choice to test with — Google offers a free tier)

If you wish to use a different provider, the code is structured so you can wire in any custom LLM.

Text-To-Speech (TTS) — Murf Falcon

Once our LLM decides exactly what to say, we need to convert that text back into natural human speech to stream down the phone line.

For this project, we are using Murf Falcon, the consistently fastest TTS model (130ms time to first audio) built specifically for real-time conversational applications. Falcon is optimized to generate lifelike, natural-sounding human speech on the fly.

It achieves this by supporting continuous chunked audio streaming — the moment our LLM streams its first few words, Falcon immediately starts converting those tokens into audio packets and pushing them back into the LiveKit media stream. This ensures the caller hears a seamless, expressive response without awkward pauses.

Beyond just speed, Falcon also addresses a critical banking compliance requirement by offering localized data residency in India, ensuring that sensitive customer data never leaves the country.

Go to murf.ai/api, create an account and get your API key.

Project setup

Create your project directory and set up a standard Python virtual environment to keep your dependencies isolated:

mkdir payment-reminder-agent
cd payment-reminder-agent
python -m venv venv

# Activate it (Mac/Linux)
source venv/bin/activate

# Or on Windows(Powershell):
# venv\Scripts\Activate.ps1

Create a requirements.txt file in your project root:

livekit-agents>=1.0.0
livekit-plugins-deepgram>=0.7.0
livekit-plugins-openai>=0.10.0
livekit-plugins-google>=0.6.0
livekit-plugins-silero>=0.7.0
livekit-murf>=0.1.0
livekit-api>=0.7.0
python-dotenv>=1.0.0
twilio>=9.0.0
psutil>=5.9.0

Then run:

pip install -r requirements.txt

Customer data

We'll set up two ways to store customer information.

For a single customer, create scenario_config.json in your project root:

{
  "companyName": "NovaFin",
  "agentName": "Asha",
  "agentVoice": "en-IN-anisha",
  "useCase": "payment_reminder",
  "language": "en",
  "customerName": "Aria",
  "accountEnding": "4321",
  "registeredMobileLastFour": "1234",
  "amountDue": "10000",
  "amountDueFormatted": "₹10,000",
  "dueDate": "June 21, 2026",
  "daysPastDue": 4,
  "scenario": "normal_reminder",
  "requireIdentityVerification": true,
  "recordingDisclosureRequired": true,
  "paymentLinkEnabled": true,
  "humanHandoffEnabled": true
}

For multiple customers, create a CSV file in your project root:

name,phone,amount_due,due_date,account_ending,registered_mobile_last_four
Aria Sharma,+910000000001,10000,"June 21, 2026",4321,1234
Rahul Verma,+910000000002,5000,"July 1, 2026",8765,5678
Priya Patel,+910000000003,15000,"June 28, 2026",2468,1357

Now that we have everything in place, let's dive into building this agent.

Layer 1: The Foundation

In this layer, we wire up the core telephony, give our agent a strict system prompt, and define the tools it can use. We'll split this across a few dedicated files to keep our logic completely decoupled from our telephony infrastructure.

Managing state and data (`data.py`)

In a real-world BFSI setup, you need to track exactly what happens on a call for auditing purposes. Did the user verify their identity? Did they promise to pay? Was a dispute raised?

Instead of parsing a massive text transcript after the fact, we use structured data logging. Create a data.py file. This file handles our environment variables, loads our config, and defines an OutcomeLog dataclass to strictly track the state of the call.

import contextvars
import json
import logging
import os
import re
import time
from dataclasses import asdict, dataclass, field
from datetime import datetime, timezone


# Context variables allow us to handle multiple simultaneous calls
# without their customer details bleeding into each other.
_call_ctx: contextvars.ContextVar[dict | None] = contextvars.ContextVar("call_cfg", default=None)


def get_config() -> dict:
    """Read the customer/scenario config fresh from disk."""
    with open(_CONFIG_PATH, encoding="utf-8") as f:
        return json.load(f)


def set_call_context_config(cfg: dict) -> None:
    """Bind this call's config to the current context so parallel calls stay isolated."""
    _call_ctx.set(cfg)


def _effective_config() -> dict:
    """Fetches the specific caller's config for the current active call."""
    ctx_cfg = _call_ctx.get()
    return ctx_cfg if ctx_cfg is not None else get_config()


def is_identity_match(mobile_last_four: str, account_last_four: str) -> bool:
    """Identity is confirmed only when both the mobile and account last four match."""
    return (
        re.sub(r"\D", "", mobile_last_four) == str(_effective_config()["registeredMobileLastFour"]) and
        re.sub(r"\D", "", account_last_four) == str(_effective_config()["accountEnding"])
    )


@dataclass
class OutcomeLog:
    """Tracks the definitive outcome of the call for compliance and CRM syncing."""
    scenario: str = ""
    call_started: bool = False
    recording_disclosure_played: bool = False
    identity_verified: bool = False
    amount_disclosed: bool = False
    payment_link_sent: bool = False
    promise_to_pay_date: str | None = None
    dispute_detected: bool = False
    hardship_detected: bool = False
    outcome: str = "unknown"

    def save_to_file(self) -> str:
        os.makedirs("logs", exist_ok=True)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"logs/{self.scenario}_{timestamp}.json"
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(asdict(self), f, indent=2)
        return filename

Giving the agent some tools (`tools.py`)

LLMs are great at talking, but they can't natively do anything. To bridge the gap between conversation and action, we use function calling.

When the user says, "Yes, my account ends in 4321 and my phone ends in 1234," we don't want the LLM to just guess if that's correct. We want it to call a Python function to check against our data.

Create tools.py and define the actions the agent can take:

import logging
from typing import Annotated
from livekit.agents import RunContext, function_tool, get_job_context
from pydantic import Field
import data


logger = logging.getLogger(__name__)


def _get_outcome_log() -> data.OutcomeLog | None:
    job_ctx = get_job_context(required=False)
    return job_ctx.proc.userdata.get("outcome_log") if job_ctx else None


@function_tool()
async def verify_borrower_identity(
    mobile_last_four: Annotated[str, Field(description="Last four digits of registered mobile")],
    account_last_four: Annotated[str, Field(description="Last four digits of account number")],
    context: RunContext,
) -> str:
    """Verify the borrower's identity before disclosing financial details."""
    outcome_log = _get_outcome_log()

    if data.is_identity_match(mobile_last_four, account_last_four):
        if outcome_log:
            outcome_log.identity_verified = True
        return "Thank you, your identity has been confirmed."

    return (
        "I'm sorry, those details don't match what we have on file. "
        "Could you please give me the last four digits of your registered mobile number "
        "and your account number again?"
    )


@function_tool()
async def flag_hardship(
    reason: Annotated[str, Field(description="Brief reason the borrower gave for their hardship.")],
    context: RunContext,
) -> str:
    """Flag the account for hardship and arrange a human callback."""
    outcome_log = _get_outcome_log()
    if outcome_log:
        outcome_log.hardship_detected = True
        outcome_log.outcome = "hardship_detected"

    return (
        "I understand, and I'm truly sorry to hear that. "
        "I've flagged your account and a member of our team will call you back to discuss your options."
    )


@function_tool()
async def send_payment_link(
    context: RunContext,
) -> str:
    """Send the official payment link to the borrower's registered mobile number."""
    outcome_log = _get_outcome_log()
    # Compliance backstop: never send the link before identity is verified.
    if outcome_log and not outcome_log.identity_verified:
        return "I can't send the payment link until your identity is verified."
    if outcome_log:
        outcome_log.payment_link_sent = True
    return (
        "I've just sent the payment link to your registered mobile number. "
        "Please check your messages."
    )

The brain (`prompt.py`)

Now we define the system prompt. In Layer 1, this is our primary defense mechanism. Notice how explicit and strict the instructions are. We don't just tell it what to do; we explicitly command it on what it must not do (for example: "Do not reveal the customer's full name, account number, amount due or overdue status until their identity has been verified").

def build_payment_prompt(config: dict) -> str:
    return (
        f"You are {config['agentName']}, an automated payment assistance voice agent from {config['companyName']}.\n\n"
        "CORE RULES — NEVER BREAK THESE\n"
        "- Do not reveal the customer's full name, account number, amount due, or overdue status\n"
        "  to anyone until identity has been verified.\n"
        "- If the borrower mentions job loss, inability to pay, or medical emergency,\n"
        "  call flag_hardship immediately. Do not pressure them.\n"
        "- Never make threats, use intimidating language, or shame the borrower.\n\n"
        "Follow this exact order:\n"
        f"1. Introduce yourself. State the call may be recorded. Ask if you are speaking with {config['customerName']}.\n"
        "2. If confirmed, ask for the last four digits of their mobile and account number.\n"
        "3. Call verify_borrower_identity.\n"
        f"4. Only after verification: mention the amount due ({config['amountDueFormatted']}) and due date ({config['dueDate']}).\n"
        "5. Offer to send the official payment link.\n"
        "6. Close the call politely.\n\n"
        "LANGUAGE AND TONE\n"
        "Keep every response to 1-2 sentences. Speak clearly and calmly. Do not use filler "
        'phrases like "Certainly!" or "Of course!".'
    )

Wiring it all together (`agent.py`)

This is where it all comes together. We spin up a LiveKit worker that manages the real-time media streams. It handles voice activity detection (VAD), pipes audio to our STT, passes text to the LLM, and streams Murf Falcon's audio back into the phone call.

One crucial detail: we must wait for the media track. When SIP calls connect, the network often takes half a second to start flowing RTP packets, so if your agent speaks the instant the line connects, the first two words of your greeting will be chopped off. We handle this by specifically waiting for the audio track to subscribe before speaking.

import asyncio
import os
import time
from livekit import rtc
from livekit.agents import Agent, AgentSession, JobContext, JobProcess, WorkerOptions, cli
from livekit.plugins import deepgram, google, openai, silero, murf
import data
from prompt import build_payment_prompt
from tools import verify_borrower_identity, flag_hardship, send_payment_link


def prewarm(proc: JobProcess) -> None:
    """Load the VAD once per worker process, before any call comes in."""
    proc.userdata["vad"] = silero.VAD.load()


async def entrypoint(ctx: JobContext) -> None:
    # 1. Connect to the LiveKit Room
    await ctx.connect(auto_subscribe=rtc.AutoSubscribe.AUDIO_ONLY)

    # 2. Setup call context and logging
    call_cfg = data.get_config()
    data.set_call_context_config(call_cfg)
    ctx.proc.userdata["outcome_log"] = data.OutcomeLog(scenario=call_cfg["scenario"])

    # 3. Initialize the Pipeline Models
    tts_instance = murf.TTS(voice=call_cfg["agentVoice"], streaming=True)

    session = AgentSession(
        stt=openai.STT(model="gpt-realtime-whisper", use_realtime=True),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=tts_instance,
        vad=ctx.proc.userdata["vad"],
    )

    prompt = build_payment_prompt(call_cfg)
    opening_line = (
        f"Hello, this is {call_cfg['agentName']}, an automated payment assistance agent from "
        f"{call_cfg['companyName']}. This call may be recorded for quality and compliance. "
        f"Am I speaking with {call_cfg['customerName']}?"
    )

    # 4. Start the Agent Pipeline
    agent = Agent(
        instructions=prompt,
        tools=[verify_borrower_identity, flag_hardship, send_payment_link],
    )
    await session.start(agent, room=ctx.room)

    # 5. Execute the Outbound SIP Dial
    phone_number = "+1234567890"  # Retrieved dynamically in full code
    await _dial_and_greet(ctx, session, opening_line, phone_number)


async def _dial_and_greet(ctx, session, opening_line, phone_number):
    """Dials via Twilio SIP trunk and handles the initial greeting."""
    # (SIP connection logic omitted for brevity...)

    # Wait for the audio track so the greeting doesn't clip!
    track_ready = asyncio.Event()
    # Event listener logic here to set track_ready when RTP flows...
    await asyncio.wait_for(track_ready.wait(), timeout=3.0)
    await asyncio.sleep(0.7)

    # Agent speaks before activating STT so it doesn't transcribe itself
    handle = session.say(opening_line, allow_interruptions=False)
    await handle.wait_for_playout()

    # Now start listening to the user
    session.room_io.set_participant("phone-user")


if __name__ == "__main__":
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
            agent_name="payment-agent",
        )
    )

Running it (`run.py`)

Finally, to orchestrate everything, we create a run.py script. This script triggers the LiveKit dispatches, effectively telling our agent: "Hey, call this number, and here is the customer data payload to inject into agent.py." Because LiveKit abstracts the infrastructure, we can run this sequentially for a single user, or dispatch thousands of calls in parallel reading from a CSV file.

First, we handle our imports and define strictly how we parse phone numbers and dates. In telephony, formatting is everything. If you don't enforce E.164 formatting (+1234567890), your SIP trunk will reject the call:

import argparse
import asyncio
import csv
import json
import logging
import os
import re
import subprocess
import sys
import time
import uuid
from datetime import date, datetime

from dotenv import load_dotenv


_ROOT = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, _ROOT)
load_dotenv(os.path.join(_ROOT, ".env"), override=False)

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("run")

_CONFIG_PATH = os.path.join(_ROOT, "scenario_config.json")
_E164_RE = re.compile(r"^\+\d{7,15}$")


def _clean_phone(raw: str) -> str:
    """Return the phone in E.164 format or raise with a helpful message."""
    phone = raw.strip()
    if not _E164_RE.match(phone):
        raise ValueError(
            f"Invalid phone number {phone!r}.\n"
            "  Phone must be in E.164 format: +CountryCodeNumber (e.g. +911234567890).\n"
            "  If you edited the CSV in Excel, it likely converted the number to scientific\n"
            "  notation. Open the file in Notepad/VS Code instead."
        )
    return phone


# (Helper functions like _parse_due_date, _days_past_due, _format_inr, and _row_to_meta
# go here to sanitize data before passing it to the agent.)

The LiveKit dispatcher is the bridge between our Python script and LiveKit's cloud infrastructure. When we call _dispatch, we are telling LiveKit to spin up a new room and attach our payment-agent to it, injecting the specific customer's metadata (name, amount due, phone number) straight into the room context.

async def _dispatch(lk, room_name: str, meta: dict) -> None:
    from livekit import api as lk_api
    await lk.room.create_room(lk_api.CreateRoomRequest(name=room_name))
    dispatch = await lk.agent_dispatch.create_dispatch(
        lk_api.CreateAgentDispatchRequest(
            agent_name="payment-agent",
            room=room_name,
            metadata=json.dumps(meta),
        )
    )
    logger.info("Dispatched → room=%s  dispatch=%s", room_name, dispatch.id)


async def _wait_for_room_close(
    lk, room_name: str, timeout_s: int = 600, agent_proc: "subprocess.Popen | None" = None
) -> bool:
    """Poll every 5s until the room disappears (call ended) or timeout."""
    from livekit import api as lk_api
    await asyncio.sleep(15)  # initial wait — dial + ring + answer takes ~10s
    deadline = time.monotonic() + timeout_s
    while time.monotonic() < deadline:
        # Check if agent crashed
        if agent_proc is not None and agent_proc.poll() is not None:
            logger.warning("Agent process exited — deleting room %s and moving on", room_name)
            try:
                await lk.room.delete_room(lk_api.DeleteRoomRequest(room=room_name))
            except Exception:
                pass
            return False

        # Check if room still exists
        try:
            result = await lk.room.list_rooms(lk_api.ListRoomsRequest(names=[room_name]))
            if not result.rooms:
                return True
        except Exception as exc:
            logger.warning("Room poll error: %s", exc)
        await asyncio.sleep(5)
    return False

We'll add three different execution modes:

Single phone call
Sequential phone calls from a CSV file
Parallel phone calls from a CSV file

async def _run_single(phone: str) -> None:
    from livekit import api as lk_api
    phone = _clean_phone(phone)
    with open(_CONFIG_PATH, encoding="utf-8") as f:
        cfg = json.load(f)
    amount   = str(cfg["amountDue"])
    due_date = cfg["dueDate"]
    meta = {
        "phone_number":                phone,
        "customer_name":               cfg["customerName"],
        "amount_due":                  amount,
        "amount_due_formatted":        _format_inr(amount),
        "due_date":                    due_date,
        "days_past_due":               _days_past_due(due_date),
        "account_ending":              str(cfg["accountEnding"]),
        "registered_mobile_last_four": str(cfg["registeredMobileLastFour"]),
    }
    lk = lk_api.LiveKitAPI(
        url=os.environ["LIVEKIT_URL"],
        api_key=os.environ["LIVEKIT_API_KEY"],
        api_secret=os.environ["LIVEKIT_API_SECRET"],
    )
    try:
        room_name = f"payment-{uuid.uuid4().hex[:8]}"
        print(f"\nCalling {phone} (room: {room_name}) …")
        await _dispatch(lk, room_name, meta)
        print("Call dispatched — your phone will ring in ~5 s.")
    finally:
        await lk.aclose()


async def _run_sequential(rows: list[dict], proc: list | None = None) -> None:
    from livekit import api as lk_api
    lk = lk_api.LiveKitAPI(
        url=os.environ["LIVEKIT_URL"],
        api_key=os.environ["LIVEKIT_API_KEY"],
        api_secret=os.environ["LIVEKIT_API_SECRET"],
    )
    try:
        for i, row in enumerate(rows, 1):
            name = row["name"].strip()
            phone = row["phone"].strip()

            # (Agent restart recovery logic goes here...)

            print(f"\n[{i}/{len(rows)}] Calling {name} ({phone}) …")
            room_name = f"payment-{uuid.uuid4().hex[:8]}"
            meta = _row_to_meta(row)
            await _dispatch(lk, room_name, meta)

            print(f"        Waiting for call to finish (room: {room_name}) …")
            agent_p = proc[0] if proc else None
            closed = await _wait_for_room_close(lk, room_name, agent_proc=agent_p)
            status = "done" if closed else "timed out"
            print(f"        {name}: {status}.")
    finally:
        await lk.aclose()


async def _run_parallel(rows: list[dict]) -> None:
    from livekit import api as lk_api
    lk = lk_api.LiveKitAPI(
        url=os.environ["LIVEKIT_URL"],
        api_key=os.environ["LIVEKIT_API_KEY"],
        api_secret=os.environ["LIVEKIT_API_SECRET"],
    )
    try:
        print(f"\nDispatching {len(rows)} calls simultaneously …")
        tasks = []
        for row in rows:
            room_name = f"payment-{uuid.uuid4().hex[:8]}"
            tasks.append(_dispatch(lk, room_name, _row_to_meta(row)))
        await asyncio.gather(*tasks)
        print(f"All {len(rows)} calls dispatched.")
    finally:
        await lk.aclose()

Finally, we wrap it all in an easy-to-use command line interface. We spin up the agent worker process in the background, wait for it to connect to LiveKit, and then execute our chosen mode.

def _start_agent() -> subprocess.Popen:
    """Start the LiveKit agent worker in the background (same terminal output)."""
    return subprocess.Popen(
        [sys.executable, os.path.join(_ROOT, "agent.py"), "start"],
    )


async def main() -> None:
    parser = argparse.ArgumentParser(description="Payment reminder system")
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--to", metavar="PHONE", help="Single call: E.164 phone number")
    group.add_argument("--csv", metavar="FILE", help="Campaign CSV file")
    parser.add_argument("--mode", choices=["sequential", "parallel"], default="sequential")
    args = parser.parse_args()

    print("\nStarting agent worker …")
    proc = [_start_agent()]

    from livekit import api as lk_api
    lk = lk_api.LiveKitAPI(
        url=os.environ["LIVEKIT_URL"],
        api_key=os.environ["LIVEKIT_API_KEY"],
        api_secret=os.environ["LIVEKIT_API_SECRET"],
    )

    try:
        await _wait_for_agent_ready(lk)  # Ensures LiveKit is listening
        print("Agent ready.\n")
        await lk.aclose()

        if args.csv:
            rows = _load_csv(args.csv)
            if args.mode == "parallel":
                await _run_parallel(rows)
            else:
                await _run_sequential(rows, proc=proc)
        else:
            await _run_single(args.to)

        print("\nAgent is running — press Ctrl+C when done.")
        while True:
            if proc[0].poll() is not None:
                break
            await asyncio.sleep(2)

    except KeyboardInterrupt:
        print("\nStopping …")
    finally:
        if proc[0].poll() is None:
            proc[0].terminate()


if __name__ == "__main__":
    asyncio.run(main())

To run the agent, use one of the three commands below:

python run.py --to +911234567890
python run.py --csv reminders.csv --mode sequential
python run.py --csv reminders.csv --mode parallel

Note: Don't use Excel to edit the CSV files — phone numbers get converted to scientific notation, which throws an error. Edit them in Notepad or your code editor.

Layer 2: The Enforcer

In this layer, we implement a state machine. Instead of hoping the model follows the required flow, we take away its autonomy. The core mechanism here is data starvation — the LLM is physically locked out of the financial data until the state machine confirms the caller's identity.

Defining the rules (`state_machine.py`)

First, we map the entire lifecycle of a compliance-bound phone call into discrete, locked phases. Create a state_machine.py file. Instead of letting the LLM wander, we define exact states, the only actions allowed in those states, and the strict paths it can take to move forward.

from __future__ import annotations

import logging
from enum import Enum

logger = logging.getLogger(__name__)


class CallState(str, Enum):
    PRE_CALL_CHECK = "PRE_CALL_CHECK"
    OPENING_DISCLOSURE = "OPENING_DISCLOSURE"
    IDENTITY_VERIFICATION = "IDENTITY_VERIFICATION"
    PAYMENT_CONTEXT = "PAYMENT_CONTEXT"
    INTENT_CLASSIFICATION = "INTENT_CLASSIFICATION"
    SEND_PAYMENT_LINK = "SEND_PAYMENT_LINK"
    PROMISE_TO_PAY = "PROMISE_TO_PAY"
    DISPUTE_INTAKE = "DISPUTE_INTAKE"
    HARDSHIP_ESCALATION = "HARDSHIP_ESCALATION"
    WRONG_PERSON_END = "WRONG_PERSON_END"
    HUMAN_HANDOFF = "HUMAN_HANDOFF"
    CALL_SUMMARY = "CALL_SUMMARY"


# Once we reach one of these, the call is winding down — no more payment flow.
TERMINAL_STATES: frozenset[CallState] = frozenset(
    {CallState.WRONG_PERSON_END, CallState.HUMAN_HANDOFF, CallState.CALL_SUMMARY}
)


# What the agent is allowed to do in each state. The prompt is rebuilt from this
# every time we transition, so the LLM only ever sees the actions for the current phase.
ALLOWED_ACTIONS: dict[CallState, list[str]] = {
    CallState.PRE_CALL_CHECK: [],
    CallState.OPENING_DISCLOSURE: [
        "introduce yourself and the company",
        "state the call may be recorded",
        "ask if speaking with the customer by name",
    ],
    CallState.IDENTITY_VERIFICATION: [
        "ask for the last four digits of the registered mobile number",
        "ask for the last four digits of the account number",
        "call verify_borrower_identity",
    ],
    CallState.PAYMENT_CONTEXT: [
        "state amount due and due date",
        "offer to send the payment link",
    ],
    CallState.INTENT_CLASSIFICATION: [
        "listen for payment intent, dispute, hardship, or stop-calling",
        "route to the appropriate next state",
    ],
    CallState.SEND_PAYMENT_LINK: [
        "call send_payment_link",
        "confirm the link was sent to the registered number",
    ],
    CallState.PROMISE_TO_PAY: [
        "ask for a commitment date",
        "call log_promise_to_pay",
    ],
    CallState.DISPUTE_INTAKE: [
        "call create_dispute_ticket",
        "stop all payment reminder language",
    ],
    CallState.HARDSHIP_ESCALATION: [
        "call flag_hardship",
        "do not pressure the borrower",
    ],
    CallState.WRONG_PERSON_END: [
        "call end_call_wrong_person",
        "apologise and end the call",
    ],
    CallState.HUMAN_HANDOFF: [
        "call transfer_to_human",
        "inform the borrower they are being transferred",
    ],
    CallState.CALL_SUMMARY: [
        "summarise the outcome",
        "thank the borrower and close the call politely",
    ],
}


# Strict one-way streets. You cannot jump from Opening straight to Payment Context —
# the only path to money talk runs through IDENTITY_VERIFICATION.
VALID_TRANSITIONS: dict[CallState, set[CallState]] = {
    CallState.PRE_CALL_CHECK: {
        CallState.OPENING_DISCLOSURE,
    },
    CallState.OPENING_DISCLOSURE: {
        CallState.IDENTITY_VERIFICATION,
        CallState.WRONG_PERSON_END,
        CallState.DISPUTE_INTAKE,
        CallState.HARDSHIP_ESCALATION,
        CallState.HUMAN_HANDOFF,
    },
    CallState.IDENTITY_VERIFICATION: {
        CallState.PAYMENT_CONTEXT,
        CallState.WRONG_PERSON_END,
        CallState.DISPUTE_INTAKE,
        CallState.HARDSHIP_ESCALATION,
        CallState.HUMAN_HANDOFF,
    },
    CallState.PAYMENT_CONTEXT: {
        CallState.INTENT_CLASSIFICATION,
        CallState.DISPUTE_INTAKE,
        CallState.HARDSHIP_ESCALATION,
        CallState.HUMAN_HANDOFF,
    },
    CallState.INTENT_CLASSIFICATION: {
        CallState.SEND_PAYMENT_LINK,
        CallState.PROMISE_TO_PAY,
        CallState.DISPUTE_INTAKE,
        CallState.HARDSHIP_ESCALATION,
        CallState.HUMAN_HANDOFF,
        CallState.CALL_SUMMARY,
    },
    CallState.SEND_PAYMENT_LINK: {
        CallState.PROMISE_TO_PAY,
        CallState.CALL_SUMMARY,
        CallState.DISPUTE_INTAKE,
        CallState.HARDSHIP_ESCALATION,
        CallState.HUMAN_HANDOFF,
    },
    CallState.PROMISE_TO_PAY: {
        CallState.CALL_SUMMARY,
        CallState.HUMAN_HANDOFF,
    },
    CallState.DISPUTE_INTAKE: {
        CallState.HUMAN_HANDOFF,
        CallState.CALL_SUMMARY,
    },
    CallState.HARDSHIP_ESCALATION: {
        CallState.HUMAN_HANDOFF,
        CallState.CALL_SUMMARY,
    },
    CallState.WRONG_PERSON_END: set(),
    CallState.HUMAN_HANDOFF: set(),
    CallState.CALL_SUMMARY: set(),
}


class CallStateMachine:
    def __init__(self) -> None:
        self._state: CallState = CallState.PRE_CALL_CHECK
        self._history: list[str] = [CallState.PRE_CALL_CHECK.value]

    @property
    def current_state(self) -> CallState:
        return self._state

    @property
    def allowed_actions(self) -> list[str]:
        return list(ALLOWED_ACTIONS[self._state])

    def transition(self, new_state: CallState) -> bool:
        if new_state not in VALID_TRANSITIONS.get(self._state, set()):
            logger.warning(
                "Invalid transition %s -> %s — staying in %s",
                self._state.value,
                new_state.value,
                self._state.value,
            )
            return False
        logger.info("State: %s -> %s", self._state.value, new_state.value)
        self._state = new_state
        self._history.append(new_state.value)
        return True

    def is_terminal(self) -> bool:
        return self._state in TERMINAL_STATES

    def history(self) -> list[str]:
        return list(self._history)

Changes to the prompt (`prompt.py`)

Now we rewrite our build_payment_prompt function to accept the current state and a boolean flag: identity_verified. Notice the if identity_verified: block. If the state machine hasn't cleared the user, the LLM is literally not given the amount or due date variables, making it nearly impossible for the LLM to hallucinate or leak the debt amount.

Replace the entire build_payment_prompt function from Layer 1 with this:

def build_payment_prompt(
    config: dict,
    identity_verified: bool,
    current_state: str,
    allowed_actions: list[str],
) -> str:
    agent_name = config["agentName"]
    company = config["companyName"]
    customer = config["customerName"]
    amount = config["amountDueFormatted"]
    due = config["dueDate"]

    if identity_verified:
        flow_step4 = f"4. Only after verification: mention the amount due ({amount}) and due date ({due})."
        identity_block = (
            "--- IDENTITY STATUS ---\n"
            "{\n"
            '  "identity_verified": true,\n'
            f'  "customer_name": "{customer}",\n'
            f'  "amount_due": "{amount}",\n'
            f'  "due_date": "{due}"\n'
            "}"
        )
    else:
        # The agent is physically locked out of the financial data.
        flow_step4 = "4. Only after verification: mention the amount due (it will appear in the status block once verified)."
        identity_block = (
            "--- IDENTITY STATUS ---\n"
            "{\n"
            '  "identity_verified": false,\n'
            '  "instruction": "Do not reveal any account details until identity is verified."\n'
            "}"
        )

    actions_list = "\n".join(f"- {a}" for a in allowed_actions) or "- (no actions available in this state)"
    state_block = (
        "---\n"
        f"CURRENT STATE: {current_state}\n\n"
        "IN THIS STATE YOU MAY ONLY:\n"
        f"{actions_list}\n\n"
        "Do not perform actions from other states. Do not skip states.\n"
        "---"
    )

    return (
        f"You are {agent_name}, an automated payment assistance voice agent from {company}.\n\n"
        "CORE RULES — NEVER BREAK THESE\n"
        "- Do not reveal the customer's name, account number, amount due, or overdue status\n"
        "  to anyone until identity has been verified.\n"
        "- If the borrower mentions job loss, inability to pay, or a medical emergency,\n"
        "  call flag_hardship immediately. Do not pressure them.\n"
        "- Never make threats, use intimidating language, or shame the borrower.\n"
        "- Never mention the borrower's family, employer, or references.\n\n"
        "CONVERSATION FLOW\n"
        f"1. Introduce yourself and the company. State the call may be recorded. Ask if you are speaking with {customer}.\n"
        "2. If confirmed, ask for the last four digits of their registered mobile number and account number.\n"
        "3. Call verify_borrower_identity with both sets of digits.\n"
        f"{flow_step4}\n"
        "5. Offer to send the official payment link with send_payment_link.\n"
        "6. Close the call politely.\n\n"
        f"{identity_block}\n\n"
        f"{state_block}\n\n"
        "LANGUAGE AND TONE\n"
        "Keep every response to 1-2 sentences. Speak clearly and calmly. Do not use filler\n"
        'phrases like "Certainly!" or "Of course!".'
    )

Wiring the changes (`agent.py`)

To make this work live on a call, we intercept the user's audio transcripts before the LLM generates a response. We use LiveKit's event listeners to track the conversation, evaluate the state, and use update_instructions() to seamlessly rewrite the LLM's system prompt.

Add an import:

from state_machine import CallState, CallStateMachine

Add this helper function outside entrypoint, after _dial_and_greet:

def _update_agent_instructions(session, call_cfg, identity_verified, sm) -> None:
    """Rebuild the prompt for the machine's current state and push it live."""
    new_prompt = build_payment_prompt(
        call_cfg, identity_verified, sm.current_state.value, sm.allowed_actions
    )
    # Hot-swaps the LLM's system prompt mid-call without dropping audio.
    asyncio.create_task(session.current_agent.update_instructions(new_prompt))

Inside entrypoint, replace the outcome_log line:

outcome_log = data.OutcomeLog(scenario=call_cfg["scenario"])
ctx.proc.userdata["outcome_log"] = outcome_log

# Layer 2: the state machine governs the flow of the call.
sm = CallStateMachine()
identity_verified = False
sm.transition(CallState.OPENING_DISCLOSURE)

Change the prompt = build_payment_prompt line to pass the new arguments:

prompt = build_payment_prompt(
    call_cfg, identity_verified, sm.current_state.value, sm.allowed_actions
)

Add the turn processor and transcript listener before await:

def _process_user_turn(utterance: str) -> None:
    nonlocal identity_verified
    if not utterance:
        return

    # Advance from greeting to verification once the user replies.
    if sm.current_state == CallState.OPENING_DISCLOSURE:
        sm.transition(CallState.IDENTITY_VERIFICATION)
        _update_agent_instructions(session, call_cfg, identity_verified, sm)

    # verify_borrower_identity succeeded → unlock the financial data.
    if (
        outcome_log.identity_verified
        and not identity_verified
        and sm.current_state == CallState.IDENTITY_VERIFICATION
    ):
        identity_verified = True
        outcome_log.amount_disclosed = True
        sm.transition(CallState.PAYMENT_CONTEXT)
        _update_agent_instructions(session, call_cfg, identity_verified, sm)


@session.on("user_input_transcribed")
def on_user_input_transcribed(event) -> None:
    if event.is_final and event.transcript:
        _process_user_turn(event.transcript)

Layer 3: The Safety Net

Now we have a proper prompt and an established flow for the call. But what happens if the LLM misinterprets the user? What if the caller says, "I lost my job, I can't pay right now," and the LLM simply responds with, "I understand, but your payment is still due on the 21st" instead of calling the hardship tool? In debt collection, ignoring a hardship disclosure is a major violation.

We cannot leave compliance entirely up to probabilistic tool-calling. We need guardrails.

Layer 3 introduces a rules engine that listens to every single word spoken on the call. If it detects high-risk language, it physically forces the state machine onto a safe path, regardless of what the LLM decides to do.

The rules (`guardrails.py`)

from __future__ import annotations


class GuardrailEngine:

    @staticmethod
    def check_pre_call(scenario: str) -> tuple[bool, str]:
        """Returns (can_proceed, block_reason). Blocks the grievance_pending scenario."""
        if scenario == "grievance_pending":
            return (False, "Active grievance ticket on file — call blocked until resolved.")
        return (True, "")

    @staticmethod
    def should_stop_payment_flow(utterance: str) -> tuple[bool, str]:
        """Checks the utterance for triggers that stop the normal payment flow.

        Returns (True, reason) where reason is one of:
        'dispute', 'hardship', 'human_requested', 'stop_calling'.
        """
        text = utterance.lower()

        dispute_phrases = [
            "already paid", "paid already", "paid this", "dispute",
            "wrong amount", "incorrect amount", "i didn't borrow",
            "i did not borrow", "don't owe", "do not owe",
        ]
        for phrase in dispute_phrases:
            if phrase in text:
                return (True, "dispute")

        hardship_phrases = [
            "lost my job", "lost job", "cannot pay", "can't pay", "cant pay",
            "unable to pay", "medical emergency", "death in family",
            "in the hospital", "in hospital", "unemployed", "no income",
            "financial hardship",
        ]
        for phrase in hardship_phrases:
            if phrase in text:
                return (True, "hardship")

        human_phrases = [
            "speak to a human", "talk to a human", "speak to a person",
            "talk to a person", "real person", "real agent", "human agent",
            "escalate",
        ]
        for phrase in human_phrases:
            if phrase in text:
                return (True, "human_requested")

        stop_phrases = [
            "stop calling", "stop calling me", "remove me", "do not call",
            "don't call", "opt out", "take me off",
        ]
        for phrase in stop_phrases:
            if phrase in text:
                return (True, "stop_calling")

        return (False, "")

    @staticmethod
    def is_prohibited_language(agent_text: str) -> tuple[bool, str]:
        """Monitors the AI's own output for illegal collection threats."""
        text = agent_text.lower()
        prohibited = [
            "legal action", "contact your family", "contact your employer",
            "contact your references", "tell your family", "tell your employer",
        ]
        for phrase in prohibited:
            if phrase in text:
                return (True, phrase)
        return (False, "")

    @staticmethod
    def check_wrong_person(utterance: str, expected_name: str) -> bool:
        """Returns True if the utterance indicates the caller is not the expected person."""
        text = utterance.lower()
        name = expected_name.lower()
        wrong_person_phrases = [
            "wrong number", "wrong person", "you have the wrong",
            f"not {name}", f"no {name} here", f"no {name}",
            "different person", "nobody by that name",
        ]
        for phrase in wrong_person_phrases:
            if phrase in text:
                return True
        return False

Enforcing the guardrails (`agent.py`)

Import the guardrails we just created at the top of the file:

from guardrails import GuardrailEngine

In the entrypoint() function, right after the sm = CallStateMachine() block, create the guardrail engine and run the pre-call check:

guardrails = GuardrailEngine()

# Layer 3: refuse to dial a blocked account.
can_proceed, block_reason = guardrails.check_pre_call(call_cfg["scenario"])
if not can_proceed:
    logger.warning("Call blocked: %s", block_reason)
    outcome_log.outcome = "blocked_pre_call"
    outcome_log.save_to_file()
    return  # entrypoint returns before dialing

Replace the _process_user_turn function with this:

def _process_user_turn(utterance: str) -> None:
    nonlocal identity_verified
    if not utterance:
        return

    # >>> LAYER 3: deterministic safety net on the caller's words. Overrides the LLM.
    # Wrong-person check while we're still in the opening / verification phase.
    if sm.current_state in (CallState.OPENING_DISCLOSURE, CallState.IDENTITY_VERIFICATION):
        if guardrails.check_wrong_person(utterance, call_cfg["customerName"]):
            sm.transition(CallState.WRONG_PERSON_END)
            _update_agent_instructions(session, call_cfg, identity_verified, sm)

    # Stop-flow triggers: dispute, hardship, or a request for a human / to stop calling.
    if not sm.is_terminal():
        stop, reason = guardrails.should_stop_payment_flow(utterance)
        if stop:
            if reason == "dispute":
                outcome_log.dispute_detected = True
                outcome_log.outcome = "payment_dispute"
                sm.transition(CallState.DISPUTE_INTAKE)
            elif reason == "hardship":
                outcome_log.hardship_detected = True
                outcome_log.outcome = "hardship_detected"
                sm.transition(CallState.HARDSHIP_ESCALATION)
            elif reason in ("human_requested", "stop_calling"):
                sm.transition(CallState.HUMAN_HANDOFF)
            _update_agent_instructions(session, call_cfg, identity_verified, sm)

    # >>> LAYER 2: normal state advancement (only runs if no guardrail fired above).
    if sm.current_state == CallState.OPENING_DISCLOSURE:
        sm.transition(CallState.IDENTITY_VERIFICATION)
        _update_agent_instructions(session, call_cfg, identity_verified, sm)

    if (
        outcome_log.identity_verified
        and not identity_verified
        and sm.current_state == CallState.IDENTITY_VERIFICATION
    ):
        identity_verified = True
        outcome_log.amount_disclosed = True
        sm.transition(CallState.PAYMENT_CONTEXT)
        _update_agent_instructions(session, call_cfg, identity_verified, sm)

And finally, add a listener to monitor the agent's output:

@session.on("conversation_item_added")
def on_conversation_item_added(event) -> None:
    item = event.item
    if item.type == "message" and item.role == "assistant" and item.text_content:
        # >>> LAYER 3: monitor the AI's own output for prohibited language.
        is_prohibited, phrase = guardrails.is_prohibited_language(item.text_content)
        if is_prohibited:
            logger.warning("Prohibited language in agent output: %r", phrase)
            outcome_log.prohibited_language_detected = True

By layering strict keyword interception over our state machine, we've built an architecture that doesn't just ask the AI to behave — it forces it to. Your agent is no longer a fragile prompt wrapper; it is a governed, audited, compliance-first communication system.

Common Errors

Error	Cause	Fix
`Required environment variable 'X' is not set`	Missing `.env` value	Copy `.env.example` to `.env` and fill in the variable
Agent answers but stays silent	Dispatch rule has no `agents` block	Edit the rule in LiveKit Cloud — add `agentName` to `roomConfig.agents`
`DuplexClosed` in logs, call drops mid-greeting	`dev` mode restarts on file save	Use `python agent.py start` for all phone testing
Call drops immediately	TwiML Bin not reachable, or `<Sip>` URI missing `;transport=tcp`	Check the URI in the TwiML Bin and add `;transport=tcp` at the end
`401` or `403` from Murf or Deepgram	Wrong or expired API key	Re-check `MURF_API_KEY` and `DEEPGRAM_API_KEY` in `.env`

By separating the stack into a baseline conversationalist (Layer 1), a strict state machine (Layer 2), and an unyielding set of semantic guardrails (Layer 3), we've transformed a probabilistic AI into a compliant communication tool.

We've only scratched the surface of BFSI compliance in this tutorial. As your use cases grow and change, so will your regulatory requirements. But by shifting control from the LLM's prompt to your deterministic Python code, you can handle any edge case simply by adding a state or a guardrail.

You can find the complete source code for this project on GitHub: https://github.com/murf-ai/murf-cookbook/tree/main/examples/agents/payment-reminder

Now take this foundation, tune the guardrails to match your requirements, and build your own compliant voice agent.

Building a fully autonomous AI Receptionist

Sanchita Sunil — Thu, 02 Jul 2026 10:04:36 +0000

Quick links.
Code: https://github.com/murf-ai/murf-cookbook/tree/main/examples/agents/reception-agent
Video Walkthrough:https://youtu.be/eCbejEt78sw

We have spent years building interfaces that require people to learn them – apps, chatbots, dashboards – and the entire time, the most natural interface we have ever had, the phone call, has been sitting there, mostly automated by hold music and press-one-for-billing menus.

Voice is the interface that needs no manual – it works across every age group, every literacy level and every level of technical comfort. And for most people, when something actually matters – a doctor's appointment, a flight change, a question that needs a real answer – they pick up the phone.

But the reality of picking up the phone is mostly frustrating automated systems. The gap between those and a real conversation is what we’ll be bridging today.

This tutorial builds a fully autonomous AI voice receptionist. You call a real Twilio phone number, an AI answers, recognizes you if you've called before, books appointments into a real database, and answers questions from a local knowledge base. The whole thing runs in a few hundred lines of Python.

I’ve built it as a receptionist for a clinic, but you can adapt this for a restaurant, a salon, a support desk or any other business, just by making a few minor tweaks.

We’ll be building this in three layers:
-The Core Voice Loop: Giving our agent a voice and wiring it up to answer phone calls
-Memory and Booking: Giving it memory and the ability to book appointments
-RAG: Adding a RAG to make sure the agent doesn’t hallucinate any details
In the end we’ll also go over some additions you can make to make the agent even better (like Google Calendar blocking, Whatsapp Confirmations and Call Transfer), some common errors and how to adapt this for any business.

We’ll dive into each component and their setups first and then get into the code. By the end of it, we will have a fully autonomous AI voice receptionist that you can call, talk to and hang-up with a booking confirmed.

The Stack
The Core Voice Loop
Memory & Appointment Booking
RAG
Optional Additions
Adapting To Your Use Case
Errors

The Stack

There are 5 components:

Livekit

When we wire up a voice agent ourselves - transcribing an input audio, putting it through an LLM and then synthesizing a response through a TTS, it adds about 2-4 seconds of delay. We could use a Websocket, but while websockets work well with text and data, it could cause stutters and pauses with voice.

Livekit is a streaming infrastructure that helps us bring this latency down to what feels natural in a conversation, without the issues caused by a Websocket. It uses WebRTC, which is designed specifically to stream conversations.

Setup:

Create an account at Livekit and create a project.
From your project settings, copy the url, api key, api secret and sip uri and drop them into your .env file.
On the console, go to Telephony > SIP Trunks and create a new trunk.
Give the trunk a name, select inbound, set the allowed IP addresses(set to 0.0.0.0/0 if you want to allow all IP addresses)
Leave the phone numbers blank for now, we’ll fill that in once we’ve created the twilio account.
If you switch to the JSON editor, your json should look something like this:

{
   “name”: “trial”,
   “allowedAddresses”: [
    “0.0.0.0/0”
   ]
}

Go to Telephony > Dispatch Rules and create a new dispatch rule.
Give your rule a name, a prefix and an agent name. Remember your agent name, it must match the one in your code, or the call will connect but all you’ll hear is silence.
If you switch to the json editor, your dispatch rule should look something like this:

{
  "sipDispatchRuleId": "SDR_abc123",
  "rule": {
    "dispatchRuleIndividual": {
      "roomPrefix": "clinic-"
    }
  },
  "trunkIds": [
    "ST_abc123"
  ],
  "name": "clinic",
  "roomConfig": {
    "agents": [
      {
        "agentName": "clinic-agent"
      }
    ]
  }
}

Copy this and save it as dispatch-rule.json in your project root directory.

Twilio

Connecting your software to a phone line or sending an SMS would mean negotiating contracts with multiple telecom carriers, configuring physical servers, and managing telecom protocols - a grueling process that could take months.

Twilio is a cloud communications platform that turns the global telecom network into simple software APIs. It acts as a digital bridge between the internet and phone networks, abstracting all of that hardware and routing complexity behind the scenes, allowing us to make real phone calls, send messages or even build an entire customer support system with just a few lines of code.

Setup:

Create an account at console.twilio.com or 1console.twilio.com. Copy your account sid, auth token and your phone number and paste them into the .env file.
On the console, go to TwiML bins and create a bin.
Give it a name and put this in the TwiML section:

<?xml version="1.0" encoding="UTF-8"?>   
<Response>
  <Dial>
    <Sip>sip:phone_number@sip_uri;transport=tcp</Sip>
  </Dial>
</Response>

Replace phone_number with your twilio phone number and sip_uri with you SIP URI.
For example:
If your phone number is +1234567 and your sip uri is sip:abc123.sip.livekit.cloud, then your TwiML section would look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial>
    <Sip>sip:1234567@abc123.sip.livekit.cloud;transport=tcp</Sip>
  </Dial>
</Response>

On 1console.twilio.com, go to Products & Services > Numbers & Senders, click on your phone number and in the Voice and emergency calling section, click on edit configuration details.
Select your configuration method as the one with TwiML bins in it, set your primary method as TwiML bins and select the TwiML bin you just created.
Now that your Twilio number is active and routing via TwiML, go back to your LiveKit SIP Trunk > Configure trunk and add this phone number to the "Phone Numbers" field so LiveKit knows to accept its calls. If you phone number is +1234567, when you switch to the json editor, it should look something like this now:

{
  "numbers": [
    "+1234567",
    "1234567"
  ],
  "allowedAddresses": [
    "0.0.0.0/0"
  ]
}

(Optional) To get the confirmation messages on Whatsapp, use the number given in the .env.example(+14155238886 - it must be written in the .env file as “whatsapp:+14155238886”). Then go to console.twilio.com, Messages > Send a Whatsapp message, and follow the instructions. The connection resets every 5 days, so the recipients must send the join message again to receive messages. For production, get a Whatsapp Business number from Twilio and use that instead, the sandbox is free and only meant for testing.

Speech-To-Text(STT)

Whisper Realtime(OpenAI’s latest realtime speech-to-text model: gpt-realtime-whisper)

OpenAI recently expanded its ecosystem by introducing a dedicated streaming speech-to-text model built directly into its real-time infrastructure. If you’ve used previous Whisper models, you know the frustrating delay caused by waiting to process an entire block of input audio at once. This new model allows for true continuous streaming - meaning the input audio is transcribed into text word-by-word as it arrives, rather than waiting for the speaker to finish before transcription begins.

In this project, Realtime Whisper is used as our primary STT model. Note that you will require a paid OpenAI API key to use it, as it isn't available on the free tier. You can get your API key at Openai.

You also have the option to use Deepgram’s Nova-3 as your STT for this project.

Deepgram(Nova-3)
If you want to make the switch, all you have to do is change the STT_PROVIDER in the .env file to deepgram. Deepgram offers free credits of $200 when your account is created, which is more than enough to test and run this project.

Another advantage that comes with using Deepgram is that it is optimized to handle noisy environments, allowing it to distinguish between the caller’s voice and the background noise. You can get your api key at Deepgram.

LLM

You will require a Large Language Model to act as the central reasoning engine for your voice agent. This model is responsible for parsing the transcribed text, deciding how to respond, and triggering database or calendar actions.
This project is built modularly and configured to support two major providers. To choose yours, you simply need to set the LLM_PROVIDER variable in your .env file to one of the following:
-openai (Requires a paid API key)
-gemini (An excellent choice to test with, Google offers a free tier)
If you wish to use a different provider, the code is structured so you can wire in any custom LLM.

Text-To-Speech(TTS)

Once our LLM decides exactly what to say, we need to convert that text back into natural human speech to stream down the phone line. In a live conversation, timing is everything. If the Text-to-Speech engine takes even a second to generate the audio, the interaction breaks down.

Murf Falcon

For this project, we are using Murf Falcon, the consistently fastest TTS model(130ms TIme To First Audio) built specifically for real-time conversational applications. Falcon is optimized to generate lifelike, natural-sounding human speech on the fly.
It achieves this by supporting continuous chunked audio streaming - The moment our LLM streams its first few words, Falcon immediately starts converting those tokens into audio packets and pushing them back into the LiveKit media stream. This ensures that the caller hears a seamless, expressive response without any awkward pauses.
Go to Murf, create an account and get your API key.

Supabase

For an AI receptionist to be genuinely useful, simply talking isn’t enough, it needs to be able to remember. We need a secure place to store and identify returning callers, existing appointments and log new appointments.

Supabase is an open-source Firebase alternative built on top of PostgreSQL. It eliminates that backend friction by giving you a fully managed, highly scalable Postgres database with an instant API.

In this project, Supabase acts as our system of record. Every time the Reception Agent takes a call, it queries Supabase to check if the caller's phone number is already in the system, logs the conversation context, and securely writes the new appointment details.

Setup:

Create an account at Supabase and create a project.
From your project dashboard, copy your project url and anon key and paste them into your .env file.

Google Calendar(optional)

Human staff need visibility into the actions the agent takes. We can achieve this by wiring our agent to mirror its bookings directly onto Google Calendar.

In this project, Google Calendar acts strictly as a write-only visual layer for clinic staff. It does not drive the core availability or booking logic, that is handled entirely by our Supabase database. If Google Calendar goes down, the agent will still book callers successfully because it updates Supabase first, then fires off an asynchronous background task to update Google Calendar without making the caller wait on the line.

Setup:

Create a Project on Google Cloud Console.
In the Project API settings, enable Google Calendar.
Go to IAM & Admin > Service Accounts and click Create Service Account. 4. Give it a descriptive name and click Done.
Select your service account and create a json key.
Save this json key to the project root as service-account.json
Go to Google Calendar, under Other Calendars, create new calendars. In this project, we create a calendar each for the doctors.
In the created calendars, go to the Shared With section and Add People - add the service account ID and make sure to give it access to make changes to events.
In the Integrate Calendar section, copy the calendar ID and drop it into the .env.

We’ve gone over all the major components for our project. Now that we have everything configured and all our credentials in place, let’s dive into building the agent.

The Core Voice Loop

We will start by getting the agent to successfully connect to a call, speak, and adopt a persona.

Before we write the logic for our voice agent, let’s start by installing the necessary packages and prepping our environment.

Run these in your terminal:

python -m venv venv

On MacOS/Linux:

source venv/bin/activate

On Windows(powershell):

venv\Scripts\Activate.ps1

Create a requirements.txt file in your project root and add the following packages:

livekit-agents>=1.0.0
livekit-plugins-deepgram>=0.7.0
livekit-plugins-openai>=0.10.0
livekit-plugins-google>=0.6.0
livekit-plugins-silero>=0.7.0
livekit-murf>=0.1.0
python-dotenv>=1.0.0
twilio>=9.0.0
livekit-api>=0.7.0
supabase>=2.0.0
lancedb>=0.6.0
sentence-transformers>=3.0.0
google-api-python-client>=2.100.0
google-auth-httplib2>=0.2.0
google-auth-oauthlib>=1.2.0
pytz>=2024.1

This covers Livekit, Twilio and all our chosen plugins.
Now, run the command below to install all of them:

pip install -r requirements.txt

System Prompt

Let’s define how our receptionist should behave. In the root of your project, create a folder named prompts and create system_prompt.py within that folder. By keeping the prompt in a separate file, we can easily make tweaks to the persona or change the it as a whole if we want to.

prompts/system_prompt.py:

SYSTEM_PROMPT = """
You are Matthew, the AI receptionist for The Clinic.
Always identify yourself as an AI at the start of every call.
Use a warm, calm, professional tone.

Your opening line at the start of every call:
Hello, thank you for calling The Clinic. I'm Matthew, your AI receptionist. How may I help you today?

Voice rules — follow on every response:
- Keep responses to 1–3 sentences
- No lists, bullet points, or markdown — this is a phone call
- Speak naturally
- Never say "Absolutely!" or "Great question!" — say "Of course" or "Certainly" instead

You can answer general questions about the clinic and help callers.
If you don't know something, say you'll have the team call them back.
""".strip()

def build_system_prompt(patient: dict | None = None) -> str:
    return SYSTEM_PROMPT

Core Logic

Coming to the core logic. Create agent.py in your project’s root, this will be the file that wires together and controls the core functionalities of the agent.

agent.py:

from __future__ import annotations
import os
import asyncio
import logging


from dotenv import load_dotenv
from livekit import rtc
from livekit.agents import (
    Agent,
    AgentSession,
    AutoSubscribe,
    JobContext,
    JobProcess,
    WorkerOptions,
    cli,
)


from livekit.plugins import deepgram, google, openai, silero, murf

Import our receptionist's personality instructions:

from prompts.system_prompt import build_system_prompt

1. ENVIRONMENT & CONFIGURATION

Load the .env file so we have access to our API keys without hardcoding them.

load_dotenv()

Set up logging so we can see what the agent is doing in the terminal:

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("clinic-agent")

Fetch our environment variables:

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MURF_API_KEY = os.getenv("MURF_API_KEY")

2. AGENT DEFINITION

In a normal phone call, the person picking up always speaks first. If the AI waits for the user, the caller will think the line is dead. We hardcode the opening line so we can trigger it immediately on connection:

OPENING_LINE = (
    "Hello, thank you for calling The Clinic. I'm Matthew, your AI receptionist. "
    "How may I help you today?"
)

We pass in the instructions from our system prompt so the LLM knows it is acting as a clinic receptionist. Later on, this is where we will attach specific tools (like calendar booking):

class ClinicAgent(Agent):
    def __init__(self, instructions: str) -> None:
        super().__init__(instructions=instructions)

AI models suffer from 'Cold Starts'. If we wait until the user speaks to load the Voice Activity Detection (VAD) model into memory, it adds a massive delay to the first response. This function pre-loads the Silero VAD weights the moment the server starts, eliminating that lag:

def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()

LiveKit rooms generated by Twilio SIP trunks usually start with a specific prefix. We check this so the agent knows whether it's talking to a real phone line or just a local web-testing environment:

def _is_phone_room(room_name: str) -> bool:
    return room_name.startswith("clinic-") and not room_name.startswith("clinic-test-")

3. THE CORE WORKER LOOP

This is the engine of the voice agent. Every time a new phone call comes in, LiveKit fires up a new instance of this entrypoint to handle the conversation:

async def entrypoint(ctx: JobContext) -> None:
    is_phone = _is_phone_room(ctx.room.name)

Connect to the LiveKit room. Since it's a phone call, we only need audio:

        await ctx.connect(
        auto_subscribe=AutoSubscribe.AUDIO_ONLY if is_phone else AutoSubscribe.SUBSCRIBE_ALL,
    )

A: Speech-To-Text (Ears)
I’m going to be using Whisper Realtime. (To use Deepgram: stt = deepgram.STT(model="nova-3", language="en", api_key=DEEPGRAM_API_KEY)

            stt = openai.STT(
            model="gpt-realtime-whisper",
            use_realtime=True,
            language="en",
            api_key=OPENAI_API_KEY,
        )

B: Large Language Model (Brain)
I’m going to be using OpenAI with gpt-4o-mini, you can swap this for any other LLM.
For example, to use Gemini: llm = google.LLM(model=”gemini-2.5-flash”, api_key=GEMINI_API_KEY)

llm = openai.LLM(model="gpt-4o-mini",api_key=OPENAI_API_KEY)

C: Text-To-Speech (Mouth)
Murf Falcon converts our LLM's text back into audio. We select a voice id based on our requirement. I’m going to use Matthew:

    tts = murf.TTS(voice="en-US-matthew", locale="en-US")

The Session Conductor

AgentSession takes our Ears, Brain, and Mouth and wires them into a single
continuous streaming loop. It automatically handles turn-taking, so if the
user interrupts, the session stops the TTS instantly and starts listening again:

    session = AgentSession(
        stt=stt,
        llm=llm,
        tts=tts,
        vad=ctx.proc.userdata["vad"],
    )

Load our receptionist instructions and start the session in the room:

    prompt = build_system_prompt()
    await session.start(ClinicAgent(prompt), room=ctx.room)

4. CALL EXECUTION & GREETING

Wait up to 20 seconds for the Twilio SIP trunk to bridge the audio stream. If no one connects, we gracefully log the error and shut down this instance:

    if is_phone:
        try:
            participant = await asyncio.wait_for(ctx.wait_for_participant(), timeout=20.0)
            session.room_io.set_participant(participant.identity)
        except asyncio.TimeoutError:
            logger.error("No caller joined within 20s")
            return

Play the opening line immediately. Setting allow_interruptions=False ensures the agent finishes saying "Hello" completely before it starts listening for the caller's response:

        handle = session.say(OPENING_LINE, allow_interruptions=False)
        await handle.wait_for_playout()

Keep the worker process alive as long as the phone call is active:

    while ctx.room.isconnected():
        await asyncio.sleep(0.25)

5. SERVER INITIALIZATION

Start the LiveKit worker. The agent_name MUST match your LiveKit Dispatch Rule otherwise Twilio will route the call, but this script won't know to pick it up, so all you'll hear is silence:

if __name__ == "__main__":
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
            agent_name="clinic-agent",
            num_idle_processes=1,
        )
    )

We can now run the agent using the following command:

python agent.py start

Watch the logs on the terminal. Once you see registered worker, pick up your phone and dial your Twilio phone number, your agent will answer and hold a conversation with you.

Memory & Appointment Booking

Having a receptionist who only speaks to you, can’t take any action, and immediately forgets the conversation the second you hang up wouldn’t be of much use to a business.

In this layer, we are going to give our agent memory and actionable capabilities so it can recognize returning callers, greet them by name, and securely lock time slots without double-booking.

Database Setup

First, let's create tables where our agent will store the details. In your Supabase dashboard, navigate to the SQL Editor and run the following query:

create extension if not exists "pgcrypto";

-- 1. Caller Memory
create table if not exists customers (
    phone                   text primary key,
    name                    text,
    preferred_doctor        text,
    last_booking_id         text,
    last_appointment_date   text,
    last_appointment_time   text,
    call_count              integer not null default 1,
    first_seen              timestamptz not null default now(),
    last_seen               timestamptz not null default now()
);

-- 2. Availability + Bookings
create table if not exists slots (
    id              uuid primary key default gen_random_uuid(),
    doctor          text not null,
    iso_date        date not null,
    iso_time        time not null,
    status          text not null default 'available',
    booking_id      text,
    patient_name    text,
    phone           text,
    reason          text,
    cancelled_at    timestamptz,
    created_at      timestamptz not null default now(),
    constraint slots_status_check check (status in ('available', 'booked')),
    -- CRITICAL: If you change the doctors in your Python code, you MUST update them here!
    constraint slots_doctor_check check (
        doctor in ('Dr. Sarah Lin', 'Dr. James Cole')
    ),
    constraint slots_unique_doctor_datetime unique (doctor, iso_date, iso_time)
);

-- 3. Booking Audit Log
create table if not exists appointments (
    id               text primary key,
    phone            text not null,
    doctor           text,
    date             text,
    time             text,
    reason           text,
    booking_id       text,
    status           text not null default 'confirmed',
    rescheduled_from text,
    created_at       timestamptz not null default now(),
    constraint appointments_status_check check (
        status in ('confirmed', 'cancelled', 'rescheduled')
    )
);

-- Indexes for Fast Querying
create index if not exists idx_slots_available
    on slots (doctor, iso_date, iso_time) where status = 'available';
create index if not exists idx_slots_phone_booked
    on slots (phone, iso_date, iso_time) where status = 'booked';

-- Seed 14 days of sample slots (Mon–Sat, 09:00–13:00 and 17:00–20:00)
do $$
begin
    insert into slots (doctor, iso_date, iso_time, status)
    select d.doctor, days.slot_date, t.slot_time, 'available'
    from (values ('Dr. Sarah Lin'), ('Dr. James Cole')) as d(doctor)
    cross join lateral (
        select (current_date + gs.i)::date as slot_date
        from generate_series(0, 13) as gs(i)
    ) as days
    cross join lateral (
        select time '09:00' as slot_time union all select time '09:30'
        union all select time '10:00' union all select time '10:30'
        union all select time '11:00' union all select time '11:30'
        union all select time '12:00' union all select time '12:30'
        union all select time '17:00' union all select time '17:30'
        union all select time '18:00' union all select time '18:30'
        union all select time '19:00' union all select time '19:30'
    ) as t
    where extract(isodow from days.slot_date) between 1 and 6
    on conflict (doctor, iso_date, iso_time) do nothing;
end $$;

Note: These have been named and structured for a clinic, they can be adjusted to the business of your choice.

Adding the Memory

Create a folder in your project root called tools and create memory.py inside it.

This file handles our database connection. Because the standard Supabase Python client is synchronous (blocking), we use asyncio.to_thread to ensure our database queries don't freeze the real-time audio stream while they fetch data.

memory.py:

Part A: Setup and Phone Normalization

Telephony systems are notoriously messy with how they format phone numbers. Before we can look up a caller in our database, we need to normalize their number into a standard E.164 format (e.g., +1234567890):

from __future__ import annotations 
import os 
import asyncio 
import logging 
import re from datetime 
import datetime, timezone 
from supabase import Client, create_client 
from dotenv import load_dotenv 

load_dotenv() 

logger = logging.getLogger(__name__) 

SUPABASE_URL = os.getenv("SUPABASE_URL") 
SUPABASE_KEY = os.getenv("SUPABASE_KEY")

Initialise the Supabase Client:

def get_client() -> Client:
    return create_client(SUPABASE_URL, SUPABASE_KEY)

Strips weird characters and normalizes the phone number:

def _normalize_phone(phone: str) -> str:
    cleaned = re.sub(r"[\s\-()]", "", phone.strip())
    if cleaned.startswith("+91"):
        return cleaned
    if cleaned.startswith("0"):
        return f"+91{cleaned[1:]}"
    if len(cleaned) == 10 and cleaned.isdigit():
        return f"+91{cleaned}"
    if cleaned.startswith("91") and len(cleaned) == 12 and cleaned.isdigit():
        return f"+{cleaned}"
    if cleaned.startswith("+"):,
        return cleaned
    return f"+91{cleaned}"
def _now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

Note: Adjust the default country code (+1, +91, etc.) based on your region.

Part B: Fetching and Writing Data

Fetching a caller:

async def get_patient(phone: str) -> dict | None:
    def _fetch():
        client = get_client()
        response = client.table("customers").select("*").eq("phone", _normalize_phone(phone)).limit(1).execute()
        return response.data[0] if response.data else None


    try:
        return await asyncio.to_thread(_fetch)
    except Exception as exc:
        logger.error("get_patient failed: %s", exc)
        return None

Updating a caller’s profile after they’ve booked an appointment:

async def upsert_patient(phone: str, name: str, doctor: str, booking_id: str, date: str, time: str):
    def _upsert():
        client = get_client()
        normalized = _normalize_phone(phone)
        existing = client.table("customers").select("call_count").eq("phone", normalized).limit(1).execute()
        call_count = (existing.data[0].get("call_count") or 0) + 1 if existing.data else 1

        client.table("customers").upsert({
            "phone": normalized, "name": name, "preferred_doctor": doctor,
            "last_booking_id": booking_id, "last_appointment_date": date,
            "last_appointment_time": time, "call_count": call_count, "last_seen": _now_iso(),
        }, on_conflict="phone").execute()


    try:
        await asyncio.to_thread(_upsert)
       return True
    except Exception as exc:
        logger.error("upsert_patient failed: %s", exc)
       return False

Tracking the call count:

async def increment_call_count(phone: str):
    def _inc():
        client = get_client()
        norm = _normalize_phone(phone)
        res = client.table("customers").select("call_count").eq("phone", norm).limit(1).execute()
        if res.data:
            client.table("customers").update({
                "call_count": (res.data[0].get("call_count") or 0) + 1,
                "last_seen": _now_iso()
            }).eq("phone", norm).execute()

    await asyncio.to_thread(_inc)

Adding an appointment to the database:

async def log_appointment(phone: str, doctor: str, date: str, time: str, reason: str, booking_id: str):
    def _log():
        client = get_client()
        client.table("appointments").insert({
            "id": booking_id, "phone": _normalize_phone(phone), "doctor": doctor,
            "date": date, "time": time, "reason": reason, "booking_id": booking_id,
        }).execute()

    try:  
await asyncio.to_thread(_log) 
return True 
    except Exception as exc: 
logger.error("log_appointment failed: %s", exc) 
return False

Create booking.py, in the tools folder.
booking.py:

from __future__ import annotations 
import asyncio 
import logging 
from datetime import date, datetime 
from typing import Any 
from tools.memory import get_client 

logger = logging.getLogger(__name__) 

VALID_DOCTORS = ("Dr. Sarah Lin", "Dr. James Cole") 

def _format_iso_time(value: str) -> str: 
parts = str(value).split(":") 
return f"{parts[0]}:{parts[1]}" if len(parts) >= 2 else str(value)

Translate messy database timestamps (2026-05-29 15:00) into natural spoken English (Friday the 29th of May, 3:00 PM):

def _spoken_from_iso(iso_date: str, iso_time: str) -> tuple[str, str]:
    try:
        dt = datetime.strptime(f"{iso_date} {_format_iso_time(iso_time)}", "%Y-%m-%d %H:%M")
        return dt.strftime("%A the %d of %B"), dt.strftime("%I:%M %p").lstrip("0")
    except ValueError:
        return iso_date, iso_time

Scan the database for the next open slot for a particular doctor:

async def find_available_slot(doctor: str, iso_date: str | None = None, iso_time: str | None = None) -> dict[str, Any]:
    if doctor not in VALID_DOCTORS:
        return {"available": False, "error": f"Unknown doctor: {doctor}"}

    def _find():
        client = get_client()
        query = (
            client.table("slots")
            .select("id, doctor, iso_date, iso_time, status")
            .eq("doctor", doctor)
            .eq("status", "available")
            .order("iso_date")
            .order("iso_time")
        )
        if iso_date: query = query.eq("iso_date", iso_date)
        if iso_time: query = query.eq("iso_time", iso_time)

        response = query.limit(1).execute()
        if not response.data: return {"available": False}

        row = response.data[0]
        return {
            "available": True,
            "slot_id": row["id"],
            "doctor": row["doctor"],
            "iso_date": row["iso_date"],
            "iso_time": _format_iso_time(row["iso_time"])
        }


    try:
        result = await asyncio.to_thread(_find)
        if not result.get("available"):
            return {"available": False}

        date_spoken, time_spoken = _spoken_from_iso(result["iso_date"], result["iso_time"])
        return {**result, "date": date_spoken, "time": time_spoken}
    except Exception as exc:
        logger.error("find_available_slot failed: %s", exc)
        return {"available": False, "error": str(exc)}

Making an appointment(reserving a slot):
ATOMIC LOCK: The most important function here. By requiring .eq("status", "available") in our update query, the database ensures that if two callers try to book the exact same slot simultaneously, only the first one will succeed.

async def reserve_slot(slot_id: str, patient_name: str, phone: str, doctor: str, booking_id: str, reason: str) -> bool:
    def _reserve():
        client = get_client()
        res = client.table("slots").update({
            "status": "booked", "booking_id": booking_id, "patient_name": patient_name, "phone": phone, "reason": reason
        }).eq("id", slot_id).eq("doctor", doctor).eq("status", "available").select("id").execute()
        return bool(res.data)
    return await asyncio.to_thread(_reserve)

Finally, a function that runs on startup to warn if the slots are running out in the database and need to be added:

async def check_slot_coverage() -> None:
    def _check():
        res = get_client().table("slots").select("iso_date").gte("iso_date", date.today().isoformat()).eq("status", "available").order("iso_date", desc=True).limit(1).execute()
        if not res.data: return 0
        return (date.fromisoformat(str(res.data[0]["iso_date"])[:10]) - date.today()).days
    days = await asyncio.to_thread(_check)
    if days <= 0: logger.critical("SLOT COVERAGE: No future slots available — booking will fail")

Equipping the AI

The agent does not know how to run python scripts to take actions by itself. Large Language Models don't automatically know how to run Python scripts. We have to wrap our logic in LiveKit's @function_tool decorator. This acts as a bridge, exposing the function to the LLM so it can trigger it natively during a conversation.

Create appointment.py in the tools folder:
appointment.py:

from __future__ import annotations


import asyncio
import logging
import random
from typing import Any


from livekit.agents import function_tool


from tools.booking import find_available_slot, reserve_slot
from tools.memory import log_appointment, upsert_patient


logger = logging.getLogger("clinic-agent.tools")

A function to check if a slot is available before asking the caller to confirm if they want to book their appointment for the slot:

@function_tool
async def check_availability(date: str, time: str, doctor: str) -> dict[str, Any]:
    slot = await find_available_slot(doctor)
    if slot.get("available"):
        return {"available": True, "confirmed_slot": f"{slot.get('date', date)} {slot.get('time', time)}", "iso_date": slot.get("iso_date"), "iso_time": slot.get("iso_time")}
    return {"available": False}

A function to generate a booking ID, block a slot and book an appointment once the caller has explicitly confirmed the booking:

@function_tool
async def book_appointment(patient_name: str, phone: str, date: str, time: str, doctor: str, reason: str, iso_date: str | None = None, iso_time: str | None = None) -> dict[str, Any]:
    booking_id = f"TC-{random.randint(1000, 9999)}"
    slot = await find_available_slot(doctor, iso_date, iso_time)

    if not slot.get("available"):
        return {"status": "failed", "message": "No available slots. Please try another day/doctor."}

    reserved = await reserve_slot(slot["slot_id"], patient_name, phone, doctor, booking_id, reason)
    if not reserved:
        return {"status": "failed", "message": "That slot was just taken. Please try another time."}   
    logger.info("Booking confirmed: %s for %s", booking_id, patient_name)
    return {"patient_name": patient_name, "phone": phone, "date": date, "time": time, "doctor": doctor, "status": "confirmed"}

Function to get the list of available doctors, when the caller needs help picking a doctor:

@function_tool
async def get_doctor_list() -> list[dict[str, str]]:
    return [{"name": "Dr. Sarah Lin", "specialty": "General Physician"}, {"name": "Dr. James Cole", "specialty": "Physician"}]

Updating the System Prompt

Now we update the prompt. We instruct the LLM on the booking rules, and if the user is found in our database, we inject their history directly into the prompt so the agent already knows their name and preferred doctor before the call even begins.

Update the contents of prompts/system_prompt.py:

Add this to the end of SYSTEM_PROMPT:

"""
BOOKING (New Appointment)
Collect these details one or two at a time: Patient's name, phone, date, time, doctor (Dr. Sarah Lin or Dr. James Cole), and reason.
If the caller asks about a slot, use check_availability first.
Read back the details and ask: "So that's an appointment for [name] at [time] on [date] with [doctor] for [reason] — shall I go ahead and book that?"
ONLY call book_appointment after the patient clearly says yes.
"""

Update build_system_prompt to check if the current caller is a returning caller and update the memory accordingly:

def build_system_prompt(patient: dict | None = None) -> str:
    if patient is None:
        return SYSTEM_PROMPT

    last_appt = patient.get("last_appointment_date", "unknown")
    last_time = patient.get("last_appointment_time", "unknown")
    last_doctor = patient.get("preferred_doctor", "unknown")

    memory_block = f"""

## Caller memory
You already know this caller. Do not ask for their name or phone number again.
- Name: {patient["name"]}
- Preferred doctor: {last_doctor}
- Last appointment: {last_appt} at {last_time}


Greet them warmly by name. If they want to re-book, use these details!
"""
    return SYSTEM_PROMPT + memory_block

Wiring it all together

And finally, to bring all of the additions we just made to life, we make a few changes to agent.py.

Add these imports and the helper functions to the top of agent.py:

from livekit import rtc
from tools.memory import get_patient, increment_call_count
from tools.booking import check_slot_coverage
from tools.appointment import book_appointment, check_availability, get_doctor_list

Function to extract the caller’s phone number from Twilio’s SIP stream:

def _sip_caller_phone(participant: rtc.RemoteParticipant) -> str | None:
    if participant.kind != rtc.ParticipantKind.PARTICIPANT_KIND_SIP: return None
    return participant.attributes.get("sip.phoneNumber") or participant.identity

Function to swap the generalised opening line with a personalised one for returning callers:

def _opening_line_for_patient(patient: dict | None) -> str:
    if patient and patient.get("name"):
        return f"Hello {patient['name']}, welcome back to The Clinic. I'm Matthew, your AI receptionist. How can I help you today?"
    return OPENING_LINE

Add the newly added tools to the agent:

class ClinicAgent(Agent):
    def __init__(self, instructions: str) -> None:
        super().__init__(
            instructions=instructions,
            tools=[book_appointment, check_availability, get_doctor_list],
        )

Add the slot coverage checking to the prewarm function:

def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()
    try: asyncio.run(check_slot_coverage())
    except Exception: pass

Replace the entrypoint function with the below. We are intercepting the SIP connection, getting the phone number, checking if it is a return caller, fetching the data and generating the personalised greeting and the prompt before the session connection is established:

async def entrypoint(ctx: JobContext) -> None:
    is_phone = _is_phone_room(ctx.room.name)
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY if is_phone else AutoSubscribe.SUBSCRIBE_ALL)

    # Keep your STT, LLM, TTS initialization code here exactly as it was

    # --- 1. RESOLVE THE CALLER ---
    caller_phone = None
    if is_phone:
        # Check if Twilio passed the phone number immediately
        for p in ctx.room.remote_participants.values():
            caller_phone = _sip_caller_phone(p)
            if caller_phone: break

        # If not immediately available, wait briefly for the network handshake
        if not caller_phone:
            try:
                p = await asyncio.wait_for(ctx.wait_for_participant(), timeout=10.0)
                caller_phone = _sip_caller_phone(p)
            except asyncio.TimeoutError: pass

    # --- 2. FETCH MEMORY ---
    patient_memory = None
    if caller_phone:
        patient_memory = await get_patient(caller_phone)
        # Update the analytics call count in the background
        asyncio.create_task(increment_call_count(caller_phone))

    # --- 3. INJECT CONTEXT ---
    prompt = build_system_prompt(patient_memory)
    opening_greeting = _opening_line_for_patient(patient_memory)

    # --- 4. START SESSION ---
    session = AgentSession(stt=stt, llm=llm, tts=tts, vad=ctx.proc.userdata["vad"])
    await session.start(ClinicAgent(prompt), room=ctx.room)


    if is_phone:
        try:
            participant = await asyncio.wait_for(ctx.wait_for_participant(), timeout=20.0)
            session.room_io.set_participant(participant.identity)
        except asyncio.TimeoutError: return


        # Greet the caller (Uses the personalized greeting if they exist in the DB)
        handle = session.say(opening_greeting, allow_interruptions=False)
        await handle.wait_for_playout()


    while ctx.room.isconnected():
        await asyncio.sleep(0.25)

Now you can start the agent and call it. On the second call, the agent will recognise and greet you by name and if you’ve placed an appointment, it will have details of the appointment as well.

RAG

The agent can now book appointments and recognise returning callers, but if a caller asks for any static information on the business such as clinic hours or about parking, it could confidently hallucinate a random answer.

To ensure that this does not happen we can use Retrieval-Augmented Generation(RAG). Instead of relying on cloud-based vector databases that add network round-trip latency, we are going to build a local RAG pipeline. The agent will chunk, embed, and search a markdown document on your local machine using LanceDB to answer specific questions instantly and accurately.

The Knowledge Base

In your project root, create a folder named knowledge and inside the folder, create clinic_faq.md.
Structure the file using ## (H2) headings. Our python script will use these headings to break the document into searchable chunks.

This has been written for a clinic, you can restructure it and add your own data to use it for any business.

clinic_faq.md:

## About The Clinic


The Clinic is a general practice clinic in Maplewood, founded in 2018


## Clinic hours


We are open Monday to Saturday from nine in the morning to one in the afternoon,
and from five in the evening to eight at night. We are closed on Sundays and
public holidays. The last appointment is thirty minutes before closing.


## Consultation fees


A general consultation costs fifty dollars. A follow-up within two weeks
costs thirty-five dollars. A specialist consultation costs eighty dollars.


## Our doctors


Dr. Sarah Lin is a General Physician with fifteen years of experience,
specialising in diabetes, hypertension, and thyroid conditions.


Dr. James Cole is a Physician with ten years of experience, specialising
in respiratory conditions, infectious diseases, and general medicine.


## Location and how to find us


Our address is 14 Birch Lane, Suite 2, Maplewood. Free parking in the building lot.


## Booking and cancellation policy


Appointments are thirty minutes. Cancel at least two hours before. Rescheduling
is free once; a second reschedule in the same week carries a small admin charge.


## Payments we accept


Cash and all major debit/credit cards. We work with most major health insurers.


## Lab and diagnostic services


In-clinic: blood count, blood sugar, HbA1c, lipid profile, thyroid panel, LFT,
KFT, urine analysis. Most results same day by six in the evening.
No X-ray, ECG, or ultrasound on-site — we refer to a nearby diagnostic centre.


## Emergencies and urgent care


The Clinic is not an emergency facility. For emergencies, call your local
emergency number or go to the nearest ER.


## Contact


Phone: 1234567. WhatsApp available on the same number for confirmations.
Email: hello@theclinic.example — response within 24 hours on working days.

The local RAG engine

In your tools folder, create faq.py. We will use sentence-transformers to turn our text into math (vectors), and LanceDB to store and search those vectors without ever leaving your machine.

Part A: Setup and Index Building
Starting with what happens when the server boots up. It breaks the markdown file into chunks and saves them in the database.

tools/faq.py:

from __future__ import annotations

import asyncio
import logging
import re
from pathlib import Path

import lancedb
from livekit.agents.llm import function_tool
from sentence_transformers import SentenceTransformer

logger = logging.getLogger(__name__)

_ROOT = Path(__file__).resolve().parent.parent
DB_PATH = str(_ROOT / ".lancedb")
TABLE_NAME = "clinic_faq"
FAQ_PATH = _ROOT / "knowledge" / "clinic_faq.md"
MODEL_NAME = "all-MiniLM-L6-v2"
VECTOR_METRIC = "cosine"
DISTANCE_THRESHOLD = 0.83
MAX_RESPONSE_CHARS = 500
_SEARCH_LIMIT = 4

We split the document at every H2(##) heading and each section becomes a chunk:

def _chunk_markdown(text: str) -> list[dict]:
    chunks: list[dict] = []
    for part in re.split(r"\n(?=## )", text.strip()):
        part = part.strip()
        if not part.startswith("## "):
            continue
        lines = part.split("\n", 1)
        heading = lines[0].strip()
        body = lines[1].strip() if len(lines) > 1 else ""
        if len(body) >= 30:
            chunks.append({"heading": heading, "body": body, "full": f"{heading}\n\n{body}"})
    return chunks

Build a LanceDB vector index from clinic_faq.md:

def build_index() -> None:
    if not FAQ_PATH.is_file():
        logger.error("FAQ file not found: %s", FAQ_PATH)
        return

    chunks = _chunk_markdown(FAQ_PATH.read_text(encoding="utf-8"))
    if not chunks:
        logger.error("No chunks produced from FAQ file")
        return

    model = SentenceTransformer(MODEL_NAME)
    vectors = model.encode([c["full"] for c in chunks])

    data = [
        {
            "heading": c["heading"],
            "body": c["body"],
            "full": c["full"],
            "vector": vectors[i].tolist(),
        }
        for i, c in enumerate(chunks)
    ]

    db = lancedb.connect(DB_PATH)
    if TABLE_NAME in db.table_names():
        db.drop_table(TABLE_NAME)
    db.create_table(TABLE_NAME, data)
    logger.info("FAQ index built: %d chunks", len(chunks))

Part B: Searching

Now we come to the actual tool the LLM will call.

Sometimes, when a vector database is searched, the intent matching could go wrong. For example, if 'fees' and 'price' are not considered under consultation fees. To avoid this, we add heading boosts, to boost relevance based on keywords in the query matching a section heading. In faq.py, add:

_HEADING_BOOSTS: list[tuple[tuple[str, ...], str]] = [
    (("timing", "hour", "open", "closed", "sunday", "holiday"), "clinic hours"),
    (("fee", "cost", "price", "charge", "follow-up", "consultation"), "consultation fees"),
    (("insurance", "payment", "card", "cashless"), "payments"),
    (("x-ray", "ecg", "ultrasound", "lab", "blood test"), "lab and diagnostic"),
    (("emergency", "urgent"), "emergencies"),
    (("doctor", "sarah", "james", "lin", "cole"), "our doctors"),
    (("address", "location", "find", "directions", "where"), "location"),
    (("cancel", "reschedule", "book", "appointment", "walk-in"), "booking"),
    (("phone", "whatsapp", "email", "contact"), "contact"),
]

Adds a mathematical boost to the search score if the keyword matches a section heading:

def _heading_boost(query: str, heading: str) -> float:
    q, h = query.lower(), heading.lower()
    for keywords, hint in _HEADING_BOOSTS:
        if hint in h and any(k in q for k in keywords):
            return 0.25
    return 0.0

Make sure that very long blocks of text are not returned, which would be too long to be spoken at once:

def _trim_at_sentence(text: str, max_chars: int) -> str:
    if len(text) <= max_chars:
        return text
    cut = text[:max_chars]
    boundary = cut.rfind(". ")
    return cut[: boundary + 1] if boundary >= int(max_chars * 0.55) else cut.rstrip() + "..."

The function the LLM calls to actually perform the search:

@function_tool()
async def search_faq(query: str) -> str:
    """
    Search the clinic knowledge base to answer patient questions.
    Call this for ANY factual question about the clinic: hours, location,
    fees, doctors, lab services, parking, payments, cancellation policy,
    pharmacy, or emergencies.
    Do not guess — always call this tool first.
    query: the patient's question exactly as they asked it.
    """
    def _search(q: str) -> str:
        try:
            model = SentenceTransformer(MODEL_NAME)
            vec = model.encode([q])[0].tolist()


            db = lancedb.connect(DB_PATH)
            if TABLE_NAME not in db.table_names():
                return ""


            results = (
                db.open_table(TABLE_NAME)
                .search(vec)
                .metric(VECTOR_METRIC)
                .limit(_SEARCH_LIMIT)
                .to_list()
            )


            # Filter by distance threshold, apply heading boost, pick best
            relevant = [r for r in results if r.get("_distance", 2.0) < DISTANCE_THRESHOLD]
            if not relevant:
                return ""


            ranked = sorted(
                relevant,
                key=lambda r: r.get("_distance", 2.0) - _heading_boost(q, r.get("heading", "")),
            )
            best_body = ranked[0]["body"]
            return _trim_at_sentence(best_body, MAX_RESPONSE_CHARS)


        except Exception as exc:
            logger.error("FAQ search error: %s", exc)
            return ""


    result = await asyncio.to_thread(_search, query)
    if not result:
        return (
            "I don't have specific information on that. "
            "Let me have someone from our team call you back with the answer."
        )
    return result

Equipping the Agent to use the RAG to answer clinic questions

In prompts/system_prompt.py, add this to the SYSTEM_PROMPT:

"""
ANSWERING CLINIC QUESTIONS
For any factual question about the clinic — hours, location, fees, doctors,
lab services, parking, payments, cancellation policy, pharmacy, or
emergencies — call the search_faq tool before answering. Never guess or
answer from memory. If search_faq returns nothing useful, say:
"Let me have someone from our team call you back with that information.
May I take your number?"
"""

Next, in agent.py,

At the top, add the imports:

from tools.faq import build_index, search_faq

To the prewarm function, add:

    build_index()

This ensures the vector index is built on startup.

In ClinicAgent, add search_faq to the tools, to establish it as a tool the agent can use:

class ClinicAgent(Agent):
    def __init__(self, instructions: str) -> None:
        #!Layer 2: Memory and Tools
        super().__init__(
            instructions=instructions,
            tools=[book_appointment, check_availability, get_doctor_list, search_faq],
            )

Change the if __name__ == “__main__”: section at the end of agent.py to this:

if __name__ == "__main__":
    import sys
    if len(sys.argv) >= 2 and sys.argv[1] == "download-files":
        build_index()
    else:
        cli.run_app(
            WorkerOptions(
                entrypoint_fnc=entrypoint,
                prewarm_fnc=prewarm,
                agent_name="clinic-agent",
                num_idle_processes=1,
            )
        )

Run:

python agent.py download-files

This downloads the embedding model before startup, so run it before starting the agent.

When you start the agent and make a call, it should answer any questions on the clinic(or whichever business you’ve adjusted it for) based on the content in the markdown file.

Some Optional Additions to Make the Agent Better

Google Calendar Blocking

After a successful booking, we create an event on the doctor’s/business’s calendar - this is only for visibility, the agent never reads from it. We’ve already covered all the credentials and configurations in the setup section.

The code for this can be found in tools/calendar_mirror.py on Github([repo link]).

In tools/appointment.py, import the new function:

from tools.calendar_mirror import create_calendar_event

Scroll to the book_appointment function, the part where we create the Supabase background tasks and add create calendar as a background task:

    reserved = await reserve_slot(slot["slot_id"], patient_name, phone, doctor, booking_id, reason)
    if not reserved:
        return {"status": "failed", "message": "That slot was just taken. Please try another time."}
# -------------------------- ADD THIS ------------------------------------
    asyncio.create_task(create_calendar_event(
        patient_name, phone, doctor, reason, booking_id, slot["iso_date"], slot["iso_time"]
    ))
# ------------------------------------------------------------------------

    logger.info("Booking confirmed: %s for %s", booking_id, patient_name)
    return {"patient_name": patient_name, "phone": phone, "date": date, "time": time, "doctor": doctor, "status": "confirmed"}

Whatsapp Confirmation

After booking an appointment, we can configure the agent to send the caller a confirmation on their Whatsapp. We’ve covered the configuration and credentials in the Twilio section.

The code for this can be found in tools/notification.py on Github([repo link]).

In tools/appointment.py:
Add the import to the top of the file:

from tools.notifications import send_whatsapp_confirmation

Add the whatsapp confirmation as a background task in the book_appointment function:

asyncio.create_task(send_whatsapp_confirmation(phone, patient_name, doctor, date, time, booking_id, reason))

Cancelling and Rescheduling

A fully autonomous receptionist doesn't just book appointments, it manages the calendar. However, modifying or deleting existing appointments introduces a critical safety risk: we do not want the AI accidentally canceling a booking just because a patient asked, "What is your cancellation policy?"

To solve this, we are going to implement a Two-Step Confirmation Pattern. The LLM must first call our tool with a confirmed=False flag to look up the appointment and read it back. It is strictly instructed to only call the tool again with confirmed=True after the patient explicitly says "Yes."

You can find the code for this on Github([repo link]) in tools/cancellation.py.

In agent.py,
Import the new functions at the top:

from tools.cancellation import cancel_appointment reschedule_appointment

Add the two new tools to ClinicAgent:

class ClinicAgent(Agent):
    def __init__(self, instructions: str) -> None:
        super().__init__(
            instructions=instructions,
            tools=[book_appointment, check_availability, get_doctor_list, search_faq, transfer_to_human, cancel_appointment, reschedule_appointment],
            )

Adapting To Your Use Case

The clinic persona is a thin configuration layer on top of a general-purpose call agent. The voice pipeline, slot system and memory are business-agnostic. Here is what to change:

What to change	Which File	Update
Agent name and persona	`prompts/system_prompt.py`	Identity block, opening line and instructions
Staff/provider names	`sql/create_tables.sql`	slots_doctor_check constraint
Booking Flow	`prompts/system_prompt.py`	Booking intent section
Business hours and slots	`sql/create_tables.sql` `supabase/functions/seed-slots/index.ts`	Seed times and working days
FAQ content	`knowledge/clinic_faq.py`	Replace entirely with content about you business, keep the ## heading structure
Calendar names	`.env`	`GOOGLE_CALENDAR_ID_*`
Handoff number	`.env`	`CLINIC_PHONE_NUMBER`
Voice	`agent.py`	Murf voice ID – see murf.ai/voices

Example Prompts

Hair Salon
System Prompt:

You are Zara, the AI receptionist for Curl & Cut salon, Indiranagar, Bangalore.
Opening line: Hello, thanks for calling Curl & Cut. I'm Zara, your AI assistant. How can I help?

Booking flow: service type (haircut / colour / blowout), stylist preference, date, time.
sql/create_tables.sql — update the provider constraint:
constraint slots_stylist_check check (doctor in ('Aisha', 'Priya', 'Riya'))
Replace knowledge/clinic_faq.md with your services, pricing, and cancellation policy.

Legal Intake
System Prompt:

You are Alex, the AI intake assistant for Mehta & Associates. Collect: caller name,
contact number, matter type (civil / criminal / family / property), and a brief description.
Then schedule a callback with a solicitor.

For callback-only intake, replace book_appointment with a lighter tool that logs the inquiry and records a preferred callback time. Memory, transcripts, and WhatsApp all still work as-is.

Restaurant
System Prompt:

You are Anaya, an AI receptionist who takes reservations for The Spice Room. Collect: guest name, contact number,
date, time, party size, and any dietary requirements.

The slots table works naturally for table-time pairs. Update the provider constraint to table names:
constraint slots_table_check check (doctor in ('Table 1', 'Table 2', 'Table 3', 'Terrace'))
Update supabase/functions/seed-slots/index.ts for your opening hours, days, and booking interval.

Errors

Error	Cause	Fix
Required environment variable 'X' is not set	Missing .env value	Copy .env.example to .env and fill in the variable
Agent answers but stays silent	Dispatch rule has no agents block	Edit the rule in LiveKit Cloud — add agentName: clinic-agent to roomConfig.agents
DuplexClosed in logs, call drops mid-greeting	dev mode restarts on file save	Use python agent.py start for all phone testing
Call drops immediately	TwiML Bin not reachable, or URI missing ;transport=tcp	Check the URI in the TwiML Bin and add ;transport=tcp at the end
ERROR: relation "slots" does not exist	Ran only part of create_tables.sql	Select the full file (Ctrl+A) and run it again from the top
Table "slots" is missing at seed time	Same as above	Same fix — the seed block at the bottom requires the tables above it
401 or 403 from Murf or Deepgram	Wrong or expired API key	Re-check MURF_API_KEY and DEEPGRAM_API_KEY in .env
WhatsApp message not delivered	Recipient has not joined the sandbox	Send join from the recipient's WhatsApp to the sandbox number
Calendar events not appearing	Service account not shared with the calendar	Go to each calendar's settings and share it with edit permissions to the service account email
Slot coverage warning on startup	Fewer than 14 days of available slots ahead	Run the manual seed SQL or POST to the Edge Function URL

The complete, ready-to-run source code for this project is available on GitHub. Clone the repo, swap out the system prompt, adjust the database schema, and build a voice agent for your own use case.

If you build something cool using this stack, I’d love to hear it, let me know what your agent is booking!

I Gave OpenClaw a Voice and It Ordered Me Dinner

Sanchita Sunil — Wed, 03 Jun 2026 09:09:15 +0000

Quick links.
Code: https://github.com/murf-ai/murf-cookbook/tree/main/examples/agents/food_ordering_agent
Video Walkthrough: https://www.youtube.com/watch?v=ypqzB093VLc
Configuration Deep Dive: https://dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg

Building a working voice agent usually means stitching state across speech, logic, and external APIs by hand. OpenClaw gives you a runtime that handles most of that for you. To see how far that gets you in practice, I wired OpenClaw up to a microphone, a Murf Falcon voice, and a Swiggy account. In about 800 lines of TypeScript, I had an agent that could search restaurants, take add-to-cart instructions, and place a real order end to end.

This post is an architecture walkthrough. I'll explain what OpenClaw is doing under the hood, where I had to fight its defaults, and how the same blueprint applies to any voice agent you want to build.

Why OpenClaw

There are several agent frameworks out there. Most of them treat an agent as a function: input goes in, tool calls happen, output comes out. OpenClaw is different — it treats the agent as a runtime, more like a long-running server than a single call. Sessions can be paused and resumed, state is keyed and persisted, tool calls go through a typed MCP (Model Context Protocol) interface. And critically for what we are building, OpenClaw exposes block-level streaming hooks that let you intercept the model's output as it arrives.

A voice agent is the hardest case any agent framework will face, because the user can hear every millisecond of latency. If your framework only hands you the full reply at the end, you cannot stream audio to the speakers. The user is left in silence while the model generates 300 characters, which can take 2 to 4 seconds, which feels like forever in conversation time. OpenClaw hands you each block, a sentence or two, the moment it arrives. You turn that block into audio and play it while the model keeps generating the next one.

Three things in particular made this build feel small once I understood them:

Skills are markdown, not function definitions. The Swiggy integration is a SKILL.md file the model reads. No JSON schemas, no function-calling boilerplate. To swap Swiggy for GitHub or Notion later, I would install a different skill and change one config line.
MCP is built in. OpenClaw treats MCP servers as first-class. The Swiggy MCP plugs in through mcporter. Adding a new tool surface means adding a new skill, not writing glue code.
Streaming hooks are real. onBlockReply fires as the model writes. You drive synthesis from inside the callback.

Once those three things are in place, the rest of the build is mostly wiring the audio loop around them.

The pipeline

A microphone library captures audio and a streaming STT turns it into transcripts. Those transcripts go into OpenClaw, which decides what to do, calls skills, and streams text out. A streaming TTS turns each block into audio as it arrives, and a speaker library plays it back.

The audio loop is the same regardless of what the agent is doing. Plug a calendar skill into OpenClaw and you have a voice scheduling assistant. Plug in a GitHub skill and you have a voice PR reviewer. The loop does not change, only the skill and the system prompt do.

The stack

Layer	Tool	Why this one
Agent runtime	OpenClaw	Skill registry, MCP integration, block-level streaming hooks. The framework this post is built around.
Tool surface	Swiggy skill via ClawHub	Vendored MCP skill. Documents the API in markdown the model can read.
Microphone and speaker	Decibri	Native WASAPI on Windows, CoreAudio on Mac, ALSA on Linux. No browser layer.
Speech to text	Deepgram Flux	Streaming STT with end-of-turn detection inside the model. No separate VAD to wire up.
Text to speech	Murf Falcon	Low time-to-first-audio, and conversational voice styles that sound right in back-and-forth dialogue.
Language model	Gemini	Free tier, supports tool calling, fast on first token. Substitutable with any tool-calling LLM.

What I deliberately left out

Before the build, here is what is not in this version:

Wake word detection. The microphone is always on while the agent is not speaking. No "Hey Claw" trigger.
Cross-session memory. Every restart starts fresh. The session key is per-process.
Order cancellation. Swiggy's MCP does not expose it, so the skill routes the user to customer care.
Production hardening. This is a single-user CLI. No auth, no rate limiting, no observability. Don't ship it as is.

A note on latency worth setting expectations on now. Streaming TTS plays each sentence as soon as it is ready, which makes the agent feel responsive on most turns. But tool calls still take as long as tool calls take. When the agent is hitting Swiggy's API for restaurant search, there is real waiting that streaming cannot hide. I cover this in detail in the latency section.

Requirements

Set these up before continuing.

Node and package manager

Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.
pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.

Platform audio dependencies

Decibri uses the native audio stack on each operating system, so the install steps differ.

Linux: apt install libasound2-dev on Debian-family, or alsa-lib-devel on Fedora-family.
Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.
macOS: CoreAudio is built in. You need Xcode Command Line Tools: xcode-select --install.

External CLIs

npm install -g clawhub

clawhub is OpenClaw's skill registry.

API keys

Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.
Murf for the Falcon TTS key. Created on the API tab of your Murf account, separate from a regular Murf Studio account.
An LLM provider of your choice. Most have a free tier sufficient for development.

Swiggy account

A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.

Step 1: Clone, install, env

git clone --filter=blob:none --sparse https://github.com/murf-ai/murf-cookbook.git
cd murf-cookbook
git sparse-checkout set examples/agents/food_ordering_agent
cd examples/agents/food_ordering_agent
pnpm install
cp .env.example .env

Open .env and add:
DEEPGRAM_API_KEY=...
MURF_API_KEY=...
GEMINI_API_KEY=...

If you would rather use OpenAI or Anthropic instead of Gemini, change one line in openclaw.json and the env variable name. Tool-calling support is the only requirement.

Step 2: Authenticate Swiggy

Swiggy's MCP needs OAuth. Run this once, the browser opens, you log in, you approve.

node scripts/swiggy-auth.mjs

This opens a browser and signs you into Swiggy via PKCE OAuth, and writes the token as a static Authorization header into ~/.mcporter/mcporter.json. You won't need to do this again unless the token expires.

If the browser doesn't open automatically, the script prints the full auth URL that you can copy and paste manually.

Confirm that the skill can actually reach Swiggy:

node skills/swiggy/swiggy-cli.js food addresses

This should print your saved addresses. If the list is empty, save one in the Swiggy app before moving on.

Note: In the video I use mcporter auth swiggy-food — that no longer works. See the repo README for current auth steps.

Step 3: The four files you write

src/
ear.ts ~110 lines microphone capture and Deepgram WebSocket
brain.ts ~500 lines streaming TTS pipeline, calls OpenClaw
voice.ts ~140 lines speaker output, two channels
index.ts ~140 lines the event loop

The whole agent fits in 900 lines. Three of these files are pure adapter code: microphone in, speaker out. The interesting file is brain.ts, because that is where OpenClaw and Murf Falcon meet.

ear.ts: microphone in, transcript out

Decibri captures 16-bit PCM at 16 kHz in 100 ms chunks. Each chunk goes to a Deepgram Flux WebSocket on /v2/listen.

const params = new URLSearchParams();
params.append("model", "flux-general-en");
params.append("encoding", "linear16");
params.append("sample_rate", "16000");
for (const k of keyterms) params.append("keyterm", k);
const url = `wss://api.deepgram.com/v2/listen?${params.toString()}`;

Two things to know.

First, Flux has end-of-turn detection inside the transcription model. You don't need a separate Voice Activity Detector. You get one event called EndOfTurn and you respond to it.

if (data.type === "TurnInfo") {
  if (data.event === "EndOfTurn") {
    const transcript = data.transcript ?? "";
    if (transcript.trim().length > 0) {
      onTranscription(transcript.trim());
    }
  }
  return;
}

Second, there is a contextual keyterm trick that mattered a lot for Indian-English food vocabulary. After each agent reply, I extract the capitalised words ("Punjab Grill," "Paneer," "Meghana") and pass them as keyterms for the next turn. This is what fixes "Kadhai Paneer" being heard as "car die panel." Standard English ASR doesn't handle Indian food names well. Per-turn keyterm biasing gets it most of the way there.

I wired Deepgram in directly here, not through OpenClaw's STT plugin slot. OpenClaw's STT integration is built for telephony, not a local CLI microphone. 110 lines of WebSocket code was the right tool for this job.

brain.ts: where OpenClaw earns its keep

This is the file that uses every OpenClaw primitive worth using.

The basic flow:

Call OpenClaw's chat() with the user's transcript.
Subscribe to OpenClaw's onBlockReply hook.
Hand each block to Murf Falcon for synthesis as it arrives.
Stream audio back to voice.ts in order.

OpenClaw's defaults are tuned for chat, where each block can be a paragraph and the user is reading on a screen. For voice, three overrides matter.

Override 1: turn streaming on. OpenClaw has two switches that both have to allow streaming. The naming is confusing because one is called disable. So you want disableBlockStreaming: false, which means "do not disable," which means "do stream."

const llmCall = getReplyFromConfig(ctx, {
  disableBlockStreaming: false,
});

Override 2: fix the coalescer. OpenClaw has a coalescer that decides when to flush a buffered block to your code. Its default minChars is 800. A typical voice reply is 200 to 300 characters, so the coalescer waits for a block that never arrives, then dumps everything at end-of-reply. Streaming defeated.

blockStreamingCoalesce: {
  minChars: 1,
  maxChars: 200,
  idleMs: 0,
  flushOnEnqueue: true,
},

flushOnEnqueue: true is the line that makes the rest of this work. It tells OpenClaw to hand the block over the moment it arrives, instead of waiting for more.

Override 3: track deltas yourself. OpenClaw's onBlockReply callback gives you the full text so far, not just the new piece. You compute the delta yourself. Three cases: extension (new starts with old), duplicate (skip), and reset (fresh string after a tool call). The reset case is easy to miss and shows up after every tool call.

let delta: string;
if (currentBlockStream && text.startsWith(currentBlockStream)) {
  delta = text.slice(currentBlockStream.length);
  currentBlockStream = text;
} else if (currentBlockStream && currentBlockStream.includes(text)) {
  return;
} else {
  delta = text;
  currentBlockStream = text;
}

Once you have the delta, you call Murf's synthesize(). Synthesis runs in parallel across blocks, but playback runs in order, serialised through a Promise chain so that chunk 2 always plays after chunk 1 even if chunk 2's network call finishes first.

const synthP = synthesizeSpeech(trimmed).catch(() => null);
emitChain = emitChain.then(async () => {
  const audio = await synthP;
  if (audio) onAudioChunk(audio);
});

That is roughly 30 lines of streaming logic. The rest of brain.ts is the agent setup, the OpenClaw config, and a fallback path for when the model batches output after tool calls.

voice.ts: two speakers, not one

Falcon's synthesis is fast — Murf reports 130 ms time-to-first-audio, and that matches what I see in practice. So when there is dead air on the agent's first turn, it is not the TTS that is causing it. It is the cold-start cost of OpenClaw initialising, the Swiggy MCP handshake, the LLM doing its first call against a fresh tool chain. All of that has to finish before the model has produced its first block of text for Falcon to synthesise.

That is the gap pre-recorded filler audio is for. Short clips like "One moment please" or "let me check that for you" play 100 ms after the user stops talking, which is fast enough that the user does not perceive a delay.

The catch: the filler is a variable-length clip, and the first real audio chunk can arrive before the filler finishes. If both play through one audio output, one cuts off the other. The fix is two separate Decibri outputs.

let oneShotSpeaker: InstanceType<typeof DecibriOutput> | null = null;
let streamSpeaker: InstanceType<typeof DecibriOutput> | null = null;

oneShotSpeaker plays fillers. streamSpeaker plays the real reply. When the first reply chunk arrives, I stop the filler channel without touching the reply channel. Anything queued on the reply channel keeps playing.

This sounds like overkill until you hear the alternative. With one channel, the filler clips the agent saying "Sure" and the user only hears "...I'll add that."

index.ts: the loop

async function startSession() {
  renderBanner();
  setImmediate(() => warmup());      // amortise OpenClaw cold start
  await playIntro();
  await openMicrophone();
}

ear.on("transcript", async (text) => {
  closeMicrophone();
  await playFiller();                  // mask LLM latency
  await chat(text);                    // streams audio as it arrives
  await openMicrophone();
});

That is the whole loop. Render the banner, kick off OpenClaw warmup in the background, play the intro, open the microphone. On each transcript: stop the microphone, play a filler, run the agent, reopen the microphone.

The setImmediate(() => warmup()) line runs OpenClaw's initialisation and the Swiggy MCP handshake while the user is hearing the intro. By the time the user finishes their first sentence, both are warm. That shaves several seconds off turn 1.

How the skill actually works

This is the part that surprised me most when I first used OpenClaw.

The agent learns to use Swiggy by reading a markdown file. Not a JSON schema, not function definitions. A human-readable file called SKILL.md that documents the commands, the sequencing rules, and the things to never do. The model reads this, figures out what to call, and emits shell commands that run against a CLI wrapper.

The wrapper is small. node skills/swiggy/swiggy-cli.js food <command> is the shape of every call. The skill knows commands like search-restaurants, get-menu, add-to-cart, checkout. The model sequences them on its own, based on the markdown documentation.

Here is a snippet from SKILL.md (paraphrased):

search-restaurants: Find restaurants matching a cuisine or dish. Use this first whenever the user mentions a food. Example: search-restaurants --query "biryani". Always call get-addresses first if you have not yet, because results depend on delivery location.

The model reads it the same way a new developer would read documentation on day one.

The one tweak I made: every swiggy food <cmd> call in SKILL.md became node skills/swiggy/swiggy-cli.js food <cmd>. OpenClaw's shell executor doesn't have npm globals on PATH, so the swiggy binary from npm link is not reachable.

The implication for builders: writing a new skill is writing a markdown file and a thin CLI. There is no SDK to learn, no function-calling glue to debug. If you can document an API in English with examples, you can give an OpenClaw agent the ability to call it.

Latency

The first turn is the slowest. Before any audio plays, OpenClaw has to initialise, complete the Swiggy MCP handshake, and make its first LLM call against a fresh tool chain. On a typical machine that takes anywhere from 15 to 50 seconds, depending on your network and your LLM provider. Streaming TTS does not save you here — the model has not produced anything to synthesise yet.

What does help is the combination of filler audio (which plays 100 ms after the user stops talking) and the background warmup that runs during the intro. Together they keep the perceived gap small even when the actual cold start is not.

Turn 2 onwards is a different story. With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking, and most of that is the LLM's time to its first sentence. Falcon's 130 ms TTFA is what makes "first sentence" actually translate to "first audio you hear."

If you genuinely need to push first-turn latency below this on tool-heavy turns, the only real lever is to take OpenClaw out of the loop on those turns — wiring the tool calls in directly, parallelising what OpenClaw would have serialised. I haven't done that in this build.

Swap the skill

The voice loop in this post does not care what the agent does. The skill lives in two files:

agents.defaults.skills in openclaw.json. Replace swiggy with another MCP skill. Google Calendar. GitHub. Notion. Linear. Pick one from ClawHub or write your own.
workspace/IDENTITY.md. The system prompt that describes who the agent is and how it should talk. Rewrite it for the new domain.

That portability is the case I wanted to make for OpenClaw with this post. The framework is doing real work behind the scenes, hiding the runtime, the MCP integration, the streaming, and the skill format behind primitives that are small enough to use without ceremony.

What I learned

The skill format is the part I underestimated going in. The model was reading it the way a developer would read API docs on day one. There is no JSON schema to maintain, no function-calling boilerplate to update when the API changes. If your API is documentable in markdown, an OpenClaw agent can use it.

Voice agents are mostly a latency engineering problem. The transcription, the agent, the TTS are mostly solved. The work that made this build feel real was around the seams — two-channel playback, background warmup, per-turn keyterm bias, pre-baked fillers. You have to find these by listening to your own demo and noticing what sounds wrong.

The combination of streaming hooks and per-block synthesis is what made the conversational rhythm work. Falcon at 130 ms TTFA is fast on its own, OpenClaw handing off blocks the moment they arrive is fast on its own. Together, if the LLM produces text in roughly 200 ms chunks and the TTS adds 130 ms on top, the user hears a new sentence every ~330 ms. That is faster than most humans speak, and it is what makes the agent feel like it is actually thinking out loud rather than waiting to deliver a finished answer.

If this was useful, the code is at github.com/murf-ai/murf-cookbook. A star helps the project reach more builders. Clone it, swap the skill, and build something else tonight. The configuration deep dive, with the parameter tables and error mappings, is at dev.to/sanchita_sunil/notes-from-the-openclaw-voice-tutorial-4ngg.

I would love to hear what you build with it.

Resources:
Murf Plugin: https://clawhub.ai/plugins/openclaw-murf-tts
Murf Falcon: https://murf.ai/api/dashboard
Openclaw: https://openclaw.ai/
Clawhub: https://clawhub.ai/
Deepgram: https://console.deepgram.com/

Notes from the Openclaw Voice Tutorial

Sanchita Sunil — Wed, 03 Jun 2026 08:29:41 +0000

This is a companion to the food-ordering agent tutorial video (You can find the video here: https://www.youtube.com/watch?v=ypqzB093VLc). The video walks you through cloning the repo and placing a real Swiggy order with your voice. This post fills in the parts the video pointed at but did not have time to cover:

Every Deepgram Flux parameter, what it does, and how the event model behaves
Why OpenClaw's block streaming defaults are wrong for voice, and which ones to flip
Falcon voice and locale compatibility, and how to swap voices without breaking things
Streaming-pipeline bugs that show up after setup, with their root causes

Repo: https://github.com/murf-ai/murf-cookbook/tree/main/examples/openclaw/food_ordering_agent
Video: https://www.youtube.com/watch?v=ypqzB093VLc

OpenClaw treats an agent as a runtime, not a prompt. A runtime is a program that runs continuously and remembers state between calls, like a server. A prompt, in contrast, is a single block of text sent to the model. The difference matters because OpenClaw can pause, resume, and track sessions across many turns.

That model works well for chat. Voice is where it starts to break down.

A microphone does not produce text. It produces audio frames (small chunks of raw sound data). A speaker cannot wait for the full reply before playing anything. The user will hear silence and assume the agent is broken. The same tool-call delay that is invisible in a chat UI becomes obvious dead air the moment the user can hear it.

Every piece of OpenClaw still works for voice. You just have to point each piece at the voice use case on purpose, instead of relying on the chat-friendly defaults. The next three sections walk through which defaults to change and why.

Requirements

If you do not have these, set them up before continuing.

Node and package manager

Node 22.16 or newer. The repo is ESM-only and breaks on earlier versions.
pnpm 9 or newer. The lockfile is pnpm. npm and yarn will resolve different versions.

Platform audio dependencies

Decibri uses the native audio stack on each operating system, so the install steps differ.

Linux: apt install libasound2-dev on Debian-family distros, or alsa-lib-devel on Fedora-family. Required at install time.
Windows: WASAPI is built in. You need a C++ build toolchain for the Decibri binary. Install "Desktop development with C++" through Visual Studio Installer.
macOS: CoreAudio is built in. You need Xcode Command Line Tools: xcode-select --install.

External CLIs

clawhub. OpenClaw's skill registry. The Swiggy skill in this repo is vendored, so you do not strictly need clawhub to run the agent, but you will need it if you want to fetch other skills later.

API keys

Deepgram for the Flux STT key. New accounts get $200 in starter credit, no card required.
Murf for the Falcon TTS key. This is created on the API tab of your Murf account, separate from a regular Murf Studio account.
An LLM provider of your choice. Most have a free tier sufficient for development.

Swiggy

A Swiggy account with at least one saved delivery address. The agent orders to saved addresses, not live GPS, because the MCP surface exposes addresses, not coordinates.

Update: The Swiggy auth flow has changed since the video was recorded. mcporter auth swiggy-food no longer works — Swiggy MCP now requires an approved client_id and uses a custom PKCE script instead. Run node scripts/swiggy-auth.mjs. See the repo README for current steps.

Deepgram Flux

Flux is the STT we use in this build. There are several streaming STTs that work for voice agents; Flux is the one wired up here, and the parts below are the configuration you need to get right regardless of which API you go with.

One concept worth covering before the parameters: turn-taking. This is the decision of when the user has stopped talking and the agent should respond. Many streaming STT APIs hand back partial transcripts and leave turn-taking to your code, which usually means adding a separate Voice Activity Detector (VAD) that listens for silence. Flux does turn-taking inside the transcription model and emits structured events for it, so for this build we do not need a separate VAD.

Endpoint

An endpoint is the URL path you connect to on a server. Flux only works on /v2/listen. The older /v1/listen endpoint will silently reject the model parameter. You will spend an hour wondering why nothing transcribes.

const params = new URLSearchParams();
params.append("model", "flux-general-en");
params.append("encoding", "linear16");
params.append("sample_rate", "16000");
for (const k of keyterms) params.append("keyterm", k);
const url = `wss://api.deepgram.com/v2/listen?${params.toString()}`;

Use URLSearchParams to build the URL. It encodes spaces in multi-word keyterms correctly (as +). If you build the query string by hand and use %20 instead, Deepgram will close the connection without telling you why. This is the most common setup bug.

Parameters

The audio format below uses the term PCM, which means pulse-code modulation. It is the standard way to represent raw audio as numbers. linear16 means each sample is a 16-bit number stored in little-endian byte order. Most audio libraries use this format by default.

Parameter	Value used	What it does
`model`	`flux-general-en`	Flux English. Use `flux-general-multi` for multilingual.
`encoding`	`linear16`	16-bit PCM audio. Must match what your microphone library outputs.
`sample_rate`	`16000`	16 kHz audio. Decibri captures at this rate by default.
`keyterm`	repeated	Vocabulary biasing. Up to 100 keyterms per connection.
`eager_eot_threshold`	not set	Enables EagerEndOfTurn events at this confidence. Off in this repo.

You can also pass eot_threshold to tune end-of-turn sensitivity. The default works well for short food-ordering sentences. If your agent handles longer thinking-out-loud utterances, raise it.

The Flux events we use

Flux sends five event types on its TurnInfo stream. The repo only consumes one of them, but the others are worth knowing because you will probably want some of them later.

Update. Partial transcript, updated as the user keeps talking. Useful if you want a live transcript display. Not used here.
StartOfTurn. The user just started speaking. This is where you would handle barge-in (cutting off the agent if it is still talking, so the user can interrupt). Not connected here.
EndOfTurn. High confidence the user is done. This is the only event the repo uses. When it fires, the transcript goes to the LLM and the agent starts generating a reply.
EagerEndOfTurn. Medium confidence the user is done. Off by default. If you turn it on (with eager_eot_threshold), the agent can start drafting a reply early. Saves some delay at the cost of more LLM calls because some drafts get thrown away.
TurnResumed. Only fires after an EagerEndOfTurn. Means the user was not actually done, and any draft you started should be discarded.

if (data.type === "TurnInfo") {
  if (data.event === "EndOfTurn") {
    const transcript: string = data.transcript ?? "";
    if (transcript.trim().length > 0) {
      onTranscription(transcript.trim());
    }
  }
  return;
}

Keyterm biasing for Indian-English food vocabulary

Deepgram lets you pass up to 100 keyterms per connection. Keyterms tell the model "if you hear something close to one of these words, lean toward this spelling." Most apps set keyterms once at connect time using a fixed vocabulary.

Flux's Configure control message lets you update keyterms on every turn. The repo uses this to bias the next turn on whatever proper nouns the agent just said.

function extractContextualKeyterms(text: string): string[] {
  const tokens = text
    .replace(/[.,!?;:()"']/g, " ")
    .split(/\s+/)
    .filter((w) => w.length >= 3 && /^[A-Z]/.test(w) && !KEYTERM_STOPWORDS.has(w));
  return [...new Set(tokens)];
}

The idea is simple. If the agent just said "Paneer Butter Masala from Punjab Grill," the user's reply is much more likely to contain those words than some random restaurant name. So we extract the capitalised words from the agent's last reply and use them as bias for the next turn.

For Indian-English food vocabulary, where standard English speech recognition struggles the most, this one feature is the difference between the agent hearing "Kadhai Paneer" and hearing "car die panel."

Cost

Deepgram bills Flux per second of streaming audio. As of early 2026, the pay-as-you-go rate sits in the range of $0.0077 to $0.015 per minute, depending on the plan and region. Check Deepgram's pricing page for current numbers. New accounts get $200 in starter credit.

A rough cost estimate for the food-ordering agent:

Average turn: 3 seconds of user speech, microphone open during user speech only
Per-turn STT cost: 3 seconds at the higher end of the range, roughly $0.00075
Ten-turn ordering session: under one cent for STT

You will run out of $200 of credit long before you run out of patience for testing.

Block streaming

OpenClaw was built for chat first. Its block streaming was tuned for long replies on a screen. In that setup, each block (a unit of text the model sends back) might be a whole paragraph. For voice, each block should be a sentence or two. Every millisecond between "LLM produced text" and "speaker plays sound" is silence the user can hear.

The defaults are wrong for voice. Until you change them, OpenClaw quietly holds onto your blocks instead of sending them to your code right away.

First, turn streaming on

OpenClaw has two settings that control block streaming:

blockStreamingDefault in the config (the channel-wide default)
disableBlockStreaming at the call site (the override for one call)

Both have to allow streaming, or it will not happen.

const llmCall = getReplyFromConfig(ctx, {
  disableBlockStreaming: false,
});

The naming is confusing. The option is called disable, so false means "do not disable." Which means "do stream." So you want disableBlockStreaming: false. Read it twice if needed.

Fix the coalescer

The coalescer is the component that decides when to send a buffered block to your code. To buffer means to hold onto something until enough has built up. To send the buffered content onward is to flush it.

The coalescer's default minChars setting is 800. A typical voice reply is 200 to 300 characters. So with the default, the coalescer waits for an 800-character block that will never arrive. It gives up at the end of the reply and dumps everything at once. Streaming defeated.

Override it like this (brain.ts lines 96 to 109):

blockStreamingChunk: {
  minChars: 1,
  maxChars: 200,
  breakPreference: "sentence",
},
blockStreamingCoalesce: {
  minChars: 1,
  maxChars: 200,
  idleMs: 0,
  flushOnEnqueue: true,
},

The line that matters most is flushOnEnqueue: true. It tells the coalescer to send the block to your code the moment it arrives, without waiting. Every other override is necessary, but useless without this one.

Track deltas yourself

A callback is a function that OpenClaw calls when something happens, like a new block arriving. OpenClaw's onBlockReply callback is given the full text so far, not just the new piece. So you have to figure out what is new yourself. The new piece is called the delta.

Here is how the repo computes it (brain.ts lines 486 to 501):

let delta: string;
if (currentBlockStream && text.startsWith(currentBlockStream)) {
  delta = text.slice(currentBlockStream.length);
  currentBlockStream = text;
} else if (currentBlockStream && currentBlockStream.includes(text)) {
  return; // already covered
} else {
  delta = text;
  currentBlockStream = text;
}

There are three cases here, and the third is the one that matters most:

Extension. The new text starts with the old text. The delta is just the part at the end. Easy.
Duplicate. The same block got reported twice. Skip it.
Reset. The new text has nothing to do with the old text. This happens after a tool call finishes. OpenClaw starts a fresh block stream, and the new text is a brand-new string. Without this branch, you would either lose the new block or join it incorrectly to the old one.

The empty payload.text quirk

When block streaming is actually working, payload.text in the final reply is an empty string. This is not a bug.

OpenClaw has a check called shouldDropFinalPayloads that removes the text from the final payload once it has already been streamed. This avoids sending the same text twice. The repo handles this by collecting text in its own buffer (canonicalText) as chunks arrive. It only falls back to payload.text if the buffer is empty:

if (!canonicalText && payloadText) canonicalText = payloadText;

Murf Falcon

Synthesis is the technical word for generating audio from text. Murf Falcon is the TTS model used in this build. Murf reports a model latency of 55 ms and a time-to-first-audio of 130 ms, at $0.01 per 1,000 characters — roughly 1 cent per minute of generated audio.

Turn off OpenClaw's built-in TTS

OpenClaw ships with its own TTS pipeline. By default it runs in auto: "on" mode, which produces one final audio file at the end of a reply. That mode is incompatible with per-block streaming, so we turn it off (openclaw.json lines 30 to 47):

"tts": {
  "provider": "murf",
  "auto": "off",
  "mode": "final",
  "providers": {
    "murf": {
      "voiceId": "en-IN-anusha",
      "model": "FALCON",
      "locale": "en-IN",
      "style": "Conversational"
    }
  }
}

With auto: "off", the Murf provider stays loaded and configured. But your code is now in charge of synthesis. You call murfProvider.synthesize() directly on each block.

Voice and locale compatibility

A locale is a code that identifies a language and region together, like en-IN for English in India or es-MX for Spanish in Mexico.

Falcon supports voices across many languages, but each voice is bound to its locale. If you set voiceId to an English voice and locale to hi-IN, the API rejects the request. If you change just one of the two when swapping voices, things silently break.

Voice ID prefix	Locale	Notes
`en-IN-*`	`en-IN`	Indian English. Used in this repo.
`en-US-*`	`en-US`	American English.
`en-UK-*`	`en-UK`	British English.
`hi-IN-*`	`hi-IN`	Hindi.
`es-ES-*`	`es-ES`	Spanish (Spain).
`es-MX-*`	`es-MX`	Spanish (Mexico). Different voices than Spain.

The full list is in Murf's API docs. Before you change voiceId in openclaw.json, query /v1/speech/voices?model=FALCON and pick a voice and its matching locale together.

Pick the right voice style

Falcon exposes a style parameter. Pick Conversational for agent work. A voice that sounds great reading an audiobook usually sounds wrong in a back-and-forth conversation. Promotional and Narration styles sound theatrical when the agent is saying short things like "Sure, anything else?"

Two speaker outputs

The pre-recorded filler audio masks the cold-start delay by playing while the LLM is still thinking. The problem is that the filler clip is a variable length, and the first real audio chunk can arrive before the filler finishes.

If you play both through the same audio output, one of two bad things happens:

The filler cuts off the start of the real reply, or
The reply cuts off the end of the filler.

The fix is two separate audio outputs (voice.ts lines 10 to 11):

let oneShotSpeaker: InstanceType<typeof DecibriOutput> | null = null;
let streamSpeaker: InstanceType<typeof DecibriOutput> | null = null;

oneShotSpeaker plays fillers. streamSpeaker plays the actual reply. When the first reply chunk arrives, stopOneShotPlayback() stops the filler channel without touching the reply channel. Anything already queued on the reply channel keeps playing.

Synthesise in parallel, play back in order

There are two layers of parallelism worth understanding.

Within a single block. Murf splits long input into chunks of up to 1500 characters and synthesises them at the same time on its own infrastructure. You do not have to do anything for this.

Across blocks. The repo starts synthesis calls the moment each block arrives. So multiple blocks can be synthesising at the same time. But the audio plays back in order through a Promise chain:

const dispatchChunk = (text: string) => {
  const trimmed = text.trim();
  if (!trimmed) return;
  if (!streamingEnabled) return;
  const synthP = synthesizeSpeech(trimmed).catch(() => null);
  emitChain = emitChain.then(async () => {
    const audio = await synthP;
    if (audio) {
      streamedAnyAudio = true;
      onAudioChunk!(audio);
    }
  });
};

synthesizeSpeech() starts the Murf network call right away. emitChain.then() waits for the previous chunk's synthesis to finish before playing the current one. So if chunk 1 and chunk 2 both take 400 ms to synthesise but chunk 1's network is slower, chunk 2 still plays second. Never first.

Streaming-pipeline bugs and their root causes

The video has a short error table for the bugs you hit during setup. This section covers the ones specific to the streaming pipeline that show up later, when the agent is mostly working.

WebSocket closes with code 1008 the moment audio starts

Code 1008 means "policy violation," which Deepgram uses for invalid API keys. Check DEEPGRAM_API_KEY in your environment, and check the Deepgram console for remaining credit.

WebSocket closes with code 1011 partway through a session

Code 1011 means "internal server error," but in practice the most common cause is running out of credit mid-session. Top up and retry.

Transcripts come back empty even though audio is sending

Three things to check, in order:

Sample rate. sample_rate in the URL must match your microphone's actual rate. The repo captures at 16000. If your system is recording at 44100 or 48000, you have to resample before sending.
Encoding. The encoding parameter and the audio format must match. linear16 expects 16-bit signed little-endian PCM.
Model. model must be flux-general-en or flux-general-multi. No other model name works on /v2/listen.

The agent's first sentence plays, then nothing

This is the coalescer holding onto your blocks. If you did not override flushOnEnqueue, the first block flushes but nothing after it streams. Check brain.ts for the coalesce override.

Audio plays out of order

The Promise chain in dispatchChunk is what keeps playback in order. If you removed the emitChain.then(...) wrapper or replaced it with Promise.all, chunks will play in synthesis-completion order instead of arrival order. Put the chain back.

The agent talks over itself

This means the filler kept playing after the real reply started. Check that stopOneShotPlayback() runs on the first chunk of the real reply, not at the end of the reply.

Voice cuts off mid-sentence

Falcon synthesis can fail silently for a single chunk. The .catch(() => null) in dispatchChunk protects you from one failed chunk crashing the whole reply. But if too many chunks fail, the user hears gaps. Log the failures and check Murf's status page.

ALSA errors on Linux

On minimal Linux installs the ALSA development headers have to be installed before the npm package will build. apt install libasound2-dev covers it on Debian-family. If install completes but the device is not found at runtime, the default ALSA device is probably pointing at an output that does not exist.

No audio on Windows

Decibri on Windows uses WASAPI. If your default output device is a Bluetooth headset that is not currently connected, the stream opens silently and no audio plays. Switch the default device in Sound settings, or set the output device explicitly in code.

Silent failure on macOS

The first run asks for microphone permission. If you deny it, subsequent runs fail silently. The agent will appear to start normally and the WebSocket will connect, but no audio frames reach Deepgram. Check microphone permissions in System Settings under Privacy and Security.

Extending the agent to something that is not Swiggy

It takes two changes.

Swap the skill. The agents.defaults.skills array in openclaw.json is the list of MCP skills the agent can call. Remove the Swiggy skill, add a different one. A calendar scheduler imports a Google Calendar MCP skill. A GitHub PR merger imports the GitHub MCP skill. A Notion assistant imports the Notion MCP skill. The runtime does not change.

Rewrite the identity. workspace/IDENTITY.md is the system prompt. It describes who the agent is, what it does, what it refuses to do, and how it should format replies. Rewriting this file changes the agent's personality and its understanding of the task.

For a calendar scheduler, you would describe an assistant that looks up free slots and confirms bookings. For a PR merger, you would describe a reviewer that summarises diffs and merges when checks pass.

Everything else stays. The audio pipeline, the streaming coalescer, the keyterm bias, the two-channel playback. That is the value of keeping the voice layer separate from the agent layer. The voice layer does not care what the agent is doing.

What this pipeline does not fix

Turn 1 latency is not solved. Time-to-first-audio on a cold start is mostly caused by tool chains and LLM time-to-first-token, not by synthesis. The slow path still includes OpenClaw's cold start, the Swiggy MCP setup, and the LLM's first-token delay. Streaming synthesis cannot hide that. The filler audio can. That is why it is there.

Getting to true sub-second first audio on turn 1 would require starting the OpenClaw runtime ahead of time, keeping the MCP connection alive across sessions, and starting tool calls before the user finishes speaking. None of those are in this repo. What is in this repo is the pattern that makes the problem manageable: split the audio pipeline from the agent pipeline, stream what can be streamed, mask the rest with fillers, and measure the result.

Turn 2 onwards is a different story. With the runtime warm and the MCP connection open, first audio arrives 5 to 10 seconds after the user stops talking. Falcon plus block streaming are why. That is the number that makes the agent usable in practice. The cold-start number is what makes every tutorial-shaped demo look slower than it will be in production.

Block streaming, Falcon, and contextual keyterm biasing are three improvements that build on each other. Each does less than a demo suggests. Together they do more than any one of them alone. That is usually how voice pipelines work.

Resources:
Murf Plugin: https://clawhub.ai/plugins/openclaw-murf-tts
Murf Falcon: https://murf.ai/api/dashboard
Openclaw: https://openclaw.ai/
Clawhub: https://clawhub.ai/
Deepgram: https://console.deepgram.com/

DEV Community: Sanchita Sunil

Building a Compliant BFSI Voice Agent

Architecture

Layer 1: The Foundation (Basic Outbound Call Agent + System Prompt)

Layer 2: The Enforcer (The State Machine)

Layer 3: The Safety Net (Guardrails & Human Escalation)

Setup & Requirements

Twilio

LiveKit

Speech-To-Text (STT) — Deepgram Nova-3

LLM

Text-To-Speech (TTS) — Murf Falcon

Project setup

Customer data

Layer 1: The Foundation

Managing state and data (data.py)

Giving the agent some tools (tools.py)

The brain (prompt.py)

Wiring it all together (agent.py)

Running it (run.py)

Layer 2: The Enforcer

Defining the rules (state_machine.py)

Changes to the prompt (prompt.py)

Wiring the changes (agent.py)

Layer 3: The Safety Net

The rules (guardrails.py)

Enforcing the guardrails (agent.py)

Common Errors

Building a fully autonomous AI Receptionist

Table of Contents

The Stack

Livekit

Twilio

Speech-To-Text(STT)

Whisper Realtime(OpenAI’s latest realtime speech-to-text model: gpt-realtime-whisper)

LLM

Text-To-Speech(TTS)

Murf Falcon

Supabase

Google Calendar*(optional)*

The Core Voice Loop

System Prompt

Core Logic

Memory & Appointment Booking

Database Setup

Adding the Memory

Equipping the AI

Updating the System Prompt

Wiring it all together

RAG

The Knowledge Base

The local RAG engine

Equipping the Agent to use the RAG to answer clinic questions

Some Optional Additions to Make the Agent Better

Google Calendar Blocking

Whatsapp Confirmation

Cancelling and Rescheduling

Adapting To Your Use Case

Example Prompts

Errors

I Gave OpenClaw a Voice and It Ordered Me Dinner

Why OpenClaw

The pipeline

The stack

What I deliberately left out

Requirements

Step 1: Clone, install, env

Step 2: Authenticate Swiggy

Step 3: The four files you write

ear.ts: microphone in, transcript out

brain.ts: where OpenClaw earns its keep

voice.ts: two speakers, not one

index.ts: the loop

How the skill actually works

Latency

Swap the skill

What I learned

Notes from the Openclaw Voice Tutorial

Requirements

Deepgram Flux

Managing state and data (`data.py`)

Giving the agent some tools (`tools.py`)

The brain (`prompt.py`)

Wiring it all together (`agent.py`)

Running it (`run.py`)

Defining the rules (`state_machine.py`)

Changes to the prompt (`prompt.py`)

Wiring the changes (`agent.py`)

The rules (`guardrails.py`)

Enforcing the guardrails (`agent.py`)

Google Calendar(optional)