Code and Trust

Posted on Jun 9 • Originally published at codeandtrust.com

Giving Your AI Agent a Phone Number: Twilio vs Vapi vs Retell vs Self-Hosted (2026)

#ai #twilio #webdev #opensource

There are four ways to give an AI agent inbound phone calls in 2026: (1) raw Twilio Media Streams with your own STT/LLM/TTS pipeline, (2) a managed voice-AI platform such as Vapi, Retell, or Bland that abstracts the telephony layer, (3) a self-hosted agent framework with built-in voice support such as the OpenClaw native voice-call plugin, or (4) a purpose-built gateway-routing connector such as clawcall. Each option trades control for complexity in a different way. This guide maps the trade-offs so you can pick the right stack for your use case.

Why This Decision Matters More Than It Looks

Adding a phone number to an AI agent sounds like a two-line config change. In practice, four non-obvious decisions determine whether your agent is genuinely useful on a call or just an expensive voice menu:

Tool access during the call. Can the agent check the caller's calendar, query a database, or send a follow-up message mid-conversation? Most managed platforms only expose this via a custom tool-call webhook â€” which means your agent's tool surface is still your problem to wire.
Latency floor. End-to-end round-trip (caller speech â†’ STT â†’ agent â†’ TTS â†’ audio back) determines whether the conversation feels natural (~700ms is acceptable) or robotic (>1.5s kills UX). Each layer adds latency; managed platforms reduce your engineering burden but do not always reduce latency.
Cost per minute at scale. A self-hosted stack at 10 calls/day has negligible cost. At 10,000 calls/day, model selection and platform fees become the dominant cost driver â€” sometimes by an order of magnitude.
Auditability. Regulated industries need a complete transcript of every call, every tool invocation, and every decision. Some managed platforms make this hard; self-hosted stacks make it trivial.

The comparison below covers each option across all four dimensions.

Option 1: Raw Twilio Media Streams (DIY Pipeline)

Twilio's Media Streams API gives you a raw WebSocket audio feed from an inbound call â€” 8kHz mulaw PCM, bidirectional. You own everything above the transport layer: speech-to-text (STT), agent routing, text-to-speech (TTS), and audio injection.

Typical stack: Twilio (telephony) â†’ Deepgram nova-2-phonecall (streaming STT) â†’ your LLM/agent (inference) â†’ ElevenLabs Turbo v2 or Google TTS (streaming TTS) â†’ Twilio audio injection.

<!-- Minimal TwiML to open a Media Stream -->
<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://your-host.com/voice/stream">
      <Parameter name="callSid" value="{{ CallSid }}" />
    </Stream>
  </Connect>
</Response>

Latency breakdown (realistic 2026 numbers):

Deepgram nova-2-phonecall end-of-speech detection: 200â€“300ms
LLM first-token latency (GPT-4o Realtime or your model): 150â€“400ms
ElevenLabs Turbo v2 first audio byte: 200â€“350ms
Total perceived latency: 550msâ€“1.1s with streaming TTS; longer if TTS is not streamed

Cost per minute (GPT-4o, Deepgram, ElevenLabs):

Twilio inbound: ~$0.0085/min
Deepgram nova-2: ~$0.0043/min
GPT-4o (2 exchanges/min, ~700 tokens total): ~$0.007/min
ElevenLabs Turbo: ~$0.003/min
Total: ~$0.023/min â€” or roughly $1.38/hour of talk time

When to choose raw Twilio: You need every call turn to flow through a custom agent runtime (maximum tool fidelity), you require a full audit log of every transcript and tool invocation, you're building a multi-party or conferencing scenario, or you need barge-in with custom silence detection tuning. The engineering investment is real: plan 2â€“4 weeks to build and harden the WebSocket handler, STT pipeline, and TTS injection before you have something production-worthy.

Option 2: Managed Voice-AI Platforms (Vapi, Retell, Bland)

Managed platforms bundle telephony, STT, LLM routing, and TTS into a single API. You configure a voice agent via a dashboard or API, provide a system prompt and tool definitions, and the platform handles the call. The major options in 2026:

Vapi

Vapi is the most developer-focused of the managed platforms. It supports custom LLM endpoints (you can point it at your own model), custom tool-call webhooks (so your agent's tools remain accessible), and a wide range of STT/TTS providers. Pricing as of mid-2026: $0.05/min plus provider costs (STT + TTS + your LLM). For a GPT-4o-backed agent, all-in cost is roughly $0.08â€“0.10/min â€” 3â€“4x higher than a self-hosted stack at the same quality level.

Vapi's tool-call webhook model is the right architectural choice for integrating with an existing agent: Vapi sends a POST to your endpoint when the LLM decides to invoke a tool, you run the tool on your own infrastructure, and return the result. This is the closest managed-platform approximation of a full gateway-turn routing model.

Retell

Retell AI focuses on conversational naturalness â€” it ships barge-in, interruption handling, and filler-word suppression out of the box. Pricing: $0.07/min plus provider costs. Retell supports custom LLM endpoints and tool calls via webhook, similar to Vapi. Its primary differentiator is the conversation flow quality for sales and support use cases where turn-taking naturalness matters most.

Bland

Bland AI targets high-volume outbound calling at aggressive pricing (~$0.09/min all-in for a standard agent). It has a simpler tool-call model than Vapi or Retell and less flexibility on LLM provider. For inbound use cases where you need deep tool integration, Bland is the weakest option of the three.

Managed platform comparison table:

Platform	Price/min (all-in est.)	Custom LLM	Tool-call webhook	Audit log	Best for
Vapi	~$0.08â€“0.10	Yes	Yes (full)	Call recordings + transcripts	Dev-first, custom tool integration
Retell	~$0.09â€“0.12	Yes	Yes (full)	Transcripts	Conversational naturalness, sales/support
Bland	~$0.09â€“0.11	Limited	Partial	Basic	High-volume outbound

When to choose a managed platform: You want to ship an inbound voice agent in days, not weeks; your team does not have the bandwidth to operate WebSocket infrastructure; you're not running an existing self-hosted agent runtime that needs deep tool integration. The cost premium (~3â€“5x self-hosted) is often worth it at small to medium call volumes when engineering time is the real constraint.

Option 3: OpenClaw Native Voice-Call Plugin

If you're already running a self-hosted OpenClaw gateway, the voice-call plugin gives you inbound calls with minimal additional infrastructure. The plugin integrates directly with the gateway process and â€” as of PR #71272 â€” exposes the openclaw_agent_consult tool to the realtime voice session, so your agent's full tool surface is available during calls.

See the full setup guide for step-by-step configuration. Key points for the comparison:

Cost: You pay only Twilio + STT + TTS + LLM API costs â€” no platform markup. Roughly $0.020â€“0.025/min with GPT-4o, comparable to the raw Twilio DIY path.
Latency: Realtime mode (end-to-end audio) delivers sub-500ms conversational feel. The consult-tool path for tool access adds one internal hop but stays well under 1s for simple queries.
Setup complexity: Config-only for existing OpenClaw users. Requires a publicly reachable HTTPS webhook (ngrok/Cloudflare Tunnel for local dev, a VPS or cloud run for production).
Tool access: Set realtime.toolPolicy: "safe-read-only" for calendar reads and memory queries; "owner" for full tool surface (use with care on public lines).

When to choose the native OpenClaw plugin: You're already running an OpenClaw gateway and want the simplest path to voice. The plugin is the right first step â€” less infrastructure than the DIY path, no platform markup, and tool access via the consult tool.

Option 4: Full Gateway-Turn Routing (clawcall)

clawcall is an open-source Twilio Media Streams connector that routes every call turn through OpenClaw's /agent/turn endpoint â€” the same full gateway agent turn used by chat and API calls. Unlike the native plugin's realtime mode (which runs a self-contained audio session with tool access via the consult bridge), clawcall routes audio through STT -> full agent turn -> TTS. Every exchange is a first-class agent turn with complete session history, memory writes, and multi-step tool chains.

// clawcall routes the full pipeline:
// Twilio WebSocket -> Deepgram STT -> OpenClaw /agent/turn -> ElevenLabs TTS -> Twilio audio inject

import { handleVoiceStream } from 'clawcall'

// Express WebSocket route -- attach to your existing gateway server
app.ws('/voice/stream', (ws, req) => {
  handleVoiceStream(ws, {
    gatewayUrl: process.env.OPENCLAW_GATEWAY_URL,
    deepgramApiKey: process.env.DEEPGRAM_API_KEY,
    elevenLabsApiKey: process.env.ELEVENLABS_API_KEY,
    allowlist: ['+18432965626'],   // caller ID allowlist
    sessionPrefix: 'voice',        // sessionId scoped per call SID
  })
})

clawcall trade-offs vs. the native plugin:

Dimension	Native plugin + consult tool	clawcall (full gateway-turn routing)
Conversational latency	Very low (~300-500ms, end-to-end audio)	Medium (~700ms-1.3s, STT + agent turn + TTS)
Tool access	Good -- consult tool reaches gateway tools via one hop	Full -- every turn is a complete agent turn with all tools
Session history	Realtime session context only	Full gateway session + memory writes persist after call
Audit log	Plugin logs + gateway logs	Complete transcript per turn in gateway session history
Infrastructure	Config only (existing gateway)	WebSocket service + STT client + TTS client (clawcall handles it)
Multi-tool chains per utterance	Limited (consult tool runs one embedded turn)	Full (agent turn can invoke multiple tools in sequence)

When to choose clawcall: You need every call turn to persist into the agent's long-term memory after the call; you're building a regulated-industry deployment that requires a complete per-turn audit log; your use case involves multi-step tool chains within a single caller utterance (e.g., "check my calendar, book the slot, and send a confirmation text"); or you want the same observability on voice calls that you have on chat sessions.

Full Comparison Table

Option	Setup effort	Cost/min (est.)	Latency	Tool access	Audit	Best for
Raw Twilio DIY	High (2-4 weeks)	~$0.023	550ms-1.1s	Full (you build it)	Full (you build it)	Max control, custom pipelines
Vapi	Low (hours-days)	~$0.08-0.10	600ms-1.2s	Via tool-call webhook	Transcripts + recordings	Ship fast, custom tool integration
Retell	Low (hours-days)	~$0.09-0.12	500ms-1.0s	Via tool-call webhook	Transcripts	Natural conversation, sales/support
Bland	Low (hours)	~$0.09-0.11	600ms-1.5s	Partial	Basic	High-volume outbound
OpenClaw native plugin	Low (config only)	~$0.020-0.025	300-500ms	Good (consult tool)	Plugin + gateway logs	Existing OpenClaw users, lowest latency
clawcall (gateway-turn)	Medium (days)	~$0.023	700ms-1.3s	Full (direct agent turn)	Full (per-turn session history)	Max tool fidelity, audit, memory persistence

Decision Guide: Which One Should You Use?

Use this flowchart to narrow your choice:

Are you already running an OpenClaw gateway? Start with the native plugin. If you later need full audit or memory persistence, add clawcall.
Do you need to ship this week, not this month? Use Vapi or Retell. Accept the 3-5x cost premium as the price of speed.
Is every call turn's tool invocation required to persist in long-term memory? Use clawcall or the raw DIY path. The native realtime mode does not write memory after each utterance.
Is latency the primary UX constraint? OpenClaw native plugin (realtime mode) is the lowest-latency self-hosted option. Retell is the lowest-latency managed option.
Do you need a complete per-turn audit log for compliance? Raw DIY or clawcall -- you own the data. Managed platforms retain call data on their infrastructure.
Are you running more than 1,000 calls/day? Model the all-in cost carefully. At 1,000 calls/day x 5 min average, Vapi costs ~$400/day vs. ~$115/day self-hosted -- $104,000/year difference for a single agent.

Frequently Asked Questions

The most common questions about choosing between Twilio, Vapi, Retell, and self-hosted voice AI agent approaches cluster around cost at scale, latency differences, tool access architecture, compliance, and how to migrate from managed to self-hosted as call volume grows.

Q1: Can I use Vapi or Retell with an OpenClaw agent?

A: Yes -- both Vapi and Retell support custom LLM endpoints and tool-call webhooks. You can point Vapi at a proxy that translates its LLM request format to OpenClaw's /agent/turn endpoint. This gives you Vapi's telephony and conversation management with OpenClaw's tool surface behind it. The latency penalty is an extra round-trip to your gateway, but the integration is architecturally clean. For most teams, this is only worth doing if you're already on Vapi for other reasons -- if you're starting fresh with OpenClaw, the native plugin or clawcall is simpler.

Q2: What is the latency difference between managed platforms and self-hosted in practice?

A: Managed platforms like Vapi and Retell have invested heavily in reducing latency and are competitive with self-hosted stacks: both typically deliver 500ms-1.2s end-to-end. The OpenClaw native plugin's realtime mode (end-to-end audio, no STT/TTS round-trips in your infrastructure) can reach 300-500ms -- marginally faster, but the difference is perceptible only in high-cadence conversation flows. For most use cases, managed platform latency is acceptable.

Q3: How do I keep costs under control as call volume grows?

A: The biggest lever is model selection. Replacing GPT-4o with GPT-4o-mini or a self-hosted Ollama instance on a self-hosted stack cuts LLM cost by 80-90%. On managed platforms, you're largely locked into their cost structure. A practical migration path: start on Vapi to ship fast, then migrate to the OpenClaw native plugin or clawcall when monthly call cost exceeds the engineering cost of the migration (typically around 500-1,000 calls/day).

Q4: Do managed platforms comply with HIPAA / SOC 2?

A: Vapi and Retell both publish SOC 2 Type II certifications and offer Business Associate Agreements (BAAs) for HIPAA-covered use cases. Bland's compliance posture is less mature as of mid-2026. For regulated industries where call recordings must stay on your own infrastructure, self-hosted is the only fully compliant option -- managed platforms retain call recordings and transcripts on their servers even with a BAA.

Q5: What happens to call data on managed platforms when I cancel?

A: Vapi and Retell both offer data export and deletion policies, but data portability is not their priority -- you'll need to pull transcripts via API before canceling. Self-hosted stacks avoid this entirely: your session history lives in your database and is under your control from day one.

Q6: Can I add SMS to any of these options?

A: Twilio supports SMS on the same account and phone number as voice -- adding SMS to any Twilio-based setup is a separate webhook. For a step-by-step guide to adding two-way SMS to a self-hosted OpenClaw agent alongside voice, see How to Add SMS to Your Self-Hosted AI Agent (Twilio + OpenClaw). Managed platforms like Vapi and Retell are voice-only; SMS handling is out of scope for them.

Build the Right Voice Architecture with Code and Trust

Choosing the right telephony stack for your AI agent is an architecture decision that affects cost, latency, tool access, and compliance posture for the lifetime of the product. Getting it wrong means either overpaying a managed platform as you scale, or under-building a DIY stack that breaks under real call volume.

Code and Trust's AI implementation practice includes voice agent architecture review as a standard deliverable -- we map your call volume, tool requirements, compliance constraints, and engineering capacity to a concrete stack recommendation, with a cost model for each option. If you'd rather start with a structured assessment before committing to a build, the AI Audit is the right first step.

For the OpenClaw-specific voice setup guide -- including the native plugin configuration, the clawcall connector, and the tool-access architecture decision -- see How to Give Your Self-Hosted AI Agent Inbound Phone Calls.

Originally published at codeandtrust.com

DEV Community