DEV Community

Frank Fu
Frank Fu

Posted on • Originally published at frankfu.blog

Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide

Recently, I’ve been researching real-time voice conversation implementations and discovered that ElevenLabs Agents Platform provides a very powerful WebSocket API. After some exploration, I completed a real-time voice conversation demo that can run directly in the browser. Today, I’ll share the implementation details and usage experience of this project.

1. Why Choose ElevenLabs?

Before we begin, you might be wondering why I chose ElevenLabs over other solutions. I compared ElevenLabs with OpenAI Realtime API and found that ElevenLabs has unique advantages in voice selection, model flexibility, and other aspects. However, I’ll elaborate on this comparison in detail later in the article.

2. Project Overview

demo link: https://demo.navtalk.ai/11labs/en/index.html

This demo is implemented based on the ElevenLabs Agents Platform WebSocket API and supports:

✅ Complete WebSocket connection management

✅ Real-time voice input and output

✅ Text message support

✅ Rich custom configuration options

✅ Complete message handling mechanism

The entire project can run directly in the browser without a backend server, making it perfect for rapid prototyping and learning.

3. Core Features

3.1 Complete WebSocket Connection

The project implements complete WebSocket connection management, including:

▪ Automatic signature URL retrieval

▪ Secure WSS connection establishment

▪ Comprehensive connection status and error handling

3.2 Real-time Voice Conversation

Voice processing is the core functionality, including:

▪ Microphone audio capture

▪ 16kHz PCM audio encoding

▪ Real-time audio stream transmission

▪ Agent audio playback

3.3 Complete Message Handling

Supports all message types provided by ElevenLabs:

▪ `conversation_initiation_metadata` – Session initialization

▪ `user_transcript` – User speech-to-text

▪ `agent_response` – Agent text response

▪ `agent_response_correction` – Agent response correction

▪ `audio` – Agent audio response

▪ `interruption` – Interruption detection

▪ `ping/pong` – Heartbeat detection

▪ `client_tool_call` – Tool call support

▪ `contextual_update` – Context update

▪ `vad_score` – Voice activity detection score

3.4 Text Message Support

In addition to voice input, it also supports sending text messages to the Agent, with a very practical feature: text messages can interrupt the Agent’s ongoing voice response, making conversations more natural.

3.5 Custom Configuration

Provides rich configuration options:

▪ Custom Agent Prompt

▪ Custom first message

▪ Language override

▪ TTS voice ID override

▪ Dynamic variable support

▪ Custom LLM parameters (temperature / max_tokens)

4. Detailed Usage Instructions

4.1 Prepare Configuration

4.1.1 Open File

Simply open the link https://demo.navtalk.ai/11labs/en/index.html in your browser to get started.

4.1.2 Required Configuration Items

API Key (xi-api-key):

▪ ElevenLabs API Key

▪ Format: `sk-…` or `xi-api-key`

▪ How to obtain: Log in to the ElevenLabs Console(https://elevenlabs.io/app/settings/api-keys), create or view API Key

Agent ID:

▪ ElevenLabs Agent ID

▪ Format: `agent_…`

▪ How to obtain: Create or view an Agent on the ElevenLabs Agents page(https://elevenlabs.io/app/agents), then copy the Agent ID

4.1.3 Optional Configuration Items (in interface order)

Custom Prompt:

▪ Override the Agent’s default prompt

▪ Leave empty to use the default prompt from Agent configuration

▪ Can be used to temporarily modify the Agent’s behavior and conversation style

First Message:

▪ The first sentence the Agent says after connection

▪ Leave empty to use the default first message from Agent configuration

▪ Example: “Hello, I’m your AI assistant. How can I help you?”

Language:

▪ Override the Agent’s default language setting

▪ Supported language codes: `en` (English), `zh` (Chinese), `es` (Spanish), `fr` (French), `de` (German), `ja` (Japanese), etc.

▪ Leave empty to use the default language from Agent configuration

TTS Voice:

▪ Override the Agent’s default voice setting

▪ Select different voice IDs from the dropdown menu

▪ Leave empty to use the default voice from Agent configuration

▪ Note: You need to fill in the API Key first to load the voice list

Dynamic Variables:

▪ Used to dynamically replace variable placeholders in the Prompt during conversation

▪ Format: JSON object, for example `{“user_name”: “John”, “greeting”: “Hello”}`

▪ Use case: When the Agent’s Prompt contains variables (such as `{{user_name}}`, `{{greeting}}`), you can pass actual values through dynamic variables

▪ Example:

  {

    “user_name”: “John”,

    “company”: “ABC Company”,

    “product”: “Smart Assistant”

  }

▪ If the Agent’s Prompt contains `Hello, {{user_name}}, welcome to use {{product}}`, the dynamic variables will automatically replace it with `Hello, John, welcome to use Smart Assistant`

▪ Leave empty to not use dynamic variables

LLM Temperature:

▪ Controls the randomness and creativity of LLM text generation

▪ Value range: 0.0 – 2.0

▪ Lower values produce more deterministic and consistent output (more conservative); higher values produce more random and creative output (more flexible)

▪ Recommended value: 0.7 – 1.0 (balanced creativity and consistency)

▪ Leave empty to use the default value from Agent configuration

LLM Max Tokens:

▪ Limits the maximum number of tokens for a single LLM response

▪ Value range: Positive integers

▪ Used to control response length and avoid overly long replies

▪ Leave empty to use the default value from Agent configuration

4.2 Start Conversation

1. Click the “Connect and Start Conversation” button

2. The browser will request microphone permission, please allow it

3. Recording will start automatically after successful connection

4. Start speaking, and the Agent will respond in real-time

4.3 Function Operations

▪ Stop Recording: Stop sending audio but keep the connection

▪ Disconnect: Completely disconnect the WebSocket connection

▪ Text Message: Enter a message in the text input box and send it

5. API Documentation Reference

The demo implementation is based on ElevenLabs Agents Platform WebSocket API(https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket)

5.1 WebSocket Endpoint

wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}

5.2 Complete Call Flow

5.2.1 Connection Establishment Phase

Step 1: Establish WebSocket Connection

Client → Server: Establish WebSocket connection

wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}

Step 2: Send Initialization Data

▪ Immediately after successful connection, send `conversation_initiation_client_data` message

▪ Contains Agent configuration overrides (optional), dynamic variables (optional), custom LLM parameters (optional)

▪ Wait for server to return `conversation_initiation_metadata` event

Step 3: Receive Session Metadata

▪ Server returns `conversation_initiation_metadata` event

▪ Content to handle:

  – Save `conversation_id` (for subsequent session management)

  – Record audio format information (`agent_output_audio_format`, `user_input_audio_format`)

  – Start audio capture (call `getUserMedia` to get microphone permission)

5.2.2 Conversation Phase

Audio Input Flow:

User speaks → Microphone capture → Audio processing (downsample to 16kHz) → Convert to 16-bit PCM → Base64 encode → Send user_audio_chunk

Server Response Flow:

Server receives audio → Speech recognition (ASR) → Send user_transcript → LLM processing → Generate response → Send agent_response → TTS synthesis → Send audio chunks

Key Event Handling Sequence:

1. When user speaks:

   ▪ Continuously send `user_audio_chunk` (send once every 4096 samples)

   ▪ Server processes audio stream, may return `vad_score` (voice activity detection score)

2. Server recognizes user speech:

  ▪ Receive `user_transcript` event

▪ Can display what the user said in the UI (for debugging)

3. Server generates response:

▪ Receive `agent_response` event

▪ Can display the Agent’s text response in the UI

▪ May receive `agent_response_correction` (if the Agent corrects the response)

4. Server sends audio:

   ▪ Receive `audio` event (may occur multiple times, streamed)

   ▪ Processing method:

      – Decode Base64 audio data

      – Add to audio playback queue

      – Play audio chunks in order

5. Interruption handling:

   ▪ If the user sends a new message while the Agent is speaking, may receive `interruption` event

   ▪ Need to immediately stop current audio playback and clear the audio queue

5.2.3 Heartbeat Maintenance Phase

Heartbeat Mechanism:

▪ Server periodically sends `ping` event

▪ Need to immediately respond with `pong` message, containing the same `event_id`

▪ Used to keep connection alive and detect connection status

5.2.4 Tool Call Flow (if enabled)

Tool Call Steps:

1. Server sends `client_tool_call` event

2. Processing flow:

   ▪ Parse tool call information (`tool_name`, `parameters`, `tool_call_id`)

   ▪ Execute the corresponding tool/function

   ▪ Send `client_tool_result` to return results

3. Server continues processing, may send new `agent_response` and `audio`

5.2.5 Context Update Flow (if enabled)

Context Update:

▪ Client can actively send `contextual_update` to update conversation context

▪ Server may also send `contextual_update` event

▪ Handle context updates according to business requirements

5.2.6 Text Message Flow

Send Text Message:

▪ Client sends `user_message` event

▪ Feature: Can interrupt the Agent’s ongoing audio response (ElevenLabs unique feature)

▪ Processing method:

  – If the Agent is playing audio, immediately stop playback (receive `interruption` event)

  – Wait for server to process text message and return new response

5.2.7 Connection Close Phase

Normal Close:

▪ Stop sending audio (call `stopRecording`)

▪ Close WebSocket connection

▪ Release audio resources (close AudioContext, stop MediaStream)

Exception Handling:

▪ Listen to WebSocket `error` and `close` events

▪ Implement reconnection logic (optional)

▪ Clean up all resources

5.3 Detailed Event Handling

5.3.1 Events Client Needs to Handle

Event Type When Received Required Handling Optional Operations
conversation_initiation_metadata After connection established Save conversation_id, start recording Display session information
user_transcript After user speaks Display what user said
agent_response After Agent generates response Display Agent text response
agent_response_correction When Agent corrects response Display correction information
audio After Agent audio synthesis Decode and play audio Display playback status
interruption When interruption detected Stop playback, clear queue Display interruption prompt
ping Server heartbeat detection Immediately send pong
client_tool_call When Agent needs to call tool Execute tool and return result Display tool call information
vad_score During voice activity detection Visualize voice activity

5.3.2 When Client Sends Messages

Message Type Send Timing Frequency
conversation_initiation_client_data Immediately after connection established Once
user_audio_chunk Continuously during recording High frequency (approximately every 250ms)
user_message When user inputs text On demand
user_activity When need to notify user activity On demand
pong Immediately respond when receive ping On demand
client_tool_result After tool execution completed On demand
contextual_update When need to update context On demand

6. Audio Format Requirements

ElevenLabs has clear requirements for audio format:

▪ Sample Rate: 16kHz

▪ Channels: Mono

▪ Encoding: 16-bit PCM

▪ Format: Base64 encoded binary data

7. Technical Implementation

7.1 Audio Processing Flow

1. Capture: Use `getUserMedia` API to get microphone audio stream

2. Process: Use `AudioContext` and `ScriptProcessorNode` to process audio

3. Downsample: If sample rate is not 16kHz, automatically downsample

4. Encode: Convert Float32 audio data to 16-bit PCM

5. Encode: Base64 encode and send via WebSocket

7.2 Audio Playback Flow

1. Receive: Receive Base64 encoded audio from WebSocket

2. Decode: Base64 decode to binary data

3. Play: Try to play as MP3 first, if fails, play as PCM

8. ElevenLabs vs OpenAI Realtime API Detailed Comparison

During development, I also researched OpenAI Realtime API and found that both platforms have their own characteristics. Below is my detailed comparison:

8.1 Quick Comparison Overview

Comparison Item ElevenLabs Agents Platform OpenAI Realtime API
Multimodal Support ❌ Not supported, i.e., does not support camera recognition (image input) ✅ Supported (GPT-4o)
Voice Selection ✅ 100+ preset voices, supports voice cloning ⚠ 10 preset voices
LLM Models ✅ Multi-model support (ElevenLabs, OpenAI, Google, Anthropic) ✅ GPT-4o, GPT-4o-mini
Knowledge Base ✅ Supported ✅ Supported (via Assistants API)
Function Call ✅ Supported ✅ Supported
Text Interrupt AI Response ✅ Supported (sending text message can interrupt AI’s ongoing response) ❌ Not supported
Latency ✅ Depends on model (163ms-3.87s) ✅ Low (300-800ms)
Pricing 💰 Per-minute billing (based on model, $0.0033-$0.1956/minute) 💰 Per-token billing (GPT-4o-mini more economical)

For detailed comparison information, please see the detailed explanations of each feature point below.

8.2 Detailed Comparison of Key Points

8.3.1 Multimodal Support (Camera Recognition)

Platform Support Status Detailed Information Reference Links
ElevenLabs Agents Platform ❌ Currently not supported Focuses on voice conversation, does not support visual input (camera/image recognition) ElevenLabs Agents Platform WebSocket API Documentation
OpenAI Realtime API ✅ Supported (via GPT-4o) Supports visual input, can process images and video frames, supports real-time camera recognition. GPT-4o model natively supports multimodal input OpenAI Realtime API Documentation
OpenAI GPT-4o Vision Capabilities

Explanation: OpenAI Realtime API is based on GPT-4o model, supports multimodal input, and can process image and video content. ElevenLabs currently focuses on voice conversation scenarios and does not support visual input.

Reference Sources:

▪ ElevenLabs: Official WebSocket API Documentation – Does not mention visual input support

▪ OpenAI: Realtime API Official Documentation – Supports GPT-4o multimodal capabilities

8.3.2 Voice Selection Comparison

Platform Voice Count Voice Characteristics Customization Capability Reference Links
ElevenLabs Agents Platform ✅ 100+ preset voices High quality, multilingual, supports emotional expression, voice cloning Supports custom voice ID, emotion control, tone adjustment, voice cloning ElevenLabs Voice Library
ElevenLabs Voice Cloning
OpenAI Realtime API ⚠ Limited selection (10 voices) Mainly relies on TTS API, provides 10 preset voices (alloy, echo, fable, onyx, nova, shimmer…) Limited voice control capability, does not support voice cloning OpenAI TTS Documentation
OpenAI TTS Voice List

Detailed Comparison:

ElevenLabs: Provides over 100 preset voices, covering multiple languages, ages, genders, and styles. Supports voice cloning, can create custom voices from a small number of samples. Supports emotion and tone control, can adjust voice expression. High voice quality, suitable for professional applications.

OpenAI: TTS API provides 10 preset voices (alloy, echo, fable, onyx, nova, shimmer…), relatively limited selection. Does not support voice cloning, weak voice control capability.

Reference Sources:

▪ OpenAI: TTS API Documentation – Lists 10 available voices

▪ ElevenLabs: Official Voice Library – Shows large number of preset voices

▪ ElevenLabs: Voice Cloning Documentation – Supports custom voice cloning

8.2.3 Supported LLM Models

Platform Supported Models Model Characteristics Reference Links
ElevenLabs Agents Platform ✅ Multi-model support Supports ElevenLabs proprietary models and multiple third-party models (OpenAI, Google, Anthropic, etc.), users can choose according to needs, supports custom LLM parameters ElevenLabs Agents Documentation
ElevenLabs LLM Configuration
OpenAI Realtime API ✅ GPT-4o, GPT-4o-mini Supports GPT-4o (multimodal, stronger capabilities) and GPT-4o-mini (lightweight, faster, lower cost), can switch models OpenAI Realtime API Models
OpenAI Model Comparison

List of Models Supported by ElevenLabs Agents Platform:

ElevenLabs Proprietary Models:

▪ GLM-4.5-Air: Suitable for agentic use cases, latency ~631ms, cost ~$0.0600/minute

▪ Qwen3-30B-A3B: Ultra-low latency, latency ~163ms, cost ~$0.0168/minute

▪ GPT-OSS-120B: Experimental model (OpenAI open-source model), latency ~314ms, cost ~$0.0126/minute

Other Provider Models (available on ElevenLabs platform):

OpenAI Models:

▪ GPT-5 series: GPT-5 (latency ~1.14s, cost ~$0.0826/minute), GPT-5.1, GPT-5 Mini (latency ~855ms, cost ~$0.0165/minute), GPT-5 Nano (latency ~788ms, cost ~$0.0033/minute)

▪ GPT-4.1 series: GPT-4.1 (latency ~803ms, cost ~$0.1298/minute), GPT-4.1 Mini, GPT-4.1 Nano (latency ~478ms, cost ~$0.0065/minute)

▪ GPT-4o (latency ~771ms, cost ~$0.1623/minute), GPT-4o Mini (latency ~738ms, cost ~$0.0097/minute)

▪ GPT-4 Turbo (latency ~1.28s, cost ~$0.6461/minute), GPT-3.5 Turbo (latency ~494ms, cost ~$0.0323/minute)

Google Models:

▪ Gemini 3 Pro Preview (latency ~3.87s, cost ~$0.1310/minute)

▪ Gemini 2.5 Flash (latency ~752ms, cost ~$0.0097/minute), Gemini 2.5 Flash Lite (latency ~505ms, cost ~$0.0065/minute)

▪ Gemini 2.0 Flash (latency ~564ms, cost ~$0.0065/minute), Gemini 2.0 Flash Lite (latency ~547ms, cost ~$0.0049/minute)

Anthropic Models:

▪ Claude Sonnet 4.5 (latency ~1.5s, cost ~$0.1956/minute), Claude Sonnet 4 (latency ~1.31s, cost ~$0.1956/minute)

▪ Claude Haiku 4.5 (latency ~703ms, cost ~$0.0652/minute)

▪ Claude 3.7 Sonnet (latency ~1.12s, cost ~$0.1956/minute), Claude 3.5 Sonnet (latency ~1.14s, cost ~$0.1956/minute)

▪ Claude 3 Haiku (latency ~608ms, cost ~$0.0163/minute)

Custom Models:

▪ Supports adding custom LLMs

The above image shows the list of selectable LLM models in ElevenLabs Agents Platform, including latency and pricing information

Detailed Explanation:

ElevenLabs: Provides rich model selection, including proprietary models and models from multiple third-party providers. Users can choose the most suitable model based on latency, cost, and functional requirements. Supports customizing LLM parameters (such as temperature, max_tokens) through `custom_llm_extra_body`.

OpenAI: Clearly supports GPT-4o (supports multimodal, stronger reasoning capabilities) and GPT-4o-mini (faster, lower cost), users can choose according to needs. Both models support real-time conversation.

Reference Sources:

▪ ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Model selection interface

▪ ElevenLabs: [WebSocket API – Custom LLM Parameters](https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket#custom-llm-extra-body)

▪ OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Supports GPT-4o and GPT-4o-mini

▪ OpenAI: [Model Comparison Documentation](https://platform.openai.com/docs/models) – Detailed model information

8.2.4 Knowledge Base Support

Platform Knowledge Base Support Implementation Method Reference Links
ElevenLabs Agents Platform ✅ Supported Supports knowledge base integration through Agent configuration, can upload documents and set up knowledge base, Agent can reference knowledge base content in conversations ElevenLabs Agents Documentation
ElevenLabs Agent Configuration
OpenAI Realtime API ✅ Supported (via Assistants API or Function Calling) Can integrate knowledge base through Assistants API (file upload, vector storage), or access external data sources and APIs through function calling OpenAI Assistants API
OpenAI Function Calling

Detailed Explanation:

ElevenLabs: Supports knowledge base functionality in Agent configuration, can upload documents for Agent reference. Knowledge base content will be automatically referenced in conversations.

OpenAI: Can create assistants with knowledge base through Assistants API (supports file upload and vector storage), or access external data sources and APIs through function calling, achieving more flexible knowledge retrieval.

Reference Sources:

▪ ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Mentions knowledge base support

▪ ElevenLabs: [Agent Configuration Documentation](https://elevenlabs.io/docs/agents-platform/agent-configuration) – Knowledge base configuration instructions

▪ OpenAI: [Assistants API Documentation](https://platform.openai.com/docs/assistants) – Knowledge base and file upload functionality

▪ OpenAI: [Function Calling Documentation](https://platform.openai.com/docs/guides/function-calling) – External data access

8.2.5 Function Call Support

Platform Support Status Implementation Method Reference Links
ElevenLabs Agents Platform ✅ Supported Implements tool calling through client_tool_call and client_tool_result message types, supports defining tools in Agent ElevenLabs WebSocket API – Tool Calling
ElevenLabs Agent Tool Configuration
OpenAI Realtime API ✅ Supported Implements function calling through tool_calls and tool_results events, supports defining tools in sessions OpenAI Realtime API – Function Calling
OpenAI Function Calling Guide

Detailed Comparison:

ElevenLabs: Uses `client_tool_call` event to request client to execute tools, returns results through `client_tool_result`. Tools are defined in Agent configuration.

OpenAI: Uses standard function calling mechanism, triggered through `tool_calls` event, returns results through `tool_results`. Supports dynamically defining tools in sessions.

Reference Sources:

▪ ElevenLabs: [WebSocket API – client_tool_call](https://elevenlabs.io/docs/agents-platform/api-reference/agents-platform/websocket#client-tool-call) – Tool calling implementation

▪ ElevenLabs: [Agent Configuration](https://elevenlabs.io/docs/agents-platform/agent-configuration) – Tool definition

▪ OpenAI: [Realtime API Function Calling](https://platform.openai.com/docs/guides/realtime/function-calling) – Real-time API tool calling

▪ OpenAI: [Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) – Detailed implementation instructions

8.2.6 Text Interrupt AI Response

Platform Support Status Detailed Information Reference Links
ElevenLabs Agents Platform ✅ Supported Sending text message (user_message) can interrupt AI’s ongoing voice response, achieving more natural conversation interaction ElevenLabs WebSocket API – User Message
OpenAI Realtime API ❌ Not supported Sending text message cannot interrupt AI’s ongoing response, need to wait for current response to complete OpenAI Realtime API Documentation

Detailed Comparison:

ElevenLabs: Supports interrupting AI’s ongoing response by sending text messages. When user sends text message while AI is speaking, AI will immediately stop current response and process new text input, making conversations more natural and smooth, similar to interruption behavior in real human conversations.

OpenAI: Does not support text message interruption feature. If AI is responding, text messages sent by user need to wait for current response to complete before being processed, which may affect conversation fluency and real-time performance.

Use Cases:

ElevenLabs: Suitable for scenarios requiring fast interaction and interruption, such as real-time customer service, quick Q&A, etc.

OpenAI: Suitable for scenarios requiring complete responses, but interaction may not be flexible enough

8.2.7 Latency Comparison

Platform Latency Performance Optimization Features Reference Links
ElevenLabs Agents Platform ✅ Depends on model selection Latency ranges from 163ms to 3.87s, depending on the selected LLM model. Low-latency models like Qwen3-30B-A3B (~163ms) are suitable for real-time interaction, high-performance models like GPT-5 (~1.14s) or Claude Sonnet (~1.5s) have higher latency but stronger capabilities. Supports streaming response ElevenLabs Agents Platform Documentation
ElevenLabs WebSocket API
OpenAI Realtime API ✅ Low latency Real-time streaming response, latency typically 300-800ms (depends on model and network), GPT-4o-mini is usually faster OpenAI Realtime API Documentation
OpenAI Performance Optimization

Detailed Explanation:

ElevenLabs: Latency depends on the selected LLM model. If selecting low-latency models (such as Qwen3-30B-A3B ~163ms, GPT-3.5 Turbo ~494ms), latency can be very low, suitable for real-time interaction. If selecting high-performance models (such as GPT-5 ~1.14s, Claude Sonnet ~1.5s), latency will be higher but reasoning capabilities stronger. Supports streaming audio response, reducing first-byte latency.

OpenAI: Latency is relatively stable, GPT-4o-mini usually responds faster than GPT-4o. Supports streaming response optimization.

Actual latency will be affected by the following factors:

– Network conditions and geographic location

– Model selection (ElevenLabs platform has multiple models to choose from, OpenAI mainly GPT-4o vs GPT-4o-mini)

– Request complexity

– Server load

The above data are typical values, actual performance may vary depending on usage scenarios.

Reference Sources:

▪ ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Emphasizes low-latency optimization

▪ OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Real-time performance description

▪ OpenAI: [Latency Optimization Guide](https://platform.openai.com/docs/guides/realtime/optimizing-latency) – Performance optimization recommendations

8.2.8 Pricing Comparison

Platform Billing Method Price Details Reference Links
ElevenLabs Agents Platform 💰 Per-conversation minute billing (based on selected model) Price depends on selected LLM model, usually includes comprehensive fees for voice synthesis, speech recognition, and LLM calls. For specific model prices, please refer to the “Supported LLM Models” section above ElevenLabs Pricing Page
ElevenLabs Billing Instructions
OpenAI Realtime API 💰 Per-token and audio duration billing GPT-4o: Input $2.50/1M tokens, Output $10/1M tokens
GPT-4o-mini: Input $0.15/1M tokens, Output $0.60/1M tokens
Audio input/output: $0.015/minute
(Prices may change over time)
OpenAI Pricing Page
OpenAI Realtime API Pricing

Detailed Comparison:

ElevenLabs: Uses per-conversation minute billing model, price depends on selected LLM model. Usually includes comprehensive fees for voice synthesis, speech recognition, and LLM calls, billing method is simple and clear. For specific model prices, please refer to the “Supported LLM Models” section above.

OpenAI: Uses per-token billing model, prices vary significantly between different models:

  – GPT-4o-mini: More economical, suitable for high-frequency usage scenarios

  – GPT-4o: Stronger functionality but higher price, suitable for scenarios requiring multimodal or stronger reasoning capabilities

  – Audio processing billed separately per minute

Cost Estimation Examples (for reference only):

Short conversation scenario (5 minutes, approximately 1000 tokens): OpenAI GPT-4o-mini approximately $0.0015 + $0.075 = $0.0765

Long conversation scenario (30 minutes, approximately 5000 tokens): OpenAI GPT-4o-mini approximately $0.0075 + $0.45 = $0.4575

Recommendations: Choose the appropriate platform based on actual usage scenarios and budget:

– If mainly using voice conversation with high usage volume, ElevenLabs’ per-minute billing may be simpler, can choose different models according to needs to balance cost and performance

– If need multimodal capabilities or stronger LLM capabilities, OpenAI may be more suitable

– For high-frequency usage, GPT-4o-mini is usually more economical

Reference Sources:

▪ ElevenLabs: [Official Pricing Page](https://elevenlabs.io/pricing) – Latest pricing information

▪ ElevenLabs: [Agents Platform Documentation](https://elevenlabs.io/docs/agents-platform) – Billing instructions

▪ OpenAI: [Official Pricing Page](https://platform.openai.com/pricing) – Latest pricing information (2024-2025)

▪ OpenAI: [Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) – Billing details

9. Conclusion

ElevenLabs Agents Platform WebSocket API provides powerful support for real-time voice conversations. Through this demo, I implemented complete real-time voice conversation functionality, including audio capture, processing, transmission, and playback.

Compared to OpenAI Realtime API, ElevenLabs has obvious advantages in voice selection, model flexibility, and other aspects, especially suitable for scenarios requiring specific voices or voice cloning. However, if multimodal capabilities are needed, OpenAI may be a better choice.

If you also want to try implementing real-time voice conversations, this demo should provide a good starting point. The project code is open source, and you can use it directly or extend it based on this foundation.

The post Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide appeared first on Frank Fu's Blog.

Top comments (0)