Hi. Dev Community!
A month ago, a client approached me with an interesting challenge:
Build an AI-powered outbound calling agent that can handle 50+ concurrent calls, sound indistinguishable from a human, and cost around $0.01 per minute.
The agent was meant for outbound sales calls, so realism, latency, and cost efficiency were absolutely critical.
This post walks through how I approached the problem, the tradeoffs I evaluated, and how I ended up building a scalable, human-sounding AI caller using Telnyx, OpenAI, and ElevenLabs.
**
Project Requirements
**
The requirements were non-negotiable:
- Concurrency - Handle more than 50 simultaneous outbound calls without degradation.
- Human-like Voice Call - recipients should believe they are speaking with a real human agent.
- Ultra-Low Latency - Delays longer than ~300–500ms completely break the illusion.
- Low Cost - As a startup MVP, the target cost was ~$0.01 per minute, including telephony, speech recognition, LLM processing, and TTS.
Twilio vs Telnyx: The First Big Decision
I had extensive experience with Twilio and initially leaned toward it out of familiarity. However, when I started breaking down the pricing and architecture requirements, a few things became clear:
Twilio
- Excellent documentation and ecosystem
- Higher per-minute call costs
- Streaming and media handling can get expensive at scale
- Cost balloons quickly with 50+ concurrent calls
Telnyx
- Significantly cheaper call rates
- Native two-way media streaming
- Better suited for real-time audio pipelines
- More control over low-level call handling
After reviewing both APIs and doing some rough cost modeling, Telnyx was the clear winner for this use case.
System Architecture Overview
Here’s the high-level architecture I ended up with:
Telnyx
- Initiates outbound calls
- Streams audio bi-directionally in real time
ElevenLabs (Speech-to-Text)
- Real-time transcription of the caller’s voice
- Low latency is critical here
OpenAI API
- Receives transcribed text
- Generates context-aware conversational responses
- Maintains conversation state
ElevenLabs (Text-to-Speech)
- Converts LLM output into ultra-realistic speech
- Key to making the AI sound human
Back to Telnyx
- Stream ElevenLabs voice back into the live call
All of this runs asynchronously to support high concurrency without blocking.
This concludes the first part. Next time, I will talk about my development experience and how I was able to build a working prototype within a week.
Top comments (0)