Building a Human-Like AI Outbound Calling Agent at ~$0.01/Minute - part 1

#webdev #programming #ai #devops

Hi. Dev Community!
A month ago, a client approached me with an interesting challenge:

Build an AI-powered outbound calling agent that can handle 50+ concurrent calls, sound indistinguishable from a human, and cost around $0.01 per minute.

The agent was meant for outbound sales calls, so realism, latency, and cost efficiency were absolutely critical.

This post walks through how I approached the problem, the tradeoffs I evaluated, and how I ended up building a scalable, human-sounding AI caller using Telnyx, OpenAI, and ElevenLabs.

Project Requirements

The requirements were non-negotiable:

Concurrency - Handle more than 50 simultaneous outbound calls without degradation.
Human-like Voice Call - recipients should believe they are speaking with a real human agent.
Ultra-Low Latency - Delays longer than ~300–500ms completely break the illusion.
Low Cost - As a startup MVP, the target cost was ~$0.01 per minute, including telephony, speech recognition, LLM processing, and TTS.

Twilio vs Telnyx: The First Big Decision

I had extensive experience with Twilio and initially leaned toward it out of familiarity. However, when I started breaking down the pricing and architecture requirements, a few things became clear:

Twilio

Excellent documentation and ecosystem
Higher per-minute call costs
Streaming and media handling can get expensive at scale
Cost balloons quickly with 50+ concurrent calls

Telnyx

Significantly cheaper call rates
Native two-way media streaming
Better suited for real-time audio pipelines
More control over low-level call handling

After reviewing both APIs and doing some rough cost modeling, Telnyx was the clear winner for this use case.

System Architecture Overview

Here’s the high-level architecture I ended up with:

Telnyx

Initiates outbound calls
Streams audio bi-directionally in real time

ElevenLabs (Speech-to-Text)

Real-time transcription of the caller’s voice
Low latency is critical here

OpenAI API

Receives transcribed text
Generates context-aware conversational responses
Maintains conversation state

ElevenLabs (Text-to-Speech)

Converts LLM output into ultra-realistic speech
Key to making the AI sound human

Back to Telnyx

Stream ElevenLabs voice back into the live call

All of this runs asynchronously to support high concurrency without blocking.

This concludes the first part. Next time, I will talk about my development experience and how I was able to build a working prototype within a week.

Top comments (1)

Jane Mayfield • Feb 5

Really interesting approach - and honestly, a pretty bold set of constraints 😄
Handling 50+ concurrent calls with human-level realism, sub-500ms latency, and ~$0.01/min sounds challenging.......
Curious to see how you solved the real-world issues around latency spikes, conversation state at scale, and cost control in practice. Looking forward to Part 2 and the details of how the prototype performed in production 🚀