DEV Community

Cover image for Building a Human-Like AI Outbound Calling Agent at ~$0.01/Minute - part 1
James campbell
James campbell

Posted on

Building a Human-Like AI Outbound Calling Agent at ~$0.01/Minute - part 1

Hi. Dev Community!
A month ago, a client approached me with an interesting challenge:

Build an AI-powered outbound calling agent that can handle 50+ concurrent calls, sound indistinguishable from a human, and cost around $0.01 per minute.

The agent was meant for outbound sales calls, so realism, latency, and cost efficiency were absolutely critical.

This post walks through how I approached the problem, the tradeoffs I evaluated, and how I ended up building a scalable, human-sounding AI caller using Telnyx, OpenAI, and ElevenLabs.

**

Project Requirements

**

The requirements were non-negotiable:

  1. Concurrency - Handle more than 50 simultaneous outbound calls without degradation.
  2. Human-like Voice Call - recipients should believe they are speaking with a real human agent.
  3. Ultra-Low Latency - Delays longer than ~300–500ms completely break the illusion.
  4. Low Cost - As a startup MVP, the target cost was ~$0.01 per minute, including telephony, speech recognition, LLM processing, and TTS.

Twilio vs Telnyx: The First Big Decision

I had extensive experience with Twilio and initially leaned toward it out of familiarity. However, when I started breaking down the pricing and architecture requirements, a few things became clear:

Twilio

  • Excellent documentation and ecosystem
  • Higher per-minute call costs
  • Streaming and media handling can get expensive at scale
  • Cost balloons quickly with 50+ concurrent calls

Telnyx

  • Significantly cheaper call rates
  • Native two-way media streaming
  • Better suited for real-time audio pipelines
  • More control over low-level call handling

After reviewing both APIs and doing some rough cost modeling, Telnyx was the clear winner for this use case.

System Architecture Overview

Here’s the high-level architecture I ended up with:

Telnyx

  • Initiates outbound calls
  • Streams audio bi-directionally in real time

ElevenLabs (Speech-to-Text)

  • Real-time transcription of the caller’s voice
  • Low latency is critical here

OpenAI API

  • Receives transcribed text
  • Generates context-aware conversational responses
  • Maintains conversation state

ElevenLabs (Text-to-Speech)

  • Converts LLM output into ultra-realistic speech
  • Key to making the AI sound human

Back to Telnyx

  • Stream ElevenLabs voice back into the live call

All of this runs asynchronously to support high concurrency without blocking.

This concludes the first part. Next time, I will talk about my development experience and how I was able to build a working prototype within a week.

Top comments (0)