DEV Community

James campbell
James campbell

Posted on

Building a Human-Like AI OutboundCalling Agent at ~$0.01/Minute - part 2

Continuing from yesterday, I want to share what actually broke, slowed me down, and forced real design decisions during development.

The three constraints that dominated every technical choice were simple to state but hard to satisfy: human-like voice quality, ultra-low latency, and the lowest possible cost. Before worrying about concurrency, I deliberately set that problem aside and focused on designing a clean, end-to-end workflow for a single call. Fortunately, Telnyx’s documentation was solid enough to get the basics in place quickly, which allowed me to focus on the harder problems that don’t show up in happy-path examples.

The first major challenge was voicemail detection, and it turned out to be more critical than expected. Because this was outbound sales, roughly 950 out of every 1,000 calls were answered by voicemail. Every second spent talking to a machine instead of hanging up immediately was pure wasted cost. Telnyx does offer voicemail detection, but in practice the accuracy wasn’t reliable enough, largely because many users record custom voicemail greetings using their own voice. That makes it extremely difficult to distinguish a real person from a voicemail system using simple heuristics. To solve this, I shifted the problem from “detect voicemail” to “detect human intent.” Instead of relying on carrier-level detection, I analyzed the first response after the call was answered and built a lightweight detection system based on common voicemail phrases and speech patterns. This allowed early termination of most voicemail calls and resulted in massive cost savings.

The second challenge was bidirectional audio streaming with Telnyx. While Telnyx does support real-time two-way streaming, there was no clear, end-to-end official documentation showing how to wire everything together for a low-latency conversational pipeline. I had to analyze multiple streaming approaches, test different buffering strategies, and experiment with real-time audio handling to find something that worked reliably. After several iterations, I was able to build a stable bidirectional stream with end-to-end latency around 0.5 seconds, which was low enough to preserve the illusion of a natural conversation.

The third major challenge was concurrency. Supporting 50+ simultaneous calls wasn’t something you could bolt on later—it had to be designed in from the beginning. The key decision here was to make every component stateless. No shared session memory, no blocking calls, no centralized bottlenecks. Each call became an independent pipeline that could scale horizontally. This design choice made it possible to handle dozens of concurrent conversations without degradation.

That’s essentially where version 1.0 landed.

Is it perfect? Not even close. But I don’t believe a first version needs to be. As long as the core loop works—low latency, human-sounding voice, correct call handling, and acceptable cost—you can move forward, gather feedback, and iterate. That’s exactly what we did.

If you’re interested in detailed implementation strategies, architectural diagrams, or actual code, feel free to reach out.

Top comments (0)