AlloTech AI

Posted on Jun 19

How We Built a Voice AI Agent That Handles 200+ Calls/Day Without a Human

#agents #ai #python #showdev

At AlloTech AI, we run a voice AI agent — codename Hermes — that handles inbound and outbound calls for small businesses. No hold music. No "press 1 for sales." Just a voice that sounds human, understands context, and takes action.

Here's how we built it, what broke along the way, and what actually works at scale.

The Stack

We didn't invent anything here. We stitched together the best tools for each layer:

Telephony: Telnyx (WebRTC + SIP, programmable call routing)
Speech-to-Text: Deepgram Nova-2 (streaming, <300ms first token)
LLM: Claude Sonnet (tool use, low hallucination rate on structured tasks)
Text-to-Speech: ElevenLabs (cloned voice, ~200ms latency with streaming)
Orchestration: FastAPI + asyncio (Python, deployed on a single GPU VM)
Memory: Redis for session state, Postgres for call logs

Total per-call cost at 200 calls/day: ~$0.08–$0.14 CAD depending on call length.

The Hard Problems

1. Interruption Handling

Users interrupt. Always. A voice agent that can't handle barge-in sounds robotic and broken.

Our solution: we stream audio in 100ms chunks and run a Voice Activity Detection (VAD) model in parallel. The moment VAD detects user speech mid-response, we:

Kill the TTS stream
Flush the LLM output buffer
Re-inject the user's new utterance as context
Resume with a fresh completion

This dropped our "agent ignoring user" complaints from ~18% of calls to under 2%.

2. Tool-Call Latency

Hermes books appointments, looks up order status, and creates tickets. Each tool call adds latency. Our target: keep total response time under 1.2 seconds.

What we did:

Parallel tool execution where possible (fetch availability + customer profile simultaneously)
Streamed partial TTS while tool results were still coming in ("Let me check that for you..." buys 800ms)
Cached frequent lookups (business hours, menu items) in Redis with 5-minute TTL

Result: P95 response latency sits at 1.1s. P99 is 2.3s (outliers are Postgres cold queries).

3. Silence Detection

What does your agent do when the user goes silent? Ours used to just... wait. Callers hung up thinking the call dropped.

Fix: after 2.5s of silence post-question, Hermes says a natural filler ("Take your time" or "Still there?"). After 5s, it offers to call back. After 8s, it ends the call gracefully.

4. Persona Consistency

Claude is excellent at following a system prompt, but long calls drift. By call minute 4–5, the agent would occasionally drop the client's business name or use generic phrasing.

Solution: we inject a "persona anchor" into the context every 6 turns — a compressed reminder of the agent's identity, the business it represents, and the current call goal. Drift dropped to near zero.

Results

Metric	Before	After
Avg handle time	4m 12s	2m 48s
Call completion rate	71%	89%
Escalation to human	34%	11%
Caller satisfaction (CSAT)	3.2/5	4.4/5

What's Next

We're working on:

Multilingual support (French/English switching mid-call — critical for Montreal)
Emotion detection (route to human if caller sounds frustrated)
Post-call summaries pushed directly to client CRMs

If you're building voice AI or want to see Hermes in action, reach out: allotech.ai

AlloTech AI — Montreal-based AI automation for SMBs.

DEV Community