Autor Technologies Inc.

Posted on Mar 24

How We Built a Production Voice AI Agent in Under 8 Weeks (With Twilio + Anthropic Claude)

#ai #machinelearning #webdev #python

Earlier this year, we shipped Loquent — a production conversational AI platform that handles real phone calls, books appointments, processes patient follow-ups, and verifies insurance — completely autonomously, 24/7.

We built it in under 8 weeks.

This isn't a tutorial about building a toy chatbot. This is a breakdown of what it actually takes to get voice AI into production — the architecture decisions, the hard lessons, and the specific stack we used.

The Problem We Were Solving

Healthcare and dental clinics miss a massive percentage of inbound calls. Front desks get overwhelmed during peak hours. Patients call after hours and get voicemail. Appointments slip through.

The ask: build an AI system that could handle inbound and outbound calls — booking appointments, confirming details, following up with patients, verifying insurance — without a human in the loop.

Not a demo. Not a prototype. Production. Real patients. Real clinics. Real calls.

The Stack

Before diving into architecture, here's what we ended up with:

Layer	Technology
Voice / Telephony	Twilio Voice + Media Streams
Speech-to-Text	Deepgram Streaming
LLM	Anthropic Claude (claude-sonnet)
Text-to-Speech	ElevenLabs
Backend	NestJS + Python
Frontend Dashboard	Next.js + React
Database	PostgreSQL + Prisma
Queue	Redis + BullMQ
Cloud	AWS (ECS, RDS, ElastiCache)
Integrations	HubSpot, Salesforce, Zendesk

We evaluated OpenAI's Realtime API, but at the time latency and reliability on production call volumes wasn't where we needed it. We went with the Deepgram → Claude → ElevenLabs pipeline, which gave us full control over each layer.

Architecture Overview

Caller dials in
     ↓
Twilio receives call → webhook fires to our backend
     ↓
Twilio Media Stream opens WebSocket to our server
     ↓
Audio chunks stream in real-time → Deepgram STT
     ↓
Transcript fed to Claude with conversation context + clinic data
     ↓
Claude response → ElevenLabs TTS → audio streamed back via Twilio
     ↓
Actions extracted (book appointment, send confirmation, update CRM)

The whole loop needs to complete in under 1.5 seconds to feel natural. That's the hard constraint everything else is built around.

The Latency Problem

This was the hardest engineering challenge. Users tolerate maybe 1–2 seconds of silence before it feels broken. We were dealing with:

Deepgram STT: ~200–400ms
Claude inference: ~400–800ms
ElevenLabs TTS first-chunk: ~300–500ms
Twilio playback: ~100ms

That's already pushing 2 seconds before any network overhead.

What we did:

1. Stream everything. We don't wait for a complete Claude response before starting TTS. The moment Claude starts outputting tokens, we pipe them to ElevenLabs sentence by sentence. The first audio chunk starts playing while Claude is still generating.

2. End-of-utterance detection. We use Deepgram's endpointing, but also built our own silence detection layer. Aggressive endpointing cuts off users mid-sentence. Too conservative and the response feels laggy. We tuned this per use case — a patient confirming an appointment has different speech patterns than one describing symptoms.

3. Claude prompt engineering for speed. Verbose responses kill latency. We prompt Claude to be concise, speak like a receptionist, and never use filler phrases that add tokens without value. We also give it explicit response format guidance — short sentences, direct answers.

4. Pre-warm everything. ElevenLabs has cold start latency. We keep connections warm with keepalive pings. Same with our database pool.

With all of this in place, we got average response latency down to ~900ms. Occasionally spikes to 1.4s. Feels natural.

Designing the Claude Prompt

This took more iteration than the infrastructure. The system prompt has to do a lot:

You are [Clinic Name]'s AI receptionist. Your name is [Agent Name].

You are speaking with a patient over the phone. Be warm, professional, 
and concise. Speak in short sentences. Never say "Certainly!" or 
"Absolutely!" or similar filler phrases.

CLINIC CONTEXT:
- Name: [Clinic Name]
- Hours: [Hours]
- Providers: [Provider list with availability]
- Services: [Service list]

CURRENT PATIENT CONTEXT:
[Injected dynamically: patient name, upcoming appointments, 
last visit, insurance status]

AVAILABLE ACTIONS:
[JSON schema of actions Claude can trigger: book_appointment, 
cancel_appointment, send_confirmation, transfer_to_human, etc.]

RULES:
- If you cannot handle the request, transfer to a human. Never guess.
- Confirm all bookings by repeating back date, time, and provider.
- Never discuss billing details — transfer to billing team.
- If the patient seems distressed, offer to transfer immediately.

The key insight: Claude needs to know what it can and cannot do. An AI that tries to handle everything and fails is worse than one that gracefully transfers when out of scope.

We use Claude's tool use (function calling) for actions — booking, cancelling, sending confirmations. This gives us clean structured outputs instead of trying to parse intent from natural language.

Multi-Tenant Architecture

Loquent serves multiple clinics, each with their own:

Phone number(s)
Providers and availability
Booking rules and constraints
Brand voice and agent name
CRM integration

The system prompt is dynamically assembled per-call using the clinic's configuration. We built a dashboard where clinic admins can update their agent's name, working hours, provider list, and escalation rules without touching code.

Each clinic gets isolated data — separate database schemas, separate API credentials, separate call logs.

The Integrations Layer

The hard part isn't the AI — it's making the AI useful by connecting it to real data.

Appointment booking: We built adapters for common dental/healthcare practice management systems. The adapter pattern let us add new integrations without touching the core engine.

CRM sync: After every call, we write a structured summary back to HubSpot or Salesforce — caller ID, intent, outcome, booking details, and a Claude-generated call summary. This is actually one of the features clinics love most.

Confirmation messages: Post-call, we trigger SMS/email confirmations via Twilio Messaging and SendGrid. Patients get a confirmation within 30 seconds of booking.

What Broke in Production (And How We Fixed It)

Problem 1: Callers interrupting the AI mid-sentence

Users naturally interrupt. The AI was finishing its sentence before responding to the interruption, which felt robotic.

Fix: Barge-in detection. When Deepgram detects speech while TTS is playing, we immediately stop audio playback, flush the TTS buffer, and re-run inference with the new input. Feels much more natural.

Problem 2: Claude hallucinating availability

Early builds had Claude generating appointment times that didn't exist. Patients were being told "Tuesday at 2pm" when the provider wasn't available.

Fix: Availability is never in the prompt. Instead it's a tool call. Claude calls get_availability(provider, date_range) and we return actual real-time slots. Claude can only offer what the function returns.

Problem 3: Long calls running up costs

Some patients would keep the AI on the phone indefinitely — confused, or just chatty. Unbounded calls = unbounded cost.

Fix: Configurable max duration per clinic. At 10 minutes, the AI politely offers to transfer to a human or calls back. Average call length is now 2.5 minutes.

Problem 4: Noisy environments destroying STT accuracy

Callers in cars, waiting rooms, restaurants. Background noise crushed Deepgram accuracy.

Fix: Deepgram's noise suppression model + a fallback that asks the caller to repeat if confidence drops below threshold. "I'm sorry, I didn't quite catch that — could you repeat that for me?"

Numbers After Launch

Calls handled: Thousands of automated calls per month
Average handle time: 2.5 minutes
Transfer rate: ~18% (calls that go to a human)
Booking completion rate: ~74% of calls that started with booking intent
Uptime: 99.7%
Average response latency: ~900ms

The clinics running Loquent have effectively eliminated missed after-hours calls. One client told us their front desk spends the first hour of every morning re-booking patients who couldn't get through the day before. Loquent eliminated that entirely.

What We'd Do Differently

Start with tool use from day one. We initially tried to have Claude make decisions through natural language reasoning. Switching to structured tool calls for all actions made the system dramatically more reliable.

Invest in evals earlier. We didn't set up proper evaluation pipelines until week 5. Building a test call suite in week 1 would have caught several issues earlier.

Separate the conversation engine from the telephony layer sooner. The abstraction between "what the AI is doing" and "how the call works" should be clean from the start. We refactored this at week 6 and it made everything better.

What's Next

We're now extending Loquent to handle outbound campaigns — appointment reminders, recall messaging, post-visit follow-ups. The same architecture works, you just flip the direction of the call.

We're also exploring multi-agent setups where a triage agent hands off to specialist agents (billing, clinical questions, booking) with full context preservation.

If you're building something similar or want to talk through the architecture, we're at getloquent.com and autor.ca.

Happy to answer questions in the comments.

Autor is a Toronto-based AI development studio. We build custom AI agents, voice assistants, and full-stack AI products for businesses. autor.ca