Autor Technologies Inc.

Posted on May 4

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

#ai #webdev #machinelearning #typescript

Eight weeks. That's how long it took our team at Autor to go from "we should build a voice AI receptionist for healthcare clinics" to handling live patient calls in production. Not a demo. Not a prototype collecting dust on a staging server. A real system answering real phones at real dental and healthcare clinics across Canada.

The product is called Loquent. It now handles thousands of automated calls per month, 24/7, for healthcare and dental clients. But the interesting part isn't what it does today — it's the 47 decisions we made in those 8 weeks that determined whether it would work at all.

Here's every major technical and product decision we made, why we made it, and what we'd change if we did it again.

Week 1–2: Picking the Stack

The first decision was the voice pipeline. We needed three things: a way to receive phone calls, a way to convert speech to text, and a way to convert text back to speech. Simple enough on paper.

For telephony, we went with Twilio. Not because it's the cheapest — it's not — but because we'd shipped 50+ products and knew Twilio's edge cases. When you're building something that needs to be in production in 8 weeks, you don't gamble on infrastructure you haven't battle-tested. Twilio's media streams gave us real-time audio over WebSocket, which was critical for keeping latency low.

For speech-to-text, we chose Deepgram. We tested Google Speech-to-Text, AWS Transcribe, and Deepgram head-to-head with 200 sample audio clips from actual clinic phone calls (with permission). Deepgram won on two axes: accuracy on medical terminology and latency. Their streaming API returned partial transcripts in under 300ms consistently. Google was close on accuracy but added 150–200ms more latency. In voice AI, that 200ms is the difference between a conversation that feels natural and one that feels like talking to a bad VoIP connection.

For the LLM brain — the part that actually understands what the caller wants and decides what to say — we went with Anthropic Claude. We'd used GPT-4 extensively on other projects, but Claude gave us two things we needed: more predictable instruction-following for complex system prompts, and better handling of the conversational nuance healthcare calls require. When a patient says "I think I need to come in but I'm not sure," Claude was measurably better at handling that ambiguity with the right mix of helpfulness and appropriate medical caution.

Text-to-speech was ElevenLabs. We tested 6 providers. ElevenLabs had the most natural-sounding voices and critically, their streaming API let us start playing audio before the full response was generated. This shaved another 400ms off perceived latency.

The backend is NestJS with TypeScript, running on AWS with Docker. Our API layer handles the orchestration between all these services. We use PostgreSQL with Prisma for call logs, appointment data, and conversation history. The frontend dashboard for clinic staff is Next.js on Vercel.

Total time on stack decisions: 4 days. We spent 3 of those days on benchmarking speech providers because that's where we had the least prior experience.

Week 3–4: The Latency Problem

Here's what nobody tells you about building voice AI: the technical challenge isn't making it smart. It's making it fast.

A normal human conversation has about 200ms of silence between turns. Our first end-to-end prototype had 2.4 seconds of latency from the moment the caller stopped speaking to when they heard the AI respond. That's brutal. Callers were hanging up or talking over the AI.

We broke the latency down:

Speech-to-text finalization: ~400ms
LLM inference (Claude): ~800ms
Text-to-speech generation: ~600ms
Network overhead between services: ~600ms

Every one of those had to come down. Here's what we did.

Speech-to-text: We switched from waiting for final transcripts to acting on interim transcripts with a confidence threshold above 0.85. This let us start LLM inference 300ms earlier on average, at the cost of occasionally sending a slightly wrong transcript. We added a correction mechanism that would interrupt and re-route if the final transcript meaningfully differed from the interim one. In practice, this happened on less than 3% of utterances.

LLM inference: We couldn't make Claude faster, but we could make it produce less. We restructured our prompts to front-load the critical response content. Instead of "think step by step and then respond," we used a format where Claude would output the spoken response first, then its reasoning. We also aggressively cached common conversation patterns — things like greeting responses, hold requests, and appointment confirmations. About 40% of conversational turns hit the cache.

Text-to-speech: Streaming. Instead of generating the full audio clip and then playing it, we streamed audio chunks as they were generated. The caller hears the first word within 200ms of generation starting.

Network: We co-located all services in the same AWS region (ca-central-1, because Canadian healthcare data stays in Canada — more on that later). We also moved from HTTP request/response to persistent WebSocket connections between our services.

Final latency after optimization: 800ms average. Not perfect, but well within the range where conversations feel natural. Callers stopped hanging up.

Week 5–6: Making It Actually Useful

A fast AI that says the wrong thing is worse than a slow AI that says the right thing. Week 5 was about prompt engineering and conversation design.

We spent two full days sitting in a dental clinic's front office, listening to actual receptionist calls and documenting every type of conversation. We categorized 14 distinct call types, from appointment booking to insurance verification to emergency triage. Each one needed different handling logic.

The critical insight: we didn't try to make one mega-prompt handle everything. Instead, we built a routing layer. The first few seconds of each call go through a lightweight classifier that determines the call type, then routes to a specialized prompt and conversation flow for that type. This meant each individual prompt could be simpler and more reliable.

For appointment booking — about 60% of all calls — we integrated directly with the clinic's scheduling software through their API. The AI doesn't just take a message; it actually checks availability and books the appointment in real time. This was the feature that made clinic owners go from "interesting demo" to "shut up and take my money."

We also built the HubSpot and Salesforce integrations during this phase. Every call gets logged with a full transcript, caller sentiment, call type, and outcome. Clinic managers can see exactly what's happening with their phone lines without listening to recordings.

Week 7: The 18% Problem

By week 7, we had something that worked. But 18% of calls were being transferred to human staff because the AI couldn't handle them. We wrote a whole separate article about what those 18% had in common (that's next week's post), but the short version: edge cases around insurance questions, multi-party calls, and callers who were genuinely distressed.

We made a deliberate decision: we would not try to get that 18% down to zero. Some calls should go to humans. A patient who just got a scary diagnosis and is calling to schedule a follow-up doesn't want to talk to an AI, no matter how good it is. We built robust handoff logic that transfers calls smoothly with full context, so the human receptionist knows exactly what's already been discussed.

This is one of the decisions I'm most proud of. The temptation in AI product development is to automate everything. But knowing where to draw the line is what makes the product trustworthy.

Week 8: Going Live

Production deployment was its own adventure. We did a phased rollout: the AI handled calls only during off-hours for the first clinic, then gradually expanded to business hours over 5 days.

The things that broke in production that didn't break in testing:

Background noise. Our test calls were recorded in quiet offices. Real calls come from cars, restaurants, playgrounds. We added a noise gate and retuned our Deepgram configuration with their noise cancellation features.

Accents and languages. Toronto is one of the most multicultural cities in the world. Callers speak English with every accent imaginable, and some prefer to start in another language entirely. We added language detection in the first 3 seconds and route non-English calls to human staff (multilingual AI is on our roadmap).

Caller expectations. Some callers figured out they were talking to AI and started testing it — asking trick questions, trying to confuse it, or just saying "give me a human." We added explicit handling for these cases. If someone asks for a human, they get one immediately. No persuasion, no "but I can help you with that."

What We'd Do Differently

If we were starting Loquent today, three things would change.

First, we'd invest in end-to-end latency monitoring from day one. We built our monitoring piecemeal and it cost us debugging time when production latency spiked at 3am (we wrote about that too — see our Week 3 article).

Second, we'd use a multi-model approach from the start. Not every conversational turn needs Claude's full reasoning capability. Simple acknowledgments ("Got it, let me check that for you") could come from a smaller, faster model. We're implementing this now, and it's cutting our average latency by another 150ms.

Third, we'd build the analytics dashboard before the AI itself. We spent the first two weeks of production flying blind on call quality because our dashboard wasn't ready. The data was being logged but we couldn't see patterns until we built the visualization layer.

The Numbers After 6 Months

Loquent now handles thousands of calls per month across multiple clinics. The 82% automation rate has held steady. Patient satisfaction scores for AI-handled calls are within 5% of human-handled calls. Clinics using Loquent report that their human receptionists spend 60% less time on routine calls and can focus on patients who actually need personal attention.

We built Loquent with a team of senior engineers, no offshore work, no handoffs between teams. The same people who designed the architecture wrote the code and debugged the production issues. That's how we shipped in 8 weeks.

Key Takeaways

Latency is the make-or-break metric for voice AI. Your AI can be the smartest system ever built, but if it takes 2 seconds to respond, callers will hang up. Optimize for speed before you optimize for intelligence.
Don't pick infrastructure you haven't used before when you're on a tight timeline. We chose Twilio over newer, cheaper alternatives because we knew its failure modes. That decision alone probably saved us a week of debugging.
Build specialized conversation flows, not one giant prompt. A routing layer with focused prompts beats a single do-everything prompt on reliability, latency, and maintainability.
Know where to draw the automation line. The 18% of calls we route to humans aren't a failure — they're a feature. Trustworthy AI knows its limits.
Co-locate everything and measure latency end-to-end. Every network hop between services adds latency. In voice AI, those milliseconds are the product experience.

Autor is a Toronto-based AI development studio that builds production AI systems. Loquent, our voice AI platform, handles thousands of automated calls monthly for healthcare clients across Canada.

If you're building something similar, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca.

DEV Community