6 Months of Running a Production Voice AI — What Changed, What Broke, What We'd Rebuild

#ai #webdev #machinelearning #typescript

Six months ago we pushed Loquent — our voice AI receptionist for healthcare and dental clinics — into production. It now handles thousands of automated calls per month across multiple clinics, 24/7, in both English and French. Here's everything that actually happened once real patients started talking to it.

This isn't a launch post. We wrote that already. This is the unglamorous sequel: the parts where our assumptions were wrong, where vendors changed things under us, and where we looked at our own architecture and thought "why did we do it that way?"

The System at Month Zero

Quick context on the stack we shipped. Loquent runs on Twilio for telephony, Deepgram for speech-to-text, Anthropic Claude for conversation logic, and ElevenLabs for text-to-speech. The backend is NestJS with PostgreSQL and Prisma, deployed on AWS with Docker. We built the whole thing in 8 weeks.

At launch we had a single healthcare client running about 400 calls per week. The system handled appointment booking, cancellations, insurance verification routing, and basic triage — determining whether a patient needed to speak with a human or could be handled automatically.

Our target was 85% full automation rate. We hit 82% in week one, which felt close enough to ship.

What Changed

The prompt architecture got rewritten twice. Our initial approach was a single massive system prompt — roughly 4,000 tokens — that covered every scenario. It worked for one clinic with one specialty. By month two we had three clinics with different booking rules, insurance requirements, and operating hours. The monolithic prompt became unmaintainable.

We moved to a modular prompt system where each clinic gets a base conversation scaffold, and clinic-specific rules (hours, procedures, insurance logic) are injected as structured data rather than prose. The prompts dropped to about 1,200 tokens of core logic plus 300-800 tokens of clinic config. Latency improved by roughly 200ms on first response because Claude was processing less context.

The second rewrite happened at month four when we added French language support for Quebec clinics. Instead of duplicating prompts, we built a language-agnostic intent layer and pushed all patient-facing text into a template system. This made adding new languages a config change instead of a prompt engineering project.

Deepgram's model updates changed our accuracy numbers overnight. Twice in six months, Deepgram pushed model updates that shifted our transcription accuracy. The first time it improved — dental terminology recognition jumped from about 71% to 83%. The second time, three weeks later, a different update introduced regressions on Quebec French accents. Our automation rate in Montreal dropped 9 points in a single day.

We now pin specific Deepgram model versions in production and test new versions against a saved corpus of 500 real call recordings before promoting. This added a week to our vendor update cycle but eliminated surprise regressions.

Call volumes tripled, but not where we expected. We planned for steady growth across all clinics. Instead, one dental group ran a local ad campaign that tripled their call volume in a week without telling us. Our Twilio concurrent call limit was 15. They hit 23 simultaneous calls on a Tuesday morning.

The overflow calls got busy signals. We didn't even know it was happening until the clinic called us directly. Now we have alerting on concurrent call counts, queue depth, and Twilio capacity headroom. We also built an auto-scaling config that bumps concurrent limits when utilization crosses 70%.

What Broke

The "I'll hold" problem. We didn't anticipate how many patients would say "I'll hold" or "I'll wait" when told a human wasn't available. Our conversation logic treated silence as a disconnect signal after 8 seconds. Patients waiting for a human would go quiet, get disconnected, call back, get the AI again, and get increasingly frustrated.

We found this pattern in 6% of all calls — roughly 40 calls per week across our clinics. The fix was a dedicated hold state with periodic check-ins ("I'm still here, a team member will be with you shortly") and extended silence tolerance of 45 seconds. Transfer-to-human success rate went from 74% to 91%.

ElevenLabs latency spikes during peak hours. Between 9am and 11am Eastern — prime appointment-booking time — ElevenLabs response times would occasionally spike from our baseline 180ms to 600-900ms. Patients experienced this as the AI "pausing" mid-conversation, which eroded trust.

We built a TTS response cache for common phrases (greetings, confirmations, hold messages) that eliminated latency for about 35% of all spoken responses. For the remaining dynamic responses, we added a streaming playback pipeline that starts speaking before the full audio is generated. Combined, these brought worst-case perceived latency down to about 300ms.

The insurance verification rabbit hole. Our original insurance check was simple: ask the patient for their insurance provider and policy number, confirm it's in the clinic's accepted list. Then clinics started asking us to do real-time eligibility checks. We built an integration with a clearinghouse API, and it worked — until it didn't.

The clearinghouse had a 4-second average response time. Four seconds of silence on a phone call feels like an eternity. We tried filling the gap with "Let me check that for you" and hold music snippets, but the UX was terrible. We ended up moving insurance verification to an async flow: the AI collects the information, confirms it'll be verified before the appointment, and the actual check happens after the call. Patient satisfaction scores went up. Clinic staff workload went down.

What We'd Rebuild

The conversation state machine. We built state management as a simple linear flow: greeting → intent detection → information collection → action → confirmation → goodbye. Real conversations aren't linear. Patients interrupt, backtrack, ask unrelated questions mid-booking, and change their minds.

We patched this with increasingly complex branching logic, and it works, but it's brittle. If we rebuilt from scratch, we'd use a graph-based conversation model where each node is an intent with defined entry/exit conditions and any node can transition to any other node based on what the patient says. We're about 60% through this rebuild now.

The monitoring stack. We started with basic CloudWatch logging and a Slack alert channel. That was fine for 400 calls a week. At our current volume, we need real-time dashboards showing automation rate by clinic, average call duration, transfer reasons, transcription confidence scores, and TTS latency — all broken down by time of day and language.

We bolted on a custom analytics pipeline at month three, but it's a collection of Lambda functions and a Grafana dashboard that took more effort to maintain than to build. We'd invest in a proper observability layer from day one if we did it again. Probably Datadog with custom metrics, though the cost at our call volume would need careful management.

The testing infrastructure. We ship prompt changes the way most teams ship code — PR, review, merge, deploy. But we didn't have automated regression testing for conversation quality until month four. Before that, someone on the team would manually call the system and run through scenarios.

We now have a test harness that replays 200 real call transcripts against any prompt change and flags regressions in intent detection, entity extraction, and task completion rate. Building this earlier would have prevented at least three production incidents that each affected several hundred calls.

The Numbers at Month Six

Here's where we stand today compared to launch:

Automation rate: 82% → 89% (target was 85%)
Average call duration: 3m 42s → 2m 51s
Patient satisfaction (post-call survey): 3.8/5 → 4.3/5
Transfer-to-human rate: 18% → 11%
First-response latency (p95): 1.4s → 0.8s
Monthly call volume: ~1,600 → ~5,200

The single biggest driver of improvement wasn't any technical change. It was the modular prompt system that let us tune each clinic's AI behavior without risk of breaking other clinics. Configuration over code.

Five Things I'd Tell Someone Building This Today

Pin your vendor model versions. Every speech-to-text and LLM provider ships updates that can change your product's behavior without warning. Control when you adopt changes.
Build your test corpus from real calls immediately. From day one, save anonymized call recordings. You'll need them for regression testing within weeks, not months.
Design for the unhappy path first. The 11% of calls that need a human are more important than the 89% that don't. A bad transfer experience destroys all the goodwill the AI built.
Async everything that takes more than 2 seconds. Silence on a phone call is death. If a backend operation takes time, collect the info and process it after the call.
Invest in per-client configuration early. Your second client will have different rules than your first. Build the config system before you need it, because you'll need it sooner than you think.

If you're building something similar, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca