Autor Technologies Inc.

Posted on Mar 31

We Analyzed 10,000 Automated Healthcare Voice Calls — Here's What We Found

#ai #webdev #machinelearning #typescript

Last October, we hit a milestone at Autor that I didn't see coming: Loquent, our production voice AI platform, processed its 10,000th automated healthcare call. Instead of celebrating, we did what any team of engineers would do — we pulled the data, locked ourselves in a room for a week, and tore apart every single pattern we could find.

What we discovered changed how we build voice AI. Some of it confirmed our assumptions. Most of it didn't.

The Setup

For context, Loquent handles automated calls for healthcare and dental clinics across Canada. We're talking appointment scheduling, confirmations, cancellations, insurance verification questions, and general intake routing. The system runs 24/7 on a stack built with Twilio for telephony, Anthropic Claude for conversation intelligence, Deepgram for speech-to-text, and ElevenLabs for text-to-speech. We built the first version in under 8 weeks and have been iterating on it for the past six months.

The 10,000 calls in this dataset span 14 clinic clients — a mix of dental offices, family practices, and specialist clinics in Ontario and British Columbia. Call durations ranged from 12 seconds (hang-ups) to 14 minutes (complex scheduling with insurance questions). The median call was 2 minutes and 38 seconds.

Here's what the data told us.

Finding 1: 73% of Calls Follow Just 4 Patterns

We categorized every call by intent. Out of the dozens of potential reasons someone calls a clinic, four patterns dominated:

Appointment booking: 31%
Appointment confirmation/change: 24%
Cancellation: 11%
"Am I covered for this?" (insurance/billing questions): 7%

That's 73% of all inbound call volume handled by four well-defined flows. The remaining 27% was a grab bag — prescription refill requests, referral follow-ups, directions to the clinic, and a surprising number of people just wanting to talk to "a real person" about nothing specific.

This matters because it means you don't need a general-purpose conversational AI to handle the majority of healthcare front-desk calls. You need four really good, tightly scoped flows with clean handoff logic for everything else. We spent months trying to make Loquent handle every possible conversation gracefully. The data told us to stop doing that and instead make those four flows bulletproof.

Finding 2: Latency Tolerance is Exactly 1.8 Seconds

We measured caller drop-off rates against our system's response latency — the time between when a caller finishes speaking and when the AI begins its response. The data was clear: at 1.2 seconds or less, drop-off rates were near zero. Between 1.2 and 1.8 seconds, drop-off crept up slightly. Above 1.8 seconds, we saw a cliff. Callers either hung up or started talking over the AI, derailing the conversation.

1.8 seconds. That's your budget for the entire pipeline: speech-to-text transcription, LLM inference, text-to-speech generation, and audio delivery back through Twilio. In practice, this means we run Deepgram's streaming transcription (adds ~300ms), Claude Haiku for most routine responses (adds ~400-600ms), and ElevenLabs with their Turbo v2 model (adds ~350ms). That leaves roughly 200ms of network overhead before we're in the danger zone.

For complex queries where we need Claude Sonnet's reasoning — like disambiguating between similar appointment types or handling multi-step insurance questions — we've built a "thinking buffer" that plays a natural filler phrase ("Let me check that for you...") to buy an extra 2-3 seconds. This single trick reduced our complex-query drop-off rate by 41%.

Finding 3: Morning Callers Are 2.3x More Patient Than Afternoon Callers

This one surprised us. We segmented call behavior by time of day and found a pattern so consistent it changed our system design.

Callers between 8am and 11am had an average interaction length of 3 minutes 12 seconds and tolerated longer AI response times before dropping off. Callers between 2pm and 5pm averaged 1 minute 54 seconds and were significantly more likely to request a human transfer.

Our theory: morning callers are often calling during a planned moment — they're at their desk, coffee in hand, checking things off a list. Afternoon callers are squeezing in a call between meetings or during a break. They want speed.

We now dynamically adjust Loquent's behavior based on time of day. Afternoon calls get shorter confirmations, faster routing, and more aggressive escalation to human staff. Morning calls get slightly more conversational, exploratory flows. This alone improved our afternoon completion rate by 18%.

Finding 4: The "Second Sentence" Problem

Here's a pattern we almost missed. In 34% of calls where the AI's first response was correct and helpful, the caller still asked to speak to a human. We dug into the transcripts and found the issue wasn't accuracy — it was the AI's second sentence.

The AI would correctly answer the question, then add a follow-up that felt robotic or presumptuous. Things like: "Is there anything else I can help you with today?" delivered in the exact same cadence as a phone tree. Or worse, immediately pivoting to: "I can also help you with appointment scheduling, prescription inquiries, or billing questions."

Real receptionists don't do this. They pause. They let the caller process. They read the room.

We rewrote our prompt engineering to include explicit "breath" instructions — moments where the AI generates a brief pause and waits for the caller to lead. We also cut the generic menu-style follow-ups entirely. The result: human transfer requests after successful first responses dropped from 34% to 12%.

Finding 5: 6% of Callers Will Try to Break Your AI (And That's Fine)

We identified a consistent 6% of callers across all clinics who deliberately tested the AI. They'd ask trick questions, try to confuse it, speak in fragments, or demand things the AI clearly couldn't do. We affectionately call these "stress-test callers" internally.

Early on, we tried to make the system handle these gracefully — clever redirects, patient re-prompts, escalation paths. We burned weeks on it. The data showed us something freeing: these callers almost always called back within 24 hours and had a normal, productive interaction the second time. They were curious, not hostile.

We now let these calls fail gracefully with a simple "I want to make sure you get the help you need — let me connect you with the team" after two confused exchanges. No heroics. Our engineering time is better spent on the 94%.

What This Changed For Us

After this analysis, we made three architectural decisions that shaped Loquent's next iteration:

1. Flow specialization over generalization. We rebuilt our four core flows from scratch, each with its own optimized prompt chain, latency budget, and escalation logic. The "general conversation" handler became a thin routing layer, not a Swiss Army knife.

2. Time-aware behavior. Loquent now adapts its conversational style, response length, and escalation thresholds based on time of day. The morning version and the afternoon version are meaningfully different systems.

3. Silence as a feature. We invested heavily in teaching the AI when not to talk. Strategic pauses, shorter confirmations, and eliminating the "anything else?" reflex made the system feel less like a phone tree and more like a receptionist who respects your time.

The Numbers After the Rebuild

Six weeks after implementing these changes across all 14 clinics:

Overall call completion rate: 74% → 82%
Average call duration: 2:38 → 2:11
Human transfer requests: 22% → 14%
Client satisfaction (post-call survey): 3.4/5 → 4.1/5
Peak hour handling capacity: up 23% (same infrastructure cost)

None of these improvements came from a better model or a fancier tech stack. They came from reading our own data honestly and being willing to simplify.

Key Takeaways

Most healthcare voice AI problems are scope problems, not intelligence problems. You don't need AGI to book a dental cleaning. You need four flows that work perfectly and clean handoffs for everything else.
Latency isn't a "nice to have" metric — it's the metric. Every millisecond above 1.8 seconds costs you callers. Architect your entire pipeline around this constraint from day one.
Time of day changes caller behavior more than you'd expect. Build your system to adapt, or you're leaving completion rate on the table.
The AI's second sentence matters more than the first. Getting the answer right is table stakes. How the AI handles the moment after the answer determines whether the caller stays or bounces.
Not every edge case deserves engineering time. The 6% who stress-test your system will come back. Focus your effort on the 94% who just want their appointment booked.

If you're building something similar, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca.

DEV Community