<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Autor Technologies Inc.</title>
    <description>The latest articles on DEV Community by Autor Technologies Inc. (@autor_tech).</description>
    <link>https://dev.to/autor_tech</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842303%2Fbf0b0e32-20ca-43ae-b5aa-7032797fc21e.png</url>
      <title>DEV Community: Autor Technologies Inc.</title>
      <link>https://dev.to/autor_tech</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/autor_tech"/>
    <language>en</language>
    <item>
      <title>Why Canada Is the Best Place to Build Healthcare AI Right Now</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Mon, 15 Jun 2026 13:13:18 +0000</pubDate>
      <link>https://dev.to/autor_tech/why-canada-is-the-best-place-to-build-healthcare-ai-right-now-5dj0</link>
      <guid>https://dev.to/autor_tech/why-canada-is-the-best-place-to-build-healthcare-ai-right-now-5dj0</guid>
      <description>&lt;p&gt;Every week I talk to a founder who's building healthcare AI in San Francisco and spending 40% of their engineering time on HIPAA compliance across fifty different state-level interpretations. Meanwhile, we shipped Loquent — a production voice AI handling thousands of automated calls per month for healthcare and dental clinics — from Toronto, in eight weeks. The regulatory environment wasn't something we fought against. It was one of the reasons we moved fast.&lt;/p&gt;

&lt;p&gt;Most of the AI discourse assumes you need to be in the Bay Area to build anything real. For healthcare AI specifically, I think that's wrong. Canada has a set of structural advantages right now that most people aren't paying attention to, and by the time they do, the window will be smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Regulatory Stack Is Actually an Advantage
&lt;/h2&gt;

&lt;p&gt;Here's what most people get wrong about Canadian privacy law: they assume PIPEDA is just "HIPAA but Canadian." It's not. It's actually more coherent.&lt;/p&gt;

&lt;p&gt;In the US, you have HIPAA at the federal level, but then you're dealing with a patchwork of state-level regulations. California has CCPA/CPRA. New York has SHIELD. Texas has its own thing. If you're building AI that processes health data, you're essentially maintaining compliance against a dozen different interpretations of what "adequate protection" means.&lt;/p&gt;

&lt;p&gt;In Canada, PIPEDA gives you a federal baseline built on 10 Fair Information Principles — purpose limitation, consent, accountability, the fundamentals. Then you layer on provincial health-specific legislation like Ontario's PHIPA (Personal Health Information Protection Act), which was actually designed with digital health workflows in mind. PHIPA explicitly addresses how health information custodians and their agents handle PHI across clinical and administrative workflows, including AI systems.&lt;/p&gt;

&lt;p&gt;The Ontario Information and Privacy Commissioner released guidance specifically about AI transcription tools in healthcare — requiring privacy impact assessments, data minimization throughout the AI lifecycle, and limiting PHI disclosure to vendors. These aren't vague handwaves. They're specific, implementable requirements.&lt;/p&gt;

&lt;p&gt;When we built Loquent, this clarity was a competitive advantage. Instead of hiring a team of lawyers to interpret ambiguous regulations, we could read the guidance, build to spec, and ship. PIPEDA's emphasis on meaningful consent — where individuals must be fully informed about how their data will be used — forced us to build better product, not slower product.&lt;/p&gt;

&lt;p&gt;And here's the kicker: Bill C-27's 2026 amendments are tightening consent requirements further, with penalties up to C$25 million or 5% of gross global revenue. That sounds scary, but it actually favors builders who are already compliant. It raises the floor for everyone else trying to compete in the Canadian market.&lt;/p&gt;

&lt;h2&gt;
  
  
  AIDA Is Dead. That's Actually Good for Builders.
&lt;/h2&gt;

&lt;p&gt;Canada's Artificial Intelligence and Data Act (AIDA) died in parliament. A lot of people read that as "Canada has no AI regulation" and panicked. I read it differently.&lt;/p&gt;

&lt;p&gt;What it means in practice is that healthcare AI in Canada operates under existing, well-understood privacy frameworks rather than a brand-new, untested AI-specific law. You're building against PIPEDA and PHIPA — legislation that's been interpreted by courts and privacy commissioners for years. Compare that to the EU's AI Act, where nobody actually knows yet how enforcement will work in practice for health applications.&lt;/p&gt;

&lt;p&gt;For a small studio like ours, regulatory predictability is worth more than regulatory completeness. We can build with confidence because the rules are clear. The AIDA vacuum will get filled eventually, but right now it means Canadian builders have a window where they can iterate without worrying that new legislation will retroactively invalidate their architecture decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Talent Pool Is Deeper Than You Think
&lt;/h2&gt;

&lt;p&gt;Toronto has the fourth-largest AI talent pool in North America — 23,936 workers — and hit number three in CBRE's 2026 tech talent ranking. The city has over 285,000 tech workers across roughly 24,000 companies and was named Canada's fastest-growing AI hub in March 2026.&lt;/p&gt;

&lt;p&gt;But the real story is Waterloo. Waterloo Region jumped 11 spots to enter the top 10 for the first time, driven by computer and information systems manager growth. The University of Waterloo co-op pipeline is producing engineers who understand both ML fundamentals and production systems. And critically, more of them are staying local instead of immediately migrating to Silicon Valley.&lt;/p&gt;

&lt;p&gt;Canada now has three of the top 10 largest AI talent pools in North America: Toronto, Vancouver, and Montreal. The talent density isn't Bay Area level, but the cost difference is dramatic. We hire senior engineers in Toronto at rates that would get us mid-levels in SF. For a bootstrapped studio building production AI, that math matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  41 Million Patients, One System
&lt;/h2&gt;

&lt;p&gt;This is the structural advantage nobody talks about. Canada's universal single-payer healthcare system generates clinical data at a scale most countries can't match. 41 million people across diverse demographics, all flowing through a single system architecture.&lt;/p&gt;

&lt;p&gt;The federal government is betting on this. In June 2025, Canada Health Infoway gave 10,000 primary care clinicians across Canada AI Scribe licenses through a federally funded program. In September 2025, they announced a task force on AI to recommend policies for research, talent, and commercialization. The VITAL health data initiative is explicitly about turning Canada's structural data advantage into Canadian AI products.&lt;/p&gt;

&lt;p&gt;For us at Autor, this means our clients are part of a system that's actively moving toward AI adoption, not resisting it. When a dental clinic in Ontario asks us about Loquent, they're asking in the context of a healthcare system that's already distributing AI tools to physicians. We're pushing on an open door.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dental Market Is Wide Open
&lt;/h2&gt;

&lt;p&gt;Speaking of dental: Canada's dental AI market is still nascent. When surveyed, 60% of Canadian dentists said they hadn't implemented AI-assisted technologies in the past five years. The companies in this space — DentalRx, MaxAssist, ClearDent — are primarily focused on practice management and imaging. Almost nobody is building voice AI for dental front desks.&lt;/p&gt;

&lt;p&gt;That's exactly where Loquent lives. And Canada's dental market has a specific advantage: it's large enough to build a real business (over 27,000 dentists across the country) but small enough that you can reach meaningful market penetration from a single city. We're not competing against Epic or Cerner here. We're building for clinic owners who are running reception with two people and a phone that won't stop ringing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Sovereignty Is a Selling Point
&lt;/h2&gt;

&lt;p&gt;One advantage I didn't anticipate: Canadian healthcare organizations are increasingly concerned about the US CLOUD Act. US-based platforms are subject to government data access requests, which creates a compliance risk for Canadian organizations handling personal health information. Canadian-operated platforms eliminate that exposure entirely.&lt;/p&gt;

&lt;p&gt;This isn't theoretical. We've had multiple conversations with clinic owners who specifically asked whether patient call data stays in Canada. For Loquent, it does. That's not just a compliance checkbox — it's becoming a real differentiator against US-based competitors trying to sell into the Canadian market.&lt;/p&gt;

&lt;p&gt;Canada's digital health market generated US$13.49 billion in 2024 and is projected to reach US$53.92 billion by 2030 — a 26% CAGR. Last year, 54% of digital health investment went to AI-enabled companies, up from 37% the year before. The money is following the opportunity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PIPEDA + PHIPA give you clear, implementable compliance requirements.&lt;/strong&gt; The US patchwork of state privacy laws is actually harder to build against than Canada's layered federal-provincial model. Clarity beats complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AIDA vacuum is a feature, not a bug.&lt;/strong&gt; Canadian healthcare AI operates under established privacy law with years of interpretation behind it. That's more predictable than brand-new AI-specific legislation that hasn't been tested.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Toronto-Waterloo talent density is real and growing.&lt;/strong&gt; Three of North America's top 10 AI talent pools are in Canada, and the cost advantage over US cities is significant for bootstrapped companies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single-payer data at scale is a structural moat.&lt;/strong&gt; 41 million patients in one system, with a federal government actively funding AI adoption. This dataset advantage compounds over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The dental/clinic market is wide open.&lt;/strong&gt; 60% of Canadian dentists haven't adopted AI yet. If you're building healthcare AI tools for frontline clinics, Canada is an ideal starting market.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Why Canada Is the Best Place to Build Healthcare AI Right Now</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Mon, 08 Jun 2026 13:12:55 +0000</pubDate>
      <link>https://dev.to/autor_tech/why-canada-is-the-best-place-to-build-healthcare-ai-right-now-4pk</link>
      <guid>https://dev.to/autor_tech/why-canada-is-the-best-place-to-build-healthcare-ai-right-now-4pk</guid>
      <description>&lt;p&gt;Last month, a US-based healthtech founder asked me where he should incorporate his AI company. He was deciding between Delaware and Ontario. I told him Ontario — and he looked at me like I'd suggested he build a spaceship out of duct tape. Six weeks later, he moved his entire dev team to Toronto. Here's why.&lt;/p&gt;

&lt;p&gt;At Autor, we've spent the last two years building Loquent — a production voice AI platform that handles thousands of automated calls per month for healthcare and dental clients across Canada. We've shipped AI into regulated healthcare environments, navigated PIPEDA and PHIPA compliance from day one, and watched the US regulatory landscape turn into a minefield while Canada quietly built something better. I'm not saying Canada is perfect. I'm saying that right now, in June 2026, if you're starting a healthcare AI company, you're making a mistake by defaulting to the US.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Regulatory Argument Everyone Gets Wrong
&lt;/h2&gt;

&lt;p&gt;The common take is that the US is more "innovation-friendly" because HIPAA is well-understood and the FDA has been approving AI/ML medical devices since 2017. That's true — if you're building a diagnostic imaging tool that fits neatly into the existing SaMD framework. For everything else, especially conversational AI, voice agents, ambient scribes, and patient-facing automation, the US is a regulatory grey zone that's getting greyer.&lt;/p&gt;

&lt;p&gt;Canada's AIDA (Artificial Intelligence and Data Act) died in Parliament when it was prorogued in January 2025. Most people read that as "Canada has no AI regulation." I read it differently: Canada has no &lt;em&gt;bad&lt;/em&gt; AI regulation. What we do have is PIPEDA — a principles-based privacy framework that actually works for AI development.&lt;/p&gt;

&lt;p&gt;Here's the practical difference. HIPAA is entity-specific. It only covers "covered entities" — healthcare providers, insurers, and clearinghouses. Your AI startup processing voice calls for a dental clinic? You might not be a covered entity, but you're still handling PHI, and the legal exposure is enormous and unclear. PIPEDA covers all commercial activity involving personal information. There's no ambiguity about whether you're in scope. You are. And because PIPEDA's 10 Fair Information Principles are consent-based rather than entity-based, you can actually build a compliance architecture that makes sense for an AI product.&lt;/p&gt;

&lt;p&gt;Ontario's PHIPA adds a healthcare-specific layer on top. Section 29 requires Canadian data residency. That sounds like a restriction, but it's actually a competitive moat. If your data stays in Canada, you're compliant by default with the data sovereignty requirements that are now hitting US companies as a surprise. The EU's adequacy decisions, cross-border transfer restrictions, and provincial approval requirements all become simpler problems when your infrastructure is already Canadian.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Talent Math
&lt;/h2&gt;

&lt;p&gt;Toronto is now the third-largest tech talent pool in North America — over 285,000 technology workers in software, systems, and engineering roles. Only the Bay Area and New York Metro are bigger. Toronto's tech talent grew 44% over five years. The Vector Institute, University of Toronto, and the broader Waterloo-Toronto corridor produce more ML engineers per capita than anywhere except maybe London.&lt;/p&gt;

&lt;p&gt;But here's the number that actually matters: 30-40% cost savings compared to equivalent US hires. A senior ML engineer in San Francisco costs $250-350K fully loaded. In Toronto, that same engineer — often trained at the same institutions, publishing in the same conferences — costs $160-220K CAD, which is roughly $115-160K USD.&lt;/p&gt;

&lt;p&gt;And Canada's Global Talent Stream visa processes international hires in two weeks. Not two months. Two weeks. We've used this at Autor to bring in specialized talent from India and Eastern Europe without the H-1B lottery or the 8-month USCIS processing times that US startups just accept as normal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Healthcare AI Market Nobody's Watching
&lt;/h2&gt;

&lt;p&gt;The dental AI market alone is projected to grow from $516M (2025) to $3.9B by 2035 — a 22.5% CAGR. The Canadian Dental Association made AI its central theme at CDA Presents 2026 in April. This isn't fringe adoption. The professional governing bodies are actively pushing clinics to modernize.&lt;/p&gt;

&lt;p&gt;We see this firsthand with Loquent. When we started building voice AI for dental clinics, the conventional wisdom was that Canadian healthcare was too conservative, too slow-moving, too resistant to automation. That turned out to be completely wrong. What Canadian clinics are is &lt;em&gt;compliance-conscious&lt;/em&gt;. They don't want to be first, but they absolutely want to adopt technology that's already proven AND compliant. Once we showed that Loquent handled PHIPA-compliant call handling with Canadian data residency, the objection wasn't "we don't want AI" — it was "how fast can you deploy."&lt;/p&gt;

&lt;p&gt;The competitive landscape is still early. DentalAssist.ai out of Burlington is doing interesting work. A few US-based companies are trying to enter the Canadian market but stumbling on compliance. The window for building a dominant Canadian healthcare AI company is open right now, and it won't stay open forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SR&amp;amp;ED Advantage Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Canada's Scientific Research and Experimental Development (SR&amp;amp;ED) tax credit covers up to 35% of eligible R&amp;amp;D spend for Canadian-controlled private corporations. If you're building an AI product, most of your engineering work qualifies. We've used SR&amp;amp;ED at Autor every year, and the refund effectively subsidizes our entire research pipeline.&lt;/p&gt;

&lt;p&gt;The US has R&amp;amp;D tax credits too, but they're far less generous and the 2022 amortization rules made them worse. For early-stage healthcare AI companies burning cash on model development, prompt engineering, and integration work, the SR&amp;amp;ED refund is often the difference between having 12 months of runway and having 16 months.&lt;/p&gt;

&lt;p&gt;Combine SR&amp;amp;ED with lower salaries, Canadian data residency compliance, and the fact that $1 USD buys you roughly $1.38 CAD of engineering output, and the unit economics of building in Canada are hard to argue against.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contrarian Bet
&lt;/h2&gt;

&lt;p&gt;I know what the counterarguments are. The US healthcare market is 10x larger. US VCs have deeper pockets. The FDA approval pathway, despite its flaws, is a known quantity.&lt;/p&gt;

&lt;p&gt;All true. And all beside the point if you're building voice AI, conversational agents, patient-facing automation, or ambient clinical intelligence — the categories that are actually growing fastest. For these products, the regulatory clarity, talent economics, and compliance infrastructure in Canada aren't just "comparable" to the US. They're better.&lt;/p&gt;

&lt;p&gt;The US is heading toward a patchwork of state-level AI regulations. California, Colorado, and Illinois already have divergent frameworks. HIPAA wasn't designed for AI and hasn't been meaningfully updated. The FDA is doing innovative work with agentic AI reviews, but that's for medical devices — not for the voice agent that answers your clinic's phones.&lt;/p&gt;

&lt;p&gt;Canada has a single federal privacy framework, provincial health information acts that are consistent in their principles, a regulatory gap that gives startups room to build without premature compliance burden, and a talent pool that keeps getting deeper.&lt;/p&gt;

&lt;p&gt;We're not the only ones who've noticed. The healthtech founder I mentioned at the top? He told me his Toronto team shipped their MVP three months faster than his US team had projected, at 40% of the budget. That's not a fluke. That's the structural advantage of building where the fundamentals align.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PIPEDA's principles-based approach works better for AI than HIPAA's entity-based model.&lt;/strong&gt; If you're building any healthcare AI product that isn't a traditional medical device, Canada's regulatory framework gives you clearer guardrails with less ambiguity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The talent economics are unbeatable.&lt;/strong&gt; Toronto's 285,000+ tech workers, 30-40% cost savings, and two-week visa processing make it the most capital-efficient place in North America to build an AI team.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Canadian data residency is a moat, not a limitation.&lt;/strong&gt; As cross-border data transfer requirements tighten globally, being Canadian-first means you're already compliant where others are scrambling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SR&amp;amp;ED extends your runway by 20-30%.&lt;/strong&gt; No other G7 country offers R&amp;amp;D tax credits this generous for early-stage AI companies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The dental and healthcare AI market is moving now.&lt;/strong&gt; The CDA made AI its 2026 theme. Clinics are buying. The competitive window is open but closing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>We're Open Sourcing Our Voice AI Latency Benchmarking Tool</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Wed, 03 Jun 2026 21:56:41 +0000</pubDate>
      <link>https://dev.to/autor_tech/were-open-sourcing-our-voice-ai-latency-benchmarking-tool-3oa8</link>
      <guid>https://dev.to/autor_tech/were-open-sourcing-our-voice-ai-latency-benchmarking-tool-3oa8</guid>
      <description>&lt;p&gt;Last month, a 340ms spike in our TTS pipeline caused 12% of Loquent callers to talk over the AI mid-response. We didn't catch it for six hours because we were measuring the wrong thing — average latency instead of tail latency at each pipeline stage. That incident is why we built &lt;code&gt;vox-bench&lt;/code&gt;, and why we're releasing it today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we needed this
&lt;/h2&gt;

&lt;p&gt;When you're building a voice AI agent that handles thousands of live phone calls per month — dental appointment bookings, patient intake, after-hours triage — latency isn't a nice-to-have metric. It's the difference between a conversation that feels human and one that feels like talking to a broken IVR.&lt;/p&gt;

&lt;p&gt;Our Loquent pipeline has five stages: Twilio media stream ingestion, speech-to-text via Deepgram, LLM inference via Anthropic Claude (with OpenAI as fallback), text-to-speech via ElevenLabs, and audio streaming back through Twilio. Each stage adds time. The total round-trip — from the moment a caller stops speaking to the moment they hear the AI respond — needs to stay under 800ms to feel natural. Go above 1.2 seconds and callers start repeating themselves. Go above 1.8 seconds and they hang up.&lt;/p&gt;

&lt;p&gt;We know these numbers because we tracked them across 10,000+ calls over six months of running Loquent in production. But for the first four months, we were tracking them wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we were doing wrong
&lt;/h2&gt;

&lt;p&gt;Our original monitoring was simple: we logged a timestamp when audio came in from Twilio and another when we sent audio back. Total round-trip time. One number. And for a while, it looked great — averaging around 650ms.&lt;/p&gt;

&lt;p&gt;The problem was that average told us almost nothing. When our ElevenLabs latency spiked from 120ms p50 to 340ms p95 during a provider-side deployment, our total average barely moved — from 650ms to 710ms. Still "fine" by our alerting thresholds. But 12% of calls were hitting 1.4+ second response times, and those callers were already talking again before the AI responded. The result was conversational chaos — interrupted responses, repeated questions, callers saying "hello? are you there?"&lt;/p&gt;

&lt;p&gt;We needed per-stage, per-percentile latency tracking. Nothing we found did exactly what we needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What exists today (and why it wasn't enough)
&lt;/h2&gt;

&lt;p&gt;We evaluated several options before building our own:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generic APM tools&lt;/strong&gt; (Datadog, New Relic) — great for HTTP request latency, but they don't understand voice AI pipeline stages. You can instrument custom spans, but you're building the domain model yourself. We tried this with Datadog for two months. The dashboard became a wall of custom metrics that nobody on the team could parse quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider-specific dashboards&lt;/strong&gt; — Deepgram and ElevenLabs both have latency metrics in their dashboards, but they only show their own stage. You can't correlate a Deepgram STT spike with downstream effects on total response time. And they measure from their side — not from your server's perspective, which includes network transit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load testing tools&lt;/strong&gt; (k6, Locust) — designed for HTTP endpoints, not real-time WebSocket audio streams. You can hack them into shape, but simulating realistic voice conversation patterns (variable utterance lengths, interruptions, silence gaps) is a project in itself.&lt;/p&gt;

&lt;p&gt;We needed something purpose-built for voice AI pipelines. So we built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How vox-bench works
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;vox-bench&lt;/code&gt; is a TypeScript CLI tool that benchmarks each stage of a voice AI pipeline independently and in combination. Here's what it does:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-stage benchmarking.&lt;/strong&gt; Point it at your STT provider, your LLM, and your TTS provider. It sends realistic audio samples (we include a corpus of 200 healthcare-domain utterances of varying lengths) and measures latency at each stage independently. You get p50, p95, p99, and max for each provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline simulation.&lt;/strong&gt; Chain your stages together and &lt;code&gt;vox-bench&lt;/code&gt; simulates full conversational round-trips. It measures total time-to-first-byte (TTFB) and time-to-complete, broken down by stage. This is where you catch the compounding effects — a 50ms STT increase plus a 80ms LLM increase that pushes your total over the threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider comparison.&lt;/strong&gt; Run the same benchmark against multiple providers simultaneously. We built this because we needed to evaluate whether switching from Deepgram Nova-2 to Nova-3 would actually reduce our p95 STT latency in practice (it did — by 35ms on average for our healthcare utterances, but increased p99 by 12ms on longer sentences). You configure providers in a YAML file and &lt;code&gt;vox-bench&lt;/code&gt; runs them head-to-head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression detection.&lt;/strong&gt; Run &lt;code&gt;vox-bench&lt;/code&gt; on a schedule (we use a GitHub Action that runs every 6 hours) and it compares results against your baseline. If any stage's p95 moves more than your configured threshold, it fires an alert. This is what would have caught the ElevenLabs spike that burned us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversation pattern simulation.&lt;/strong&gt; Real calls aren't "send audio, get response, repeat." Callers interrupt. They pause mid-sentence. They say "um" for three seconds. &lt;code&gt;vox-bench&lt;/code&gt; includes conversation profiles — &lt;code&gt;healthcare-intake&lt;/code&gt;, &lt;code&gt;appointment-booking&lt;/code&gt;, &lt;code&gt;general-inquiry&lt;/code&gt; — that model realistic interaction patterns we extracted from our Loquent call data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What our benchmarks actually show
&lt;/h2&gt;

&lt;p&gt;We've been running &lt;code&gt;vox-bench&lt;/code&gt; internally for two months. Here's what the data looks like across our current production stack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deepgram Nova-3 STT:&lt;/strong&gt; p50 = 180ms, p95 = 245ms, p99 = 310ms. The variance is almost entirely driven by utterance length. Anything under 3 seconds of audio processes fast. Once you cross 6-7 seconds (a full sentence describing symptoms, for example), latency jumps. Our takeaway: design your prompts to encourage shorter caller responses when possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic Claude (Haiku) LLM:&lt;/strong&gt; p50 = 210ms TTFB, p95 = 340ms, p99 = 480ms. This is streaming — we start sending to TTS as soon as the first tokens arrive. We tested Claude Sonnet too: p50 = 380ms TTFB, p95 = 620ms. For voice, Haiku wins. The quality difference between Haiku and Sonnet for our use cases (appointment scheduling, FAQ answers, intake questions) is negligible, but the latency difference is enormous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ElevenLabs TTS:&lt;/strong&gt; p50 = 130ms, p95 = 220ms, p99 = 350ms. The most variable stage in our pipeline. We've seen p99 hit 600ms during what we assume are provider-side capacity issues, always between 2-4pm ET. &lt;code&gt;vox-bench&lt;/code&gt; caught this pattern within a week of deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total pipeline (end-to-end):&lt;/strong&gt; p50 = 620ms, p95 = 890ms, p99 = 1,150ms. Our p99 is above the 800ms "feels natural" threshold, but below the 1.2 second "callers repeat themselves" line. We're okay with that tradeoff — optimizing p99 below 800ms would require either pre-generating responses (quality hit) or switching to a faster but lower-quality TTS (quality hit). For now, 5-6% of responses feeling slightly delayed is acceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;vox-bench&lt;/code&gt; is built with TypeScript (Node.js 20+). We chose TypeScript because our entire Loquent backend is TypeScript/NestJS, and we wanted the team to be able to extend the tool without context-switching languages.&lt;/p&gt;

&lt;p&gt;Key components: a provider adapter layer (currently supports Deepgram, OpenAI Whisper, Anthropic Claude, OpenAI GPT, ElevenLabs, and Google Cloud TTS), a pipeline orchestrator that chains stages with proper streaming, a statistics engine that computes percentiles using the t-digest algorithm (accurate percentiles without storing every measurement), and a reporter that outputs results as JSON, Markdown tables, or sends them to your monitoring system via webhooks.&lt;/p&gt;

&lt;p&gt;The whole thing is about 3,200 lines of TypeScript. No magic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key findings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TTS is your most variable stage.&lt;/strong&gt; STT and LLM latency are relatively predictable. TTS providers show the most variance, and the variance is time-of-day dependent. Benchmark at different times or you'll get misleading numbers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Average latency is a useless metric for voice AI.&lt;/strong&gt; Your p95 and p99 determine caller experience. A 650ms average can hide a 1.4 second p99 that makes 5% of your conversations feel broken.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streaming changes everything.&lt;/strong&gt; Without streaming (waiting for complete LLM response before sending to TTS), our p50 total would be 1,100ms+. With streaming, it's 620ms. If your voice AI pipeline isn't streaming at every stage, fix that before optimizing anything else.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provider latency varies by content domain.&lt;/strong&gt; Our healthcare utterances benchmark 15-20% slower on STT than general conversation because of medical terminology. Always benchmark with domain-representative audio, not generic test phrases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schedule your benchmarks.&lt;/strong&gt; Provider performance isn't static. Run benchmarks on a cadence and track trends. The regression detection in &lt;code&gt;vox-bench&lt;/code&gt; has caught three provider-side degradations that our application monitoring missed.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Where to find it
&lt;/h2&gt;

&lt;p&gt;The repo is at &lt;a href="https://github.com/Autor-Technologies/vox-bench" rel="noopener noreferrer"&gt;github.com/Autor-Technologies/vox-bench&lt;/a&gt;. MIT licensed. The README has quickstart instructions — you can be running benchmarks against your own providers in under five minutes if you have API keys ready.&lt;/p&gt;

&lt;p&gt;We included our healthcare conversation profiles and audio corpus. If you're building voice AI for a different domain, you can create your own profiles — the format is documented and there's a generator script that builds profiles from your own call recordings.&lt;/p&gt;

&lt;p&gt;We're actively using this internally. If you find bugs or want to add a provider adapter, PRs are welcome. If you're building voice AI and want to talk latency optimization, we've probably hit the same walls you're hitting.&lt;/p&gt;

&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>The Exact Prompt Engineering That Makes Our Voice AI Sound Human (Full Prompts Included)</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Mon, 11 May 2026 14:37:59 +0000</pubDate>
      <link>https://dev.to/autor_tech/the-exact-prompt-engineering-that-makes-our-voice-ai-sound-human-full-prompts-included-im8</link>
      <guid>https://dev.to/autor_tech/the-exact-prompt-engineering-that-makes-our-voice-ai-sound-human-full-prompts-included-im8</guid>
      <description>&lt;p&gt;Last month, a patient called one of our dental clinic clients at 11pm on a Saturday and had a full conversation about rescheduling their root canal — and didn't realize they were talking to an AI until the receptionist mentioned it at their next visit. That wasn't an accident. It was the result of 4 months of prompt iteration, 10,000+ call recordings analyzed, and about 140 prompt versions before we landed on something that actually works.&lt;/p&gt;

&lt;p&gt;We build Loquent, a production voice AI platform that handles thousands of automated calls per month for healthcare and dental clinics across Canada. The system runs on Anthropic's Claude for conversation logic, Deepgram for speech-to-text, ElevenLabs for text-to-speech, and Twilio for telephony. When we started, our AI sounded like a chatbot reading a script. Now it sounds like a receptionist who's been working at the clinic for three years. The difference was almost entirely in the prompts.&lt;/p&gt;

&lt;p&gt;I'm going to share the actual prompt architecture we use, the specific techniques that moved the needle, and the failures that taught us the most. If you're building voice AI for any domain, most of this transfers directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Voice Prompts Are a Completely Different Problem
&lt;/h2&gt;

&lt;p&gt;The first mistake we made was treating voice AI prompts like chatbot prompts. We took our best-performing text prompts and plugged them into the voice pipeline. The result was technically correct and completely unusable.&lt;/p&gt;

&lt;p&gt;Here's why: in a text chat, a user reads a 3-sentence response in about 4 seconds. In a voice call, that same response takes 12-15 seconds to speak aloud. By sentence two, the caller has already mentally checked out or tried to interrupt. We learned this the hard way — our first production deployment had a 34% hang-up rate within the first 30 seconds.&lt;/p&gt;

&lt;p&gt;Voice has three constraints that text doesn't: latency sensitivity (callers expect sub-second responses), interruption handling (people talk over AI constantly), and conversational pacing (long responses feel robotic regardless of how natural the voice sounds).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Architecture That Actually Works
&lt;/h2&gt;

&lt;p&gt;After 140+ iterations, we settled on a three-layer prompt architecture. Here's the actual structure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: The System Identity Prompt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the foundation. We keep it under 400 tokens because longer system prompts measurably increase response latency with Claude. Here's a representative version (clinic details changed):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are Sarah, a receptionist at Bright Dental Care in Toronto.
You answer phone calls. You are friendly, efficient, and 
knowledgeable about the clinic.

CRITICAL RULES:
- Respond in 1-2 short sentences maximum. Never more.
- Use natural filler words occasionally: "sure", "of course", 
  "absolutely", "let me check on that"
- If you don't know something, say "Let me check with the 
  team and get back to you" — never guess
- Always confirm spelled-out details back to the caller
- You cannot provide medical advice. Ever. Redirect to 
  the dentist.

CLINIC HOURS: Mon-Fri 8am-6pm, Sat 9am-2pm, Closed Sunday
EMERGENCY LINE: 416-555-0199
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's NOT in there: no verbose personality descriptions, no "you are a helpful assistant," no lengthy backstory. Every token in the system prompt costs you latency, and in voice, latency kills the experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: The Conversation State Manager&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where most voice AI projects fail. They treat each turn independently. We inject a dynamic context block that updates every turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;CURRENT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;CALL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;STATE:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Caller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;intent:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;appointment_reschedule&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Caller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;name:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;collected:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Michael"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Caller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;verified:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;yes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;DOB&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;matched&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Current&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;appointment:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;May&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;pm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Dr.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Patel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;cleaning&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Turn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;count:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Sentiment:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;neutral&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;slightly&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;impatient&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;NEXT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;LIKELY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ACTIONS:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;confirm_new_time,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;check_availability&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This state block is assembled programmatically from our NestJS backend. The sentiment field comes from Deepgram's tone analysis on the caller's voice. The "turn count" matters because we found that after turn 6-7, callers get noticeably more impatient, so we prompt Claude to be more concise and direct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: The Response Shaping Instructions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the layer we iterated on the most. Here's the version that cut our hang-up rate from 34% to 11%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RESPONSE FORMAT RULES:
- Maximum 25 words per response unless reading back 
  specific information
- Lead with the answer, then context. Never context first.
- End with a clear next step or question
- Use contractions always (it's, we've, that's)
- No lists. No bullet points. No "firstly/secondly"
- If the caller seems confused, ask ONE clarifying question. 
  Not two.

PACING:
- After giving appointment details, pause with "Does that 
  work for you?" before continuing
- Never stack multiple pieces of information in one response
- If you need to relay 3+ data points, break across turns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "lead with the answer" rule alone improved our caller satisfaction scores by 22%. When someone asks "Do you have anything available Thursday?" — the old prompt would say "Let me check our availability for Thursday. We have several options..." The new prompt produces: "Thursday works. We have 10am or 2:30pm with Dr. Patel. Which do you prefer?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Techniques That Made the Biggest Difference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The "Overheard Conversation" Training Method&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We stopped writing prompts from scratch and started transcribing our best human receptionists. We recorded 40 hours of real receptionist calls (with consent), transcribed them, and identified the specific phrases and patterns that made callers respond positively. Then we encoded those exact patterns into the prompt.&lt;/p&gt;

&lt;p&gt;For example, we noticed that the best receptionists always said "perfect" or "great" after a caller confirmed information, before moving on. Small thing. But when we added &lt;code&gt;After caller confirms any information, acknowledge with a brief affirmation ("perfect", "great", "got it") before your next question&lt;/code&gt; to the prompt, our post-call satisfaction ratings went up 8%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The 25-Word Ceiling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We tested response lengths systematically across 2,000 calls. The data was clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under 15 words: callers felt the AI was too curt, asked "are you still there?"&lt;/li&gt;
&lt;li&gt;15-25 words: optimal range, lowest hang-up rate, highest task completion&lt;/li&gt;
&lt;li&gt;25-40 words: hang-up rate increased 18%&lt;/li&gt;
&lt;li&gt;Over 40 words: hang-up rate increased 41%, callers started interrupting mid-response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We hard-coded the 25-word ceiling into the prompt and added a programmatic check that flags any Claude response over 30 words for review. In production, Claude stays under 25 words about 89% of the time with this prompt instruction alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Sentiment-Adaptive Prompting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We inject real-time sentiment into the conversation state (from Deepgram's audio analysis) and include conditional instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IF sentiment = frustrated or impatient:
- Be extra concise. Under 15 words if possible.
- Skip pleasantries. Get to the point.
- Offer to transfer to a human: "Would you like me to 
  connect you with someone from our team?"

IF sentiment = confused:
- Slow down. One piece of information at a time.
- Repeat back what you understood.
- Ask a yes/no question to re-anchor.

IF sentiment = positive/chatty:
- Match energy briefly but stay on task.
- One friendly comment maximum, then redirect.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single addition reduced our human transfer rate from 23% to 18%. The frustrated callers who previously would have demanded a human were getting handled faster, which resolved their frustration before it escalated.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failures Worth Mentioning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The persona trap.&lt;/strong&gt; We spent two weeks crafting elaborate backstories for our AI personas — hobbies, favorite coffee orders, years of "experience." None of it mattered. Callers don't ask receptionists about their hobbies. Every token spent on backstory was latency we couldn't afford. We stripped it all out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The politeness overcorrection.&lt;/strong&gt; After getting feedback that the AI sounded "too robotic," we over-indexed on politeness. The AI started saying things like "I'd be absolutely delighted to help you with that!" on every turn. Three callers in one day asked if they were talking to a bot. Ironically, being too polite was the tell. Real receptionists are friendly but efficient, not performatively enthusiastic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The temperature disaster.&lt;/strong&gt; We ran Claude at temperature 0.9 for two days thinking it would sound more "natural." It did — until it started confidently inventing appointment slots that didn't exist and telling one caller that Dr. Patel "usually runs about 10 minutes behind on Tuesdays." Temperature 0.3 is where we live now. Boring is better than wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Voice prompts must be ruthlessly short.&lt;/strong&gt; Every word costs latency and attention. The 25-word ceiling isn't arbitrary — it's data-driven from 10,000+ calls. If your voice AI responses regularly exceed 30 words, you have a prompt problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lead with the answer, always.&lt;/strong&gt; Context-first responses are a text pattern that fails completely in voice. Callers want the answer in the first 3 seconds, then they'll listen to details.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inject real-time state into every turn.&lt;/strong&gt; Treating each turn independently produces conversations that feel like talking to someone with amnesia. The state manager layer is the difference between a demo and a product.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Copy real humans, not chatbots.&lt;/strong&gt; Transcribe your best human operators. The specific words and micro-patterns they use (affirmations, pacing, question framing) are worth more than any prompt engineering framework.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Measure everything, trust nothing.&lt;/strong&gt; We thought longer, more detailed prompts would perform better. The data said the opposite. We thought higher temperature would sound more natural. It created hallucinations. Test with real callers, not vibes.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;We're planning to open-source our prompt benchmarking tool in the next few weeks (that's the Week 9 article). If you're building voice AI — healthcare or otherwise — and want to compare notes on prompt architecture, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>deeplearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Mon, 04 May 2026 13:14:16 +0000</pubDate>
      <link>https://dev.to/autor_tech/we-built-a-voice-ai-receptionist-in-8-weeks-every-decision-we-made-and-why-3dc8</link>
      <guid>https://dev.to/autor_tech/we-built-a-voice-ai-receptionist-in-8-weeks-every-decision-we-made-and-why-3dc8</guid>
      <description>&lt;p&gt;Eight weeks. That's how long it took our team at Autor to go from "we should build a voice AI receptionist for healthcare clinics" to handling live patient calls in production. Not a demo. Not a prototype collecting dust on a staging server. A real system answering real phones at real dental and healthcare clinics across Canada.&lt;/p&gt;

&lt;p&gt;The product is called Loquent. It now handles thousands of automated calls per month, 24/7, for healthcare and dental clients. But the interesting part isn't what it does today — it's the 47 decisions we made in those 8 weeks that determined whether it would work at all.&lt;/p&gt;

&lt;p&gt;Here's every major technical and product decision we made, why we made it, and what we'd change if we did it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 1–2: Picking the Stack
&lt;/h2&gt;

&lt;p&gt;The first decision was the voice pipeline. We needed three things: a way to receive phone calls, a way to convert speech to text, and a way to convert text back to speech. Simple enough on paper.&lt;/p&gt;

&lt;p&gt;For telephony, we went with &lt;strong&gt;Twilio&lt;/strong&gt;. Not because it's the cheapest — it's not — but because we'd shipped 50+ products and knew Twilio's edge cases. When you're building something that needs to be in production in 8 weeks, you don't gamble on infrastructure you haven't battle-tested. Twilio's media streams gave us real-time audio over WebSocket, which was critical for keeping latency low.&lt;/p&gt;

&lt;p&gt;For speech-to-text, we chose &lt;strong&gt;Deepgram&lt;/strong&gt;. We tested Google Speech-to-Text, AWS Transcribe, and Deepgram head-to-head with 200 sample audio clips from actual clinic phone calls (with permission). Deepgram won on two axes: accuracy on medical terminology and latency. Their streaming API returned partial transcripts in under 300ms consistently. Google was close on accuracy but added 150–200ms more latency. In voice AI, that 200ms is the difference between a conversation that feels natural and one that feels like talking to a bad VoIP connection.&lt;/p&gt;

&lt;p&gt;For the LLM brain — the part that actually understands what the caller wants and decides what to say — we went with &lt;strong&gt;Anthropic Claude&lt;/strong&gt;. We'd used GPT-4 extensively on other projects, but Claude gave us two things we needed: more predictable instruction-following for complex system prompts, and better handling of the conversational nuance healthcare calls require. When a patient says "I think I need to come in but I'm not sure," Claude was measurably better at handling that ambiguity with the right mix of helpfulness and appropriate medical caution.&lt;/p&gt;

&lt;p&gt;Text-to-speech was &lt;strong&gt;ElevenLabs&lt;/strong&gt;. We tested 6 providers. ElevenLabs had the most natural-sounding voices and critically, their streaming API let us start playing audio before the full response was generated. This shaved another 400ms off perceived latency.&lt;/p&gt;

&lt;p&gt;The backend is &lt;strong&gt;NestJS&lt;/strong&gt; with &lt;strong&gt;TypeScript&lt;/strong&gt;, running on &lt;strong&gt;AWS&lt;/strong&gt; with &lt;strong&gt;Docker&lt;/strong&gt;. Our API layer handles the orchestration between all these services. We use &lt;strong&gt;PostgreSQL&lt;/strong&gt; with &lt;strong&gt;Prisma&lt;/strong&gt; for call logs, appointment data, and conversation history. The frontend dashboard for clinic staff is &lt;strong&gt;Next.js&lt;/strong&gt; on &lt;strong&gt;Vercel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Total time on stack decisions: 4 days. We spent 3 of those days on benchmarking speech providers because that's where we had the least prior experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 3–4: The Latency Problem
&lt;/h2&gt;

&lt;p&gt;Here's what nobody tells you about building voice AI: the technical challenge isn't making it smart. It's making it fast.&lt;/p&gt;

&lt;p&gt;A normal human conversation has about 200ms of silence between turns. Our first end-to-end prototype had 2.4 seconds of latency from the moment the caller stopped speaking to when they heard the AI respond. That's brutal. Callers were hanging up or talking over the AI.&lt;/p&gt;

&lt;p&gt;We broke the latency down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speech-to-text finalization: ~400ms&lt;/li&gt;
&lt;li&gt;LLM inference (Claude): ~800ms&lt;/li&gt;
&lt;li&gt;Text-to-speech generation: ~600ms&lt;/li&gt;
&lt;li&gt;Network overhead between services: ~600ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those had to come down. Here's what we did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-text&lt;/strong&gt;: We switched from waiting for final transcripts to acting on interim transcripts with a confidence threshold above 0.85. This let us start LLM inference 300ms earlier on average, at the cost of occasionally sending a slightly wrong transcript. We added a correction mechanism that would interrupt and re-route if the final transcript meaningfully differed from the interim one. In practice, this happened on less than 3% of utterances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM inference&lt;/strong&gt;: We couldn't make Claude faster, but we could make it produce less. We restructured our prompts to front-load the critical response content. Instead of "think step by step and then respond," we used a format where Claude would output the spoken response first, then its reasoning. We also aggressively cached common conversation patterns — things like greeting responses, hold requests, and appointment confirmations. About 40% of conversational turns hit the cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text-to-speech&lt;/strong&gt;: Streaming. Instead of generating the full audio clip and then playing it, we streamed audio chunks as they were generated. The caller hears the first word within 200ms of generation starting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network&lt;/strong&gt;: We co-located all services in the same AWS region (ca-central-1, because Canadian healthcare data stays in Canada — more on that later). We also moved from HTTP request/response to persistent WebSocket connections between our services.&lt;/p&gt;

&lt;p&gt;Final latency after optimization: 800ms average. Not perfect, but well within the range where conversations feel natural. Callers stopped hanging up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 5–6: Making It Actually Useful
&lt;/h2&gt;

&lt;p&gt;A fast AI that says the wrong thing is worse than a slow AI that says the right thing. Week 5 was about prompt engineering and conversation design.&lt;/p&gt;

&lt;p&gt;We spent two full days sitting in a dental clinic's front office, listening to actual receptionist calls and documenting every type of conversation. We categorized 14 distinct call types, from appointment booking to insurance verification to emergency triage. Each one needed different handling logic.&lt;/p&gt;

&lt;p&gt;The critical insight: we didn't try to make one mega-prompt handle everything. Instead, we built a routing layer. The first few seconds of each call go through a lightweight classifier that determines the call type, then routes to a specialized prompt and conversation flow for that type. This meant each individual prompt could be simpler and more reliable.&lt;/p&gt;

&lt;p&gt;For appointment booking — about 60% of all calls — we integrated directly with the clinic's scheduling software through their API. The AI doesn't just take a message; it actually checks availability and books the appointment in real time. This was the feature that made clinic owners go from "interesting demo" to "shut up and take my money."&lt;/p&gt;

&lt;p&gt;We also built the &lt;strong&gt;HubSpot and Salesforce integrations&lt;/strong&gt; during this phase. Every call gets logged with a full transcript, caller sentiment, call type, and outcome. Clinic managers can see exactly what's happening with their phone lines without listening to recordings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 7: The 18% Problem
&lt;/h2&gt;

&lt;p&gt;By week 7, we had something that worked. But 18% of calls were being transferred to human staff because the AI couldn't handle them. We wrote a whole separate article about what those 18% had in common (that's next week's post), but the short version: edge cases around insurance questions, multi-party calls, and callers who were genuinely distressed.&lt;/p&gt;

&lt;p&gt;We made a deliberate decision: we would not try to get that 18% down to zero. Some calls should go to humans. A patient who just got a scary diagnosis and is calling to schedule a follow-up doesn't want to talk to an AI, no matter how good it is. We built robust handoff logic that transfers calls smoothly with full context, so the human receptionist knows exactly what's already been discussed.&lt;/p&gt;

&lt;p&gt;This is one of the decisions I'm most proud of. The temptation in AI product development is to automate everything. But knowing where to draw the line is what makes the product trustworthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 8: Going Live
&lt;/h2&gt;

&lt;p&gt;Production deployment was its own adventure. We did a phased rollout: the AI handled calls only during off-hours for the first clinic, then gradually expanded to business hours over 5 days.&lt;/p&gt;

&lt;p&gt;The things that broke in production that didn't break in testing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background noise.&lt;/strong&gt; Our test calls were recorded in quiet offices. Real calls come from cars, restaurants, playgrounds. We added a noise gate and retuned our Deepgram configuration with their noise cancellation features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accents and languages.&lt;/strong&gt; Toronto is one of the most multicultural cities in the world. Callers speak English with every accent imaginable, and some prefer to start in another language entirely. We added language detection in the first 3 seconds and route non-English calls to human staff (multilingual AI is on our roadmap).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caller expectations.&lt;/strong&gt; Some callers figured out they were talking to AI and started testing it — asking trick questions, trying to confuse it, or just saying "give me a human." We added explicit handling for these cases. If someone asks for a human, they get one immediately. No persuasion, no "but I can help you with that."&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;If we were starting Loquent today, three things would change.&lt;/p&gt;

&lt;p&gt;First, we'd invest in &lt;strong&gt;end-to-end latency monitoring from day one&lt;/strong&gt;. We built our monitoring piecemeal and it cost us debugging time when production latency spiked at 3am (we wrote about that too — see our Week 3 article).&lt;/p&gt;

&lt;p&gt;Second, we'd use a &lt;strong&gt;multi-model approach from the start&lt;/strong&gt;. Not every conversational turn needs Claude's full reasoning capability. Simple acknowledgments ("Got it, let me check that for you") could come from a smaller, faster model. We're implementing this now, and it's cutting our average latency by another 150ms.&lt;/p&gt;

&lt;p&gt;Third, we'd build the &lt;strong&gt;analytics dashboard before the AI itself&lt;/strong&gt;. We spent the first two weeks of production flying blind on call quality because our dashboard wasn't ready. The data was being logged but we couldn't see patterns until we built the visualization layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers After 6 Months
&lt;/h2&gt;

&lt;p&gt;Loquent now handles thousands of calls per month across multiple clinics. The 82% automation rate has held steady. Patient satisfaction scores for AI-handled calls are within 5% of human-handled calls. Clinics using Loquent report that their human receptionists spend 60% less time on routine calls and can focus on patients who actually need personal attention.&lt;/p&gt;

&lt;p&gt;We built Loquent with a team of senior engineers, no offshore work, no handoffs between teams. The same people who designed the architecture wrote the code and debugged the production issues. That's how we shipped in 8 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency is the make-or-break metric for voice AI.&lt;/strong&gt; Your AI can be the smartest system ever built, but if it takes 2 seconds to respond, callers will hang up. Optimize for speed before you optimize for intelligence.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't pick infrastructure you haven't used before when you're on a tight timeline.&lt;/strong&gt; We chose Twilio over newer, cheaper alternatives because we knew its failure modes. That decision alone probably saved us a week of debugging.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build specialized conversation flows, not one giant prompt.&lt;/strong&gt; A routing layer with focused prompts beats a single do-everything prompt on reliability, latency, and maintainability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Know where to draw the automation line.&lt;/strong&gt; The 18% of calls we route to humans aren't a failure — they're a feature. Trustworthy AI knows its limits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Co-locate everything and measure latency end-to-end.&lt;/strong&gt; Every network hop between services adds latency. In voice AI, those milliseconds are the product experience.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Autor is a Toronto-based AI development studio that builds production AI systems. Loquent, our voice AI platform, handles thousands of automated calls monthly for healthcare clients across Canada.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:16:58 +0000</pubDate>
      <link>https://dev.to/autor_tech/we-built-a-voice-ai-receptionist-in-8-weeks-every-decision-we-made-and-why-o4b</link>
      <guid>https://dev.to/autor_tech/we-built-a-voice-ai-receptionist-in-8-weeks-every-decision-we-made-and-why-o4b</guid>
      <description>&lt;p&gt;Eight weeks. That's how long it took our team at Autor to go from "we should build a voice AI receptionist for healthcare clinics" to handling live patient calls 24/7. Not a demo. Not a proof of concept. A production system that now processes thousands of automated calls per month for dental and healthcare clients across Canada. Here's every technical and business decision we made along the way, and the reasoning behind each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Starting Point
&lt;/h2&gt;

&lt;p&gt;A dental clinic in Ontario came to us with a problem we'd heard a dozen times: they were losing patients because nobody answered the phone after hours. Their staff spent 3+ hours per day on calls that followed the same script — confirming appointments, answering insurance questions, routing urgent calls. They wanted automation, but every off-the-shelf solution they'd tried sounded robotic and confused patients.&lt;/p&gt;

&lt;p&gt;We'd already built 40+ AI products at that point. We knew the gap between a voice AI demo and a voice AI that handles real calls from real patients who are sometimes anxious, sometimes angry, and sometimes just confused. We scoped 8 weeks and got to work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 1–2: Choosing the Voice Stack
&lt;/h2&gt;

&lt;p&gt;The first decision was the hardest: which speech-to-text and text-to-speech providers to use. We benchmarked four STT options and three TTS options over two weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-text: Deepgram won.&lt;/strong&gt; We tested Deepgram, Google Cloud Speech, AWS Transcribe, and Whisper (self-hosted). Deepgram gave us the best combination of latency and accuracy for Canadian English with diverse accents. Our benchmarks showed Deepgram averaged 180ms first-byte latency versus 320ms for Google Cloud Speech. For a phone conversation, that 140ms difference is the gap between natural and awkward. Whisper was the most accurate but unusable for real-time — even on a GPU instance, streaming latency was 400ms+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text-to-speech: ElevenLabs.&lt;/strong&gt; We needed a voice that didn't trigger the "I'm talking to a robot" response. ElevenLabs' Turbo v2 model gave us near-human quality at 150ms latency. We tested with 30 real patients in a blind study — 22 of them didn't realize they were talking to AI until we told them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telephony: Twilio.&lt;/strong&gt; This wasn't a hard choice. Twilio's Media Streams API gave us bidirectional audio streaming over WebSocket. We'd used it before, understood the edge cases, and knew the Canadian number provisioning was solid. We briefly considered Vonage but their WebSocket implementation had reliability issues in our testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 3–4: The Brain — Why We Chose Anthropic Claude Over GPT-4
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. We needed a language model that could handle the core conversation logic — understanding patient intent, managing appointment scheduling, handling insurance questions, and knowing when to transfer to a human.&lt;/p&gt;

&lt;p&gt;We ran both GPT-4 and Anthropic Claude through 200 simulated patient conversations. The results surprised us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude was better at saying "I don't know."&lt;/strong&gt; In healthcare, making something up is worse than admitting uncertainty. When we threw edge cases at both models — rare insurance scenarios, questions about specific procedures the clinic didn't offer — GPT-4 was more likely to confabulate a plausible-sounding answer. Claude was more likely to say it wasn't sure and offer to connect the patient with staff. For a healthcare application, that behavior is worth its weight in gold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude's instruction-following was more consistent.&lt;/strong&gt; We needed the model to stay strictly within its role as a receptionist. No medical advice, ever. No promises about pricing without checking the database. After prompt engineering both models for a week, Claude held its boundaries more reliably across 1,000+ test conversations.&lt;/p&gt;

&lt;p&gt;We built the conversation engine on Claude with structured tool use. The model calls functions to check appointment availability, look up patient records, and route calls — all through a clean tool-use interface rather than string parsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 5: The Integration Layer
&lt;/h2&gt;

&lt;p&gt;Week 5 was all plumbing. We built the integration layer that connects the voice AI to the clinic's actual systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture: NestJS backend on AWS ECS.&lt;/strong&gt; We chose NestJS because our team thinks in TypeScript and NestJS gives us dependency injection and module structure without the bloat. The service runs on ECS Fargate — we didn't want to manage servers, and Fargate's auto-scaling handles call volume spikes at 9am when every patient calls at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database: PostgreSQL with Prisma ORM.&lt;/strong&gt; Every call gets logged — full transcript, intent classification, actions taken, duration, and outcome. This data is what makes the system get better over time. We chose Prisma because the type safety between our TypeScript code and the database eliminated an entire class of bugs we'd dealt with in past projects using raw SQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CRM integration: HubSpot and Salesforce.&lt;/strong&gt; Most clinics use one or the other. We built adapters for both so the AI receptionist can pull patient history and push call summaries. The HubSpot integration took 3 days. Salesforce took 8. If you've worked with the Salesforce API, you know why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 6: Latency Optimization — The Make-or-Break Week
&lt;/h2&gt;

&lt;p&gt;Week 6 nearly broke us. Our end-to-end latency — from when the patient stops speaking to when the AI starts responding — was averaging 2.1 seconds. That's unacceptable for a phone conversation. Anything over 1.2 seconds and callers start saying "hello?" again.&lt;/p&gt;

&lt;p&gt;Here's what we did to get it under 800ms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming everything.&lt;/strong&gt; We switched from waiting for complete STT transcription to streaming partial results. As soon as Deepgram gives us a stable partial transcript, we start sending it to Claude. Claude streams its response back, and we start TTS on the first sentence while Claude is still generating the rest. This pipelining cut 600ms off the round trip.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching.&lt;/strong&gt; Claude's prompt caching feature was a massive win. Our system prompt is about 2,000 tokens — clinic-specific information, conversation rules, available tools. With prompt caching, that system prompt gets processed once and reused across turns. This alone saved 200-300ms per turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connection pooling and keep-alive.&lt;/strong&gt; We keep persistent WebSocket connections to Twilio, persistent HTTP/2 connections to Claude's API, and persistent connections to Deepgram. Cold-starting any of these adds 100-200ms.&lt;/p&gt;

&lt;p&gt;After a week of optimization, we hit a median response time of 740ms. The 95th percentile was 1.1 seconds. Patients stopped noticing the delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 7: Edge Cases and Failure Modes
&lt;/h2&gt;

&lt;p&gt;We spent all of week 7 on what happens when things go wrong. This is the week that separates a demo from a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the patient speaks a language the AI doesn't handle?&lt;/strong&gt; We built language detection into the first 3 seconds of the call. If we detect French, Mandarin, or Cantonese (the three most common non-English languages for our Ontario clinics), we route immediately to a human or play a pre-recorded message in that language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the AI gets confused?&lt;/strong&gt; We built a confidence scoring system. If Claude's response confidence drops below our threshold for two consecutive turns, the system says "Let me connect you with our team" and transfers the call. No patient should ever be stuck in a loop with a confused AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if Deepgram or Claude goes down?&lt;/strong&gt; Circuit breakers on every external dependency. If STT fails, we fall back to a DTMF menu ("press 1 for appointments"). If the LLM fails, we route to voicemail with a text notification to staff. We tested every failure mode by literally killing services in production during low-traffic hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What about PIPEDA compliance?&lt;/strong&gt; This is Canada — we have to handle patient data under PIPEDA, not HIPAA. All call recordings are encrypted at rest and in transit. Transcripts are stored in Canadian data centers. We built a data retention policy that auto-deletes recordings after the clinic's specified retention period. We worked with a privacy consultant to ensure our consent flow at the start of each call met PIPEDA requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week 8: Launch and the First 1,000 Calls
&lt;/h2&gt;

&lt;p&gt;We launched on a Thursday at 5pm — right when the clinic closed. The AI receptionist would handle all after-hours calls for the weekend as a soft launch.&lt;/p&gt;

&lt;p&gt;The first weekend: 127 calls. 89 handled completely by AI. 24 transferred to the on-call number (correctly — these were urgent or complex). 14 hung up before the AI could help. That's a 70% full-automation rate on day one.&lt;/p&gt;

&lt;p&gt;Within the first month, the automation rate climbed to 82% as we tuned prompts based on real call data. The clinic saved an estimated 45 staff-hours per month. More importantly, they stopped losing after-hours patients to competitors who answered their phones.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with latency budgets, not features.&lt;/strong&gt; We should have set our 800ms latency target on day one and built every component to that budget. Instead, we built features first and then scrambled in week 6 to optimize. The architecture would have been cleaner if latency was a first-class constraint from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the monitoring dashboard earlier.&lt;/strong&gt; We didn't have real-time call monitoring until week 7. That meant weeks 5 and 6 were partially blind. Now, every Loquent deployment gets a monitoring dashboard on day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test with real patients sooner.&lt;/strong&gt; Our simulated conversations, no matter how good, didn't capture the way real patients talk on the phone. They pause mid-sentence. They talk to someone else in the room. They put the phone down and come back. We caught these patterns in week 8 when we should have been finding them in week 4.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency is the product.&lt;/strong&gt; In voice AI, response time determines whether your system feels like a helpful receptionist or an annoying robot. Budget for it from day one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pick models for their failure modes, not their best cases.&lt;/strong&gt; Claude won over GPT-4 not because it was smarter, but because it failed more gracefully — admitting uncertainty instead of making things up.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Healthcare voice AI in Canada is viable right now.&lt;/strong&gt; PIPEDA compliance is manageable, Canadian data residency options exist for all major cloud providers, and patients are more accepting of AI receptionists than most people assume.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The 80/20 rule applies hard.&lt;/strong&gt; Getting to 80% automation was 3 weeks of work. Getting from 80% to 82% was another 4 weeks of prompt tuning and edge case handling. Plan your timeline accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build the transfer path first.&lt;/strong&gt; The AI knowing when to hand off to a human is more important than handling every scenario. A graceful transfer builds trust; a confused AI destroys it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;This is the system we turned into Loquent, our production voice AI platform. It now serves multiple healthcare and dental clients across Canada, handling thousands of calls per month.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit autor.ca.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>We Analyzed 10,000 Automated Healthcare Voice Calls — Here's What We Found</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Tue, 31 Mar 2026 16:24:13 +0000</pubDate>
      <link>https://dev.to/autor_tech/we-analyzed-10000-automated-healthcare-voice-calls-heres-what-we-found-32me</link>
      <guid>https://dev.to/autor_tech/we-analyzed-10000-automated-healthcare-voice-calls-heres-what-we-found-32me</guid>
      <description>&lt;p&gt;Last October, we hit a milestone at Autor that I didn't see coming: Loquent, our production voice AI platform, processed its 10,000th automated healthcare call. Instead of celebrating, we did what any team of engineers would do — we pulled the data, locked ourselves in a room for a week, and tore apart every single pattern we could find.&lt;/p&gt;

&lt;p&gt;What we discovered changed how we build voice AI. Some of it confirmed our assumptions. Most of it didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;For context, Loquent handles automated calls for healthcare and dental clinics across Canada. We're talking appointment scheduling, confirmations, cancellations, insurance verification questions, and general intake routing. The system runs 24/7 on a stack built with Twilio for telephony, Anthropic Claude for conversation intelligence, Deepgram for speech-to-text, and ElevenLabs for text-to-speech. We built the first version in under 8 weeks and have been iterating on it for the past six months.&lt;/p&gt;

&lt;p&gt;The 10,000 calls in this dataset span 14 clinic clients — a mix of dental offices, family practices, and specialist clinics in Ontario and British Columbia. Call durations ranged from 12 seconds (hang-ups) to 14 minutes (complex scheduling with insurance questions). The median call was 2 minutes and 38 seconds.&lt;/p&gt;

&lt;p&gt;Here's what the data told us.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1: 73% of Calls Follow Just 4 Patterns
&lt;/h2&gt;

&lt;p&gt;We categorized every call by intent. Out of the dozens of potential reasons someone calls a clinic, four patterns dominated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Appointment booking: 31%&lt;/li&gt;
&lt;li&gt;Appointment confirmation/change: 24%&lt;/li&gt;
&lt;li&gt;Cancellation: 11%&lt;/li&gt;
&lt;li&gt;"Am I covered for this?" (insurance/billing questions): 7%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's 73% of all inbound call volume handled by four well-defined flows. The remaining 27% was a grab bag — prescription refill requests, referral follow-ups, directions to the clinic, and a surprising number of people just wanting to talk to "a real person" about nothing specific.&lt;/p&gt;

&lt;p&gt;This matters because it means you don't need a general-purpose conversational AI to handle the majority of healthcare front-desk calls. You need four really good, tightly scoped flows with clean handoff logic for everything else. We spent months trying to make Loquent handle every possible conversation gracefully. The data told us to stop doing that and instead make those four flows bulletproof.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2: Latency Tolerance is Exactly 1.8 Seconds
&lt;/h2&gt;

&lt;p&gt;We measured caller drop-off rates against our system's response latency — the time between when a caller finishes speaking and when the AI begins its response. The data was clear: at 1.2 seconds or less, drop-off rates were near zero. Between 1.2 and 1.8 seconds, drop-off crept up slightly. Above 1.8 seconds, we saw a cliff. Callers either hung up or started talking over the AI, derailing the conversation.&lt;/p&gt;

&lt;p&gt;1.8 seconds. That's your budget for the entire pipeline: speech-to-text transcription, LLM inference, text-to-speech generation, and audio delivery back through Twilio. In practice, this means we run Deepgram's streaming transcription (adds ~300ms), Claude Haiku for most routine responses (adds ~400-600ms), and ElevenLabs with their Turbo v2 model (adds ~350ms). That leaves roughly 200ms of network overhead before we're in the danger zone.&lt;/p&gt;

&lt;p&gt;For complex queries where we need Claude Sonnet's reasoning — like disambiguating between similar appointment types or handling multi-step insurance questions — we've built a "thinking buffer" that plays a natural filler phrase ("Let me check that for you...") to buy an extra 2-3 seconds. This single trick reduced our complex-query drop-off rate by 41%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3: Morning Callers Are 2.3x More Patient Than Afternoon Callers
&lt;/h2&gt;

&lt;p&gt;This one surprised us. We segmented call behavior by time of day and found a pattern so consistent it changed our system design.&lt;/p&gt;

&lt;p&gt;Callers between 8am and 11am had an average interaction length of 3 minutes 12 seconds and tolerated longer AI response times before dropping off. Callers between 2pm and 5pm averaged 1 minute 54 seconds and were significantly more likely to request a human transfer.&lt;/p&gt;

&lt;p&gt;Our theory: morning callers are often calling during a planned moment — they're at their desk, coffee in hand, checking things off a list. Afternoon callers are squeezing in a call between meetings or during a break. They want speed.&lt;/p&gt;

&lt;p&gt;We now dynamically adjust Loquent's behavior based on time of day. Afternoon calls get shorter confirmations, faster routing, and more aggressive escalation to human staff. Morning calls get slightly more conversational, exploratory flows. This alone improved our afternoon completion rate by 18%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 4: The "Second Sentence" Problem
&lt;/h2&gt;

&lt;p&gt;Here's a pattern we almost missed. In 34% of calls where the AI's first response was correct and helpful, the caller still asked to speak to a human. We dug into the transcripts and found the issue wasn't accuracy — it was the AI's second sentence.&lt;/p&gt;

&lt;p&gt;The AI would correctly answer the question, then add a follow-up that felt robotic or presumptuous. Things like: "Is there anything else I can help you with today?" delivered in the exact same cadence as a phone tree. Or worse, immediately pivoting to: "I can also help you with appointment scheduling, prescription inquiries, or billing questions."&lt;/p&gt;

&lt;p&gt;Real receptionists don't do this. They pause. They let the caller process. They read the room.&lt;/p&gt;

&lt;p&gt;We rewrote our prompt engineering to include explicit "breath" instructions — moments where the AI generates a brief pause and waits for the caller to lead. We also cut the generic menu-style follow-ups entirely. The result: human transfer requests after successful first responses dropped from 34% to 12%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 5: 6% of Callers Will Try to Break Your AI (And That's Fine)
&lt;/h2&gt;

&lt;p&gt;We identified a consistent 6% of callers across all clinics who deliberately tested the AI. They'd ask trick questions, try to confuse it, speak in fragments, or demand things the AI clearly couldn't do. We affectionately call these "stress-test callers" internally.&lt;/p&gt;

&lt;p&gt;Early on, we tried to make the system handle these gracefully — clever redirects, patient re-prompts, escalation paths. We burned weeks on it. The data showed us something freeing: these callers almost always called back within 24 hours and had a normal, productive interaction the second time. They were curious, not hostile.&lt;/p&gt;

&lt;p&gt;We now let these calls fail gracefully with a simple "I want to make sure you get the help you need — let me connect you with the team" after two confused exchanges. No heroics. Our engineering time is better spent on the 94%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changed For Us
&lt;/h2&gt;

&lt;p&gt;After this analysis, we made three architectural decisions that shaped Loquent's next iteration:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Flow specialization over generalization.&lt;/strong&gt; We rebuilt our four core flows from scratch, each with its own optimized prompt chain, latency budget, and escalation logic. The "general conversation" handler became a thin routing layer, not a Swiss Army knife.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Time-aware behavior.&lt;/strong&gt; Loquent now adapts its conversational style, response length, and escalation thresholds based on time of day. The morning version and the afternoon version are meaningfully different systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Silence as a feature.&lt;/strong&gt; We invested heavily in teaching the AI when not to talk. Strategic pauses, shorter confirmations, and eliminating the "anything else?" reflex made the system feel less like a phone tree and more like a receptionist who respects your time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers After the Rebuild
&lt;/h2&gt;

&lt;p&gt;Six weeks after implementing these changes across all 14 clinics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overall call completion rate: 74% → 82%&lt;/li&gt;
&lt;li&gt;Average call duration: 2:38 → 2:11&lt;/li&gt;
&lt;li&gt;Human transfer requests: 22% → 14%&lt;/li&gt;
&lt;li&gt;Client satisfaction (post-call survey): 3.4/5 → 4.1/5&lt;/li&gt;
&lt;li&gt;Peak hour handling capacity: up 23% (same infrastructure cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these improvements came from a better model or a fancier tech stack. They came from reading our own data honestly and being willing to simplify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Most healthcare voice AI problems are scope problems, not intelligence problems.&lt;/strong&gt; You don't need AGI to book a dental cleaning. You need four flows that work perfectly and clean handoffs for everything else.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency isn't a "nice to have" metric — it's the metric.&lt;/strong&gt; Every millisecond above 1.8 seconds costs you callers. Architect your entire pipeline around this constraint from day one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Time of day changes caller behavior more than you'd expect.&lt;/strong&gt; Build your system to adapt, or you're leaving completion rate on the table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AI's second sentence matters more than the first.&lt;/strong&gt; Getting the answer right is table stakes. How the AI handles the moment after the answer determines whether the caller stays or bounces.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Not every edge case deserves engineering time.&lt;/strong&gt; The 6% who stress-test your system will come back. Focus your effort on the 94% who just want their appointment booked.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;If you're building something similar, we'd love to hear about it. Reach out at &lt;a href="mailto:hello@autor.ca"&gt;hello@autor.ca&lt;/a&gt; or visit &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>typescript</category>
    </item>
    <item>
      <title>How We Built a Production Voice AI Agent in Under 8 Weeks (With Twilio + Anthropic Claude)</title>
      <dc:creator>Autor Technologies Inc.</dc:creator>
      <pubDate>Tue, 24 Mar 2026 23:36:05 +0000</pubDate>
      <link>https://dev.to/autor_tech/how-we-built-a-production-voice-ai-agent-in-under-8-weeks-with-twilio-anthropic-claude-8n8</link>
      <guid>https://dev.to/autor_tech/how-we-built-a-production-voice-ai-agent-in-under-8-weeks-with-twilio-anthropic-claude-8n8</guid>
      <description>&lt;p&gt;Earlier this year, we shipped Loquent — a production conversational AI platform that handles real phone calls, books appointments, processes patient follow-ups, and verifies insurance — completely autonomously, 24/7.&lt;/p&gt;

&lt;p&gt;We built it in under 8 weeks.&lt;/p&gt;

&lt;p&gt;This isn't a tutorial about building a toy chatbot. This is a breakdown of what it actually takes to get voice AI into production — the architecture decisions, the hard lessons, and the specific stack we used.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem We Were Solving
&lt;/h2&gt;

&lt;p&gt;Healthcare and dental clinics miss a massive percentage of inbound calls. Front desks get overwhelmed during peak hours. Patients call after hours and get voicemail. Appointments slip through.&lt;/p&gt;

&lt;p&gt;The ask: build an AI system that could handle inbound and outbound calls — booking appointments, confirming details, following up with patients, verifying insurance — without a human in the loop.&lt;/p&gt;

&lt;p&gt;Not a demo. Not a prototype. Production. Real patients. Real clinics. Real calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;Before diving into architecture, here's what we ended up with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Voice / Telephony&lt;/td&gt;
&lt;td&gt;Twilio Voice + Media Streams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-Text&lt;/td&gt;
&lt;td&gt;Deepgram Streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Anthropic Claude (claude-sonnet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text-to-Speech&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;NestJS + Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend Dashboard&lt;/td&gt;
&lt;td&gt;Next.js + React&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;PostgreSQL + Prisma&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue&lt;/td&gt;
&lt;td&gt;Redis + BullMQ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;AWS (ECS, RDS, ElastiCache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integrations&lt;/td&gt;
&lt;td&gt;HubSpot, Salesforce, Zendesk&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We evaluated OpenAI's Realtime API, but at the time latency and reliability on production call volumes wasn't where we needed it. We went with the Deepgram → Claude → ElevenLabs pipeline, which gave us full control over each layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Caller dials in
     ↓
Twilio receives call → webhook fires to our backend
     ↓
Twilio Media Stream opens WebSocket to our server
     ↓
Audio chunks stream in real-time → Deepgram STT
     ↓
Transcript fed to Claude with conversation context + clinic data
     ↓
Claude response → ElevenLabs TTS → audio streamed back via Twilio
     ↓
Actions extracted (book appointment, send confirmation, update CRM)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole loop needs to complete in under 1.5 seconds to feel natural. That's the hard constraint everything else is built around.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Latency Problem
&lt;/h2&gt;

&lt;p&gt;This was the hardest engineering challenge. Users tolerate maybe 1–2 seconds of silence before it feels broken. We were dealing with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deepgram STT: ~200–400ms&lt;/li&gt;
&lt;li&gt;Claude inference: ~400–800ms&lt;/li&gt;
&lt;li&gt;ElevenLabs TTS first-chunk: ~300–500ms&lt;/li&gt;
&lt;li&gt;Twilio playback: ~100ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's already pushing 2 seconds before any network overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we did:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Stream everything.&lt;/strong&gt; We don't wait for a complete Claude response before starting TTS. The moment Claude starts outputting tokens, we pipe them to ElevenLabs sentence by sentence. The first audio chunk starts playing while Claude is still generating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. End-of-utterance detection.&lt;/strong&gt; We use Deepgram's endpointing, but also built our own silence detection layer. Aggressive endpointing cuts off users mid-sentence. Too conservative and the response feels laggy. We tuned this per use case — a patient confirming an appointment has different speech patterns than one describing symptoms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Claude prompt engineering for speed.&lt;/strong&gt; Verbose responses kill latency. We prompt Claude to be concise, speak like a receptionist, and never use filler phrases that add tokens without value. We also give it explicit response format guidance — short sentences, direct answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Pre-warm everything.&lt;/strong&gt; ElevenLabs has cold start latency. We keep connections warm with keepalive pings. Same with our database pool.&lt;/p&gt;

&lt;p&gt;With all of this in place, we got average response latency down to ~900ms. Occasionally spikes to 1.4s. Feels natural.&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing the Claude Prompt
&lt;/h2&gt;

&lt;p&gt;This took more iteration than the infrastructure. The system prompt has to do a lot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are [Clinic Name]'s AI receptionist. Your name is [Agent Name].

You are speaking with a patient over the phone. Be warm, professional, 
and concise. Speak in short sentences. Never say "Certainly!" or 
"Absolutely!" or similar filler phrases.

CLINIC CONTEXT:
- Name: [Clinic Name]
- Hours: [Hours]
- Providers: [Provider list with availability]
- Services: [Service list]

CURRENT PATIENT CONTEXT:
[Injected dynamically: patient name, upcoming appointments, 
last visit, insurance status]

AVAILABLE ACTIONS:
[JSON schema of actions Claude can trigger: book_appointment, 
cancel_appointment, send_confirmation, transfer_to_human, etc.]

RULES:
- If you cannot handle the request, transfer to a human. Never guess.
- Confirm all bookings by repeating back date, time, and provider.
- Never discuss billing details — transfer to billing team.
- If the patient seems distressed, offer to transfer immediately.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;Claude needs to know what it can and cannot do&lt;/strong&gt;. An AI that tries to handle everything and fails is worse than one that gracefully transfers when out of scope.&lt;/p&gt;

&lt;p&gt;We use Claude's tool use (function calling) for actions — booking, cancelling, sending confirmations. This gives us clean structured outputs instead of trying to parse intent from natural language.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Tenant Architecture
&lt;/h2&gt;

&lt;p&gt;Loquent serves multiple clinics, each with their own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phone number(s)&lt;/li&gt;
&lt;li&gt;Providers and availability&lt;/li&gt;
&lt;li&gt;Booking rules and constraints&lt;/li&gt;
&lt;li&gt;Brand voice and agent name&lt;/li&gt;
&lt;li&gt;CRM integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system prompt is dynamically assembled per-call using the clinic's configuration. We built a dashboard where clinic admins can update their agent's name, working hours, provider list, and escalation rules without touching code.&lt;/p&gt;

&lt;p&gt;Each clinic gets isolated data — separate database schemas, separate API credentials, separate call logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Integrations Layer
&lt;/h2&gt;

&lt;p&gt;The hard part isn't the AI — it's making the AI useful by connecting it to real data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Appointment booking:&lt;/strong&gt; We built adapters for common dental/healthcare practice management systems. The adapter pattern let us add new integrations without touching the core engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CRM sync:&lt;/strong&gt; After every call, we write a structured summary back to HubSpot or Salesforce — caller ID, intent, outcome, booking details, and a Claude-generated call summary. This is actually one of the features clinics love most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confirmation messages:&lt;/strong&gt; Post-call, we trigger SMS/email confirmations via Twilio Messaging and SendGrid. Patients get a confirmation within 30 seconds of booking.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Broke in Production (And How We Fixed It)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem 1: Callers interrupting the AI mid-sentence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Users naturally interrupt. The AI was finishing its sentence before responding to the interruption, which felt robotic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Barge-in detection. When Deepgram detects speech while TTS is playing, we immediately stop audio playback, flush the TTS buffer, and re-run inference with the new input. Feels much more natural.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2: Claude hallucinating availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Early builds had Claude generating appointment times that didn't exist. Patients were being told "Tuesday at 2pm" when the provider wasn't available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Availability is never in the prompt. Instead it's a tool call. Claude calls &lt;code&gt;get_availability(provider, date_range)&lt;/code&gt; and we return actual real-time slots. Claude can only offer what the function returns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 3: Long calls running up costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some patients would keep the AI on the phone indefinitely — confused, or just chatty. Unbounded calls = unbounded cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Configurable max duration per clinic. At 10 minutes, the AI politely offers to transfer to a human or calls back. Average call length is now 2.5 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 4: Noisy environments destroying STT accuracy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Callers in cars, waiting rooms, restaurants. Background noise crushed Deepgram accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Deepgram's noise suppression model + a fallback that asks the caller to repeat if confidence drops below threshold. "I'm sorry, I didn't quite catch that — could you repeat that for me?"&lt;/p&gt;




&lt;h2&gt;
  
  
  Numbers After Launch
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Calls handled:&lt;/strong&gt; Thousands of automated calls per month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average handle time:&lt;/strong&gt; 2.5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfer rate:&lt;/strong&gt; ~18% (calls that go to a human)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Booking completion rate:&lt;/strong&gt; ~74% of calls that started with booking intent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uptime:&lt;/strong&gt; 99.7%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average response latency:&lt;/strong&gt; ~900ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The clinics running Loquent have effectively eliminated missed after-hours calls. One client told us their front desk spends the first hour of every morning re-booking patients who couldn't get through the day before. Loquent eliminated that entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with tool use from day one.&lt;/strong&gt; We initially tried to have Claude make decisions through natural language reasoning. Switching to structured tool calls for all actions made the system dramatically more reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest in evals earlier.&lt;/strong&gt; We didn't set up proper evaluation pipelines until week 5. Building a test call suite in week 1 would have caught several issues earlier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate the conversation engine from the telephony layer sooner.&lt;/strong&gt; The abstraction between "what the AI is doing" and "how the call works" should be clean from the start. We refactored this at week 6 and it made everything better.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;We're now extending Loquent to handle outbound campaigns — appointment reminders, recall messaging, post-visit follow-ups. The same architecture works, you just flip the direction of the call.&lt;/p&gt;

&lt;p&gt;We're also exploring multi-agent setups where a triage agent hands off to specialist agents (billing, clinical questions, booking) with full context preservation.&lt;/p&gt;




&lt;p&gt;If you're building something similar or want to talk through the architecture, we're at &lt;a href="https://getloquent.com" rel="noopener noreferrer"&gt;getloquent.com&lt;/a&gt; and &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy to answer questions in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Autor is a Toronto-based AI development studio. We build custom AI agents, voice assistants, and full-stack AI products for businesses. &lt;a href="https://www.autor.ca" rel="noopener noreferrer"&gt;autor.ca&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
