<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pritesh Kumar</title>
    <description>The latest articles on DEV Community by Pritesh Kumar (@priteshkr).</description>
    <link>https://dev.to/priteshkr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824033%2Fcfd12cee-b8ea-4c4f-8d95-3fd96af7163f.png</url>
      <title>DEV Community: Pritesh Kumar</title>
      <link>https://dev.to/priteshkr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/priteshkr"/>
    <language>en</language>
    <item>
      <title>Where Voice AI Agents Are Actually Getting Used in 2026</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:45:27 +0000</pubDate>
      <link>https://dev.to/dograh/where-voice-ai-agents-are-actually-getting-used-in-2026-49oe</link>
      <guid>https://dev.to/dograh/where-voice-ai-agents-are-actually-getting-used-in-2026-49oe</guid>
      <description>&lt;p&gt;Voice AI has moved past the demo phase. After watching hundreds of deployments across our customer base and the broader ecosystem, I wanted to put together a practical list of where voice agents are actually earning their keep right now, and where the ROI is strong enough to justify real production budgets.&lt;/p&gt;

&lt;p&gt;The list below is the subset that keeps showing up in real pipelines and real P&amp;amp;Ls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Customer Support
&lt;/h2&gt;

&lt;p&gt;Tier-1 support is the single biggest deployed voice AI use case today. Voice agents handle password resets, order status checks, account balance lookups, policy questions, and similar high-volume repetitive queries. The value is straightforward: deflect 40-60% of inbound calls away from human agents, answer in the language the caller speaks, operate 24/7. Most teams start here because the data already exists in their CRM or knowledge base, and the workflows are well understood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lead Screening and Qualification
&lt;/h2&gt;

&lt;p&gt;Inbound leads from ads, forms, and content marketing usually sit in a queue for hours before a human gets to them. By that point, intent has dropped significantly. Voice agents now pick up the call within seconds, qualify against BANT or a custom rubric, book the meeting straight into the sales rep calendar, and log everything in HubSpot or Salesforce. This is the highest-velocity use case for B2B teams I have seen. The math is easy: a qualified meeting has a known value in the CRM, and answering in 30 seconds instead of 4 hours multiplies that yield.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collections and Renewals in Fintech
&lt;/h2&gt;

&lt;p&gt;This is where I have seen some of the strongest unit economics. Banks, lenders, and insurance companies run enormous outbound collections operations with razor-thin per-call economics. Voice agents handle reminders, soft collections, payment plan negotiation, drop-off recovery, and renewal nudges at a fraction of the cost of a human BPO. The volumes are high, the scripts are compliance-heavy (which AI handles consistently), and the conversion lift from reaching a borrower in their preferred language at the right time of day is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cold Calling and Outbound Sales
&lt;/h2&gt;

&lt;p&gt;The ROI here is very good if you get the targeting right. Voice agents can run thousands of outbound dials a day, handle objections, qualify interest, and hand off warm prospects to a human closer. The catch is that bad targeting plus AI dialing equals spam complaints at scale, so list hygiene and opt-in matter more than the tech itself. Teams that get this right see cost per meeting drop by 5-10x.&lt;/p&gt;

&lt;h2&gt;
  
  
  Appointment Setting in Healthcare
&lt;/h2&gt;

&lt;p&gt;Hospitals, dental clinics, and specialty practices deal with huge no-show rates and constant rebooking churn. Voice agents handle appointment confirmations, reminders, rescheduling, and prep instructions like pre-op fasting rules. The operational impact is immediate: front desk staff get their attention back for patients physically in the clinic, and call handling capacity goes up overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Receptions, Restaurants, and Local Services
&lt;/h2&gt;

&lt;p&gt;Any local business with a phone number that rings all day is a candidate. Restaurants take reservations and handle takeaway orders, dental clinics book and confirm visits, salons do intake. The ticket size per business is small, but the cumulative market is enormous. This category will eventually absorb the most total call volume, even if each individual deployment is modest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;The interesting wave ahead is in regulated industries like KYC verification, insurance claims intake (FNOL), patient engagement beyond appointments, and legal intake flows. These need stronger guardrails, better audit trails, tighter integration into systems of record, and clear compliance boundaries. That is where the platform layer matters, and where closed black-box APIs start to hit walls that open, inspectable stacks handle gracefully.&lt;/p&gt;

&lt;p&gt;If you are evaluating voice AI for your own business, start with the use case where you already have volume, a clear script, a measurable outcome, and a team ready to handle the tail exceptions. Skip the speculative experiments. The wins are in the boring, high-frequency calls you make every day already.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>nlp</category>
    </item>
    <item>
      <title>How I built a full fledged open source AI calling platform and got a million impressions in almost a month .....🤯</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:41:36 +0000</pubDate>
      <link>https://dev.to/priteshkr/how-i-built-a-full-fledged-open-source-ai-calling-platform-and-got-a-million-impressions-in-almost-5c4m</link>
      <guid>https://dev.to/priteshkr/how-i-built-a-full-fledged-open-source-ai-calling-platform-and-got-a-million-impressions-in-almost-5c4m</guid>
      <description>&lt;p&gt;We published &lt;a href="https://www.dograh.com/" rel="noopener noreferrer"&gt;Dograh&lt;/a&gt; on &lt;a href="https://www.reddit.com/r/buildinpublic/comments/1shn39p/quit_a_chill_job_after_my_previous_startup_got/" rel="noopener noreferrer"&gt;Reddit&lt;/a&gt; a few days ago and the response surprised us. A self-hostable, &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;open-source alternative to Vapi &lt;/a&gt; was something many developers had been waiting for.&lt;br&gt;
Since then we've gotten a lot of questions about how we actually built it - what the architecture looks like, what decisions we made, and what we'd do differently. Here's the honest walkthrough.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fux0z0iec55f7e0fpx8iz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fux0z0iec55f7e0fpx8iz.png" alt=" " width="613" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it stands&lt;/strong&gt;: an agency self-hosts our stack and is building its third client on top of it, now looking to migrate its existing clients over from other platforms. One of India's largest fintech companies is using our S2S support for a voice AI use case that is currently in development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcjb3r5tok6ayccbbllj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcjb3r5tok6ayccbbllj.png" alt=" " width="571" height="711"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspq3oyam4rc7xgdkf844.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspq3oyam4rc7xgdkf844.png" alt=" " width="584" height="690"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core idea - provider abstraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first decision we made was that Dograh should never care which provider is running behind any layer. Every voice agent needs four things: something to handle the phone call, something to transcribe speech, something to think, and something to talk back. Each of these is an abstract layer in Dograh. Any provider plugs in without touching anything else.&lt;br&gt;
For the real-time pipeline, we built on a custom fork of Pipecat. We chose Pipecat over LiveKit because of its architectural simplicity - everything is a processor, and events and data flow through the pipeline. Each processor can either act on the event asynchronously or forward it to the next one. That model makes it easy to reason about what's happening at any point in the call.&lt;br&gt;
We also built our own telephony integration layer on top of this, rather than relying on existing abstractions. That turned out to be the right call. It let us build things like human call transfers, where the transfer is exposed as a tool option in the LLM context - the model decides when to hand off based on the conversation, not a hardcoded rule.&lt;br&gt;
Dograh supports today:&lt;br&gt;
Telephony: Twilio, Vonage, Cloudonix, Telnyx, Vobiz, Asterisk ARI&lt;br&gt;
STT: Deepgram, Speechmatics, Sarvam, OpenAI Whisper&lt;br&gt;
LLM: OpenAI, Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted&lt;br&gt;
TTS: ElevenLabs, Cartesia, Deepgram, OpenAI TTS&lt;br&gt;
Swapping any of these is a config change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-speech support&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond the STT-LLM-TTS pipeline, we've added support for true speech-to-speech models. Right now that means Gemini 3.1 Flash Live. S2S collapses the three-step loop into a single model call, which gets you sub-300ms latency(theoretically) and more natural turn-taking. Barge-in handling works out of the box. We're planning to build more robust support for locally hosted S2S models in the short term.&lt;/p&gt;

&lt;p&gt;We also support distributed tracing with &lt;a href="https://www.dynatrace.com/monitoring/technologies/opentelemetry/?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_term=opentelemetry&amp;amp;utm_campaign=apac-p1-in-en-observability-tcpa&amp;amp;utm_content=none&amp;amp;utm_campaign_id=16272465256&amp;amp;gad_source=1&amp;amp;gad_campaignid=16272465256&amp;amp;gbraid=0AAAAADk5-tXLwixMknINiqeVjEVtVSkLQ&amp;amp;gclid=Cj0KCQjwy_fOBhC6ARIsAHKFB78DjYQwf4B2Pvx9SyVffMcr_X27h3csvswUMbar11cJFxFia6pxUE8aAjIoEALw_wcB" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt;, with a solid integration into &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;Langfuse&lt;/a&gt;. If you want full observability across every LLM call, tool invocation, and latency breakdown - it's already there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid voice - the thing we're most proud of&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pure TTS agents have a latency and naturalness problem. Every response gets generated fresh, which takes time, and even the best TTS models sound slightly synthetic on predictable phrases.&lt;br&gt;
We built a hybrid voice mode to fix this. The LLM picks from a library of pre-recorded human voice clips for common responses - greetings, confirmations, transitions - and only falls back to TTS when it needs to improvise. The predictable parts play instantly because there's no generation happening. Dynamic parts use TTS or voice clone. The result is lower latency, lower cost, and a more natural-sounding agent on the parts of the call that matter most for first impressions. We can also use recording while the agent transitions to a new node or a tool call is made. This way, while the node transition or tool call happens, the agent can play something which means users don't have to wait while the agent is quiet. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our current stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rather than explain each layer in isolation, here's the full picture:&lt;br&gt;
FastAPI for the backend. Our workload is heavily I/O bound - concurrent calls, real-time audio streaming, multiple async API calls in flight at once. FastAPI's async Python support handles this well within a single process.&lt;br&gt;
Next.js for the UI, deployed on Vercel.&lt;br&gt;
PostgreSQL as the primary database, shipped with Docker Compose. We also use it for vector embeddings, so there's no separate vector store to run.&lt;br&gt;
Arq for async task management and cron jobs. It handles our scheduled call queue and background workers cleanly.&lt;br&gt;
MinIO for S3-compatible file storage - transcripts, recordings, anything that needs to persist beyond a call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call traces and QA - the part most platforms skip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a call fails in production, a recording and a transcript are not enough. You need to know what the STT heard on turn 4, what the LLM decided, which tool it called, what the latency was at each step, and whether the caller interrupted mid-response. Without that, you're guessing.&lt;br&gt;
Every call in Dograh generates a full per-turn trace. It's the unit of debugging, not the recording. When something goes wrong you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.&lt;br&gt;
After every call, &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh&lt;/a&gt; also runs automatic post-call QA - checking sentiment, whether the user got confused, whether the agent followed its instructions, and what actions fired. You don't need to listen to 200 recordings to find where things broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we open-sourced it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We built this because we kept hitting the same frustrations. You got only two choices today. One you pay a platform fee to any of hte 300+ voice AI companies for a comfy UI. And building directly on &lt;a href="https://www.pipecat.ai/" rel="noopener noreferrer"&gt;Pipecat&lt;/a&gt; or &lt;a href="https://www.livechat.com/" rel="noopener noreferrer"&gt;LiveKit&lt;/a&gt; meant every prompt tweak required a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.&lt;br&gt;
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace not filing a ticket.&lt;br&gt;
Dograh is BSD-2 licensed. Self-hosted via Docker Compose. No per-minute platform fee because you own the platform.&lt;br&gt;
A star genuinely helps us more than we can explain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Star Dograh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>We analyzed 10,000 voice AI calls. The LLM was rarely the problem.</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Sat, 28 Mar 2026 13:54:27 +0000</pubDate>
      <link>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</link>
      <guid>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</guid>
      <description>&lt;p&gt;We built &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh OSS&lt;/a&gt;, an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there.&lt;/p&gt;

&lt;p&gt;Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected.&lt;/p&gt;

&lt;p&gt;The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed.&lt;/p&gt;

&lt;p&gt;Here is what we found, roughly ranked by frequency.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure area&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;th&gt;Primary driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT / word error rate&lt;/td&gt;
&lt;td&gt;~38%&lt;/td&gt;
&lt;td&gt;Low-quality telephony audio and accent variation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First-8-second chaos&lt;/td&gt;
&lt;td&gt;~34%&lt;/td&gt;
&lt;td&gt;Greeting latency, barge-in, variable user behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interruption handling&lt;/td&gt;
&lt;td&gt;~28%&lt;/td&gt;
&lt;td&gt;Filler words breaking flow, context switching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extended silence&lt;/td&gt;
&lt;td&gt;~22%&lt;/td&gt;
&lt;td&gt;Users distracted, fetching info, handing off phone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool call latency&lt;/td&gt;
&lt;td&gt;~19%&lt;/td&gt;
&lt;td&gt;LLM turns plus external API latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM failure modes&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;Hallucinations, instruction drift, latency trade-offs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken escalation&lt;/td&gt;
&lt;td&gt;~11%&lt;/td&gt;
&lt;td&gt;No clear human handoff path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These categories overlap. A lot of bad calls had two or three of these compounding each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  STT is worse than you think
&lt;/h2&gt;

&lt;p&gt;STT failures were the single biggest contributor to broken calls, and the one most consistently underestimated by teams building voice AI for the first time.&lt;/p&gt;

&lt;p&gt;Standard telephony runs at 8 kHz audio. That is genuinely low quality. It strips away acoustic detail that helps distinguish consonants, so "schedule" becomes "school" in a transcript and "Sunday" comes through as "someday." Add background noise, speakerphone, or a non-native accent and word error rates climb fast.&lt;/p&gt;

&lt;p&gt;The thing about WER is that errors compound. "I need to school my appointment for someday" should be understood as "I need to schedule my appointment for Sunday," and a well-prompted LLM will figure that out. When three or four words are garbled in the same turn though, contextual inference falls apart. We saw calls where entire turns came through as near-gibberish.&lt;/p&gt;

&lt;p&gt;There is also a dimension STT completely misses. Transcription captures words but it does not capture how those words are said. A frustrated "fine" and a genuinely agreeable "fine" are two very different things, but they look identical in a transcript. Tone helps disambiguate words that sound similar. When a caller sounds confused or hesitant, a human listener picks up on that and adjusts. STT gives you a flat string of text and the LLM works with it blind.&lt;/p&gt;

&lt;p&gt;No STT provider has uniform accuracy across all accents either. The gap between a provider's accuracy on American English versus Nigerian English or heavily accented Spanish can be 15 to 25 percentage points. If your users call from diverse regions, picking one STT provider and moving on is going to hurt you.&lt;/p&gt;

&lt;p&gt;Two mitigations made a real difference in our testing. First, adding a custom vocabulary to your STT engine - and I mean beyond domain jargon. If you are building an order management bot, add the common words your callers actually say: "order," "ID," "payment," "cancel," "account," "address." Thirty to forty frequently used words. The STT engine listens for these with extra weight and it meaningfully reduces errors on the terms that matter most.&lt;/p&gt;

&lt;p&gt;Second, tell the LLM to expect transcription errors. A single line in your system prompt acknowledging that the caller's words may contain transcription noise, and asking the model to use contextual reasoning before responding, reduces the downstream impact of STT failures significantly. The LLM stops treating garbled transcripts as literal input and starts being smarter about what the caller probably meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first 8 seconds are where calls die
&lt;/h2&gt;

&lt;p&gt;About a third of all problematic calls failed before the real conversation even started.&lt;/p&gt;

&lt;p&gt;A lot happens in those first 8 to 10 seconds. Callers are still deciding whether they want to engage. Many have not fully shifted their attention to the call. Some are already talking before the bot finishes its greeting, and others wait much longer than expected, unsure whether the silence means the system is broken. This is also the window where callers most frequently realize they are talking to a bot, and many immediately start testing it - interrupting, asking off-script questions, being deliberately vague. The variance in behavior is just much higher here than at any other point in the call.&lt;/p&gt;

&lt;p&gt;Greeting latency makes everything worse. A gap of more than a second or two between the call connecting and the first word of audio is enough for many callers to assume things are broken and hang up. Pre-generating and caching your greeting audio instead of synthesizing it fresh every time removes an entire class of failures here.&lt;/p&gt;

&lt;p&gt;What works: keep greetings short, under six seconds. Consider disabling barge-in for just the first 3 to 4 seconds if your greeting contains information the caller needs to hear. Re-enable interruption after that. The period where barge-in causes the most problems is right at the start.&lt;/p&gt;

&lt;p&gt;In Dograh, each workflow node has its own "Allow Interruption" toggle, so you can switch off interruption on the starting node for a short introduction and re-enable it for the rest of the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interruptions, silence, and dead air
&lt;/h2&gt;

&lt;p&gt;Most barge-in documentation focuses on detecting when a user starts speaking and stopping the TTS output. That part is reasonably well solved. The harder problem is what happens to the conversation after the interruption.&lt;/p&gt;

&lt;p&gt;One pattern we saw constantly: a caller says "uh huh" or "yeah" while the bot is talking. The bot interprets this as a genuine turn, stops speaking, and tries to process it as new user input. The LLM produces a response to what is essentially a non-sentence and the conversation breaks. The caller has to re-explain what they wanted.&lt;/p&gt;

&lt;p&gt;A related pattern involves context switching - a caller interrupts to ask "wait, does that include weekends?" The bot handles the question fine but loses track of what it was explaining before. The original task gets dropped without resolution.&lt;/p&gt;

&lt;p&gt;Both problems are about how interruption events are handled in the conversation state. The fix is designing explicit logic for what happens after an interruption. Differentiate between filler sounds and substantive speech, and track unresolved conversational threads so the bot can come back to them.&lt;/p&gt;

&lt;p&gt;Silence is closely related. Real callers go quiet for perfectly legitimate reasons - checking an email for an order ID, handing the phone to someone else, looking something up. A caller fetching information might go quiet for 20 to 40 seconds. Most voice AI systems interpret this as confusion or abandonment and respond with prompts or just hang up.&lt;/p&gt;

&lt;p&gt;A graduated response works much better: a brief neutral filler after 5 seconds, a gentle "still there?" check at 15 seconds, a real choice at 45 seconds ("I can hold, or you are welcome to call back"), and a graceful close with callback instructions only after 90 seconds or more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool call latency creates its own kind of dead air
&lt;/h2&gt;

&lt;p&gt;When a voice agent needs to look something up in a CRM or check a calendar slot, a tool call gets triggered. In practice that means at least one additional LLM turn to interpret the result, plus the external API latency. The total gap can easily reach 3 to 5 seconds, and callers at that point genuinely don't know if the call is still alive.&lt;/p&gt;

&lt;p&gt;We are adding a "playback speech" feature in Dograh that lets you configure a pre-recorded audio clip to play while a tool call executes. This fills the silence without the LLM having to generate a response and it keeps the caller engaged. Beyond that, pre-fetching data you know will be needed at call start - account details, prior call history, caller ID lookups - keeps common tool calls out of the live response path entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM failures and broken escalation
&lt;/h2&gt;

&lt;p&gt;LLM failures in production voice AI are real but probably smaller than you would expect. Hallucination gets the most attention but it accounts for a smaller share of bad calls than the quieter failures. Models stop following instructions partway through long calls. They generate empty responses that cause the bot to say nothing at all. They produce answers that are technically coherent but completely disconnected from the previous turn.&lt;/p&gt;

&lt;p&gt;The trade-off between model size and latency matters here too. Smaller models respond faster, which helps perceived call quality, but they follow complex instructions less reliably. Larger models handle nuanced prompts better but their response latency feeds right back into the dead-air problem.&lt;/p&gt;

&lt;p&gt;Escalation failures have a lower frequency but they are the most consequential category on this list. The callers asking for a human almost always have the hardest, most urgent problems. They have already tried self-service and it didn't work. When the bot can't detect they want to escalate - because they phrased it in a way the intent detection doesn't recognize - or when the escalation path drops the context so the human agent starts from scratch, that caller's experience is about as bad as it gets.&lt;/p&gt;

&lt;p&gt;Escalation should be a first-class destination in every workflow. The phrases callers use to ask for a human are wildly more varied than any training set anticipates, and the transfer needs to carry the full conversation transcript to the receiving agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Voice AI breaks at transitions. The first seconds of a call, the moment a user interrupts, the gap while a tool is running, the point where a caller needs a human. These are the edges where the system's assumptions about how conversations work meet how people actually behave on phone calls.&lt;/p&gt;

&lt;p&gt;Teams that focus almost entirely on LLM response quality and treat these transitions as secondary concerns tend to ship agents that sound good in demos but disappoint in production. The calls that held up well in our data were the ones with the most deliberate handling of everything that happens between LLM responses.&lt;/p&gt;




&lt;p&gt;We are building Dograh as the open-source alternative to Vapi, Retell, and Bland. No per-minute fees, no vendor lock-in, deploy on your own infra. Ask anny queries &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Star us on GitHub&lt;/a&gt; | &lt;a href="https://app.dograh.com" rel="noopener noreferrer"&gt;Try the cloud version&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceai</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
