<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pritesh Kumar</title>
    <description>The latest articles on DEV Community by Pritesh Kumar (@priteshkr).</description>
    <link>https://dev.to/priteshkr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824033%2Fcfd12cee-b8ea-4c4f-8d95-3fd96af7163f.png</url>
      <title>DEV Community: Pritesh Kumar</title>
      <link>https://dev.to/priteshkr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/priteshkr"/>
    <language>en</language>
    <item>
      <title>Keeping government call data in-country: a self-hosted voice AI architecture</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Fri, 03 Jul 2026 11:44:10 +0000</pubDate>
      <link>https://dev.to/dograh/keeping-government-call-data-in-country-a-self-hosted-voice-ai-architecture-2gjk</link>
      <guid>https://dev.to/dograh/keeping-government-call-data-in-country-a-self-hosted-voice-ai-architecture-2gjk</guid>
      <description>&lt;p&gt;If you're wiring up a voice agent for a government helpline, the first question is where the audio goes, before you ever get to which model sounds best. A recording of someone calling about their benefits eligibility or a criminal record is about as sensitive as data gets, and the moment it leaves the country you have a procurement problem no feature list can fix.&lt;/p&gt;

&lt;p&gt;Most hosted voice platforms make that leave-the-country decision for you. You point your phone line at their API, and every recording, every transcript, and every field the agent extracts lands in their cloud, usually somewhere you don't control and often on another continent. For a private startup that's fine. For a public agency bound by data residency rules, it's a non-starter before accuracy even enters the conversation.&lt;/p&gt;

&lt;p&gt;So here's the architecture that keeps all of it in-country.&lt;/p&gt;

&lt;h2&gt;
  
  
  The data path is the whole design
&lt;/h2&gt;

&lt;p&gt;A voice call moves through three model stages. Speech comes in and gets transcribed. The transcript goes to a language model that decides what to say and what to pull from your systems. The reply gets turned back into speech. Each of those stages is a place the audio or its text can leak out to a third party.&lt;/p&gt;

&lt;p&gt;The fix is to run all three on infrastructure you already accredit. Whisper handles transcription locally. An open-weight LLM (Llama, Qwen, whatever your security team has cleared) does the reasoning on the same network. An open TTS voice speaks the reply. The caller's audio hits your servers, gets processed, and the extracted fields land in your own case system. Nothing about the call travels to a model vendor, so the border never gets crossed because there's nothing crossing it.&lt;/p&gt;

&lt;p&gt;That's the part hosted SaaS can't give you, no matter how many compliance badges sit on the pricing page. If the inference runs in their cloud, the data goes to their cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Colocate for latency, not just sovereignty
&lt;/h2&gt;

&lt;p&gt;Keeping the models in-country has a second payoff infra people care about, which is latency. Voice is unforgiving. A caller notices half a second of dead air. If your STT is in one region, your LLM in another, and your TTS somewhere else, you pay round-trip time over and over, plus a trip out to a hosted vendor and back.&lt;/p&gt;

&lt;p&gt;Colocating the transcription, the LLM, and the voice on the same box or the same rack collapses that. The audio round-trip stays inside a single network hop, so the agent can pace itself and wait for a slow-speaking caller, handling barge-in without the awkward gaps that make people hang up. Data residency and low latency end up wanting the same thing, which is the whole stack running close together on hardware you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open source is what makes it auditable
&lt;/h2&gt;

&lt;p&gt;Here's the part that closes a government security review. When the model weights and the pipeline code are open, your team can read exactly what the agent does with a call. Where the recording gets written. How long it's kept. What gets sent to which internal API. Which fields get extracted and where they land.&lt;/p&gt;

&lt;p&gt;You can't audit a closed vendor's pipeline. You get a SOC 2 report and a promise. With an open stack the review becomes a code review, and the answer to "who can subpoena this recording, where does it get replicated, how long does it live" stays inside your own retention and access rules where an auditor can actually check.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Dograh fits
&lt;/h2&gt;

&lt;p&gt;This is the constraint Dograh is built around. It's BSD-2 licensed and fully self-hostable, so you run the whole agent in your own environment and bring your own models. Whisper for transcription, an open LLM for reasoning, an open voice for playback. You pay for infrastructure instead of a per-minute platform fee, which for a high, steady government line changes the yearly number a lot. Most hosted platforms meter around 5 to 7 cents a minute just for the platform layer before any AI usage, climbing toward 15 cents all-in at volume. Own the stack and that meter stops running.&lt;/p&gt;

&lt;p&gt;The buying decision for citizen data comes down to whether the audio ever leaves the building. Self-hosting with open models is how you keep it inside.&lt;/p&gt;

&lt;p&gt;If you want the fuller picture, the six citizen-facing use cases this architecture serves plus the cost math and the CJIS and FISMA angles, &lt;a href="https://www.dograh.com/hub/blogs/ai-voice-agents-public-services" rel="noopener noreferrer"&gt;we get into it in the full write-up&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>privacy</category>
      <category>security</category>
    </item>
    <item>
      <title>Why On-Prem Will Win Enterprise Voice AI (Hosted Can't Follow)</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Thu, 02 Jul 2026 20:50:36 +0000</pubDate>
      <link>https://dev.to/dograh/why-on-prem-will-win-enterprise-voice-ai-hosted-cant-follow-15n2</link>
      <guid>https://dev.to/dograh/why-on-prem-will-win-enterprise-voice-ai-hosted-cant-follow-15n2</guid>
      <description>&lt;p&gt;A vendor can add "on-premise" to a sales deck in about five minutes. Shipping it is a different problem, and for most voice AI platforms it is an unsolvable one. The blocker has nothing to do with roadmap or engineering effort. It comes down to the license on the models running underneath the product.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This is a condensed, independently written version of our full deep-dive. &lt;a href="https://www.dograh.com/hub/blogs/on-prem-enterprise-voice-ai" rel="noopener noreferrer"&gt;Read the complete article on Dograh&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is the short version for anyone weighing this decision. For regulated buyers in healthcare and finance, running voice AI on your own infrastructure is often the only compliant option, because call audio and personal data cannot leave your network without tripping HIPAA or residency rules like GDPR. True on-prem keeps every layer inside your perimeter. It only works with open models you can actually download, and it deletes the per-minute vendor fee.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part most compliance calls skip
&lt;/h2&gt;

&lt;p&gt;You cannot colocate a model whose weights you are not allowed to hold. Closed providers sell access to a model, never the model itself, so there is no artifact to install on a server you own. When a hosted per-minute vendor says "colocation," the best it can do is pick a cloud region near your other services, and your audio still leaves the building to land on someone else's machines.&lt;/p&gt;

&lt;p&gt;That single fact decides the whole architecture. A product built on closed APIs can bolt on a private-cloud tier, and the sensitive processing still happens on hardware you do not control. If the models are closed, on-prem is a marketing word. If the models are open, on-prem is something an auditor can actually verify.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "on-prem" actually means
&lt;/h2&gt;

&lt;p&gt;The deployment shapes are not equal. A fully hosted setup, the default for most per-minute platforms, runs everything on the vendor's servers while you connect over an API or a SIP trunk. A private-cloud or VPC deployment drops the vendor's software into your own cloud account, which narrows exposure, though the underlying models often still call out to the provider's endpoints. True on-prem, sometimes called colocation, runs the entire pipeline inside your perimeter, so speech-to-text, the language model, speech synthesis, and telephony all sit on hardware you own or rent with no call data crossing the boundary. Only that last shape satisfies the strictest residency rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why regulated buyers are forcing the move
&lt;/h2&gt;

&lt;p&gt;The pressure is coming from compliance and finance, not from engineering. The &lt;a href="https://www.grandviewresearch.com/industry-analysis/ai-voice-agents-market-report" rel="noopener noreferrer"&gt;AI voice agents market&lt;/a&gt; was worth 2.54 billion dollars in 2025 and is on track for 35.24 billion by 2033, so the volume of sensitive audio moving through these systems is climbing fast. Gartner expects more than 75 percent of European and Middle Eastern enterprises to move workloads into &lt;a href="https://www.truefoundry.com/blog/geopatriation" rel="noopener noreferrer"&gt;sovereign solutions&lt;/a&gt; by 2030, up from under 5 percent in 2025.&lt;/p&gt;

&lt;p&gt;HIPAA makes the stakes concrete. A compliant voice deployment needs a signed Business Associate Agreement at every layer, and &lt;a href="https://www.getprosper.ai/blog/hipaa-compliant-voice-ai-providers-healthcare-guide" rel="noopener noreferrer"&gt;Prosper AI's 2026 analysis&lt;/a&gt; counts up to five separate agreements, with civil penalties reaching 2,190,294 dollars per violation category per year. &lt;a href="https://www.ibm.com/reports/data-breach" rel="noopener noreferrer"&gt;IBM's 2025 Cost of a Data Breach Report&lt;/a&gt; puts healthcare at 7.42 million dollars per breach, the costliest sector for the fourteenth year running. Self-hosting removes the problem at the root, because when every model runs inside your perimeter there is no third party to sign a BAA with and no audio leaving your network.&lt;/p&gt;

&lt;h2&gt;
  
  
  The open stack that makes it real
&lt;/h2&gt;

&lt;p&gt;A fully self-hosted pipeline is buildable today from open components. Whisper and Voxtral handle speech-to-text on your own GPUs. Open language models such as Llama and Qwen serve through vLLM or Ollama. Kokoro and Piper generate natural speech locally, with Coqui and Chatterbox as further options. Telephony sits on Asterisk and standard SIP trunking, with ARI for low-level call control. Running these together on one server or in one availability zone is what colocation actually buys you, and since every network hop adds delay, keeping the models next to each other is one of the biggest levers for &lt;a href="https://www.dograh.com/hub/blogs/speech-latency" rel="noopener noreferrer"&gt;sub-800ms speech latency&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bill hosted vendors do not print
&lt;/h2&gt;

&lt;p&gt;Per-minute pricing scales linearly with every call. &lt;a href="https://www.ringly.io/blog/voice-ai-pricing" rel="noopener noreferrer"&gt;Ringly.io's 2026 pricing data&lt;/a&gt; puts the all-in cost of a hosted deployment at 0.12 to 0.25 dollars per minute once speech, model, voice, and telephony stack on the platform fee, with the platform fee alone around 5 to 7 cents a minute. Run 1,000 minutes a day and you land between 15,000 and 30,000 dollars a year, climbing with every new campaign. Self-hosting turns that meter into a fixed infrastructure line, so the marginal cost of another minute sits close to zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why per-minute vendors cannot follow
&lt;/h2&gt;

&lt;p&gt;Hosted platforms will offer a private-cloud tier and a thick stack of compliance documents, and both help. What they cannot offer is a stack you fully own, because the models underneath are closed, so the residency guarantee ends at the model boundary and the meter keeps running.&lt;/p&gt;

&lt;p&gt;This is the gap &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh&lt;/a&gt; was built to close. It is an open-source voice agent platform under a BSD-2 license, self-hostable from the ground up. You can colocate an open stack and bring your own keys for any commercial model, or drop commercial models entirely and run open weights end to end. There is no per-minute platform fee, and because the whole system is open, data residency and auditability arrive with the deployment instead of a contract addendum.&lt;/p&gt;

&lt;p&gt;The reason a hosted rival cannot copy this is structural. In &lt;a href="https://www.swfte.com/blog/avoid-ai-vendor-lock-in-enterprise-guide" rel="noopener noreferrer"&gt;a 2026 cloud computing survey&lt;/a&gt;, 94 percent of organizations reported concern about vendor lock-in, and only 6 percent believed they could switch their main AI provider without serious disruption. A per-minute vendor benefits from that friction, because a customer who cannot leave keeps paying. Hand that customer open weights on their own hardware with no meter, and there is very little company left to bill. Announcing an on-premise option is easy. Switching off the billing that funds the business is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to check before you move
&lt;/h2&gt;

&lt;p&gt;Start with the license, since a platform you can install and run yourself is auditable in a way a closed product never is. Check the model layer next, and ask whether you can bring open weights at every step and whether any stage quietly falls back to a closed API that ships audio out. Confirm telephony can run on your own SIP or Asterisk setup so the call path stays internal. Then follow the money, because a real on-prem option turns cost into fixed infrastructure with no per-minute fee riding on top, and a vendor that cannot remove the meter is not handing you a deployment you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Colocation.&lt;/strong&gt; Hosting speech-to-text, the language model, speech synthesis, and telephony on the same server or availability zone to cut network hops. It only works with open models you can self-host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data residency.&lt;/strong&gt; The requirement that call audio and personal data physically stay inside a specific country or jurisdiction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Associate Agreement (BAA).&lt;/strong&gt; A HIPAA contract that makes a vendor legally liable for protecting patient data. A hosted voice stack needs one at every layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Geopatriation.&lt;/strong&gt; Moving cloud and AI workloads back inside national borders to satisfy sovereignty rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can I self-host closed-source voice AI models?
&lt;/h3&gt;

&lt;p&gt;No. Closed providers sell API access, not the model weights, so there is nothing to install on your own hardware. With closed models, colocation only means choosing a nearby cloud region, and your audio still leaves your network. Only open models run fully on-prem.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the open-source stack for on-prem voice AI?
&lt;/h3&gt;

&lt;p&gt;A self-hosted pipeline typically uses Whisper or Voxtral for speech-to-text, an open language model like Llama or Qwen served through vLLM or Ollama, and Kokoro or Piper for text-to-speech. Telephony runs on Asterisk and SIP. Every layer stays on hardware you control.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does self-hosted voice AI cost compared to per-minute vendors?
&lt;/h3&gt;

&lt;p&gt;Hosted voice AI runs about 0.12 to 0.25 dollars per minute all-in, and that meter scales with every call. Self-hosting converts the cost into fixed infrastructure, so the marginal cost of another minute is close to zero. Running open models instead of commercial APIs lowers the bill further.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is on-prem deployment required for HIPAA voice AI?
&lt;/h3&gt;

&lt;p&gt;It is not strictly required, though it removes the hardest part of HIPAA compliance. A hosted stack needs a signed BAA at every layer, up to five separate agreements. Self-hosting keeps patient audio inside your own perimeter, so there is no third party to contract with in the first place.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.dograh.com/hub/blogs/on-prem-enterprise-voice-ai" rel="noopener noreferrer"&gt;www.dograh.com/hub/blogs/on-prem-enterprise-voice-ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>4 open-source tools to build production-ready AI voice agents 🎙️🚀</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Thu, 23 Apr 2026 01:20:48 +0000</pubDate>
      <link>https://dev.to/priteshkr/4-open-source-tools-to-build-production-ready-ai-voice-agents-49p2</link>
      <guid>https://dev.to/priteshkr/4-open-source-tools-to-build-production-ready-ai-voice-agents-49p2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We built this because we kept hitting the same frustrations. You've got only two choices today. One, you pay a platform fee to any of the 300+ voice AI companies for a comfy UI. Or you build directly on Dograh, Pipecat or LiveKit, where every prompt tweak means a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.&lt;br&gt;
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace, not filing a ticket.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Dograh 👑&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I've built voice agents before, but when it came to shipping them for production, I couldn't find a platform that worked quickly in 2 minutes - until we started building Dograh.&lt;br&gt;
It's an open-source voice AI platform with a visual workflow builder, built-in telephony, and post-call analytics out of the box. Alternative to Vapi, Retell, and Bland, but self-hostable and BSD-2 licensed.&lt;br&gt;
You get a canvas where you connect nodes instead of writing Python, so prompt tweaks don't mean a redeploy. Voicemail detection, call transfer, variable extraction, knowledge base, and CRM connectors all come standard. Same feature set whether you self-host or use the managed cloud.&lt;br&gt;
It has native support for BYOK (bring your own key) across every layer. Deepgram or Whisper for STT, ElevenLabs or Kokoro for TTS, and any LLM for the brain. Want to run everything locally? Swap in self-hosted models through the UI, no code required.&lt;br&gt;
Check it. &lt;a href="https://docs.dograh.com/getting-started" rel="noopener noreferrer"&gt;https://docs.dograh.com/getting-started&lt;/a&gt;&lt;br&gt;
Youtube link: &lt;a href="https://www.youtube.com/watch?v=sxiSp4JXqws" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=sxiSp4JXqws&lt;/a&gt;&lt;br&gt;
Star the Dograh repo ⭐ → &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;https://github.com/dograh-hq/dograh&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5d5iw0newtdf2tdinjv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5d5iw0newtdf2tdinjv.jpg" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Pipecat&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building a voice AI prototype is one thing, but owning the audio pipeline in production is a different ball game. Pipecat is the Python framework from the Daily.co team for engineers who want full control over how audio frames move through an agent.&lt;br&gt;
The framework handles STT, voice activity detection, LLM, and TTS as composable stages. Integration coverage is wide, including Deepgram, ElevenLabs, Cartesia, Kokoro, Whisper, Gemini, and several dozen others. Pipecat Cloud is available if you want to skip the ops side. Of the three frameworks on this list, Pipecat is the one I'd bet on in the long term if you're comfortable with Python and want to own the pipeline.&lt;br&gt;
The tradeoff is that Pipecat doesn't ship anything above the framework layer: no visual builder, no post-call analytics, no CRM connectors, no QA tooling. Every change to conversation logic means editing Python, committing, and redeploying. Fine if you have an engineering team with the bandwidth to build the platform layer on top. Rough if you want a working system on day one.&lt;br&gt;
Check it out: &lt;a href="https://docs.pipecat.ai/overview/introduction" rel="noopener noreferrer"&gt;https://docs.pipecat.ai/overview/introduction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Star the Pipecat repo ⭐ →&lt;a href="https://github.com/pipecat-ai/pipecat" rel="noopener noreferrer"&gt;https://github.com/pipecat-ai/pipecat&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7t7i47zc3iu64eapwmlq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7t7i47zc3iu64eapwmlq.png" alt=" " width="800" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LiveKit Agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building voice AI without battle-tested real-time infrastructure is a disaster waiting to happen. Audio is unforgiving and the moment you have packet loss, multi-party rooms, or browser-to-browser calls, rolling your own transport layer becomes a nightmare.&lt;br&gt;
LiveKit Agents, a WebRTC-native voice framework from LiveKit, is built on top of their widely used real-time media server.&lt;br&gt;
It's organised as composable pieces, including the core media server, the Agents framework for voice AI logic, and LiveKit SIP for PSTN bridging.&lt;br&gt;
In addition, they offer a managed cloud option if you don't want to run the media server yourself, handling scaling, geographic distribution, and SIP trunking for you.&lt;br&gt;
The easiest way to get started is to use the SDK.&lt;br&gt;
The tradeoff is the same as Pipecat. Code-first SDK, no visual interface, no built-in analytics or CRM tooling. Getting a call out the door means wiring up the media server, the agent worker, and the SIP bridge separately. LiveKit Agents is overkill unless you're already using LiveKit for something else, or you genuinely need WebRTC multi-party. For a standard inbound or outbound phone agent, Pipecat is simpler, and Dograh is faster to ship.&lt;br&gt;
For more, refer to their documentation.&lt;a href="https://docs.livekit.io/intro/overview/" rel="noopener noreferrer"&gt;https://docs.livekit.io/intro/overview/&lt;/a&gt;&lt;br&gt;
Star the LiveKit Agents repository ⭐ → &lt;a href="https://github.com/livekit" rel="noopener noreferrer"&gt;https://github.com/livekit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww4beot2bbcqat9cbj0l.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww4beot2bbcqat9cbj0l.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Vocode&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Building a voice AI prototype is one thing, but inheriting a dead codebase is another. What can be a bigger time sink than picking a framework that looks alive in search results but is actually abandoned?&lt;br&gt;
Vocode was one of the earlier Python libraries in this space and introduced useful abstractions when it launched. Active development has largely stopped, with minimal commits for well over a year, unanswered issues, and an architecture that predates speech-to-speech models and sub-500ms pipelines.&lt;br&gt;
Building a new production system on Vocode means inheriting technical debt without an active maintainer behind it. Don't. Start with Dograh, Pipecat, or LiveKit instead.&lt;br&gt;
Check out here:&lt;a href="https://docs.vocode.dev/welcome" rel="noopener noreferrer"&gt;https://docs.vocode.dev/welcome&lt;/a&gt;&lt;br&gt;
Star Github repository: &lt;a href="https://github.com/vocodedev" rel="noopener noreferrer"&gt;https://github.com/vocodedev&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr6r8cbyiqj1hioca7kq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr6r8cbyiqj1hioca7kq.png" alt=" " width="280" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dograh&lt;/th&gt;
&lt;th&gt;Pipecat&lt;/th&gt;
&lt;th&gt;LiveKit Agents&lt;/th&gt;
&lt;th&gt;Vocode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;Free OSS + optional cloud&lt;/td&gt;
&lt;td&gt;Free OSS&lt;/td&gt;
&lt;td&gt;Free OSS&lt;/td&gt;
&lt;td&gt;Free OSS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual workflow builder&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hostable&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BYOK for STT, TTS, LLM&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production features (tools, QA, telephony)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Thanks for reading the post.&lt;/p&gt;

&lt;p&gt;Let me know in the comments below if any other open-source voice AI tools or frameworks have helped you ship agents to production.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Where Voice AI Agents Are Actually Getting Used in 2026</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:45:27 +0000</pubDate>
      <link>https://dev.to/dograh/where-voice-ai-agents-are-actually-getting-used-in-2026-49oe</link>
      <guid>https://dev.to/dograh/where-voice-ai-agents-are-actually-getting-used-in-2026-49oe</guid>
      <description>&lt;p&gt;Voice AI has moved past the demo phase. After watching hundreds of deployments across our customer base and the broader ecosystem, I wanted to put together a practical list of where voice agents are actually earning their keep right now, and where the ROI is strong enough to justify real production budgets.&lt;/p&gt;

&lt;p&gt;The list below is the subset that keeps showing up in real pipelines and real P&amp;amp;Ls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Customer Support
&lt;/h2&gt;

&lt;p&gt;Tier-1 support is the single biggest deployed voice AI use case today. Voice agents handle password resets, order status checks, account balance lookups, policy questions, and similar high-volume repetitive queries. The value is straightforward: deflect 40-60% of inbound calls away from human agents, answer in the language the caller speaks, operate 24/7. Most teams start here because the data already exists in their CRM or knowledge base, and the workflows are well understood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lead Screening and Qualification
&lt;/h2&gt;

&lt;p&gt;Inbound leads from ads, forms, and content marketing usually sit in a queue for hours before a human gets to them. By that point, intent has dropped significantly. Voice agents now pick up the call within seconds, qualify against BANT or a custom rubric, book the meeting straight into the sales rep calendar, and log everything in HubSpot or Salesforce. This is the highest-velocity use case for B2B teams I have seen. The math is easy: a qualified meeting has a known value in the CRM, and answering in 30 seconds instead of 4 hours multiplies that yield.&lt;/p&gt;

&lt;h2&gt;
  
  
  Collections and Renewals in Fintech
&lt;/h2&gt;

&lt;p&gt;This is where I have seen some of the strongest unit economics. Banks, lenders, and insurance companies run enormous outbound collections operations with razor-thin per-call economics. Voice agents handle reminders, soft collections, payment plan negotiation, drop-off recovery, and renewal nudges at a fraction of the cost of a human BPO. The volumes are high, the scripts are compliance-heavy (which AI handles consistently), and the conversion lift from reaching a borrower in their preferred language at the right time of day is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cold Calling and Outbound Sales
&lt;/h2&gt;

&lt;p&gt;The ROI here is very good if you get the targeting right. Voice agents can run thousands of outbound dials a day, handle objections, qualify interest, and hand off warm prospects to a human closer. The catch is that bad targeting plus AI dialing equals spam complaints at scale, so list hygiene and opt-in matter more than the tech itself. Teams that get this right see cost per meeting drop by 5-10x.&lt;/p&gt;

&lt;h2&gt;
  
  
  Appointment Setting in Healthcare
&lt;/h2&gt;

&lt;p&gt;Hospitals, dental clinics, and specialty practices deal with huge no-show rates and constant rebooking churn. Voice agents handle appointment confirmations, reminders, rescheduling, and prep instructions like pre-op fasting rules. The operational impact is immediate: front desk staff get their attention back for patients physically in the clinic, and call handling capacity goes up overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Receptions, Restaurants, and Local Services
&lt;/h2&gt;

&lt;p&gt;Any local business with a phone number that rings all day is a candidate. Restaurants take reservations and handle takeaway orders, dental clinics book and confirm visits, salons do intake. The ticket size per business is small, but the cumulative market is enormous. This category will eventually absorb the most total call volume, even if each individual deployment is modest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;The interesting wave ahead is in regulated industries like KYC verification, insurance claims intake (FNOL), patient engagement beyond appointments, and legal intake flows. These need stronger guardrails, better audit trails, tighter integration into systems of record, and clear compliance boundaries. That is where the platform layer matters, and where closed black-box APIs start to hit walls that open, inspectable stacks handle gracefully.&lt;/p&gt;

&lt;p&gt;If you are evaluating voice AI for your own business, start with the use case where you already have volume, a clear script, a measurable outcome, and a team ready to handle the tail exceptions. Skip the speculative experiments. The wins are in the boring, high-frequency calls you make every day already.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>nlp</category>
    </item>
    <item>
      <title>How I built a full fledged open source AI calling platform and got a million impressions in almost a month .....🤯</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:41:36 +0000</pubDate>
      <link>https://dev.to/priteshkr/how-i-built-a-full-fledged-open-source-ai-calling-platform-and-got-a-million-impressions-in-almost-5c4m</link>
      <guid>https://dev.to/priteshkr/how-i-built-a-full-fledged-open-source-ai-calling-platform-and-got-a-million-impressions-in-almost-5c4m</guid>
      <description>&lt;p&gt;We published &lt;a href="https://www.dograh.com/" rel="noopener noreferrer"&gt;Dograh&lt;/a&gt; on &lt;a href="https://www.reddit.com/r/buildinpublic/comments/1shn39p/quit_a_chill_job_after_my_previous_startup_got/" rel="noopener noreferrer"&gt;Reddit&lt;/a&gt; a few days ago and the response surprised us. A self-hostable, &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;open-source alternative to Vapi &lt;/a&gt; was something many developers had been waiting for.&lt;br&gt;
Since then we've gotten a lot of questions about how we actually built it - what the architecture looks like, what decisions we made, and what we'd do differently. Here's the honest walkthrough.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fux0z0iec55f7e0fpx8iz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fux0z0iec55f7e0fpx8iz.png" alt=" " width="613" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it stands&lt;/strong&gt;: an agency self-hosts our stack and is building its third client on top of it, now looking to migrate its existing clients over from other platforms. One of India's largest fintech companies is using our S2S support for a voice AI use case that is currently in development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcjb3r5tok6ayccbbllj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcjb3r5tok6ayccbbllj.png" alt=" " width="571" height="711"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspq3oyam4rc7xgdkf844.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspq3oyam4rc7xgdkf844.png" alt=" " width="584" height="690"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core idea - provider abstraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first decision we made was that Dograh should never care which provider is running behind any layer. Every voice agent needs four things: something to handle the phone call, something to transcribe speech, something to think, and something to talk back. Each of these is an abstract layer in Dograh. Any provider plugs in without touching anything else.&lt;br&gt;
For the real-time pipeline, we built on a custom fork of Pipecat. We chose Pipecat over LiveKit because of its architectural simplicity - everything is a processor, and events and data flow through the pipeline. Each processor can either act on the event asynchronously or forward it to the next one. That model makes it easy to reason about what's happening at any point in the call.&lt;br&gt;
We also built our own telephony integration layer on top of this, rather than relying on existing abstractions. That turned out to be the right call. It let us build things like human call transfers, where the transfer is exposed as a tool option in the LLM context - the model decides when to hand off based on the conversation, not a hardcoded rule.&lt;br&gt;
Dograh supports today:&lt;br&gt;
Telephony: Twilio, Vonage, Cloudonix, Telnyx, Vobiz, Asterisk ARI&lt;br&gt;
STT: Deepgram, Speechmatics, Sarvam, OpenAI Whisper&lt;br&gt;
LLM: OpenAI, Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted&lt;br&gt;
TTS: ElevenLabs, Cartesia, Deepgram, OpenAI TTS&lt;br&gt;
Swapping any of these is a config change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-speech support&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond the STT-LLM-TTS pipeline, we've added support for true speech-to-speech models. Right now that means Gemini 3.1 Flash Live. S2S collapses the three-step loop into a single model call, which gets you sub-300ms latency(theoretically) and more natural turn-taking. Barge-in handling works out of the box. We're planning to build more robust support for locally hosted S2S models in the short term.&lt;/p&gt;

&lt;p&gt;We also support distributed tracing with &lt;a href="https://www.dynatrace.com/monitoring/technologies/opentelemetry/?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_term=opentelemetry&amp;amp;utm_campaign=apac-p1-in-en-observability-tcpa&amp;amp;utm_content=none&amp;amp;utm_campaign_id=16272465256&amp;amp;gad_source=1&amp;amp;gad_campaignid=16272465256&amp;amp;gbraid=0AAAAADk5-tXLwixMknINiqeVjEVtVSkLQ&amp;amp;gclid=Cj0KCQjwy_fOBhC6ARIsAHKFB78DjYQwf4B2Pvx9SyVffMcr_X27h3csvswUMbar11cJFxFia6pxUE8aAjIoEALw_wcB" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt;, with a solid integration into &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;Langfuse&lt;/a&gt;. If you want full observability across every LLM call, tool invocation, and latency breakdown - it's already there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid voice - the thing we're most proud of&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pure TTS agents have a latency and naturalness problem. Every response gets generated fresh, which takes time, and even the best TTS models sound slightly synthetic on predictable phrases.&lt;br&gt;
We built a hybrid voice mode to fix this. The LLM picks from a library of pre-recorded human voice clips for common responses - greetings, confirmations, transitions - and only falls back to TTS when it needs to improvise. The predictable parts play instantly because there's no generation happening. Dynamic parts use TTS or voice clone. The result is lower latency, lower cost, and a more natural-sounding agent on the parts of the call that matter most for first impressions. We can also use recording while the agent transitions to a new node or a tool call is made. This way, while the node transition or tool call happens, the agent can play something which means users don't have to wait while the agent is quiet. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our current stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rather than explain each layer in isolation, here's the full picture:&lt;br&gt;
FastAPI for the backend. Our workload is heavily I/O bound - concurrent calls, real-time audio streaming, multiple async API calls in flight at once. FastAPI's async Python support handles this well within a single process.&lt;br&gt;
Next.js for the UI, deployed on Vercel.&lt;br&gt;
PostgreSQL as the primary database, shipped with Docker Compose. We also use it for vector embeddings, so there's no separate vector store to run.&lt;br&gt;
Arq for async task management and cron jobs. It handles our scheduled call queue and background workers cleanly.&lt;br&gt;
MinIO for S3-compatible file storage - transcripts, recordings, anything that needs to persist beyond a call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call traces and QA - the part most platforms skip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a call fails in production, a recording and a transcript are not enough. You need to know what the STT heard on turn 4, what the LLM decided, which tool it called, what the latency was at each step, and whether the caller interrupted mid-response. Without that, you're guessing.&lt;br&gt;
Every call in Dograh generates a full per-turn trace. It's the unit of debugging, not the recording. When something goes wrong you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.&lt;br&gt;
After every call, &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh&lt;/a&gt; also runs automatic post-call QA - checking sentiment, whether the user got confused, whether the agent followed its instructions, and what actions fired. You don't need to listen to 200 recordings to find where things broke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we open-sourced it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We built this because we kept hitting the same frustrations. You got only two choices today. One you pay a platform fee to any of hte 300+ voice AI companies for a comfy UI. And building directly on &lt;a href="https://www.pipecat.ai/" rel="noopener noreferrer"&gt;Pipecat&lt;/a&gt; or &lt;a href="https://www.livechat.com/" rel="noopener noreferrer"&gt;LiveKit&lt;/a&gt; meant every prompt tweak required a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.&lt;br&gt;
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace not filing a ticket.&lt;br&gt;
Dograh is BSD-2 licensed. Self-hosted via Docker Compose. No per-minute platform fee because you own the platform.&lt;br&gt;
A star genuinely helps us more than we can explain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Star Dograh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>We analyzed 10,000 voice AI calls. The LLM was rarely the problem.</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Sat, 28 Mar 2026 13:54:27 +0000</pubDate>
      <link>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</link>
      <guid>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</guid>
      <description>&lt;p&gt;We built &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh OSS&lt;/a&gt;, an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there.&lt;/p&gt;

&lt;p&gt;Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected.&lt;/p&gt;

&lt;p&gt;The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed.&lt;/p&gt;

&lt;p&gt;Here is what we found, roughly ranked by frequency.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure area&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;th&gt;Primary driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT / word error rate&lt;/td&gt;
&lt;td&gt;~38%&lt;/td&gt;
&lt;td&gt;Low-quality telephony audio and accent variation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First-8-second chaos&lt;/td&gt;
&lt;td&gt;~34%&lt;/td&gt;
&lt;td&gt;Greeting latency, barge-in, variable user behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interruption handling&lt;/td&gt;
&lt;td&gt;~28%&lt;/td&gt;
&lt;td&gt;Filler words breaking flow, context switching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extended silence&lt;/td&gt;
&lt;td&gt;~22%&lt;/td&gt;
&lt;td&gt;Users distracted, fetching info, handing off phone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool call latency&lt;/td&gt;
&lt;td&gt;~19%&lt;/td&gt;
&lt;td&gt;LLM turns plus external API latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM failure modes&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;Hallucinations, instruction drift, latency trade-offs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken escalation&lt;/td&gt;
&lt;td&gt;~11%&lt;/td&gt;
&lt;td&gt;No clear human handoff path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These categories overlap. A lot of bad calls had two or three of these compounding each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  STT is worse than you think
&lt;/h2&gt;

&lt;p&gt;STT failures were the single biggest contributor to broken calls, and the one most consistently underestimated by teams building voice AI for the first time.&lt;/p&gt;

&lt;p&gt;Standard telephony runs at 8 kHz audio. That is genuinely low quality. It strips away acoustic detail that helps distinguish consonants, so "schedule" becomes "school" in a transcript and "Sunday" comes through as "someday." Add background noise, speakerphone, or a non-native accent and word error rates climb fast.&lt;/p&gt;

&lt;p&gt;The thing about WER is that errors compound. "I need to school my appointment for someday" should be understood as "I need to schedule my appointment for Sunday," and a well-prompted LLM will figure that out. When three or four words are garbled in the same turn though, contextual inference falls apart. We saw calls where entire turns came through as near-gibberish.&lt;/p&gt;

&lt;p&gt;There is also a dimension STT completely misses. Transcription captures words but it does not capture how those words are said. A frustrated "fine" and a genuinely agreeable "fine" are two very different things, but they look identical in a transcript. Tone helps disambiguate words that sound similar. When a caller sounds confused or hesitant, a human listener picks up on that and adjusts. STT gives you a flat string of text and the LLM works with it blind.&lt;/p&gt;

&lt;p&gt;No STT provider has uniform accuracy across all accents either. The gap between a provider's accuracy on American English versus Nigerian English or heavily accented Spanish can be 15 to 25 percentage points. If your users call from diverse regions, picking one STT provider and moving on is going to hurt you.&lt;/p&gt;

&lt;p&gt;Two mitigations made a real difference in our testing. First, adding a custom vocabulary to your STT engine - and I mean beyond domain jargon. If you are building an order management bot, add the common words your callers actually say: "order," "ID," "payment," "cancel," "account," "address." Thirty to forty frequently used words. The STT engine listens for these with extra weight and it meaningfully reduces errors on the terms that matter most.&lt;/p&gt;

&lt;p&gt;Second, tell the LLM to expect transcription errors. A single line in your system prompt acknowledging that the caller's words may contain transcription noise, and asking the model to use contextual reasoning before responding, reduces the downstream impact of STT failures significantly. The LLM stops treating garbled transcripts as literal input and starts being smarter about what the caller probably meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first 8 seconds are where calls die
&lt;/h2&gt;

&lt;p&gt;About a third of all problematic calls failed before the real conversation even started.&lt;/p&gt;

&lt;p&gt;A lot happens in those first 8 to 10 seconds. Callers are still deciding whether they want to engage. Many have not fully shifted their attention to the call. Some are already talking before the bot finishes its greeting, and others wait much longer than expected, unsure whether the silence means the system is broken. This is also the window where callers most frequently realize they are talking to a bot, and many immediately start testing it - interrupting, asking off-script questions, being deliberately vague. The variance in behavior is just much higher here than at any other point in the call.&lt;/p&gt;

&lt;p&gt;Greeting latency makes everything worse. A gap of more than a second or two between the call connecting and the first word of audio is enough for many callers to assume things are broken and hang up. Pre-generating and caching your greeting audio instead of synthesizing it fresh every time removes an entire class of failures here.&lt;/p&gt;

&lt;p&gt;What works: keep greetings short, under six seconds. Consider disabling barge-in for just the first 3 to 4 seconds if your greeting contains information the caller needs to hear. Re-enable interruption after that. The period where barge-in causes the most problems is right at the start.&lt;/p&gt;

&lt;p&gt;In Dograh, each workflow node has its own "Allow Interruption" toggle, so you can switch off interruption on the starting node for a short introduction and re-enable it for the rest of the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interruptions, silence, and dead air
&lt;/h2&gt;

&lt;p&gt;Most barge-in documentation focuses on detecting when a user starts speaking and stopping the TTS output. That part is reasonably well solved. The harder problem is what happens to the conversation after the interruption.&lt;/p&gt;

&lt;p&gt;One pattern we saw constantly: a caller says "uh huh" or "yeah" while the bot is talking. The bot interprets this as a genuine turn, stops speaking, and tries to process it as new user input. The LLM produces a response to what is essentially a non-sentence and the conversation breaks. The caller has to re-explain what they wanted.&lt;/p&gt;

&lt;p&gt;A related pattern involves context switching - a caller interrupts to ask "wait, does that include weekends?" The bot handles the question fine but loses track of what it was explaining before. The original task gets dropped without resolution.&lt;/p&gt;

&lt;p&gt;Both problems are about how interruption events are handled in the conversation state. The fix is designing explicit logic for what happens after an interruption. Differentiate between filler sounds and substantive speech, and track unresolved conversational threads so the bot can come back to them.&lt;/p&gt;

&lt;p&gt;Silence is closely related. Real callers go quiet for perfectly legitimate reasons - checking an email for an order ID, handing the phone to someone else, looking something up. A caller fetching information might go quiet for 20 to 40 seconds. Most voice AI systems interpret this as confusion or abandonment and respond with prompts or just hang up.&lt;/p&gt;

&lt;p&gt;A graduated response works much better: a brief neutral filler after 5 seconds, a gentle "still there?" check at 15 seconds, a real choice at 45 seconds ("I can hold, or you are welcome to call back"), and a graceful close with callback instructions only after 90 seconds or more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool call latency creates its own kind of dead air
&lt;/h2&gt;

&lt;p&gt;When a voice agent needs to look something up in a CRM or check a calendar slot, a tool call gets triggered. In practice that means at least one additional LLM turn to interpret the result, plus the external API latency. The total gap can easily reach 3 to 5 seconds, and callers at that point genuinely don't know if the call is still alive.&lt;/p&gt;

&lt;p&gt;We are adding a "playback speech" feature in Dograh that lets you configure a pre-recorded audio clip to play while a tool call executes. This fills the silence without the LLM having to generate a response and it keeps the caller engaged. Beyond that, pre-fetching data you know will be needed at call start - account details, prior call history, caller ID lookups - keeps common tool calls out of the live response path entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM failures and broken escalation
&lt;/h2&gt;

&lt;p&gt;LLM failures in production voice AI are real but probably smaller than you would expect. Hallucination gets the most attention but it accounts for a smaller share of bad calls than the quieter failures. Models stop following instructions partway through long calls. They generate empty responses that cause the bot to say nothing at all. They produce answers that are technically coherent but completely disconnected from the previous turn.&lt;/p&gt;

&lt;p&gt;The trade-off between model size and latency matters here too. Smaller models respond faster, which helps perceived call quality, but they follow complex instructions less reliably. Larger models handle nuanced prompts better but their response latency feeds right back into the dead-air problem.&lt;/p&gt;

&lt;p&gt;Escalation failures have a lower frequency but they are the most consequential category on this list. The callers asking for a human almost always have the hardest, most urgent problems. They have already tried self-service and it didn't work. When the bot can't detect they want to escalate - because they phrased it in a way the intent detection doesn't recognize - or when the escalation path drops the context so the human agent starts from scratch, that caller's experience is about as bad as it gets.&lt;/p&gt;

&lt;p&gt;Escalation should be a first-class destination in every workflow. The phrases callers use to ask for a human are wildly more varied than any training set anticipates, and the transfer needs to carry the full conversation transcript to the receiving agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Voice AI breaks at transitions. The first seconds of a call, the moment a user interrupts, the gap while a tool is running, the point where a caller needs a human. These are the edges where the system's assumptions about how conversations work meet how people actually behave on phone calls.&lt;/p&gt;

&lt;p&gt;Teams that focus almost entirely on LLM response quality and treat these transitions as secondary concerns tend to ship agents that sound good in demos but disappoint in production. The calls that held up well in our data were the ones with the most deliberate handling of everything that happens between LLM responses.&lt;/p&gt;




&lt;p&gt;We are building Dograh as the open-source alternative to Vapi, Retell, and Bland. No per-minute fees, no vendor lock-in, deploy on your own infra. Ask anny queries &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Star us on GitHub&lt;/a&gt; | &lt;a href="https://app.dograh.com" rel="noopener noreferrer"&gt;Try the cloud version&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceai</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
