<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dograh AI</title>
    <description>The latest articles on DEV Community by Dograh AI (@dograh).</description>
    <link>https://dev.to/dograh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12702%2F7ca75fbe-0efc-4495-a3f1-c9d0cd08bb1e.png</url>
      <title>DEV Community: Dograh AI</title>
      <link>https://dev.to/dograh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dograh"/>
    <language>en</language>
    <item>
      <title>We analyzed 10,000 voice AI calls. The LLM was rarely the problem.</title>
      <dc:creator>Pritesh Kumar</dc:creator>
      <pubDate>Sat, 28 Mar 2026 13:54:27 +0000</pubDate>
      <link>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</link>
      <guid>https://dev.to/dograh/we-analyzed-10000-voice-ai-calls-the-llm-was-rarely-the-problem-3dod</guid>
      <description>&lt;p&gt;We built &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Dograh OSS&lt;/a&gt;, an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there.&lt;/p&gt;

&lt;p&gt;Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected.&lt;/p&gt;

&lt;p&gt;The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed.&lt;/p&gt;

&lt;p&gt;Here is what we found, roughly ranked by frequency.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure area&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;th&gt;Primary driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT / word error rate&lt;/td&gt;
&lt;td&gt;~38%&lt;/td&gt;
&lt;td&gt;Low-quality telephony audio and accent variation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First-8-second chaos&lt;/td&gt;
&lt;td&gt;~34%&lt;/td&gt;
&lt;td&gt;Greeting latency, barge-in, variable user behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interruption handling&lt;/td&gt;
&lt;td&gt;~28%&lt;/td&gt;
&lt;td&gt;Filler words breaking flow, context switching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extended silence&lt;/td&gt;
&lt;td&gt;~22%&lt;/td&gt;
&lt;td&gt;Users distracted, fetching info, handing off phone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool call latency&lt;/td&gt;
&lt;td&gt;~19%&lt;/td&gt;
&lt;td&gt;LLM turns plus external API latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM failure modes&lt;/td&gt;
&lt;td&gt;~15%&lt;/td&gt;
&lt;td&gt;Hallucinations, instruction drift, latency trade-offs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken escalation&lt;/td&gt;
&lt;td&gt;~11%&lt;/td&gt;
&lt;td&gt;No clear human handoff path&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These categories overlap. A lot of bad calls had two or three of these compounding each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  STT is worse than you think
&lt;/h2&gt;

&lt;p&gt;STT failures were the single biggest contributor to broken calls, and the one most consistently underestimated by teams building voice AI for the first time.&lt;/p&gt;

&lt;p&gt;Standard telephony runs at 8 kHz audio. That is genuinely low quality. It strips away acoustic detail that helps distinguish consonants, so "schedule" becomes "school" in a transcript and "Sunday" comes through as "someday." Add background noise, speakerphone, or a non-native accent and word error rates climb fast.&lt;/p&gt;

&lt;p&gt;The thing about WER is that errors compound. "I need to school my appointment for someday" should be understood as "I need to schedule my appointment for Sunday," and a well-prompted LLM will figure that out. When three or four words are garbled in the same turn though, contextual inference falls apart. We saw calls where entire turns came through as near-gibberish.&lt;/p&gt;

&lt;p&gt;There is also a dimension STT completely misses. Transcription captures words but it does not capture how those words are said. A frustrated "fine" and a genuinely agreeable "fine" are two very different things, but they look identical in a transcript. Tone helps disambiguate words that sound similar. When a caller sounds confused or hesitant, a human listener picks up on that and adjusts. STT gives you a flat string of text and the LLM works with it blind.&lt;/p&gt;

&lt;p&gt;No STT provider has uniform accuracy across all accents either. The gap between a provider's accuracy on American English versus Nigerian English or heavily accented Spanish can be 15 to 25 percentage points. If your users call from diverse regions, picking one STT provider and moving on is going to hurt you.&lt;/p&gt;

&lt;p&gt;Two mitigations made a real difference in our testing. First, adding a custom vocabulary to your STT engine - and I mean beyond domain jargon. If you are building an order management bot, add the common words your callers actually say: "order," "ID," "payment," "cancel," "account," "address." Thirty to forty frequently used words. The STT engine listens for these with extra weight and it meaningfully reduces errors on the terms that matter most.&lt;/p&gt;

&lt;p&gt;Second, tell the LLM to expect transcription errors. A single line in your system prompt acknowledging that the caller's words may contain transcription noise, and asking the model to use contextual reasoning before responding, reduces the downstream impact of STT failures significantly. The LLM stops treating garbled transcripts as literal input and starts being smarter about what the caller probably meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first 8 seconds are where calls die
&lt;/h2&gt;

&lt;p&gt;About a third of all problematic calls failed before the real conversation even started.&lt;/p&gt;

&lt;p&gt;A lot happens in those first 8 to 10 seconds. Callers are still deciding whether they want to engage. Many have not fully shifted their attention to the call. Some are already talking before the bot finishes its greeting, and others wait much longer than expected, unsure whether the silence means the system is broken. This is also the window where callers most frequently realize they are talking to a bot, and many immediately start testing it - interrupting, asking off-script questions, being deliberately vague. The variance in behavior is just much higher here than at any other point in the call.&lt;/p&gt;

&lt;p&gt;Greeting latency makes everything worse. A gap of more than a second or two between the call connecting and the first word of audio is enough for many callers to assume things are broken and hang up. Pre-generating and caching your greeting audio instead of synthesizing it fresh every time removes an entire class of failures here.&lt;/p&gt;

&lt;p&gt;What works: keep greetings short, under six seconds. Consider disabling barge-in for just the first 3 to 4 seconds if your greeting contains information the caller needs to hear. Re-enable interruption after that. The period where barge-in causes the most problems is right at the start.&lt;/p&gt;

&lt;p&gt;In Dograh, each workflow node has its own "Allow Interruption" toggle, so you can switch off interruption on the starting node for a short introduction and re-enable it for the rest of the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interruptions, silence, and dead air
&lt;/h2&gt;

&lt;p&gt;Most barge-in documentation focuses on detecting when a user starts speaking and stopping the TTS output. That part is reasonably well solved. The harder problem is what happens to the conversation after the interruption.&lt;/p&gt;

&lt;p&gt;One pattern we saw constantly: a caller says "uh huh" or "yeah" while the bot is talking. The bot interprets this as a genuine turn, stops speaking, and tries to process it as new user input. The LLM produces a response to what is essentially a non-sentence and the conversation breaks. The caller has to re-explain what they wanted.&lt;/p&gt;

&lt;p&gt;A related pattern involves context switching - a caller interrupts to ask "wait, does that include weekends?" The bot handles the question fine but loses track of what it was explaining before. The original task gets dropped without resolution.&lt;/p&gt;

&lt;p&gt;Both problems are about how interruption events are handled in the conversation state. The fix is designing explicit logic for what happens after an interruption. Differentiate between filler sounds and substantive speech, and track unresolved conversational threads so the bot can come back to them.&lt;/p&gt;

&lt;p&gt;Silence is closely related. Real callers go quiet for perfectly legitimate reasons - checking an email for an order ID, handing the phone to someone else, looking something up. A caller fetching information might go quiet for 20 to 40 seconds. Most voice AI systems interpret this as confusion or abandonment and respond with prompts or just hang up.&lt;/p&gt;

&lt;p&gt;A graduated response works much better: a brief neutral filler after 5 seconds, a gentle "still there?" check at 15 seconds, a real choice at 45 seconds ("I can hold, or you are welcome to call back"), and a graceful close with callback instructions only after 90 seconds or more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool call latency creates its own kind of dead air
&lt;/h2&gt;

&lt;p&gt;When a voice agent needs to look something up in a CRM or check a calendar slot, a tool call gets triggered. In practice that means at least one additional LLM turn to interpret the result, plus the external API latency. The total gap can easily reach 3 to 5 seconds, and callers at that point genuinely don't know if the call is still alive.&lt;/p&gt;

&lt;p&gt;We are adding a "playback speech" feature in Dograh that lets you configure a pre-recorded audio clip to play while a tool call executes. This fills the silence without the LLM having to generate a response and it keeps the caller engaged. Beyond that, pre-fetching data you know will be needed at call start - account details, prior call history, caller ID lookups - keeps common tool calls out of the live response path entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM failures and broken escalation
&lt;/h2&gt;

&lt;p&gt;LLM failures in production voice AI are real but probably smaller than you would expect. Hallucination gets the most attention but it accounts for a smaller share of bad calls than the quieter failures. Models stop following instructions partway through long calls. They generate empty responses that cause the bot to say nothing at all. They produce answers that are technically coherent but completely disconnected from the previous turn.&lt;/p&gt;

&lt;p&gt;The trade-off between model size and latency matters here too. Smaller models respond faster, which helps perceived call quality, but they follow complex instructions less reliably. Larger models handle nuanced prompts better but their response latency feeds right back into the dead-air problem.&lt;/p&gt;

&lt;p&gt;Escalation failures have a lower frequency but they are the most consequential category on this list. The callers asking for a human almost always have the hardest, most urgent problems. They have already tried self-service and it didn't work. When the bot can't detect they want to escalate - because they phrased it in a way the intent detection doesn't recognize - or when the escalation path drops the context so the human agent starts from scratch, that caller's experience is about as bad as it gets.&lt;/p&gt;

&lt;p&gt;Escalation should be a first-class destination in every workflow. The phrases callers use to ask for a human are wildly more varied than any training set anticipates, and the transfer needs to carry the full conversation transcript to the receiving agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Voice AI breaks at transitions. The first seconds of a call, the moment a user interrupts, the gap while a tool is running, the point where a caller needs a human. These are the edges where the system's assumptions about how conversations work meet how people actually behave on phone calls.&lt;/p&gt;

&lt;p&gt;Teams that focus almost entirely on LLM response quality and treat these transitions as secondary concerns tend to ship agents that sound good in demos but disappoint in production. The calls that held up well in our data were the ones with the most deliberate handling of everything that happens between LLM responses.&lt;/p&gt;




&lt;p&gt;We are building Dograh as the open-source alternative to Vapi, Retell, and Bland. No per-minute fees, no vendor lock-in, deploy on your own infra. Ask anny queries &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;Star us on GitHub&lt;/a&gt; | &lt;a href="https://app.dograh.com" rel="noopener noreferrer"&gt;Try the cloud version&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceai</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>your voice agent can talk. it has no idea what it said.</title>
      <dc:creator>Hariom Yadav</dc:creator>
      <pubDate>Sat, 28 Mar 2026 09:06:42 +0000</pubDate>
      <link>https://dev.to/dograh/your-voice-agent-can-talk-it-has-no-idea-what-it-said-3gm3</link>
      <guid>https://dev.to/dograh/your-voice-agent-can-talk-it-has-no-idea-what-it-said-3gm3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TLDR&lt;/strong&gt;: Dograh is an open-source voice AI platform- an OSS alternative to Vapi. Self-hostable, no per-minute fees, visual workflow builder, full call traces per turn. Choose any LLM, STT, and TTS provider. one docker command to run.&lt;br&gt;
Your voice agent made 2000 calls last night. 180 failed. 110 hit answering machines and kept talking anyway. 40 transferred to the wrong department. 30 said something your prompt definitely didn't tell it to say.&lt;br&gt;
You have a call recording. You have a transcript. But you have no idea which turn went wrong, what the LLM actually decided, whether the STT heard something different from what was said, or why latency spiked on call 47 but not call 48.&lt;br&gt;
Voice agents are getting deployed everywhere right now. We haven't spent nearly enough time giving builders the visibility to know what their agents are actually doing.   &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;duct tape as voice AI infra&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're building a voice agent today, your stack probably looks something like this: Twilio for telephony, Vapi or Retell as the orchestration layer, Deepgram for speech-to-text, ElevenLabs for text-to-speech, OpenAI for the LLM, your own logic for answering machine detection, and some observability to debug when something breaks.&lt;br&gt;
Six vendors. It works. Kind of.&lt;br&gt;
Until Vapi's per-minute fee eats 70% of your margin. Until a call fails silently and you have no turn-level trace to show why. Until your enterprise client says "we can't send call recordings to a third-party cloud." Until you need to change one prompt and you're back to redeploying the whole stack.&lt;br&gt;
The real problem isn't any single vendor. There's no single layer connecting all of them. Every component is a black box. When something goes wrong between them, you're guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tracing is the layer your voice agent is missing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional voice AI platforms are built around per-minute billing. You sign up, connect a Twilio number, write a prompt, hope it works, and get a bill. That's fine for demos. It's wrong for production agents.&lt;br&gt;
Dograh gives your voice agent a complete, observable, self-hostable runtime. Every call generates a full trace - not just a transcript. For every turn you get: what the STT heard, what the LLM decided, which tool was called, what the latency was, what the TTS said, and whether the human interrupted. The call trace is the unit of debugging, not the recording.&lt;br&gt;
The mental model is a debugger, not a phone bill. When a call goes wrong, you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.&lt;br&gt;
Dograh is BSD-2 licensed. Self-hosted via docker-compose. There is no per-minute platform fee because you own the platform.&lt;br&gt;
GitHub: &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;https://github.com/dograh-hq/dograh&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feihjcrs04zzycwfaauh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feihjcrs04zzycwfaauh2.png" alt=" " width="604" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;what dograh gives you out of the box&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dograh does a lot because voice agents need a lot. The important thing is that it's modular - every layer is swappable without touching the rest of the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telephony&lt;/strong&gt;- works via Twilio, Vonage, and Cloudonix for both inbound and outbound calling. You bring your own numbers. If you're on a PBX, Cloudonix connects directly to your existing SIP trunk. You own the telephony layer with no vendor lock-in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STT&lt;/strong&gt;- supports Deepgram, Speechmatics, Sarvam, and OpenAI Whisper. You pick the model per-agent or per-call depending on language, speed, or accuracy needs. Indian English works better on Sarvam. Low-latency real-time transcription works better on Deepgram Nova-3. High accuracy on noisy calls works better on Whisper. You swap without rewriting any logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM&lt;/strong&gt; -support covers OpenAI, Google Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted models. The agent doesn't care which model responds - the interface is the same across all of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTS&lt;/strong&gt;- runs on ElevenLabs, Cartesia, Deepgram, and OpenAI TTS. There's also a hybrid voice mode that's worth calling out separately. Instead of generating every response with TTS, the LLM picks from a library of pre-recorded human voice clips for common responses and only falls back to TTS when it needs to improvise. This cuts latency because there's no generation delay, cuts cost because you're using less TTS, and sounds more human because it literally is a human recording for the predictable parts. And when you have a dynamic text, fallback to the voice clones TTS. &lt;br&gt;
Here’s a quick tutorial on this hybrid approach&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=1uZqhG0_cIo" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=1uZqhG0_cIo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call traces&lt;/strong&gt; -are the thing that actually changes how you debug. Every turn is logged with the STT input, LLM output, tool calls, TTS output, and timing at each step. These aren't logs dumped to a file - they're structured and queryable. This is what debugging production voice agents actually requires, and it's the thing that's missing everywhere else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;what people are actually building with dograh&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lead screening and follow-ups&lt;/strong&gt;. The agent calls a list, detects answering machines using a custom detection prompt, disconnects on voicemail, and books when it reaches a human. The trace shows every call - what speech was detected, what the agent said, where drop-off happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outbound sales&lt;/strong&gt;. The agent pulls CRM data before each call and injects it into the prompt. It qualifies, handles objections, and transfers to a human rep when intent is high. It updates the CRM automatically so the rep already knows what was said before they pick up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inbound support&lt;/strong&gt;. The agent handles tier-1 support - order status, appointment changes, basic troubleshooting. When it can't resolve, it transfers with a conversation summary already written to the CRM. The human agent picks up with full context, not a blank screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-language outbound&lt;/strong&gt;. One agent, multiple languages. Sarvam for Hindi, Deepgram for English and Spanish. The agent detects language on the first turn and switches STT and TTS provider automatically. No separate agent per language, no separate infrastructure per market.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;other things built into dograh&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beyond the call itself, Dograh has a few other things built in worth knowing about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automatic QA and post-call analysis&lt;/strong&gt;. &lt;br&gt;
After every call, Dograh checks what happened automatically. It looks at sentiment, whether the user got confused, whether the agent followed its instructions, and what actions actually fired. You don't need to listen to 200 recordings to find problems. It surfaces where things went wrong - whether the agent sounded off, missed a question, or skipped a step in the flow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dedicated telephony with integrated dialer and Asterisk ARI&lt;/strong&gt;. Telephony is built in, not bolted on. You get low-latency calling across regions and a dialer that works out of the box for both inbound and outbound. No separate systems to stitch together. For teams that need deeper control, Dograh supports Asterisk ARI - you can plug into existing telephony infrastructure, customize call logic at a low level, and build more complex flows. Flexible enough for serious production deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API key rotation&lt;/strong&gt;. &lt;br&gt;
Add multiple API keys for any LLM, STT, or TTS provider and Dograh rotates between them automatically. No custom hacks needed to stay under rate limits or handle concurrency at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looptalk (coming soon)&lt;/strong&gt;. AI-to-AI call testing. You spin up a test caller with a persona - "skeptical prospect", "fast talker", "non-native English speaker" - and run it against your production agent at scale. Every simulated call leaves a full trace. You find edge cases before real customers do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;why Open Source Matters Here&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Closed-source voice AI platforms have a structural problem. When your agent breaks, you file a ticket. The platform tells you what they can see. You don't get the internals, you don't get the turn-level trace, and you can't fix it yourself.&lt;br&gt;
Better support doesn't fix that. It's a fundamental property of closed infrastructure.&lt;br&gt;
With Dograh you run the platform. The trace is yours. The data never leaves your VPC unless you want it to. When something breaks at 3am, you look at the trace. You don't wait for a vendor to respond.&lt;br&gt;
This is also why BSD-2 matters. Not AGPL, not SSPL, not "open core with enterprise features behind a paywall." BSD-2 means you can fork it, modify it, white-label it, embed it in a commercial product, and deploy it for clients without owing anyone a license fee. The code is yours.&lt;br&gt;
The per-minute fee model in closed platforms creates a genuinely adversarial relationship - the platform makes more money when your calls are longer or more frequent. Dograh's business model is managed hosting on app.dograh.com for teams that don't want to self-host. The self-hosted version is completely free and always will be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;get started&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Github link - &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;https://github.com/dograh-hq/dograh&lt;/a&gt;&lt;br&gt;
Runs on any Linux host or Apple Silicon Mac. The default config works for local dev. Drop your LLM, STT, and TTS API keys in the .env file and swap providers without touching code.&lt;br&gt;
&lt;strong&gt;dograh&lt;/strong&gt; - open-source voice AI runtime, full call traces, self-hostable&lt;br&gt;
&lt;strong&gt;docs.dograh.com&lt;/strong&gt; - setup, provider config, AMD, call transfer, knowledge base&lt;br&gt;
&lt;strong&gt;app.dograh.com&lt;/strong&gt; - managed hosting if you don't want to run the infra&lt;br&gt;
Every provider is pluggable. Every call leaves a trace. The platform is yours.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>opensource</category>
      <category>agents</category>
    </item>
    <item>
      <title>The Open-Source Voice AI Stack Every Developer Should Know in 2026</title>
      <dc:creator>Hariom Yadav</dc:creator>
      <pubDate>Sat, 21 Mar 2026 09:35:05 +0000</pubDate>
      <link>https://dev.to/dograh/the-open-source-voice-ai-stack-every-developer-should-know-in-2026-4363</link>
      <guid>https://dev.to/dograh/the-open-source-voice-ai-stack-every-developer-should-know-in-2026-4363</guid>
      <description>&lt;p&gt;"Voice AI just had its "ChatGPT moment."&lt;br&gt;
A year ago, building a voice agent meant stitching together five different APIs and paying multiple vendors per minute of conversation. Today the open-source ecosystem has genuinely caught up - and it's moving fast.&lt;br&gt;
I've been deep in this rabbit hole building Dograh, an open-source voice agent platform like n8n. This post is basically the research I wish existed when I started. Here's the full OSS stack - from raw audio all the way to a deployed phone agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Stack at a Glance&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A production voice agent has five layers:
Telephony / Transport  -&amp;gt;  Twilio, Vonage, WebRTC
STT (Speech-to-Text)   -&amp;gt;  Parakeet, Canary Qwen, Silero VAD
LLM                    -&amp;gt;  GPT-4o, Claude, Llama 3
TTS (Text-to-Speech)   -&amp;gt;  Chatterbox, Kokoro, XTTS-v2
Orchestration          -&amp;gt;  Dograh, Pipecat, LiveKit Agents

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every single layer now has solid open-source options. Let's go through them one by one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-Text&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're building anything real-time, you need something built for streaming from the ground up. Whisper was a great model in 2022 and has kept up well with the voice agent use case. You may try whisper turbo first before other alternatives. &lt;br&gt;
The best option right now for English real-time transcription is NVIDIA's Parakeet TDT 0.6B V2. It sits at #3 on the Hugging Face Open ASR leaderboard with a 6.05% WER, but the number that actually matters for voice agents is its RTFx score of 3386 - meaning it can process audio roughly 3000x faster than real-time. On a T4 GPU it's extremely affordable to run. It handles punctuation, capitalization, and word-level timestamps out of the box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nemo.collections.asr&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nemo_asr&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nemo_asr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ASRModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nvidia/parakeet-tdt-0.6b-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If accuracy matters more than raw speed - say you're transcribing medical calls or anything where a wrong word is costly - NVIDIA's Canary Qwen 2.5B is the current accuracy leader on the Open ASR leaderboard at 5.63% WER. It combines ASR with LLM capabilities under the hood, which helps a lot with context and unusual vocabulary. The tradeoff is it's heavier to run and not as snappy for real-time use.&lt;br&gt;
Either way, pair your STT model with Silero VAD. It's a small Voice Activity Detection model that tells your agent when someone is actually speaking. Without it you're either cutting people off mid-sentence or waiting awkwardly for them to finish. Every real-time voice pipeline needs this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text-to-Speech&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chatterbox from Resemble AI is the most exciting TTS release of the past year. It hits commercial-grade quality in blind tests, supports voice cloning, and has built-in audio watermarking for responsible use. If you're building anything customer-facing, this is probably your best open-source option right now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torchaudio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chatterbox.tts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatterboxTTS&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatterboxTTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello! This is Chatterbox speaking.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;torchaudio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wav&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For multilingual voice cloning, XTTS-v2 from Coqui is the go-to. It supports zero-shot cloning across 20+ languages with just a 6-second reference clip. Works well for audiobook tools, multilingual assistants, and anything where you need a consistent voice across languages.&lt;br&gt;
If latency is your main constraint, look at Kokoro. It's only 82M parameters, runs on CPU, and can hit under 100ms on consumer hardware. The quality isn't Chatterbox-level but for edge deployments or high-throughput scenarios it's hard to beat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the layer most developers underestimate. Orchestration ties STT, LLM, and TTS together and handles all the hard real-time stuff - barge-in when the user interrupts, turn detection, audio streaming, silence handling. Getting this wrong is what makes voice agents feel robotic even when the individual models are great.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dograh&lt;/strong&gt; is what I've been building. Think of it as the n8n of voice AI - a visual workflow builder where you can wire up your entire agent flow without touching code. It's the most direct open-source alternative to Vapi and Retell, and unlike those it's fully self-hostable with no per-minute markup.&lt;br&gt;
It's pretty mature at this point. You get telephony out of the box via Twilio, Vonage, and Cloudonix, inbound and outbound calling, a fully built workflow builder, and one-command setup. All the plumbing is already there - knowledge base, dictionary, KYC, voicemail detection, variable extraction, QA you calls, multilingual, transfer call to human. It can be an inbound and outbound call as well as be deployed as a widget on your website. And you just bring your own LLM, STT, and TTS keys and plug them in.&lt;br&gt;
curl -fsSL &lt;a href="https://raw.githubusercontent.com/dograh-hq/dograh/main/install.sh" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/dograh-hq/dograh/main/install.sh&lt;/a&gt; | bash&lt;/p&gt;

&lt;p&gt;The visual workflow builder is the big differentiator from raw frameworks like Pipecat or LiveKit Agents. Changing agent behavior means dragging a node, not editing Python and redeploying. For teams that want to iterate fast on agent logic without a full engineering cycle every time, that's a pretty big deal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipecat&lt;/strong&gt; is a Python framework built by Daily.co. It treats audio as a stream of typed frames and lets you build a pipeline of processors in sequence. It's transport-agnostic and gives you fine-grained control over every step. That control comes at a cost though - every time you want to change agent behavior, you're editing Python code, redeploying, and hoping nothing broke in the pipeline. For a team without dedicated voice engineering experience, the iteration loop gets slow fast.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Python&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;silero_vad&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;deepgram_stt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;openai_llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cartesia_tts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LiveKit/Agents&lt;/strong&gt; has a cleaner API and abstracts away the WebRTC infrastructure, which makes the initial setup quicker. But the same problem applies - your agent logic lives in code. Any prompt change, flow tweak, or new use case means a code change and a redeploy. It's genuinely a good framework if you have engineers who live in this stuff full-time, but it's not something a small team can move fast with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VoicePipelineAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;silero&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VAD&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;stt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;deepgram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;STT&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cartesia&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TTS&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both Pipecat and LiveKit Agents are solid if you want maximum control and have the engineering bandwidth to match. If you don't, you'll spend more time maintaining infrastructure than actually improving your agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick Comparison&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dograh&lt;/th&gt;
&lt;th&gt;Pipecat&lt;/th&gt;
&lt;th&gt;LiveKit Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Type&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;td&gt;Framework&lt;/td&gt;
&lt;td&gt;Framework&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual Workflow Builder&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend UI&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Telephony&lt;/td&gt;
&lt;td&gt;Twilio, Vonage, Cloudonix&lt;/td&gt;
&lt;td&gt;Twilio&lt;/td&gt;
&lt;td&gt;SIP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hostable&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup Time&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bring Your Own LLM&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Mistake Most People Make&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The biggest trap I see developers fall into when building voice agents: treating voice like chat with a microphone attached.&lt;br&gt;
It's a completely different problem. The hard parts aren't the models - they're the real-time engineering. When do you cut off the STT and start the LLM? What happens when the user interrupts the agent mid-sentence? How do you handle answering machines on outbound calls? What about codec mismatches - PSTN phone lines use 8kHz u-law, but most STT models expect 16kHz PCM? These are the things that will actually bite you in production.&lt;/p&gt;

&lt;p&gt;If you're just starting out, use either  Dograh or Vapi or Retell to prototype. They're all fast and handle a lot of edge cases well. But once you hit serious volume or need custom logic, the open-source stack should be your default. Choose Dograh - and the cost difference is real. Running your own stack costs under $0.02 per minute. Managed platforms charge $0.10 to $0.15.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Starter Stack That's Completely Free&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want to get something running this weekend with zero API bills:&lt;br&gt;
VAD - Silero VAD&lt;br&gt;
STT - Parakeet TDT 0.6B V2 running locally&lt;br&gt;
LLM - Llama 3.1 via Groq's free tier&lt;br&gt;
TTS - Kokoro running locally&lt;br&gt;
Orchestration - Dograh &lt;br&gt;
Total infra cost - basically zero. Real latency - very achievable under 500ms end-to-end with a mid-range GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Are You Building?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Curious what stacks people are actually running in production. Is anyone using Kokoro for real-time agents? The latency numbers look great on paper but I haven't seen many production writeups.&lt;br&gt;
Drop your stack in the comments.&lt;/p&gt;

&lt;p&gt;I'm building &lt;strong&gt;Dograh&lt;/strong&gt; - an open-source alternative to Vapi and Retell. If you're tired of vendor lock-in, check it out and star it if it's useful.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>opensource</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
