DEV Community: Dograh AI

Keeping government call data in-country: a self-hosted voice AI architecture

Pritesh Kumar — Fri, 03 Jul 2026 11:44:10 +0000

If you're wiring up a voice agent for a government helpline, the first question is where the audio goes, before you ever get to which model sounds best. A recording of someone calling about their benefits eligibility or a criminal record is about as sensitive as data gets, and the moment it leaves the country you have a procurement problem no feature list can fix.

Most hosted voice platforms make that leave-the-country decision for you. You point your phone line at their API, and every recording, every transcript, and every field the agent extracts lands in their cloud, usually somewhere you don't control and often on another continent. For a private startup that's fine. For a public agency bound by data residency rules, it's a non-starter before accuracy even enters the conversation.

So here's the architecture that keeps all of it in-country.

The data path is the whole design

A voice call moves through three model stages. Speech comes in and gets transcribed. The transcript goes to a language model that decides what to say and what to pull from your systems. The reply gets turned back into speech. Each of those stages is a place the audio or its text can leak out to a third party.

The fix is to run all three on infrastructure you already accredit. Whisper handles transcription locally. An open-weight LLM (Llama, Qwen, whatever your security team has cleared) does the reasoning on the same network. An open TTS voice speaks the reply. The caller's audio hits your servers, gets processed, and the extracted fields land in your own case system. Nothing about the call travels to a model vendor, so the border never gets crossed because there's nothing crossing it.

That's the part hosted SaaS can't give you, no matter how many compliance badges sit on the pricing page. If the inference runs in their cloud, the data goes to their cloud.

Colocate for latency, not just sovereignty

Keeping the models in-country has a second payoff infra people care about, which is latency. Voice is unforgiving. A caller notices half a second of dead air. If your STT is in one region, your LLM in another, and your TTS somewhere else, you pay round-trip time over and over, plus a trip out to a hosted vendor and back.

Colocating the transcription, the LLM, and the voice on the same box or the same rack collapses that. The audio round-trip stays inside a single network hop, so the agent can pace itself and wait for a slow-speaking caller, handling barge-in without the awkward gaps that make people hang up. Data residency and low latency end up wanting the same thing, which is the whole stack running close together on hardware you own.

Open source is what makes it auditable

Here's the part that closes a government security review. When the model weights and the pipeline code are open, your team can read exactly what the agent does with a call. Where the recording gets written. How long it's kept. What gets sent to which internal API. Which fields get extracted and where they land.

You can't audit a closed vendor's pipeline. You get a SOC 2 report and a promise. With an open stack the review becomes a code review, and the answer to "who can subpoena this recording, where does it get replicated, how long does it live" stays inside your own retention and access rules where an auditor can actually check.

Where Dograh fits

This is the constraint Dograh is built around. It's BSD-2 licensed and fully self-hostable, so you run the whole agent in your own environment and bring your own models. Whisper for transcription, an open LLM for reasoning, an open voice for playback. You pay for infrastructure instead of a per-minute platform fee, which for a high, steady government line changes the yearly number a lot. Most hosted platforms meter around 5 to 7 cents a minute just for the platform layer before any AI usage, climbing toward 15 cents all-in at volume. Own the stack and that meter stops running.

The buying decision for citizen data comes down to whether the audio ever leaves the building. Self-hosting with open models is how you keep it inside.

If you want the fuller picture, the six citizen-facing use cases this architecture serves plus the cost math and the CJIS and FISMA angles, we get into it in the full write-up.

Why On-Prem Will Win Enterprise Voice AI (Hosted Can't Follow)

Pritesh Kumar — Thu, 02 Jul 2026 20:50:36 +0000

A vendor can add "on-premise" to a sales deck in about five minutes. Shipping it is a different problem, and for most voice AI platforms it is an unsolvable one. The blocker has nothing to do with roadmap or engineering effort. It comes down to the license on the models running underneath the product.

This is a condensed, independently written version of our full deep-dive. Read the complete article on Dograh.

Here is the short version for anyone weighing this decision. For regulated buyers in healthcare and finance, running voice AI on your own infrastructure is often the only compliant option, because call audio and personal data cannot leave your network without tripping HIPAA or residency rules like GDPR. True on-prem keeps every layer inside your perimeter. It only works with open models you can actually download, and it deletes the per-minute vendor fee.

The part most compliance calls skip

You cannot colocate a model whose weights you are not allowed to hold. Closed providers sell access to a model, never the model itself, so there is no artifact to install on a server you own. When a hosted per-minute vendor says "colocation," the best it can do is pick a cloud region near your other services, and your audio still leaves the building to land on someone else's machines.

That single fact decides the whole architecture. A product built on closed APIs can bolt on a private-cloud tier, and the sensitive processing still happens on hardware you do not control. If the models are closed, on-prem is a marketing word. If the models are open, on-prem is something an auditor can actually verify.

What "on-prem" actually means

The deployment shapes are not equal. A fully hosted setup, the default for most per-minute platforms, runs everything on the vendor's servers while you connect over an API or a SIP trunk. A private-cloud or VPC deployment drops the vendor's software into your own cloud account, which narrows exposure, though the underlying models often still call out to the provider's endpoints. True on-prem, sometimes called colocation, runs the entire pipeline inside your perimeter, so speech-to-text, the language model, speech synthesis, and telephony all sit on hardware you own or rent with no call data crossing the boundary. Only that last shape satisfies the strictest residency rules.

Why regulated buyers are forcing the move

The pressure is coming from compliance and finance, not from engineering. The AI voice agents market was worth 2.54 billion dollars in 2025 and is on track for 35.24 billion by 2033, so the volume of sensitive audio moving through these systems is climbing fast. Gartner expects more than 75 percent of European and Middle Eastern enterprises to move workloads into sovereign solutions by 2030, up from under 5 percent in 2025.

HIPAA makes the stakes concrete. A compliant voice deployment needs a signed Business Associate Agreement at every layer, and Prosper AI's 2026 analysis counts up to five separate agreements, with civil penalties reaching 2,190,294 dollars per violation category per year. IBM's 2025 Cost of a Data Breach Report puts healthcare at 7.42 million dollars per breach, the costliest sector for the fourteenth year running. Self-hosting removes the problem at the root, because when every model runs inside your perimeter there is no third party to sign a BAA with and no audio leaving your network.

The open stack that makes it real

A fully self-hosted pipeline is buildable today from open components. Whisper and Voxtral handle speech-to-text on your own GPUs. Open language models such as Llama and Qwen serve through vLLM or Ollama. Kokoro and Piper generate natural speech locally, with Coqui and Chatterbox as further options. Telephony sits on Asterisk and standard SIP trunking, with ARI for low-level call control. Running these together on one server or in one availability zone is what colocation actually buys you, and since every network hop adds delay, keeping the models next to each other is one of the biggest levers for sub-800ms speech latency.

The bill hosted vendors do not print

Per-minute pricing scales linearly with every call. Ringly.io's 2026 pricing data puts the all-in cost of a hosted deployment at 0.12 to 0.25 dollars per minute once speech, model, voice, and telephony stack on the platform fee, with the platform fee alone around 5 to 7 cents a minute. Run 1,000 minutes a day and you land between 15,000 and 30,000 dollars a year, climbing with every new campaign. Self-hosting turns that meter into a fixed infrastructure line, so the marginal cost of another minute sits close to zero.

Why per-minute vendors cannot follow

Hosted platforms will offer a private-cloud tier and a thick stack of compliance documents, and both help. What they cannot offer is a stack you fully own, because the models underneath are closed, so the residency guarantee ends at the model boundary and the meter keeps running.

This is the gap Dograh was built to close. It is an open-source voice agent platform under a BSD-2 license, self-hostable from the ground up. You can colocate an open stack and bring your own keys for any commercial model, or drop commercial models entirely and run open weights end to end. There is no per-minute platform fee, and because the whole system is open, data residency and auditability arrive with the deployment instead of a contract addendum.

The reason a hosted rival cannot copy this is structural. In a 2026 cloud computing survey, 94 percent of organizations reported concern about vendor lock-in, and only 6 percent believed they could switch their main AI provider without serious disruption. A per-minute vendor benefits from that friction, because a customer who cannot leave keeps paying. Hand that customer open weights on their own hardware with no meter, and there is very little company left to bill. Announcing an on-premise option is easy. Switching off the billing that funds the business is not.

What to check before you move

Start with the license, since a platform you can install and run yourself is auditable in a way a closed product never is. Check the model layer next, and ask whether you can bring open weights at every step and whether any stage quietly falls back to a closed API that ships audio out. Confirm telephony can run on your own SIP or Asterisk setup so the call path stays internal. Then follow the money, because a real on-prem option turns cost into fixed infrastructure with no per-minute fee riding on top, and a vendor that cannot remove the meter is not handing you a deployment you own.

Glossary

Colocation. Hosting speech-to-text, the language model, speech synthesis, and telephony on the same server or availability zone to cut network hops. It only works with open models you can self-host.

Data residency. The requirement that call audio and personal data physically stay inside a specific country or jurisdiction.

Business Associate Agreement (BAA). A HIPAA contract that makes a vendor legally liable for protecting patient data. A hosted voice stack needs one at every layer.

Geopatriation. Moving cloud and AI workloads back inside national borders to satisfy sovereignty rules.

FAQ

Can I self-host closed-source voice AI models?

No. Closed providers sell API access, not the model weights, so there is nothing to install on your own hardware. With closed models, colocation only means choosing a nearby cloud region, and your audio still leaves your network. Only open models run fully on-prem.

What is the open-source stack for on-prem voice AI?

A self-hosted pipeline typically uses Whisper or Voxtral for speech-to-text, an open language model like Llama or Qwen served through vLLM or Ollama, and Kokoro or Piper for text-to-speech. Telephony runs on Asterisk and SIP. Every layer stays on hardware you control.

How much does self-hosted voice AI cost compared to per-minute vendors?

Hosted voice AI runs about 0.12 to 0.25 dollars per minute all-in, and that meter scales with every call. Self-hosting converts the cost into fixed infrastructure, so the marginal cost of another minute is close to zero. Running open models instead of commercial APIs lowers the bill further.

Is on-prem deployment required for HIPAA voice AI?

It is not strictly required, though it removes the hardest part of HIPAA compliance. A hosted stack needs a signed BAA at every layer, up to five separate agreements. Self-hosting keeps patient audio inside your own perimeter, so there is no third party to contract with in the first place.

Originally published at www.dograh.com/hub/blogs/on-prem-enterprise-voice-ai

Where Voice AI Agents Are Actually Getting Used in 2026

Pritesh Kumar — Fri, 17 Apr 2026 13:45:27 +0000

Voice AI has moved past the demo phase. After watching hundreds of deployments across our customer base and the broader ecosystem, I wanted to put together a practical list of where voice agents are actually earning their keep right now, and where the ROI is strong enough to justify real production budgets.

The list below is the subset that keeps showing up in real pipelines and real P&Ls.

Customer Support

Tier-1 support is the single biggest deployed voice AI use case today. Voice agents handle password resets, order status checks, account balance lookups, policy questions, and similar high-volume repetitive queries. The value is straightforward: deflect 40-60% of inbound calls away from human agents, answer in the language the caller speaks, operate 24/7. Most teams start here because the data already exists in their CRM or knowledge base, and the workflows are well understood.

Lead Screening and Qualification

Inbound leads from ads, forms, and content marketing usually sit in a queue for hours before a human gets to them. By that point, intent has dropped significantly. Voice agents now pick up the call within seconds, qualify against BANT or a custom rubric, book the meeting straight into the sales rep calendar, and log everything in HubSpot or Salesforce. This is the highest-velocity use case for B2B teams I have seen. The math is easy: a qualified meeting has a known value in the CRM, and answering in 30 seconds instead of 4 hours multiplies that yield.

Collections and Renewals in Fintech

This is where I have seen some of the strongest unit economics. Banks, lenders, and insurance companies run enormous outbound collections operations with razor-thin per-call economics. Voice agents handle reminders, soft collections, payment plan negotiation, drop-off recovery, and renewal nudges at a fraction of the cost of a human BPO. The volumes are high, the scripts are compliance-heavy (which AI handles consistently), and the conversion lift from reaching a borrower in their preferred language at the right time of day is real.

Cold Calling and Outbound Sales

The ROI here is very good if you get the targeting right. Voice agents can run thousands of outbound dials a day, handle objections, qualify interest, and hand off warm prospects to a human closer. The catch is that bad targeting plus AI dialing equals spam complaints at scale, so list hygiene and opt-in matter more than the tech itself. Teams that get this right see cost per meeting drop by 5-10x.

Appointment Setting in Healthcare

Hospitals, dental clinics, and specialty practices deal with huge no-show rates and constant rebooking churn. Voice agents handle appointment confirmations, reminders, rescheduling, and prep instructions like pre-op fasting rules. The operational impact is immediate: front desk staff get their attention back for patients physically in the clinic, and call handling capacity goes up overnight.

Receptions, Restaurants, and Local Services

Any local business with a phone number that rings all day is a candidate. Restaurants take reservations and handle takeaway orders, dental clinics book and confirm visits, salons do intake. The ticket size per business is small, but the cumulative market is enormous. This category will eventually absorb the most total call volume, even if each individual deployment is modest.

What Comes Next

The interesting wave ahead is in regulated industries like KYC verification, insurance claims intake (FNOL), patient engagement beyond appointments, and legal intake flows. These need stronger guardrails, better audit trails, tighter integration into systems of record, and clear compliance boundaries. That is where the platform layer matters, and where closed black-box APIs start to hit walls that open, inspectable stacks handle gracefully.

If you are evaluating voice AI for your own business, start with the use case where you already have volume, a clear script, a measurable outcome, and a team ready to handle the tail exceptions. Skip the speculative experiments. The wins are in the boring, high-frequency calls you make every day already.

We analyzed 10,000 voice AI calls. The LLM was rarely the problem.

Pritesh Kumar — Sat, 28 Mar 2026 13:54:27 +0000

We built Dograh OSS, an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there.

Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected.

The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed.

Here is what we found, roughly ranked by frequency.

Failure area	Share	Primary driver
STT / word error rate	~38%	Low-quality telephony audio and accent variation
First-8-second chaos	~34%	Greeting latency, barge-in, variable user behavior
Interruption handling	~28%	Filler words breaking flow, context switching
Extended silence	~22%	Users distracted, fetching info, handing off phone
Tool call latency	~19%	LLM turns plus external API latency
LLM failure modes	~15%	Hallucinations, instruction drift, latency trade-offs
Broken escalation	~11%	No clear human handoff path

These categories overlap. A lot of bad calls had two or three of these compounding each other.

STT is worse than you think

STT failures were the single biggest contributor to broken calls, and the one most consistently underestimated by teams building voice AI for the first time.

Standard telephony runs at 8 kHz audio. That is genuinely low quality. It strips away acoustic detail that helps distinguish consonants, so "schedule" becomes "school" in a transcript and "Sunday" comes through as "someday." Add background noise, speakerphone, or a non-native accent and word error rates climb fast.

The thing about WER is that errors compound. "I need to school my appointment for someday" should be understood as "I need to schedule my appointment for Sunday," and a well-prompted LLM will figure that out. When three or four words are garbled in the same turn though, contextual inference falls apart. We saw calls where entire turns came through as near-gibberish.

There is also a dimension STT completely misses. Transcription captures words but it does not capture how those words are said. A frustrated "fine" and a genuinely agreeable "fine" are two very different things, but they look identical in a transcript. Tone helps disambiguate words that sound similar. When a caller sounds confused or hesitant, a human listener picks up on that and adjusts. STT gives you a flat string of text and the LLM works with it blind.

No STT provider has uniform accuracy across all accents either. The gap between a provider's accuracy on American English versus Nigerian English or heavily accented Spanish can be 15 to 25 percentage points. If your users call from diverse regions, picking one STT provider and moving on is going to hurt you.

Two mitigations made a real difference in our testing. First, adding a custom vocabulary to your STT engine - and I mean beyond domain jargon. If you are building an order management bot, add the common words your callers actually say: "order," "ID," "payment," "cancel," "account," "address." Thirty to forty frequently used words. The STT engine listens for these with extra weight and it meaningfully reduces errors on the terms that matter most.

Second, tell the LLM to expect transcription errors. A single line in your system prompt acknowledging that the caller's words may contain transcription noise, and asking the model to use contextual reasoning before responding, reduces the downstream impact of STT failures significantly. The LLM stops treating garbled transcripts as literal input and starts being smarter about what the caller probably meant.

The first 8 seconds are where calls die

About a third of all problematic calls failed before the real conversation even started.

A lot happens in those first 8 to 10 seconds. Callers are still deciding whether they want to engage. Many have not fully shifted their attention to the call. Some are already talking before the bot finishes its greeting, and others wait much longer than expected, unsure whether the silence means the system is broken. This is also the window where callers most frequently realize they are talking to a bot, and many immediately start testing it - interrupting, asking off-script questions, being deliberately vague. The variance in behavior is just much higher here than at any other point in the call.

Greeting latency makes everything worse. A gap of more than a second or two between the call connecting and the first word of audio is enough for many callers to assume things are broken and hang up. Pre-generating and caching your greeting audio instead of synthesizing it fresh every time removes an entire class of failures here.

What works: keep greetings short, under six seconds. Consider disabling barge-in for just the first 3 to 4 seconds if your greeting contains information the caller needs to hear. Re-enable interruption after that. The period where barge-in causes the most problems is right at the start.

In Dograh, each workflow node has its own "Allow Interruption" toggle, so you can switch off interruption on the starting node for a short introduction and re-enable it for the rest of the conversation.

Interruptions, silence, and dead air

Most barge-in documentation focuses on detecting when a user starts speaking and stopping the TTS output. That part is reasonably well solved. The harder problem is what happens to the conversation after the interruption.

One pattern we saw constantly: a caller says "uh huh" or "yeah" while the bot is talking. The bot interprets this as a genuine turn, stops speaking, and tries to process it as new user input. The LLM produces a response to what is essentially a non-sentence and the conversation breaks. The caller has to re-explain what they wanted.

A related pattern involves context switching - a caller interrupts to ask "wait, does that include weekends?" The bot handles the question fine but loses track of what it was explaining before. The original task gets dropped without resolution.

Both problems are about how interruption events are handled in the conversation state. The fix is designing explicit logic for what happens after an interruption. Differentiate between filler sounds and substantive speech, and track unresolved conversational threads so the bot can come back to them.

Silence is closely related. Real callers go quiet for perfectly legitimate reasons - checking an email for an order ID, handing the phone to someone else, looking something up. A caller fetching information might go quiet for 20 to 40 seconds. Most voice AI systems interpret this as confusion or abandonment and respond with prompts or just hang up.

A graduated response works much better: a brief neutral filler after 5 seconds, a gentle "still there?" check at 15 seconds, a real choice at 45 seconds ("I can hold, or you are welcome to call back"), and a graceful close with callback instructions only after 90 seconds or more.

Tool call latency creates its own kind of dead air

When a voice agent needs to look something up in a CRM or check a calendar slot, a tool call gets triggered. In practice that means at least one additional LLM turn to interpret the result, plus the external API latency. The total gap can easily reach 3 to 5 seconds, and callers at that point genuinely don't know if the call is still alive.

We are adding a "playback speech" feature in Dograh that lets you configure a pre-recorded audio clip to play while a tool call executes. This fills the silence without the LLM having to generate a response and it keeps the caller engaged. Beyond that, pre-fetching data you know will be needed at call start - account details, prior call history, caller ID lookups - keeps common tool calls out of the live response path entirely.

LLM failures and broken escalation

LLM failures in production voice AI are real but probably smaller than you would expect. Hallucination gets the most attention but it accounts for a smaller share of bad calls than the quieter failures. Models stop following instructions partway through long calls. They generate empty responses that cause the bot to say nothing at all. They produce answers that are technically coherent but completely disconnected from the previous turn.

The trade-off between model size and latency matters here too. Smaller models respond faster, which helps perceived call quality, but they follow complex instructions less reliably. Larger models handle nuanced prompts better but their response latency feeds right back into the dead-air problem.

Escalation failures have a lower frequency but they are the most consequential category on this list. The callers asking for a human almost always have the hardest, most urgent problems. They have already tried self-service and it didn't work. When the bot can't detect they want to escalate - because they phrased it in a way the intent detection doesn't recognize - or when the escalation path drops the context so the human agent starts from scratch, that caller's experience is about as bad as it gets.

Escalation should be a first-class destination in every workflow. The phrases callers use to ask for a human are wildly more varied than any training set anticipates, and the transfer needs to carry the full conversation transcript to the receiving agent.

The pattern

Voice AI breaks at transitions. The first seconds of a call, the moment a user interrupts, the gap while a tool is running, the point where a caller needs a human. These are the edges where the system's assumptions about how conversations work meet how people actually behave on phone calls.

Teams that focus almost entirely on LLM response quality and treat these transitions as secondary concerns tend to ship agents that sound good in demos but disappoint in production. The calls that held up well in our data were the ones with the most deliberate handling of everything that happens between LLM responses.

We are building Dograh as the open-source alternative to Vapi, Retell, and Bland. No per-minute fees, no vendor lock-in, deploy on your own infra. Ask anny queries

Star us on GitHub | Try the cloud version

your voice agent can talk. it has no idea what it said.

Hariom Yadav — Sat, 28 Mar 2026 09:06:42 +0000

TLDR: Dograh is an open-source voice AI platform- an OSS alternative to Vapi. Self-hostable, no per-minute fees, visual workflow builder, full call traces per turn. Choose any LLM, STT, and TTS provider. one docker command to run.
Your voice agent made 2000 calls last night. 180 failed. 110 hit answering machines and kept talking anyway. 40 transferred to the wrong department. 30 said something your prompt definitely didn't tell it to say.
You have a call recording. You have a transcript. But you have no idea which turn went wrong, what the LLM actually decided, whether the STT heard something different from what was said, or why latency spiked on call 47 but not call 48.
Voice agents are getting deployed everywhere right now. We haven't spent nearly enough time giving builders the visibility to know what their agents are actually doing.

duct tape as voice AI infra

If you're building a voice agent today, your stack probably looks something like this: Twilio for telephony, Vapi or Retell as the orchestration layer, Deepgram for speech-to-text, ElevenLabs for text-to-speech, OpenAI for the LLM, your own logic for answering machine detection, and some observability to debug when something breaks.
Six vendors. It works. Kind of.
Until Vapi's per-minute fee eats 70% of your margin. Until a call fails silently and you have no turn-level trace to show why. Until your enterprise client says "we can't send call recordings to a third-party cloud." Until you need to change one prompt and you're back to redeploying the whole stack.
The real problem isn't any single vendor. There's no single layer connecting all of them. Every component is a black box. When something goes wrong between them, you're guessing.

tracing is the layer your voice agent is missing

Traditional voice AI platforms are built around per-minute billing. You sign up, connect a Twilio number, write a prompt, hope it works, and get a bill. That's fine for demos. It's wrong for production agents.
Dograh gives your voice agent a complete, observable, self-hostable runtime. Every call generates a full trace - not just a transcript. For every turn you get: what the STT heard, what the LLM decided, which tool was called, what the latency was, what the TTS said, and whether the human interrupted. The call trace is the unit of debugging, not the recording.
The mental model is a debugger, not a phone bill. When a call goes wrong, you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.
Dograh is BSD-2 licensed. Self-hosted via docker-compose. There is no per-minute platform fee because you own the platform.
GitHub: https://github.com/dograh-hq/dograh

what dograh gives you out of the box

Dograh does a lot because voice agents need a lot. The important thing is that it's modular - every layer is swappable without touching the rest of the system.

Telephony- works via Twilio, Vonage, and Cloudonix for both inbound and outbound calling. You bring your own numbers. If you're on a PBX, Cloudonix connects directly to your existing SIP trunk. You own the telephony layer with no vendor lock-in.

STT- supports Deepgram, Speechmatics, Sarvam, and OpenAI Whisper. You pick the model per-agent or per-call depending on language, speed, or accuracy needs. Indian English works better on Sarvam. Low-latency real-time transcription works better on Deepgram Nova-3. High accuracy on noisy calls works better on Whisper. You swap without rewriting any logic.

LLM -support covers OpenAI, Google Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted models. The agent doesn't care which model responds - the interface is the same across all of them.

TTS- runs on ElevenLabs, Cartesia, Deepgram, and OpenAI TTS. There's also a hybrid voice mode that's worth calling out separately. Instead of generating every response with TTS, the LLM picks from a library of pre-recorded human voice clips for common responses and only falls back to TTS when it needs to improvise. This cuts latency because there's no generation delay, cuts cost because you're using less TTS, and sounds more human because it literally is a human recording for the predictable parts. And when you have a dynamic text, fallback to the voice clones TTS.
Here’s a quick tutorial on this hybrid approach
https://www.youtube.com/watch?v=1uZqhG0_cIo

Call traces -are the thing that actually changes how you debug. Every turn is logged with the STT input, LLM output, tool calls, TTS output, and timing at each step. These aren't logs dumped to a file - they're structured and queryable. This is what debugging production voice agents actually requires, and it's the thing that's missing everywhere else.

what people are actually building with dograh

Lead screening and follow-ups. The agent calls a list, detects answering machines using a custom detection prompt, disconnects on voicemail, and books when it reaches a human. The trace shows every call - what speech was detected, what the agent said, where drop-off happened.

Outbound sales. The agent pulls CRM data before each call and injects it into the prompt. It qualifies, handles objections, and transfers to a human rep when intent is high. It updates the CRM automatically so the rep already knows what was said before they pick up.

Inbound support. The agent handles tier-1 support - order status, appointment changes, basic troubleshooting. When it can't resolve, it transfers with a conversation summary already written to the CRM. The human agent picks up with full context, not a blank screen.

Multi-language outbound. One agent, multiple languages. Sarvam for Hindi, Deepgram for English and Spanish. The agent detects language on the first turn and switches STT and TTS provider automatically. No separate agent per language, no separate infrastructure per market.

other things built into dograh

Beyond the call itself, Dograh has a few other things built in worth knowing about.

Automatic QA and post-call analysis.
After every call, Dograh checks what happened automatically. It looks at sentiment, whether the user got confused, whether the agent followed its instructions, and what actions actually fired. You don't need to listen to 200 recordings to find problems. It surfaces where things went wrong - whether the agent sounded off, missed a question, or skipped a step in the flow.

Dedicated telephony with integrated dialer and Asterisk ARI. Telephony is built in, not bolted on. You get low-latency calling across regions and a dialer that works out of the box for both inbound and outbound. No separate systems to stitch together. For teams that need deeper control, Dograh supports Asterisk ARI - you can plug into existing telephony infrastructure, customize call logic at a low level, and build more complex flows. Flexible enough for serious production deployments.

API key rotation.
Add multiple API keys for any LLM, STT, or TTS provider and Dograh rotates between them automatically. No custom hacks needed to stay under rate limits or handle concurrency at scale.

Looptalk (coming soon). AI-to-AI call testing. You spin up a test caller with a persona - "skeptical prospect", "fast talker", "non-native English speaker" - and run it against your production agent at scale. Every simulated call leaves a full trace. You find edge cases before real customers do.

why Open Source Matters Here

Closed-source voice AI platforms have a structural problem. When your agent breaks, you file a ticket. The platform tells you what they can see. You don't get the internals, you don't get the turn-level trace, and you can't fix it yourself.
Better support doesn't fix that. It's a fundamental property of closed infrastructure.
With Dograh you run the platform. The trace is yours. The data never leaves your VPC unless you want it to. When something breaks at 3am, you look at the trace. You don't wait for a vendor to respond.
This is also why BSD-2 matters. Not AGPL, not SSPL, not "open core with enterprise features behind a paywall." BSD-2 means you can fork it, modify it, white-label it, embed it in a commercial product, and deploy it for clients without owing anyone a license fee. The code is yours.
The per-minute fee model in closed platforms creates a genuinely adversarial relationship - the platform makes more money when your calls are longer or more frequent. Dograh's business model is managed hosting on app.dograh.com for teams that don't want to self-host. The self-hosted version is completely free and always will be.

get started

Github link - https://github.com/dograh-hq/dograh
Runs on any Linux host or Apple Silicon Mac. The default config works for local dev. Drop your LLM, STT, and TTS API keys in the .env file and swap providers without touching code.
dograh - open-source voice AI runtime, full call traces, self-hostable
docs.dograh.com - setup, provider config, AMD, call transfer, knowledge base
app.dograh.com - managed hosting if you don't want to run the infra
Every provider is pluggable. Every call leaves a trace. The platform is yours.

The Open-Source Voice AI Stack Every Developer Should Know in 2026

Hariom Yadav — Sat, 21 Mar 2026 09:35:05 +0000

"Voice AI just had its "ChatGPT moment."
A year ago, building a voice agent meant stitching together five different APIs and paying multiple vendors per minute of conversation. Today the open-source ecosystem has genuinely caught up - and it's moving fast.
I've been deep in this rabbit hole building Dograh, an open-source voice agent platform like n8n. This post is basically the research I wish existed when I started. Here's the full OSS stack - from raw audio all the way to a deployed phone agent.

The Stack at a Glance

A production voice agent has five layers:
Telephony / Transport  ->  Twilio, Vonage, WebRTC
STT (Speech-to-Text)   ->  Parakeet, Canary Qwen, Silero VAD
LLM                    ->  GPT-4o, Claude, Llama 3
TTS (Text-to-Speech)   ->  Chatterbox, Kokoro, XTTS-v2
Orchestration          ->  Dograh, Pipecat, LiveKit Agents

Every single layer now has solid open-source options. Let's go through them one by one.

Speech-to-Text

If you're building anything real-time, you need something built for streaming from the ground up. Whisper was a great model in 2022 and has kept up well with the voice agent use case. You may try whisper turbo first before other alternatives.
The best option right now for English real-time transcription is NVIDIA's Parakeet TDT 0.6B V2. It sits at #3 on the Hugging Face Open ASR leaderboard with a 6.05% WER, but the number that actually matters for voice agents is its RTFx score of 3386 - meaning it can process audio roughly 3000x faster than real-time. On a T4 GPU it's extremely affordable to run. It handles punctuation, capitalization, and word-level timestamps out of the box.

Python
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
transcript = model.transcribe(["audio.wav"])
print(transcript[0])

If accuracy matters more than raw speed - say you're transcribing medical calls or anything where a wrong word is costly - NVIDIA's Canary Qwen 2.5B is the current accuracy leader on the Open ASR leaderboard at 5.63% WER. It combines ASR with LLM capabilities under the hood, which helps a lot with context and unusual vocabulary. The tradeoff is it's heavier to run and not as snappy for real-time use.
Either way, pair your STT model with Silero VAD. It's a small Voice Activity Detection model that tells your agent when someone is actually speaking. Without it you're either cutting people off mid-sentence or waiting awkwardly for them to finish. Every real-time voice pipeline needs this.

Text-to-Speech

Chatterbox from Resemble AI is the most exciting TTS release of the past year. It hits commercial-grade quality in blind tests, supports voice cloning, and has built-in audio watermarking for responsible use. If you're building anything customer-facing, this is probably your best open-source option right now.

Python
import torchaudio
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate("Hello! This is Chatterbox speaking.")
torchaudio.save("output.wav", wav, model.sr)

For multilingual voice cloning, XTTS-v2 from Coqui is the go-to. It supports zero-shot cloning across 20+ languages with just a 6-second reference clip. Works well for audiobook tools, multilingual assistants, and anything where you need a consistent voice across languages.
If latency is your main constraint, look at Kokoro. It's only 82M parameters, runs on CPU, and can hit under 100ms on consumer hardware. The quality isn't Chatterbox-level but for edge deployments or high-throughput scenarios it's hard to beat.

Orchestration

This is the layer most developers underestimate. Orchestration ties STT, LLM, and TTS together and handles all the hard real-time stuff - barge-in when the user interrupts, turn detection, audio streaming, silence handling. Getting this wrong is what makes voice agents feel robotic even when the individual models are great.

Dograh is what I've been building. Think of it as the n8n of voice AI - a visual workflow builder where you can wire up your entire agent flow without touching code. It's the most direct open-source alternative to Vapi and Retell, and unlike those it's fully self-hostable with no per-minute markup.
It's pretty mature at this point. You get telephony out of the box via Twilio, Vonage, and Cloudonix, inbound and outbound calling, a fully built workflow builder, and one-command setup. All the plumbing is already there - knowledge base, dictionary, KYC, voicemail detection, variable extraction, QA you calls, multilingual, transfer call to human. It can be an inbound and outbound call as well as be deployed as a widget on your website. And you just bring your own LLM, STT, and TTS keys and plug them in.
curl -fsSL https://raw.githubusercontent.com/dograh-hq/dograh/main/install.sh | bash

The visual workflow builder is the big differentiator from raw frameworks like Pipecat or LiveKit Agents. Changing agent behavior means dragging a node, not editing Python and redeploying. For teams that want to iterate fast on agent logic without a full engineering cycle every time, that's a pretty big deal.

Pipecat is a Python framework built by Daily.co. It treats audio as a stream of typed frames and lets you build a pipeline of processors in sequence. It's transport-agnostic and gives you fine-grained control over every step. That control comes at a cost though - every time you want to change agent behavior, you're editing Python code, redeploying, and hoping nothing broke in the pipeline. For a team without dedicated voice engineering experience, the iteration loop gets slow fast.

Python
pipeline = Pipeline([
    transport.input(),
    silero_vad,
    deepgram_stt,
    openai_llm,
    cartesia_tts,
    transport.output(),
])

LiveKit/Agents has a cleaner API and abstracts away the WebRTC infrastructure, which makes the initial setup quicker. But the same problem applies - your agent logic lives in code. Any prompt change, flow tweak, or new use case means a code change and a redeploy. It's genuinely a good framework if you have engineers who live in this stuff full-time, but it's not something a small team can move fast with.

python
agent = VoicePipelineAgent(
    vad=silero.VAD.load(),
    stt=deepgram.STT(),
    llm=openai.LLM(),
    tts=cartesia.TTS(),
)

Both Pipecat and LiveKit Agents are solid if you want maximum control and have the engineering bandwidth to match. If you don't, you'll spend more time maintaining infrastructure than actually improving your agent.

Quick Comparison

Feature	Dograh	Pipecat	LiveKit Agents
Type	Platform	Framework	Framework
Visual Workflow Builder	Yes	No	No
Frontend UI	Yes	No	No
Telephony	Twilio, Vonage, Cloudonix	Twilio	SIP
Self-hostable	Yes	Yes	Yes
Setup Time	Minutes	Hours	Hours
Bring Your Own LLM	Yes	Yes	Yes
Open-source	Yes	Yes	Yes

The Mistake Most People Make

The biggest trap I see developers fall into when building voice agents: treating voice like chat with a microphone attached.
It's a completely different problem. The hard parts aren't the models - they're the real-time engineering. When do you cut off the STT and start the LLM? What happens when the user interrupts the agent mid-sentence? How do you handle answering machines on outbound calls? What about codec mismatches - PSTN phone lines use 8kHz u-law, but most STT models expect 16kHz PCM? These are the things that will actually bite you in production.

If you're just starting out, use either Dograh or Vapi or Retell to prototype. They're all fast and handle a lot of edge cases well. But once you hit serious volume or need custom logic, the open-source stack should be your default. Choose Dograh - and the cost difference is real. Running your own stack costs under $0.02 per minute. Managed platforms charge $0.10 to $0.15.

A Starter Stack That's Completely Free

If you want to get something running this weekend with zero API bills:
VAD - Silero VAD
STT - Parakeet TDT 0.6B V2 running locally
LLM - Llama 3.1 via Groq's free tier
TTS - Kokoro running locally
Orchestration - Dograh
Total infra cost - basically zero. Real latency - very achievable under 500ms end-to-end with a mid-range GPU.

What Are You Building?

Curious what stacks people are actually running in production. Is anyone using Kokoro for real-time agents? The latency numbers look great on paper but I haven't seen many production writeups.
Drop your stack in the comments.

I'm building Dograh - an open-source alternative to Vapi and Retell. If you're tired of vendor lock-in, check it out and star it if it's useful.