DEV Community: Pritesh Kumar

4 open-source tools to build production-ready AI voice agents 🎙️🚀

Pritesh Kumar — Thu, 23 Apr 2026 01:20:48 +0000

TL;DR:

We built this because we kept hitting the same frustrations. You've got only two choices today. One, you pay a platform fee to any of the 300+ voice AI companies for a comfy UI. Or you build directly on Dograh, Pipecat or LiveKit, where every prompt tweak means a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace, not filing a ticket.

1. Dograh 👑

I've built voice agents before, but when it came to shipping them for production, I couldn't find a platform that worked quickly in 2 minutes - until we started building Dograh.
It's an open-source voice AI platform with a visual workflow builder, built-in telephony, and post-call analytics out of the box. Alternative to Vapi, Retell, and Bland, but self-hostable and BSD-2 licensed.
You get a canvas where you connect nodes instead of writing Python, so prompt tweaks don't mean a redeploy. Voicemail detection, call transfer, variable extraction, knowledge base, and CRM connectors all come standard. Same feature set whether you self-host or use the managed cloud.
It has native support for BYOK (bring your own key) across every layer. Deepgram or Whisper for STT, ElevenLabs or Kokoro for TTS, and any LLM for the brain. Want to run everything locally? Swap in self-hosted models through the UI, no code required.
Check it. https://docs.dograh.com/getting-started
Youtube link: https://www.youtube.com/watch?v=sxiSp4JXqws
Star the Dograh repo ⭐ → https://github.com/dograh-hq/dograh

2. Pipecat

Building a voice AI prototype is one thing, but owning the audio pipeline in production is a different ball game. Pipecat is the Python framework from the Daily.co team for engineers who want full control over how audio frames move through an agent.
The framework handles STT, voice activity detection, LLM, and TTS as composable stages. Integration coverage is wide, including Deepgram, ElevenLabs, Cartesia, Kokoro, Whisper, Gemini, and several dozen others. Pipecat Cloud is available if you want to skip the ops side. Of the three frameworks on this list, Pipecat is the one I'd bet on in the long term if you're comfortable with Python and want to own the pipeline.
The tradeoff is that Pipecat doesn't ship anything above the framework layer: no visual builder, no post-call analytics, no CRM connectors, no QA tooling. Every change to conversation logic means editing Python, committing, and redeploying. Fine if you have an engineering team with the bandwidth to build the platform layer on top. Rough if you want a working system on day one.
Check it out: https://docs.pipecat.ai/overview/introduction

Star the Pipecat repo ⭐ →https://github.com/pipecat-ai/pipecat

3. LiveKit Agents

Building voice AI without battle-tested real-time infrastructure is a disaster waiting to happen. Audio is unforgiving and the moment you have packet loss, multi-party rooms, or browser-to-browser calls, rolling your own transport layer becomes a nightmare.
LiveKit Agents, a WebRTC-native voice framework from LiveKit, is built on top of their widely used real-time media server.
It's organised as composable pieces, including the core media server, the Agents framework for voice AI logic, and LiveKit SIP for PSTN bridging.
In addition, they offer a managed cloud option if you don't want to run the media server yourself, handling scaling, geographic distribution, and SIP trunking for you.
The easiest way to get started is to use the SDK.
The tradeoff is the same as Pipecat. Code-first SDK, no visual interface, no built-in analytics or CRM tooling. Getting a call out the door means wiring up the media server, the agent worker, and the SIP bridge separately. LiveKit Agents is overkill unless you're already using LiveKit for something else, or you genuinely need WebRTC multi-party. For a standard inbound or outbound phone agent, Pipecat is simpler, and Dograh is faster to ship.
For more, refer to their documentation.https://docs.livekit.io/intro/overview/
Star the LiveKit Agents repository ⭐ → https://github.com/livekit

Vocode

Building a voice AI prototype is one thing, but inheriting a dead codebase is another. What can be a bigger time sink than picking a framework that looks alive in search results but is actually abandoned?
Vocode was one of the earlier Python libraries in this space and introduced useful abstractions when it launched. Active development has largely stopped, with minimal commits for well over a year, unanswered issues, and an architecture that predates speech-to-speech models and sub-500ms pipelines.
Building a new production system on Vocode means inheriting technical debt without an active maintainer behind it. Don't. Start with Dograh, Pipecat, or LiveKit instead.
Check out here:https://docs.vocode.dev/welcome
Star Github repository: https://github.com/vocodedev

Feature	Dograh	Pipecat	LiveKit Agents	Vocode
Pricing	Free OSS + optional cloud	Free OSS	Free OSS	Free OSS
Visual workflow builder	Yes	No	No	No
Self-hostable	Yes	Yes	Yes	Yes
BYOK for STT, TTS, LLM	Yes	Yes	Yes	Yes
Production features (tools, QA, telephony)	Yes	No	No	No

Thanks for reading the post.

Let me know in the comments below if any other open-source voice AI tools or frameworks have helped you ship agents to production.

Where Voice AI Agents Are Actually Getting Used in 2026

Pritesh Kumar — Fri, 17 Apr 2026 13:45:27 +0000

Voice AI has moved past the demo phase. After watching hundreds of deployments across our customer base and the broader ecosystem, I wanted to put together a practical list of where voice agents are actually earning their keep right now, and where the ROI is strong enough to justify real production budgets.

The list below is the subset that keeps showing up in real pipelines and real P&Ls.

Customer Support

Tier-1 support is the single biggest deployed voice AI use case today. Voice agents handle password resets, order status checks, account balance lookups, policy questions, and similar high-volume repetitive queries. The value is straightforward: deflect 40-60% of inbound calls away from human agents, answer in the language the caller speaks, operate 24/7. Most teams start here because the data already exists in their CRM or knowledge base, and the workflows are well understood.

Lead Screening and Qualification

Inbound leads from ads, forms, and content marketing usually sit in a queue for hours before a human gets to them. By that point, intent has dropped significantly. Voice agents now pick up the call within seconds, qualify against BANT or a custom rubric, book the meeting straight into the sales rep calendar, and log everything in HubSpot or Salesforce. This is the highest-velocity use case for B2B teams I have seen. The math is easy: a qualified meeting has a known value in the CRM, and answering in 30 seconds instead of 4 hours multiplies that yield.

Collections and Renewals in Fintech

This is where I have seen some of the strongest unit economics. Banks, lenders, and insurance companies run enormous outbound collections operations with razor-thin per-call economics. Voice agents handle reminders, soft collections, payment plan negotiation, drop-off recovery, and renewal nudges at a fraction of the cost of a human BPO. The volumes are high, the scripts are compliance-heavy (which AI handles consistently), and the conversion lift from reaching a borrower in their preferred language at the right time of day is real.

Cold Calling and Outbound Sales

The ROI here is very good if you get the targeting right. Voice agents can run thousands of outbound dials a day, handle objections, qualify interest, and hand off warm prospects to a human closer. The catch is that bad targeting plus AI dialing equals spam complaints at scale, so list hygiene and opt-in matter more than the tech itself. Teams that get this right see cost per meeting drop by 5-10x.

Appointment Setting in Healthcare

Hospitals, dental clinics, and specialty practices deal with huge no-show rates and constant rebooking churn. Voice agents handle appointment confirmations, reminders, rescheduling, and prep instructions like pre-op fasting rules. The operational impact is immediate: front desk staff get their attention back for patients physically in the clinic, and call handling capacity goes up overnight.

Receptions, Restaurants, and Local Services

Any local business with a phone number that rings all day is a candidate. Restaurants take reservations and handle takeaway orders, dental clinics book and confirm visits, salons do intake. The ticket size per business is small, but the cumulative market is enormous. This category will eventually absorb the most total call volume, even if each individual deployment is modest.

What Comes Next

The interesting wave ahead is in regulated industries like KYC verification, insurance claims intake (FNOL), patient engagement beyond appointments, and legal intake flows. These need stronger guardrails, better audit trails, tighter integration into systems of record, and clear compliance boundaries. That is where the platform layer matters, and where closed black-box APIs start to hit walls that open, inspectable stacks handle gracefully.

If you are evaluating voice AI for your own business, start with the use case where you already have volume, a clear script, a measurable outcome, and a team ready to handle the tail exceptions. Skip the speculative experiments. The wins are in the boring, high-frequency calls you make every day already.

How I built a full fledged open source AI calling platform and got a million impressions in almost a month .....🤯

Pritesh Kumar — Tue, 14 Apr 2026 13:41:36 +0000

We published Dograh on Reddit a few days ago and the response surprised us. A self-hostable, open-source alternative to Vapi was something many developers had been waiting for.
Since then we've gotten a lot of questions about how we actually built it - what the architecture looks like, what decisions we made, and what we'd do differently. Here's the honest walkthrough.

Where it stands: an agency self-hosts our stack and is building its third client on top of it, now looking to migrate its existing clients over from other platforms. One of India's largest fintech companies is using our S2S support for a voice AI use case that is currently in development.

The core idea - provider abstraction

The first decision we made was that Dograh should never care which provider is running behind any layer. Every voice agent needs four things: something to handle the phone call, something to transcribe speech, something to think, and something to talk back. Each of these is an abstract layer in Dograh. Any provider plugs in without touching anything else.
For the real-time pipeline, we built on a custom fork of Pipecat. We chose Pipecat over LiveKit because of its architectural simplicity - everything is a processor, and events and data flow through the pipeline. Each processor can either act on the event asynchronously or forward it to the next one. That model makes it easy to reason about what's happening at any point in the call.
We also built our own telephony integration layer on top of this, rather than relying on existing abstractions. That turned out to be the right call. It let us build things like human call transfers, where the transfer is exposed as a tool option in the LLM context - the model decides when to hand off based on the conversation, not a hardcoded rule.
Dograh supports today:
Telephony: Twilio, Vonage, Cloudonix, Telnyx, Vobiz, Asterisk ARI
STT: Deepgram, Speechmatics, Sarvam, OpenAI Whisper
LLM: OpenAI, Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted
TTS: ElevenLabs, Cartesia, Deepgram, OpenAI TTS
Swapping any of these is a config change.

Speech-to-speech support

Beyond the STT-LLM-TTS pipeline, we've added support for true speech-to-speech models. Right now that means Gemini 3.1 Flash Live. S2S collapses the three-step loop into a single model call, which gets you sub-300ms latency(theoretically) and more natural turn-taking. Barge-in handling works out of the box. We're planning to build more robust support for locally hosted S2S models in the short term.

We also support distributed tracing with OpenTelemetry, with a solid integration into Langfuse. If you want full observability across every LLM call, tool invocation, and latency breakdown - it's already there.

Hybrid voice - the thing we're most proud of

Pure TTS agents have a latency and naturalness problem. Every response gets generated fresh, which takes time, and even the best TTS models sound slightly synthetic on predictable phrases.
We built a hybrid voice mode to fix this. The LLM picks from a library of pre-recorded human voice clips for common responses - greetings, confirmations, transitions - and only falls back to TTS when it needs to improvise. The predictable parts play instantly because there's no generation happening. Dynamic parts use TTS or voice clone. The result is lower latency, lower cost, and a more natural-sounding agent on the parts of the call that matter most for first impressions. We can also use recording while the agent transitions to a new node or a tool call is made. This way, while the node transition or tool call happens, the agent can play something which means users don't have to wait while the agent is quiet.

Our current stack

Rather than explain each layer in isolation, here's the full picture:
FastAPI for the backend. Our workload is heavily I/O bound - concurrent calls, real-time audio streaming, multiple async API calls in flight at once. FastAPI's async Python support handles this well within a single process.
Next.js for the UI, deployed on Vercel.
PostgreSQL as the primary database, shipped with Docker Compose. We also use it for vector embeddings, so there's no separate vector store to run.
Arq for async task management and cron jobs. It handles our scheduled call queue and background workers cleanly.
MinIO for S3-compatible file storage - transcripts, recordings, anything that needs to persist beyond a call.

Call traces and QA - the part most platforms skip

When a call fails in production, a recording and a transcript are not enough. You need to know what the STT heard on turn 4, what the LLM decided, which tool it called, what the latency was at each step, and whether the caller interrupted mid-response. Without that, you're guessing.
Every call in Dograh generates a full per-turn trace. It's the unit of debugging, not the recording. When something goes wrong you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.
After every call, Dograh also runs automatic post-call QA - checking sentiment, whether the user got confused, whether the agent followed its instructions, and what actions fired. You don't need to listen to 200 recordings to find where things broke.

Why we open-sourced it

We built this because we kept hitting the same frustrations. You got only two choices today. One you pay a platform fee to any of hte 300+ voice AI companies for a comfy UI. And building directly on Pipecat or LiveKit meant every prompt tweak required a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace not filing a ticket.
Dograh is BSD-2 licensed. Self-hosted via Docker Compose. No per-minute platform fee because you own the platform.
A star genuinely helps us more than we can explain.

Star Dograh

We analyzed 10,000 voice AI calls. The LLM was rarely the problem.

Pritesh Kumar — Sat, 28 Mar 2026 13:54:27 +0000

We built Dograh OSS, an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there.

Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected.

The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed.

Here is what we found, roughly ranked by frequency.

Failure area	Share	Primary driver
STT / word error rate	~38%	Low-quality telephony audio and accent variation
First-8-second chaos	~34%	Greeting latency, barge-in, variable user behavior
Interruption handling	~28%	Filler words breaking flow, context switching
Extended silence	~22%	Users distracted, fetching info, handing off phone
Tool call latency	~19%	LLM turns plus external API latency
LLM failure modes	~15%	Hallucinations, instruction drift, latency trade-offs
Broken escalation	~11%	No clear human handoff path

These categories overlap. A lot of bad calls had two or three of these compounding each other.

STT is worse than you think

STT failures were the single biggest contributor to broken calls, and the one most consistently underestimated by teams building voice AI for the first time.

Standard telephony runs at 8 kHz audio. That is genuinely low quality. It strips away acoustic detail that helps distinguish consonants, so "schedule" becomes "school" in a transcript and "Sunday" comes through as "someday." Add background noise, speakerphone, or a non-native accent and word error rates climb fast.

The thing about WER is that errors compound. "I need to school my appointment for someday" should be understood as "I need to schedule my appointment for Sunday," and a well-prompted LLM will figure that out. When three or four words are garbled in the same turn though, contextual inference falls apart. We saw calls where entire turns came through as near-gibberish.

There is also a dimension STT completely misses. Transcription captures words but it does not capture how those words are said. A frustrated "fine" and a genuinely agreeable "fine" are two very different things, but they look identical in a transcript. Tone helps disambiguate words that sound similar. When a caller sounds confused or hesitant, a human listener picks up on that and adjusts. STT gives you a flat string of text and the LLM works with it blind.

No STT provider has uniform accuracy across all accents either. The gap between a provider's accuracy on American English versus Nigerian English or heavily accented Spanish can be 15 to 25 percentage points. If your users call from diverse regions, picking one STT provider and moving on is going to hurt you.

Two mitigations made a real difference in our testing. First, adding a custom vocabulary to your STT engine - and I mean beyond domain jargon. If you are building an order management bot, add the common words your callers actually say: "order," "ID," "payment," "cancel," "account," "address." Thirty to forty frequently used words. The STT engine listens for these with extra weight and it meaningfully reduces errors on the terms that matter most.

Second, tell the LLM to expect transcription errors. A single line in your system prompt acknowledging that the caller's words may contain transcription noise, and asking the model to use contextual reasoning before responding, reduces the downstream impact of STT failures significantly. The LLM stops treating garbled transcripts as literal input and starts being smarter about what the caller probably meant.

The first 8 seconds are where calls die

About a third of all problematic calls failed before the real conversation even started.

A lot happens in those first 8 to 10 seconds. Callers are still deciding whether they want to engage. Many have not fully shifted their attention to the call. Some are already talking before the bot finishes its greeting, and others wait much longer than expected, unsure whether the silence means the system is broken. This is also the window where callers most frequently realize they are talking to a bot, and many immediately start testing it - interrupting, asking off-script questions, being deliberately vague. The variance in behavior is just much higher here than at any other point in the call.

Greeting latency makes everything worse. A gap of more than a second or two between the call connecting and the first word of audio is enough for many callers to assume things are broken and hang up. Pre-generating and caching your greeting audio instead of synthesizing it fresh every time removes an entire class of failures here.

What works: keep greetings short, under six seconds. Consider disabling barge-in for just the first 3 to 4 seconds if your greeting contains information the caller needs to hear. Re-enable interruption after that. The period where barge-in causes the most problems is right at the start.

In Dograh, each workflow node has its own "Allow Interruption" toggle, so you can switch off interruption on the starting node for a short introduction and re-enable it for the rest of the conversation.

Interruptions, silence, and dead air

Most barge-in documentation focuses on detecting when a user starts speaking and stopping the TTS output. That part is reasonably well solved. The harder problem is what happens to the conversation after the interruption.

One pattern we saw constantly: a caller says "uh huh" or "yeah" while the bot is talking. The bot interprets this as a genuine turn, stops speaking, and tries to process it as new user input. The LLM produces a response to what is essentially a non-sentence and the conversation breaks. The caller has to re-explain what they wanted.

A related pattern involves context switching - a caller interrupts to ask "wait, does that include weekends?" The bot handles the question fine but loses track of what it was explaining before. The original task gets dropped without resolution.

Both problems are about how interruption events are handled in the conversation state. The fix is designing explicit logic for what happens after an interruption. Differentiate between filler sounds and substantive speech, and track unresolved conversational threads so the bot can come back to them.

Silence is closely related. Real callers go quiet for perfectly legitimate reasons - checking an email for an order ID, handing the phone to someone else, looking something up. A caller fetching information might go quiet for 20 to 40 seconds. Most voice AI systems interpret this as confusion or abandonment and respond with prompts or just hang up.

A graduated response works much better: a brief neutral filler after 5 seconds, a gentle "still there?" check at 15 seconds, a real choice at 45 seconds ("I can hold, or you are welcome to call back"), and a graceful close with callback instructions only after 90 seconds or more.

Tool call latency creates its own kind of dead air

When a voice agent needs to look something up in a CRM or check a calendar slot, a tool call gets triggered. In practice that means at least one additional LLM turn to interpret the result, plus the external API latency. The total gap can easily reach 3 to 5 seconds, and callers at that point genuinely don't know if the call is still alive.

We are adding a "playback speech" feature in Dograh that lets you configure a pre-recorded audio clip to play while a tool call executes. This fills the silence without the LLM having to generate a response and it keeps the caller engaged. Beyond that, pre-fetching data you know will be needed at call start - account details, prior call history, caller ID lookups - keeps common tool calls out of the live response path entirely.

LLM failures and broken escalation

LLM failures in production voice AI are real but probably smaller than you would expect. Hallucination gets the most attention but it accounts for a smaller share of bad calls than the quieter failures. Models stop following instructions partway through long calls. They generate empty responses that cause the bot to say nothing at all. They produce answers that are technically coherent but completely disconnected from the previous turn.

The trade-off between model size and latency matters here too. Smaller models respond faster, which helps perceived call quality, but they follow complex instructions less reliably. Larger models handle nuanced prompts better but their response latency feeds right back into the dead-air problem.

Escalation failures have a lower frequency but they are the most consequential category on this list. The callers asking for a human almost always have the hardest, most urgent problems. They have already tried self-service and it didn't work. When the bot can't detect they want to escalate - because they phrased it in a way the intent detection doesn't recognize - or when the escalation path drops the context so the human agent starts from scratch, that caller's experience is about as bad as it gets.

Escalation should be a first-class destination in every workflow. The phrases callers use to ask for a human are wildly more varied than any training set anticipates, and the transfer needs to carry the full conversation transcript to the receiving agent.

The pattern

Voice AI breaks at transitions. The first seconds of a call, the moment a user interrupts, the gap while a tool is running, the point where a caller needs a human. These are the edges where the system's assumptions about how conversations work meet how people actually behave on phone calls.

Teams that focus almost entirely on LLM response quality and treat these transitions as secondary concerns tend to ship agents that sound good in demos but disappoint in production. The calls that held up well in our data were the ones with the most deliberate handling of everything that happens between LLM responses.

We are building Dograh as the open-source alternative to Vapi, Retell, and Bland. No per-minute fees, no vendor lock-in, deploy on your own infra. Ask anny queries

Star us on GitHub | Try the cloud version