your voice agent can talk. it has no idea what it said.

#ai #opensource #agents #mcp

TLDR: Dograh is an open-source voice AI platform- an OSS alternative to Vapi. Self-hostable, no per-minute fees, visual workflow builder, full call traces per turn. Choose any LLM, STT, and TTS provider. one docker command to run.
Your voice agent made 2000 calls last night. 180 failed. 110 hit answering machines and kept talking anyway. 40 transferred to the wrong department. 30 said something your prompt definitely didn't tell it to say.
You have a call recording. You have a transcript. But you have no idea which turn went wrong, what the LLM actually decided, whether the STT heard something different from what was said, or why latency spiked on call 47 but not call 48.
Voice agents are getting deployed everywhere right now. We haven't spent nearly enough time giving builders the visibility to know what their agents are actually doing.

duct tape as voice AI infra

If you're building a voice agent today, your stack probably looks something like this: Twilio for telephony, Vapi or Retell as the orchestration layer, Deepgram for speech-to-text, ElevenLabs for text-to-speech, OpenAI for the LLM, your own logic for answering machine detection, and some observability to debug when something breaks.
Six vendors. It works. Kind of.
Until Vapi's per-minute fee eats 70% of your margin. Until a call fails silently and you have no turn-level trace to show why. Until your enterprise client says "we can't send call recordings to a third-party cloud." Until you need to change one prompt and you're back to redeploying the whole stack.
The real problem isn't any single vendor. There's no single layer connecting all of them. Every component is a black box. When something goes wrong between them, you're guessing.

tracing is the layer your voice agent is missing

Traditional voice AI platforms are built around per-minute billing. You sign up, connect a Twilio number, write a prompt, hope it works, and get a bill. That's fine for demos. It's wrong for production agents.
Dograh gives your voice agent a complete, observable, self-hostable runtime. Every call generates a full trace - not just a transcript. For every turn you get: what the STT heard, what the LLM decided, which tool was called, what the latency was, what the TTS said, and whether the human interrupted. The call trace is the unit of debugging, not the recording.
The mental model is a debugger, not a phone bill. When a call goes wrong, you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.
Dograh is BSD-2 licensed. Self-hosted via docker-compose. There is no per-minute platform fee because you own the platform.
GitHub: https://github.com/dograh-hq/dograh

what dograh gives you out of the box

Dograh does a lot because voice agents need a lot. The important thing is that it's modular - every layer is swappable without touching the rest of the system.

Telephony- works via Twilio, Vonage, and Cloudonix for both inbound and outbound calling. You bring your own numbers. If you're on a PBX, Cloudonix connects directly to your existing SIP trunk. You own the telephony layer with no vendor lock-in.

STT- supports Deepgram, Speechmatics, Sarvam, and OpenAI Whisper. You pick the model per-agent or per-call depending on language, speed, or accuracy needs. Indian English works better on Sarvam. Low-latency real-time transcription works better on Deepgram Nova-3. High accuracy on noisy calls works better on Whisper. You swap without rewriting any logic.

LLM -support covers OpenAI, Google Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted models. The agent doesn't care which model responds - the interface is the same across all of them.

TTS- runs on ElevenLabs, Cartesia, Deepgram, and OpenAI TTS. There's also a hybrid voice mode that's worth calling out separately. Instead of generating every response with TTS, the LLM picks from a library of pre-recorded human voice clips for common responses and only falls back to TTS when it needs to improvise. This cuts latency because there's no generation delay, cuts cost because you're using less TTS, and sounds more human because it literally is a human recording for the predictable parts. And when you have a dynamic text, fallback to the voice clones TTS.
Here’s a quick tutorial on this hybrid approach
https://www.youtube.com/watch?v=1uZqhG0_cIo

Call traces -are the thing that actually changes how you debug. Every turn is logged with the STT input, LLM output, tool calls, TTS output, and timing at each step. These aren't logs dumped to a file - they're structured and queryable. This is what debugging production voice agents actually requires, and it's the thing that's missing everywhere else.

what people are actually building with dograh

Lead screening and follow-ups. The agent calls a list, detects answering machines using a custom detection prompt, disconnects on voicemail, and books when it reaches a human. The trace shows every call - what speech was detected, what the agent said, where drop-off happened.

Outbound sales. The agent pulls CRM data before each call and injects it into the prompt. It qualifies, handles objections, and transfers to a human rep when intent is high. It updates the CRM automatically so the rep already knows what was said before they pick up.

Inbound support. The agent handles tier-1 support - order status, appointment changes, basic troubleshooting. When it can't resolve, it transfers with a conversation summary already written to the CRM. The human agent picks up with full context, not a blank screen.

Multi-language outbound. One agent, multiple languages. Sarvam for Hindi, Deepgram for English and Spanish. The agent detects language on the first turn and switches STT and TTS provider automatically. No separate agent per language, no separate infrastructure per market.

other things built into dograh

Beyond the call itself, Dograh has a few other things built in worth knowing about.

Automatic QA and post-call analysis.
After every call, Dograh checks what happened automatically. It looks at sentiment, whether the user got confused, whether the agent followed its instructions, and what actions actually fired. You don't need to listen to 200 recordings to find problems. It surfaces where things went wrong - whether the agent sounded off, missed a question, or skipped a step in the flow.

Dedicated telephony with integrated dialer and Asterisk ARI. Telephony is built in, not bolted on. You get low-latency calling across regions and a dialer that works out of the box for both inbound and outbound. No separate systems to stitch together. For teams that need deeper control, Dograh supports Asterisk ARI - you can plug into existing telephony infrastructure, customize call logic at a low level, and build more complex flows. Flexible enough for serious production deployments.

API key rotation.
Add multiple API keys for any LLM, STT, or TTS provider and Dograh rotates between them automatically. No custom hacks needed to stay under rate limits or handle concurrency at scale.

Looptalk (coming soon). AI-to-AI call testing. You spin up a test caller with a persona - "skeptical prospect", "fast talker", "non-native English speaker" - and run it against your production agent at scale. Every simulated call leaves a full trace. You find edge cases before real customers do.

why Open Source Matters Here

Closed-source voice AI platforms have a structural problem. When your agent breaks, you file a ticket. The platform tells you what they can see. You don't get the internals, you don't get the turn-level trace, and you can't fix it yourself.
Better support doesn't fix that. It's a fundamental property of closed infrastructure.
With Dograh you run the platform. The trace is yours. The data never leaves your VPC unless you want it to. When something breaks at 3am, you look at the trace. You don't wait for a vendor to respond.
This is also why BSD-2 matters. Not AGPL, not SSPL, not "open core with enterprise features behind a paywall." BSD-2 means you can fork it, modify it, white-label it, embed it in a commercial product, and deploy it for clients without owing anyone a license fee. The code is yours.
The per-minute fee model in closed platforms creates a genuinely adversarial relationship - the platform makes more money when your calls are longer or more frequent. Dograh's business model is managed hosting on app.dograh.com for teams that don't want to self-host. The self-hosted version is completely free and always will be.

get started

Github link - https://github.com/dograh-hq/dograh
Runs on any Linux host or Apple Silicon Mac. The default config works for local dev. Drop your LLM, STT, and TTS API keys in the .env file and swap providers without touching code.
dograh - open-source voice AI runtime, full call traces, self-hostable
docs.dograh.com - setup, provider config, AMD, call transfer, knowledge base
app.dograh.com - managed hosting if you don't want to run the infra
Every provider is pluggable. Every call leaves a trace. The platform is yours.

DEV Community

your voice agent can talk. it has no idea what it said.

Top comments (0)