We published Dograh on Reddit a few days ago and the response surprised us. A self-hostable, open-source alternative to Vapi was something many developers had been waiting for.
Since then we've gotten a lot of questions about how we actually built it - what the architecture looks like, what decisions we made, and what we'd do differently. Here's the honest walkthrough.
Where it stands: an agency self-hosts our stack and is building its third client on top of it, now looking to migrate its existing clients over from other platforms. One of India's largest fintech companies is using our S2S support for a voice AI use case that is currently in development.
The core idea - provider abstraction
The first decision we made was that Dograh should never care which provider is running behind any layer. Every voice agent needs four things: something to handle the phone call, something to transcribe speech, something to think, and something to talk back. Each of these is an abstract layer in Dograh. Any provider plugs in without touching anything else.
For the real-time pipeline, we built on a custom fork of Pipecat. We chose Pipecat over LiveKit because of its architectural simplicity - everything is a processor, and events and data flow through the pipeline. Each processor can either act on the event asynchronously or forward it to the next one. That model makes it easy to reason about what's happening at any point in the call.
We also built our own telephony integration layer on top of this, rather than relying on existing abstractions. That turned out to be the right call. It let us build things like human call transfers, where the transfer is exposed as a tool option in the LLM context - the model decides when to hand off based on the conversation, not a hardcoded rule.
Dograh supports today:
Telephony: Twilio, Vonage, Cloudonix, Telnyx, Vobiz, Asterisk ARI
STT: Deepgram, Speechmatics, Sarvam, OpenAI Whisper
LLM: OpenAI, Gemini, Groq, OpenRouter, Azure, AWS Bedrock, and fully self-hosted
TTS: ElevenLabs, Cartesia, Deepgram, OpenAI TTS
Swapping any of these is a config change.
Speech-to-speech support
Beyond the STT-LLM-TTS pipeline, we've added support for true speech-to-speech models. Right now that means Gemini 3.1 Flash Live. S2S collapses the three-step loop into a single model call, which gets you sub-300ms latency(theoretically) and more natural turn-taking. Barge-in handling works out of the box. We're planning to build more robust support for locally hosted S2S models in the short term.
We also support distributed tracing with OpenTelemetry, with a solid integration into Langfuse. If you want full observability across every LLM call, tool invocation, and latency breakdown - it's already there.
Hybrid voice - the thing we're most proud of
Pure TTS agents have a latency and naturalness problem. Every response gets generated fresh, which takes time, and even the best TTS models sound slightly synthetic on predictable phrases.
We built a hybrid voice mode to fix this. The LLM picks from a library of pre-recorded human voice clips for common responses - greetings, confirmations, transitions - and only falls back to TTS when it needs to improvise. The predictable parts play instantly because there's no generation happening. Dynamic parts use TTS or voice clone. The result is lower latency, lower cost, and a more natural-sounding agent on the parts of the call that matter most for first impressions. We can also use recording while the agent transitions to a new node or a tool call is made. This way, while the node transition or tool call happens, the agent can play something which means users don't have to wait while the agent is quiet.
Our current stack
Rather than explain each layer in isolation, here's the full picture:
FastAPI for the backend. Our workload is heavily I/O bound - concurrent calls, real-time audio streaming, multiple async API calls in flight at once. FastAPI's async Python support handles this well within a single process.
Next.js for the UI, deployed on Vercel.
PostgreSQL as the primary database, shipped with Docker Compose. We also use it for vector embeddings, so there's no separate vector store to run.
Arq for async task management and cron jobs. It handles our scheduled call queue and background workers cleanly.
MinIO for S3-compatible file storage - transcripts, recordings, anything that needs to persist beyond a call.
Call traces and QA - the part most platforms skip
When a call fails in production, a recording and a transcript are not enough. You need to know what the STT heard on turn 4, what the LLM decided, which tool it called, what the latency was at each step, and whether the caller interrupted mid-response. Without that, you're guessing.
Every call in Dograh generates a full per-turn trace. It's the unit of debugging, not the recording. When something goes wrong you open the trace, find the turn, see exactly what happened, fix the prompt, and re-run. No support tickets to a vendor who can't show you the internals.
After every call, Dograh also runs automatic post-call QA - checking sentiment, whether the user got confused, whether the agent followed its instructions, and what actions fired. You don't need to listen to 200 recordings to find where things broke.
Why we open-sourced it
We built this because we kept hitting the same frustrations. You got only two choices today. One you pay a platform fee to any of hte 300+ voice AI companies for a comfy UI. And building directly on Pipecat or LiveKit meant every prompt tweak required a code change and a redeployment. For anyone shipping for clients or any production use case, that's a constant bottleneck.
We wanted a platform where the code is yours, the data stays in your infrastructure, and debugging means reading a trace not filing a ticket.
Dograh is BSD-2 licensed. Self-hosted via Docker Compose. No per-minute platform fee because you own the platform.
A star genuinely helps us more than we can explain.



Top comments (1)
Hello Community. I am one of the maintainers of Dograh (github.com/dograh-hq/dograh). Please feel free to ask me any questions around Open Source/ Voice AI/ MCPs or Self Hosting AI models. Happy to answer any questions that you might have. :)