Voice AI is everywhere right now. AI phone agents, voice assistants, automated support calls. On the surface it looks like the problem is already solved.
But if you actually try to build a voice AI agent yourself, the reality feels very different.
The models are good. Speech recognition works well. Text-to-speech sounds natural. But once you start connecting all these pieces together, things get messy very quickly.
Over the last few months we experimented with several voice stacks. Most of them end up looking something like this.
Speech-to-text
↓
LLM reasoning
↓
Tool calls
↓
Text-to-speech
↓
Telephony
Each piece works on its own. The trouble starts when you try to run the whole pipeline together in a real product.
The real problems with voice AI
Once we started building real voice agents, a few problems kept showing up again and again.
Latency breaks the experience
Voice is very different from chat. People expect responses almost instantly. Even small delays feel awkward in a conversation.
A typical pipeline often looks like this:
Speech recognition ~500ms
LLM reasoning 1-2 seconds
Tool execution depends on the API
Text-to-speech 300-800ms
By the time everything finishes, the user may wait 3-5 seconds for a response. That gap is enough to make the interaction feel slow.
In text chat this delay is acceptable. In voice it feels broken.
Too many moving parts
A simple voice agent often needs multiple systems working together.
For example you might need:
speech recognition
an LLM
text-to-speech
telephony infrastructure
external APIs
Each service has its own API, authentication flow and failure cases. Debugging becomes painful because problems can happen at any step in the pipeline.
Very quickly you realise you are spending more time managing infrastructure than actually building the agent.
Integrations take a lot of effort
Most useful agents need access to real tools. Think about things like Gmail, Slack or a calendar.
Once you try to connect these systems you run into another layer of complexity. You need to handle OAuth flows, manage tokens, deal with API schemas and handle rate limits.
This work repeats for every new tool you add.
Why we started Dograh
While working on these problems we kept asking a simple question.
Why is there no clean orchestration layer for voice agents?
Developers should be able to focus on building agent logic instead of constantly wiring together different services.
This idea eventually led us to start building Dograh
Dograh is an open source platform designed to make it easier to build AI agents that interact with real tools.
The goal is simple. Instead of manually connecting every service, developers should have a unified layer that manages the workflow.
A simplified pipeline might look like this
User voice input
↓
Speech processing
↓
LLM reasoning
↓
Tool execution
↓
Voice response
The agent logic stays clean while the system handles the infrastructure behind the scenes.
A simple example
Imagine a voice assistant that helps with your daily work.
You ask
"What meetings do I have today?"
The system converts your speech into text. The LLM understands the request and checks the calendar. The result is summarised and spoken back to you.
The same agent could also summarise emails or search Slack conversations.
Once multiple tools start working together, the assistant becomes much more useful.
What we are learning
Working on voice AI has changed how we think about building agents.
Voice interfaces are very sensitive to delays. If responses are slow, users lose patience quickly.
Another lesson is that orchestration matters a lot. A powerful model alone does not make the system reliable. The surrounding infrastructure plays a huge role.
Debugging also becomes harder with voice systems. Without good logs and monitoring it is difficult to understand why something failed.
What we want to explore next
We are still early in building Dograh. The goal is to make experimentation with voice agents easier for developers.
In future posts we plan to share what we learn while building:
voice agent architectures
orchestration patterns for tools
real production challenges
If you are building voice agents or experimenting with voice AI, I would love to hear what tools and approaches you are using.












Top comments (0)