DEV Community

Cover image for Why Building Voice AI Agents Is Still So Hard (and Why We Started Dograh)
Hariom Yadav for Dograh AI

Posted on

Why Building Voice AI Agents Is Still So Hard (and Why We Started Dograh)

Voice AI is everywhere right now. AI phone agents, voice assistants, automated support calls. On the surface it looks like the problem is already solved.
But if you actually try to build a voice AI agent yourself, the reality feels very different.
The models are good. Speech recognition works well. Text-to-speech sounds natural. But once you start connecting all these pieces together, things get messy very quickly.
Over the last few months we experimented with several voice stacks. Most of them end up looking something like this.

Speech-to-text
↓
LLM reasoning
↓
Tool calls
↓
Text-to-speech
↓
Telephony
Enter fullscreen mode Exit fullscreen mode

Each piece works on its own. The trouble starts when you try to run the whole pipeline together in a real product.

The real problems with voice AI

Once we started building real voice agents, a few problems kept showing up again and again.

Latency breaks the experience

Voice is very different from chat. People expect responses almost instantly. Even small delays feel awkward in a conversation.

A typical pipeline often looks like this:

Speech recognition ~500ms


LLM reasoning 1-2 seconds


Tool execution depends on the API


Text-to-speech 300-800ms
Enter fullscreen mode Exit fullscreen mode

By the time everything finishes, the user may wait 3-5 seconds for a response. That gap is enough to make the interaction feel slow.

In text chat this delay is acceptable. In voice it feels broken.

Too many moving parts

A simple voice agent often needs multiple systems working together.
For example you might need:

speech recognition


an LLM


text-to-speech


telephony infrastructure


external APIs


Enter fullscreen mode Exit fullscreen mode

Each service has its own API, authentication flow and failure cases. Debugging becomes painful because problems can happen at any step in the pipeline.
Very quickly you realise you are spending more time managing infrastructure than actually building the agent.

Integrations take a lot of effort

Most useful agents need access to real tools. Think about things like Gmail, Slack or a calendar.
Once you try to connect these systems you run into another layer of complexity. You need to handle OAuth flows, manage tokens, deal with API schemas and handle rate limits.
This work repeats for every new tool you add.

Why we started Dograh

While working on these problems we kept asking a simple question.
Why is there no clean orchestration layer for voice agents?
Developers should be able to focus on building agent logic instead of constantly wiring together different services.
This idea eventually led us to start building Dograh

 .

Dograh is an open source platform designed to make it easier to build AI agents that interact with real tools.
The goal is simple. Instead of manually connecting every service, developers should have a unified layer that manages the workflow.

A simplified pipeline might look like this

User voice input
↓
Speech processing
↓
LLM reasoning
↓
Tool execution
↓
Voice response
Enter fullscreen mode Exit fullscreen mode

The agent logic stays clean while the system handles the infrastructure behind the scenes.

A simple example

Imagine a voice assistant that helps with your daily work.
You ask
"What meetings do I have today?"
The system converts your speech into text. The LLM understands the request and checks the calendar. The result is summarised and spoken back to you.
The same agent could also summarise emails or search Slack conversations.
Once multiple tools start working together, the assistant becomes much more useful.

What we are learning

Working on voice AI has changed how we think about building agents.
Voice interfaces are very sensitive to delays. If responses are slow, users lose patience quickly.
Another lesson is that orchestration matters a lot. A powerful model alone does not make the system reliable. The surrounding infrastructure plays a huge role.
Debugging also becomes harder with voice systems. Without good logs and monitoring it is difficult to understand why something failed.

What we want to explore next

We are still early in building Dograh. The goal is to make experimentation with voice agents easier for developers.
In future posts we plan to share what we learn while building:

voice agent architectures


orchestration patterns for tools


real production challenges
Enter fullscreen mode Exit fullscreen mode

If you are building voice agents or experimenting with voice AI, I would love to hear what tools and approaches you are using.

Top comments (0)