I have been experimenting with AI agents for a while now but this time, I wanted to build a Voice AI Agent. I won't lie, it does feel intimidating (if you have never built one before).
Voice AI agents are becoming more common these days so I took the chance to learn the core components with principles and understand how everything fits together.
This post covers everything I picked up: building blocks of Voice AI Agents, a step-by-step guide to building, testing and deploying one. Also listed popular platforms out there with some real-world use cases to learn from.
If you have been curious about how voice agents work (or want to build your own), this might help you get started. It took me just 30 minutes to deploy the agent to my portfolio.
What isΒ covered?
In summary, we are going to cover these topics in detail.
- What is a Voice AI Agent?
- Popular tools and platforms for building one.
- A step-by-step guide to build your first voice Agent from scratch.
- Practical use cases with examples.
1. What is a Voice AI Agent?
You might already be familiar with AI Agents which are computer programs that can understand tasks, think for themselves and take actions on their own.
Voice AI Agents take a step further by combining speech and reasoning capabilities.
They are autonomous system that listens to your voice, understand what you are saying (using speech-to-text), respond using Large Language Models (LLMs) like GPT-4 and speak the answer back to you using a synthetic voice (text-to-speech).
There are mainly two types:
Inbound agents
: answer calls and respond when someone reaches out.Β ΒOutbound agents
: proactively make calls to deliver messages, reminders or carry out tasks.
Unlike traditional virtual assistants (such as Siri), Voice AI Agents can perform multi-step, complex tasks that range from:
answer customer phone calls
initiate outbound campaigns
provide support via a voice widget on your site
speak English, Arabic or any other supported languages
If you are curious about the tech behind these agents, here are two great reads:
What exactly is an AI Voice Agent? by Deepgram
How do Voice AI Agents work? by Picovoice
The best part is that you donβt need to be an expert to build one. There are many tools like VoiceHub by DataQueue and Retell AI that make it super easy to create and test a working voice agent in just a few minutes.
Letβs explore some of those platforms next.
2. Popular tools and platforms for building one.
If you want to build your own Voice AI Agent, there are several frameworks and tools available to help you get started. Picking the right one depends on:
Language and regional support. For example, many platforms donβt handle Arabic (MENA) or Indian English well.
Whether you are comfortable writing code or prefer no-code platforms.
Whether you want custom setups (more control, takes longer) or something faster
Whether your focus is on mobile-based agents or web experiences.
I believe choosing the right platform is less about what's "best" and more about what fits your use case.
Here are the most popular platforms:
- VoiceHub by DataQueue - Easiest way to build voice agents without writing code. It connects LLMs to phone calls, lets you define workflows and deploy quickly. Bonus: MENA region support is solid (unlike many others). I will be using this one in the next section.
- Rime - lets you build conversational AI apps with both voice and text. Good for more advanced voice flows, supports integrations and has a polished UI.
- Vapi β Build phone-based voice agents using LLMs and connect them to real numbers. Offers a simple API and UI for call flows, used for scheduling, Q&A bots, and hotlines.
- Retell AI - Specializes in phone call automation. Let you create voice agents that can have real-time conversations over phone lines.
- LiveKit - Open source platform for real-time audio/video development. While it doesn't include AI by itself, it gives you the live voice infrastructure
- Twilio Voice + OpenAI + ElevenLabs - A more flexible setup using Twilio for phone/audio input, GPT-4 for responses, and ElevenLabs for speech. Requires coding but gives you full control.
There are specific platforms like DeepgramΒ is recommended for highly accurate speech-to-text (STT) and ElevenLabsΒ is popular for realistic text-to-speech (TTS), generating natural-sounding voices. You can then enable your agent to make and receive phone calls through services likeΒ Twilio.
It totally depends on your use case, but I'm choosing VoiceHub to create the voice agent. It will be using ElevenLabs voices & OpenAI GPT-4o as a model under the hood.
3. A step-by-step guide to build your first voice Agent from scratch.
It's finally time to build a real agent. I will be using VoiceHub because of its fast setup, easy third-party integrations and solid support for the MENA region.
I went through the official docs, tested everything hands-on and documented the key steps, so you donβt have to get stuck in jargon.
Step 1: Sign up for the dashboard
You can sign up to visit the dashboard at voicehub.dataqueue.ai/.
Hereβs what the dashboard looks like. We will walk through each part as we go.
Click on the + New Agent
button to start. You can either create a blank agent or use an existing template. I'm using an existing template to make things easier to follow.
Once created, you will land inside the agentβs workspace.
There are several useful tabs here including Call logs, Phonebook, Analytics, Provider keys and more.
Β
The most useful is the knowledge base (RAG), which helps in the intelligent retrieval of information during conversations to provide accurate responses.
If you switch to the configuration option in the sidebar, you will see everything you can tweak. You should change the language to English (the default is Arabic). Here is the brief info on tabs:
Models
: choose your STT and LLM providersVoices
: choose how your agent sounds. You can test voices by typing something and hearing it out loudPathway
: Build your agent's logic visually or with a global promptVoIP
: Assign phone numbers for your agent to receive or make callsAnalysis
: Decide how to tag calls, track how they went and check sentiment.Widget
: Add a voice chat interface to your website and customize how it looks.White Labeling
: Set up your own branding, logo and custom domain for your team
As you can see on the top right, you can toggle between two modes:
a) DataQueue Mode (Optimized stack for MENA)
- default mode when creating a new agent
- uses DataQueueβs fully optimized models for speech recognition (STT), conversation logic (LLM), speech synthesis (TTS).
- designed for MENA use cases where accuracy, latency, sentiment detection matter most.
- voice setup and tuning are handled via the DQ Configs tab.
Note: You cannot manually override model providers in this mode It's possible in the second section.
b) Custom Mode (Provider Flexibility)
Custom Mode gives you full flexibility in model selection and configuration. Here are the supported providers:
- STT Providers: Google, Deepgram, Gladia, Speechmatics, Azure
- TTS Providers: ElevenLabs, Deepgram, LMNT, Cartesia, Rime AI, Azure, OpenAI, Google
- LLM Providers: OpenAI, Groq, Claude (Anthropic), Cohere, DeepSeek, Ollama, Grok
Make sure to switch the language to en-US
from the SST settings in the Models tab.
You can also perform side-by-side Benchmarking to compare different setups to identify the optimal configuration for any specific use cases.
There are thousands of voices available across different accents and tones. Third-party integrations (TTS providers) are also easy to set up.
The customization options are huge, perfect for developers/teams who want full control over voice, prompts and model selection.
Β
Step 2: Building the logic
They provide two approaches for defining how your agent behaves:
a) Global Prompt
A single prompt that guides the agentβs entire behavior (similar to system prompts in traditional LLM apps). Use this when your agent only needs to answer general questions or operate reactively.
b) Conversational Pathway
It's a visual drag-and-drop builder to define complex flows, variables and decision logic using connected nodes. This is what I will be using (seems easier).
I recommend using this when:
- You need branching logic (such as verification β escalate β book β end)
- You want to extract variables (date, location..)
- You want fine control over what the agent says and when
Yes, it's possible to combine them. You can start with a global prompt and add the conversation flow later or build the flow first and use the global prompt as a backup.
You can add different nodes in the pathway. Here is the list of nodes and the purpose of each node:
For instance, the default node speaks a message and waits for the reply. While the End Call Node just ends the conversation.
Click a node to open customization options. You can define specific behaviors, conditions or plug in knowledge base lookups.
Β
Step 3: Testing it out
You can test your agent with Start Test Call
or Start Test Chat
. You just need to provide the necessary microphone permissions on the website and the assistant will respond based on your flow.
This sample had only two nodes so the agent replies once and ends the call.
You can also perform QA Testing to simulate conversational scenarios and evaluate how well an agent actually behaves. It produces pass/fail results per test scenario and lets you identify weaknesses before deploying to production.
Example test case: Hi, I want to schedule a new appointment next Monday at 3 pm
.
Result:
β
PASS: Agent confirms the correct date/time with the proper tone
β FAIL: Agent ignores time or responds vaguely
You will also get full call logs to analyze past conversations.
Β
Step 4: Deployment options
I think we all can agree that local testing is easier. Just build the workflow, test it out and voila. But what's the purpose if we cannot take it to real users?
VoiceHub makes this super easy. Navigate to Configuration > Widget
and you will get a unique embed code for your website. You will get the option to customize the look, position and welcome message.
It will look something like this.
<dq-voice agent-id="your-agent-id" env="https://voicehub.dataqueue.ai/"> </dq-voice>
<script src="https://voicehub.dataqueue.ai/DqVoiceWidget.js"></script>
I tried it on my Next.js portfolio website and it worked properly. If you directly place it just before the closing </body>
tag, you will probably get an error as Property 'dq-voice' does not exist on type 'JSX.IntrinsicElements'
.
To fix that, follow these steps:
a) There is a <dq-voice>
tag so for TS/React to treat it as a valid JSX tag, we need to add this to a new declaration file (src/types/custom-elements.d.ts
).
declare namespace JSX {
Β Β interface IntrinsicElements {
Β Β 'dq-voice': React.DetailedHTMLProps<
Β Β Β Β React.HTMLAttributes<HTMLElement>,
Β Β Β Β HTMLElement
Β Β Β >
Β Β }
}
b) In your projectβs root tsconfig.json
, add the types folder so the TS will now load that file and no longer complain about the unknown tag.
"include": [
"src/types/custom-elements.d.ts",
"next-env.d.ts",
Β Β "**/*.ts",
Β Β "**/*.tsx",
".next/types/**/*.ts",
],
c) Now just insert the widget into your Next.js layout.tsx
.
dq-voice
mounts the custom widget element.Script
ensures the widget script loads after the page is interactive (safe and efficient).
<dq-voice agent-id="id"></dq-voice>
<Script
Β Β src="https://voicehub.dataqueue.ai/DqVoiceWidget.js"
Β Β strategy="afterInteractive"
/>
As soon as you run the server, it will be live and a pop-up will appear to ask for the necessary permission for the microphone.
You can have a usual conversation based on your workflow and it will listen/respond accordingly.
And just like that, I built and deployed my first Voice AI Agent! π
You can also deploy with their cloud with your own private infrastructure or use a hybrid deployment (optimize infrastructure reducing server and GPU costs by up to 90%).
I tried some more advanced flows too but covering it would have made things a little bit confusing so I decided to leave it out.
I cannot cover all the things so I recommend checking the official docs and trying it out yourself. If you are still wondering about real-world use cases, check out the next section.
4. Practical use cases with examples.
Once you are familiar with the Voice AI Agents, it's easy to see how powerful they can be (especially when used to automate workflows).
Here are a few real-world use cases that show whatβs possible:
β AI agent deployed in an international airport to support passengers with disabilities
This is the use case that will blow your mind. The DataQueue team officially deployed the VoiceHub AI agent at Queen Alia International Airport in Amman, Jordan.
The agent is designed to support passengers with disabilities, ensuring they receive the assistance they need within 5 minutes.
Here is the demo video!
They are rolling out similar projects across airports in MENA and Europe, creating a positive impact at airports by handling customer support, accessibility and real-time responsiveness.
Β
β Auto call Agent for Internal Status Checks (Engineering & Ops Teams)
Startups are usually fast environments (infra teams, devops, logistics ops), teams often need to stay updated on ongoing issues, service status or even deployment logs.
Instead of Slack reminders or waiting for someone to check dashboards, a voice agent can actively call team members, summarize the current situation and log any updates or confirmations.
The flow can look something like this:
Cron job triggers every 2 hours.
Agent calls on-call engineer with a status update:
Hi, just checking in. The latest deployment finished with 2 minor warnings. Do you want me to notify QA or hold off?
Engineer responds with
Hold off until we patch
β Agent logs the response to Jira or an internal dashboard via API.If there is no answer β it falls back to SMS or call escalation.
Β
β Voice Agent to make cold Email more personal (via Call)
This is a very exciting workflow for sales teams that want to warm up cold emails before sending a pitch.
Instead of sending a generic email blast, the agent calls the lead, confirms if they are open to receiving information and gets some light qualifying data without a human SDR involved
.
The flow can look something like this:
Lead data is pulled from CRM.
Voice agent calls: βHey, Iβm helping xyz company learn more about founders in the fintech space. Just a quick 1-minute call, are you still working on xyz?β
Captures 2β3 data points (interest level, industry fit, team size) using variable extraction.
If the response is positive β marks lead as warm β generates a tailored intro email and sends via marketing tool.
The result is a more personal, context-aware email thatβs far more likely to convert.
I used to think building voice AI agents required a lot of custom engineering, but it's accessible with the right set of tools.
This was just a basic version so there's still a lot more to explore.
If you have any questions, feedback or end up building something cool, please share in the comments.
Have a great day! Until next timeΒ :)
You can check my work at anmolbaranwal.com. Thank you for reading! π₯° |
![]() ![]() ![]() |
---|
Top comments (32)
Incredible as usual.
I gotta try this out soon.
Thank you
Appreciate you reading this Divya. You can make one for free with the credits you get. I'm checking out a few other platforms as well, may write about it soon :)
I am planning on it.
Thank you for this article.
Looking forward to those articles, if you write them .
This is amazing Anmol! Thanks for sharing π
thanks for reading Ndeye! If you end up creating an agent, you should definitely write about it.
Great job! This is super helpful! π
means a lot. thanks for reading!
This is amazing brother!! π
Appreciate you saying that Ali. I'm also looking into 11ai (recently launched by ElevenLabs).
Can it be integrated with n8n?
I don't think there is a direct integration with n8n (as per the docs) but we might be able to do it indirectly using a webhook.
Please can you try and let me know if it's possible?
Love how clearly you broke everything down too...inspiring stuff
Feels great to hear that Parag! I spent a lot of time learning everything (since this was my first time too) and tried my best to explain the stuff.
Amazing, can it take info from call and push into my db ?
Yeah, you can use the Webhook node to collect the data and send it to your endpoint. Then parse the JSON payload & use any ORM to write data in your DB.
I am a beginner, without any prior knowledge can I complete this???
I didn't have much knowledge in this space which is why I wrote this.. so others can understand the fundamental concepts. After reading this, I'm sure you can do it easily. And if you want to build something more advanced, just refer to the docs.
π thanksss
good articles, love airport video!
thanks. I'm so happy you noticed that :D
It was a shorts video so it wasn't embedding properly which is probably why most people missed it.
Love it! Have to try this at some point. Tried something similar 3 years ago, but the technology just wasn't there to make it useful.
Yeah, I was researching and found some crazy useful platforms out there. Some are a bit technical, others are easier so the barrier to entry in this space is dropping really fast.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.