Anmol Baranwal

Posted on Jul 12, 2025

I built and deployed a Voice AI Agent in 30 minutes! 🎉

#ai #programming #tutorial #nextjs

I have been experimenting with AI agents for a while now but this time, I wanted to build a Voice AI Agent. I won't lie, it does feel intimidating (if you have never built one before).

Voice AI agents are becoming more common these days so I took the chance to learn the core components with principles and understand how everything fits together.

This post covers everything I picked up: building blocks of Voice AI Agents, a step-by-step guide to building, testing and deploying one. Also listed popular platforms out there with some real-world use cases to learn from.

If you have been curious about how voice agents work (or want to build your own), this might help you get started. It took me just 30 minutes to deploy the agent to my portfolio.

What is covered?

In summary, we are going to cover these topics in detail.

What is a Voice AI Agent?
Popular tools and platforms for building one.
A step-by-step guide to build your first voice Agent from scratch.
Practical use cases with examples.

1. What is a Voice AI Agent?

You might already be familiar with AI Agents which are computer programs that can understand tasks, think for themselves and take actions on their own.

Voice AI Agents take a step further by combining speech and reasoning capabilities.

They are autonomous system that listens to your voice, understand what you are saying (using speech-to-text), respond using Large Language Models (LLMs) like GPT-4 and speak the answer back to you using a synthetic voice (text-to-speech).

There are mainly two types:

Inbound agents : answer calls and respond when someone reaches out.
Outbound agents : proactively make calls to deliver messages, reminders or carry out tasks.

Unlike traditional virtual assistants (such as Siri), Voice AI Agents can perform multi-step, complex tasks that range from:

answer customer phone calls
initiate outbound campaigns
provide support via a voice widget on your site
speak English, Arabic or any other supported languages

If you are curious about the tech behind these agents, here are two great reads:

What exactly is an AI Voice Agent? by Deepgram
How do Voice AI Agents work? by Picovoice

The best part is that you don’t need to be an expert to build one. There are many tools like VoiceHub by DataQueue and Retell AI that make it super easy to create and test a working voice agent in just a few minutes.

Let’s explore some of those platforms next.

2. Popular tools and platforms for building one.

If you want to build your own Voice AI Agent, there are several frameworks and tools available to help you get started. Picking the right one depends on:

Language and regional support. For example, many platforms don’t handle Arabic (MENA) or Indian English well.
Whether you are comfortable writing code or prefer no-code platforms.
Whether you want custom setups (more control, takes longer) or something faster
Whether your focus is on mobile-based agents or web experiences.

I believe choosing the right platform is less about what's "best" and more about what fits your use case.

Here are the most popular platforms:

VoiceHub by DataQueue - Easiest way to build voice agents without writing code. It connects LLMs to phone calls, lets you define workflows and deploy quickly. Bonus: MENA region support is solid (unlike many others). I will be using this one in the next section.

Rime - lets you build conversational AI apps with both voice and text. Good for more advanced voice flows, supports integrations and has a polished UI.

Vapi – Build phone-based voice agents using LLMs and connect them to real numbers. Offers a simple API and UI for call flows, used for scheduling, Q&A bots, and hotlines.

Retell AI - Specializes in phone call automation. Let you create voice agents that can have real-time conversations over phone lines.

LiveKit - Open source platform for real-time audio/video development. While it doesn't include AI by itself, it gives you the live voice infrastructure

Twilio Voice + OpenAI + ElevenLabs - A more flexible setup using Twilio for phone/audio input, GPT-4 for responses, and ElevenLabs for speech. Requires coding but gives you full control.

There are specific platforms like Deepgram is recommended for highly accurate speech-to-text (STT) and ElevenLabs is popular for realistic text-to-speech (TTS), generating natural-sounding voices. You can then enable your agent to make and receive phone calls through services like Twilio.

It totally depends on your use case, but I'm choosing VoiceHub to create the voice agent. It will be using ElevenLabs voices & OpenAI GPT-4o as a model under the hood.

3. A step-by-step guide to build your first voice Agent from scratch.

It's finally time to build a real agent. I will be using VoiceHub because of its fast setup, easy third-party integrations and solid support for the MENA region.

I went through the official docs, tested everything hands-on and documented the key steps, so you don’t have to get stuck in jargon.

Step 1: Sign up for the dashboard

You can sign up to visit the dashboard at voicehub.dataqueue.ai/.

Here’s what the dashboard looks like. We will walk through each part as we go.

Click on the + New Agent button to start. You can either create a blank agent or use an existing template. I'm using an existing template to make things easier to follow.

Once created, you will land inside the agent’s workspace.

There are several useful tabs here including Call logs, Phonebook, Analytics, Provider keys and more.

The most useful is the knowledge base (RAG), which helps in the intelligent retrieval of information during conversations to provide accurate responses.

If you switch to the configuration option in the sidebar, you will see everything you can tweak. You should change the language to English (the default is Arabic). Here is the brief info on tabs:

Models : choose your STT and LLM providers
Voices : choose how your agent sounds. You can test voices by typing something and hearing it out loud
Pathway : Build your agent's logic visually or with a global prompt
VoIP : Assign phone numbers for your agent to receive or make calls
Analysis : Decide how to tag calls, track how they went and check sentiment.
Widget : Add a voice chat interface to your website and customize how it looks.
White Labeling : Set up your own branding, logo and custom domain for your team

As you can see on the top right, you can toggle between two modes:

a) DataQueue Mode (Optimized stack for MENA)

default mode when creating a new agent
uses DataQueue’s fully optimized models for speech recognition (STT), conversation logic (LLM), speech synthesis (TTS).
designed for MENA use cases where accuracy, latency, sentiment detection matter most.
voice setup and tuning are handled via the DQ Configs tab.

Note: You cannot manually override model providers in this mode It's possible in the second section.

b) Custom Mode (Provider Flexibility)

Custom Mode gives you full flexibility in model selection and configuration. Here are the supported providers:

STT Providers: Google, Deepgram, Gladia, Speechmatics, Azure
TTS Providers: ElevenLabs, Deepgram, LMNT, Cartesia, Rime AI, Azure, OpenAI, Google
LLM Providers: OpenAI, Groq, Claude (Anthropic), Cohere, DeepSeek, Ollama, Grok

Make sure to switch the language to en-US from the SST settings in the Models tab.

You can also perform side-by-side Benchmarking to compare different setups to identify the optimal configuration for any specific use cases.

There are thousands of voices available across different accents and tones. Third-party integrations (TTS providers) are also easy to set up.

The customization options are huge, perfect for developers/teams who want full control over voice, prompts and model selection.

Step 2: Building the logic

They provide two approaches for defining how your agent behaves:

a) Global Prompt

A single prompt that guides the agent’s entire behavior (similar to system prompts in traditional LLM apps). Use this when your agent only needs to answer general questions or operate reactively.

b) Conversational Pathway

It's a visual drag-and-drop builder to define complex flows, variables and decision logic using connected nodes. This is what I will be using (seems easier).

I recommend using this when:

You need branching logic (such as verification → escalate → book → end)
You want to extract variables (date, location..)
You want fine control over what the agent says and when

Yes, it's possible to combine them. You can start with a global prompt and add the conversation flow later or build the flow first and use the global prompt as a backup.

You can add different nodes in the pathway. Here is the list of nodes and the purpose of each node:

For instance, the default node speaks a message and waits for the reply. While the End Call Node just ends the conversation.

Click a node to open customization options. You can define specific behaviors, conditions or plug in knowledge base lookups.

Step 3: Testing it out

You can test your agent with Start Test Call or Start Test Chat. You just need to provide the necessary microphone permissions on the website and the assistant will respond based on your flow.

This sample had only two nodes so the agent replies once and ends the call.

You can also perform QA Testing to simulate conversational scenarios and evaluate how well an agent actually behaves. It produces pass/fail results per test scenario and lets you identify weaknesses before deploying to production.

Example test case: Hi, I want to schedule a new appointment next Monday at 3 pm.

Result:

✅ PASS: Agent confirms the correct date/time with the proper tone
❌ FAIL: Agent ignores time or responds vaguely

You will also get full call logs to analyze past conversations.

Step 4: Deployment options

I think we all can agree that local testing is easier. Just build the workflow, test it out and voila. But what's the purpose if we cannot take it to real users?

VoiceHub makes this super easy. Navigate to Configuration > Widget and you will get a unique embed code for your website. You will get the option to customize the look, position and welcome message.

It will look something like this.

<dq-voice agent-id="your-agent-id" env="https://voicehub.dataqueue.ai/"> </dq-voice>
<script src="https://voicehub.dataqueue.ai/DqVoiceWidget.js"></script>

I tried it on my Next.js portfolio website and it worked properly. If you directly place it just before the closing </body> tag, you will probably get an error as Property 'dq-voice' does not exist on type 'JSX.IntrinsicElements'.

To fix that, follow these steps:

a) There is a <dq-voice> tag so for TS/React to treat it as a valid JSX tag, we need to add this to a new declaration file (src/types/custom-elements.d.ts).

declare namespace JSX {
  interface IntrinsicElements {
    'dq-voice': React.DetailedHTMLProps<
      React.HTMLAttributes<HTMLElement>,
      HTMLElement
    >
  }
}

b) In your project’s root tsconfig.json, add the types folder so the TS will now load that file and no longer complain about the unknown tag.

"include": [
"src/types/custom-elements.d.ts",
"next-env.d.ts",
  "**/*.ts",
  "**/*.tsx",
".next/types/**/*.ts",
],

c) Now just insert the widget into your Next.js layout.tsx.

dq-voice mounts the custom widget element.
Script ensures the widget script loads after the page is interactive (safe and efficient).

<dq-voice agent-id="id"></dq-voice>
<Script
  src="https://voicehub.dataqueue.ai/DqVoiceWidget.js"
  strategy="afterInteractive"
/>

As soon as you run the server, it will be live and a pop-up will appear to ask for the necessary permission for the microphone.

You can have a usual conversation based on your workflow and it will listen/respond accordingly.

And just like that, I built and deployed my first Voice AI Agent! 🎉

You can also deploy with their cloud with your own private infrastructure or use a hybrid deployment (optimize infrastructure reducing server and GPU costs by up to 90%).

I tried some more advanced flows too but covering it would have made things a little bit confusing so I decided to leave it out.

I cannot cover all the things so I recommend checking the official docs and trying it out yourself. If you are still wondering about real-world use cases, check out the next section.

4. Practical use cases with examples.

Once you are familiar with the Voice AI Agents, it's easy to see how powerful they can be (especially when used to automate workflows).

Here are a few real-world use cases that show what’s possible:

✅ AI agent deployed in an international airport to support passengers with disabilities

This is the use case that will blow your mind. The DataQueue team officially deployed the VoiceHub AI agent at Queen Alia International Airport in Amman, Jordan.

The agent is designed to support passengers with disabilities, ensuring they receive the assistance they need within 5 minutes.

Here is the demo video!

They are rolling out similar projects across airports in MENA and Europe, creating a positive impact at airports by handling customer support, accessibility and real-time responsiveness.

✅ Auto call Agent for Internal Status Checks (Engineering & Ops Teams)

Startups are usually fast environments (infra teams, devops, logistics ops), teams often need to stay updated on ongoing issues, service status or even deployment logs.

Instead of Slack reminders or waiting for someone to check dashboards, a voice agent can actively call team members, summarize the current situation and log any updates or confirmations.

The flow can look something like this:

Cron job triggers every 2 hours.
Agent calls on-call engineer with a status update: Hi, just checking in. The latest deployment finished with 2 minor warnings. Do you want me to notify QA or hold off?
Engineer responds with Hold off until we patch → Agent logs the response to Jira or an internal dashboard via API.
If there is no answer → it falls back to SMS or call escalation.

✅ Voice Agent to make cold Email more personal (via Call)

This is a very exciting workflow for sales teams that want to warm up cold emails before sending a pitch.

Instead of sending a generic email blast, the agent calls the lead, confirms if they are open to receiving information and gets some light qualifying data without a human SDR involved.

The flow can look something like this:

Lead data is pulled from CRM.
Voice agent calls: “Hey, I’m helping xyz company learn more about founders in the fintech space. Just a quick 1-minute call, are you still working on xyz?”
Captures 2–3 data points (interest level, industry fit, team size) using variable extraction.
If the response is positive → marks lead as warm → generates a tailored intro email and sends via marketing tool.

The result is a more personal, context-aware email that’s far more likely to convert.

I used to think building voice AI agents required a lot of custom engineering, but it's accessible with the right set of tools.

This was just a basic version so there's still a lot more to explore.

If you have any questions, feedback or end up building something cool, please share in the comments.

Have a great day! Until next time :)

You can check my work at anmolbaranwal.com. Thank you for reading! 🥰

Top comments (33)

Divya • Jul 12 '25

Incredible as usual.
I gotta try this out soon.

Thank you

Anmol Baranwal • Jul 12 '25

Appreciate you reading this Divya. You can make one for free with the credits you get. I'm checking out a few other platforms as well, may write about it soon :)

Divya • Jul 12 '25

I am planning on it.
Thank you for this article.

Looking forward to those articles, if you write them .

Ndeye Fatou Diop • Jul 13 '25

This is amazing Anmol! Thanks for sharing 🙏

Anmol Baranwal • Jul 14 '25

thanks for reading Ndeye! If you end up creating an agent, you should definitely write about it.

ANIRUDDHA ADAK • Jul 12 '25

Great job! This is super helpful! 👏

Anmol Baranwal • Jul 13 '25

means a lot. thanks for reading!

Fluents • Sep 9 '25

Nice writeup. Breaking the stack into STT, NLU, dialog, and TTS is exactly how I got over the “voice agents feel intimidating” hump too. Your examples map nicely to real workflows where voice makes more sense than Slack pings or forms.

A few things that helped me in production: set an explicit latency budget per turn (roughly STT partials <150 ms, model think time <300 ms, TTS <200 ms), and tune barge-in so the agent ducks or cuts TTS as soon as the user speaks. Voicemail detection matters for outbound - quick classification on the first 2-3 seconds saves a ton of wasted minutes. Also track turn-level metrics like first-token latency, interruption rate, and no-speech timeouts; they reveal most issues faster than raw WER. If you go WebRTC with LiveKit, VAD endpointing and audio normalization (AGC off, consistent gain) make a big difference when users are on PSTN bridges. For lead gen, guardrails around consent windows and DNC scrubbing are worth baking in early.

At Fluents we build voice agents across inbound and outbound, and BYOK on STT/TTS has been handy for accent-heavy markets or industry vocab - swapping Deepgram vs Whisper or different TTS voices per use case without changing the rest of the stack. We also found a simple slot-filling layer for critical fields (name, date, address) reduces loops and weird detours from the LLM.

Curious which STT/TTS combo you landed on with Retell, and how you handled barge-in and voicemail detection. Also, did you end up simulating noisy environments during testing or just iterating from real calls? Would love a follow-up post with the testing setup and metrics you track.

Cheryl Tibbs • Jul 13 '25

Nice

Reena Ram • Jul 14 '25

I am a beginner, without any prior knowledge can I complete this???

Anmol Baranwal • Jul 14 '25

I didn't have much knowledge in this space which is why I wrote this.. so others can understand the fundamental concepts. After reading this, I'm sure you can do it easily. And if you want to build something more advanced, just refer to the docs.

Reena Ram • Jul 18 '25

🙏 thanksss

Vlad I • Jul 14 '25

Amazing, can it take info from call and push into my db ?

Anmol Baranwal • Jul 14 '25

Yeah, you can use the Webhook node to collect the data and send it to your endpoint. Then parse the JSON payload & use any ORM to write data in your DB.