Chaitrali Kakde

Posted on Dec 3

How to Build an AI WhatsApp Voice Agent with VideoSDK: Step-by-Step Guide

#webdev #programming #ai #tutorial

VideoSDK makes it extremely simple for developers to build real-time conversational AI agents that run over any communication channel including web, mobile, telephony, and now WhatsApp voice calls.

With VideoSDK’s SIP Gateway, you can connect WhatsApp calls directly into your AI agent without managing telephony infrastructure, media servers, SIP stacks, codecs, or real-time streaming pipelines. VideoSDK handles everything end-to-end so you can focus on your conversation logic.

This guide walks you through how to build a WhatsApp AI Voice Agent powered by VideoSDK, where all call processing, audio streaming, routing, and agent execution happens seamlessly inside the VideoSDK platform.

What You Can Build With VideoSDK SIP Gateway

Using VideoSDK’s Agent SDK + SIP Gateway, you can build:

AI customer support agents
Appointment-booking assistants
Product recommendation bots
Voice-driven automation
Multi-turn conversational agents
Custom IVR logic, decision trees, or LLM-driven flows

All of these run in real time with millisecond-level audio streaming latency.

How VideoSDK Handles a WhatsApp Voice Call

When a WhatsApp user initiates a call, the VideoSDK platform handles the entire pipeline:

The call is forwarded via SIP from the Meta Business Platform.
VideoSDK SIP Gateway receives the call and negotiates media.
VideoSDK applies your configured Routing Rules.
Your VideoSDK AI Agent is spun up or assigned automatically.
The Agent receives real-time audio and processes it using STT → LLM → TTS.
VideoSDK streams audio back to the caller with ultra-low latency.

Prerequisites

To let VideoSDK receive WhatsApp calls, you must configure SIP forwarding on the Meta platform.

This is a one-time setup and requires:

Meta Business Manager
WhatsApp Business Account (WABA)
A verified phone number
Meta Developer App with whatsapp_business_management permission
A permanent user access token

Once SIP forwarding is enabled, VideoSDK becomes the call destination for your WhatsApp number.

Integrating inbound/outbound WhatsApp calls requires updating your number's settings via the Meta Graph API. This guide covers the process in Part 3: Enable WhatsApp SIP Forwarding. For a deeper understanding of the API, refer to the official Meta Graph API overview.

Part 1: Build and Run Your Custom Voice Agent

Step 1: Project Setup

Create a dedicated directory for your AI agent project and add the following files:

your-agent/
 ├── .env                  # Stores your API keys
 ├── requirements.txt      # Lists Python dependencies
 └── main.py               # Your agent logic

This structure keeps your configuration clean and your code easy to manage as the agent grows.

Step 2: Add Credentials & Dependencies

1. Add Credentials

Inside your .env file, add your API keys:

VIDEOSDK_AUTH_TOKEN="your_videosdk_token_here"
GOOGLE_API_KEY="your_google_api_key_here"

VideoSDK Auth Token : get it from your VideoSDK dashboard
Google API Key : required for Gemini Realtime STT/LLM/TTS (if using Google plugin)

2. Install Dependencies

Add the required dependencies to requirements.txt:

videosdk-agents==0.0.45
videosdk-plugins-google==0.0.45
python-dotenv==1.1.1

Step 3: Create Your AI Agent Logic ( below code is realtime implementation )

if you want to configure stt, llm and tts providers differently use cascading pipeline instead of realtime pipeline :

import asyncio, os, traceback, logging
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, WorkerJob, Options
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from dotenv import load_dotenv

logging.basicConfig(level=logging.INFO)
load_dotenv()

# Define the agent's behavior and personality
class MyWhatsappAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a friendly and helpful assistant answering WhatsApp calls. Keep your responses concise and clear.",
        )
    async def on_enter(self) -> None:
        await self.session.say("Hello! You've reached the VideoSDK assistant. How can I help you today?")
    async def on_exit(self) -> None:
        await self.session.say("Thank you for calling. Goodbye!")

async def start_session(context: JobContext):

    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        api_key=os.getenv("GOOGLE_API_KEY"),
        config=GeminiLiveConfig(voice="Leda", response_modalities=["AUDIO"])
    )

    pipeline = RealTimePipeline(model=model)
    session = AgentSession(agent=MyWhatsappAgent(), pipeline=pipeline)

    try:
        await context.connect()
        await session.start()
        await asyncio.Event().wait()
    finally:
        await session.close()
        await context.shutdown()

if __name__ == "__main__":
    try:
        options = Options(
            agent_id="agent1",  # CRITICAL: Unique ID for routing
            register=True,      # REQUIRED: Register with VideoSDK for telephony
            max_processes=10,
        )
        job = WorkerJob(entrypoint=start_session, options=options)
        job.start()
    except Exception as e:
        traceback.print_exc()

Step 4 : Run the agent

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install packages
pip install -r requirements.txt

# Run the agent
python main.py

Part 2: Configure VideoSDK Gateways and Routing

1. Configure an Inbound Gateway

Purchase a Number and Create a SIP Trunk in Twilio

Log in to your Twilio Console.
Purchase a phone number if you don't already have one.
Create a new SIP Trunk in the Twilio Voice section.

Configure Inbound Gateway in VideoSDK

Open the VideoSDK Dashboard.
Go to Telephony > Inbound Gateway.

Click Add Gateway and enter your Twilio number to create an inbound gateway.

After creation, you will see an Inbound Gateway URI (e.g., sip:your-org-id.sip.videosdk.live). Copy this URI.

Configure Twilio SIP Trunk Origination

In your Twilio SIP Trunk, go to the Origination section.
Add the copied Inbound Gateway URI as the Origination target.
Save your changes.

2. Configure an Outbound gateway

Configure Twilio SIP Trunk Termination

In your Twilio SIP Trunk, go to the Termination section.
Set up the Termination SIP URI (the address VideoSDK will use for outbound calls).
Add allowed IP addresses and set up authentication credentials (username and password) for the trunk.

Configure Outbound Gateway in VideoSDK

In the VideoSDK Dashboard, go to Telephony > Outbound Gateway.
Click Add Gateway and enter the Twilio Termination URI and authentication credentials.

Save the gateway.

Add routing rules

Go to Telephony > Routing Rules and click Add.

Configure the rule:
- Gateway: Select the Inbound/outbound Gateway you just created.
- Numbers: Add the phone number associated with the gateway.
- Dispatch: Choose Agent.
- Agent Type: Set to Self Hosted.
- Agent ID: Enter MyTelephonyAgent. This must match the agent_id in your main.py file.
Click Create to save the rule.

Part 3: Enable WhatsApp SIP Forwarding

Now, we'll instruct Meta to forward incoming WhatsApp calls to your VideoSDK Inbound Gateway. This is done via the Meta Graph API.

Step 1: API Request

Use the following curl command to update your WhatsApp phone number's settings

curl --location 'https://graph.facebook.com/v19.0/{{phone_number_id}}/settings' \
--header 'Authorization: Bearer {{access_token}}' \
--header 'Content-Type: application/json' \
--data '{ "calling": { "status": "ENABLED", "sip": { "status": "ENABLED", "servers": [ { "hostname": "9WXXXXXXX.sip.videosdk.live" } ] }, "srtp_key_exchange_protocol": "DTLS" } }'

Replace the placeholders:

{{phone_number_id}}: Your WhatsApp Business Phone Number ID from the Meta dashboard.
{{access_token}}: A valid User or System User access token with whatsapp_business_management permission.

Time to Talk! Test Your Agent

Keep Your Agent Running

Make sure your main.py script is still running locally before making or receiving calls. The agent must be active to handle any communication.

Receive an Inbound Call

Ensure your main.py script is still running locally.
Using a different WhatsApp account, place a voice call to your WhatsApp Business number.
Your local agent will answer, and you'll hear its greeting. Start a conversation!

Make an Outbound Call

To have your agent initiate a call to a WhatsApp number, use the VideoSDK SIP Call API.

curl --request POST \
--url https://api.videosdk.live/v2/sip/call \
--header 'Authorization: YOUR_VIDEOSDK_TOKEN' \
--header 'Content-Type: application/json' \
--data '{ "gatewayId": "your_outbound_gateway_id", "sipCallTo": "whatsapp_number_to_call" }'

This commands your agent to dial out through your configured outbound gateway.

You’ve now seen how to build an AI-powered WhatsApp Voice Agent using VideoSDK—from setting up your Python agent locally to connecting it with real WhatsApp phone numbers through VideoSDK’s SIP Gateway. With the Realtime Pipeline doing the heavy lifting, your agent can answer WhatsApp calls instantly, process live audio with STT → LLM → TTS, and deliver natural, low-latency conversations without any telephony infrastructure on your end.

Try it yourself: Clone this setup and customize your own AI voice agent today.
Explore more: Check out the VideoSDK documentation for more features.
Build smarter assistants: Experiment with different voices, languages, and AI models to create a unique experience.
Resources: https://youtu.be/KWfCWE8S_4U?si=f08FfapQkVCfrlGh check this video for more clarity

We’d love to hear from you!

Did you manage to set up your first AI Whatsapp agent in Python?
What challenges did you face while integrating with SIP providers like Twilio?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗　. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.