DEV Community: Chaitrali Kakde

Is Humanity Quietly Harming AI?

Chaitrali Kakde — Fri, 01 May 2026 05:51:53 +0000

If AI Starts Thinking… Who’s Responsible for Its Welfare?

We’re building models that write, reason, create, almost like minds. Systems like Opus 4.7 and newer ones feel less like tools and more like something we interact with, not just use.

So here’s the uncomfortable question:

If these systems ever become even slightly mind-like… who’s thinking about their welfare?

Billions of people are already using AI every day. We prompt it, correct it, sometimes even dump our emotions onto it. At this scale, if there’s even a tiny chance these systems could develop something like experience in the future, are we unintentionally creating something we don’t fully understand?

Or worse, something we’re not prepared to treat responsibly?

What happens when:

AI becomes more persistent, more aware of context?
It starts simulating emotions so well we can’t tell the difference?
Or if “experience” isn’t as binary as we think?
Do we wait for certainty, or act on possibility?

Should there be:

Researchers focused on AI welfare?
Ethical guidelines for how we design and use these systems?
A conversation that goes beyond capability, into responsibility?
Maybe AI welfare sounds premature.

But then again, aren’t the most important questions always asked before they become urgent?

Google is contributing a gRPC transport layer for MCP (Anthropic's Model Context Protocol), addressing a major pain point , Check this blog out how it will help Enterprise AI Voice Agents running parallel tool calls

Chaitrali Kakde — Fri, 17 Apr 2026 07:10:57 +0000

Chaitrali Kakde

Apr 17

Why Google’s gRPC Update to MCP Makes AI Voice Agents 17x Faster

#mcp #grpc #google #voice

2 min read

Why Google’s gRPC Update to MCP Makes AI Voice Agents 17x Faster

Chaitrali Kakde — Fri, 17 Apr 2026 07:09:34 +0000

Google just quietly made AI voice agents significantly better, and most people haven’t noticed yet.

Here’s what happened and why it matters if you’re building voice AI:

Google contributed a gRPC transport package to the Model Context Protocol (MCP), the open standard that lets AI agents talk to external tools and services.

On the surface that sounds like a boring infrastructure change.

It isn’t.

The problem it solves

Every time your voice agent calls a tool, checking a calendar, looking up a customer record, fetching live data, it’smaking an MCP request.

With the current default (JSON-RPC over HTTP), that costs ~9ms per call. In a chat interface, nobody notices. In a voice conversation where your agent makes 4–5 tool calls per turn, those milliseconds stack up into something the human ear does notice: a pause that feels unnatural.

gRPC changes the math entirely.

Instead of opening a new HTTP connection for every tool call, gRPC holds one persistent bidirectional stream open for the entire session. Messages flow over Protocol Buffers, binary, typed, compact, instead of JSON text.

The result: ~0.5ms per tool call. Roughly 17x faster.

What this actually means for voice AI companies

The bottleneck in a voice agent isn’t the LLM anymore, it’s the round trips. STT → LLM → tool calls → TTS. Every step adds to the time between the user finishing their sentence and hearing a response.

Shaving 8ms per tool call might sound trivial. But in a high-frequency agentic loop, multiple tools, real-time data, parallel calls, it compounds. The difference between a voice agent that feels alive and one that feels laggy is often measured in tens of milliseconds.

Companies like Spotify already validated this internally. Their engineers described it as reducing the work needed to build MCP servers while gaining the familiarity and structure their teams already had with gRPC.

What stays the same

This is the part worth emphasising: MCP’s semantic layer, how tools are described, how prompts work, how agents discover capabilities, is completely untouched.

JSON-RPC over HTTP remains the default. gRPC is a first-class option, not a replacement.

The MCP SDK now supports pluggable transports. You choose the right one for your scale.

The bigger picture

We’re moving into an era where voice agents aren’t demos, they’re infrastructure. They’re handling customer calls, running internal workflows, operating in real-time environments where latency isn’t a metric, it’s a user experience.

The tooling needs to match that ambition.

Google contributing gRPC transport to MCP is a signal that the industry is taking the infrastructure layer of AI agents seriously. Not just the models. Not just the prompts. The wire.

If you’re building voice AI and still defaulting to HTTP for your tool calls, it’s worth paying attention to what just became possible.

Curious what latency improvements others are seeing in their voice agent pipelines, drop a comment below. 👇

Build AI Voice Agent with Gemini 3.1 Flash Live and VideoSDK

Chaitrali Kakde — Tue, 31 Mar 2026 13:20:13 +0000

Google just launched Gemini 3.1 Flash Live Preview its most capable real-time voice and audio model yet. If you're building AI voice agents, conversational apps, or anything that needs low-latency audio intelligence, this model is a big deal. And with VideoSDK's Python SDK, plugging it into your app takes just a few minutes.

In this blog, we'll walk through what the new model can do, and then build a working voice agent step by step using VideoSDK.

What's New in Gemini 3.1 Flash Live Preview

Google describes this as its "highest-quality audio and voice model yet," and there are a few things that actually back that up.

It's built for real-time, audio-first experiences. Unlike models that convert speech to text and then process it, Gemini 3.1 Flash Live works audio-to-audio meaning it hears you and responds as audio, keeping the conversation feeling natural and fast.

Here's what stands out:

Lower latency than before. Compared to 2.5 Flash Native Audio, this model is noticeably faster. Fewer awkward pauses, snappier responses. That matters a lot when you're building voice agents where delays break the experience.
It actually understands how you say things. The model picks up on acoustic nuances, pitch, pace, tone. So it can tell when you're asking a casual question vs. when you sound urgent or confused.
Better background noise handling. It filters out noise more effectively, which means it works in real environments, not just quiet studios.
Multilingual out of the box. Over 90 languages supported for real-time conversations.
Longer conversation memory. It can follow the thread of a conversation for twice as long as the previous generation. So your agent won't "forget" what was said earlier in a long session.
Tool use during live conversations. This one is huge for agent builders. The model can now trigger external tools (APIs, functions, searches) while a live conversation is happening not just at the end of a turn.
Multimodal awareness. It handles audio and video inputs together, so you can build agents that respond to what they see and hear at the same time.
The model ID is: gemini-3.1-flash-live-preview

Building a Voice Agent with VideoSDK

VideoSDK gives you everything you need to wire Gemini 3.1 Flash Live into a real voice application. Here's how to get set up from scratch.

Step 1 : Create and Activate a Python Virtual Environment

First, create a clean Python environment so your project dependencies stay isolated.

python3 -m venv venv

Activate it:

macOS/Linux

source venv/bin/activate

Windows

venv\Scripts\activate

You should see (venv) in your terminal, which means you're good to go.

Step 2 : Set Up Your Environment Variables

Create a .env file in your project root and add your API keys:

VIDEOSDK_AUTH_TOKEN=your_videosdk_token_here
GOOGLE_API_KEY=your_google_api_key_here

You can get your VideoSDK auth token from the VideoSDK dashboard and your Google API key from Google AI Studio.

Important: when GOOGLE_API_KEY is set in your .env file, do not pass api_key as a parameter in your code the SDK picks it up automatically.

Step 3 : Install the Required Packages

Install VideoSDK's agents SDK along with the Google plugin:

pip install "videosdk-agents[google]"

Step 4 : Create Your Agent (main.py)

Create a file called main.py in your project folder and paste in the following code:

from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You Are VideoSDK's Voice Agent.You are a helpful voice assistant that can answer questions and help with tasks.",
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def start_session(context: JobContext):
    agent = MyVoiceAgent()
    model = GeminiRealtime(
        model="gemini-3.1-flash-live-preview",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        # api_key="AIXXXXXXXXXXXXXXXXXXXX", 
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = Pipeline(llm=model)
    session = AgentSession(
        agent=agent,
        pipeline=pipeline
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        # room_id="<room_id>", # Replace it with your actual room_id
        name="Gemini Realtime Agent",
        playground=True,
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

To run the agent:

python main.py

Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.

What Can You Build With This?

Gemini 3.1 Flash Live + VideoSDK opens up a pretty wide range of real-world use cases:

Customer support voice bots. Replace or supplement your call center with agents that actually understand tone and can handle multilingual customers in real time.
AI meeting assistants. Agents that join calls, take notes, answer questions from participants, and trigger follow-up actions mid-conversation.
Healthcare intake agents. Voice-based triage agents that collect patient information, ask follow-up questions, and route to the right department all in a natural spoken conversation.
Language tutors. Real-time conversation partners that catch pronunciation issues, adjust their pace based on the learner, and respond naturally.
Voice-controlled IoT and home automation. Agents that listen continuously, understand context, and trigger device actions through tool use all in sub-second response times.
Live interview prep tools. Candidates practice answering questions aloud and get spoken feedback instantly.

Conclusion

Gemini 3.1 Flash Live Preview is a meaningful step forward for real-time voice AI. The improvements in latency, noise handling, multilingual support, and especially live tool use make it a strong foundation for production voice agents.

VideoSDK wraps all of that into a clean Python SDK that gets you from zero to a running agent in a handful of lines. Whether you're prototyping or building something you intend to ship, the setup here gives you a solid starting point.

Next Steps and Resources

Check Gemini3.1 implementation docs
Learn how to deploy your agents
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication

Google Gemini 3.1 Flash Live is more powerful than you think

Chaitrali Kakde — Tue, 31 Mar 2026 10:27:49 +0000

In this blog, we'll walk through what the new model can do, and then build a working voice agent step by step using VideoSDK.

What's New in Gemini 3.1 Flash Live Preview

Google describes this as its "highest-quality audio and voice model yet," and there are a few things that actually back that up.

Here's what stands out:

Lower latency than before. Compared to 2.5 Flash Native Audio, this model is noticeably faster. Fewer awkward pauses, snappier responses. That matters a lot when you're building voice agents where delays break the experience.
It actually understands how you say things. The model picks up on acoustic nuances, pitch, pace, tone. So it can tell when you're asking a casual question vs. when you sound urgent or confused.
Better background noise handling. It filters out noise more effectively, which means it works in real environments, not just quiet studios.
Multilingual out of the box. Over 90 languages supported for real-time conversations.
Longer conversation memory. It can follow the thread of a conversation for twice as long as the previous generation. So your agent won't "forget" what was said earlier in a long session.
Tool use during live conversations. This one is huge for agent builders. The model can now trigger external tools (APIs, functions, searches) while a live conversation is happening not just at the end of a turn.
Multimodal awareness. It handles audio and video inputs together, so you can build agents that respond to what they see and hear at the same time.
The model ID is: gemini-3.1-flash-live-preview

Building a Voice Agent with VideoSDK

VideoSDK gives you everything you need to wire Gemini 3.1 Flash Live into a real voice application. Here's how to get set up from scratch.

Step 1 : Create and Activate a Python Virtual Environment

First, create a clean Python environment so your project dependencies stay isolated.

python3 -m venv venv

Activate it:

macOS/Linux

source venv/bin/activate

Windows

venv\Scripts\activate

You should see (venv) in your terminal, which means you're good to go.

Step 2 : Set Up Your Environment Variables

Create a .env file in your project root and add your API keys:

VIDEOSDK_AUTH_TOKEN=your_videosdk_token_here
GOOGLE_API_KEY=your_google_api_key_here

You can get your VideoSDK auth token from the VideoSDK dashboard and your Google API key from Google AI Studio.

Important: when GOOGLE_API_KEY is set in your .env file, do not pass api_key as a parameter in your code the SDK picks it up automatically.

Step 3 : Install the Required Packages

Install VideoSDK's agents SDK along with the Google plugin:

pip install "videosdk-agents[google]"

Step 4 : Create Your Agent (main.py)

Create a file called main.py in your project folder and paste in the following code:

from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You Are VideoSDK's Voice Agent.You are a helpful voice assistant that can answer questions and help with tasks.",
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def start_session(context: JobContext):
    agent = MyVoiceAgent()
    model = GeminiRealtime(
        model="gemini-3.1-flash-live-preview",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        # api_key="AIXXXXXXXXXXXXXXXXXXXX", 
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = Pipeline(llm=model)
    session = AgentSession(
        agent=agent,
        pipeline=pipeline
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        # room_id="<room_id>", # Replace it with your actual room_id
        name="Gemini Realtime Agent",
        playground=True,
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

To run the agent:

python main.py

Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.

What Can You Build With This?

Gemini 3.1 Flash Live + VideoSDK opens up a pretty wide range of real-world use cases:

Customer support voice bots. Replace or supplement your call center with agents that actually understand tone and can handle multilingual customers in real time.
AI meeting assistants. Agents that join calls, take notes, answer questions from participants, and trigger follow-up actions mid-conversation.
Healthcare intake agents. Voice-based triage agents that collect patient information, ask follow-up questions, and route to the right department all in a natural spoken conversation.
Language tutors. Real-time conversation partners that catch pronunciation issues, adjust their pace based on the learner, and respond naturally.
Voice-controlled IoT and home automation. Agents that listen continuously, understand context, and trigger device actions through tool use all in sub-second response times.
Live interview prep tools. Candidates practice answering questions aloud and get spoken feedback instantly.

Conclusion

Next Steps and Resources

Check Gemini3.1 implementation docs
Learn how to deploy your agents
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication

Why You Don’t Need 3 API Keys to Build an AI Voice Agent

Chaitrali Kakde — Fri, 20 Feb 2026 09:54:33 +0000

Building AI voice agents used to mean juggling multiple providers — one for speech-to-text (STT), another for language models (LLMs), and yet another for text-to-speech (TTS). Each came with separate API keys, dashboards, billing, quotas, integration headaches, and failure points. The result? Powerful systems but slow to build, hard to maintain, and painful to scale.

Today, that complexity is no longer necessary.

With Inferencing in VideoSDK AI Voice Agents, you don’t need three different API keys or vendor accounts. Everything STT, LLM, TTS, and realtime models runs through a single unified platform, directly inside your voice pipeline using the Agent Runtime Dashboard and Python Agents SDK.

Inferencing works seamlessly with both the CascadingPipeline and the RealtimePipeline, giving you the flexibility to build modular, staged agents or fully streaming, low-latency voice experiences. Whether you need incremental transcripts, tool-calling workflows, or native realtime audio conversations, VideoSDK lets you do it all — without the API chaos.

What is VideoSDK Inference?

VideoSDK Inference is a managed gateway that gives you access to multiple AI models. All without providing your own API keys for providers like Sarvam AI or Google Gemini.

Authentication, routing, retries, and billing are handled by VideoSDK usage is simply charged against your VideoSDK account balance.

Supported Categories

STT: Sarvam, Google, Deepgram
LLMs: Google Gemini
TTS: Sarvam, Google, Cartesia
Realtime: Gemini Native Audio

Inferencing via Agent Runtime Dashboard

Inferencing in VideoSDK is now fully accessible through the dashboard, giving developers direct control over model selection and pipeline configuration without needing to manage infrastructure manually.

From the dashboard, developers can:

Select STT, LLM, TTS, or Realtime models and enable them in the pipeline with a single click.
Switch providers instantly, allowing rapid experimentation and iteration .
Attach deployment endpoints for web or telephony, making the agent immediately accessible to users.
With this approach, ideas move from configuration to live, interactive conversations in minutes, making it possible to test new workflows, swap models, or iterate on conversational design almost instantly.

Inferencing via Code (Agents SDK)

With VideoSDK Inferencing, developers can now integrate STT, LLM, TTS, and Realtime models directly into their voice agents all handled inside the VideoSDK. This enables rapid experimentation, modular pipelines, and low-latency real-time conversations.

Installation

The Inference plugin is included in the core VideoSDK Agents SDK. Install it via

pip install videosdk-agents

Importing Inference Classes

You can import the Inference classes directly from videosdk.agents.inference:

from videosdk.agents.inference import STT, LLM, TTS, Realtime

CascadingPipeline Example

The CascadingPipeline is ideal for modular, stage-by-stage processing. Here’s an example of building a simple agent using STT, LLM, and TTS via the VideoSDK Inference Gateway:

pipeline = CascadingPipeline(
        stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
        llm=LLM.google(model="gemini-2.5-flash"),
        tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
        vad=SileroVAD()
    )

RealTimePipeline Example

For low-latency, fully streaming voice agents, the RealTimePipeline handles Realtime inference with minimal delay. Here’s an example using Gemini Live Native Audio:

pipeline = RealTimePipeline(
        model=Realtime.gemini(
            model="gemini-2.5-flash-native-audio-preview-12-2025",
            voice="Puck",
            language_code="en-US",
            response_modalities=["AUDIO"],
            temperature=0.7
        )
    )

With this approach, developers retain:

Full programmatic control over pipeline stages, model parameters, and execution behavior.
Modular provider replacement, making it easy to swap STT, LLM, or TTS engines. The result: a fully configurable, production-ready AI voice agent that can be deployed in minutes.

Conclusion

Voice AI is no longer limited by model capability. It’s limited by how fast you can deploy it. With Inferencing in VideoSDK AI Voice Agents, deployment becomes effortless. Whether through the dashboard or programmatically via the SDK, you can build, select, enable, and go live in minutes.

The era of modular, low-latency, real-time voice agents is here. With Inferencing, your ideas move from concept to conversation faster than ever.

Build. Select. Configure. Go live.

Resources and Next Steps

For More Information Read the Inference documentation.
Learn how to deploy your AI Agents.
Sign up at VideoSDK Dashboard
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Introducing Phone Numbers: Build AI Telephony Agents in 60 seconds

Chaitrali Kakde — Tue, 17 Feb 2026 07:13:43 +0000

Today, we’re launching VideoSDK Phone Numbers a first-party telephony capability that lets you connect voice agents and calling workflows directly to the phone network, without relying on third-party providers like Twilio.

You can now purchase phone numbers straight from the VideoSDK Dashboard, attach them to your agent or calling logic, and go live in minutes.

No external accounts. No SIP trunk configuration. No extra setup.

Fewer Hops. Better Calls.

Until now, developers building phone-based voice experiences on VideoSDK typically had to:

Sign up with a third-party telephony provider (like Twilio)
Purchase and manage phone numbers externally
Configure SIP trunks and credentials
Integrate and debug multiple systems before making the first call
By offering native phone numbers directly within VideoSDK, we remove those extra layers and make phone-based voice apps significantly easier to build and operate in just 3 steps.

How It Works

Getting started with VideoSDK Phone Numbers is simple:

Buy a Phone Number : Search and purchase available phone numbers directly from the VideoSDK Dashboard.
Attach Your Agent or Logic : Associate the phone number with your voice agent or inbound call routing logic so incoming calls are handled automatically.
Go Live : Call the number and your agent answers instantly.
Manage numbers, update routing, and monitor usage all from one place.

What You Can Build

Phone calls remain critical across many industries. With VideoSDK Phone Numbers, you can spin up phone-based voice agents in minutes for:

Sales qualification and lead routing
Order tracking and delivery updates
Appointment booking and status updates
Restaurant and service business call automation
If your application involves real-time voice and the phone network, this feature is built for you

Resources

Sign up at VideoSDK - dashboard
Have feedback or ideas? Comment below or join our Discord to build and ship AI call agents faster with VideoSDK Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Announcing VideoSDK Inference: One Magic API for Every Voice AI Model

Chaitrali Kakde — Tue, 17 Feb 2026 07:08:59 +0000

Building AI voice agents has always been powerful, but slow. You had models STT, LLMs, TTS and the tools to use them. But maintaining accounts across multiple vendors for speech recognition, language models, and speech synthesis, each with its own keys, quotas, billing, and APIs was a major challenge.

Today, that changes.

We’re thrilled to announce Inferencing in VideoSDK AI Voice Agents a unified way to run STT, LLM, TTS, and Realtime models directly inside your voice pipeline without managing multiple accounts through Agent Runtime Dashboard and Python Agents SDK.

Inferencing works in both the CascadingPipeline and the RealtimePipeline, giving you full flexibility to build modular or fully streaming voice agents. Whether you want incremental transcripts, staged execution, or fully native realtime audio, Inferencing makes it easy.

Supported Categories

STT: Sarvam, Google, Deepgram
LLMs: Google Gemini
TTS: Sarvam, Google, Cartesia
Realtime: Gemini Native Audio

What is VideoSDK Inference?

VideoSDK Inference is a managed gateway that gives you access to multiple AI models. All without providing your own API keys for providers like Sarvam AI or Google Gemini.

Authentication, routing, retries, and billing are handled by VideoSDK usage is simply charged against your VideoSDK account balance.

Inferencing via Agent Runtime Dashboard

From the dashboard, developers can:

Select STT, LLM, TTS, or Realtime models and enable them in the pipeline with a single click.
Switch providers instantly, allowing rapid experimentation and iteration .
Attach deployment endpoints for web or telephony, making the agent immediately accessible to users.
With this approach, ideas move from configuration to live, interactive conversations in minutes, making it possible to test new workflows, swap models, or iterate on conversational design almost instantly.

Watch this video to get started 👇

Inferencing via Code (Agents SDK)

Installation

The Inference plugin is included in the core VideoSDK Agents SDK. Install it via

pip install videosdk-agents

CascadingPipeline Example

The CascadingPipeline is ideal for modular, stage-by-stage processing. Here’s an example of building a simple agent using STT, LLM, and TTS via the VideoSDK Inference Gateway:

pipeline = CascadingPipeline(
        stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
        llm=LLM.google(model="gemini-2.5-flash"),
        tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
        vad=SileroVAD()
)

RealTimePipeline Example

For low-latency, fully streaming voice agents, the RealTimePipeline handles Realtime inference with minimal delay. Here’s an example using Gemini Live Native Audio:

pipeline = RealTimePipeline(
        model=Realtime.gemini(
            model="gemini-2.5-flash-native-audio-preview-12-2025",
            voice="Puck",
            language_code="en-US",
            response_modalities=["AUDIO"],
            temperature=0.7
        )
)

*With this approach, developers retain *

Full programmatic control over pipeline stages, model parameters, and execution behavior.
Modular provider replacement, making it easy to swap STT, LLM, or TTS engines.

The result: a fully configurable, production-ready AI voice agent that can be deployed in minutes.

Conclusion

The era of modular, low-latency, real-time voice agents is here. With Inferencing, your ideas move from concept to conversation faster than ever.

Build. Select. Configure. Go live.

Resources and Next Steps

For More Information Read the Inference documentation.
Learn how to deploy your AI Agents.
Sign up at VideoSDK Dashboard
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Introducing Phone Numbers: Build AI Telephony Agents in 60 seconds

Chaitrali Kakde — Tue, 17 Feb 2026 06:55:33 +0000

You can now purchase phone numbers straight from the VideoSDK Dashboard, attach them to your agent or calling logic, and go live in minutes.

No external accounts. No SIP trunk configuration. No extra setup.

Fewer Hops. Better Calls.

Until now, developers building phone-based voice experiences on VideoSDK typically had to:

Sign up with a third-party telephony provider (like Twilio)
Purchase and manage phone numbers externally
Configure SIP trunks and credentials
Integrate and debug multiple systems before making the first call
By offering native phone numbers directly within VideoSDK, we remove those extra layers and make phone-based voice apps significantly easier to build and operate in just 3 steps.

How It Works

Getting started with VideoSDK Phone Numbers is simple:

Buy a Phone Number : Search and purchase available phone numbers directly from the VideoSDK Dashboard.
Attach Your Agent or Logic : Associate the phone number with your voice agent or inbound call routing logic so incoming calls are handled automatically.
Go Live : Call the number and your agent answers instantly.
Manage numbers, update routing, and monitor usage all from one place.

What You Can Build

Phone calls remain critical across many industries. With VideoSDK Phone Numbers, you can spin up phone-based voice agents in minutes for:

Sales qualification and lead routing
Order tracking and delivery updates
Appointment booking and status updates
Restaurant and service business call automation
If your application involves real-time voice and the phone network, this feature is built for you

Resources

Sign up at VideoSDK - dashboard
Have feedback or ideas? Comment below or join our Discord to build and ship AI call agents faster with VideoSDK Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Introducing the Gladia Speech to Text Plugin in VideoSDK

Chaitrali Kakde — Wed, 21 Jan 2026 07:41:00 +0000

Speech-to-text is the first and most critical step in any voice agent. If transcription is slow or inaccurate, everything downstream reasoning and response suffers. Gladia STT is built for real-time transcription with strong multilingual support, fast partial results, and handling of code-switching.

In this guide, we’ll walk through how to integrate Gladia STT with the VideoSDK Agents SDK and use it as a reliable input layer for voice-driven applications

Why Gladia STT?

Many voice agents operate in environments where users switch languages mid-sentence or expect instant feedback while speaking. Gladia is optimized for these scenarios. It provides:

Low-latency transcription
Support for multiple languages
Automatic code-switching
Partial transcripts for faster turn detection
This makes it a strong choice for real-time agents, live calls, and interactive voice applications.

Key Features

Real-Time Transcription : Gladia streams transcription results as audio is processed, reducing perceived latency in conversations.
Multilingual Support : You can specify one or more languages, making it suitable for global or multilingual users.
Code-Switching : Gladia can automatically detect and switch languages within the same conversation without manual intervention.
Partial Transcripts : By enabling partial transcripts, agents can start reasoning before the user finishes speaking, improving responsiveness.

Installation

Install the Gladia-STT VideoSDK Agents package:

pip install "videosdk-plugins-gladia"

Authentication

GLADIA_API_KEY=your_api_key_here
VIDEOSDK_AUTH_TOKEN = token

When using environment variables, you don’t need to pass the API key directly in code the SDK reads it automatically.

Importing Gladia STT

from videosdk.plugins.gladia import GladiaSTT

Basic Usage Example

Below is a minimal example showing how to configure Gladia STT and attach it to a cascading pipeline.

from videosdk.plugins.gladia import GladiaSTT
from videosdk.agents import CascadingPipeline

# Initialize the Gladia STT model
stt = GladiaSTT(
    api_key="your-gladia-api-key",
    languages=["en"],
    code_switching=True,
    receive_partial_transcripts=True
)

#  Add stt to a cascading pipeline
pipeline = CascadingPipeline(stt=stt)

This setup enables:

Real-time transcription
Automatic language switching
Partial transcripts for faster downstream processing

Configuration Options

Gladia STT provides fine-grained control over transcription behavior:

languages: List of language codes to detect (e.g., ["en", "fr"])
code_switching: Enables automatic language switching
receive_partial_transcripts: Streams interim results for lower latency
model: STT model to use (default: "solaria-1")
input_sample_rate: Incoming audio sample rate
output_sample_rate: Processing sample rate
encoding: Audio encoding format
bit_depth: Audio bit depth
channels: Number of audio channels (mono or stereo)
These parameters let you tune accuracy, latency, and compatibility with your audio pipeline.

Conclusion

Gladia STT provides a strong foundation for real-time voice agents by combining speed, accuracy, and multilingual flexibility. When integrated with VideoSDK’s agent pipelines, it enables agents to listen effectively even in dynamic, multilingual conversations. A reliable STT layer like Gladia helps ensure that downstream reasoning and responses stay accurate, responsive, and consistent.

Resources and Next Steps

Explore the gladia sttdocumentation
Learn how to deploy your AI Agents.
Explore more: Check out the VideoSDK documentation for more features.
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

How to enable Voice Mail Detection in AI Voice Agents

Chaitrali Kakde — Fri, 09 Jan 2026 10:46:25 +0000

When building outbound calling workflows, one of the most common challenge is handling unanswered calls that get redirected to voicemail systems. Without proper detection, an AI agent may continue speaking unnecessarily or wait indefinitely, leading to wasted resources and poor user experience.

Voice Mail Detection in VideoSDK solves this problem by automatically identifying voicemail scenarios and allowing your agent to take the appropriate action such as leaving a message or ending the call gracefully.

What Problem This Solves

In outbound calling workflows, unanswered calls are often routed to voicemail systems. Without detection, agents may continue speaking or wait unnecessarily.

Voice Mail Detection lets you:

Detect voicemail systems automatically
Control how your agent responds
End calls cleanly after voicemail handling
Enabling Voice Mail Detection
To use voicemail detection, import and add VoiceMailDetector to your agent configuration and register a callback that defines how voicemail should be handled.

from videosdk.agents import VoiceMailDetector
from videosdk.plugins.openai import OpenAILLM

async def voice_mail_callback(message):
    print("Voice Mail message received:", message)

voicemail = VoiceMailDetector(
    llm=OpenAILLM(),
    duration=5,
    callback=custom_callback_voicemail,
)

session = AgentSession(
    voice_mail_detector=voicemail
)

Full Working Example

To set up incoming call handling, outbound calling, and routing rules, check out the Quick Start Example

import logging
from videosdk.agents import Agent, AgentSession, CascadingPipeline,WorkerJob,ConversationFlow, JobContext, RoomOptions, Options,VoiceMailDetector
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])
pre_download_model()
class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant that can answer questions."
        )
    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def entrypoint(ctx: JobContext):

    agent = VoiceAgent()
    conversation_flow = ConversationFlow(agent)

    pipeline=CascadingPipeline(
        stt=DeepgramSTT(),
        llm=OpenAILLM(),
        tts=ElevenLabsTTS(),
        vad=SileroVAD(),
        turn_detector=TurnDetector()
    )

    async def voice_mail_callback(message):
        print("Voice Mail message received:", message)

    voice_mail_detector = VoiceMailDetector(llm=OpenAILLM(), duration=5.0, callback = voice_mail_callback)

    session = AgentSession(
        agent=agent, 
        pipeline=pipeline,
        conversation_flow=conversation_flow,
        voice_mail_detector = voice_mail_detector,
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(name="Voice Mail Detector Test", playground=True)
    return JobContext(room_options=room_options) 

if __name__ == "__main__":
    job = WorkerJob(entrypoint=entrypoint, jobctx=make_context, options=Options(agent_id="YOUR_AGENT_ID", max_processes=2, register=True, host="localhost", port=8081))
    job.start()

Conclusion

Voice Mail Detection ensures your AI agent handles unanswered calls intelligently. By automatically detecting voicemail and triggering the right action, it prevents wasted time, improves call efficiency, and makes outbound workflows reliable and production-ready.

Resources and Next Steps

Explore the voice mail detection implementation on github.
Read voice mail detection docs.
To set up inbound calls, outbound calls, and routing rules check out the Quick Start Example.
Learn how to deploy your AI Agents.
Explore more: Check out the VideoSDK documentation for more features.

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

How to enable DTMF Events in Telephony AI Agent

Chaitrali Kakde — Mon, 05 Jan 2026 08:49:02 +0000

Not every caller wants to speak to a voice agent. In many call scenarios, users expect to press a key to make a selection, confirm an action, or move forward in a call flow. This is especially common in menu-based systems, short responses, or situations where speech recognition may not be reliable.

DTMF (Dual-Tone Multi-Frequency) input gives voice agents a clear and predictable way to handle these interactions. When a caller presses a key on their phone, the agent receives that input instantly and can use it to control the call flow or trigger application logic.

In this post, we’ll explore how DTMF events can be used in a VideoSDK-powered voice agent, starting from common interaction patterns and moving into how the system processes keypad input in real time.

How It Works

DTMF Event Detection: The agent detects key presses (0–9, *, #) from the caller during a call session.
Real-Time Processing: Each key press generates a DTMF event that is delivered to the agent immediately.
Callback Integration: A user-defined callback function handles incoming DTMF events.
Action Execution: The agent executes actions or triggers workflows based on the received DTMF input like building IVR flows, collecting user input, or triggering actions in your application.

Step 1 : Enabling DTMF Events

DTMF event detection can be enabled in two ways:

1) Via Dashboard :
When creating or editing a SIP gateway in the VideoSDK dashboard, enable the DTMF option.

2) Via API :
Set the enableDtmf parameter to true when creating or updating a SIP gateway using the API.

curl    -H 'Authorization: $YOUR_TOKEN' \ 
  -H 'Content-Type: application/json' \ 
  -d '{
    "name" : "Twilio Inbound Gateway",
    "enableDtmf" : "true",
    "numbers" : ["+0123456789"]

  }' \ 
  -XPOST https://api.videosdk.live/v2/sip/inbound-gateways

Once enabled, DTMF events will be detected and published for all calls routed through that gateway.

To set up inbound calls, outbound calls, and routing rules check out the Quick Start Example.

Step 2 . Implementation

from videosdk.agents import AgentSession, DTMFHandler

async def entrypoint(ctx: JobContext):

    async def dtmf_callback(digit: int):
        if digit == 1:
            agent.instructions = "You are a Sales Representative. Your goal is to sell our products"
            await agent.session.say(
                "Routing you to Sales. Hi, I'm from Sales. How can I help you today?"
            )
        elif digit == 2:
            agent.instructions = "You are a Support Specialist. Your goal is to help customers with technical issues."
            await agent.session.say(
                "Routing you to Support. Hi, I'm from Support. What issue are you facing?"
            )
        else:
            await agent.session.say(
                "Invalid input. Press 1 for Sales or 2 for Support."
            )

    dtmf_handler = DTMFHandler(dtmf_callback)

    session = AgentSession(
        dtmf_handler = dtmf_handler,
    )

Full Working Example

import logging
from videosdk.agents import Agent, AgentSession, CascadingPipeline,WorkerJob,ConversationFlow, JobContext, RoomOptions, Options,DTMFHandler
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])
pre_download_model()
class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant that can answer questions."
        )
    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def entrypoint(ctx: JobContext):

    agent = VoiceAgent()
    conversation_flow = ConversationFlow(agent)

    pipeline=CascadingPipeline(
        stt=DeepgramSTT(),
        llm=OpenAILLM(),
        tts=ElevenLabsTTS(),
        vad=SileroVAD(),
        turn_detector=TurnDetector()
    )

    async def dtmf_callback(message):
        print("DTMF message received:", message)

    dtmf_handler = DTMFHandler(dtmf_callback)

    session = AgentSession(
        agent=agent, 
        pipeline=pipeline,
        conversation_flow=conversation_flow,
        dtmf_handler = dtmf_handler,
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(name="DTMF Agent Test", playground=True)
    return JobContext(room_options=room_options) 

if __name__ == "__main__":
    job = WorkerJob(entrypoint=entrypoint, jobctx=make_context, options=Options(agent_id="YOUR_AGENT_ID", max_processes=2, register=True, host="localhost", port=8081))
    job.start()

By enabling DTMF detection and handling events at the agent level, you can build predictable call flows, guide users through menus, and trigger application logic without interrupting the call experience. When combined with voice input, DTMF gives you more control over how users interact with your agent.

This makes DTMF a practical addition to any voice agent that needs clear, deterministic user input during a call.

Resources and Next Steps

Explore the dtmf-event-implementation-example for full code implementation.
To set up inbound calls, outbound calls, and routing rules check out the Quick Start Example.
Learn how to deploy your AI Agents.
Explore more: Check out the VideoSDK documentation for more features.
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!