How to enable preemptive response in AI Voice Agents

#ai #deepgram #voiceai #webdev

When it comes to voice AI, the real challenge isn’t speed it’s timing

A response that arrives a second too late feels unnatural. That tiny pause is enough to remind users they’re talking to a machine. Humans don’t wait for sentences to end. We anticipate intent and respond at the right moment. Traditional voice agents don’t. They wait for silence and that’s what makes conversations feel slow.

Preemptive Response fixes this by letting voice agents start understanding and preparing responses while the user is still speaking.

What Is Preemptive Response?

Preemptive Response is a capability that allows a voice agent to start understanding a user’s intent before they finish speaking.

As the user talks, the Speech-to-Text engine emits partial transcripts in real time. These partial results are enough for the agent to begin reasoning early, instead of waiting for the full sentence and a moment of silence.

The goal isn’t to interrupt the user it’s to be ready at the right moment.

How Preemptive response works

User audio is streamed to the STT, which generates partial transcripts.
These partial transcripts are immediately sent to the LLM to enable preemptive (early) responses.
The LLM output is then passed to the TTS to generate the spoken response.

Enabling Preemptive Response

To enable this feature, set the enable_preemptive_generation flag to True when initializing your STT plugin (e.g., DeepgramSTTV2).

from videosdk.plugins.deepgram import DeepgramSTTV2

stt = DeepgramSTTV2(
    enable_preemptive_generation=True
)

Once enabled, partial transcripts start flowing automatically and your agent begins preparing responses earlier by design.

Currently, preemptive response generation is limited to Deepgram’s STT implementation and is available only in the Flux model.

Implementation

Prerequisites

A VideoSDK authentication token (generate from app.videosdk.live), follow to guide to generate videosdk token)
A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
Python 3.12 or higher

Install dependencies

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector]"

Want to use a different provider? Check out our plugins for STT, LLM, and TTS.

Set API Keys in .env

DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"

API Keys - Get API keys Deepgram ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Full Working Example

import asyncio
import os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
from videosdk.plugins.deepgram import DeepgramSTTV2
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS

# Pre-download the Turn Detector model to avoid delays during startup
pre_download_model()

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")

    async def on_enter(self):
        await self.session.say("Hello! How can I help you today?")

    async def on_exit(self):
        await self.session.say("Goodbye!")

async def start_session(context: JobContext):
    # 1. Create the agent and conversation flow
    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)

    # 2. Define the pipeline with Preemptive Generation enabled
    pipeline = CascadingPipeline(
        stt=DeepgramSTTV2(
            model="flux-general-en",
            enable_preemptive_generation=True  # Enable low-latency partials
        ),
        llm=OpenAILLM(model="gpt-4o"),
        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
        vad=SileroVAD(threshold=0.35),
        turn_detector=TurnDetector(threshold=0.8)
    )

    # 3. Initialize the session
    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running
        await asyncio.Event().wait()
    finally:
        # Clean up resources
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
        name="VideoSDK Cascaded Agent",
        playground=True
    )
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Run the Python Script

python main.py

You can also use console for running the script :

python main.py console

With Preemptive Response enabled, the voice agent no longer waits for speech to end. It begins processing intent as audio arrives, reducing latency and keeping conversations natural. The result is a responsive, end-to-end voice experience that feels fluid in real time.