Behram

Posted on Jan 16

Gemini 2.5 Native Audio + LiveKit: A Production Setup Guide

#livekit #gemini #voiceai #webrtc

10-minute tutorials make voice AI look simple:

const token = await getToken(roomName);
await room.connect(url, token);
// Done! ✨

Production reality is different:

Who creates the room?
How does the agent identify users?
What happens when failures occur?
How do you prevent duplicate agents?

After shipping 1000+ voice AI sessions using LiveKit + Gemini Realtime on Next.js/Cloud Run, here's what tutorials skip.

This article covers: JWT auth, auto-dispatch patterns, audio subscription timing, greeting guards, and production robustness.

Prerequisites: Basic Next.js, LiveKit Cloud account, Firebase Auth configured.

The Complete Flow (11 Steps)

Before diving into code, understand the full sequence:

┌─────────────┐
│   Browser   │  1. User clicks "Start Interview"
└──────┬──────┘
       │ 2. POST /api/start-bot
       │    Authorization: Bearer <firebase-token>
       ▼
┌─────────────────┐
│  Next.js API    │  3. Verify token with Firebase
│ /api/start-bot  │  4. Create LiveKit room
└────────┬────────┘  5. Generate JWT + metadata
         │           6. Return token to browser
         ▼
┌───────────────────────┐
│   LiveKit Cloud       │  7. Browser connects
│ wss://your.livekit... │  8. Auto-dispatch agent
└───────────────────────┘
         │
         ▼
┌───────────────────────┐
│  Cloud Run Worker     │  9. entrypoint() runs
│  (Voice Agent)        │  10. Load user state from DB
└───────────────────────┘  11. Generate personalized greeting

This isn't just "connect to a room"—it's an orchestrated sequence where order matters.

Let's break down each part with the mistakes I made.

Part 1: The JWT Token Factory (Backend API)

❌ Mistake #1: Exposing Credentials to Browser

I've seen this in production codebases:

// NEVER DO THIS
const token = new AccessToken(
  process.env.NEXT_PUBLIC_LIVEKIT_API_KEY,  // ❌ EXPOSED IN BROWSER
  process.env.NEXT_PUBLIC_LIVEKIT_SECRET,   // ❌ SECURITY DISASTER
);

Why it's terrible:

Anyone can create tokens for ANY room
Attackers can impersonate users
Zero authentication

✅ The Right Way: Server-Side Token Generation

Create a Next.js API route that verifies the user BEFORE creating tokens:

// frontend/src/app/api/start-bot/route.ts

import { NextRequest, NextResponse } from 'next/server';
import { RoomServiceClient, AccessToken } from 'livekit-server-sdk';
import { adminAuth } from '@/lib/firebase-admin';

export async function POST(req: NextRequest) {
  // STEP 1: Verify Firebase token (The Gatekeeper)
  const authHeader = req.headers.get('Authorization');
  if (!authHeader?.startsWith('Bearer ')) {
    return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
  }

  const idToken = authHeader.split('Bearer ')[1];

  let user_id: string;
  try {
    // Server-side verification with Firebase Admin SDK
    const decodedToken = await adminAuth.verifyIdToken(idToken);
    user_id = decodedToken.uid;  // ⭐ REAL verified user ID
  } catch (authError) {
    return NextResponse.json({ error: 'Invalid Token' }, { status: 401 });
  }

  // STEP 2: LiveKit setup
  const livekitUrl = process.env.LIVEKIT_URL;
  const apiKey = process.env.LIVEKIT_API_KEY;
  const apiSecret = process.env.LIVEKIT_API_SECRET;

  const roomService = new RoomServiceClient(livekitUrl, apiKey, apiSecret);

  // STEP 3: Create room
  const room_name = `room-${Math.random().toString(36).substring(7)}`;

  await roomService.createRoom({
    name: room_name,
    emptyTimeout: 60,      // Auto-cleanup after 60s
    maxParticipants: 2,    // 1 user + 1 agent only
  });

  // STEP 4: Generate token with verified identity
  const at = new AccessToken(apiKey, apiSecret, {
    identity: user_id,  // ⭐ Embedded in token
  });

  at.addGrant({
    roomJoin: true,
    room: room_name,
  });

  const token = await at.toJwt();

  return NextResponse.json({
    success: true,
    token,
    room_name,
    identity: user_id
  });
}

Key Decisions:

✅ Server generates tokens (credentials never touch browser)

✅ Firebase verifies user BEFORE creating room

✅ User ID embedded as identity in LiveKit token

✅ Random room names (users can't guess existing rooms)

Part 2: Explicit vs Auto-Dispatch

Now that you have a room and token, how does the agent join?

Pattern A: Manual Dispatch (What I Used First)

// Create token
const token = await at.toJwt();

// Separately dispatch agent
await roomService.dispatchAgent({
  agentName: 'my-agent',
  room: room_name,
});

return { token };

Problems I Hit:

Requires 2 API calls to LiveKit
Race condition: User joins before agent
If dispatch fails, user sits in empty room

Pattern B: Auto-Dispatch via JWT ⭐ (Production Choice)

import { RoomConfiguration, RoomAgentDispatch } from '@livekit/protocol';

// Attach dispatch config to user's token
at.roomConfig = new RoomConfiguration({
  agents: [
    new RoomAgentDispatch({
      agentName: 'noah-voice-agent',
    }),
  ],
});

const token = await at.toJwt();

Why it's better:

✅ Atomic: Agent dispatch happens when user joins

✅ LiveKit handles retries (more reliable)

✅ One API call instead of two

✅ No race conditions

Real Impact: Reduced agent dispatch failures from ~5% to <0.1%.

Part 3: Agent Entry Point - The Critical Pattern

Your agent receives a job when the user joins. Here's where most tutorials fail you.

❌ Mistake #2: Waiting for Participant Before Starting Session

This looks logical but breaks in production:

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    # Wait for user to appear
    participant = await wait_for_participant(ctx.room)
    user_id = participant.identity

    # Load their data from database
    data = await db.load_user_state(user_id)
    agent = MyAgent(user_id, data)

    # Start session
    session = AgentSession(llm=...)
    await session.start(room=ctx.room, agent=agent)  # ❌ TOO LATE!

Symptoms:

Agent connects but can't hear user
Logs show: subscribed=False
Works in local testing (lucky timing)
Fails in production randomly

Root Cause: Audio subscription happens inside session.start(). If you wait for participant identity first, you miss the subscription window.

✅ The Fix: Start Session First, Then Personalize

async def entrypoint(ctx: JobContext):
    # STEP 1: Create placeholder agent
    placeholder_db = DatabaseService("pending")
    agent = InterviewAgent("pending", placeholder_db, None)

    # STEP 2: Create session
    session = AgentSession(
        llm=google.realtime.RealtimeModel(
            model="gemini-live-2.5-flash-native-audio",
            voice="Puck",
            instructions=agent.instructions,
            vertexai=True,
        ),
    )

    # STEP 3: Start session IMMEDIATELY (SDK subscribes to audio here)
    await session.start(room=ctx.room, agent=agent)
    logger.info("✅ Session started - audio pipeline active")

    # STEP 4: NOW get participant (session already listening)
    participant = None
    for _ in range(30):  # 30s timeout
        if ctx.room.remote_participants:
            participant = list(ctx.room.remote_participants.values())[0]
            break
        await asyncio.sleep(1)

    if not participant:
        logger.error("⚠️ No participant after 30s")
        return

    user_id = participant.identity  # ⭐ This is Firebase UID from token
    logger.info(f"✅ User identity: {user_id}")

    # STEP 5: Hydrate agent with real data
    agent.user_id = user_id
    agent.db_service = DatabaseService(user_id)

    try:
        initial_data = await agent.db_service.get_candidate_data()
    except Exception as e:
        logger.error(f"❌ DB failed: {e}")
        initial_data = {}  # Fallback to fresh session

    agent.initial_data = initial_data

    # STEP 6: Determine phase (new vs resume)
    agent.current_phase = agent._determine_initial_phase(initial_data)

    # STEP 7: Generate personalized greeting
    greeting = agent.get_greeting_instruction()
    await session.generate_reply(instructions=greeting)
    logger.info("✅ Greeting triggered")

The Pattern: Start → Listen → Identify → Personalize → Greet

Impact: 100% audio subscription success rate.

Part 4: The Greeting Guard Pattern

Even with audio working, there's another trap.

❌ Mistake #3: Tools Fire During Greeting

What happens:

Agent: "Hi, I'm Noah, your AI career c—"
User: "Hello!" (eager)
Gemini: *calls process_response() mid-greeting*
Database: *saves garbage data*
Agent: *confused about conversation state*

✅ Solution: Greeting Guard Flag

class InterviewAgent(Agent):
    def __init__(self):
        super().__init__(instructions=SYSTEM_PROMPT)
        self.greeting_complete = False  # ⭐ Start locked

    @function_tool()
    async def process_user_response(self, data: dict):
        # GUARD: Block tool execution during greeting
        if not self.greeting_complete:
            return "SYSTEM: Wait for greeting to complete. Do not process yet."

        # Normal logic continues here
        await self.db.save(data)
        ...

# Unlock tools after greeting completes
@session.on("agent_state_changed")
def on_state_changed(event):
    if event.old_state == "speaking" and event.new_state == "listening":
        if not agent.greeting_complete:
            agent.greeting_complete = True
            logger.info("✅ Greeting done, tools unlocked")

Why it works:

Agent speaks full greeting uninterrupted
Tools unlock only after state transition: speaking → listening
No premature database writes

Real Data: Eliminated 100% of corrupted session starts.

Part 5: Robustness Patterns

Production code needs fallbacks. Here are patterns from actual failures.

Pattern 1: Timeout with Fallback

# Don't wait forever for participant
participant = None
for attempt in range(30):
    if ctx.room.remote_participants:
        participant = list(ctx.room.remote_participants.values())[0]
        break
    await asyncio.sleep(1)

if not participant:
    # FALLBACK: Log and gracefully exit
    logger.error("⚠️ No participant after 30s")
    await session.generate_reply(
        "I can't hear anyone. Please refresh and try again."
    )
    return  # Exit cleanly

Pattern 2: Database Connection Fallback

try:
    initial_data = await db.get_candidate_data()
except Exception as e:
    logger.error(f"❌ DB connection failed: {e}")
    # FALLBACK: Start fresh session instead of crashing
    initial_data = {}
    await session.generate_reply(
        "Hi! Let's start fresh today. Tell me about your background."
    )

Pattern 3: Duplicate Agent Prevention

Problem: Health check fails → Cloud Run keeps retrying → Multiple agents speak simultaneously

Solution:

# Health check server MUST bind to 0.0.0.0
def start_health_server():
    port = int(os.getenv('PORT', 8080))
    httpd = HTTPServer(
        ('0.0.0.0', port),  # ⭐ NOT 'localhost'
        HealthCheckHandler
    )
    httpd.serve_forever()

# Don't override Cloud Run's PORT variable
load_dotenv(override=False)  # ⭐ Critical

Why: Cloud Run expects health checks on 0.0.0.0. If you bind to localhost, it fails and creates zombie agents.

Part 6: Testing the Full Flow

Local Development Setup

# Terminal 1: Cloud SQL Proxy (if using Cloud SQL)
./cloud-sql-proxy PROJECT:REGION:INSTANCE --port 5432

# Terminal 2: Voice Agent
cd voice-agent
source .venv/bin/activate
python src/main.py dev

# Terminal 3: Frontend
cd frontend
npm run dev

Critical Test Cases

Happy Path: User joins → Agent greets → Conversation flows
User Refresh: User closes tab mid-call → Reconnects → Session resumes
Database Timeout: DB slow → Agent uses fallback greeting
No Participant: Room created but user never joins → Agent exits gracefully
Network Drop: User loses connection → Reconnects → Conversation continues

Lessons Learned: The DO/DON'T Checklist

DO ✅

Generate tokens server-side (never expose credentials)
Embed user identity in JWT token
Use auto-dispatch via roomConfig (more reliable than manual)
Start session FIRST, get identity AFTER
Guard tools during greeting with state flag
Add timeouts and fallbacks everywhere
Bind health server to 0.0.0.0 for Cloud Run
Test with database failures and network drops

DON'T ❌

Put LiveKit secrets in NEXT_PUBLIC_* variables
Wait for participant before starting session
Allow tools to execute during greeting
Assume database is always available
Skip health check server (Cloud Run requires it)
Override Cloud Run's $PORT environment variable
Deploy without testing the full flow locally first

From Demo to Production

The gap between LiveKit tutorials and production isn't just code—it's robustness thinking.

Tutorials assume happy paths (user joins, everything works)
Production has 10 failure modes per integration point

After 3 weeks of debugging these issues in production, I learned:

Audio subscription is timing-sensitive (start session first)
Auto-dispatch beats manual dispatch (atomic operations win)
State guards prevent race conditions (greeting flag pattern)
Fallbacks save user experience (DB down? Start fresh)
Health checks matter on Cloud Run (bind to 0.0.0.0)

What's Next?

In the next article, I'll cover:

Tool calling patterns with complex state machines
Database-backed session resume
The 80% CPU problem (and how I fixed it)

Building production voice AI?

Check my GitHub for code examples
👉 Subscribe to my Substack for more real-world patterns

What's your biggest voice AI integration pain point? Comment below. 👇

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.