10-minute tutorials make voice AI look simple:
const token = await getToken(roomName);
await room.connect(url, token);
// Done! ✨
Production reality is different:
- Who creates the room?
- How does the agent identify users?
- What happens when failures occur?
- How do you prevent duplicate agents?
After shipping 1000+ voice AI sessions using LiveKit + Gemini Realtime on Next.js/Cloud Run, here's what tutorials skip.
This article covers: JWT auth, auto-dispatch patterns, audio subscription timing, greeting guards, and production robustness.
Prerequisites: Basic Next.js, LiveKit Cloud account, Firebase Auth configured.
The Complete Flow (11 Steps)
Before diving into code, understand the full sequence:
┌─────────────┐
│ Browser │ 1. User clicks "Start Interview"
└──────┬──────┘
│ 2. POST /api/start-bot
│ Authorization: Bearer <firebase-token>
▼
┌─────────────────┐
│ Next.js API │ 3. Verify token with Firebase
│ /api/start-bot │ 4. Create LiveKit room
└────────┬────────┘ 5. Generate JWT + metadata
│ 6. Return token to browser
▼
┌───────────────────────┐
│ LiveKit Cloud │ 7. Browser connects
│ wss://your.livekit... │ 8. Auto-dispatch agent
└───────────────────────┘
│
▼
┌───────────────────────┐
│ Cloud Run Worker │ 9. entrypoint() runs
│ (Voice Agent) │ 10. Load user state from DB
└───────────────────────┘ 11. Generate personalized greeting
This isn't just "connect to a room"—it's an orchestrated sequence where order matters.
Let's break down each part with the mistakes I made.
Part 1: The JWT Token Factory (Backend API)
❌ Mistake #1: Exposing Credentials to Browser
I've seen this in production codebases:
// NEVER DO THIS
const token = new AccessToken(
process.env.NEXT_PUBLIC_LIVEKIT_API_KEY, // ❌ EXPOSED IN BROWSER
process.env.NEXT_PUBLIC_LIVEKIT_SECRET, // ❌ SECURITY DISASTER
);
Why it's terrible:
- Anyone can create tokens for ANY room
- Attackers can impersonate users
- Zero authentication
✅ The Right Way: Server-Side Token Generation
Create a Next.js API route that verifies the user BEFORE creating tokens:
// frontend/src/app/api/start-bot/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { RoomServiceClient, AccessToken } from 'livekit-server-sdk';
import { adminAuth } from '@/lib/firebase-admin';
export async function POST(req: NextRequest) {
// STEP 1: Verify Firebase token (The Gatekeeper)
const authHeader = req.headers.get('Authorization');
if (!authHeader?.startsWith('Bearer ')) {
return NextResponse.json({ error: 'Unauthorized' }, { status: 401 });
}
const idToken = authHeader.split('Bearer ')[1];
let user_id: string;
try {
// Server-side verification with Firebase Admin SDK
const decodedToken = await adminAuth.verifyIdToken(idToken);
user_id = decodedToken.uid; // ⭐ REAL verified user ID
} catch (authError) {
return NextResponse.json({ error: 'Invalid Token' }, { status: 401 });
}
// STEP 2: LiveKit setup
const livekitUrl = process.env.LIVEKIT_URL;
const apiKey = process.env.LIVEKIT_API_KEY;
const apiSecret = process.env.LIVEKIT_API_SECRET;
const roomService = new RoomServiceClient(livekitUrl, apiKey, apiSecret);
// STEP 3: Create room
const room_name = `room-${Math.random().toString(36).substring(7)}`;
await roomService.createRoom({
name: room_name,
emptyTimeout: 60, // Auto-cleanup after 60s
maxParticipants: 2, // 1 user + 1 agent only
});
// STEP 4: Generate token with verified identity
const at = new AccessToken(apiKey, apiSecret, {
identity: user_id, // ⭐ Embedded in token
});
at.addGrant({
roomJoin: true,
room: room_name,
});
const token = await at.toJwt();
return NextResponse.json({
success: true,
token,
room_name,
identity: user_id
});
}
Key Decisions:
✅ Server generates tokens (credentials never touch browser)
✅ Firebase verifies user BEFORE creating room
✅ User ID embedded as identity in LiveKit token
✅ Random room names (users can't guess existing rooms)
Part 2: Explicit vs Auto-Dispatch
Now that you have a room and token, how does the agent join?
Pattern A: Manual Dispatch (What I Used First)
// Create token
const token = await at.toJwt();
// Separately dispatch agent
await roomService.dispatchAgent({
agentName: 'my-agent',
room: room_name,
});
return { token };
Problems I Hit:
- Requires 2 API calls to LiveKit
- Race condition: User joins before agent
- If dispatch fails, user sits in empty room
Pattern B: Auto-Dispatch via JWT ⭐ (Production Choice)
import { RoomConfiguration, RoomAgentDispatch } from '@livekit/protocol';
// Attach dispatch config to user's token
at.roomConfig = new RoomConfiguration({
agents: [
new RoomAgentDispatch({
agentName: 'noah-voice-agent',
}),
],
});
const token = await at.toJwt();
Why it's better:
✅ Atomic: Agent dispatch happens when user joins
✅ LiveKit handles retries (more reliable)
✅ One API call instead of two
✅ No race conditions
Real Impact: Reduced agent dispatch failures from ~5% to <0.1%.
Part 3: Agent Entry Point - The Critical Pattern
Your agent receives a job when the user joins. Here's where most tutorials fail you.
❌ Mistake #2: Waiting for Participant Before Starting Session
This looks logical but breaks in production:
async def entrypoint(ctx: JobContext):
await ctx.connect()
# Wait for user to appear
participant = await wait_for_participant(ctx.room)
user_id = participant.identity
# Load their data from database
data = await db.load_user_state(user_id)
agent = MyAgent(user_id, data)
# Start session
session = AgentSession(llm=...)
await session.start(room=ctx.room, agent=agent) # ❌ TOO LATE!
Symptoms:
- Agent connects but can't hear user
- Logs show:
subscribed=False - Works in local testing (lucky timing)
- Fails in production randomly
Root Cause: Audio subscription happens inside session.start(). If you wait for participant identity first, you miss the subscription window.
✅ The Fix: Start Session First, Then Personalize
async def entrypoint(ctx: JobContext):
# STEP 1: Create placeholder agent
placeholder_db = DatabaseService("pending")
agent = InterviewAgent("pending", placeholder_db, None)
# STEP 2: Create session
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-live-2.5-flash-native-audio",
voice="Puck",
instructions=agent.instructions,
vertexai=True,
),
)
# STEP 3: Start session IMMEDIATELY (SDK subscribes to audio here)
await session.start(room=ctx.room, agent=agent)
logger.info("✅ Session started - audio pipeline active")
# STEP 4: NOW get participant (session already listening)
participant = None
for _ in range(30): # 30s timeout
if ctx.room.remote_participants:
participant = list(ctx.room.remote_participants.values())[0]
break
await asyncio.sleep(1)
if not participant:
logger.error("⚠️ No participant after 30s")
return
user_id = participant.identity # ⭐ This is Firebase UID from token
logger.info(f"✅ User identity: {user_id}")
# STEP 5: Hydrate agent with real data
agent.user_id = user_id
agent.db_service = DatabaseService(user_id)
try:
initial_data = await agent.db_service.get_candidate_data()
except Exception as e:
logger.error(f"❌ DB failed: {e}")
initial_data = {} # Fallback to fresh session
agent.initial_data = initial_data
# STEP 6: Determine phase (new vs resume)
agent.current_phase = agent._determine_initial_phase(initial_data)
# STEP 7: Generate personalized greeting
greeting = agent.get_greeting_instruction()
await session.generate_reply(instructions=greeting)
logger.info("✅ Greeting triggered")
The Pattern: Start → Listen → Identify → Personalize → Greet
Impact: 100% audio subscription success rate.
Part 4: The Greeting Guard Pattern
Even with audio working, there's another trap.
❌ Mistake #3: Tools Fire During Greeting
What happens:
Agent: "Hi, I'm Noah, your AI career c—"
User: "Hello!" (eager)
Gemini: *calls process_response() mid-greeting*
Database: *saves garbage data*
Agent: *confused about conversation state*
✅ Solution: Greeting Guard Flag
class InterviewAgent(Agent):
def __init__(self):
super().__init__(instructions=SYSTEM_PROMPT)
self.greeting_complete = False # ⭐ Start locked
@function_tool()
async def process_user_response(self, data: dict):
# GUARD: Block tool execution during greeting
if not self.greeting_complete:
return "SYSTEM: Wait for greeting to complete. Do not process yet."
# Normal logic continues here
await self.db.save(data)
...
# Unlock tools after greeting completes
@session.on("agent_state_changed")
def on_state_changed(event):
if event.old_state == "speaking" and event.new_state == "listening":
if not agent.greeting_complete:
agent.greeting_complete = True
logger.info("✅ Greeting done, tools unlocked")
Why it works:
- Agent speaks full greeting uninterrupted
- Tools unlock only after state transition:
speaking→listening - No premature database writes
Real Data: Eliminated 100% of corrupted session starts.
Part 5: Robustness Patterns
Production code needs fallbacks. Here are patterns from actual failures.
Pattern 1: Timeout with Fallback
# Don't wait forever for participant
participant = None
for attempt in range(30):
if ctx.room.remote_participants:
participant = list(ctx.room.remote_participants.values())[0]
break
await asyncio.sleep(1)
if not participant:
# FALLBACK: Log and gracefully exit
logger.error("⚠️ No participant after 30s")
await session.generate_reply(
"I can't hear anyone. Please refresh and try again."
)
return # Exit cleanly
Pattern 2: Database Connection Fallback
try:
initial_data = await db.get_candidate_data()
except Exception as e:
logger.error(f"❌ DB connection failed: {e}")
# FALLBACK: Start fresh session instead of crashing
initial_data = {}
await session.generate_reply(
"Hi! Let's start fresh today. Tell me about your background."
)
Pattern 3: Duplicate Agent Prevention
Problem: Health check fails → Cloud Run keeps retrying → Multiple agents speak simultaneously
Solution:
# Health check server MUST bind to 0.0.0.0
def start_health_server():
port = int(os.getenv('PORT', 8080))
httpd = HTTPServer(
('0.0.0.0', port), # ⭐ NOT 'localhost'
HealthCheckHandler
)
httpd.serve_forever()
# Don't override Cloud Run's PORT variable
load_dotenv(override=False) # ⭐ Critical
Why: Cloud Run expects health checks on 0.0.0.0. If you bind to localhost, it fails and creates zombie agents.
Part 6: Testing the Full Flow
Local Development Setup
# Terminal 1: Cloud SQL Proxy (if using Cloud SQL)
./cloud-sql-proxy PROJECT:REGION:INSTANCE --port 5432
# Terminal 2: Voice Agent
cd voice-agent
source .venv/bin/activate
python src/main.py dev
# Terminal 3: Frontend
cd frontend
npm run dev
Critical Test Cases
- Happy Path: User joins → Agent greets → Conversation flows
- User Refresh: User closes tab mid-call → Reconnects → Session resumes
- Database Timeout: DB slow → Agent uses fallback greeting
- No Participant: Room created but user never joins → Agent exits gracefully
- Network Drop: User loses connection → Reconnects → Conversation continues
Lessons Learned: The DO/DON'T Checklist
DO ✅
- Generate tokens server-side (never expose credentials)
- Embed user identity in JWT token
- Use auto-dispatch via
roomConfig(more reliable than manual) - Start session FIRST, get identity AFTER
- Guard tools during greeting with state flag
- Add timeouts and fallbacks everywhere
- Bind health server to
0.0.0.0for Cloud Run - Test with database failures and network drops
DON'T ❌
- Put LiveKit secrets in
NEXT_PUBLIC_*variables - Wait for participant before starting session
- Allow tools to execute during greeting
- Assume database is always available
- Skip health check server (Cloud Run requires it)
- Override Cloud Run's
$PORTenvironment variable - Deploy without testing the full flow locally first
From Demo to Production
The gap between LiveKit tutorials and production isn't just code—it's robustness thinking.
- Tutorials assume happy paths (user joins, everything works)
- Production has 10 failure modes per integration point
After 3 weeks of debugging these issues in production, I learned:
- Audio subscription is timing-sensitive (start session first)
- Auto-dispatch beats manual dispatch (atomic operations win)
- State guards prevent race conditions (greeting flag pattern)
- Fallbacks save user experience (DB down? Start fresh)
-
Health checks matter on Cloud Run (bind to
0.0.0.0)
What's Next?
In the next article, I'll cover:
- Tool calling patterns with complex state machines
- Database-backed session resume
- The 80% CPU problem (and how I fixed it)
Building production voice AI?
What's your biggest voice AI integration pain point? Comment below. 👇
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.