I recently explored building an AI voice agent for technical interviews — the kind that can actually hold a conversation, ask follow-up questions, and adapt in real-time.
Turns out, getting voice latency right and making the conversation feel natural is harder than it looks. Here's what I learned 👇
Why Voice Agents for Interviews?
Traditional interviews don't scale well:
- Resource-intensive (coordinating schedules, interviewer availability)
- Inconsistent (every interviewer has their own style)
- Hard to audit (what actually happened in that hour?)
Voice agents can help with:
- Scalability - Interview 100 candidates simultaneously
- Consistency - Same evaluation criteria for everyone
- Real-time feedback - Immediate scoring and analytics
- Auditability - Full transcripts and traces of every session
The Tech Stack
LiveKit for Real-Time Audio
LiveKit handles the gnarly parts of voice infrastructure:
- Ultra-low latency streaming
- Turn detection (knowing when to interrupt vs when to wait)
- Integration with LLMs and TTS engines
- Scales from prototype to production
Why Real-Time Matters
You can't fake low latency. If there's a 2-second gap between "Tell me about your experience" and the candidate starting to answer, the whole flow breaks. LiveKit's WebRTC foundation keeps things snappy.
Building the Agent
Here's the simplified architecture:
1. Agent Definition
class InterviewAgent(Agent):
def __init__(self, jd: str) -> None:
super().__init__(
instructions=f"""You are a professional interviewer.
The job description is: {jd}
Ask relevant interview questions, listen to answers,
and follow up as a real interviewer would."""
)
The agent adapts its questions based on the job description you provide. Simple but effective.
2. Adding Tool Use (Web Search)
@function_tool()
async def web_search(self, query: str) -> str:
tavily_client = TavilyClient(api_key=TAVILY_API_KEY)
response = tavily_client.search(query=query, search_depth="basic")
return response.get('answer', 'No results found.')
This lets the agent look up technical details on the fly. If a candidate mentions a framework the agent isn't familiar with, it can search and ask informed follow-ups.
3. Session Management
async def entrypoint(ctx: agents.JobContext):
print("🎤 Welcome! Paste your Job Description below.")
jd = input("JD: ")
room_name = f"interview-room-{uuid.uuid4().hex}"
session = AgentSession(
llm=google.beta.realtime.RealtimeModel(
model="gemini-2.0-flash-exp",
voice="Puck"
),
)
await session.start(room=room, agent=InterviewAgent(jd))
await session.generate_reply(
instructions="Greet the candidate and start the interview."
)
What Makes This Tricky
1. Turn-taking is an unsolved problem
Humans interrupt each other naturally. Agents struggle with:
- When to let the candidate finish
- When to jump in with a clarifying question
- How to handle awkward silences
LiveKit's turn detection helps, but you still need to tune sensitivity.
2. Latency compounds quickly
- Speech-to-text: ~200ms
- LLM inference: ~500-1000ms
- Text-to-speech: ~300ms
That's 1-1.5 seconds best case. Any additional processing (logging, evaluation, tool calls) adds up fast.
3. Context management
Interview sessions can be 30-60 minutes. That's a lot of conversation history to keep in context without:
- Blowing your token budget
- Degrading response quality
- Missing important details from earlier in the conversation
Observability: The Unsexy But Critical Part
When your agent asks a weird question or misunderstands an answer, you need to know why.
I instrumented the agent to log:
- Full conversation traces
- Tool calls and their results
- Latency breakdowns per turn
- When the agent decided to interrupt vs wait
This makes debugging 10x easier. Instead of "the agent behaved weird," you can pinpoint "the web search timed out, so it hallucinated instead."
def on_event(event: str, data: dict):
if event == "trace.started":
trace_id = data["trace_id"]
logging.debug(f"Trace started - ID: {trace_id}")
elif event == "trace.ended":
trace_id = data["trace_id"]
logging.debug(f"Trace ended - ID: {trace_id}")
Running It
Setup:
# Install dependencies
pip install livekit livekit-agents[google] tavily-python
# Set environment variables
export LIVEKIT_URL=your_livekit_url
export LIVEKIT_API_KEY=your_api_key
export TAVILY_API_KEY=your_tavily_key
export GOOGLE_API_KEY=your_google_key
Launch:
python interview_agent.py
The agent creates a room, gives you a join link, and starts the interview when you connect.
What I'd Add Next
Multi-agent panels: Simulate a panel interview with multiple interviewers asking from different angles.
Real-time scoring: Evaluate answers as the interview progresses, not just at the end.
Resume parsing: Pull details from the candidate's resume to personalize questions.
Code challenges: For technical roles, integrate a live coding environment.
Emotion detection: Analyze tone and sentiment to gauge candidate confidence/stress.
The Bigger Picture
Voice agents aren't just for interviews. The same patterns apply to:
- Customer support (handling calls at scale)
- Sales qualification (pre-screening leads)
- Healthcare triage (initial symptom assessment)
- Education (tutoring and assessment)
The hard parts are universal:
- Low-latency voice processing
- Natural turn-taking
- Context management over long sessions
- Debugging probabilistic behavior
Lessons Learned
1. Test with real humans early
Synthetic test cases don't capture the messiness of real conversations. Get feedback from actual users ASAP.
2. Latency budgets are tight
Every millisecond matters. Optimize aggressively or the conversation feels robotic.
3. Observability is non-negotiable
You can't improve what you can't measure. Log everything, then filter down to what matters.
4. Voice is different from chat
What works in text-based agents often breaks in voice. Verbosity, pacing, and interruption handling are completely different problems.
Try It Yourself
The full code is on GitHub (link in comments). You'll need:
- LiveKit account (they have a free tier)
- Google Cloud for Gemini + TTS
- Tavily API for web search
If you're curious about the observability/evaluation side, I'm working on this at Maxim AI. Still figuring it out, but happy to share what I learn.
Resources:
What's your experience with voice agents? Where do you think they'll have the most impact? Let me know in the comments 👇
Top comments (0)