Navya Yadav

Posted on Nov 10

Building a Real-Time AI Interview Agent with Voice

#ai #machinelearning #aiops #voice

I recently explored building an AI voice agent for technical interviews — the kind that can actually hold a conversation, ask follow-up questions, and adapt in real-time.

Turns out, getting voice latency right and making the conversation feel natural is harder than it looks. Here's what I learned 👇

Why Voice Agents for Interviews?

Traditional interviews don't scale well:

Resource-intensive (coordinating schedules, interviewer availability)
Inconsistent (every interviewer has their own style)
Hard to audit (what actually happened in that hour?)

Voice agents can help with:

Scalability - Interview 100 candidates simultaneously
Consistency - Same evaluation criteria for everyone
Real-time feedback - Immediate scoring and analytics
Auditability - Full transcripts and traces of every session

The Tech Stack

LiveKit for Real-Time Audio

LiveKit handles the gnarly parts of voice infrastructure:

Ultra-low latency streaming
Turn detection (knowing when to interrupt vs when to wait)
Integration with LLMs and TTS engines
Scales from prototype to production

Why Real-Time Matters

You can't fake low latency. If there's a 2-second gap between "Tell me about your experience" and the candidate starting to answer, the whole flow breaks. LiveKit's WebRTC foundation keeps things snappy.

Building the Agent

Here's the simplified architecture:

1. Agent Definition

class InterviewAgent(Agent):
    def __init__(self, jd: str) -> None:
        super().__init__(
            instructions=f"""You are a professional interviewer. 
            The job description is: {jd}

            Ask relevant interview questions, listen to answers, 
            and follow up as a real interviewer would."""
        )

The agent adapts its questions based on the job description you provide. Simple but effective.

2. Adding Tool Use (Web Search)

@function_tool()
async def web_search(self, query: str) -> str:
    tavily_client = TavilyClient(api_key=TAVILY_API_KEY)
    response = tavily_client.search(query=query, search_depth="basic")
    return response.get('answer', 'No results found.')

This lets the agent look up technical details on the fly. If a candidate mentions a framework the agent isn't familiar with, it can search and ask informed follow-ups.

3. Session Management

async def entrypoint(ctx: agents.JobContext):
    print("🎤 Welcome! Paste your Job Description below.")
    jd = input("JD: ")

    room_name = f"interview-room-{uuid.uuid4().hex}"

    session = AgentSession(
        llm=google.beta.realtime.RealtimeModel(
            model="gemini-2.0-flash-exp", 
            voice="Puck"
        ),
    )

    await session.start(room=room, agent=InterviewAgent(jd))
    await session.generate_reply(
        instructions="Greet the candidate and start the interview."
    )

What Makes This Tricky

1. Turn-taking is an unsolved problem

Humans interrupt each other naturally. Agents struggle with:

When to let the candidate finish
When to jump in with a clarifying question
How to handle awkward silences

LiveKit's turn detection helps, but you still need to tune sensitivity.

2. Latency compounds quickly

Speech-to-text: ~200ms
LLM inference: ~500-1000ms
Text-to-speech: ~300ms

That's 1-1.5 seconds best case. Any additional processing (logging, evaluation, tool calls) adds up fast.

3. Context management

Interview sessions can be 30-60 minutes. That's a lot of conversation history to keep in context without:

Blowing your token budget
Degrading response quality
Missing important details from earlier in the conversation

Observability: The Unsexy But Critical Part

When your agent asks a weird question or misunderstands an answer, you need to know why.

I instrumented the agent to log:

Full conversation traces
Tool calls and their results
Latency breakdowns per turn
When the agent decided to interrupt vs wait

This makes debugging 10x easier. Instead of "the agent behaved weird," you can pinpoint "the web search timed out, so it hallucinated instead."

def on_event(event: str, data: dict):
    if event == "trace.started":
        trace_id = data["trace_id"]
        logging.debug(f"Trace started - ID: {trace_id}")
    elif event == "trace.ended":
        trace_id = data["trace_id"]
        logging.debug(f"Trace ended - ID: {trace_id}")

Running It

Setup:

# Install dependencies
pip install livekit livekit-agents[google] tavily-python

# Set environment variables
export LIVEKIT_URL=your_livekit_url
export LIVEKIT_API_KEY=your_api_key
export TAVILY_API_KEY=your_tavily_key
export GOOGLE_API_KEY=your_google_key

Launch:

python interview_agent.py

The agent creates a room, gives you a join link, and starts the interview when you connect.

What I'd Add Next

Multi-agent panels: Simulate a panel interview with multiple interviewers asking from different angles.

Real-time scoring: Evaluate answers as the interview progresses, not just at the end.

Resume parsing: Pull details from the candidate's resume to personalize questions.

Code challenges: For technical roles, integrate a live coding environment.

Emotion detection: Analyze tone and sentiment to gauge candidate confidence/stress.

The Bigger Picture

Voice agents aren't just for interviews. The same patterns apply to:

Customer support (handling calls at scale)
Sales qualification (pre-screening leads)
Healthcare triage (initial symptom assessment)
Education (tutoring and assessment)

The hard parts are universal:

Low-latency voice processing
Natural turn-taking
Context management over long sessions
Debugging probabilistic behavior

Lessons Learned

1. Test with real humans early

Synthetic test cases don't capture the messiness of real conversations. Get feedback from actual users ASAP.

2. Latency budgets are tight

Every millisecond matters. Optimize aggressively or the conversation feels robotic.

3. Observability is non-negotiable

You can't improve what you can't measure. Log everything, then filter down to what matters.

4. Voice is different from chat

What works in text-based agents often breaks in voice. Verbosity, pacing, and interruption handling are completely different problems.

Try It Yourself

The full code is on GitHub (link in comments). You'll need:

LiveKit account (they have a free tier)
Google Cloud for Gemini + TTS
Tavily API for web search

If you're curious about the observability/evaluation side, I'm working on this at Maxim AI. Still figuring it out, but happy to share what I learn.

Resources:

What's your experience with voice agents? Where do you think they'll have the most impact? Let me know in the comments 👇

DEV Community