DEV Community

Navya Yadav
Navya Yadav

Posted on

Building a Real-Time AI Interview Agent with Voice

I recently explored building an AI voice agent for technical interviews — the kind that can actually hold a conversation, ask follow-up questions, and adapt in real-time.

Turns out, getting voice latency right and making the conversation feel natural is harder than it looks. Here's what I learned 👇


Why Voice Agents for Interviews?

Traditional interviews don't scale well:

  • Resource-intensive (coordinating schedules, interviewer availability)
  • Inconsistent (every interviewer has their own style)
  • Hard to audit (what actually happened in that hour?)

Voice agents can help with:

  • Scalability - Interview 100 candidates simultaneously
  • Consistency - Same evaluation criteria for everyone
  • Real-time feedback - Immediate scoring and analytics
  • Auditability - Full transcripts and traces of every session

The Tech Stack

LiveKit for Real-Time Audio

LiveKit handles the gnarly parts of voice infrastructure:

  • Ultra-low latency streaming
  • Turn detection (knowing when to interrupt vs when to wait)
  • Integration with LLMs and TTS engines
  • Scales from prototype to production

Why Real-Time Matters

You can't fake low latency. If there's a 2-second gap between "Tell me about your experience" and the candidate starting to answer, the whole flow breaks. LiveKit's WebRTC foundation keeps things snappy.


Building the Agent

Here's the simplified architecture:

1. Agent Definition

class InterviewAgent(Agent):
    def __init__(self, jd: str) -> None:
        super().__init__(
            instructions=f"""You are a professional interviewer. 
            The job description is: {jd}

            Ask relevant interview questions, listen to answers, 
            and follow up as a real interviewer would."""
        )
Enter fullscreen mode Exit fullscreen mode

The agent adapts its questions based on the job description you provide. Simple but effective.

2. Adding Tool Use (Web Search)

@function_tool()
async def web_search(self, query: str) -> str:
    tavily_client = TavilyClient(api_key=TAVILY_API_KEY)
    response = tavily_client.search(query=query, search_depth="basic")
    return response.get('answer', 'No results found.')
Enter fullscreen mode Exit fullscreen mode

This lets the agent look up technical details on the fly. If a candidate mentions a framework the agent isn't familiar with, it can search and ask informed follow-ups.

3. Session Management

async def entrypoint(ctx: agents.JobContext):
    print("🎤 Welcome! Paste your Job Description below.")
    jd = input("JD: ")

    room_name = f"interview-room-{uuid.uuid4().hex}"

    session = AgentSession(
        llm=google.beta.realtime.RealtimeModel(
            model="gemini-2.0-flash-exp", 
            voice="Puck"
        ),
    )

    await session.start(room=room, agent=InterviewAgent(jd))
    await session.generate_reply(
        instructions="Greet the candidate and start the interview."
    )
Enter fullscreen mode Exit fullscreen mode

What Makes This Tricky

1. Turn-taking is an unsolved problem

Humans interrupt each other naturally. Agents struggle with:

  • When to let the candidate finish
  • When to jump in with a clarifying question
  • How to handle awkward silences

LiveKit's turn detection helps, but you still need to tune sensitivity.

2. Latency compounds quickly

  • Speech-to-text: ~200ms
  • LLM inference: ~500-1000ms
  • Text-to-speech: ~300ms

That's 1-1.5 seconds best case. Any additional processing (logging, evaluation, tool calls) adds up fast.

3. Context management

Interview sessions can be 30-60 minutes. That's a lot of conversation history to keep in context without:

  • Blowing your token budget
  • Degrading response quality
  • Missing important details from earlier in the conversation

Observability: The Unsexy But Critical Part

When your agent asks a weird question or misunderstands an answer, you need to know why.

I instrumented the agent to log:

  • Full conversation traces
  • Tool calls and their results
  • Latency breakdowns per turn
  • When the agent decided to interrupt vs wait

This makes debugging 10x easier. Instead of "the agent behaved weird," you can pinpoint "the web search timed out, so it hallucinated instead."

def on_event(event: str, data: dict):
    if event == "trace.started":
        trace_id = data["trace_id"]
        logging.debug(f"Trace started - ID: {trace_id}")
    elif event == "trace.ended":
        trace_id = data["trace_id"]
        logging.debug(f"Trace ended - ID: {trace_id}")
Enter fullscreen mode Exit fullscreen mode

Running It

Setup:

# Install dependencies
pip install livekit livekit-agents[google] tavily-python

# Set environment variables
export LIVEKIT_URL=your_livekit_url
export LIVEKIT_API_KEY=your_api_key
export TAVILY_API_KEY=your_tavily_key
export GOOGLE_API_KEY=your_google_key
Enter fullscreen mode Exit fullscreen mode

Launch:

python interview_agent.py
Enter fullscreen mode Exit fullscreen mode

The agent creates a room, gives you a join link, and starts the interview when you connect.


What I'd Add Next

Multi-agent panels: Simulate a panel interview with multiple interviewers asking from different angles.

Real-time scoring: Evaluate answers as the interview progresses, not just at the end.

Resume parsing: Pull details from the candidate's resume to personalize questions.

Code challenges: For technical roles, integrate a live coding environment.

Emotion detection: Analyze tone and sentiment to gauge candidate confidence/stress.


The Bigger Picture

Voice agents aren't just for interviews. The same patterns apply to:

  • Customer support (handling calls at scale)
  • Sales qualification (pre-screening leads)
  • Healthcare triage (initial symptom assessment)
  • Education (tutoring and assessment)

The hard parts are universal:

  • Low-latency voice processing
  • Natural turn-taking
  • Context management over long sessions
  • Debugging probabilistic behavior

Lessons Learned

1. Test with real humans early

Synthetic test cases don't capture the messiness of real conversations. Get feedback from actual users ASAP.

2. Latency budgets are tight

Every millisecond matters. Optimize aggressively or the conversation feels robotic.

3. Observability is non-negotiable

You can't improve what you can't measure. Log everything, then filter down to what matters.

4. Voice is different from chat

What works in text-based agents often breaks in voice. Verbosity, pacing, and interruption handling are completely different problems.


Try It Yourself

The full code is on GitHub (link in comments). You'll need:

  • LiveKit account (they have a free tier)
  • Google Cloud for Gemini + TTS
  • Tavily API for web search

If you're curious about the observability/evaluation side, I'm working on this at Maxim AI. Still figuring it out, but happy to share what I learn.


Resources:


What's your experience with voice agents? Where do you think they'll have the most impact? Let me know in the comments 👇

Top comments (0)