Esha Agarwal

Posted on Mar 16

Building a Real-Time AI Technical Interviewer with Gemini Live

#ai #gemini #interview #showdev

I've been preparing for technical interviews lately and realized how hard it is to get quality practice. Mock interviews with friends are hard to schedule, platforms like LeetCode give no conversational feedback, and hiring a coach is expensive. That frustration led me to build AI Mock Interviewer during the Gemini Live Agent Challenge.

What It Does

You open the app in your browser, click Start, and an AI interviewer named Alex begins asking you technical questions by voice — just like a real interviewer. You answer out loud. Alex listens, asks follow-ups, then gives you a coding problem. While you write code in your editor, Alex watches your screen every 5 seconds and asks questions about your approach, complexity, and edge cases. When the interview wraps, you get a full scored report covering Technical Knowledge, Problem Solving, Code Quality, and Communication — with specific coaching for
each area.

How I Built It

The core is Google Gemini Live (gemini-2.5-flash-native-audio) for
real-time bidirectional audio and vision. The browser captures microphone audio at 16kHz via AudioWorklet and screen frames as JPEG every 5 seconds, both streamed to a FastAPI backend over WebSocket. The backend forwards everything to Gemini Live and plays back the AI audio response at 24kHz.

The Hardest Part — VAD Pipeline

Gemini Live requires explicit stream_end signals to know when the user has finished speaking. I built a custom Voice Activity Detection system from scratch computing RMS energy on every 256ms PCM chunk, detecting sustained silence, and applying a cooldown after the AI finishes speaking to prevent echo triggering false stream_ends.

# VAD — detect silence and fire stream_end
if rms < SILENCE_RMS:
    silent_chunks += 1
    if (silent_chunks >= SILENCE_CHUNKS
            and not stream_ended
            and candidate_speech_seen
            and past_cooldown):
        await to_gemini.put(("stream_end", b""))
        stream_ended = True

I also built barge-in support — if the candidate speaks loudly enough while Alex is talking, the audio gets interrupted immediately so the conversation feels natural.

Keeping the Interview Structured — Tool State Machine

Without guardrails, Gemini would hallucinate calling end_interview before asking a single question,or asking five questions in one turn. The solution was a tool-based state machine where every action gets BLOCKED with an error message if prerequisites aren't met. Gemini self-corrects based on the blocked response and tries the correct sequence instead.

def end_interview() -> str:
    if len(_state["behavioral_notes"]) < 2:
        return "BLOCKED: need at least 2 behavioral notes first."
    if not _state["coding_phase_started"]:
        return "BLOCKED: coding phase not started yet."
    if not _state["timer_checked"]:
        return "BLOCKED: call check_timer() first."
    if not _state["closing_spoken"]:
        return "BLOCKED: speak your closing sentence first."
    _state["complete"] = True
    return "INTERVIEW_COMPLETE."

This pattern — let the model decide, validate the decision, return structured errors is what makes agentic systems reliable in production.

Report Generation

At the end of each session, Gemini 2.5 Flash generates a scored coaching report from the notes logged during the interview. The report covers overall score, specific strengths, areas to improve, and actionable next steps — delivered as markdown rendered directly in the browser.

Deployment

The backend runs on Google Cloud Run — serverless, scales to zero when
idle, and supports long-lived WebSocket connections with a 3600 second timeout.Deployed with a single gcloud command.

Key Challenges

1011 WebSocket errors — Gemini Live closes idle connections. Fixed with a keepalive loop sending silent audio every 20 seconds with guards to prevent it firing during tool calls or active speech.

Duplicate audio — Gemini sometimes sends the same sentence twice as
separate audio turns. Fixed by comparing each turn's transcript against the previous and sending an interrupted signal to the browser to discard duplicates.

Model vocalising tool calls — Gemini occasionally speaks tool names aloud instead of calling them silently. Fixed with transcript cleaning that strips function-call syntax and auto-rescue logic that detects and executes spoken tool calls directly in Python.

Premature interview endings — Without guardrails Gemini would end the
interview after 30 seconds. Fixed with a closing_spoken gate that only
unblocks end_interview after a turn containing actual closing words.

What I Learned

Gemini Live is powerful but needs careful state management on the application side. The challenge isn't the AI — it's building reliable guardrails in an async real-time environment where timing matters at the millisecond level. Pure "let the LLM decide everything" doesn't work in production. Structured tool validation and state machines are essential for agentic systems that need to follow a reliable flow.

This was one of the most challenging and rewarding builds I've done. Real-time audio engineering is unforgiving — but when it works, it feels like magic.

DEV Community