Sameer Ali khan

Posted on Mar 16

InterviewAce: Real-Time AI Mock Interviews with Gemini Live API & Google ADK

#gemini #geminiliveagentchallenge #ai #cloud

📢 Disclosure: I created this blog post as part of my submission to the Gemini Live Agent Challenge hackathon. The project — InterviewAce — was built specifically for this competition using Google AI models and Google Cloud. #GeminiLiveAgentChallenge

🔗 Live Demo · 💻 GitHub · 📹 Demo Video

The Problem Nobody Talks About

Everyone knows technical interviews are hard. But here is what nobody says out loud — most people never actually practice them.

Not because they are lazy. Because real practice is expensive and inaccessible:

Professional mock interview services charge $150–$300 per session
Asking friends to interview you is awkward and rarely useful
AI chatbots give you text responses — but real interviews are not text conversations

And here is the deeper issue: the things that fail candidates are not the answers — they are the delivery. The "um"s and "uh"s. The slouched posture. The rambling answer that never reaches a conclusion. The eye contact that breaks every time you think.

No text-based AI tool has ever addressed this. Until now.

What I Built: InterviewAce

InterviewAce is a real-time, multimodal AI interview coach that puts you in a pixel-perfect Google Meet replica with an AI hiring manager called Coach Ace — who simultaneously:

🗣️ Speaks to you with sub-500ms voice latency via Gemini 2.5 Flash Native Audio
👀 Watches your body language live through your webcam
🎤 Detects filler words in real time ("um", "uh", "like", "you know")
📊 Scores your answers across Confidence, Clarity, Content and STAR structure
🔍 Searches Google live to give hallucination-free company-specific context
📝 Generates a full performance report with downloadable transcript

No typing. No text boxes. Just a real conversation with an AI that actually watches and listens.

Tech Stack at a Glance

Layer	Technology
AI Agent	Google ADK + Gemini 2.5 Flash Native Audio
Live Streaming	Gemini Live API (bidiGenerateContent)
Backend	Python, FastAPI, Uvicorn, WebSockets
Frontend	Vanilla JavaScript, Web Audio API, MediaDevices API
Grounding	ADK built-in google_search + local knowledge base
Infrastructure	Google Cloud Run, Docker, Cloud Build

System Architecture

Here is the complete picture of how every component connects — from your microphone all the way to Gemini and back:

┌─────────────────────────────────────────────────────────────────────┐
│                      🖥️  BROWSER (Vanilla JS)                       │
│                                                                      │
│  🎤 Microphone (PCM 16kHz)  ──┐                                     │
│  📷 Camera (JPEG 1fps)  ──────┼──▶  WebSocket Client ◀────────┐    │
│                                │         │                      │    │
│  🔊 Audio Player  ◀────────────┼─────────┘                     │    │
│  💬 Closed Captions ◀──────────┤      Audio + Images + JSON    │    │
│  📊 Live Analytics ◀───────────┘                               │    │
└────────────────────────────────────────────────────────────────┼────┘
                                                                  │ WebSocket
┌─────────────────────────────────────────────────────────────────▼───┐
│                    ⚙️  FASTAPI BACKEND (Python)                      │
│                                                                      │
│   WebSocket Server (main.py)                                         │
│          │                                                           │
│          ▼                                                           │
│   LiveRequestQueue ──▶ ADK Runner ──▶ InMemorySessionService        │
└──────────────────────────────────────────────────────────────────────┘
                    │                        ▲
           Bidi Stream                  Audio + Tool Results
                    │                        │
┌───────────────────▼────────────────────────┴────────────────────────┐
│                      🤖  GOOGLE ADK AGENT                            │
│                                                                      │
│     Gemini 2.5 Flash Native Audio + Vision                           │
│          │                                                           │
│          ▼  Autonomous Tool Calls (silent, every 2-3 answers)        │
│   ┌──────────────────────────────────────────────────────────────┐   │
│   │                   🔧 11 CUSTOM TOOLS                         │   │
│   │                                                              │   │
│   │  TIER 1 — Core Analysis:                                     │   │
│   │   save_session_feedback  │  detect_filler_words              │   │
│   │   analyze_body_language  │  evaluate_star_method             │   │
│   │                                                              │   │
│   │  TIER 2 — Deep Coaching:                                     │   │
│   │   analyze_voice_confidence  │  get_improvement_tips          │   │
│   │   fetch_grounding_data      │  adjust_difficulty_level       │   │
│   │                                                              │   │
│   │  TIER 3 — Session Reporting:                                 │   │
│   │   get_session_history  │  save_session_recording             │   │
│   │   generate_session_report                                    │   │
│   │                                                              │   │
│   │  GROUNDING:  google_search (ADK built-in)                    │   │
│   └──────────────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────────┐
│                        ☁️  GOOGLE CLOUD                              │
│                                                                      │
│      Cloud Run (Serverless Container)  +  Container Registry        │
└──────────────────────────────────────────────────────────────────────┘

Data Flow — Step by Step

STEP 1 ── User speaks
          Mic → PCM audio (16kHz) → WebSocket → FastAPI → LiveRequestQueue

STEP 2 ── Camera streams
          Webcam → JPEG frame (1fps, 320×240) → WebSocket → FastAPI → LiveRequestQueue

STEP 3 ── Gemini responds
          LiveRequestQueue → bidiGenerateContent → Gemini 2.5 Flash
          Gemini audio bytes → WebSocket → Browser AudioPlayer → User hears voice

STEP 4 ── Background tools fire (silently, every 2-3 answers)
          Gemini calls → detect_filler_words()
          Gemini calls → analyze_body_language()
          Gemini calls → evaluate_star_method()
          Gemini calls → save_session_feedback()
          Tool results → JSON side-channel → WebSocket → Sidebar updates live

STEP 5 ── Transcription
          Gemini → Input + Output transcription → Closed Captions rendered in UI

STEP 6 ── Session ends
          User clicks End Interview
          → generate_session_report()
          → Full modal: scores, breakdown, downloadable transcript

Project File Structure

IntyerviewBit/
├── README.md
├── cloudbuild.yaml                     ← Google Cloud Build CI/CD
└── interviewace/
    ├── Dockerfile                      ← Cloud Run container
    ├── .env.example
    ├── requirements.txt
    └── app/
        ├── main.py                     ← FastAPI + WebSocket server
        └── interview_coach_agent/
        │   ├── agent.py                ← ADK Agent + 11 tools registered
        │   ├── prompts.py              ← Coach Ace persona + instructions
        │   ├── tools.py                ← All 10 custom tool implementations
        │   └── grounding_data.py       ← Verified local coaching knowledge base
        └── static/
            ├── index.html              ← Single-page Google Meet replica
            ├── css/
            │   └── style.css           ← Complete Meet-style CSS
            └── js/
                ├── app.js              ← Main app logic + WebSocket client
                ├── audio-player.js     ← PCM audio playback engine
                ├── audio-recorder.js   ← Mic capture + 48kHz→16kHz downsample
                └── camera.js           ← Adaptive webcam frame capture

The Agent — Coach Ace

COACH ACE — FULL TOOL MAP
────────────────────────────────────────────────────────────────────

TIER 1 — Core Analysis  (fires silently every 2-3 answers)
┌───────────────────────────┬──────────────────────────────────────┐
│ Tool                      │ What It Does                         │
├───────────────────────────┼──────────────────────────────────────┤
│ save_session_feedback     │ Scores 4 dimensions 0-100:           │
│                           │ Confidence, Clarity, Content,        │
│                           │ Body Language                        │
├───────────────────────────┼──────────────────────────────────────┤
│ detect_filler_words       │ Counts um / uh / like / you know.    │
│                           │ Updates live sidebar counter + tips  │
├───────────────────────────┼──────────────────────────────────────┤
│ analyze_body_language     │ Rates posture, eye contact,          │
│                           │ expression from live camera frame    │
├───────────────────────────┼──────────────────────────────────────┤
│ evaluate_star_method      │ Checks S-T-A-R answer structure.     │
│                           │ Lights up S T A R badges in real time│
└───────────────────────────┴──────────────────────────────────────┘

TIER 2 — Deep Coaching
┌───────────────────────────┬──────────────────────────────────────┐
│ analyze_voice_confidence  │ Pace, volume, tone, pause analysis   │
│ get_improvement_tips      │ Targeted coaching per weakness       │
│ fetch_grounding_data      │ Pulls from verified local KB         │
│ adjust_difficulty_level   │ Scales question difficulty up/down   │
└───────────────────────────┴──────────────────────────────────────┘

TIER 3 — Session Management
┌───────────────────────────┬──────────────────────────────────────┐
│ get_session_history       │ Retrieves scores from past sessions  │
│ save_session_recording    │ Persists transcript + all metrics    │
│ generate_session_report   │ Builds full post-interview breakdown │
└───────────────────────────┴──────────────────────────────────────┘

GROUNDING
┌───────────────────────────┬──────────────────────────────────────┐
│ google_search             │ ADK built-in. Live web search for    │
│ (ADK built-in)            │ company-specific interview facts     │
└───────────────────────────┴──────────────────────────────────────┘

Architechture -

The Dual Grounding System

Early in development, Coach Ace would confidently hallucinate Amazon Leadership Principles or invent Google interview formats. I fixed it with two grounding layers:

CANDIDATE ASKS: "What is Google's interview process like?"
                              │
                              ▼
                ┌─────────────────────────┐
                │     GROUNDING ROUTER    │
                └────────┬────────────────┘
                         │
            ┌────────────┴────────────┐
            ▼                         ▼
┌─────────────────────┐   ┌─────────────────────┐
│ fetch_grounding_    │   │   google_search()    │
│ data()              │   │   ADK built-in       │
│                     │   │                     │
│ LOCAL KNOWLEDGE     │   │  LIVE WEB SEARCH    │
│ BASE                │   │                     │
│ (grounding_data.py) │   │  Searches for real, │
│                     │   │  current company    │
│ Covers:             │   │  interview info     │
│ • STAR method       │   │                     │
│ • Body language     │   │  Prevents all       │
│ • Voice delivery    │   │  hallucination of   │
│ • Common mistakes   │   │  company-specific   │
│ • Coaching tips     │   │  facts              │
└──────────┬──────────┘   └──────────┬──────────┘
           │                         │
           └────────────┬────────────┘
                        ▼
             GROUNDED RESPONSE
             (accurate + verified)

Code Deep Dive

1. The WebSocket Bridge (`main.py`)

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    session_service = InMemorySessionService()
    session = await session_service.create_session(
        app_name="interviewace",
        user_id="candidate",
        session_id=str(uuid.uuid4())
    )

    live_request_queue = LiveRequestQueue()

    # Start ADK runner — talks to Gemini Live API in background
    runner_task = asyncio.create_task(
        run_agent(runner, live_request_queue, websocket, session)
    )

    try:
        async for message in websocket.iter_bytes():
            data = parse_message(message)

            if data["type"] == "audio":
                live_request_queue.send_realtime(
                    Blob(data=data["chunk"], mime_type="audio/pcm;rate=16000")
                )
            elif data["type"] == "image":
                live_request_queue.send_realtime(
                    Blob(data=data["frame"], mime_type="image/jpeg")
                )
    finally:
        runner_task.cancel()

2. The ADK Agent (`agent.py`)

from google.adk.agents import Agent
from google.adk.tools import google_search
from .tools import (
    save_session_feedback, detect_filler_words,
    analyze_body_language, evaluate_star_method,
    analyze_voice_confidence, get_improvement_tips,
    fetch_grounding_data, adjust_difficulty_level,
    get_session_history, save_session_recording,
    generate_session_report,
)

coach_ace = Agent(
    name="coach_ace",
    model="gemini-2.5-flash-preview-native-audio-dialog",
    description="Senior AI hiring manager — real-time mock interviews",
    instruction=COACH_ACE_PROMPT,
    tools=[
        save_session_feedback, detect_filler_words,
        analyze_body_language, evaluate_star_method,
        analyze_voice_confidence, get_improvement_tips,
        fetch_grounding_data, adjust_difficulty_level,
        get_session_history, save_session_recording,
        generate_session_report,
        google_search,   # ADK built-in grounding tool
    ],
)

3. Silent Background Tool (Example)

def detect_filler_words(transcript: str, session_id: str) -> dict:
    """
    Fires autonomously every 2-3 answers.
    User never hears a pause — runs between turns.
    """
    filler_patterns = ["um", "uh", "like", "you know", "basically", "literally"]
    counts = {f: transcript.lower().count(f) for f in filler_patterns if transcript.lower().count(f) > 0}
    total = sum(counts.values())

    update_session_analytics(session_id, "filler_words", {
        "total": total,
        "breakdown": counts,
        "coaching_tip": get_filler_tip(total)
    })

    return {"filler_count": total, "breakdown": counts}

4. PCM Audio Engine (`audio-recorder.js`)

class AudioRecorder {
    constructor(onAudioData) {
        this.onAudioData = onAudioData;
        this.targetSampleRate = 16000; // Gemini Live expects 16kHz
    }

    async start() {
        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const context = new AudioContext(); // Native rate ~48kHz

        const scriptProcessor = context.createScriptProcessor(4096, 1, 1);
        scriptProcessor.onaudioprocess = (event) => {
            const inputData = event.inputBuffer.getChannelData(0);
            // Downsample 48kHz → 16kHz before sending to Gemini
            const downsampled = this.downsample(inputData, context.sampleRate, 16000);
            const pcm16 = this.floatTo16BitPCM(downsampled);
            this.onAudioData(pcm16); // → WebSocket → FastAPI → Gemini
        };

        source.connect(scriptProcessor);
        scriptProcessor.connect(context.destination);
    }

    downsample(buffer, fromRate, toRate) {
        const ratio = fromRate / toRate;
        const result = new Float32Array(Math.round(buffer.length / ratio));
        for (let i = 0; i < result.length; i++) {
            result[i] = buffer[Math.round(i * ratio)];
        }
        return result;
    }
}

5. Adaptive Webcam Capture (`camera.js`)

class CameraCapture {
    captureFrame() {
        const canvas = document.createElement('canvas');
        canvas.width = 320;  // Low-res — body language doesn't need 1080p
        canvas.height = 240;

        const ctx = canvas.getContext('2d');
        ctx.drawImage(this.videoElement, 0, 0, 320, 240);

        // JPEG 60% quality — sufficient for vision, minimal bandwidth
        canvas.toBlob(
            (blob) => blob.arrayBuffer().then(buf => this.onFrame(buf)),
            'image/jpeg',
            0.6
        );
    }

    adaptFrameRate(networkQuality) {
        // Drop to 0.33fps (1 frame/3 sec) under poor network
        this.fps = Math.max(0.33, Math.min(1, networkQuality));
    }
}

The UI — Google Meet Replica

┌──────────────────────────────────────────────────────────────────────┐
│  InterviewAce  [Google logo]          ⏱ 12:34      [Participants]    │
├─────────────────────────────────────────────┬────────────────────────┤
│                                             │   📊 LIVE ANALYTICS    │
│  ┌─────────────────┐  ┌─────────────────┐   │                        │
│  │                 │  │                 │   │  Confidence  ████▓░    │
│  │  COACH ACE      │  │  ELENA          │   │  Clarity     ███░░░    │
│  │  AI Interviewer │  │  AI Notetaker   │   │  STAR Score  ████░░    │
│  │                 │  │                 │   │  Body Lang.  ███▓░░    │
│  │ [Volume Rings]  │  │ [Volume Rings]  │   │                        │
│  └─────────────────┘  └─────────────────┘   │  🎤 Filler Words: 3   │
│                                             │  um(2)  uh(1)          │
│  ┌──────────────────────────────────────┐   │                        │
│  │                                      │   │  👁 Eye Contact  ●●    │
│  │            YOU                       │   │  🧍 Posture      ●○    │
│  │        (Live Webcam)                 │   │  😊 Expression   ●●    │
│  │                                      │   │                        │
│  │  [Equalizer bars animate when mic]   │   │  STAR Badges:          │
│  └──────────────────────────────────────┘   │  [S✓][T✓][A○][R○]     │
│                                             │                        │
│  💬 CC: "...tell me about a time you had    │  💡 Use 'I' not 'we'   │
│          to debug a production issue..."    │                        │
├─────────────────────────────────────────────┴────────────────────────┤
│  [🎤 Mic]  [📷 Cam]  [CC]  [Chat]  [People]    [🔴 End Interview]   │
└──────────────────────────────────────────────────────────────────────┘

Built in 100% Vanilla JavaScript — no React, no Vue, no Angular. Frameworks add event loop overhead that impacts PCM audio timing. At 16kHz, every millisecond of jitter is audible.

Cloud Deployment Pipeline

Developer pushes to GitHub
         │
         ▼
Google Cloud Build (cloudbuild.yaml)
         │
         ├── docker build -t gcr.io/PROJECT/interviewace .
         ├── docker push → Container Registry
         └── gcloud run deploy interviewace
                  │
                  ├── Region:         us-central1
                  ├── Memory:         1Gi
                  ├── Port:           8080
                  ├── session-affinity: TRUE  ← critical for WebSockets
                  └── allow-unauthenticated:  TRUE
         │
         ▼
Cloud Run Serverless Container
         ├── Scales to zero when idle  (cost: $0)
         ├── Handles WebSocket connections persistently
         └── Scales instantly on demand

Key lesson learned the hard way: Cloud Run requires --session-affinity for any WebSocket-based app. Without it, the load balancer routes mid-session requests to a different container instance, breaking your persistent connection. This cost us hours to debug.

Challenges I Ran Into

1. Real-time audio + vision sync
Streaming 16kHz PCM audio and JPEG frames simultaneously over one WebSocket without dropped frames required bandwidth-adaptive throttling and decoupled queues.

2. Barge-in handling
When the user speaks mid-response, the agent must stop cleanly without corrupting the audio buffer. Getting reliable barge-in via the ADK streaming layer took multiple iterations.

3. Invisible tool calls
All background analysis tools need to feel completely invisible. I tuned them to fire silently between turns and stream analytics to the UI via a JSON side-channel on the same WebSocket.

4. Company hallucination
Early versions confidently invented interview formats. Fixed with two-layer grounding: verified local knowledge base + ADK's built-in Google Search.

5. Audio latency in production
Getting end-to-end voice latency under 500ms on Cloud Run required tuning buffer sizes, optimising the PCM pipeline, and keeping ASGI async throughout.

Key Lessons

Native audio models are fundamentally different from text models. Design your system for async bidirectional streaming from the ground up.
Grounding is non-negotiable for agentic apps. Prompt engineering alone is not enough — even capable models hallucinate domain facts without grounding.
Vanilla JS outperforms frameworks for latency-sensitive audio/video. Full control over the audio pipeline timing matters at the PCM level.
ADK LiveRequestQueue needs careful queue management. Keep audio, vision, and tool-call result streams strictly decoupled.
Always add session-affinity to Cloud Run WebSocket services. Stateless load balancing breaks persistent connections.

Try It Yourself

🔗 Live Demo: https://interviewace-117780891544.us-central1.run.app/

No API key. No signup. No cost. Click the link, allow mic + camera, and start your mock interview.

💻 GitHub: https://github.com/SameerAliKhan-git/IntyerviewBit

📹 Demo Video: https://youtu.be/JrjhgB5Ib_0

Built for the Gemini Live Agent Challenge using Google ADK · Gemini Live API · Google Cloud Run

#GeminiLiveAgentChallenge #GoogleAI #Gemini #ADK #GoogleCloud #Python #WebDev

DEV Community

InterviewAce: Real-Time AI Mock Interviews with Gemini Live API & Google ADK

The Problem Nobody Talks About

What I Built: InterviewAce

Tech Stack at a Glance

System Architecture

Data Flow — Step by Step

Project File Structure

The Agent — Coach Ace

Architechture -

The Dual Grounding System

Code Deep Dive

1. The WebSocket Bridge (`main.py`)

2. The ADK Agent (`agent.py`)

3. Silent Background Tool (Example)

4. PCM Audio Engine (`audio-recorder.js`)

5. Adaptive Webcam Capture (`camera.js`)

The UI — Google Meet Replica

Cloud Deployment Pipeline

Challenges I Ran Into

Key Lessons

Try It Yourself

Top comments (0)

The Problem Nobody Talks About

What I Built: InterviewAce

Tech Stack at a Glance

System Architecture

Data Flow — Step by Step

Project File Structure

The Agent — Coach Ace

Architechture -

The Dual Grounding System

Code Deep Dive

1. The WebSocket Bridge (main.py)

2. The ADK Agent (agent.py)

3. Silent Background Tool (Example)

4. PCM Audio Engine (audio-recorder.js)

5. Adaptive Webcam Capture (camera.js)

The UI — Google Meet Replica

Cloud Deployment Pipeline

Challenges I Ran Into

Key Lessons

Try It Yourself

1. The WebSocket Bridge (`main.py`)

2. The ADK Agent (`agent.py`)

4. PCM Audio Engine (`audio-recorder.js`)

5. Adaptive Webcam Capture (`camera.js`)