DEV Community

Cover image for InterviewAce: Real-Time AI Mock Interviews with Gemini Live API & Google ADK
Sameer Ali khan
Sameer Ali khan

Posted on

InterviewAce: Real-Time AI Mock Interviews with Gemini Live API & Google ADK

πŸ“’ Disclosure: I created this blog post as part of my submission to the Gemini Live Agent Challenge hackathon. The project β€” InterviewAce β€” was built specifically for this competition using Google AI models and Google Cloud. #GeminiLiveAgentChallenge


πŸ”— Live Demo Β· πŸ’» GitHub Β· πŸ“Ή Demo Video


The Problem Nobody Talks About

Everyone knows technical interviews are hard. But here is what nobody says out loud β€” most people never actually practice them.

Not because they are lazy. Because real practice is expensive and inaccessible:

  • Professional mock interview services charge $150–$300 per session
  • Asking friends to interview you is awkward and rarely useful
  • AI chatbots give you text responses β€” but real interviews are not text conversations

And here is the deeper issue: the things that fail candidates are not the answers β€” they are the delivery. The "um"s and "uh"s. The slouched posture. The rambling answer that never reaches a conclusion. The eye contact that breaks every time you think.

No text-based AI tool has ever addressed this. Until now.


What I Built: InterviewAce

InterviewAce is a real-time, multimodal AI interview coach that puts you in a pixel-perfect Google Meet replica with an AI hiring manager called Coach Ace β€” who simultaneously:

  • πŸ—£οΈ Speaks to you with sub-500ms voice latency via Gemini 2.5 Flash Native Audio
  • πŸ‘€ Watches your body language live through your webcam
  • 🎀 Detects filler words in real time ("um", "uh", "like", "you know")
  • πŸ“Š Scores your answers across Confidence, Clarity, Content and STAR structure
  • πŸ” Searches Google live to give hallucination-free company-specific context
  • πŸ“ Generates a full performance report with downloadable transcript

No typing. No text boxes. Just a real conversation with an AI that actually watches and listens.


Tech Stack at a Glance

Layer Technology
AI Agent Google ADK + Gemini 2.5 Flash Native Audio
Live Streaming Gemini Live API (bidiGenerateContent)
Backend Python, FastAPI, Uvicorn, WebSockets
Frontend Vanilla JavaScript, Web Audio API, MediaDevices API
Grounding ADK built-in google_search + local knowledge base
Infrastructure Google Cloud Run, Docker, Cloud Build

System Architecture

Here is the complete picture of how every component connects β€” from your microphone all the way to Gemini and back:


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      πŸ–₯️  BROWSER (Vanilla JS)                       β”‚
β”‚                                                                      β”‚
β”‚  🎀 Microphone (PCM 16kHz)  ──┐                                     β”‚
β”‚  πŸ“· Camera (JPEG 1fps)  ──────┼──▢  WebSocket Client ◀────────┐    β”‚
β”‚                                β”‚         β”‚                      β”‚    β”‚
β”‚  πŸ”Š Audio Player  β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚    β”‚
β”‚  πŸ’¬ Closed Captions ◀───────────      Audio + Images + JSON    β”‚    β”‚
β”‚  πŸ“Š Live Analytics β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”˜
                                                                  β”‚ WebSocket
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”
β”‚                    βš™οΈ  FASTAPI BACKEND (Python)                      β”‚
β”‚                                                                      β”‚
β”‚   WebSocket Server (main.py)                                         β”‚
β”‚          β”‚                                                           β”‚
β”‚          β–Ό                                                           β”‚
β”‚   LiveRequestQueue ──▢ ADK Runner ──▢ InMemorySessionService        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                        β–²
           Bidi Stream                  Audio + Tool Results
                    β”‚                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      πŸ€–  GOOGLE ADK AGENT                            β”‚
β”‚                                                                      β”‚
β”‚     Gemini 2.5 Flash Native Audio + Vision                           β”‚
β”‚          β”‚                                                           β”‚
β”‚          β–Ό  Autonomous Tool Calls (silent, every 2-3 answers)        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚                   πŸ”§ 11 CUSTOM TOOLS                         β”‚   β”‚
β”‚   β”‚                                                              β”‚   β”‚
β”‚   β”‚  TIER 1 β€” Core Analysis:                                     β”‚   β”‚
β”‚   β”‚   save_session_feedback  β”‚  detect_filler_words              β”‚   β”‚
β”‚   β”‚   analyze_body_language  β”‚  evaluate_star_method             β”‚   β”‚
β”‚   β”‚                                                              β”‚   β”‚
β”‚   β”‚  TIER 2 β€” Deep Coaching:                                     β”‚   β”‚
β”‚   β”‚   analyze_voice_confidence  β”‚  get_improvement_tips          β”‚   β”‚
β”‚   β”‚   fetch_grounding_data      β”‚  adjust_difficulty_level       β”‚   β”‚
β”‚   β”‚                                                              β”‚   β”‚
β”‚   β”‚  TIER 3 β€” Session Reporting:                                 β”‚   β”‚
β”‚   β”‚   get_session_history  β”‚  save_session_recording             β”‚   β”‚
β”‚   β”‚   generate_session_report                                    β”‚   β”‚
β”‚   β”‚                                                              β”‚   β”‚
β”‚   β”‚  GROUNDING:  google_search (ADK built-in)                    β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        ☁️  GOOGLE CLOUD                              β”‚
β”‚                                                                      β”‚
β”‚      Cloud Run (Serverless Container)  +  Container Registry        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Data Flow β€” Step by Step

STEP 1 ── User speaks
          Mic β†’ PCM audio (16kHz) β†’ WebSocket β†’ FastAPI β†’ LiveRequestQueue

STEP 2 ── Camera streams
          Webcam β†’ JPEG frame (1fps, 320Γ—240) β†’ WebSocket β†’ FastAPI β†’ LiveRequestQueue

STEP 3 ── Gemini responds
          LiveRequestQueue β†’ bidiGenerateContent β†’ Gemini 2.5 Flash
          Gemini audio bytes β†’ WebSocket β†’ Browser AudioPlayer β†’ User hears voice

STEP 4 ── Background tools fire (silently, every 2-3 answers)
          Gemini calls β†’ detect_filler_words()
          Gemini calls β†’ analyze_body_language()
          Gemini calls β†’ evaluate_star_method()
          Gemini calls β†’ save_session_feedback()
          Tool results β†’ JSON side-channel β†’ WebSocket β†’ Sidebar updates live

STEP 5 ── Transcription
          Gemini β†’ Input + Output transcription β†’ Closed Captions rendered in UI

STEP 6 ── Session ends
          User clicks End Interview
          β†’ generate_session_report()
          β†’ Full modal: scores, breakdown, downloadable transcript
Enter fullscreen mode Exit fullscreen mode

Project File Structure

IntyerviewBit/
β”œβ”€β”€ README.md
β”œβ”€β”€ cloudbuild.yaml                     ← Google Cloud Build CI/CD
└── interviewace/
    β”œβ”€β”€ Dockerfile                      ← Cloud Run container
    β”œβ”€β”€ .env.example
    β”œβ”€β”€ requirements.txt
    └── app/
        β”œβ”€β”€ main.py                     ← FastAPI + WebSocket server
        └── interview_coach_agent/
        β”‚   β”œβ”€β”€ agent.py                ← ADK Agent + 11 tools registered
        β”‚   β”œβ”€β”€ prompts.py              ← Coach Ace persona + instructions
        β”‚   β”œβ”€β”€ tools.py                ← All 10 custom tool implementations
        β”‚   └── grounding_data.py       ← Verified local coaching knowledge base
        └── static/
            β”œβ”€β”€ index.html              ← Single-page Google Meet replica
            β”œβ”€β”€ css/
            β”‚   └── style.css           ← Complete Meet-style CSS
            └── js/
                β”œβ”€β”€ app.js              ← Main app logic + WebSocket client
                β”œβ”€β”€ audio-player.js     ← PCM audio playback engine
                β”œβ”€β”€ audio-recorder.js   ← Mic capture + 48kHzβ†’16kHz downsample
                └── camera.js           ← Adaptive webcam frame capture
Enter fullscreen mode Exit fullscreen mode

The Agent β€” Coach Ace

COACH ACE β€” FULL TOOL MAP
────────────────────────────────────────────────────────────────────

TIER 1 β€” Core Analysis  (fires silently every 2-3 answers)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tool                      β”‚ What It Does                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ save_session_feedback     β”‚ Scores 4 dimensions 0-100:           β”‚
β”‚                           β”‚ Confidence, Clarity, Content,        β”‚
β”‚                           β”‚ Body Language                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ detect_filler_words       β”‚ Counts um / uh / like / you know.    β”‚
β”‚                           β”‚ Updates live sidebar counter + tips  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ analyze_body_language     β”‚ Rates posture, eye contact,          β”‚
β”‚                           β”‚ expression from live camera frame    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ evaluate_star_method      β”‚ Checks S-T-A-R answer structure.     β”‚
β”‚                           β”‚ Lights up S T A R badges in real timeβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TIER 2 β€” Deep Coaching
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ analyze_voice_confidence  β”‚ Pace, volume, tone, pause analysis   β”‚
β”‚ get_improvement_tips      β”‚ Targeted coaching per weakness       β”‚
β”‚ fetch_grounding_data      β”‚ Pulls from verified local KB         β”‚
β”‚ adjust_difficulty_level   β”‚ Scales question difficulty up/down   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TIER 3 β€” Session Management
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ get_session_history       β”‚ Retrieves scores from past sessions  β”‚
β”‚ save_session_recording    β”‚ Persists transcript + all metrics    β”‚
β”‚ generate_session_report   β”‚ Builds full post-interview breakdown β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

GROUNDING
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ google_search             β”‚ ADK built-in. Live web search for    β”‚
β”‚ (ADK built-in)            β”‚ company-specific interview facts     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Architechture -

The Dual Grounding System

Early in development, Coach Ace would confidently hallucinate Amazon Leadership Principles or invent Google interview formats. I fixed it with two grounding layers:

CANDIDATE ASKS: "What is Google's interview process like?"
                              β”‚
                              β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚     GROUNDING ROUTER    β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β–Ό                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ fetch_grounding_    β”‚   β”‚   google_search()    β”‚
β”‚ data()              β”‚   β”‚   ADK built-in       β”‚
β”‚                     β”‚   β”‚                     β”‚
β”‚ LOCAL KNOWLEDGE     β”‚   β”‚  LIVE WEB SEARCH    β”‚
β”‚ BASE                β”‚   β”‚                     β”‚
β”‚ (grounding_data.py) β”‚   β”‚  Searches for real, β”‚
β”‚                     β”‚   β”‚  current company    β”‚
β”‚ Covers:             β”‚   β”‚  interview info     β”‚
β”‚ β€’ STAR method       β”‚   β”‚                     β”‚
β”‚ β€’ Body language     β”‚   β”‚  Prevents all       β”‚
β”‚ β€’ Voice delivery    β”‚   β”‚  hallucination of   β”‚
β”‚ β€’ Common mistakes   β”‚   β”‚  company-specific   β”‚
β”‚ β€’ Coaching tips     β”‚   β”‚  facts              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                         β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β–Ό
             GROUNDED RESPONSE
             (accurate + verified)
Enter fullscreen mode Exit fullscreen mode

Code Deep Dive

1. The WebSocket Bridge (main.py)

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    session_service = InMemorySessionService()
    session = await session_service.create_session(
        app_name="interviewace",
        user_id="candidate",
        session_id=str(uuid.uuid4())
    )

    live_request_queue = LiveRequestQueue()

    # Start ADK runner β€” talks to Gemini Live API in background
    runner_task = asyncio.create_task(
        run_agent(runner, live_request_queue, websocket, session)
    )

    try:
        async for message in websocket.iter_bytes():
            data = parse_message(message)

            if data["type"] == "audio":
                live_request_queue.send_realtime(
                    Blob(data=data["chunk"], mime_type="audio/pcm;rate=16000")
                )
            elif data["type"] == "image":
                live_request_queue.send_realtime(
                    Blob(data=data["frame"], mime_type="image/jpeg")
                )
    finally:
        runner_task.cancel()
Enter fullscreen mode Exit fullscreen mode

2. The ADK Agent (agent.py)

from google.adk.agents import Agent
from google.adk.tools import google_search
from .tools import (
    save_session_feedback, detect_filler_words,
    analyze_body_language, evaluate_star_method,
    analyze_voice_confidence, get_improvement_tips,
    fetch_grounding_data, adjust_difficulty_level,
    get_session_history, save_session_recording,
    generate_session_report,
)

coach_ace = Agent(
    name="coach_ace",
    model="gemini-2.5-flash-preview-native-audio-dialog",
    description="Senior AI hiring manager β€” real-time mock interviews",
    instruction=COACH_ACE_PROMPT,
    tools=[
        save_session_feedback, detect_filler_words,
        analyze_body_language, evaluate_star_method,
        analyze_voice_confidence, get_improvement_tips,
        fetch_grounding_data, adjust_difficulty_level,
        get_session_history, save_session_recording,
        generate_session_report,
        google_search,   # ADK built-in grounding tool
    ],
)
Enter fullscreen mode Exit fullscreen mode

3. Silent Background Tool (Example)

def detect_filler_words(transcript: str, session_id: str) -> dict:
    """
    Fires autonomously every 2-3 answers.
    User never hears a pause β€” runs between turns.
    """
    filler_patterns = ["um", "uh", "like", "you know", "basically", "literally"]
    counts = {f: transcript.lower().count(f) for f in filler_patterns if transcript.lower().count(f) > 0}
    total = sum(counts.values())

    update_session_analytics(session_id, "filler_words", {
        "total": total,
        "breakdown": counts,
        "coaching_tip": get_filler_tip(total)
    })

    return {"filler_count": total, "breakdown": counts}
Enter fullscreen mode Exit fullscreen mode

4. PCM Audio Engine (audio-recorder.js)

class AudioRecorder {
    constructor(onAudioData) {
        this.onAudioData = onAudioData;
        this.targetSampleRate = 16000; // Gemini Live expects 16kHz
    }

    async start() {
        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const context = new AudioContext(); // Native rate ~48kHz

        const scriptProcessor = context.createScriptProcessor(4096, 1, 1);
        scriptProcessor.onaudioprocess = (event) => {
            const inputData = event.inputBuffer.getChannelData(0);
            // Downsample 48kHz β†’ 16kHz before sending to Gemini
            const downsampled = this.downsample(inputData, context.sampleRate, 16000);
            const pcm16 = this.floatTo16BitPCM(downsampled);
            this.onAudioData(pcm16); // β†’ WebSocket β†’ FastAPI β†’ Gemini
        };

        source.connect(scriptProcessor);
        scriptProcessor.connect(context.destination);
    }

    downsample(buffer, fromRate, toRate) {
        const ratio = fromRate / toRate;
        const result = new Float32Array(Math.round(buffer.length / ratio));
        for (let i = 0; i < result.length; i++) {
            result[i] = buffer[Math.round(i * ratio)];
        }
        return result;
    }
}
Enter fullscreen mode Exit fullscreen mode

5. Adaptive Webcam Capture (camera.js)

class CameraCapture {
    captureFrame() {
        const canvas = document.createElement('canvas');
        canvas.width = 320;  // Low-res β€” body language doesn't need 1080p
        canvas.height = 240;

        const ctx = canvas.getContext('2d');
        ctx.drawImage(this.videoElement, 0, 0, 320, 240);

        // JPEG 60% quality β€” sufficient for vision, minimal bandwidth
        canvas.toBlob(
            (blob) => blob.arrayBuffer().then(buf => this.onFrame(buf)),
            'image/jpeg',
            0.6
        );
    }

    adaptFrameRate(networkQuality) {
        // Drop to 0.33fps (1 frame/3 sec) under poor network
        this.fps = Math.max(0.33, Math.min(1, networkQuality));
    }
}
Enter fullscreen mode Exit fullscreen mode

The UI β€” Google Meet Replica

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  InterviewAce  [Google logo]          ⏱ 12:34      [Participants]    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                             β”‚   πŸ“Š LIVE ANALYTICS    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚                        β”‚
β”‚  β”‚                 β”‚  β”‚                 β”‚   β”‚  Confidence  β–ˆβ–ˆβ–ˆβ–ˆβ–“β–‘    β”‚
β”‚  β”‚  COACH ACE      β”‚  β”‚  ELENA          β”‚   β”‚  Clarity     β–ˆβ–ˆβ–ˆβ–‘β–‘β–‘    β”‚
β”‚  β”‚  AI Interviewer β”‚  β”‚  AI Notetaker   β”‚   β”‚  STAR Score  β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘    β”‚
β”‚  β”‚                 β”‚  β”‚                 β”‚   β”‚  Body Lang.  β–ˆβ–ˆβ–ˆβ–“β–‘β–‘    β”‚
β”‚  β”‚ [Volume Rings]  β”‚  β”‚ [Volume Rings]  β”‚   β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  🎀 Filler Words: 3   β”‚
β”‚                                             β”‚  um(2)  uh(1)          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚                        β”‚
β”‚  β”‚                                      β”‚   β”‚  πŸ‘ Eye Contact  ●●    β”‚
β”‚  β”‚            YOU                       β”‚   β”‚  🧍 Posture      ●○    β”‚
β”‚  β”‚        (Live Webcam)                 β”‚   β”‚  😊 Expression   ●●    β”‚
β”‚  β”‚                                      β”‚   β”‚                        β”‚
β”‚  β”‚  [Equalizer bars animate when mic]   β”‚   β”‚  STAR Badges:          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  [Sβœ“][Tβœ“][Aβ—‹][Rβ—‹]     β”‚
β”‚                                             β”‚                        β”‚
β”‚  πŸ’¬ CC: "...tell me about a time you had    β”‚  πŸ’‘ Use 'I' not 'we'   β”‚
β”‚          to debug a production issue..."    β”‚                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  [🎀 Mic]  [πŸ“· Cam]  [CC]  [Chat]  [People]    [πŸ”΄ End Interview]   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Built in 100% Vanilla JavaScript β€” no React, no Vue, no Angular. Frameworks add event loop overhead that impacts PCM audio timing. At 16kHz, every millisecond of jitter is audible.


Cloud Deployment Pipeline

Developer pushes to GitHub
         β”‚
         β–Ό
Google Cloud Build (cloudbuild.yaml)
         β”‚
         β”œβ”€β”€ docker build -t gcr.io/PROJECT/interviewace .
         β”œβ”€β”€ docker push β†’ Container Registry
         └── gcloud run deploy interviewace
                  β”‚
                  β”œβ”€β”€ Region:         us-central1
                  β”œβ”€β”€ Memory:         1Gi
                  β”œβ”€β”€ Port:           8080
                  β”œβ”€β”€ session-affinity: TRUE  ← critical for WebSockets
                  └── allow-unauthenticated:  TRUE
         β”‚
         β–Ό
Cloud Run Serverless Container
         β”œβ”€β”€ Scales to zero when idle  (cost: $0)
         β”œβ”€β”€ Handles WebSocket connections persistently
         └── Scales instantly on demand
Enter fullscreen mode Exit fullscreen mode

Key lesson learned the hard way: Cloud Run requires --session-affinity for any WebSocket-based app. Without it, the load balancer routes mid-session requests to a different container instance, breaking your persistent connection. This cost us hours to debug.


Challenges I Ran Into

1. Real-time audio + vision sync
Streaming 16kHz PCM audio and JPEG frames simultaneously over one WebSocket without dropped frames required bandwidth-adaptive throttling and decoupled queues.

2. Barge-in handling
When the user speaks mid-response, the agent must stop cleanly without corrupting the audio buffer. Getting reliable barge-in via the ADK streaming layer took multiple iterations.

3. Invisible tool calls
All background analysis tools need to feel completely invisible. I tuned them to fire silently between turns and stream analytics to the UI via a JSON side-channel on the same WebSocket.

4. Company hallucination
Early versions confidently invented interview formats. Fixed with two-layer grounding: verified local knowledge base + ADK's built-in Google Search.

5. Audio latency in production
Getting end-to-end voice latency under 500ms on Cloud Run required tuning buffer sizes, optimising the PCM pipeline, and keeping ASGI async throughout.


Key Lessons

  • Native audio models are fundamentally different from text models. Design your system for async bidirectional streaming from the ground up.
  • Grounding is non-negotiable for agentic apps. Prompt engineering alone is not enough β€” even capable models hallucinate domain facts without grounding.
  • Vanilla JS outperforms frameworks for latency-sensitive audio/video. Full control over the audio pipeline timing matters at the PCM level.
  • ADK LiveRequestQueue needs careful queue management. Keep audio, vision, and tool-call result streams strictly decoupled.
  • Always add session-affinity to Cloud Run WebSocket services. Stateless load balancing breaks persistent connections.

Try It Yourself

πŸ”— Live Demo: https://interviewace-117780891544.us-central1.run.app/

No API key. No signup. No cost. Click the link, allow mic + camera, and start your mock interview.

πŸ’» GitHub: https://github.com/SameerAliKhan-git/IntyerviewBit

πŸ“Ή Demo Video: https://youtu.be/JrjhgB5Ib_0


Built for the Gemini Live Agent Challenge using Google ADK Β· Gemini Live API Β· Google Cloud Run

#GeminiLiveAgentChallenge #GoogleAI #Gemini #ADK #GoogleCloud #Python #WebDev

Top comments (0)