π’ Disclosure: I created this blog post as part of my submission to the Gemini Live Agent Challenge hackathon. The project β InterviewAce β was built specifically for this competition using Google AI models and Google Cloud. #GeminiLiveAgentChallenge
π Live Demo Β· π» GitHub Β· πΉ Demo Video
The Problem Nobody Talks About
Everyone knows technical interviews are hard. But here is what nobody says out loud β most people never actually practice them.
Not because they are lazy. Because real practice is expensive and inaccessible:
- Professional mock interview services charge $150β$300 per session
- Asking friends to interview you is awkward and rarely useful
- AI chatbots give you text responses β but real interviews are not text conversations
And here is the deeper issue: the things that fail candidates are not the answers β they are the delivery. The "um"s and "uh"s. The slouched posture. The rambling answer that never reaches a conclusion. The eye contact that breaks every time you think.
No text-based AI tool has ever addressed this. Until now.
What I Built: InterviewAce
InterviewAce is a real-time, multimodal AI interview coach that puts you in a pixel-perfect Google Meet replica with an AI hiring manager called Coach Ace β who simultaneously:
- π£οΈ Speaks to you with sub-500ms voice latency via Gemini 2.5 Flash Native Audio
- π Watches your body language live through your webcam
- π€ Detects filler words in real time ("um", "uh", "like", "you know")
- π Scores your answers across Confidence, Clarity, Content and STAR structure
- π Searches Google live to give hallucination-free company-specific context
- π Generates a full performance report with downloadable transcript
No typing. No text boxes. Just a real conversation with an AI that actually watches and listens.
Tech Stack at a Glance
| Layer | Technology |
|---|---|
| AI Agent | Google ADK + Gemini 2.5 Flash Native Audio |
| Live Streaming | Gemini Live API (bidiGenerateContent) |
| Backend | Python, FastAPI, Uvicorn, WebSockets |
| Frontend | Vanilla JavaScript, Web Audio API, MediaDevices API |
| Grounding | ADK built-in google_search + local knowledge base |
| Infrastructure | Google Cloud Run, Docker, Cloud Build |
System Architecture
Here is the complete picture of how every component connects β from your microphone all the way to Gemini and back:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π₯οΈ BROWSER (Vanilla JS) β
β β
β π€ Microphone (PCM 16kHz) βββ β
β π· Camera (JPEG 1fps) βββββββΌβββΆ WebSocket Client ββββββββββ β
β β β β β
β π Audio Player ββββββββββββββΌββββββββββ β β
β π¬ Closed Captions ββββββββββββ€ Audio + Images + JSON β β
β π Live Analytics βββββββββββββ β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββ
β WebSocket
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββ
β βοΈ FASTAPI BACKEND (Python) β
β β
β WebSocket Server (main.py) β
β β β
β βΌ β
β LiveRequestQueue βββΆ ADK Runner βββΆ InMemorySessionService β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β²
Bidi Stream Audio + Tool Results
β β
βββββββββββββββββββββΌβββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β π€ GOOGLE ADK AGENT β
β β
β Gemini 2.5 Flash Native Audio + Vision β
β β β
β βΌ Autonomous Tool Calls (silent, every 2-3 answers) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β π§ 11 CUSTOM TOOLS β β
β β β β
β β TIER 1 β Core Analysis: β β
β β save_session_feedback β detect_filler_words β β
β β analyze_body_language β evaluate_star_method β β
β β β β
β β TIER 2 β Deep Coaching: β β
β β analyze_voice_confidence β get_improvement_tips β β
β β fetch_grounding_data β adjust_difficulty_level β β
β β β β
β β TIER 3 β Session Reporting: β β
β β get_session_history β save_session_recording β β
β β generate_session_report β β
β β β β
β β GROUNDING: google_search (ADK built-in) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β βοΈ GOOGLE CLOUD β
β β
β Cloud Run (Serverless Container) + Container Registry β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow β Step by Step
STEP 1 ββ User speaks
Mic β PCM audio (16kHz) β WebSocket β FastAPI β LiveRequestQueue
STEP 2 ββ Camera streams
Webcam β JPEG frame (1fps, 320Γ240) β WebSocket β FastAPI β LiveRequestQueue
STEP 3 ββ Gemini responds
LiveRequestQueue β bidiGenerateContent β Gemini 2.5 Flash
Gemini audio bytes β WebSocket β Browser AudioPlayer β User hears voice
STEP 4 ββ Background tools fire (silently, every 2-3 answers)
Gemini calls β detect_filler_words()
Gemini calls β analyze_body_language()
Gemini calls β evaluate_star_method()
Gemini calls β save_session_feedback()
Tool results β JSON side-channel β WebSocket β Sidebar updates live
STEP 5 ββ Transcription
Gemini β Input + Output transcription β Closed Captions rendered in UI
STEP 6 ββ Session ends
User clicks End Interview
β generate_session_report()
β Full modal: scores, breakdown, downloadable transcript
Project File Structure
IntyerviewBit/
βββ README.md
βββ cloudbuild.yaml β Google Cloud Build CI/CD
βββ interviewace/
βββ Dockerfile β Cloud Run container
βββ .env.example
βββ requirements.txt
βββ app/
βββ main.py β FastAPI + WebSocket server
βββ interview_coach_agent/
β βββ agent.py β ADK Agent + 11 tools registered
β βββ prompts.py β Coach Ace persona + instructions
β βββ tools.py β All 10 custom tool implementations
β βββ grounding_data.py β Verified local coaching knowledge base
βββ static/
βββ index.html β Single-page Google Meet replica
βββ css/
β βββ style.css β Complete Meet-style CSS
βββ js/
βββ app.js β Main app logic + WebSocket client
βββ audio-player.js β PCM audio playback engine
βββ audio-recorder.js β Mic capture + 48kHzβ16kHz downsample
βββ camera.js β Adaptive webcam frame capture
The Agent β Coach Ace
COACH ACE β FULL TOOL MAP
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TIER 1 β Core Analysis (fires silently every 2-3 answers)
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β Tool β What It Does β
βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ€
β save_session_feedback β Scores 4 dimensions 0-100: β
β β Confidence, Clarity, Content, β
β β Body Language β
βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ€
β detect_filler_words β Counts um / uh / like / you know. β
β β Updates live sidebar counter + tips β
βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ€
β analyze_body_language β Rates posture, eye contact, β
β β expression from live camera frame β
βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ€
β evaluate_star_method β Checks S-T-A-R answer structure. β
β β Lights up S T A R badges in real timeβ
βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
TIER 2 β Deep Coaching
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β analyze_voice_confidence β Pace, volume, tone, pause analysis β
β get_improvement_tips β Targeted coaching per weakness β
β fetch_grounding_data β Pulls from verified local KB β
β adjust_difficulty_level β Scales question difficulty up/down β
βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
TIER 3 β Session Management
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β get_session_history β Retrieves scores from past sessions β
β save_session_recording β Persists transcript + all metrics β
β generate_session_report β Builds full post-interview breakdown β
βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
GROUNDING
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β google_search β ADK built-in. Live web search for β
β (ADK built-in) β company-specific interview facts β
βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββ
Architechture -
The Dual Grounding System
Early in development, Coach Ace would confidently hallucinate Amazon Leadership Principles or invent Google interview formats. I fixed it with two grounding layers:
CANDIDATE ASKS: "What is Google's interview process like?"
β
βΌ
βββββββββββββββββββββββββββ
β GROUNDING ROUTER β
ββββββββββ¬βββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β fetch_grounding_ β β google_search() β
β data() β β ADK built-in β
β β β β
β LOCAL KNOWLEDGE β β LIVE WEB SEARCH β
β BASE β β β
β (grounding_data.py) β β Searches for real, β
β β β current company β
β Covers: β β interview info β
β β’ STAR method β β β
β β’ Body language β β Prevents all β
β β’ Voice delivery β β hallucination of β
β β’ Common mistakes β β company-specific β
β β’ Coaching tips β β facts β
ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ
β β
ββββββββββββββ¬βββββββββββββ
βΌ
GROUNDED RESPONSE
(accurate + verified)
Code Deep Dive
1. The WebSocket Bridge (main.py)
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
session_service = InMemorySessionService()
session = await session_service.create_session(
app_name="interviewace",
user_id="candidate",
session_id=str(uuid.uuid4())
)
live_request_queue = LiveRequestQueue()
# Start ADK runner β talks to Gemini Live API in background
runner_task = asyncio.create_task(
run_agent(runner, live_request_queue, websocket, session)
)
try:
async for message in websocket.iter_bytes():
data = parse_message(message)
if data["type"] == "audio":
live_request_queue.send_realtime(
Blob(data=data["chunk"], mime_type="audio/pcm;rate=16000")
)
elif data["type"] == "image":
live_request_queue.send_realtime(
Blob(data=data["frame"], mime_type="image/jpeg")
)
finally:
runner_task.cancel()
2. The ADK Agent (agent.py)
from google.adk.agents import Agent
from google.adk.tools import google_search
from .tools import (
save_session_feedback, detect_filler_words,
analyze_body_language, evaluate_star_method,
analyze_voice_confidence, get_improvement_tips,
fetch_grounding_data, adjust_difficulty_level,
get_session_history, save_session_recording,
generate_session_report,
)
coach_ace = Agent(
name="coach_ace",
model="gemini-2.5-flash-preview-native-audio-dialog",
description="Senior AI hiring manager β real-time mock interviews",
instruction=COACH_ACE_PROMPT,
tools=[
save_session_feedback, detect_filler_words,
analyze_body_language, evaluate_star_method,
analyze_voice_confidence, get_improvement_tips,
fetch_grounding_data, adjust_difficulty_level,
get_session_history, save_session_recording,
generate_session_report,
google_search, # ADK built-in grounding tool
],
)
3. Silent Background Tool (Example)
def detect_filler_words(transcript: str, session_id: str) -> dict:
"""
Fires autonomously every 2-3 answers.
User never hears a pause β runs between turns.
"""
filler_patterns = ["um", "uh", "like", "you know", "basically", "literally"]
counts = {f: transcript.lower().count(f) for f in filler_patterns if transcript.lower().count(f) > 0}
total = sum(counts.values())
update_session_analytics(session_id, "filler_words", {
"total": total,
"breakdown": counts,
"coaching_tip": get_filler_tip(total)
})
return {"filler_count": total, "breakdown": counts}
4. PCM Audio Engine (audio-recorder.js)
class AudioRecorder {
constructor(onAudioData) {
this.onAudioData = onAudioData;
this.targetSampleRate = 16000; // Gemini Live expects 16kHz
}
async start() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const context = new AudioContext(); // Native rate ~48kHz
const scriptProcessor = context.createScriptProcessor(4096, 1, 1);
scriptProcessor.onaudioprocess = (event) => {
const inputData = event.inputBuffer.getChannelData(0);
// Downsample 48kHz β 16kHz before sending to Gemini
const downsampled = this.downsample(inputData, context.sampleRate, 16000);
const pcm16 = this.floatTo16BitPCM(downsampled);
this.onAudioData(pcm16); // β WebSocket β FastAPI β Gemini
};
source.connect(scriptProcessor);
scriptProcessor.connect(context.destination);
}
downsample(buffer, fromRate, toRate) {
const ratio = fromRate / toRate;
const result = new Float32Array(Math.round(buffer.length / ratio));
for (let i = 0; i < result.length; i++) {
result[i] = buffer[Math.round(i * ratio)];
}
return result;
}
}
5. Adaptive Webcam Capture (camera.js)
class CameraCapture {
captureFrame() {
const canvas = document.createElement('canvas');
canvas.width = 320; // Low-res β body language doesn't need 1080p
canvas.height = 240;
const ctx = canvas.getContext('2d');
ctx.drawImage(this.videoElement, 0, 0, 320, 240);
// JPEG 60% quality β sufficient for vision, minimal bandwidth
canvas.toBlob(
(blob) => blob.arrayBuffer().then(buf => this.onFrame(buf)),
'image/jpeg',
0.6
);
}
adaptFrameRate(networkQuality) {
// Drop to 0.33fps (1 frame/3 sec) under poor network
this.fps = Math.max(0.33, Math.min(1, networkQuality));
}
}
The UI β Google Meet Replica
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β InterviewAce [Google logo] β± 12:34 [Participants] β
βββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββ€
β β π LIVE ANALYTICS β
β βββββββββββββββββββ βββββββββββββββββββ β β
β β β β β β Confidence ββββββ β
β β COACH ACE β β ELENA β β Clarity ββββββ β
β β AI Interviewer β β AI Notetaker β β STAR Score ββββββ β
β β β β β β Body Lang. ββββββ β
β β [Volume Rings] β β [Volume Rings] β β β
β βββββββββββββββββββ βββββββββββββββββββ β π€ Filler Words: 3 β
β β um(2) uh(1) β
β ββββββββββββββββββββββββββββββββββββββββ β β
β β β β π Eye Contact ββ β
β β YOU β β π§ Posture ββ β
β β (Live Webcam) β β π Expression ββ β
β β β β β
β β [Equalizer bars animate when mic] β β STAR Badges: β
β ββββββββββββββββββββββββββββββββββββββββ β [Sβ][Tβ][Aβ][Rβ] β
β β β
β π¬ CC: "...tell me about a time you had β π‘ Use 'I' not 'we' β
β to debug a production issue..." β β
βββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ€
β [π€ Mic] [π· Cam] [CC] [Chat] [People] [π΄ End Interview] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Built in 100% Vanilla JavaScript β no React, no Vue, no Angular. Frameworks add event loop overhead that impacts PCM audio timing. At 16kHz, every millisecond of jitter is audible.
Cloud Deployment Pipeline
Developer pushes to GitHub
β
βΌ
Google Cloud Build (cloudbuild.yaml)
β
βββ docker build -t gcr.io/PROJECT/interviewace .
βββ docker push β Container Registry
βββ gcloud run deploy interviewace
β
βββ Region: us-central1
βββ Memory: 1Gi
βββ Port: 8080
βββ session-affinity: TRUE β critical for WebSockets
βββ allow-unauthenticated: TRUE
β
βΌ
Cloud Run Serverless Container
βββ Scales to zero when idle (cost: $0)
βββ Handles WebSocket connections persistently
βββ Scales instantly on demand
Key lesson learned the hard way: Cloud Run requires
--session-affinityfor any WebSocket-based app. Without it, the load balancer routes mid-session requests to a different container instance, breaking your persistent connection. This cost us hours to debug.
Challenges I Ran Into
1. Real-time audio + vision sync
Streaming 16kHz PCM audio and JPEG frames simultaneously over one WebSocket without dropped frames required bandwidth-adaptive throttling and decoupled queues.
2. Barge-in handling
When the user speaks mid-response, the agent must stop cleanly without corrupting the audio buffer. Getting reliable barge-in via the ADK streaming layer took multiple iterations.
3. Invisible tool calls
All background analysis tools need to feel completely invisible. I tuned them to fire silently between turns and stream analytics to the UI via a JSON side-channel on the same WebSocket.
4. Company hallucination
Early versions confidently invented interview formats. Fixed with two-layer grounding: verified local knowledge base + ADK's built-in Google Search.
5. Audio latency in production
Getting end-to-end voice latency under 500ms on Cloud Run required tuning buffer sizes, optimising the PCM pipeline, and keeping ASGI async throughout.
Key Lessons
- Native audio models are fundamentally different from text models. Design your system for async bidirectional streaming from the ground up.
- Grounding is non-negotiable for agentic apps. Prompt engineering alone is not enough β even capable models hallucinate domain facts without grounding.
- Vanilla JS outperforms frameworks for latency-sensitive audio/video. Full control over the audio pipeline timing matters at the PCM level.
- ADK LiveRequestQueue needs careful queue management. Keep audio, vision, and tool-call result streams strictly decoupled.
- Always add session-affinity to Cloud Run WebSocket services. Stateless load balancing breaks persistent connections.
Try It Yourself
π Live Demo: https://interviewace-117780891544.us-central1.run.app/
No API key. No signup. No cost. Click the link, allow mic + camera, and start your mock interview.
π» GitHub: https://github.com/SameerAliKhan-git/IntyerviewBit
πΉ Demo Video: https://youtu.be/JrjhgB5Ib_0
Built for the Gemini Live Agent Challenge using Google ADK Β· Gemini Live API Β· Google Cloud Run
#GeminiLiveAgentChallenge #GoogleAI #Gemini #ADK #GoogleCloud #Python #WebDev


Top comments (0)