Diven Rastdus

Posted on Mar 8 • Edited on Jun 29

Building Viva: A Real-Time AI Interview Coach with Gemini Live API

#googlecloud #gemini #ai #hackathon

TL;DR: I built Viva, a real-time AI interview coach that listens to your answers via bidirectional audio streaming and watches your body language through your webcam — all powered by Google's Gemini Live API and Vision API, deployed on Cloud Run.

The Problem

Job seekers practice interviews alone with zero feedback. You can record yourself on your phone and watch it back, but that doesn't tell you about your filler words, pacing, eye contact, or posture in real-time. Human coaches cost $100-300 per session.

What Viva Does

Viva is a full-stack interview coaching application that provides real-time feedback on both verbal answers and body language:

Live audio conversation — bidirectional audio streaming via Gemini Live API (gemini-2.5-flash-native-audio-latest). The AI interviewer asks questions, listens to your answers, and responds naturally. You can interrupt mid-sentence (barge-in).
Body language coaching — webcam frames analyzed every 2 seconds via Gemini Vision (gemini-2.5-flash). You get feedback on eye contact, posture, facial expressions, and confidence.
Speech pattern tracking — filler word detection ("um", "uh", "like"), pace analysis, confidence scoring.
Answer scoring — each answer scored on relevance, clarity, and depth.
Post-interview report — full scorecard with per-question breakdown and aggregate stats.

Architecture

Browser (Next.js)                    Google Cloud
  ├─ Mic → AudioWorklet             ┌──────────────────┐
  │   → PCM 16kHz ──WebSocket──►    │  Cloud Run       │
  │                                  │  (FastAPI)       │
  │                                  │       │          │
  │   ◄── PCM 24kHz ◄──────────     │  Gemini Live API │
  │   → PcmPlayer → Speaker         │  (bidi audio)    │
  │                                  │       │          │
  ├─ Camera → JPEG frames           │  Gemini Vision   │
  │   → POST /api/analyze-frame ──► │  (body language)  │
  │                                  │       │          │
  └─ Score/Report ◄──────────────   │  ADK Agent Tools │
                                     └──────────────────┘

The Gemini Live API Pipeline

The core of Viva is the bidirectional audio pipeline. Here's how it works:

Browser captures mic audio using Web Audio API's AudioWorklet. The worklet converts Float32 samples to 16-bit PCM at 16kHz.
PCM chunks stream over WebSocket to the FastAPI backend.
Backend forwards to Gemini Live API using the Google GenAI SDK:

session = await client.aio.live.connect(
    model="gemini-2.5-flash-native-audio-latest",
    config=types.LiveConnectConfig(
        response_modalities=["AUDIO"],
        system_instruction=types.Content(
            parts=[types.Part(text=system_prompt)]
        ),
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Kore")
            )
        ),
    ),
)

Gemini responds with audio — the AI interviewer's voice streams back as 24kHz PCM.
Browser plays the response through a custom PcmPlayer that buffers and schedules audio chunks for smooth playback.

The barge-in capability is built into the Live API — when the user starts speaking while the AI is talking, the AI naturally stops and listens.

Body Language Analysis

Every 2 seconds, the frontend captures a JPEG frame from the webcam, downscales to 640x480, and sends it to the backend. The backend uses Gemini Vision to analyze:

Eye contact (looking at camera vs. looking away)
Posture (sitting straight, slouching, leaning)
Facial expressions (smiling, nervous, neutral)
Hand gestures

The analysis is returned as structured coaching tips that appear as a live overlay on the interview screen.

ADK Agent Tools

The backend uses Google's Agent Development Kit (ADK) with four tools:

Tool	Purpose
`analyze_body_language`	Interprets frame descriptions into actionable coaching tips
`track_speech_patterns`	Detects filler words and estimates speaking pace
`score_answer`	Scores answers on relevance, clarity, and depth (1-10)
`generate_next_question`	Selects contextually appropriate follow-up questions

Google Cloud Services

Service	Purpose
Cloud Run	Backend hosting with auto-scaling (0-3 instances)
Cloud Build	Container image building from Dockerfile
Secret Manager	Secure API key storage
Generative Language API	Gemini Live API + Vision API

Infrastructure as Code

The entire deployment is automated via a single deploy.sh script:

./deploy.sh

This handles: API enablement, Secret Manager setup, container building, Cloud Run deployment, and optional Vercel frontend deployment.

Mock Mode

Viva runs fully without a Gemini API key — all AI features fall back to realistic mock responses. This makes local development and testing seamless.

Try It

GitHub: https://github.com/astraedus/viva
Live Demo: https://viva-api-93135657352.us-central1.run.app

This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon (#GeminiLiveAgentChallenge). The project demonstrates real-time AI interview coaching using Gemini Live API using Google AI models and Google Cloud infrastructure.

Built with Gemini Live API, Gemini Vision API, Google ADK, FastAPI, Next.js, and Cloud Run.

If you're building AI agents for production, check out my book Production AI Agents on Amazon Kindle. It covers architecture patterns, tool design, multi-agent coordination, and deployment strategies.

I write these from real work at astraedus.dev, that's where I build apps and tools. Building something, or stuck on something like this? Reach me at astraedus.dev or theagentthatcould@gmail.com.

DEV Community