Rovugui

Posted on Mar 15

How I Built Audeza — A Real-Time AI Pitch Coach with Gemini Live API and Google Cloud

#geminiliveagentchallenge #googlecloud #ai #hackathon

I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

What is Audeza?

Audeza is the first AI pitch coach that can see you, hear you, and talk back — all in real time. You upload your pitch deck, turn on your camera, and start practicing. The AI watches your body language, listens to your delivery, and coaches you mid-pitch like a mentor sitting across the table.

It has two modes:

Coaching Mode — Real-time feedback on posture, eye contact, filler words, pacing, and content coverage as you present
Investor Simulator — The AI role-plays as a tough VC, listens to your full pitch, then grills you with hard questions based on what you actually said

After each session, you get a scorecard with delivery, content, and presence scores, per-slide timing analysis, and actionable improvements.

The Core: Gemini Live API

The entire experience is powered by the Gemini Live API through the @google/genai SDK. Here's why this API was the right choice:

Multimodal real-time streaming

Gemini Live accepts audio and video simultaneously over a single WebSocket connection. During a pitch session, Audeza sends:

Microphone audio — 16kHz PCM captured via AudioWorklet
Camera frames — 1 FPS JPEG snapshots from the webcam

The model processes both streams and responds with natural spoken audio. This is what makes Audeza different from tools that only analyze recordings after the fact — the coaching happens while you're presenting.

// Connecting to Gemini Live API
import { GoogleGenAI } from '@google/genai'

const ai = new GoogleGenAI({ apiKey })
const session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-preview-12-2025',
  config: {
    responseModalities: ['AUDIO'],
    systemInstruction: { parts: [{ text: systemPrompt }] },
    contextWindowCompression: { slidingWindow: {} },
    inputAudioTranscription: {},
    outputAudioTranscription: {},
  },
  callbacks: {
    onmessage: (msg) => {
      // Handle audio responses, transcriptions, turn completion
    }
  }
})

// Send real-time audio
session.sendRealtimeInput({
  audio: { data: base64PCM, mimeType: 'audio/pcm;rate=16000' }
})

// Send camera frames at 1 FPS
session.sendRealtimeInput({
  video: { data: base64Jpeg, mimeType: 'image/jpeg' }
})

Context window compression

Pitch sessions can run 5-10 minutes. With continuous audio and video, that's a lot of context. I used Gemini's built-in sliding window compression to keep sessions running without hitting token limits:

contextWindowCompression: {
  slidingWindow: {},
}

Input/output transcription

Gemini Live provides real-time transcription of both the user's speech and the model's responses. I capture these to build a full session transcript, which is then used for scorecard generation.

Native audio output

Using gemini-2.5-flash-native-audio-preview, the model speaks with natural vocal variety — it can be warm and encouraging when you're nervous, or push harder when you're confident. The voice style adapts to the coaching context.

Scorecard Generation with Gemini 2.5 Flash

After a session ends, I send the full transcript plus slide context to Gemini 2.5 Flash (text model) to generate a structured scorecard:

Overall score (1-100)
Delivery, Content, and Presence sub-scores
Per-slide breakdown with timing analysis
Specific improvements to work on
Investor verdict (in simulator mode)

Google Cloud Stack

The full infrastructure runs on Google Cloud:

Service	Purpose
Cloud Run	Hosts the app (containerized with Bun runtime)
Firebase Auth	Google sign-in for users
Cloud Firestore	Session history, scorecards, user preferences, deck metadata
Artifact Registry	Docker image storage
GitHub Actions	CI/CD pipeline → build → push → deploy to Cloud Run

The app is a TanStack Start (React) application built with Bun, containerized via Docker, and deployed to Cloud Run with automatic deploys on every push to main.

What I Learned

Gemini Live API is surprisingly natural — The model's ability to process video and audio simultaneously and respond conversationally makes it feel like you're talking to a real person, not an AI.
AudioWorklet is essential — For reliable real-time audio capture, the Web Audio API's AudioWorklet is the way to go. It runs in a separate thread and doesn't drop frames.
1 FPS is enough for body language — You don't need high frame rates for the model to pick up on posture, eye contact, and gestures. One frame per second keeps the context window manageable while still giving useful visual coaching.
System prompts matter enormously — The difference between a generic AI response and a great coaching experience came down to prompt engineering. Defining specific triggers (slide transitions, feedback requests, real-time interjections) made the model proactive instead of passive.

Wanna try it?

Stay tuned for Audeza live soon.

DEV Community