I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
What is Audeza?
Audeza is the first AI pitch coach that can see you, hear you, and talk back — all in real time. You upload your pitch deck, turn on your camera, and start practicing. The AI watches your body language, listens to your delivery, and coaches you mid-pitch like a mentor sitting across the table.
It has two modes:
- Coaching Mode — Real-time feedback on posture, eye contact, filler words, pacing, and content coverage as you present
- Investor Simulator — The AI role-plays as a tough VC, listens to your full pitch, then grills you with hard questions based on what you actually said
After each session, you get a scorecard with delivery, content, and presence scores, per-slide timing analysis, and actionable improvements.
The Core: Gemini Live API
The entire experience is powered by the Gemini Live API through the @google/genai SDK. Here's why this API was the right choice:
Multimodal real-time streaming
Gemini Live accepts audio and video simultaneously over a single WebSocket connection. During a pitch session, Audeza sends:
- Microphone audio — 16kHz PCM captured via AudioWorklet
- Camera frames — 1 FPS JPEG snapshots from the webcam
The model processes both streams and responds with natural spoken audio. This is what makes Audeza different from tools that only analyze recordings after the fact — the coaching happens while you're presenting.
// Connecting to Gemini Live API
import { GoogleGenAI } from '@google/genai'
const ai = new GoogleGenAI({ apiKey })
const session = await ai.live.connect({
model: 'gemini-2.5-flash-native-audio-preview-12-2025',
config: {
responseModalities: ['AUDIO'],
systemInstruction: { parts: [{ text: systemPrompt }] },
contextWindowCompression: { slidingWindow: {} },
inputAudioTranscription: {},
outputAudioTranscription: {},
},
callbacks: {
onmessage: (msg) => {
// Handle audio responses, transcriptions, turn completion
}
}
})
// Send real-time audio
session.sendRealtimeInput({
audio: { data: base64PCM, mimeType: 'audio/pcm;rate=16000' }
})
// Send camera frames at 1 FPS
session.sendRealtimeInput({
video: { data: base64Jpeg, mimeType: 'image/jpeg' }
})
Context window compression
Pitch sessions can run 5-10 minutes. With continuous audio and video, that's a lot of context. I used Gemini's built-in sliding window compression to keep sessions running without hitting token limits:
contextWindowCompression: {
slidingWindow: {},
}
Input/output transcription
Gemini Live provides real-time transcription of both the user's speech and the model's responses. I capture these to build a full session transcript, which is then used for scorecard generation.
Native audio output
Using gemini-2.5-flash-native-audio-preview, the model speaks with natural vocal variety — it can be warm and encouraging when you're nervous, or push harder when you're confident. The voice style adapts to the coaching context.
Scorecard Generation with Gemini 2.5 Flash
After a session ends, I send the full transcript plus slide context to Gemini 2.5 Flash (text model) to generate a structured scorecard:
- Overall score (1-100)
- Delivery, Content, and Presence sub-scores
- Per-slide breakdown with timing analysis
- Specific improvements to work on
- Investor verdict (in simulator mode)
Google Cloud Stack
The full infrastructure runs on Google Cloud:
| Service | Purpose |
|---|---|
| Cloud Run | Hosts the app (containerized with Bun runtime) |
| Firebase Auth | Google sign-in for users |
| Cloud Firestore | Session history, scorecards, user preferences, deck metadata |
| Artifact Registry | Docker image storage |
| GitHub Actions | CI/CD pipeline → build → push → deploy to Cloud Run |
The app is a TanStack Start (React) application built with Bun, containerized via Docker, and deployed to Cloud Run with automatic deploys on every push to main.
What I Learned
Gemini Live API is surprisingly natural — The model's ability to process video and audio simultaneously and respond conversationally makes it feel like you're talking to a real person, not an AI.
AudioWorklet is essential — For reliable real-time audio capture, the Web Audio API's AudioWorklet is the way to go. It runs in a separate thread and doesn't drop frames.
1 FPS is enough for body language — You don't need high frame rates for the model to pick up on posture, eye contact, and gestures. One frame per second keeps the context window manageable while still giving useful visual coaching.
System prompts matter enormously — The difference between a generic AI response and a great coaching experience came down to prompt engineering. Defining specific triggers (slide transitions, feedback requests, real-time interjections) made the model proactive instead of passive.
Wanna try it?
Stay tuned for Audeza live soon.
Top comments (0)