How I build an AI Conversation Coach with Gemini Live API for Gemini Live Agent Challenge

#ai #geminiliveagentchallenge

Why I Built This

I'm socially anxious. Not the cute "I'm introverted" kind but the actual scrambling for words, wishing the ground would open up and swallow kind.

So I moved to Ireland for grad school. Figured a fresh start, new country, new people. Maybe things would be different. A few months in, I went to a tech conference. Perfect opportunity to network, I thought. I psyched myself up.

Then it happened. I got there, saw a group of people chatting, and my brain just... shut down. Froze completely. I wanted to join the conversation so badly, but I just couldn't do it. I spent half the conference wandering around, pretending to look at the schedule.

The frustrating part? It wasn't because I had nothing to say. I work in tech, I find the problems interesting, I wanted to talk to these people. The problem was pure lack of practice. I'd never actually practiced starting conversations with strangers. And it showed.

The Idea

So what if there was a way to practice this? Like, actually practice talking to people without the fear of judgment. A realistic AI person who could act distracted, busy, skeptical the way real people at conferences actually are.

IceBreaker is basically that. You pick any scenario be it a casual conversation, tech mixer, cold intro at a founder booth, whatever you want and you practice out loud. The AI responds in real-time, like an actual person. It can be warm or guarded depending on the difficulty level.

While you're talking, it analyzes your audio and video in real-time and gives you live tips. Things like "try asking a question here" or "you said 'um' 5 times that minute." After you're done, you get a full breakdown with scores: how much you talked vs. listened, how many questions you asked, filler words, body language confidence, your sentiment trend through the conversation, and how well you recovered from awkward moments.

Plus a dashboard to track your progress over multiple sessions. Watch your scores improve over time.

Building It

This wasn't a simple chatbot. I needed real-time audio conversations that actually felt like talking to a person. Which meant WebSockets, streaming audio, video frames, and multiple systems all talking to each other.

Here's what I used:

Part	Stack
Frontend	React 19 + Vite + Tailwind + Recharts (for the debrief charts)
AI Engine	Google Gemini 2.5 Flash via Gemini Live API
Backend	Python + FastAPI on Cloud Run
Database	Firestore
Deployment	Docker, Cloud Build, Vercel

The tricky part was the real-time bit. The browser opens a WebSocket directly to Gemini, streams audio to it (16 kHz), gets audio back (24 kHz), and also sends video frames (~1 per second) so Gemini can see your body language.

Two function calls handle the feedback loop:

submit_tip(): After you speak, Gemini calls this to send live coaching
submit_metrics(): At the end, Gemini calls this to calculate your scores

Why Function Calls Mattered

Here's something I learned the hard way: asking an LLM to output JSON mixed in with regular text is a mess. I started by having Gemini just output tips and metrics as text, then I'd parse them. It was unreliable. Sometimes it would paraphrase what you said wrong. Sometimes the JSON wasn't valid. Sometimes it just forgot what you were asking for.

Then I switched to function calling. Instead of asking Gemini to "output this as JSON," I gave it actual functions to call. Now the coaching tips and metrics come through as clean, structured data. No parsing guesswork, no hallucinations. It just works.

What Went Wrong (And How I Fixed It)

The AI Kept Forgetting Things

Early on, if your internet dropped for a second, the AI would lose context. It'd restart the conversation or start making stuff up. Imagine you're halfway through pitching your startup and suddenly the AI forgets what you said two turns ago. It was bad.

I fixed this by saving sessions on the backend. Now if you drop connection, you can pick up right where you left off.

The AI Sounded Like a Chatbot

Getting Gemini to sound like an actual person at a conference not helpful, not overly eager took a lot of tweaking. The same prompt would sometimes produce a warm, natural response and other times something super formal and robotic. It'd ask too many questions at once, or use phrases nobody actually says.

I spent a lot of time on prompt engineering. Testing different personas, different conversational styles, getting more specific about what "natural" means. It's still not perfect, but it's way better.

Parsing Was Killing Me

Before I switched to function calls, I was asking Gemini to output coaching tips as text, then trying to parse that. It was fragile. The model would sometimes rewrite what you said, sometimes format things weird. I'd built all this parsing logic and it still broke constantly.

Once I switched to function calling, it was night and day. Clean JSON every time. No more parsing headaches.

Things I'm Actually Proud Of

Getting Gemini Live to work end-to-end. Real-time audio conversations that don't feel like you're talking to a bot. That's genuinely hard. WebSockets, audio streaming, keeping state, managing latency, talk about a lot that can go wrong. I made it work.

Building something people can actually use. Not just a demo that works once. I mean error handling, reconnection logic, data persistence, tracking progress over time. The boring stuff that makes a product actually useful.

The feedback system actually helps. You don't just get random stats. The coaching tips are timely (in-the-moment), and the debrief metrics are actually actionable. You can see exactly where you improved.

What I Actually Learned

Real-time streaming is fragile. Audio dropping, video lag, WebSocket timeouts—there are so many places where things can break. It's not as simple as "just stream the data." You have to think about buffers, reconnection, graceful degradation.

Never ask an LLM for JSON and try to parse it. This is a hard lesson. The model will sometimes output valid JSON, sometimes not, sometimes it'll add comments, sometimes it'll mess up the schema. Function calling is the right answer. Give the model an actual function to call, not a text format to output.

Prompt engineering is really hard. I thought I could write a prompt once and be done. Nope. The same prompt produces different outputs depending on temperature, context, the moon phase, who knows. It takes iteration, testing, examples, and luck. Don't underestimate it.

Shipping matters more than having the perfect feature. I could spend months perfecting the persona AI. But shipping an 80% solution that people can use and give feedback on is way more valuable. You learn what actually matters from real users, not from theorizing.

Next Steps

More personas and scenarios. Right now there are a few conversation types. I want to expand that—different industries, different difficulty levels, different people types.

Let people create custom scenarios. Instead of me pre-building everything, what if you could describe an event you're going to, describe the kind of person you expect to meet, and have IceBreaker generate a practice session tailored to that? Way more useful.

That's It

Building this thing taught me that good products come from real problems. I built it because I was frustrated with my own anxiety, not because I wanted to solve "networking" for everyone. And honestly? My anxiety is still there. I'm still nervous at conferences.

But now I can practice. I can work on it. And maybe that's the difference between just being stuck with something and actually being able to improve.

If you're like me, if you struggle with social stuff and want to get better try it out. If you have feedback, let me know.

Try IceBreaker
View Code