Ivan

Posted on Mar 28

Building Squared: How I Used Gemini Live API to Create an AI That Coaches You While You Speak

#ai #agents #gemini #learning

“And if we look at the key metrics, we can clearly see the... the...”

Sound familiar? You rehearse for days. You know your slides inside out. Then you step on stage, the nerves hit, and your mind goes blank. No one can help you in that moment — your coach is backstage, your notes are too far away to read, and the audience is already drifting.

Every presentation tool I’ve ever used gives you feedback after the fact. Record yourself, upload the video, wait for analysis. That’s like a GPS that tells you “you missed your turn” five minutes after the exit. Useful for next time, useless right now.

When Google released the Gemini 2.5 Flash Live API — bidirectional, multimodal, sub-second latency streaming — I realized something had fundamentally changed. For the first time, an AI could watch you, listen to you, and respond to you while you’re still talking. Not after the recording. Not in a text box. In real time, in your ear.

That’s how Squared was born.

Not Just a Coach — a Navigator
There are plenty of AI speech coaches out there. They record you, analyze the footage, and hand you a report: “Your pace was 180 WPM, you said ‘um’ 14 times, your posture dropped on slide 7.” Useful information. But it’s all retrospective — a postmortem of a presentation that already happened. Each session exists in isolation. The coach doesn’t remember that you’ve struggled with slide 7 three times before. It doesn’t know that last Tuesday you found a great recovery phrase that saved that exact transition. It evaluates a single run, gives you a score, and forgets everything.

Squared is different because it’s not just a coach — it’s a navigator. A coach evaluates your performance. A navigator rides with you in real time, knows the road ahead, remembers the turns where you’ve gotten lost before, and guides you through them as they come. Squared carries context across every session — it knows your history, your weak spots, your best recoveries — and it uses all of that during the presentation itself, not after.

Squared has two modes that mirror how people actually prepare and deliver presentations:

Rehearsal Mode is where the coaching happens — but it’s coaching with memory. The AI watches you through your camera, listens to your delivery, and actively interrupts with spoken feedback. Too fast? It tells you. Lost eye contact for the third time on this slide? It calls it out — and reminds you this has been a recurring problem across your last three rehearsals, not just this one. You can ask it questions mid-rehearsal: “How should I emphasize this point?” and get a response informed by everything it has learned about your delivery patterns. This is still coaching, yes — but coaching that accumulates knowledge and uses it to guide you forward, not just grade you.

Presentation Mode is where the navigator takes the wheel. This is the real stage — a live audience, real stakes. The AI still watches and listens, but it never speaks. Instead, it displays a visual HUD — micro-prompts, rescue text when you lose your train of thought, pace indicators, confidence metrics — all visible only to you. It knows which slides are fragile based on your rehearsals. It has a game plan prepared before you even start. Your audience sees a confident speaker. You see your navigation system working, anticipating the road ahead and keeping you on course.

Two Agents, One Stage
The architecture decision I’m most proud of is running two parallel Gemini Live API sessions simultaneously.

The first is the Delivery Agent. It processes your audio stream and video frames, tracks five live metrics (eye contact, pace, posture, filler words, confidence), and generates either spoken feedback (rehearsal) or silent visual cues (presentation). It uses Gemini’s tool calling to send structured updateIndicators payloads that the UI renders in real time.

The second is the Audience Agent. In presentation mode, when you’re sharing your screen with a video call audience, a separate Gemini Live session monitors the audience video feed. It tracks engagement levels, raised hands, confused expressions, and chat reactions. Its observations flow into the Delivery Agent’s context as [AudienceAgent] messages.

Why does this matter? Because when you’re deep in a presentation, it’s physically impossible to monitor your audience. Even with multiple monitors, your attention is on your content. The Audience Agent gives you something no human presenter has ever had: real-time awareness of how your audience is responding, surfaced as gentle indicators in your HUD.

The hardest part was preventing context pollution between the two agents. Two independent Gemini sessions generating tool calls in parallel can easily interfere with each other. The solution was a state composition layer — composeDualAgentOverlayState() merges delivery and audience feedback into a unified overlay, with filtering to ensure each agent’s signals stay clean.

The Memory That Makes It Personal
Most AI tools are stateless. Every session starts from zero. This is fine for a chatbot, but terrible for a presentation coach. If I struggle with slide 7 every single rehearsal, the AI should know that before I get to slide 7.

Squared builds what I call pattern memory — and it exists because of a lucky timing. Midway through development, Google released Gemini Embedding 2. I was already thinking about how to make sessions feel connected rather than isolated, and suddenly there was a lightweight, high-quality embedding model purpose-built for semantic retrieval. That’s what pushed me to actually build it.

Here’s how it works. After each session, the system chunks the experience into 45-second time windows. Each chunk is classified as one of three types: transcript windows (what you said), flagged moments (where something went wrong), and recovery phrases (what worked when you recovered). These chunks get embedded using the Gemini Embedding API — gemini-embedding-2-preview with 256-dimensional vectors — and stored in PostgreSQL with the pgvector extension on Google Cloud SQL.

When you start a new session, Squared retrieves the most semantically similar past moments via cosine similarity search and injects them into the Gemini system prompt. The result is an AI that says things like: “That was a very fast pace and you lost eye contact several times. This has been a recurring issue for this slide. Try to slow down and remember the key phrase: rehearsal mode is an interactive AI voice coach.”

It remembers what worked, what didn’t, and where you struggled. Every rehearsal builds on the last.

For environments where the Gemini Embedding API isn’t available, I built a deterministic fallback using SHA256-based pseudo-random embeddings. The similarity search quality degrades, but the system never crashes.

Reading Your Body, Locally
My original plan was to let Gemini handle everything — including eye contact and posture analysis from the video frames I was already streaming. The model can see me, so why not ask it to evaluate my body language directly?

The problem is how the Live API behaves in START_OF_ACTIVITY_INTERRUPTS mode (which Rehearsal Mode relies on for real-time voice feedback). When the model is actively generating a response and gets interrupted by new user input, it cancels the current generation and starts over. This is exactly what you want for natural conversation — but it means that continuous background analysis becomes unreliable. I’d send video frames asking for eye contact and posture data, and the responses would arrive inconsistently or not at all, because the model kept getting interrupted by my speech. The data was too sparse and too delayed to drive real-time metrics.

So I moved visual analysis out of Gemini entirely and into the browser. Squared runs MediaPipe FaceLandmarker and PoseLandmarker locally to track:

Eye contact — not a simple binary “looking at camera / not looking.” The system tracks iris positioning relative to eye width for both eyes independently, measures vertical gaze direction, and applies multi-factor severity scoring. This catches the difference between briefly glancing at notes and completely losing eye contact with the audience.

Posture — three independent metrics: head drop, shoulder tilt, and lateral lean, each normalized and aggregated into a single score. The system detects the difference between a confident stance and a slow slide into slouching.

Raw MediaPipe landmarks jitter significantly frame-to-frame. Without smoothing, the UI flickers constantly — your eye contact indicator would bounce between green and red every 100ms. The solution is a majority voting system: a rolling window of 12 samples with a 60% consensus threshold. The displayed value only changes when a clear majority agrees. Combined with a per-user calibration phase (25 baseline readings, normalized using median absolute deviation), this gives stable, personalized visual signals.

MediaPipe turned out to be the right tool for this job — it’s fast, runs entirely client-side, and gives me frame-by-frame landmark data that no cloud API could match at this latency. Still, it would be incredibly powerful if the Gemini Live API supported a mode where the model could process video frames continuously while maintaining an active voice conversation — something like background analysis that doesn’t get cancelled by interruptions. That would open up a whole class of real-time multimodal applications

Keeping the Connection Alive
The AudioWorklet itself — capturing PCM at 16kHz and streaming it to Gemini — worked on the first try. The real challenges were keeping that stream reliable.

The first surprise was a race condition. The audio worklet runs in a separate thread, posting PCM buffers back to the main thread, where they get base64-encoded and sent through the WebSocket. But if the session disconnects while a Promise callback is still in flight, you’re sending audio into a dead socket. Early versions would silently fail or throw. The fix was a withOpenSession guard — a wrapper that checks connection state before every send — which became the central pattern for all real-time communication in the app.

And then there’s the bigger problem: the entire WebSocket can die. During a rehearsal, a dropped connection is annoying. During a live presentation, it could be catastrophic — the HUD goes blank, your navigation disappears, and you’re suddenly alone on stage. Squared captures resumption handles from the Gemini Live API and implements automatic reconnection with exponential backoff (up to 3 attempts). When the connection drops mid-speech, the system reconnects with full context preservation. The user doesn’t notice. The coaching continues.

The withOpenSession guard and the reconnection logic ended up being two sides of the same coin: the small-scale problem (individual audio packets hitting a dead socket) and the large-scale problem (the entire session going down) both needed the same design principle — never trust that the connection is alive, always verify, and recover silently.

One API, Two Personalities
Rehearsal Mode needs Gemini to actively interrupt you with spoken feedback. Presentation Mode needs Gemini to stay completely silent and only communicate through tool calls.

Both modes use the same Gemini 2.5 Flash Live API, but with fundamentally different configurations. Rehearsal uses START_OF_ACTIVITY_INTERRUPTS — the AI can cut in at any moment. Presentation uses NO_INTERRUPTION — the AI processes everything but never produces audio output, routing all feedback through updateIndicators tool calls instead.

Tuning the system instructions to achieve both behaviors from the same API took more iteration than I expected. The Presentation Mode prompt engineering was especially delicate: the AI needs to be opinionated enough to generate useful micro-prompts and rescue text, but disciplined enough to never attempt audio output that would play through the speaker during a live talk.

Production on Google Cloud
Squared isn’t a localhost demo. It runs as a production application on Google Cloud, and every infrastructure choice has a reason tied to the product.

Cloud SQL with pgvector over a managed vector database like Pinecone — because pattern memory is tightly coupled with session data (runs, slides, feedbacks). Keeping embeddings in the same PostgreSQL instance that stores everything else means a single transaction can save a coaching moment and its vector representation together. No sync issues, no eventual consistency, no extra service to manage. The database sits behind a private VPC Connector — never exposed to the public internet.

Ephemeral Live API tokens over passing the Gemini API key to the browser behind CORS. The server mints short-lived tokens (30-minute TTL) via ai.authTokens.create(), so even if a token leaks from the client, it expires before anyone can abuse it. The API key never leaves the server.

Cloud Run for auto-scaling without managing infrastructure — Squared’s load is spiky (rehearsals cluster before big presentations), and paying for idle servers makes no sense. Terraform defines the entire stack — 379 lines that provision everything from the database to IAM roles. A fresh deployment from zero to running takes one terraform apply.

The Meta-Demo
There’s a moment in the Squared demo video that I didn’t plan, but that ended up capturing what this project is about better than anything I could have scripted.

I’m presenting Squared — explaining its features, showing how it works — and Squared itself is my navigator for the presentation. It’s watching me through the camera. It’s listening to my delivery. And when I start rushing through the pattern memory explanation — the part I know best and therefore speed through every time — it catches me. A live tip appears in the HUD: slow down, this slide has been fragile in past rehearsals.

The recursion is the point. Squared is a tool for presenting, being used to present itself, and doing its job in the process. It’s not a staged demo with pre-recorded responses. It’s the real system, running live, navigating its own creator through the presentation about it.

From first rehearsal to final stage. That’s the whole product in one moment.

What I Learned
The Live API is powerful, but you have to design around its constraints — not pretend they don’t exist. START_OF_ACTIVITY_INTERRUPTS mode cancels in-flight generation on every new input, which makes it perfect for natural conversation but unreliable for continuous background analysis. I learned this the hard way when eye contact and posture data from Gemini came back sparse and delayed. The fix — moving visual analysis to MediaPipe locally — made the system faster and more robust than the original plan. Sometimes the best architecture comes from a workaround you were forced into.

Timing matters more than planning. Pattern memory — the feature that arguably defines Squared — wasn’t in the original design. It became possible because Google released Gemini Embedding 2 midway through development. The lesson: leave room in your architecture for features you haven’t thought of yet. If the embedding pipeline hadn’t been easy to plug into the existing session flow, I would have shipped without memory, and the product would have been fundamentally weaker.

Building on a new API means shipping workarounds. Presentation Mode should only need text-based tool call responses — no audio output at all. But Modality.TEXT doesn’t work as expected with the Live API yet, so the mode runs with Modality.AUDIO enabled and silently discards the audio stream. It’s wasteful, but it works. The important thing is designing so that when the platform catches up, the workaround swaps out cleanly.

📎 Links
🌐 Video https://youtu.be/V1KMdCky2kw?si=L1nCt97DzkHPac7L
🏆 Devpost Submission
https://devpost.com/software/squared-ai-presentation-navigator
Built for the Gemini Live Agent Challenge

DEV Community

Building Squared: How I Used Gemini Live API to Create an AI That Coaches You While You Speak

Top comments (0)