DEV Community

Anil Kumar
Anil Kumar

Posted on

How Gemini Live API Turns a Dancer's Body Into a Real-Time AI Prompt — No Keyboard Required

Built for the #GeminiLiveAgentChallenge · Live demo: nritya-ardance.web.app · Source: github.com/anil9973/nritya


Nritya Demo — Gemini Live Agent narrating Bharatanatyam in real time


The Problem Gemini Just Solved

Bharatanatyam — India's 2,000-year-old classical dance — encodes an entire spoken language in the body. 108 hand gestures. 9 emotional states. Complete mythological narratives told without a single word.

For every performance ever staged, the audience understood almost none of it.

Pre-written subtitles require preparation. Human interpreters require choreography. Turn-based AI APIs introduce 2–5 seconds of latency — the dancer completes three gestures before the first response arrives.

The Gemini Live API changed the calculus entirely.

Nritya is the first system that watches a live Bharatanatyam performance, identifies each gesture the moment it is held, and narrates its 2,000-year-old meaning to the audience — in the same breath the dancer performs it. No preparation. No human interpreter. No perceptible delay.

The dancer's body is the only prompt Nritya needs.


Why This Required the Gemini Live API Specifically

This is not a "we used AI" story. Three architectural requirements made Gemini Live the only viable foundation:

1. Sub-300ms multimodal perception

Every other vision API is turn-based. Send image → wait → receive response. At 2–5 seconds round trip, a classical dancer has already completed the gesture, transitioned to the next, and begun a third. The narration would always lag the performance by several beats.

The Gemini Live API maintains a persistent bidirectional WebSocket. A pose-lock event fires the moment both wrist velocities drop below 3px/frame. A JPEG frame and skeletal JSON reach Gemini before the dancer has exhaled. Narration begins within 300ms of the pose completing.

That latency number is not a marketing claim — it is the architectural requirement that made the product possible.

2. Simultaneous tool calls and live audio

Every other approach requires a choice: the AI either speaks or acts. It calls a tool and waits for confirmation, then speaks. Or it speaks and defers the tool response.

The Gemini Live API's interleaved output stream eliminates this constraint. In a single turn, Gemini:

  • Calls trigger_mudra_lock — sacred geometry snaps to the dancer's wrist
  • Calls update_story_card — Sanskrit translation pushes to the AR HUD
  • Simultaneously streams PCM16 narration at 24kHz

The audience sees the visual feedback and hears the poetry at the same moment. Neither waits for the other. This simultaneity is not achievable with any sequential API.

3. Persistent creative persona across the full performance

The Gemini Live API maintains conversational context for the entire session. Nritya's system prompt defines a complete theatrical persona — the Sutradhara, the ancient thread-holder of Indian theater — and Gemini inhabits it from the opening gesture to the final bow. The AI remembers which Rasas have already been narrated, avoids repeating the same poetic framing, and escalates its language when the performance reaches its climactic moments.

That consistency is only possible with a persistent session. Stateless APIs would start fresh at every gesture.


The Architecture

sequenceDiagram
    participant D as 💃 Dancer
    participant B as Browser Client
    participant F as Firebase Functions
    participant G as Gemini Live API
    participant V as Vertex AI Imagen 3

    D->>B: Holds Alapadma mudra
    Note over B: Wrist velocity drops below 3px/frame
    B->>F: POST /generateLiveToken
    Note over F: authTokens.create() — uses:1, TTL 30min<br/>API key stays server-side
    F-->>B: { token: "ya29.ephemeral..." }
    B->>G: WebSocket wss://...?access_token={token}
    Note over G: Token consumed on connect — replay impossible
    B->>G: sendRealtimeInput { JPEG + skeletal JSON }
    G->>B: toolCall: trigger_mudra_lock(LOTUS_OUTLINE)
    G->>B: toolCall: update_story_card(sanskritTitle, narrativeText)
    G-->>B: PCM16 audio stream begins simultaneously
    Note over B: Sacred geometry + Story Card + Voice<br/>all arrive in the same turn
    G->>V: image_gen_prompt → Tanjore artwork
    V-->>B: base64 PNG → GSAP crossfade
Enter fullscreen mode Exit fullscreen mode

The full stack is Google Cloud native — Gemini Live API, Vertex AI Imagen 3, Firebase Functions, Firebase Hosting. Zero third-party AI services.


Three Technical Decisions Worth Copying

Ephemeral Token Security (Production Pattern)

The Gemini API key never reaches the browser. Firebase Functions mint a one-use token with a 30-minute TTL. The browser opens the WebSocket directly using this token, which is invalidated on connection.

// Firebase Function — generateLiveToken
import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({});
const expireTime = new Date(Date.now() + 30 * 60 * 1000).toISOString();

const token = await client.authTokens.create({
  config: {
    uses: 1,
    expireTime,
    newSessionExpireTime: new Date(Date.now() + 60 * 1000).toISOString(),
    httpOptions: { apiVersion: "v1alpha" },
  },
});

res.json({ token: token.accessToken });
Enter fullscreen mode Exit fullscreen mode
// Browser — connect with ephemeral token, not raw API key
const { token } = await fetch(VITE_FIREBASE_FUNCTION_URL).then(r => r.json());

this._ai = new GoogleGenAI({ accessToken: token });
this._session = await this._ai.live.connect({ model, config, callbacks });
// Token consumed here — permanently invalidated
Enter fullscreen mode Exit fullscreen mode

This is not a hackathon shortcut. This is the production security pattern Google recommends for any client-facing Gemini Live deployment.

NON_BLOCKING Tool Calls — The Simultaneity Key

The critical declaration that enables simultaneous narration and tool execution:

// Tool declared as NON_BLOCKING — Gemini continues speaking while the client processes
{
  name:     "update_story_card",
  behavior: Behavior.NON_BLOCKING,  // narration never pauses for this tool
  parameters: { /* ... */ }
}

// Tool response with SILENT scheduling — Gemini resumes from where it left off
session.sendToolResponse({
  functionResponses: [{
    id:   fc.id,
    name: fc.name,
    response: {
      result:     "ok",
      scheduling: FunctionResponseScheduling.SILENT,
    },
  }]
});
Enter fullscreen mode Exit fullscreen mode

Without NON_BLOCKING, every update_story_card call would pause the narration until the client responded. The audience would hear a gap between sentences — a glitch that breaks the theatrical immersion completely.

Dual mediaChunks — Multimodal Context in One Message

Each frame send carries both visual and structured data as separate chunks:

session.sendRealtimeInput({
  mediaChunks: [
    {
      // Clean video frame — 512×512 JPEG, video-only (no canvas trails)
      // Gemini identifies the Mudra from visual confirmation alone
      mimeType: "image/jpeg",
      data:     jpegBase64,
    },
    {
      // Skeletal context — wrist velocity, Aramandi depth, session state
      // Sent as base64-encoded JSON text alongside the image
      mimeType: "text/plain",
      data:     btoa(JSON.stringify({
        event_trigger:          "VELOCITY_DROP_MUDRA_LOCK",
        wrist_velocity_l:       meta.wristVelocityL,
        wrist_velocity_r:       meta.wristVelocityR,
        aramandi_stance_active: meta.aramandiActive,
        current_rasa:           meta.currentRasa,
        session_time_ms:        meta.sessionTimeMs,
      })),
    },
  ],
});
Enter fullscreen mode Exit fullscreen mode

Gemini reads both parts as a single multimodal input. The image provides visual confirmation of the gesture. The JSON provides physical context — velocity, stance depth, emotional state — that helps Gemini calibrate its narration without requiring a separate API call.


Performance Numbers

Metric Value How
Pose lock → Gemini receives frame ~80ms rVFC loop at 60fps, JPEG capture on lock
Gemini first tool call → client ~200ms Persistent WSS, no connection overhead
Tool call → AR HUD update ~16ms requestAnimationFrame, Svelte $state
Full end-to-end (pose → narration begins) < 300ms All three stages in sequence
Imagen 3 artwork generation 4–8s Background fetch, shimmer placeholder shown

The 300ms end-to-end figure is what makes the product feel live rather than reactive. Human perception of audio-visual sync breaks down around 80ms — at 300ms, the narration arrives before the dancer has transitioned to the next gesture. The performance and the AI stay in the same moment.


What This Demonstrates About Gemini Live

The standard Gemini Live demo is voice conversation — a user speaks, the AI responds. Nritya demonstrates a fundamentally different interaction paradigm: structured physical gesture as a continuous prompt stream.

The dancer never speaks to the AI. The AI never waits to be asked. The system operates on velocity-triggered perception — a frame is sent precisely when the gesture completes, and Gemini responds with a fully orchestrated AR experience: geometry, translation, narration, and generated artwork arriving simultaneously.

This opens a category of applications that simply did not exist before the Gemini Live API's multimodal streaming capability:

  • Live sports commentary — gesture recognition triggering instant tactical analysis
  • Sign language interpretation — ASL detection with real-time spoken translation
  • Surgical guidance — instrument position triggering contextual procedural notes
  • Industrial inspection — technician hand position triggering relevant manual sections

In each case, the human body is the prompt. The AI is the always-watching collaborator. The interaction model requires no keyboard, no microphone, no deliberate input — only action.

Bharatanatyam is the proof of concept. The paradigm is universal.


Resources


Built for the Gemini Live Agent Challenge · #GeminiLiveAgentChallenge
The dancer has always been speaking. Gemini is finally listening.

Top comments (0)