Built for the #GeminiLiveAgentChallenge · Live demo: nritya-ardance.web.app · Source: github.com/anil9973/nritya
The Problem Gemini Just Solved
Bharatanatyam — India's 2,000-year-old classical dance — encodes an entire spoken language in the body. 108 hand gestures. 9 emotional states. Complete mythological narratives told without a single word.
For every performance ever staged, the audience understood almost none of it.
Pre-written subtitles require preparation. Human interpreters require choreography. Turn-based AI APIs introduce 2–5 seconds of latency — the dancer completes three gestures before the first response arrives.
The Gemini Live API changed the calculus entirely.
Nritya is the first system that watches a live Bharatanatyam performance, identifies each gesture the moment it is held, and narrates its 2,000-year-old meaning to the audience — in the same breath the dancer performs it. No preparation. No human interpreter. No perceptible delay.
The dancer's body is the only prompt Nritya needs.
Why This Required the Gemini Live API Specifically
This is not a "we used AI" story. Three architectural requirements made Gemini Live the only viable foundation:
1. Sub-300ms multimodal perception
Every other vision API is turn-based. Send image → wait → receive response. At 2–5 seconds round trip, a classical dancer has already completed the gesture, transitioned to the next, and begun a third. The narration would always lag the performance by several beats.
The Gemini Live API maintains a persistent bidirectional WebSocket. A pose-lock event fires the moment both wrist velocities drop below 3px/frame. A JPEG frame and skeletal JSON reach Gemini before the dancer has exhaled. Narration begins within 300ms of the pose completing.
That latency number is not a marketing claim — it is the architectural requirement that made the product possible.
2. Simultaneous tool calls and live audio
Every other approach requires a choice: the AI either speaks or acts. It calls a tool and waits for confirmation, then speaks. Or it speaks and defers the tool response.
The Gemini Live API's interleaved output stream eliminates this constraint. In a single turn, Gemini:
- Calls
trigger_mudra_lock— sacred geometry snaps to the dancer's wrist - Calls
update_story_card— Sanskrit translation pushes to the AR HUD - Simultaneously streams PCM16 narration at 24kHz
The audience sees the visual feedback and hears the poetry at the same moment. Neither waits for the other. This simultaneity is not achievable with any sequential API.
3. Persistent creative persona across the full performance
The Gemini Live API maintains conversational context for the entire session. Nritya's system prompt defines a complete theatrical persona — the Sutradhara, the ancient thread-holder of Indian theater — and Gemini inhabits it from the opening gesture to the final bow. The AI remembers which Rasas have already been narrated, avoids repeating the same poetic framing, and escalates its language when the performance reaches its climactic moments.
That consistency is only possible with a persistent session. Stateless APIs would start fresh at every gesture.
The Architecture
sequenceDiagram
participant D as 💃 Dancer
participant B as Browser Client
participant F as Firebase Functions
participant G as Gemini Live API
participant V as Vertex AI Imagen 3
D->>B: Holds Alapadma mudra
Note over B: Wrist velocity drops below 3px/frame
B->>F: POST /generateLiveToken
Note over F: authTokens.create() — uses:1, TTL 30min<br/>API key stays server-side
F-->>B: { token: "ya29.ephemeral..." }
B->>G: WebSocket wss://...?access_token={token}
Note over G: Token consumed on connect — replay impossible
B->>G: sendRealtimeInput { JPEG + skeletal JSON }
G->>B: toolCall: trigger_mudra_lock(LOTUS_OUTLINE)
G->>B: toolCall: update_story_card(sanskritTitle, narrativeText)
G-->>B: PCM16 audio stream begins simultaneously
Note over B: Sacred geometry + Story Card + Voice<br/>all arrive in the same turn
G->>V: image_gen_prompt → Tanjore artwork
V-->>B: base64 PNG → GSAP crossfade
The full stack is Google Cloud native — Gemini Live API, Vertex AI Imagen 3, Firebase Functions, Firebase Hosting. Zero third-party AI services.
Three Technical Decisions Worth Copying
Ephemeral Token Security (Production Pattern)
The Gemini API key never reaches the browser. Firebase Functions mint a one-use token with a 30-minute TTL. The browser opens the WebSocket directly using this token, which is invalidated on connection.
// Firebase Function — generateLiveToken
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const expireTime = new Date(Date.now() + 30 * 60 * 1000).toISOString();
const token = await client.authTokens.create({
config: {
uses: 1,
expireTime,
newSessionExpireTime: new Date(Date.now() + 60 * 1000).toISOString(),
httpOptions: { apiVersion: "v1alpha" },
},
});
res.json({ token: token.accessToken });
// Browser — connect with ephemeral token, not raw API key
const { token } = await fetch(VITE_FIREBASE_FUNCTION_URL).then(r => r.json());
this._ai = new GoogleGenAI({ accessToken: token });
this._session = await this._ai.live.connect({ model, config, callbacks });
// Token consumed here — permanently invalidated
This is not a hackathon shortcut. This is the production security pattern Google recommends for any client-facing Gemini Live deployment.
NON_BLOCKING Tool Calls — The Simultaneity Key
The critical declaration that enables simultaneous narration and tool execution:
// Tool declared as NON_BLOCKING — Gemini continues speaking while the client processes
{
name: "update_story_card",
behavior: Behavior.NON_BLOCKING, // narration never pauses for this tool
parameters: { /* ... */ }
}
// Tool response with SILENT scheduling — Gemini resumes from where it left off
session.sendToolResponse({
functionResponses: [{
id: fc.id,
name: fc.name,
response: {
result: "ok",
scheduling: FunctionResponseScheduling.SILENT,
},
}]
});
Without NON_BLOCKING, every update_story_card call would pause the narration until the client responded. The audience would hear a gap between sentences — a glitch that breaks the theatrical immersion completely.
Dual mediaChunks — Multimodal Context in One Message
Each frame send carries both visual and structured data as separate chunks:
session.sendRealtimeInput({
mediaChunks: [
{
// Clean video frame — 512×512 JPEG, video-only (no canvas trails)
// Gemini identifies the Mudra from visual confirmation alone
mimeType: "image/jpeg",
data: jpegBase64,
},
{
// Skeletal context — wrist velocity, Aramandi depth, session state
// Sent as base64-encoded JSON text alongside the image
mimeType: "text/plain",
data: btoa(JSON.stringify({
event_trigger: "VELOCITY_DROP_MUDRA_LOCK",
wrist_velocity_l: meta.wristVelocityL,
wrist_velocity_r: meta.wristVelocityR,
aramandi_stance_active: meta.aramandiActive,
current_rasa: meta.currentRasa,
session_time_ms: meta.sessionTimeMs,
})),
},
],
});
Gemini reads both parts as a single multimodal input. The image provides visual confirmation of the gesture. The JSON provides physical context — velocity, stance depth, emotional state — that helps Gemini calibrate its narration without requiring a separate API call.
Performance Numbers
| Metric | Value | How |
|---|---|---|
| Pose lock → Gemini receives frame | ~80ms | rVFC loop at 60fps, JPEG capture on lock |
| Gemini first tool call → client | ~200ms | Persistent WSS, no connection overhead |
| Tool call → AR HUD update | ~16ms | requestAnimationFrame, Svelte $state |
| Full end-to-end (pose → narration begins) | < 300ms | All three stages in sequence |
| Imagen 3 artwork generation | 4–8s | Background fetch, shimmer placeholder shown |
The 300ms end-to-end figure is what makes the product feel live rather than reactive. Human perception of audio-visual sync breaks down around 80ms — at 300ms, the narration arrives before the dancer has transitioned to the next gesture. The performance and the AI stay in the same moment.
What This Demonstrates About Gemini Live
The standard Gemini Live demo is voice conversation — a user speaks, the AI responds. Nritya demonstrates a fundamentally different interaction paradigm: structured physical gesture as a continuous prompt stream.
The dancer never speaks to the AI. The AI never waits to be asked. The system operates on velocity-triggered perception — a frame is sent precisely when the gesture completes, and Gemini responds with a fully orchestrated AR experience: geometry, translation, narration, and generated artwork arriving simultaneously.
This opens a category of applications that simply did not exist before the Gemini Live API's multimodal streaming capability:
- Live sports commentary — gesture recognition triggering instant tactical analysis
- Sign language interpretation — ASL detection with real-time spoken translation
- Surgical guidance — instrument position triggering contextual procedural notes
- Industrial inspection — technician hand position triggering relevant manual sections
In each case, the human body is the prompt. The AI is the always-watching collaborator. The interaction model requires no keyboard, no microphone, no deliberate input — only action.
Bharatanatyam is the proof of concept. The paradigm is universal.
Resources
- Live Demo: nritya-ardance.web.app
- Source Code: github.com/anil9973/nritya
- Gemini Live API Docs: ai.google.dev/gemini-api/docs/live
- Ephemeral Token Pattern: ai.google.dev/gemini-api/docs/ephemeral-tokens
- Firebase AI Logic: firebase.google.com/docs/ai-logic
Built for the Gemini Live Agent Challenge · #GeminiLiveAgentChallenge
The dancer has always been speaking. Gemini is finally listening.

Top comments (0)