Disclosure: I created this content for the purposes of entering the
Gemini Live Agent Challenge hackathon
hosted by Google. All technical claims reflect the actual production implementation
of MAITRI, deployed on Google Cloud Run and Firebase Hosting.
View the repository →
FULL POST BODY
How Gemini Live API Makes Real-Time Psychological Support in Space Technically Possible
Disclosure: I created this content for the purposes of entering the Gemini Live Agent Challenge hosted by Google. All technical claims reflect the actual production implementation of MAITRI, deployed on Google Cloud Run and Firebase Hosting. View the repository →
The Constraint That Determined Everything
India is sending humans to space for the first time. Gaganyaan will orbit 400km above Earth, and every 90 minutes, for 45 minutes, the spacecraft passes beyond every relay satellite. Complete blackout. No ground control. No communication of any kind.
An astronaut experiencing psychological distress during a blackout window has no one to reach. This is not a solvable communication problem — it is a physical reality of orbital mechanics.
ISRO formally identified this gap in problem statement PS-ID-25175: real-time psychological support for isolated astronauts. I built MAITRI as a working implementation against that specification, entirely on Google Cloud.
The first architectural question was brutal: what does it mean for an AI to already be there?
A chatbot responds to requests. A REST completion API adds 3–4 seconds of round-trip latency per exchange. In a psychological support context, 3 seconds of silence after someone reaches out is not a delay — it is the pause that confirms they are alone.
Gemini Live's sub-300ms bidirectional audio-video stream is the only technology that eliminates that pause. Not reduces it. Eliminates it. Every subsequent technical decision in MAITRI's architecture follows from that single constraint.
What Gemini Live Makes Possible (That Nothing Else Does)
Before diving into the implementation, it's worth being explicit about why Gemini Live is not interchangeable with other APIs here.
| Approach | Latency | Why It Fails |
|---|---|---|
| REST completion API | 3,000ms+ | Perceptible silence = perceived rejection |
| WebSocket chat stream | 800ms+ | Still detectable delay in conversation |
| Gemini Live bidirectional | <300ms | Below human perception threshold |
The 300ms threshold matters clinically. NASA's psychological research on ISS astronauts shows that conversational responsiveness — the feeling of being heard — breaks down above 400ms of response latency. Gemini Live is not a faster API. It is a different category of interaction: one where the model is a continuous presence, not a request-response endpoint.
Designing MAITRI for presence rather than answers required entirely different architecture patterns.
The Architecture: Three Gemini Products, One Mission
MAITRI uses three products from the Gemini ecosystem, each doing precisely what it does best:
Gemini Live API → Real-time voice conversation (presence)
Gemini Flash → Affect scoring from video (analysis)
Google ADK → Protocol orchestration (intelligence)
Here is how they connect:
Android (Astronaut)
│
├─ Audio (mic) ──► LiveKit WebRTC ──► Python Worker (Cloud Run)
│ │
└─ Video (camera) ──────────────────────► │
│
┌────────┴──────────┐
│ │
LiveRequestQueue Affect Pipeline
│ (every 5 seconds)
▼ │
Google ADK ▼
│ Gemini Flash
▼ (Vertex AI)
Gemini Live API │
│ ▼
└──────► ProtocolStateMachine
│
┌────────┴────────┐
│ │
Firestore Pub/Sub
│
Svelte Dashboard
(Firebase Hosting)
The critical architectural detail: the affect pipeline is completely isolated from the Live session. Gemini Flash scores the astronaut's emotional state from video frames every 5 seconds — the conversation never knows this is happening. MAITRI does not randomly comment on the camera feed. The scoring is invisible. The effect is not.
The Audio Bridge: Getting Voice Into Gemini Live
The first implementation challenge is a format mismatch. LiveKit WebRTC delivers audio at 48kHz. Gemini Live requires 16kHz mono PCM. The response comes back at 24kHz and needs to be published back to LiveKit at 48kHz.
This is the ingress pipeline:
async def forward_remote_audio_stream_to_adk_live_queue(
audio_stream: rtc.AudioStream,
live_request_queue: LiveRequestQueue,
ingress_resampler: rtc.AudioResampler,
) -> None:
forwarded_frame_count: int = 0
async for event in audio_stream:
incoming_frame: rtc.AudioFrame = event.frame
# Resample 48kHz → 16kHz — Gemini Live requirement
resampled_frames = ingress_resampler.push(incoming_frame)
for resampled_frame in resampled_frames:
audio_blob = genai_types.Blob(
mime_type=f"audio/pcm;rate=16000",
data=bytes(resampled_frame.data),
)
live_request_queue.send_realtime(audio_blob)
forwarded_frame_count += 1
LiveRequestQueue is ADK's abstraction over the Gemini Live WebSocket — it handles reconnection, session management, and tool call dispatch transparently. Every 20ms audio frame from the astronaut's microphone arrives in Gemini Live with no buffering layer adding latency.
Zero audio processing on the spacecraft. Full intelligence on Google Cloud.
The Affect Pipeline: Gemini Flash as a Silent Observer
Every 5 seconds, a single video frame is extracted from the LiveKit video track, converted to JPEG, and sent to Gemini Flash on Vertex AI with a structured prompt:
async def _score_single_frame(self, frame: rtc.VideoFrame) -> AffectScore:
async with self._scoring_lock: # drop-if-busy — never queue frames
jpeg_bytes = self._convert_rgba_to_jpeg(frame)
b64 = base64.b64encode(jpeg_bytes).decode()
response = await asyncio.to_thread(
self._gemini_client.models.generate_content,
model="gemini-2.0-flash",
contents=[
{
"parts": [
{"inline_data": {"mime_type": "image/jpeg", "data": b64}},
{"text": AFFECT_SCORING_PROMPT}
]
}
]
)
return AffectScore.model_validate_json(response.text)
The asyncio.Lock with drop-if-busy is intentional. If Gemini Flash takes longer than 5 seconds on a frame (network spike, API latency), the next frame is dropped rather than queued. A stale affect score from 10 seconds ago is worse than no score — it might trigger a false state transition.
The returned AffectScore carries arousal, valence, and confidence — three numbers that drive the entire protocol state machine.
The State Machine: ADK Protocol Intelligence
MAITRI's three-state protocol maps directly to clinical psychological triage logic:
class ProtocolStateMachine:
_STATE_SEVERITY: dict[str, int] = {
"BASELINE_MONITORING": 0,
"ANOMALY_FLAGGED": 1,
"ACTIVE_INTERVENTION": 2,
}
async def evaluate_score_and_transition_if_needed(
self, score: AffectScore
) -> str | None:
"""Returns alert_id UUID if transition occurred, None otherwise."""
async with self._transition_lock:
arousal_dev = abs(score.arousal - self._baseline_arousal) / 100
valence_dev = abs(score.valence - self._baseline_valence) / 100
max_dev = max(arousal_dev, valence_dev)
if max_dev >= INTERVENTION_THRESHOLD:
target = "ACTIVE_INTERVENTION"
elif max_dev >= ANOMALY_THRESHOLD:
target = "ANOMALY_FLAGGED"
else:
return None # No transition
# Latch upward only — never auto-downgrade
if self._severity(target) <= self._severity(self._current_state):
return None
alert_id = str(uuid.uuid4())
self._current_state = target
await self._write_to_firestore(score, alert_id)
return alert_id
The latching rule is the most important safety constraint in the system. ACTIVE_INTERVENTION never auto-resets. A psychological crisis that resolves in 30 seconds still happened — auto-reset would mask the event from ground control. Only a credentialed ground controller issuing POST /api/reset-session clears the state.
The Killer Feature: Mid-Session Hint Injection
When the state machine transitions, MAITRI needs Gemini to change its behavior — immediately, without interrupting the conversation, without restarting the session.
ADK's LiveRequestQueue exposes exactly this capability:
async def inject_system_context_hint_into_live_session(
live_request_queue: LiveRequestQueue,
system_context_message: str,
) -> None:
hint = f"[SYSTEM_CONTEXT: {system_context_message}]"
text_blob = genai_types.Blob(
mime_type="text/plain",
data=hint.encode("utf-8"),
)
live_request_queue.send_realtime(text_blob)
And the hint builders, one per state transition:
def build_anomaly_flagged_context_hint(score: AffectScore) -> str:
return (
f"SYSTEM_CONTEXT: ANOMALY_FLAGGED — "
f"arousal={score.arousal:.1f}, valence={score.valence:.1f}. "
f"Shift to grounding mode. Use structured breathing or "
f"cognitive reframing. Do not alarm. Do not diagnose."
)
def build_active_intervention_context_hint(score: AffectScore) -> str:
return (
f"SYSTEM_CONTEXT: ACTIVE_INTERVENTION — "
f"arousal={score.arousal:.1f}, valence={score.valence:.1f}. "
f"Deliver one grounding sentence. Then fall silent. "
f"Ground control has been alerted."
)
No session restart. No silence. The conversation continues — Gemini simply becomes more focused. The astronaut never experiences a gap.
This pattern — continuous presence with in-flight context adjustment — is what separates Gemini Live from every other API in this space.
The Alert Architecture: One Event, Every Channel
When ACTIVE_INTERVENTION triggers, three things happen simultaneously from a single EventDispatcher:
async def handle_affect_score(self, affect_score: AffectScore) -> None:
# 1. Always: publish affect overlay to Android
await self._publish_affect_update(affect_score, timestamp_utc)
# 2. Evaluate state machine
alert_id = await self._state_machine.evaluate_score_and_transition_if_needed(score)
if alert_id is None:
return
new_state = self._state_machine.current_state
# 3. State change → DataChannel to Android (reliable=True)
await self._publish_state_change(affect_score, new_state, alert_id, timestamp_utc)
# 4. Inject hint into live Gemini session
await self._inject_protocol_hint(new_state, affect_score)
# 5. ACTIVE_INTERVENTION only: critical alert
if new_state == "ACTIVE_INTERVENTION":
await self._publish_critical_alert(affect_score, alert_id, timestamp_utc)
dispatch_critical_alert_to_pubsub(payload) # fire-and-forget
The same alert_id UUID flows through every channel — DataChannel to Android, Pub/Sub to ground systems, Firestore to the SSE stream. The Svelte dashboard deduplicates on alert_id. One event. Every channel. Zero duplication regardless of which arrives first.
Why Cloud Pub/Sub and not a direct HTTP call to the dashboard?
Cloud Pub/Sub decouples the critical alert from the audio pipeline. A direct HTTP call would block the asyncio event loop for 200ms+ — causing audible stuttering in Gemini's voice on the astronaut's device at the exact moment they need it most. Fire-and-forget via asyncio.create_task(asyncio.to_thread(...)) keeps the audio pipeline unblocked.
The Ground Dashboard: Firestore → SSE
The ground control dashboard at maitri-astronaut.web.app receives events via a Server-Sent Events stream that merges two sources:
@router.get("/api/session-status")
async def session_status_sse(request: Request) -> StreamingResponse:
snapshot_queue: asyncio.Queue = asyncio.Queue()
loop = asyncio.get_running_loop()
# Firestore onSnapshot — requires sync client
sync_db = get_firestore_sync_client()
doc_ref = sync_db.collection("sessions").document("GAGANYAAN_01")
def _on_snapshot(snapshots, changes, read_time):
# Runs in Firestore SDK thread — thread-safe handoff to asyncio
for snapshot in snapshots:
event = _build_state_change_sse_dict(snapshot, str(uuid.uuid4()))
loop.call_soon_threadsafe(snapshot_queue.put_nowait, event)
unsubscribe = doc_ref.on_snapshot(_on_snapshot)
# Merge Firestore snapshots + affect telemetry queue
async def generator():
try:
async for chunk in _sse_event_generator(request, snapshot_queue):
yield chunk
finally:
unsubscribe() # Clean teardown on client disconnect
onSnapshot propagates every Firestore write to the dashboard in under 100ms. The dashboard never polls. State changes appear the moment they happen — protocol transitions, affect score updates, intervention triggers — all driven by Firestore's real-time listener without a separate WebSocket server.
Eight Google Cloud Services, Each Earning Its Place
| Service | Why This, Not an Alternative |
|---|---|
| Gemini Live API | Sub-300ms presence — no other API achieves this |
| Vertex AI — Gemini Flash | Structured affect scoring in one prompt — no custom ML model required |
| Google ADK |
LiveRequestQueue hint injection — no equivalent in third-party frameworks |
| Cloud Run | Single-process: FastAPI + worker share memory, no IPC |
| Firestore |
onSnapshot drives dashboard without polling — latency under 100ms |
| Cloud Pub/Sub | Decouples critical alert from audio pipeline — no event loop blocking |
| Cloud Storage | Single JSON blob per session — directly ingestible by BigQuery |
| Cloud Monitoring | Custom affect_arousal + affect_valence gauges every 5 seconds |
8 Google Cloud services. Each genuinely wired. None aspirational.
Automated Deployment: Cloud Build + GitHub Actions
Every commit to main ships to production:
# backend/cloudbuild.yaml — three steps
steps:
- name: gcr.io/cloud-builders/docker
args: [build, --tag=gcr.io/$PROJECT_ID/maitri:$SHORT_SHA, .]
- name: gcr.io/cloud-builders/docker
args: [push, gcr.io/$PROJECT_ID/maitri]
- name: gcr.io/google.com/cloudsdktool/cloud-sdk
args:
- run deploy maitri
- --image=gcr.io/$PROJECT_ID/maitri:$SHORT_SHA
- --region=us-east4
- --min-instances=1 # Always warm
- --memory=2Gi # Audio + video parallel pipelines
- --concurrency=1 # One WebRTC session per instance
- --timeout=3600 # Full session support
The Svelte dashboard deploys to Firebase Hosting via GitHub Actions on every merge. Backend and dashboard ship independently — a frontend change never risks a backend restart.
git push → Cloud Build → Docker build → Artifact Registry → Cloud Run
git push → GitHub Actions → npm run build → Firebase Hosting
Automated deployment on every commit via Cloud Build and GitHub Actions.
Responsible AI by Design
MAITRI is a first-responder signal layer. It is not a therapeutic agent.
The ACTIVE_INTERVENTION state has exactly one permitted output: a single grounding sentence, then silence, then a ground alert. No clinical advice. No diagnosis. No attempt to resolve the situation autonomously.
def build_active_intervention_context_hint(score: AffectScore) -> str:
return (
"SYSTEM_CONTEXT: ACTIVE_INTERVENTION. "
"Deliver one grounding sentence. Then fall silent. "
"Ground control has been alerted. Do not continue the conversation."
)
The constraint is not in the system prompt as a suggestion. It is enforced by the protocol state — when ACTIVE_INTERVENTION triggers, hint injection fires before the next Gemini response is generated. The model receives the constraint as its next context frame.
Responsible AI by design: MAITRI augments ground psychologists, never replaces them.
What Gemini Live Unlocks for the Next Generation of AI Applications
The patterns in MAITRI are not space-specific. The fundamental capability — a sub-300ms bidirectional stream that maintains continuous presence, accepts mid-session context injection, and processes voice and vision simultaneously — applies to any domain where AI needs to be with someone rather than responding to them.
Healthcare monitoring. Crisis intervention. High-isolation research deployments. Anywhere humans go where the latency of a REST API is the difference between presence and absence.
Gemini Live is not a faster way to build chatbots. It is the foundation for a different category of AI application — one where the model is continuously aware and the interaction is ambient rather than transactional.
MAITRI is one implementation of what that looks like. The architecture is open, the repository is public, and the deployment runs on Google Cloud infrastructure available to every developer reading this.
India's first astronauts deserve the best AI that exists today. MAITRI is built on the technology that makes it possible.
Live demo: maitri-astronaut.web.app
Repository: github.com/anil9973/maitri
Architecture: ARCHITECTURE.md
#GeminiLiveAgentChallenge #GoogleCloud #GeminiLive #MAITRI #Python #Android
Top comments (0)