This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
What I Built
ARCHITECT is a real-time AI interior design assistant. You point your phone camera at any room, talk to the agent naturally, and it generates photorealistic redesigns — all in real-time, all through voice.
The core premise: what if you had a talented interior designer who could literally see your room, understand your style preferences from a conversation, and instantly show you a reimagined version? That's ARCHITECT.
GitHub: https://github.com/Antimatter543/architect
Why Gemini Live API Was the Right Choice
Most AI voice assistants are turn-based: you speak, you wait, it responds. Gemini's Live API is different — it's a persistent bidirectional stream where audio, video frames, and tool calls all flow simultaneously. This enabled an interaction pattern that wasn't possible before:
- User walks through their living room while talking
- Agent sees the room continuously (camera frames streamed at 1fps)
- Agent calls
analyze_room()to capture spatial data while still listening - User says "make it Japandi" mid-sentence
- Agent immediately starts generating a redesign image while responding vocally
The single WebSocket carries everything: 16kHz PCM audio in, 24kHz PCM audio out, JPEG frames in, JSON events, and binary image payloads out. There's no "please hold while I process" — it's genuinely live.
The Technical Architecture
Backend: FastAPI + Google ADK
The agent is built with Google's ADK (LlmAgent) wrapping Gemini 2.0 Flash Live as the underlying model. ADK handles the agent loop; Gemini handles multimodal understanding and tool call orchestration.
Five FunctionTool instances hang off the agent:
@FunctionTool
async def analyze_room(description: "str, style_tags: list[str]) -> dict:"
"""Analyze the visible room and extract spatial/design data."""
# Stores analysis to Firestore, namespaced by user_id
...
@FunctionTool
async def generate_redesign(style: str, room_analysis_id: str) -> str:
"""Generate a photorealistic redesign using Imagen 3."""
# Calls gemini-2.0-flash-exp-image-generation
# Uploads to Cloud Storage, returns public URL
...
@FunctionTool
async def search_furniture(style: str, room_type: str) -> list[dict]:
"""Find matching furniture with prices from real retailers."""
...
ADK's docstring-based schema inference is underrated — you write a clear docstring and it generates the JSON schema for tool calling automatically. No manual tools array.
The WebSocket Protocol
The interesting architectural detail is the binary framing. Everything goes over one WebSocket:
[JSON header bytes] [0x00 null byte] [payload bytes]
For audio frames: header is {"type":"audio"}, payload is raw PCM.
For camera frames: header is {"type":"frame"}, payload is JPEG bytes.
For server-to-client audio: same protocol in reverse.
This lets the frontend handle audio, video, and events all in one onmessage handler without multiplexing connections.
Frontend: React + AudioWorklets
The audio pipeline was the most technically demanding piece. The browser captures microphone audio at 48kHz; Gemini expects 16kHz PCM. The playback side does the reverse: 24kHz → 48kHz.
Both conversions run in AudioWorklets — dedicated audio threads that don't block the main thread. This keeps the UI responsive while audio streams continuously.
// Capture worklet: 48kHz → 16kHz downsampling
class CaptureProcessor extends AudioWorkletProcessor {
process(inputs) {
const input = inputs[0][0]; // 128 samples at 48kHz
// Downsample 3:1 with simple averaging
const downsampled = new Float32Array(Math.floor(input.length / 3));
for (let i = 0; i < downsampled.length; i++) {
downsampled[i] = (input[i*3] + input[i*3+1] + input[i*3+2]) / 3;
}
this.port.postMessage(downsampled);
return true;
}
}
Google Cloud Services Used
-
Cloud Run — hosts the FastAPI backend with session affinity (essential for persistent WebSocket connections — without
--session-affinity, load balancing breaks them) - Cloud Storage — stores Imagen 3-generated redesign images, served via public URLs
- Firestore — persists room analyses, design history, and shopping lists per user session
-
Cloud Build — automated deployment pipeline in
deploy/cloudbuild.yaml - Secret Manager — stores API keys and Auth0 credentials, injected into Cloud Run at deploy time
The deployment is fully automated — one gcloud builds submit command builds the Docker image, pushes it to Container Registry, deploys to Cloud Run with all secrets wired in, and builds + deploys the React frontend to a Cloud Storage static site.
What I Learned
Gemini Live API's multimodal simultaneity is genuinely new. Most voice APIs handle audio only. Most vision APIs are stateless image uploads. Gemini Live lets you send audio and video frames and receive audio and trigger tool calls all in the same session. The design space this opens up is significant.
ADK's FunctionTool pattern is clean. The docstring → JSON schema inference means your tool documentation is your tool definition. There's no separate schema to maintain.
Cloud Run session affinity is not optional for WebSocket apps. First deployment worked fine locally, broke in production because Cloud Run was load balancing across instances mid-session. --session-affinity flag fixed it — but it's buried in the docs.
AudioWorklet precision matters for speech recognition. Naive downsampling (taking every Nth sample) introduced aliasing artifacts that degraded Gemini's speech recognition. Averaging 3 samples per output sample before the 3:1 ratio downsample made a noticeable difference.
Architecture Diagram
WebSocket Auth + Data Flow
Running It
# Backend
cd backend
cp .env.example .env # fill in GOOGLE_API_KEY, etc.
uvicorn app.main:app --reload --port 8080
# Frontend
cd frontend
cp .env.example .env # fill in VITE_AUTH0_* vars
npm run dev
Cloud deployment:
# One-time setup
bash deploy/setup.sh
# Deploy
gcloud builds submit --config=deploy/cloudbuild.yaml \
--substitutions=_AUTH0_CLIENT_ID=your-client-id .
ARCHITECT is a submission for the Gemini Live Agent Challenge — building agents that truly see, hear, and create in real-time. The full source is at https://github.com/Antimatter543/architect.
#GeminiLiveAgentChallenge
Top comments (0)