DEV Community

Perill
Perill

Posted on

Building ARCHITECT: Real-Time AI Interior Design with Gemini Live API + Google ADK

This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge


What I Built

ARCHITECT is a real-time AI interior design assistant. You point your phone camera at any room, talk to the agent naturally, and it generates photorealistic redesigns — all in real-time, all through voice.

The core premise: what if you had a talented interior designer who could literally see your room, understand your style preferences from a conversation, and instantly show you a reimagined version? That's ARCHITECT.

GitHub: https://github.com/Antimatter543/architect

Why Gemini Live API Was the Right Choice

Most AI voice assistants are turn-based: you speak, you wait, it responds. Gemini's Live API is different — it's a persistent bidirectional stream where audio, video frames, and tool calls all flow simultaneously. This enabled an interaction pattern that wasn't possible before:

  • User walks through their living room while talking
  • Agent sees the room continuously (camera frames streamed at 1fps)
  • Agent calls analyze_room() to capture spatial data while still listening
  • User says "make it Japandi" mid-sentence
  • Agent immediately starts generating a redesign image while responding vocally

The single WebSocket carries everything: 16kHz PCM audio in, 24kHz PCM audio out, JPEG frames in, JSON events, and binary image payloads out. There's no "please hold while I process" — it's genuinely live.

The Technical Architecture

Backend: FastAPI + Google ADK

The agent is built with Google's ADK (LlmAgent) wrapping Gemini 2.0 Flash Live as the underlying model. ADK handles the agent loop; Gemini handles multimodal understanding and tool call orchestration.

Five FunctionTool instances hang off the agent:

@FunctionTool
async def analyze_room(description: "str, style_tags: list[str]) -> dict:"
    """Analyze the visible room and extract spatial/design data."""
    # Stores analysis to Firestore, namespaced by user_id
    ...

@FunctionTool  
async def generate_redesign(style: str, room_analysis_id: str) -> str:
    """Generate a photorealistic redesign using Imagen 3."""
    # Calls gemini-2.0-flash-exp-image-generation
    # Uploads to Cloud Storage, returns public URL
    ...

@FunctionTool
async def search_furniture(style: str, room_type: str) -> list[dict]:
    """Find matching furniture with prices from real retailers."""
    ...
Enter fullscreen mode Exit fullscreen mode

ADK's docstring-based schema inference is underrated — you write a clear docstring and it generates the JSON schema for tool calling automatically. No manual tools array.

The WebSocket Protocol

The interesting architectural detail is the binary framing. Everything goes over one WebSocket:

[JSON header bytes] [0x00 null byte] [payload bytes]
Enter fullscreen mode Exit fullscreen mode

For audio frames: header is {"type":"audio"}, payload is raw PCM.
For camera frames: header is {"type":"frame"}, payload is JPEG bytes.
For server-to-client audio: same protocol in reverse.

This lets the frontend handle audio, video, and events all in one onmessage handler without multiplexing connections.

Frontend: React + AudioWorklets

The audio pipeline was the most technically demanding piece. The browser captures microphone audio at 48kHz; Gemini expects 16kHz PCM. The playback side does the reverse: 24kHz → 48kHz.

Both conversions run in AudioWorklets — dedicated audio threads that don't block the main thread. This keeps the UI responsive while audio streams continuously.

Audio Pipeline: Mic 48kHz → CaptureWorklet 3:1 downsample → 16kHz PCM → WebSocket → Gemini → 24kHz PCM → PlaybackWorklet → Speaker 48kHz

// Capture worklet: 48kHz → 16kHz downsampling
class CaptureProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0][0]; // 128 samples at 48kHz
    // Downsample 3:1 with simple averaging
    const downsampled = new Float32Array(Math.floor(input.length / 3));
    for (let i = 0; i < downsampled.length; i++) {
      downsampled[i] = (input[i*3] + input[i*3+1] + input[i*3+2]) / 3;
    }
    this.port.postMessage(downsampled);
    return true;
  }
}
Enter fullscreen mode Exit fullscreen mode

Google Cloud Services Used

  • Cloud Run — hosts the FastAPI backend with session affinity (essential for persistent WebSocket connections — without --session-affinity, load balancing breaks them)
  • Cloud Storage — stores Imagen 3-generated redesign images, served via public URLs
  • Firestore — persists room analyses, design history, and shopping lists per user session
  • Cloud Build — automated deployment pipeline in deploy/cloudbuild.yaml
  • Secret Manager — stores API keys and Auth0 credentials, injected into Cloud Run at deploy time

The deployment is fully automated — one gcloud builds submit command builds the Docker image, pushes it to Container Registry, deploys to Cloud Run with all secrets wired in, and builds + deploys the React frontend to a Cloud Storage static site.

What I Learned

Gemini Live API's multimodal simultaneity is genuinely new. Most voice APIs handle audio only. Most vision APIs are stateless image uploads. Gemini Live lets you send audio and video frames and receive audio and trigger tool calls all in the same session. The design space this opens up is significant.

ADK's FunctionTool pattern is clean. The docstring → JSON schema inference means your tool documentation is your tool definition. There's no separate schema to maintain.

Cloud Run session affinity is not optional for WebSocket apps. First deployment worked fine locally, broke in production because Cloud Run was load balancing across instances mid-session. --session-affinity flag fixed it — but it's buried in the docs.

AudioWorklet precision matters for speech recognition. Naive downsampling (taking every Nth sample) introduced aliasing artifacts that degraded Gemini's speech recognition. Averaging 3 samples per output sample before the 3:1 ratio downsample made a noticeable difference.

Architecture Diagram

System Architecture: Browser React hooks → WebSocket → FastAPI Cloud Run → ADK LlmAgent → Gemini Live → Firestore + Cloud Storage

WebSocket Auth + Data Flow

WebSocket Auth + Data Flow sequence: Frontend connects, sends JWT, backend verifies, live session with audio/video/tool calls

Running It

# Backend
cd backend
cp .env.example .env  # fill in GOOGLE_API_KEY, etc.
uvicorn app.main:app --reload --port 8080

# Frontend  
cd frontend
cp .env.example .env  # fill in VITE_AUTH0_* vars
npm run dev
Enter fullscreen mode Exit fullscreen mode

Cloud deployment:

# One-time setup
bash deploy/setup.sh

# Deploy
gcloud builds submit --config=deploy/cloudbuild.yaml \
  --substitutions=_AUTH0_CLIENT_ID=your-client-id .
Enter fullscreen mode Exit fullscreen mode

ARCHITECT is a submission for the Gemini Live Agent Challenge — building agents that truly see, hear, and create in real-time. The full source is at https://github.com/Antimatter543/architect.

#GeminiLiveAgentChallenge

Top comments (0)