Yusuf Elnady

Posted on Mar 16

I Built Rayan: A 3D Memory Palace Live Agent That Listens, Remembers, and Speaks Back

#gemini #googlecloud #ai #webdev

I created this blog post for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem

We forget things. All the time. Not in some big philosophical way. In the most basic, embarrassing way. What did you eat yesterday? What was that person's name at the conference? What did your manager say two Fridays back?

I kept seeing the same thing on Reddit, in productivity forums, in real conversations. I take in so much information every day and I remember almost none of it. We take notes we never look at again. We bookmark articles we never reopen. We record meetings we never re-watch.

The problem isn't capture. We have more capture tools than ever. The problem is retrieval. Our memories are flat, unsearchable, disconnected from each other.

Then I Found Gemini Live

I started playing with the Gemini Live 2.5 Flash API. And something clicked.

A model that holds a persistent, real-time audio session. You can interrupt it mid-sentence and it recovers. It calls tools asynchronously without breaking the conversation. It matches your tone, your energy, your pace through affective dialogue. It accepts screen sharing and video alongside voice. And it's fast. Really, genuinely fast.

I realized I could build something that doesn't just capture memories. It could be your memory. A system that listens alongside you, pulls out what matters on its own, organizes it in space, and lets you walk through it and talk to it. No keyboard. No typing prompts.

That system is Rayan - Your 3D Memory Palace.

What Rayan Actually Is

Rayan turns everything you hear, see, and say into a 3D Memory Palace you can explore. It's not a notes app. It's not a chatbot with a search bar. It's a fully rendered Three.js environment you navigate in first person. Rooms, walls, glowing objects, doors. Every object on every wall is a memory that Rayan extracted, categorized, and placed there for you, in real time, while you were just going about your day.

Two persistent Gemini Live voice agents run the whole thing.

CaptureAgent listens alongside you. Run it during a lecture, a meeting, a podcast, while browsing the web with screen share. It passively analyzes what it hears and sees. When it detects a concept worth keeping (confidence >= 0.7), it silently extracts it. It generates a title, summary, keywords, classifies the type, creates an embedding, and hands it to the Memory Architect for room placement. A new 3D artifact shows up on your palace wall in real time. You don't press anything. You don't type anything. You just live, and the palace builds itself.

RecallAgent is your voice companion inside the palace. Walk up to any room, any artifact, and just ask. "What did I learn about transformers last week?" or "Quiz me on everything in my Biology room." It searches your memories semantically, grounds every answer in what you've actually captured (it literally cannot hallucinate things not in your palace), and speaks back to you. It navigates rooms, highlights artifacts, pulls up related memories as it talks. It can create new memories, edit existing ones, generate visual mind maps, do web searches. All by voice, mid-conversation.

The palace isn't a metaphor. It's a real 3D space you walk through.

Why a Memory Palace

The method of loci is one of the oldest memory techniques that exists. The idea is simple. Place the things you want to remember in specific locations within a familiar space. Then mentally walk through that space to retrieve them. The spatial context, like this fact was on the north wall of the library, next to the crystal orb about neural networks, gives you retrieval cues that flat lists and folders never will.

Rayan makes that literal. Your memories aren't rows in a database. They're glowing hologram panels, floating books, crystal orbs, speech bubbles, and framed screenshots spread across themed 3D rooms that you navigate through. The spatial encoding isn't a gimmick. It's the core retrieval mechanism, backed by voice and semantic search.

Two Modes

Capture. Your Always-On Memory Companion.

When you start a Capture session, Rayan opens a persistent Gemini Live connection (gemini-live-2.5-flash-native-audio) and starts listening. You get three input approaches.

Voice only. Just talk. Rayan listens to your microphone, pulls out concepts from your speech, and builds your palace as you speak.

Screen share. Share your screen and Rayan becomes your study partner. It sees your slides, your browser tabs, your documents. It extracts key information from both what it sees and what you say about it. When it spots a good diagram or slide, it autonomously calls take_screenshot, uploads the image to Cloud Storage, and places it as a framed visual artifact on your palace wall.

Camera. Point your webcam at a whiteboard, a textbook, your environment. Rayan processes the video frames alongside your voice.

You control how aggressively it captures. Want it taking notes every few seconds? Adjust the cadence. Want it to listen longer before synthesizing multiple memories on the same topic? You can do that too.

As Capture runs, new artifacts appear in your 3D palace in real time. No page refresh, no manual save. The WebSocket connection pushes palace_update events the instant an artifact is created, and the Three.js scene renders it live.

Every extraction goes through smart deduplication. New captures are cosine-compared against everything saved in the current session. Near-duplicates (similarity >= 0.90) are merged, not duplicated. The palace stays clean.

Recall. Your Voice-Navigable Second Brain.

In Recall mode, a second Gemini Live agent connects and becomes your conversational guide through the palace. You speak naturally. No wake words, no rigid commands. It understands context, follows up on previous statements, handles interruptions gracefully through Gemini Live's built-in VAD and the interrupted server event, and executes tools mid-conversation.

Here's what Recall can do, all by voice.

Navigate rooms. "Take me to my Machine Learning room" triggers navigate_to_room, and your camera flies there.

Highlight artifacts. "Show me what I captured about attention mechanisms" triggers highlight_artifact, and the relevant 3D object scales up and glows.

Answer questions. Every answer is grounded in your actual memories via semantic search. Rayan cannot make up information that isn't in your palace.

Create new memories. "Remember that the deadline is March 20th" creates a new artifact mid-conversation.

Edit memories. "Update my notes on the project, the scope changed to include mobile" modifies an existing artifact.

Synthesize rooms. "Synthesize this room" generates a creative AI mind map image that visually summarizes every memory in the current room, rendered directly on the 3D wall.

Web search. "Look up the latest paper on mixture of experts" runs a grounded web search and can save findings as enrichment artifacts.

Bird's-eye view. "Show me the map" toggles to an overview camera so you can see your whole palace layout.

The critical piece here is semantic grounding. Every time you enter a room, navigate to an artifact, or ask a question, the RecallAgent runs a real-time semantic search.

Your query (or the current artifact's summary) is embedded via Vertex AI text-embedding-005 into a 768-dimensional vector.
That vector is cosine-compared against every stored artifact embedding in Firestore.
The top 8 most semantically relevant memories are injected into the live system prompt under a MEMORIES section.
The system prompt enforces "ONLY use information from the provided MEMORIES section. NEVER hallucinate or invent information. Cite which artifact/room the information comes from."
On every room navigation and artifact highlight, update_context() re-runs the search and injects fresh memories mid-conversation via send_client_content. No reconnection needed.

This isn't RAG tacked onto a chatbot. It's a voice-driven retrieval system where the grounding context updates continuously as you move through your palace.

How People Actually Use This

Rayan isn't a productivity app you try once and forget. It's a persistent second brain you build over weeks and months.

Learning. Run Capture during any lecture or online course. Your palace auto-fills with searchable concepts as you listen. Read aloud or screen-share a textbook and Rayan clusters key ideas into rooms by topic. Walk your palace before a test and ask Recall to quiz you. Capture browser tabs and articles across a research session and let Recall surface connections between them. Capture vocabulary in context for language learning. Revisit your palace daily for spaced repetition that's more durable than flashcards because of the spatial encoding.

Work. Run Capture during meetings. Action items, decisions, and names get auto-extracted. Capture everything discussed during client onboarding so Recall knows the context as well as you do. Capture architecture discussions and technical decisions so Recall can answer "why did we do it this way?" months later. Build a room per direct report and Recall surfaces what you discussed last time before every one-on-one.

Creative work. Capture sources, quotes, and ideas, then Recall helps you cite and cross-reference while writing. Capture lore and character decisions for worldbuilding and Recall keeps your fictional world consistent. Capture every brainstorming idea and Recall finds the patterns across messy ideation.

Personal life. Capture travel recommendations and Recall answers "what was that restaurant someone mentioned?" Capture doctor conversations and health research. Capture birthdays, preferences, and conversations so you remember what matters to people.

Power features. Recall can save new memories during a voice conversation without leaving the palace. Everything persists across sessions, so your palace from six months ago is fully searchable today. Generate an AI mind map of any room on demand. Watch new 3D artifacts appear live as Capture runs.

The Technology Stack

Rayan is built entirely on Google's AI and cloud ecosystem. Let me walk through every layer.

Four Gemini Models, Four Roles

Model	What it does
`gemini-live-2.5-flash-native-audio`	Real-time two-way voice streaming for CaptureAgent and RecallAgent
`gemini-2.5-flash`	Text generation for the Memory Architect (categorization), Narrator Agent (narration scripts), general AI tasks
`gemini-2.5-flash-image`	Creative synthesis. Generates styled mind map images that visually summarize a room's memories
`text-embedding-005`	768-dimensional semantic embeddings for every artifact, powering cosine similarity search and grounding

Three things made Gemini Live the only real option for this project.

Persistent sessions. Gemini Live holds a long-running WebSocket connection. The CaptureAgent session can run for an entire hour-long lecture without reconnecting. This isn't request-response. It's a stateful, living conversation.

Native tool calling. Both agents register tools (navigate, highlight, create artifact, screenshot, web search, etc.) that Gemini calls autonomously mid-conversation. The tools execute asynchronously. Gemini doesn't freeze while waiting for a tool result. It keeps talking and works the result in when it arrives.

Affective dialogue. enable_affective_dialog=True means Gemini adjusts its tone, pacing, and empathy based on your emotional cues. When you sound excited, Rayan matches that energy. When you're quietly focused, it stays subdued. This is the difference between a tool and a companion.

Firebase

Service	What Rayan uses it for
Firebase Authentication	Google Sign-In on the frontend. One click, you get a Firebase ID token, verified server-side via Firebase Admin SDK on every WebSocket connection and REST request. No passwords, no custom auth flows.
Firebase Hosting	The React + Three.js frontend is built as a static SPA and deployed to Firebase Hosting. CDN distribution, SSL, SPA routing all handled.
Firebase Analytics	Frontend analytics tracking engagement and feature usage.

Firebase was the natural choice for auth because it integrates with every other Google Cloud service Rayan uses. The ID token from Firebase Auth is the same identity that Firestore, Cloud Storage, and Cloud Run understand natively. One identity system across the whole stack.

Google Cloud Infrastructure

Service	What Rayan uses it for
Cloud Firestore	Primary database. Every room, artifact, capture session, and user profile lives here. Firestore's document model maps perfectly to the palace hierarchy, `users/{userId}/rooms/{roomId}/artifacts/{artifactId}`. Embeddings (768-float arrays) are stored inline on each artifact document. No separate vector database needed at current scale.
Cloud Storage	Two buckets. The media bucket stores screenshots captured by CaptureAgent and mind map images generated by the synthesis service. The frontend bucket hosts the static SPA build.
Cloud Run	The FastAPI backend runs as a containerized service. Session affinity is enabled. This is critical because both agents maintain long-lived WebSocket connections to Gemini Live. If Cloud Run routed requests to different instances, those sessions would break. Session affinity ensures all traffic from a connected user sticks to the same container. Configured at 2 vCPU, 2 GiB memory, min 1 / max 10 instances.
Vertex AI	Powers `text-embedding-005` for generating 768-dimensional embeddings. Also provides the client library for all Gemini API calls via the `google-genai` SDK with Vertex AI backend.
IAM and Service Accounts	A dedicated `rayan-backend` service account with exactly three roles. `roles/datastore.user` for Firestore, `roles/storage.objectAdmin` for Cloud Storage, `roles/aiplatform.user` for Vertex AI. Least privilege.

Why Google Cloud Specifically

I'll be direct about this. Rayan could not have been built this cleanly on another cloud provider. The reason is integration density. Look at what happens when a user speaks a sentence during a Capture session.

Audio arrives at Cloud Run via WebSocket.
Cloud Run forwards it to the Gemini Live API (same Google network, minimal latency).
Gemini detects a concept and calls the capture_concept tool.
The backend generates an embedding via Vertex AI text-embedding-005 (same network).
The artifact is written to Firestore (same network, same service account).
If a screenshot is involved, it goes to Cloud Storage (same network, same service account).
The Memory Architect categorizes it via gemini-2.5-flash (same network).
The frontend, hosted on Firebase Hosting, receives the update over the same WebSocket.

Every hop is Google-to-Google. No cross-cloud latency, no credential translation, no API gateway stitching. The service account that Cloud Run uses is the same identity that Firestore, Storage, Vertex AI, and the Gemini API all trust. For a real-time voice agent where latency kills the experience, that coherence matters more than anything.

The Agent Architecture

CaptureAgent

This is the most complex piece of Rayan. It holds a persistent async with client.aio.live.connect() context that stays alive for the entire capture session, potentially 60+ minutes.

Initialization looks like this.

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    enable_affective_dialog=True,
    system_instruction=system_prompt,
    tools=CAPTURE_LIVE_TOOLS,
    speech_config=SpeechConfig(voice_config=VoiceConfig(
        prebuilt_voice_config=PrebuiltVoiceConfig(voice_name="Aoede")
    ))
)

Audio and video stream concurrently. The frontend sends two data channels over WebSocket. Microphone audio at 16kHz mono PCM in ~100ms chunks, captured via AudioWorklet. And video frames as JPEG-encoded screen captures or webcam frames. Both are forwarded to the Gemini Live session simultaneously.

Extraction is autonomous. The CaptureAgent doesn't wait to be told what to remember. Its system prompt instructs it to continuously analyze both streams. When it detects something worth capturing (confidence >= 0.7), it calls the capture_concept tool on its own with a title (max 8 words), summary (50-150 words), type classification (one of 20+ types like lecture, insight, moment, goal, emotion), keywords, and a confidence score.

Deduplication happens before saving. The new concept's embedding is cosine-compared against every artifact already captured this session. If similarity >= 0.90, it merges into the existing artifact instead of creating a duplicate.

Then the Memory Architect takes over. It decides where the artifact belongs based on cosine similarity against existing room embeddings. Similarity >= 0.75 means auto-assign to the best room, no confirmation needed. Between 0.50 and 0.75 means suggest a room match and show a confirmation prompt. Below 0.50 means suggest creating a new room entirely.

Registered tools for CaptureAgent include capture_concept, create_artifact, create_room, take_screenshot, edit_artifact, delete_artifact, web_search, navigate_to_room, and end_session.

RecallAgent

This is the conversational companion inside the 3D palace. It also holds a persistent Gemini Live session, but it's focused on retrieval and interaction rather than extraction.

The key innovation is the semantic context pipeline. The RecallAgent maintains a continuously updated context.

On session start, semantic search returns the top 8 most relevant memories across the entire palace. These get injected into the system prompt. On room navigation, when you or the agent navigate to a new room, update_context() re-runs semantic search scoped to that room and injects fresh memories via send_client_content(). Mid-conversation, no reconnection. On artifact highlight, when discussing a specific artifact, its summary becomes a search query to find related memories.

So the RecallAgent's knowledge of your palace is always fresh and contextually relevant. It doesn't go stale after session start.

Interruption handling feels natural. Gemini Live's built-in Voice Activity Detection catches when you start speaking while Rayan is still talking. The interrupted server event fires and the RecallAgent gracefully stops. It feels like a real conversation, not turn-based Q&A.

Registered tools include navigate_to_room, navigate_to_map_view, navigate_horizontal, highlight_artifact, create_artifact, edit_artifact, delete_artifact, create_room, synthesize_room, web_search, end_session, and close_artifact.

Narrator Agent

When you click an artifact in the 3D palace outside of a live voice session, the Narrator Agent activates. It loads the artifact from Firestore, finds the top 5 related artifacts via semantic search, then generates a narration script via gemini-2.5-flash. The script has a specific structure. An opening of about 5 seconds that says something like "This is from your machine learning study session..." Then 20-30 seconds of core content synthesized in conversational language. Then 5-10 seconds of connections to related memories. Then a 5-second invitation to explore further.

If the narration contains a diagram trigger ([DIAGRAM: type|title|description]), it generates visual diagrams too. Then it synthesizes the text into voice audio via Gemini Live and returns everything. Audio, text, diagrams, and related artifact links.

Memory Architect

This handles categorization. It uses gemini-2.5-flash (text, not live) to decide where each captured concept belongs.

The algorithm embeds the concept's title and summary via text-embedding-005, computes cosine similarity against every existing room's topicEmbedding, and applies thresholds. If creating a new room, it infers the name from concept keywords and assigns a random style from 10 options. Library, lab, gallery, garden, workshop, museum, observatory, sanctuary, studio, dojo.

The result is that your palace structures itself. You never manually create rooms or drag things into folders. The architecture emerges from the semantic relationships in your knowledge.

The 3D Palace

The 3D palace isn't a visualization layer on top of a database. It's the primary interface.

Scene Architecture

Built with React Three Fiber (React renderer for Three.js) and @react-three/drei utilities.

Canvas (full-screen, dark background #060614)
├── Environment (HDRI ambient lighting)
├── PalaceGround (500x500 tiled plane)
├── Lobby (central hub at world origin, 12x12 units)
│   └── Doors (one per room, cycling through 4 walls)
├── Room (8x8x4 default dimensions)
│   ├── Walls (north, south, east, west)
│   ├── Floor and Ceiling
│   ├── Door portals (to connected rooms)
│   └── Artifacts (type-based 3D renderers)
└── Camera Controls
    ├── FirstPersonControls (WASD + mouse look)
    └── OverviewControls (OrbitControls, bird's-eye)

16+ Distinct 3D Artifact Types

Every artifact looks different based on its type.

Floating books for document, lecture, and lesson artifacts. Hologram panels for concepts and insights. Crystal orbs (icosahedrons with orbiting particles) for enrichment artifacts. Speech bubbles for conversations. Framed images for screenshots and synthesis images mounted on walls. And 20+ unique GLB models including a brain, question mark, coffee cup, milestone trophy, heart, dream cloud, tree, headphones, cash stack, exam paper, speaker, warning sign, and a hamburger.

Instanced Rendering for Performance

With potentially hundreds of artifacts in a room, rendering each one individually would kill performance. Rayan uses instanced rendering.

BookInstancedRenderer clones the document GLB model per artifact, sharing geometry and textures. One draw call instead of N. OrbInstancedRenderer uses InstancedMesh for both orbs and their particles. Two draw calls instead of N times 6.

This keeps the palace smooth even with dense rooms.

Camera System

Two modes with smooth transitions. First-person gives you WASD movement with mouse look, constrained to room bounds with wall collision detection. This is how you experience the palace. Overview gives you a bird's-eye OrbitControls view at 55 units height with 45-degree FOV. This is how you survey the layout.

Room transitions use flyTo() for smooth interpolation from current position to target.

Real-Time Communication

Rayan's real-time behavior runs through a single WebSocket connection per user at /ws/{userId}.

Authentication

Client connects and sends { type: "auth", token: "<Firebase ID token>" }. Backend verifies via Firebase Admin SDK. Connection established.

60+ Message Types

The WebSocket carries a typed protocol.

Client sends capture_start, video_frame, capture_voice_chunk, capture_end for the capture lifecycle. live_session_start, audio_chunk, live_session_end for recall. context_update to notify RecallAgent of room navigation. ping every 30 seconds as heartbeat.

Server sends palace_update for real-time palace mutations (rooms added, artifacts added/updated/removed, connections, lobby doors). capture_ack when a concept is extracted. capture_audio and capture_text for Rayan's spoken acknowledgments during capture. capture_user_text for transcription of user speech. live_audio and live_text for Rayan's voice responses during recall. live_tool_call for tool invocations (navigate, highlight, synthesize). live_interrupted when the user cuts in. artifact_recall for narration and diagrams when you click an artifact. room_suggestion for room placement suggestions. enrichment_update for web search results.

Audio Pipeline

Browser to server. getUserMedia() at 16kHz mono. An AudioWorklet (pcm-processor.js) processes raw PCM. About 100ms chunks get base64-encoded and sent over the WebSocket. Echo cancellation and noise suppression are enabled.

Server to browser. Gemini Live returns base64-encoded Linear16 PCM at 24kHz. The client wraps raw PCM in a WAV header (44 bytes). The browser's decodeAudioData() parses it. Web Audio API schedules sequential chunks for gapless playback. Audio plays as it arrives. No waiting for the full response.

Infrastructure as Code with Terraform

The entire Google Cloud infrastructure is defined in a single Terraform file (infrastructure/terraform/main.tf) and provisioned with one command.

terraform apply \
  -var="project_id=your-project-id" \
  -var="backend_image=gcr.io/your-project-id/rayan-backend:latest"

That single command provisions everything below.

Service Account and IAM

resource "google_service_account" "backend" {
  account_id   = "rayan-backend"
  display_name = "Rayan Backend Service Account"
}

# Three roles, least privilege
resource "google_project_iam_member" "firestore"  { role = "roles/datastore.user" }
resource "google_project_iam_member" "storage"    { role = "roles/storage.objectAdmin" }
resource "google_project_iam_member" "vertex_ai"  { role = "roles/aiplatform.user" }

Cloud Run Service

resource "google_cloud_run_v2_service" "backend" {
  template {
    session_affinity = true  # Critical for long-lived Gemini Live sessions
    containers {
      image = var.backend_image
      resources {
        limits = { cpu = "2", memory = "2Gi" }
      }
    }
    scaling {
      min_instance_count = 1   # Always warm
      max_instance_count = 10  # Scale under load
    }
  }
}

Firestore Database

resource "google_firestore_database" "default" {
  type             = "FIRESTORE_NATIVE"
  location_id      = var.region  # us-central1
}

Cloud Storage Buckets

# Media bucket (screenshots, mind maps)
resource "google_storage_bucket" "media" {
  name     = "rayan-media-${var.project_id}"
  location = "US"
  cors { origin = ["*"] }
}

# Frontend hosting bucket
resource "google_storage_bucket" "frontend" {
  name     = "rayan-frontend-${var.project_id}"
  website { main_page_suffix = "index.html" }
}

One command. Full infrastructure. Reproducible, version-controlled, reviewable.

Frontend and Backend Stacks

Frontend

Library	Version	What it does
React	18.3.1	UI framework
Three.js	0.170.0	3D rendering engine
@react-three/fiber	8.17.0	React renderer for Three.js
@react-three/drei	9.122.0	Utilities (useGLTF, OrbitControls, Environment)
Zustand	5.0.0	State management, 6 stores (auth, palace, camera, capture, voice, transition)
Firebase SDK	11.0.0	Authentication
Framer Motion	12.35.0	UI animations
GSAP	3.12.5	Camera transition timelines
Tailwind CSS	3.4.19	Utility-first styling
Lucide React	0.577.0	Icons
React Router DOM	6.27.0	SPA routing

The 6 Zustand stores cleanly separate concerns. authStore for Firebase user state. palaceStore for rooms, artifacts, current room, layout. cameraStore for position, orientation, overview mode, flyTo transitions. captureStore for capture session state, audio stream, transcript, extraction messages. voiceStore for recall session state, audio playback queue, tool activity. transitionStore for room transition animations.

Backend

Library	What it does
FastAPI	Async web framework with WebSocket support
google-genai	Gemini Live SDK (persistent audio sessions, tool calling)
google-adk	Google Agent Development Kit
google-cloud-firestore	Async Firestore client
google-cloud-aiplatform	Vertex AI embeddings
firebase-admin	Server-side token verification
numpy	Cosine similarity computation
httpx	Async HTTP client for web search
beautifulsoup4	HTML parsing for web search results
websockets	WebSocket protocol
pydantic	Data validation and serialization

Creative Synthesis

One of my favorite features. When you ask Rayan to "synthesize this room," here's what happens.

The synthesis_service fetches all non-synthesis artifacts in the current room. It builds a prompt that includes the room name and style, a style-specific color palette (library gets warm amber and aged parchment, lab gets midnight blue and neon cyan, gallery gets lavender and rose gold), every artifact's title and keywords, mood hints derived from artifact types (emotion-heavy rooms get "warmth, feeling" hints), and color hints from individual artifact fields.

The prompt goes to gemini-2.5-flash-image with response_modalities=["Text", "Image"]. The model generates a creative, styled mind map image. Not a diagram. Something that actually looks good. The PNG gets extracted, uploaded to Cloud Storage at syntheses/{roomId}/{uuid}.png, and made publicly accessible. A synthesis artifact gets created and placed on the south wall, centered.

Each synthesis is unique to the room's theme. Library rooms get warm parchment textures with scholarly connections. Lab rooms get holographic panels with neon data flows. Gallery rooms get painterly brushstrokes. The instruction to the model says "Draw visible relationships. Make it beautiful enough to hang on a wall."

Data Model

Artifact

Every memory in Rayan is an Artifact.

class Artifact(BaseModel):
    id: str                        # artifact_{uuid}
    roomId: str                    # Parent room
    type: ArtifactType             # 20+ types (lecture, insight, moment, goal, emotion...)
    visual: ArtifactVisual         # 3D rendering type (floating_book, crystal_orb, hologram_frame...)
    position: Position3D           # x, y, z within room
    title: str
    keywords: list[str]
    summary: str                   # 50-150 word description
    fullContent: Optional[str]     # Extended content
    embedding: list[float]         # 768-dim vector from text-embedding-005
    sourceMediaUrl: Optional[str]  # Screenshot or mind map image URL
    capturedAt: datetime
    captureSessionId: Optional[str]
    enrichments: list[str]         # IDs of enrichment artifacts
    relatedArtifacts: list[str]    # Cross-links to related memories
    color: Optional[str]           # Hex color hint for rendering
    wall: Optional[str]            # north, south, east, west, or center

Room

class Room(BaseModel):
    id: str                        # room_{uuid}
    name: str                      # "Machine Learning", "Travel Plans"
    style: str                     # library, lab, gallery, garden, workshop, etc.
    position: Position3D           # World coordinates
    dimensions: Dimensions3D       # Default 8x8x4
    topicKeywords: list[str]       # ["AI", "neural networks"]
    topicEmbedding: list[float]    # 768-dim for room matching
    artifactCount: int
    summary: str                   # Derived from artifact summaries
    firstMemoryAt: datetime        # Earliest artifact
    lastMemoryAt: datetime         # Most recent artifact

10 Room Styles

Each style defines the visual aesthetic and influences synthesis art. Library (scholarly, warm amber). Lab (scientific, midnight blue with neon). Gallery (artistic, lavender and rose gold). Garden (organic, emerald and bioluminescent). Workshop (practical, charcoal and molten orange). Museum (historical, classical). Observatory (visionary, deep space). Sanctuary (emotional, soft and reflective). Studio (creative, vibrant). Dojo (disciplined, minimal).

Deployment

Backend to Cloud Run

# Build and push container
gcloud builds submit --tag gcr.io/$PROJECT_ID/rayan-backend .

# Deploy with session affinity
gcloud run deploy rayan-backend \
  --image gcr.io/$PROJECT_ID/rayan-backend \
  --region us-central1 \
  --allow-unauthenticated \
  --session-affinity \
  --set-env-vars GOOGLE_CLOUD_PROJECT=$PROJECT_ID,MEDIA_BUCKET=rayan-media-$PROJECT_ID

Session affinity is the key flag. Without it, Cloud Run load-balances WebSocket connections across instances, which breaks the persistent Gemini Live sessions.

Frontend to Firebase Hosting

npm run build
firebase deploy --only hosting

Firebase Hosting serves the static SPA with CDN distribution and SSL.

Things I Learned Building with Gemini Live

Session affinity is non-negotiable

Gemini Live sessions are stateful WebSocket connections. If your infrastructure doesn't guarantee that subsequent messages from the same client hit the same server instance, your sessions break silently. Cloud Run's --session-affinity flag fixed this.

Affective dialog changes everything

I built Rayan initially with enable_affective_dialog=False. It worked, but it felt mechanical. Flipping that single boolean changed the whole experience. Rayan became something you wanted to talk to. The pacing changes, the tone shifts, the subtle empathy. It's the difference between a tool and a companion.

Tool calling is asynchronous, and that's powerful

Unlike traditional function calling where the model waits for a response, Gemini Live's tool calls are non-blocking. The model keeps talking while the tool executes in the background. When the result arrives, it gets injected via send_client_content() and the model works it in naturally. This means Rayan can say "Let me navigate you to your Biology room" and start talking about Biology while the navigation animation is still playing.

Grounding must be continuous, not one-shot

My first implementation loaded memories at session start and never refreshed them. After navigating three rooms, Rayan's context was stale. The fix was re-running semantic search on every room navigation and artifact interaction, injecting fresh memories mid-conversation. Gemini Live's send_client_content() makes this possible without reconnection.

Embeddings inline on documents scale surprisingly well

I originally planned to use a separate vector database. But storing 768-float embeddings directly on Firestore documents and doing cosine similarity in Python works fine up to thousands of artifacts. The simplicity is worth it. The Vertex AI Vector Search Index is already provisioned in Terraform for when scale demands it.

What's Next

Rayan works today. But the vision goes beyond what a hackathon timeline allows.

Vertex AI Vector Search at Scale

Right now, semantic search loads all artifacts from Firestore and computes cosine similarity in Python. This works at hundreds or low thousands of artifacts. Next step is activating the Vertex AI Vector Search Index already provisioned in Terraform, moving to approximate nearest neighbor search that handles millions of embeddings with sub-millisecond latency. This also opens the door to hybrid search, combining semantic similarity with keyword matching and temporal filters at the index level.

Mobile Companion App

The 3D palace works great on desktop, but real life happens on your phone. A mobile companion app (React Native or Flutter) would let you run Capture sessions from your pocket during walks, commutes, or in-person conversations, syncing everything back to your palace. The mobile experience would focus on voice-first interaction with a simplified 2D room view. The full 3D palace stays on desktop.

Collaborative Palaces

A shared palace for a study group, a project team, or a couple. Multiple users contributing memories to shared rooms, with RecallAgent understanding multi-user context. "What did Sarah capture about the API design?" The architecture already supports multi-user Firestore paths. The agent context and permission model need extension.

Spaced Repetition Engine

The palace structure is inherently spatial, which already helps memory. Adding a spaced repetition layer where Rayan proactively surfaces memories about to fade from your recall curve would turn the palace into an active learning system. "You haven't visited your Organic Chemistry room in 12 days. Want me to quiz you on the key reactions?"

Persistent Cross-Session Agent Memory

Right now, each Capture and Recall session starts fresh (though the palace itself persists). Adding persistent agent memory where Rayan remembers how you like to be spoken to, what topics you care about most, your learning style, your naming conventions would make the companion feel truly personal over months of use.

Try It

Rayan is live at rayan-memory.web.app. Sign in with Google, start a Capture session, and speak. Watch your 3D palace build itself in real time. Then switch to Recall and walk through your memories.

The whole project is built on Gemini Live API (gemini-live-2.5-flash-native-audio) for real-time voice agents. Gemini 2.5 Flash for memory categorization and narration. Gemini 2.5 Flash Image for creative mind map synthesis. Vertex AI text-embedding-005 for semantic grounding. Cloud Run for the backend with session affinity. Firestore as the primary database. Cloud Storage for media. Firebase Hosting for the frontend. Firebase Authentication for Google Sign-In. Terraform for one-command infrastructure.

A 3D memory palace that listens, remembers, and speaks back.

I created this content for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

GitHub github.com/yelnady/rayan
Developer g.dev/yelnady