How I Built a Multimodal AI Virtual Stager with the Gemini API and Cloud Run

#webdev #googlecloud #javascript #ai

The Problem with Empty Rooms

If you have ever tried to sell a house, you know the golden rule: empty homes sit on the market longer and sell for less. Buyers struggle to visualize the potential of a blank space.

As a Senior Data Scientist who is also fully licensed in real estate and insurance (yes, I can build the AI to stage your house, legally sell it to you, and write the insurance policy all in one go 😂), I knew there had to be a better, cheaper way than paying thousands of dollars for physical furniture staging.

For the Gemini Live Agent Challenge, I decided to build the Open House AI Storyteller—a full-stack, multimodal AI agent that takes a simple photo of an empty room and instantly generates a photorealistic staged image, a compelling marketing narrative (with a built-in Feng Shui expert mode!), and a studio-quality audio voiceover.

Here is a breakdown of the architecture, the tech stack, and the biggest roadblock I hit while orchestrating multiple AI models in Node.js.

🛠️ The Tech Stack

Frontend: React, Vite, Tailwind CSS, Framer Motion
Backend: Node.js, Express
Database: Supabase (for lead capture and deduplication)
Infrastructure: Google Cloud Run
AI & APIs: * Gemini 3.1 Flash Image (for pixel-perfect virtual staging)
- Gemini 2.5 Flash (for rapid text generation and Feng Shui analysis)
- Google Cloud Text-to-Speech (for the audio tour)

🏗️ The Architecture: Splitting the Brain

My initial approach was to send a single, massive prompt to the gemini-3.1-flash-image-preview model, asking it to both draw the staged room and write the marketing copy.

The result? A frozen server and an infinite loading spinner on my frontend.

I quickly learned a crucial lesson about multimodal AI: dedicated image models only output pixels. They will completely ignore text-generation instructions, and if your code is await-ing a text response that never comes, your API call will hang until the server crashes.

To fix this, I implemented a "split-brain" architecture in my Express backend. I separated the tasks so each model did exactly what it was best at:

Step 1: The Vision Request
I used sharp to compress the user's uploaded image to an optimal size, then sent it to the gemini-3.1-flash-image model with a strict visual prompt. To protect my Cloud Run instance from hanging on transient network spikes, I wrapped the API call in a custom 60-second timeout function with a smart retry loop.

Step 2: The Narrative Request
The second the image finished generating, I immediately triggered gemini-2.5-flash. I passed it the same base64 image along with a dedicated copywriter prompt. Because 2.5 Flash is multimodal, it could actually "see" the empty room layout and write a highly accurate description of how the new furniture layout maximized Qi flow and adhered to Feng Shui principles.

Step 3: The Voiceover
Finally, the generated text was passed to the Google Cloud TTS API to generate an MP3 buffer, which was sent back to the React frontend alongside the image and the story.