Building Aura: A Multimodal Smart Home Operated by Gemini Live 🌌

#geminiliveagentchallenge #googlecloud #gemini #discuss

💡 The Problem with Smart Homes

Smart homes today are often fragmented and reactive. You speak into a puck on the wall, and it toggles a light on a screen. There is no continuous awareness.

For the Gemini Live Agent Challenge 2026, I wanted to build something that feels alive. Inspired by futuristic sci-fi interfaces, I built Aura — a central AI operating pilot that doesn't just hear you, but sees your environment concurrently and translates that intelligence into a living, responsive Ambient Dashboard layout natively.

🚀 What is Aura?

Aura is a fully multimodal smart home operating system utilizing bidirectional WebSockets over continuous, low-latency backpressure limits.

Unlike previous generations of voice assistants that rely on turn-taking (Speech-to-Text ➔ LLM ➔ Text-to-Speech), Aura streams continuous raw audio and webcam frames concurrently using the google/genai Node SDK.

🛠️ The Architecture

I engineered a decoupled reactive container pipeline deployed on Google Cloud Run:

⚡ Secret Sauce: Native Visual Concurrency

The biggest challenge I ran into was translating standard 16:9 webcam buffers onto square visual grids without distorting the frame aspect ratio. AI can hallucinate if you squash the context!

I fixed this by injecting a continuous Canvas Context buffer scaling calculation on every local-exec push:

// Quick glimpse at the frontend scaling preserving 1:1 ratios
const scale = Math.min(600 / video.videoWidth, 600 / video.videoHeight);
const x = (600 - video.videoWidth * scale) / 2;
const y = (600 - video.videoHeight * scale) / 2;
ctx.drawImage(video, x, y, video.videoWidth * scale, video.videoHeight * scale);

🚨 Visual Ambient States (The "Wow" Factor)

Dashboard views shouldn't just list data. When Aura triggers a smart decision, the full Chrome viewport adapts natively using CSS Variable Overrides:

💡 .lights-off (Ambient Dimming): Absolute viewport drop-shadow shading to deep #06080E with neon frame glowing edges securely.
🚨 .emergency-global (Strobe Alerting): Repeating red and white absolute background flashes demanded continuous viewer security attention.
🌡️ Thermal Card Shadings: Thermostats pulse with continuous Amber shadings overlays strictly enforcing accurate contextual reading gradients safely.

🎥 Check out the Demo Video!
https://www.youtube.com/watch?v=Vm2iGpAuexQ

📂 Source Code

The code is 100% open-weight and available on GitHub: 👉 https://github.com/karthidec/gemini-agent-challenge.git

⚠️ Contest Disclaimer

This project is an entry for the Google Gemini Live Agent Challenge 2026. Explicitly leveraging @google/genai continuous WebSocket routing modules.

What do you think of this continuous audio/vision ambient approach for smart environments? Let me know in the comments below! 🌌✨

Top comments (1)

VICTOR KIMUTAI • Mar 16

Great implementation. Continuous multimodal streams with bidirectional WebSockets are definitely the future of interactive AI systems. The canvas scaling approach to preserve visual context is a smart solution distorted frames can definitely degrade model perception. The ambient state system reacting to AI decisions is also a really interesting UX layer on top of the model intelligence.