angu10

Posted on Mar 13

Building CloudGuide: A Real-Time AI Assistant for Navigating Google Cloud

#agents #geminiliveagentchallenge #python #gcp

Introduction

Cloud platforms are incredibly powerful — but navigating them can be confusing even for experienced developers.

Recently, while setting up a static website on Google Cloud Storage, I realized how easy it is to make small mistakes:

A permission checkbox hidden deep in the UI
A configuration buried under another menu
A setting that appears correct but fails silently

In many cases, developers spend more time searching the console UI than actually building their application.

That led to a question:

What if an AI assistant could watch your screen, listen to your question, and guide you step‑by‑step through cloud configuration?

That idea became CloudGuide, a multimodal AI agent built with Google AI models and Google Cloud.

This project was developed specifically for the Google AI Hackathon, and this post explains how it works under the hood.

The Idea

CloudGuide is a voice-enabled AI assistant that helps users configure cloud resources in real time.

Instead of reading documentation or watching tutorials, users can simply say:

“Help me deploy a website on Google Cloud.”

The AI agent then:

• Watches the user’s screen

• Listens to voice input

• Speaks instructions

• Highlights UI elements to click

• Verifies steps using real Google Cloud APIs

The goal is to turn complex cloud configuration into a guided interactive experience.

Key Capabilities

CloudGuide combines several multimodal capabilities:

Screen Understanding

The system captures periodic screenshots of the user's screen and sends them to the AI model for interpretation.

Voice Interaction

Users communicate naturally through a microphone, asking questions or requesting help.

Real-Time Voice Responses

The AI responds with native audio output using Gemini’s audio model.

API Grounding

Instead of trusting screenshots alone, the system verifies actions using the Google Cloud Storage API.

Visual Click Guidance

The system highlights the exact UI element users need to click.

This dramatically reduces confusion when navigating complex cloud interfaces.

System Architecture

The system consists of three primary layers:

Client
Backend
AI Model

Client (User Machine)

The client runs locally and handles:

Screen capture
Microphone input
Speaker output
Browser highlighting

Technologies used:

mss – screen capture
pyaudio – audio streaming
Playwright + Chrome DevTools Protocol – UI highlighting

The client streams screenshots and audio to the backend via WebSocket.

Backend (Google Cloud Run)

The backend is deployed on Google Cloud Run and built with FastAPI.

Responsibilities include:

Managing WebSocket connections
Streaming data to the AI model
Executing tool calls
Verifying cloud configuration through APIs

All multimodal input flows through a request queue before being sent to the AI model.

AI Model (Gemini Live)

CloudGuide uses the Gemini Live API with:

gemini-2.5-flash-native-audio-latest

This model supports:

Real-time audio streaming
Image understanding
Tool calling
Natural voice output

The model processes audio and screenshots together within a single streaming session.

This creates a natural conversational experience.

Tool Calling and API Grounding

One challenge with vision-based AI systems is that screenshots can be misleading.

For example:

A UI might show a resource as created even though the underlying API operation failed.

To solve this, CloudGuide uses tool calling.

Example tools include:

check_bucket()
list_bucket_files()
check_bucket_permissions()
diagnose_bucket_issues()

These tools query the Google Cloud Storage API directly.

This allows the AI agent to verify that each step actually worked.

Visual UI Guidance

Another major feature is element highlighting.

Before asking the user to click something, the system highlights the UI element.

A pulsing rectangle appears around the button or input field.

This is implemented using:

Playwright
Chrome DevTools Protocol

The backend sends highlight instructions, and the client injects an overlay into the browser.

Challenges Encountered

Audio Feedback Loops

When the AI speaks through speakers, the microphone can pick up the sound and send it back to the model.

This creates a feedback loop where the model responds to itself.

Using headphones mitigates this issue, but future improvements could include built‑in echo cancellation.

Voice Activity Detection

The Live API sometimes interprets background noise as speech.

Adding voice activity detection would improve reliability.

UI Changes

Cloud interfaces evolve frequently.

During development, certain UI paths moved or appeared differently than documented, which required adjustments to the workflow detection logic.

Deployment

The backend runs on Google Cloud Run and is deployed through a CI/CD pipeline using:

Cloud Build
Container Registry
Docker

Deployment is automated through a simple script that builds and deploys the backend service.

This makes the system scalable and accessible from anywhere.

What I Learned

Building this project revealed several key insights:

Multimodal AI is powerful

Combining screen understanding with voice interaction creates a much more natural interface.

API grounding improves reliability

Vision alone is not enough. Verifying system state using APIs is essential.

Voice interfaces still need infrastructure improvements

Capabilities like echo cancellation and voice activity detection will make voice agents significantly more robust.

Final Thoughts

Cloud platforms are incredibly capable, but their complexity often slows developers down.

Projects like CloudGuide explore a new paradigm:

AI agents that guide users through complex systems in real time.

By combining:

Google AI models
Gemini Live streaming
Google Cloud Run
Real API grounding

we can create assistants that truly understand what users are doing.

This project was built using Google AI models and Google Cloud services and the accompanying content was created specifically for entering the Google AI Hackathon.

DEV Community