Introduction
Cloud platforms are incredibly powerful — but navigating them can be confusing even for experienced developers.
Recently, while setting up a static website on Google Cloud Storage, I realized how easy it is to make small mistakes:
- A permission checkbox hidden deep in the UI
- A configuration buried under another menu
- A setting that appears correct but fails silently
In many cases, developers spend more time searching the console UI than actually building their application.
That led to a question:
What if an AI assistant could watch your screen, listen to your question, and guide you step‑by‑step through cloud configuration?
That idea became CloudGuide, a multimodal AI agent built with Google AI models and Google Cloud.
This project was developed specifically for the Google AI Hackathon, and this post explains how it works under the hood.
The Idea
CloudGuide is a voice-enabled AI assistant that helps users configure cloud resources in real time.
Instead of reading documentation or watching tutorials, users can simply say:
“Help me deploy a website on Google Cloud.”
The AI agent then:
• Watches the user’s screen
• Listens to voice input
• Speaks instructions
• Highlights UI elements to click
• Verifies steps using real Google Cloud APIs
The goal is to turn complex cloud configuration into a guided interactive experience.
Key Capabilities
CloudGuide combines several multimodal capabilities:
Screen Understanding
The system captures periodic screenshots of the user's screen and sends them to the AI model for interpretation.
Voice Interaction
Users communicate naturally through a microphone, asking questions or requesting help.
Real-Time Voice Responses
The AI responds with native audio output using Gemini’s audio model.
API Grounding
Instead of trusting screenshots alone, the system verifies actions using the Google Cloud Storage API.
Visual Click Guidance
The system highlights the exact UI element users need to click.
This dramatically reduces confusion when navigating complex cloud interfaces.
System Architecture
The system consists of three primary layers:
- Client
- Backend
- AI Model
Client (User Machine)
The client runs locally and handles:
- Screen capture
- Microphone input
- Speaker output
- Browser highlighting
Technologies used:
- mss – screen capture
- pyaudio – audio streaming
- Playwright + Chrome DevTools Protocol – UI highlighting
The client streams screenshots and audio to the backend via WebSocket.
Backend (Google Cloud Run)
The backend is deployed on Google Cloud Run and built with FastAPI.
Responsibilities include:
- Managing WebSocket connections
- Streaming data to the AI model
- Executing tool calls
- Verifying cloud configuration through APIs
All multimodal input flows through a request queue before being sent to the AI model.
AI Model (Gemini Live)
CloudGuide uses the Gemini Live API with:
gemini-2.5-flash-native-audio-latest
This model supports:
- Real-time audio streaming
- Image understanding
- Tool calling
- Natural voice output
The model processes audio and screenshots together within a single streaming session.
This creates a natural conversational experience.
Tool Calling and API Grounding
One challenge with vision-based AI systems is that screenshots can be misleading.
For example:
A UI might show a resource as created even though the underlying API operation failed.
To solve this, CloudGuide uses tool calling.
Example tools include:
check_bucket()
list_bucket_files()
check_bucket_permissions()
diagnose_bucket_issues()
These tools query the Google Cloud Storage API directly.
This allows the AI agent to verify that each step actually worked.
Visual UI Guidance
Another major feature is element highlighting.
Before asking the user to click something, the system highlights the UI element.
A pulsing rectangle appears around the button or input field.
This is implemented using:
- Playwright
- Chrome DevTools Protocol
The backend sends highlight instructions, and the client injects an overlay into the browser.
Challenges Encountered
Audio Feedback Loops
When the AI speaks through speakers, the microphone can pick up the sound and send it back to the model.
This creates a feedback loop where the model responds to itself.
Using headphones mitigates this issue, but future improvements could include built‑in echo cancellation.
Voice Activity Detection
The Live API sometimes interprets background noise as speech.
Adding voice activity detection would improve reliability.
UI Changes
Cloud interfaces evolve frequently.
During development, certain UI paths moved or appeared differently than documented, which required adjustments to the workflow detection logic.
Deployment
The backend runs on Google Cloud Run and is deployed through a CI/CD pipeline using:
- Cloud Build
- Container Registry
- Docker
Deployment is automated through a simple script that builds and deploys the backend service.
This makes the system scalable and accessible from anywhere.
What I Learned
Building this project revealed several key insights:
Multimodal AI is powerful
Combining screen understanding with voice interaction creates a much more natural interface.
API grounding improves reliability
Vision alone is not enough. Verifying system state using APIs is essential.
Voice interfaces still need infrastructure improvements
Capabilities like echo cancellation and voice activity detection will make voice agents significantly more robust.
Final Thoughts
Cloud platforms are incredibly capable, but their complexity often slows developers down.
Projects like CloudGuide explore a new paradigm:
AI agents that guide users through complex systems in real time.
By combining:
- Google AI models
- Gemini Live streaming
- Google Cloud Run
- Real API grounding
we can create assistants that truly understand what users are doing.
This project was built using Google AI models and Google Cloud services and the accompanying content was created specifically for entering the Google AI Hackathon.
Top comments (0)