This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built Vocal Vista, an immersive travel and exploration applet.
It's designed to change how we discover new places.
Instead of just typing into a search bar, Vocal Vista lets you explore the world with your voice.
The problem I wanted to solve was making location discovery more natural and engaging.
I envisioned a tool where you could simply say a place's name and instantly see it come to life.
Vocal Vista combines AI-generated imagery, detailed descriptions, and a voice-controlled interactive map.
It's more than a search tool; it's a dynamic window to any location on Earth, right from your browser.
Demo
Here's a link to the app in action:
Link of deployed Vocal Vista applet
And here are a few snapshots of the experience.
The Welcome Screen
Users are greeted with a clean, modern interface.
They can choose to either speak or type their destination.
Interactive Exploration
Once a location is searched, the results page is a rich dashboard.
It features a large, interactive map, a gallery of stunning AI-generated images, and detailed information cards.
Voice-Controlled Map
This is where the magic happens.
This video shows how a user can say "show me cafes nearby" or "zoom in two levels," and the map responds instantly.
How I Used Google AI Studio
Google AI Studio was the command center for this project.
It was essential for prototyping and perfecting the AI interactions.
I heavily relied on it to craft the prompts for the Gemini 2.5 Flash model.
My goal was to get reliable, structured JSON output for both location details and map commands.
In the Studio, I could experiment with different schemas and prompt phrasing until the model's responses were consistently accurate.
This iterative process was crucial.
For example, I fine-tuned the mapCommandSchema
to ensure the AI could differentiate between a search query like "restaurants" and a zoom command like "zoom out."
I also used the Studio to test prompts for the Imagen 4.0 model.
This helped me find the right words to generate beautiful, photorealistic images that truly capture the essence of a location.
Multimodal Features
Vocal Vista is built on a foundation of multimodality.
It translates user intent across different formats—voice, text, images, and structured data.
1. Speech-to-Action
This is the core of the experience.
A user's spoken words are transcribed and then interpreted by Gemini 2.5 Flash.
The model doesn't just see text; it understands intent.
A command like "Show me how to get here from the Golden Gate Bridge" is converted into a structured JSON object:
{ "intent": "DIRECTIONS", "value": "...", "origin": "Golden Gate Bridge" }
This allows the app to control the map with remarkable precision, all from a natural, spoken sentence.
2. Text-to-Image
To make discovery more visual, Vocal Vista uses Imagen 4.0.
When a user searches for a location, the app generates a gallery of four unique, high-quality images.
This turns a simple text query like "The Great Wall of China" into a vibrant, visual mood board, giving an immediate feel for the place.
3. Text-to-Structured-Data
The app also transforms a simple location name into a rich, organized travel guide.
Gemini 2.5 Flash takes the query and generates a detailed JSON object containing:
- A compelling description.
- A list of key landmarks.
- Practical navigation advice.
- Precise geographic coordinates.
This multimodal step turns a basic input into a wealth of useful, beautifully formatted information. It enhances the user experience by making complex data easy to digest and turning simple curiosity into genuine discovery.
Top comments (0)