This is a submission for the Google AI Studio Multimodal Challenge
What I Built
I built the Synesthesia Simulator, an AI-powered applet designed to translate sound and imagery into a unified, cross-sensory artistic experience. It creatively simulates the neurological trait of synesthesia, allowing users to see music as color and hear pictures as melodies.
The applet provides a creative and exploratory space for users to discover novel connections between their senses. You can upload an audio file, an image file, or both, and the AI generates:
- A Descriptive Scene – A vivid, artistic narrative describing the blended sensory experience.
- Creative Prompts – Inspiring ideas for writing, art, or reflection based on the output.
- A Generated Vision – A unique AI-generated image visually representing the fusion of sound and/or visuals.
- Creative Chat – An interactive chat session with a creative AI assistant, primed with the context of your generated experience, to explore ideas further.
My goal was to create a tool that not only showcases advanced AI but also serves as a source of inspiration — particularly for creative and neurodiverse individuals who may naturally think in cross-sensory ways. It's not a medical tool, but a canvas for imagination.
Demo
Live Applet Link:
➡️ Launch the Synesthesia Simulator Here
Screenshots & Walkthrough:
Here’s the main interface where you can upload an audio file and an image:
After processing, the applet presents the AI's synesthetic interpretation alongside a newly generated piece of art. The app includes a built-in audio visualizer that reacts to your music, with customizable color schemes:
Other features and showing the experience with a context-aware creative AI assistant:
How I Used Google AI Studio
Google AI Studio and the Gemini API power this entire experience. I combined multiple models in a seamless pipeline to handle complex multimodal tasks:
-
Gemini 2.5 Flash (Multimodal Understanding):
- Core of the simulator.
- Handles system prompt + user prompt + audio file bytes + image file bytes all in one request.
- Outputs a structured JSON (descriptiveScene, creativePrompts, imageGenerationInstruction) for reliable integration into the UI.
-
Imagen 4.0 (Image Generation):
- Translates the imageGenerationInstruction from Gemini into tangible artwork.
- Creates visuals that embody the cross-sensory interpretation.
-
Gemini 2.5 Flash (Conversational AI):
- Powers the Creative Chat.
- A new chat session is initialized with descriptiveScene + creativePrompts as context.
- Turns the assistant into a creative partner, offering deeper exploration of the user’s generated experience.
Multimodal Features
The multimodal capabilities of Gemini are what make this applet possible:
-
Cross-Modal Understanding:
- Goes beyond analyzing audio and images separately.
- Interprets emotional tone of melodies, maps rhythms to textures, and links color palettes to musical patterns.
- Produces the descriptive scene that defines the synesthetic simulation.
-
Sense-Blending for Generation:
- Uses cross-modal insights to drive Imagen prompts.
- Example: “Abstract glowing waves of violet and silver flowing in rhythm with deep piano chords.”
- Generates true synthesis of sound + visual inputs.
-
Contextual Conversation:
- Creative Chat expands the experience.
- Users can ask: “What does the color red sound like in this song?” or “Tell me a story based on the third creative prompt.”
- The assistant responds with context-aware, imaginative answers.
✨ Thank you for checking out my project!
Submission by: @sarthak_bhardwaj_05aba55d
Top comments (0)