Turning Music Into Art — Building a Synesthesia Simulator with Gemini

#devchallenge #googleaichallenge #ai #gemini

Google AI Challenge Submission

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built the Synesthesia Simulator, an AI-powered applet designed to translate sound and imagery into a unified, cross-sensory artistic experience. It creatively simulates the neurological trait of synesthesia, allowing users to see music as color and hear pictures as melodies.

The applet provides a creative and exploratory space for users to discover novel connections between their senses. You can upload an audio file, an image file, or both, and the AI generates:

A Descriptive Scene – A vivid, artistic narrative describing the blended sensory experience.
Creative Prompts – Inspiring ideas for writing, art, or reflection based on the output.
A Generated Vision – A unique AI-generated image visually representing the fusion of sound and/or visuals.
Creative Chat – An interactive chat session with a creative AI assistant, primed with the context of your generated experience, to explore ideas further.

My goal was to create a tool that not only showcases advanced AI but also serves as a source of inspiration — particularly for creative and neurodiverse individuals who may naturally think in cross-sensory ways. It's not a medical tool, but a canvas for imagination.

Demo

Live Applet Link:

➡️ Launch the Synesthesia Simulator Here

Screenshots & Walkthrough:

Here’s the main interface where you can upload an audio file and an image:

After processing, the applet presents the AI's synesthetic interpretation alongside a newly generated piece of art. The app includes a built-in audio visualizer that reacts to your music, with customizable color schemes:

Other features and showing the experience with a context-aware creative AI assistant:

How I Used Google AI Studio

Google AI Studio and the Gemini API power this entire experience. I combined multiple models in a seamless pipeline to handle complex multimodal tasks:

Gemini 2.5 Flash (Multimodal Understanding):
- Core of the simulator.
- Handles system prompt + user prompt + audio file bytes + image file bytes all in one request.
- Outputs a structured JSON (descriptiveScene, creativePrompts, imageGenerationInstruction) for reliable integration into the UI.
Imagen 4.0 (Image Generation):
- Translates the imageGenerationInstruction from Gemini into tangible artwork.
- Creates visuals that embody the cross-sensory interpretation.
Gemini 2.5 Flash (Conversational AI):
- Powers the Creative Chat.
- A new chat session is initialized with descriptiveScene + creativePrompts as context.
- Turns the assistant into a creative partner, offering deeper exploration of the user’s generated experience.

Multimodal Features

The multimodal capabilities of Gemini are what make this applet possible:

Cross-Modal Understanding:
- Goes beyond analyzing audio and images separately.
- Interprets emotional tone of melodies, maps rhythms to textures, and links color palettes to musical patterns.
- Produces the descriptive scene that defines the synesthetic simulation.
Sense-Blending for Generation:
- Uses cross-modal insights to drive Imagen prompts.
- Example: “Abstract glowing waves of violet and silver flowing in rhythm with deep piano chords.”
- Generates true synthesis of sound + visual inputs.
Contextual Conversation:
- Creative Chat expands the experience.
- Users can ask: “What does the color red sound like in this song?” or “Tell me a story based on the third creative prompt.”
- The assistant responds with context-aware, imaginative answers.