This is a submission for the Google AI Studio Multimodal Challenge
What I Built
PicMoods is a web application that explores the concept of digital synesthesia, translating the mood and aesthetics of a visual image into a completely original audiovisual experience.
Users upload an image that inspires them, and PicMoods orchestrates a multi-step AI pipeline to compose a unique piece of music and a corresponding video. It's a tool for creative exploration, allowing anyone to discover the hidden melody within a photograph or piece of art.
The entire creative process is powered by the Gemini API, with all audio and video rendering handled client-side using Tone.js and ffmpeg.wasm. The app also features a local, in-browser gallery using IndexedDB to save, replay, and download your favorite compositions
Demo
URL: https://picmoods-315248990502.us-west1.run.app/
In the demo, you can see the full user journey:
A user uploads a vibrant picture of a city at night.
They click "Compose Music," and the app displays real-time progress as it moves through the AI pipeline.
In under a minute, a video player appears, featuring a Ken Burns-style slideshow of 10 surreal, AI-generated variations of the original cityscape.
Playing alongside the video is an upbeat, synth-based melody, perfectly matching the energetic and electric mood of the image.
The user then saves the final MP4 composition to their local gallery.
How I Used Google AI Studio
I leveraged two different Gemini models to create a sophisticated, chained pipeline where the output of one AI task becomes the creative input for the next.
- gemini-2.5-flash for Mood Analysis & Music Composition This model was the core "brain" of the composition process. I used it for two distinct tasks:
Mood Analysis: The first call is a classic multimodal query. The model receives the user's image and a simple text prompt asking it to describe the primary mood in 2-5 words. This extracted mood (e.g., "dark and mysterious" or "joyful and energetic") acts as the creative director for the music.
Structured Music Generation: The second, more complex call, feeds the original image and the newly generated mood back to the model. Using a strict responseSchema, I prompted Gemini to return a JSON object containing everything needed for the audiovisual experience:
Musical metadata like tempo and instrument.
A full array of notes in Tone.js format ({note, duration, time}).
The complete musical score in ABC Notation for visual display.
This demonstrates Gemini's powerful ability to perform creative tasks while adhering to a required data structure, which is critical for application development.
- gemini-2.5-flash-image-preview for Visual Storytelling To create the visual part of the video, I used the image generation capabilities of gemini-2.5-flash-image-preview. The application takes the user's original image and runs it through the model 10 times, each with a different creative text prompt (e.g., "Reimagine this as a vintage, sepia-toned photograph," "Apply a beautiful watercolor painting effect."). This generates a sequence of 10 thematically linked but stylistically unique images that form the visual narrative of the final video.
Multimodal Features
PicMoods is built from the ground up on a foundation of multimodal interactions, chaining them together to create a result that is greater than the sum of its parts.
Image-to-Text (Mood Analysis): The process starts by interpreting visual data to produce descriptive text. The model analyzes the pixels, colors, and composition of the input image to generate a concise summary of its emotional tone.
Input: (Image, "Analyze the mood" Text)
Output: Text (e.g., "peaceful and serene")
Image-and-Text-to-Structured-Data (Music Composition): This is the core creative step. The model doesn't just look at the image or the text; it synthesizes both. It considers the visual context of the image through the lens of the textual mood prompt to generate a complex, structured JSON object representing a full musical piece.
Input: (Image, "Compose music for this mood: ..." Text)
Output: Structured JSON { tempo, instrument, notes, abcNotation }
Image-and-Text-to-Image (Visual Variation): To build the video, the app leverages multimodality for visual art generation. By repeatedly combining the source image with different artistic prompts, it creates a diverse set of new images that all share the same foundational subject matter.
Input: (Image, "Render this in a dreamlike style" Text)
Output: Image
This pipeline is a powerful demonstration of how different multimodal capabilities can be stacked to build a complex and creative application.
Top comments (0)