<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shayan Banerjee</title>
    <description>The latest articles on DEV Community by Shayan Banerjee (@shayan_banerjee_000a15fb8).</description>
    <link>https://dev.to/shayan_banerjee_000a15fb8</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3502476%2F9f8afefa-d57b-43d3-8281-d2e4a1febae3.png</url>
      <title>DEV Community: Shayan Banerjee</title>
      <link>https://dev.to/shayan_banerjee_000a15fb8</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shayan_banerjee_000a15fb8"/>
    <language>en</language>
    <item>
      <title>PicMoods: An AI Synesthesia Experience</title>
      <dc:creator>Shayan Banerjee</dc:creator>
      <pubDate>Mon, 15 Sep 2025 06:59:40 +0000</pubDate>
      <link>https://dev.to/shayan_banerjee_000a15fb8/picmoods-an-ai-synesthesia-experience-3pff</link>
      <guid>https://dev.to/shayan_banerjee_000a15fb8/picmoods-an-ai-synesthesia-experience-3pff</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-ai-studio-2025-09-03"&gt;Google AI Studio Multimodal Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;PicMoods is a web application that explores the concept of digital synesthesia, translating the mood and aesthetics of a visual image into a completely original audiovisual experience.&lt;/p&gt;

&lt;p&gt;Users upload an image that inspires them, and PicMoods orchestrates a multi-step AI pipeline to compose a unique piece of music and a corresponding video. It's a tool for creative exploration, allowing anyone to discover the hidden melody within a photograph or piece of art.&lt;/p&gt;

&lt;p&gt;The entire creative process is powered by the Gemini API, with all audio and video rendering handled client-side using Tone.js and ffmpeg.wasm. The app also features a local, in-browser gallery using IndexedDB to save, replay, and download your favorite compositions&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;URL: &lt;a href="https://picmoods-315248990502.us-west1.run.app/" rel="noopener noreferrer"&gt;https://picmoods-315248990502.us-west1.run.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the demo, you can see the full user journey:&lt;br&gt;
A user uploads a vibrant picture of a city at night.&lt;br&gt;
They click "Compose Music," and the app displays real-time progress as it moves through the AI pipeline.&lt;/p&gt;

&lt;p&gt;In under a minute, a video player appears, featuring a Ken Burns-style slideshow of 10 surreal, AI-generated variations of the original cityscape.&lt;/p&gt;

&lt;p&gt;Playing alongside the video is an upbeat, synth-based melody, perfectly matching the energetic and electric mood of the image.&lt;br&gt;
The user then saves the final MP4 composition to their local gallery.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Google AI Studio
&lt;/h2&gt;

&lt;p&gt;I leveraged two different Gemini models to create a sophisticated, chained pipeline where the output of one AI task becomes the creative input for the next.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;gemini-2.5-flash for Mood Analysis &amp;amp; Music Composition
This model was the core "brain" of the composition process. I used it for two distinct tasks:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Mood Analysis: The first call is a classic multimodal query. The model receives the user's image and a simple text prompt asking it to describe the primary mood in 2-5 words. This extracted mood (e.g., "dark and mysterious" or "joyful and energetic") acts as the creative director for the music.&lt;/p&gt;

&lt;p&gt;Structured Music Generation: The second, more complex call, feeds the original image and the newly generated mood back to the model. Using a strict responseSchema, I prompted Gemini to return a JSON object containing everything needed for the audiovisual experience:&lt;/p&gt;

&lt;p&gt;Musical metadata like tempo and instrument.&lt;br&gt;
A full array of notes in Tone.js format ({note, duration, time}).&lt;br&gt;
The complete musical score in ABC Notation for visual display.&lt;br&gt;
This demonstrates Gemini's powerful ability to perform creative tasks while adhering to a required data structure, which is critical for application development.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;gemini-2.5-flash-image-preview for Visual Storytelling
To create the visual part of the video, I used the image generation capabilities of gemini-2.5-flash-image-preview. The application takes the user's original image and runs it through the model 10 times, each with a different creative text prompt (e.g., "Reimagine this as a vintage, sepia-toned photograph," "Apply a beautiful watercolor painting effect."). This generates a sequence of 10 thematically linked but stylistically unique images that form the visual narrative of the final video.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Multimodal Features
&lt;/h2&gt;

&lt;p&gt;PicMoods is built from the ground up on a foundation of multimodal interactions, chaining them together to create a result that is greater than the sum of its parts.&lt;/p&gt;

&lt;p&gt;Image-to-Text (Mood Analysis): The process starts by interpreting visual data to produce descriptive text. The model analyzes the pixels, colors, and composition of the input image to generate a concise summary of its emotional tone.&lt;/p&gt;

&lt;p&gt;Input: (Image, "Analyze the mood" Text)&lt;br&gt;
Output: Text (e.g., "peaceful and serene")&lt;/p&gt;

&lt;p&gt;Image-and-Text-to-Structured-Data (Music Composition): This is the core creative step. The model doesn't just look at the image or the text; it synthesizes both. It considers the visual context of the image through the lens of the textual mood prompt to generate a complex, structured JSON object representing a full musical piece.&lt;/p&gt;

&lt;p&gt;Input: (Image, "Compose music for this mood: ..." Text)&lt;br&gt;
Output: Structured JSON { tempo, instrument, notes, abcNotation }&lt;br&gt;
Image-and-Text-to-Image (Visual Variation): To build the video, the app leverages multimodality for visual art generation. By repeatedly combining the source image with different artistic prompts, it creates a diverse set of new images that all share the same foundational subject matter.&lt;/p&gt;

&lt;p&gt;Input: (Image, "Render this in a dreamlike style" Text)&lt;br&gt;
Output: Image&lt;/p&gt;

&lt;p&gt;This pipeline is a powerful demonstration of how different multimodal capabilities can be stacked to build a complex and creative application.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>googleaichallenge</category>
      <category>ai</category>
      <category>gemini</category>
    </item>
  </channel>
</rss>
