This is a submission for the Google AI Studio Multimodal Challenge
What I Built
The Digital Storyteller is an interactive applet that leverages Google's Gemini models to create imaginative stories from images. Users can upload an image and, optionally, provide a text prompt to guide the narrative. The app generates a short, creative story and then converts that story into an audio file that can be played back. This provides a complete multimedia experience, transforming a static image into a dynamic, narrated tale.
Demo
How I Used Google AI Studio
I used Google AI Studio as the development environment for building this applet. I accessed and utilized the Gemini models directly through their APIs to power the app's core functionalities. The app relies on gemini-2.5-flash to process the image and text prompts and generate the story. It then uses the gemini-2.5-flash-preview-tts model to create the audio from the text. The app was built locally and is ready to be deployed to a platform like Cloud Run, as required by the challenge.
Multimodal Features
This applet demonstrates multimodal functionality in two key ways:
Multimodal Content Understanding: The app takes two different modalities as input: an image (visual) and a text prompt. It uses the powerful gemini-2.5-flash model to understand the content of both inputs and then combine them to create a single, cohesive text output (the story).
Multimodal Content Generation: After the story is created, the app uses the gemini-2.5-flash-preview-tts model to convert the text of the story into audio data. This showcases the ability to generate new content in a different modality from a text input, providing a richer, more engaging user experience.
Top comments (0)