Digital Storyteller: A Multimodal Applet

#devchallenge #googleaichallenge #ai #gemini

Google AI Challenge Submission

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

The Digital Storyteller is an interactive applet that leverages Google's Gemini models to create imaginative stories from images. Users can upload an image and, optionally, provide a text prompt to guide the narrative. The app generates a short, creative story and then converts that story into an audio file that can be played back. This provides a complete multimedia experience, transforming a static image into a dynamic, narrated tale.

Demo

Demo link

How I Used Google AI Studio

I used Google AI Studio as the development environment for building this applet. I accessed and utilized the Gemini models directly through their APIs to power the app's core functionalities. The app relies on gemini-2.5-flash to process the image and text prompts and generate the story. It then uses the gemini-2.5-flash-preview-tts model to create the audio from the text. The app was built locally and is ready to be deployed to a platform like Cloud Run, as required by the challenge.

Multimodal Features

This applet demonstrates multimodal functionality in two key ways:

Multimodal Content Understanding: The app takes two different modalities as input: an image (visual) and a text prompt. It uses the powerful gemini-2.5-flash model to understand the content of both inputs and then combine them to create a single, cohesive text output (the story).

Multimodal Content Generation: After the story is created, the app uses the gemini-2.5-flash-preview-tts model to convert the text of the story into audio data. This showcases the ability to generate new content in a different modality from a text input, providing a richer, more engaging user experience.

DEV Community

Digital Storyteller: A Multimodal Applet

What I Built

Demo

How I Used Google AI Studio

Multimodal Features

Top comments (0)