DEV Community

Cover image for This AI Tells the Story Behind Any Historical Photo or Video

This AI Tells the Story Behind Any Historical Photo or Video

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

I built the Historical Photo/Video Narrator, an interactive applet designed to bring the past to life. This tool allows users to upload historical photos and videos to generate rich, AI-powered narratives that uncover the stories hidden within the frames.

But it doesn't stop at storytelling. The applet also features a powerful "Re-imagine" function. After learning about the context of an image (or capturing a specific frame from a video), users can edit the photo using simple text prompts. Want to see what that 1920s street scene would look like on a sunny day? Or add a splash of color to a black-and-white portrait? The Historical Narrator makes it possible, creating a unique bridge between historical appreciation and creative expression.

The core experience is about transforming passive consumption of historical media into an active, engaging, and educational journey, with all creations saved locally in the browser for future viewing.

Demo

Full Video Demo

To showcase the full video processing and frame-capture capabilities, here is a short video of the project in action:

Here’s a walkthrough of the experience:

1. Upload Your Media: The app starts with a clean, simple interface for uploading an image or a video file.

2. Generate the Narrative: Once a photo is uploaded, Gemini analyzes the visual content and generates a compelling historical narrative. Users can even listen to the story using the built-in text-to-speech feature.

3. Capture & Re-imagine: For videos, you can pause and capture a specific frame. For any image or captured frame, you can then enter a text prompt to modify it.

4. View the Result: The app presents the original and the newly generated image side-by-side, instantly showing the power of your creative direction combined with AI.

Source Code
Link to Google AI Studio

How I Used Google AI Studio

Google AI Studio was the backbone of this project, allowing me to rapidly prototype and deploy a sophisticated multimodal application. I leveraged two key Gemini models:

  1. gemini-2.5-flash: I chose this model for the core narrative generation due to its incredible speed and powerful multimodal understanding. By providing it with an image or video file and a carefully crafted system prompt ("You are a historian and captivating storyteller..."), I could reliably generate high-quality, context-aware narratives that truly enhance the source media.

  2. gemini-2.5-flash-image-preview: This model is the engine behind the "Re-imagine" feature. Its image editing capabilities are phenomenal. The API was straightforward to implement; I passed the source image and the user's text prompt to the model, configuring the response to ensure it returned an edited image. This allowed for an intuitive and powerful creative tool within the app.

The entire development and deployment process was streamlined through Google AI Studio, making it possible to go from concept to a fully functional, deployed applet efficiently.

Multimodal Features

The applet is built around two core multimodal functionalities that work in tandem to create a cohesive user experience.

  1. Multimodal Understanding (Media-to-Text): The primary feature is the app's ability to interpret visual media (images/videos) and translate that understanding into descriptive text. This is more than just object detection; it's about context, atmosphere, and historical inference.

    • Why it enhances the user experience: It adds a profound layer of depth and discovery. A static, silent photo is transformed into a gateway to a potential story, making history feel immediate and accessible. It turns a simple gallery viewer into an educational and storytelling tool.
  2. Multimodal Generation (Image + Text-to-Image): The "Re-imagine" feature allows for creative input on top of the historical analysis. It takes two distinct modalities—an existing image and a new text prompt from the user—and merges them to generate a completely new visual artifact.

    • Why it enhances the user experience: This fosters a deeper, more personal connection with the media. After learning the story behind a photo, the user is invited to become part of the creative process. This interactive loop of "learn, then create" is incredibly engaging and provides a unique way to explore history and "what if" scenarios visually.

Top comments (2)

Collapse
 
prema_ananda profile image
Prema Ananda

Hey there!
Long time no see!
Really nice clean and straightforward project!
Meanwhile, I've overcomplicated mine so much that now I'm not even sure if I'll manage to finish it... 😅

Collapse
 
axrisi profile image
Nikoloz Turazashvili (@axrisi)

hey!
haha, yes, I was busy with raising capital for my startup :)
now found some time on weekends to do something I enjoy.

thanks for kind words. looking forward to see your submission. <3