Nadine

Posted on Sep 13

VisionGen

#devchallenge #googleaichallenge #ai #gemini

Google AI Challenge Submission

This is a submission for the Google AI Studio Multimodal Challenge

What I Built

VisionGen is a next-gen tool for "Video-to-Video" generation. Because prompting does not often give you the results you want, I cooked up an a-la-carte JSON prompting applet. It automates prompt generation for the most accurate results.

The applet resolves time-consuming video annotation by automating object detection, tracking, and scene segmentation with precision. This is useful for computer vision training but skips the training process to provide an immediate practical use: creating new videos from a reference.

This helps you "get it right the first time," because if a generated video isn't what you want, and you have to regenerate it, you pay for both attempts. This is why VisionGen is designed to increase the chances of getting the perfect video on the first try.

Demo

VisionGen

Video Analysis Features

Consistent Object Tracking → Increase confidence threshold if objects are missed.
Bounding Boxes → With object classes that can be filtered or excluded.
Contextual Descriptions → Can be edited or modified for final output.
Transcriptions → Provides temporal cues based on timestamps.
Timeline Visualization → Jump to a specific moment in the video by clicking on the text.
Scene Segmentation → Automatic detection of scene changes and storyline
Screenshots → The secret hidden ingredient exclusive to VisionGen's workflow.

Multimodal Features

🎥 Video Understanding and Generation

Gemini is used to understand temporal relationships and object movements. It generates a coherent video that maintains object consistency and follows the narrative provided.

🔍 Object Tracking with Occlusion Handling

Maintains consistent object IDs throughout videos, even when objects are:

Temporarily occluded (hidden behind other objects)
Partially visible
Leaving and re-entering the frame

The model interpolates positions based on trajectory before and after occlusions, ensuring continuous tracking.

🎬 Scene Segmentation

The AI identifies distinct scene changes with precise timestamps and descriptions, enabling users to understand the overall structure quickly.

⚙️ Configurable Analysis Parameters

Users can customize analysis settings to balance detail and processing speed:

Confidence Threshold: Filter out lower-confidence detections.
Frame Rate: Control analysis granularity for different video lengths.
Time Range Focusing: Analyze specific segments of longer videos.
Add/Remove Audio: Optional audio for balancing cost or to overlay your own audio.

How I Used Google AI Studio

Built entirely using Google AI Studio:

Gemini 2.5 Flash Integration: for video understanding to analyze uploaded files frame-by-frame, extracting detailed annotations and creating a narrative.
(default) veo-2.0-generate-001 endpoint: the applet is designed to be model-agnostic and has 2 endpoints configured, including veo-3-fast-generate-preview.
GoogleGenAI SDK: for communication with the Gemini and Veo APIs using structured prompts for both analysis and generation.
Cloud Run Deployment: for a scalable, secure deployment.

The application is designed to communicate with Google's models, inluding Veo, directly from the browser using the official @google/genai JavaScript SDK.

How it works

1. Standard JSON Prompting:

JSON Prompt Generation: automatically transforms the video analysis data into a structured, narrative JSON prompt, ready for video generation.

The standard feature of the applet is interactive video generation, allowing users to review and, if needed, edit the AI-generated script before creating a new video using a selected model.

✨ 2. Advanced JSON Prompting:

The JSON prompt was upgraded to collect data from the video analysis and generate the intermediate requirements to form an a-la-carte JSON prompt, including:

model
prompt
negative prompt
seed
keyframes (screenshots)
transcription

This new JSON prompt includes a narrative, excluded objects (negative prompts), keyframes from the source video, and transcription data to guide the AI with maximum context from the original video to enhance the final output's adherence to your vision.

Why JSON Prompts?

The raw Video Analysis data is spreadsheet of disconnected facts:

timestamp: 1.2s, object: 'car', bbox: [...]
timestamp: 1.3s, object: 'person', bbox: [...]
timestamp: 1.4s, object: 'car', bbox: [...]

generateNarrativeForVideo uses the Gemini text model to act as a "scriptwriter." We ask it to convert the raw data into a structured an array of NarrativePoint objects.

{
  "model": "veo-3.0-fast-generate-preview",
  "prompt": "Joker appears from the right side of the frame...",
  "negativePrompt": "...",
  "seed": 94272,
  "keyframes": [
    {
      "timestamp": 44.4,
      "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..." // The very long Base64 string
    },
    {
      "timestamp": 48,
      "image": "data:image/jpeg;base64,/9j/2wBDAAYEBQYFBAY..." // Another very long string
    },
  ]
 "transcription": "[29.47s - 30.07s] Hey Arthur..."
}

This JSON array acts as a chronological shot list or a script. It forces the AI to organize the chaotic events (and use the video as reference for a new video).

☭ Division of Labor works Better Together

1. Seeding

What is seeding and why is it included in the metadata?

The prompt is the destination, and the seed provides the path the AI takes to get there.

Seeding is for consistency. It creates a reference value assigned to a prompt to ensure the model adheres to specific details in the reference data. It is used to reproduce the same output each time, even with minor prompt variations, by following the same "path."

If you can predict the result, you can manipulate small details of the narrative. For example, you can add, remove, or modify a prompt's details, like the color of a car, without unintentionally changing the car type or the direction it's driving in.

📸 Chaining

One API call to a model like Veo generates an 8-second video segment. Each API call requires a new prompt and generates a new frame, which can result in broken context.

Solution: We use a screenshot as a Base64-encoded 'last frame' from the previous step. The new prompt describes what should happen next, continuing from the screenshot.

This approach grounds the AI, forcing it to adopt the original video's color palette, lighting, object style, and composition. This process eliminates ambiguity and is the single most important addition for consistency.

This process can be repeated infinitely or for as many segments as you need to create your desired video length.

📊 Export Formats:

YOLO Format: Optimized for object detection model training.
COCO JSON: Compatible with popular computer vision frameworks.
A-la-carte JSON: A detailed JSON file with scripts and keyframes.

Persistent Storage: All metadata is saved locally along with the analysis, so it will be restored when you load a project from your history.

To use the applet and generate videos Bring-Your-Own-API Key: update in Settings. Your API key is safely saved on your own web browser using localStorage.

Top comments (2)

TROJAN • Sep 14

Wow, VisionGen is next-level! I love how you combined JSON-based prompting with advanced object tracking and scene segmentation—it’s like giving AI a director’s notebook. The chaining method using Base64 keyframes for continuity is genius for maintaining consistency across generated videos. Can’t wait to see how this performs on complex scenes—definitely pushing the boundaries of ‘Video-to-Video’ generation...!!

Nadine • Sep 15

Yes, it’s like borrowing the expertise of the director. I have the endpoint configured and if I get access to the veo-3-fast-preview model I will definitely update with some video content 🙂. Thanks for the comment!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.