Nan He

Posted on Mar 16

Cutto — Viral Video Replicator

#geminiliveagentchallenge #ai #googlecloud #gemini

Upload any viral video. Get a brand-new version featuring your products—same energy, your brand.

Inspiration

Every brand wants to go viral. The problem is that going viral isn't random — viral videos follow repeatable formulas: a specific hook in the first 2 seconds, a precise editing rhythm, a visual language that triggers engagement. But most brands don't have the budget to hire a creative director who can reverse-engineer those formulas and rebuild them from scratch for every product launch.

We built Cutto to answer one question: what if AI could watch a viral video, understand exactly why it works, and recreate it for any brand?

What it does

Cutto is a Creative Storyteller agent that takes two inputs — a viral video and a brand description — and produces a complete brand video that replicates the original's visual style, pacing, editing rhythm, and emotional energy, featuring your products.

The experience has five steps:

1. Upload a viral video. Any short-form video from TikTok, Instagram Reels, or YouTube Shorts. Cutto uploads it to Google Cloud Storage and immediately begins analysis.

2. Gemini analyzes the viral formula. Gemini 3.1 Pro watches the video at 6 FPS — a higher frame rate than the default 1 FPS, chosen specifically because fast-cut viral videos have scene changes that 1 FPS would miss entirely. Gemini extracts: the hook strategy (what happens in the first 2-3 seconds to prevent scrolling), the editing rhythm and cut frequency, the visual style including color grading and camera movement, the audio energy profile, a full scene-by-scene storyboard with timing and continuity flags, and the transition type between each scene. All of this streams back to the frontend in real time via Server-Sent Events, so the user watches the analysis build up live.

3. Browse the original storyboard. The original video's keyframes are extracted using FFmpeg and displayed in a 3D film-strip carousel — each frame synced with Gemini's scene description and visual analysis. The carousel uses CSS 3D transforms to create a physical drum/cylinder effect, with the center card forward-facing and side cards rotated inward.

4. Create a Director Script with interleaved storyboard images. The user enters their brand description and any style adjustments. Cutto calls Gemini 3.1 Pro to generate a complete director script that adapts the viral formula to the brand's specific products — same number of scenes, same durations, same structural pacing, but with brand-appropriate actions and visual language. What makes this step distinctive is what happens next: simultaneously, gemini-3.1-flash-image-preview generates a concept storyboard image for each scene using Gemini's interleaved text + image output capability. The text script and the images stream back together in a single SSE response, arriving interleaved — exactly as the Creative Storyteller category requires. As each image arrives, it fills into the film carousel, replacing the original video's frames with concept art for the new brand video. The user can scroll through the carousel and read the director script for each scene before committing to generation.

5. Generate with Veo 3.1 and smart editing. Once the user confirms the director script, Cutto identifies which consecutive scenes form physically continuous action sequences — for example, "open jewelry box → take out ring → slide onto finger" is one continuous sequence. Rather than generating each scene independently (which would produce inconsistent objects and lighting across cuts), Cutto generates the first scene as an 8-second base clip with Veo 3.1 Fast, then uses Veo's Scene Extension to extend the clip for each subsequent scene in the group. This ensures the jewelry box, the ring, the hands, and the lighting are all physically consistent across cuts, because they were all generated in the same continuous clip. Independent scenes are generated separately. After all clips are generated, Gemini watches each clip at 8 FPS and finds the best cut point based on the director script's cut requirements. FFmpeg composites the clips with per-scene transitions (cut, fade, dissolve, wipe) into the final video, which is uploaded to Cloud Storage and returned as a signed URL.

How we built it

Backend: Python 3.13 + FastAPI, deployed on Google Cloud Run with 2GB RAM and a 3600-second timeout (necessary because Veo generation takes several minutes per clip). All AI calls go through the google-genai SDK. The backend is stateless — job state is held in an in-memory dict, sufficient for a hackathon demo.

Frontend: Next.js 15 + TypeScript + Tailwind CSS v4, deployed on Vercel. The UI uses a glassmorphism design system with a deep purple gradient, frosted glass cards, and the Syne + Inter font pairing. All Gemini analysis and director script generation uses SSE streaming — the frontend reads the stream as it arrives and progressively updates the UI.

AI pipeline:

gemini-3.1-pro-preview at 6 FPS for video analysis and director script generation
gemini-3.1-flash-image-preview for storyboard concept image generation (interleaved output)
veo-3.1-fast-generate-preview for clip generation and Scene Extension
gemini-3.1-pro-preview at 8 FPS for finding best cut points in generated clips

Storage: Google Cloud Storage (gs://cutto-videos) for all video assets, generated clips, and extracted keyframes.

Challenges we ran into

Visual consistency across scenes. When Veo generates each scene independently, physically related scenes produce different objects — the ring in scene 2 looks different from the ring in scene 3 even if the prompt says the same thing. The solution was continuous scene groups: identify sequences of physically connected actions during director script generation, generate them as one extended clip using Veo Scene Extension, and then use Gemini to find each individual scene's cut point within the long clip. This gives visual consistency without sacrificing per-scene control over timing.

Physical realism in Veo prompts. Early generations produced common AI artifacts: objects materializing from nothing, jewelry wrapping itself onto wrists without hands, rings sliding onto fingers from the side rather than the tip. The root cause was prompts that described outcomes ("the bracelet is fastened") without describing the physical process. We added a set of physical realism rules to the director script prompt that enforce: initial frame state description, complete physical causality for every action, and explicit contact mechanics. This significantly reduced artifact frequency.

SSE streaming from FastAPI. FastAPI's StreamingResponse with asyncio.as_completed for parallel image generation had edge cases where exceptions in the generator were swallowed by Starlette's exception handling rather than propagating cleanly. Required careful wrapping of each generator step in try/except with explicit error SSE events.

FFmpeg clip concatenation. Using -c copy for concatenation caused clips to silently drop when keyframe boundaries didn't align — a common issue with Veo-generated MP4s. Switching to libx264 re-encode for all concatenation fixed this at the cost of some processing time.

Accomplishments that we're proud of

The interleaved output step is the most satisfying part of the demo. Watching text scene descriptions and concept images arrive simultaneously — populating the film carousel in real time, replacing the original video's frames with AI-generated concept art — captures exactly what "Creative Storyteller" means. It's not a text generator or an image generator; it's an agent that produces a mixed-media storyboard as a single coherent output stream.

The continuous scene group approach also ended up being a genuinely better architecture than we expected. Not only does it solve visual consistency, it also gives Gemini more context when finding cut points — analyzing a 22-second clip that contains three related scenes is more accurate than analyzing three separate 8-second clips with no awareness of each other.

What we learned

Physical realism in generative video is almost entirely a prompt engineering problem. Veo is capable of generating physically plausible footage — the key is giving it enough information about the starting state of the scene and the complete physical mechanism of every action. "A hand picks up a ring" produces artifacts. "The ring already rests in the open pink velvet box. A hand enters from the right, the index finger and thumb pinch the band, and lift the ring straight upward out of the cushion slot" produces clean footage.

We also learned that 1 FPS video sampling for analysis is insufficient for fast-cut viral content. Bumping to 6 FPS meaningfully improved scene detection and continuity flag accuracy — at the cost of higher token usage, but worth it for a tool whose output quality depends entirely on accurate analysis.

What's next for Cutto

Custom asset injection: Let users upload product photos that Veo uses as reference images for consistency across all generated clips
Audio generation: Veo 3.1 supports native audio generation — add background music and sound effects that match the original video's audio energy
Multi-platform export: Generate in 9:16 (Reels/TikTok), 1:1 (feed), and 16:9 (YouTube) in one pass
Campaign batching: Generate a full week of content variations from a single viral reference video

Built with

gemini-3.1-pro-preview · gemini-3.1-flash-image-preview · veo-3.1-fast-generate-preview · Google Cloud Run · Google Cloud Storage · Vertex AI · FastAPI · Next.js 15 · Tailwind CSS v4 · FFmpeg · google-genai SDK >= 1.65.0

This project was created for the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

DEV Community