⛓️‍💥Chaining Veo 3.1 and NanoBanana with Gemini

#ai #webdev #programming #cloud

As developers, we are used to chaining APIs to get a desired output. In the world of Generative AI, a similar pattern emerges: Model Chaining.

Creating a high-quality AI videos often requires orchestrating a workflow, not just typing in text and hitting the "generate!" button. Today, I’m going to walk through a specific stack—Gemini 2.5 Pro (for reasoning/prompting), NanoBanana (for base image generation), and Veo 3.1 (for image-to-video) to simulate a hyper-realistic doorbell security camera feed of a very cute fennec fox who is playing with LEGO bricks.

Below is the breakdown of how we went from a blank slate to a coherent video, the prompts I used in Google AI Studio, and a critique of the generated video output. Let's get started! 😄

The model chain

Gemini. Used to reason through the visual aesthetics and generate the complex prompts needed for the image and video generation models.
NanoBanana. Used to generate the initial static image asset, in portrait mode (9:16).
Veo 3.1 Fast. Used to apply physics and motion to the static image asset, also in portrait mode (9:16).

Phase 1: The Base Image

The hardest part of image generation is getting the "vibes" and the character consistency right. For this example, I wanted a specific medium — a grainy, night-vision security camera from the perspective of a home doorbell.

Instead of guessing keywords, I asked Gemini to act as a prompt engineer for the task at hand. I provided the concept ("Fennec fox, Legos, night, doorbell cam") and asked it to write a prompt for an image model.

The Gemini-Generated Prompt:

A grainy, low-quality doorbell camera snapshot at night. Infrared night vision aesthetic with a slight monochromatic green tint. A wide-angle fisheye lens view looking down at a front porch welcome mat. A cute fennec fox with large ears is sitting on the mat, looking up at the camera with glowing reflective eyes. The fox is surrounded by scattered LEGO bricks. The LEGO bricks are arranged on the floor to clearly spell out the word "HI :)" in block letters. Digital overlay text in the corner says "FRONT DOOR - LIVE" and the current timestamp.

Why this works:

Artifact Injection: Words like "grainy," "low-quality," and "monochromatic green tint" prevent the model from making the image too clean or artistic. It forces realism through imperfection.
Camera Specs: Specifying "fisheye lens" and "looking down" enforces the correct perspective distortion typical of Ring/Nest cameras.

The Result:
NanoBanana output a near-perfect static image. The lighting was flat (typical of IR), the eyes glowed (retro-reflection), and the angle was distinctively "doorbell."

Phase 2: The Animation

If you simply tell a video model "make it move," all of the models have a tendency to hallucinate wild camera pans or morph the subject. You need to provide direction. To do this, I fed the static image back into Gemini and asked it to write a prompt for animating the image. After taking a look at the example prompts, I selected one that focused on interaction and physics.

The Video Prompt:

The cute fennec fox looks down from the camera towards the LEGO bricks on the mat. It gently extends one front paw and nudges a loose LEGO brick near the "HI", sliding it slightly across the mat. The fox then looks back up at the camera with a playful, innocent expression. Its ears twitch. The camera remains static.

I fed this prompt and the static image into Veo 3.1 Fast.

Phase 3: Analyzing the Veo Output

Let’s look at the resulting video file and analyze the execution against the prompt:

Wins

Temporal coherence (lighting and texture):
The most impressive aspect is the consistency of the night-vision texture. The "grain" doesn't shimmer uncontrollably, and the monochromatic green remains stable throughout the 7 seconds. The fur texture on the fox changes naturally as it moves, rather than boiling or morphing.
The "Fisheye" effect:
Veo 3.1 respected the distortion of the original image. When the fox leans down and back up, it moves within the 3D space of that distorted lens. It doesn't flatten out.
Ear dynamics:
The prompt specifically asked for "ears twitch." Veo nailed this. The ears move independently and reactively, which is a critical trait of fennec foxes. This adds a layer of biological realism to the generated movement.
Camera locking:
The prompt specified "The camera remains static." This is crucial. Early video models often added unnecessary pans or zooms. Veo kept the frame locked, reinforcing the "mounted security camera" aesthetic.

Bugs

Object Permanence ( The LEGOs):
While the prompt asked the fox to "nudge a loose LEGO," the model struggled with rigid body physics. Instead of a clean slide, the LEGOs near the paws tend to morph or "melt" slightly as the fox interacts with them. The "HI" text also loses integrity, shifting into abstract shapes by the end of the clip.
Motion Interpretation:
The prompt asked for a gentle paw extension. The model interpreted this more as a "pounce" or a head-dive. The fox dips its whole upper body down rather than isolating the paw. While cute, it’s a deviation from the specific articulation requested.
Text Overlay (OCR Hallucination):
The original image had a crisp timestamp. As soon as motion begins, the text overlay ("FRONT DOOR - LIVE") becomes unstable. Video models still struggle to keep text overlays static while animating the pixels behind them. The timestamp blurs and fails to count up logically.
The "Welcome" Mat:
If you look closely at the mat, the text (presumably "WELCOME") is geometrically inconsistent. As the fox moves over it, the letters seem to shift their orientation slightly, revealing that the model treats the mat as a texture rather than a flat plane in 3D space.

TL;DR

Using an LLM like Gemini to generate prompts for media models is a massive efficiency booster! And while Veo 3.1 Fast demonstrates incredible understanding of lighting, texture, and biological movement (the ears!), it can — like all current video models — still face challenges with rigid object interaction (LEGOs) and static text overlays.

Quick tips: Be specific about camera angles and lighting in your text-to-image phase. In the video phase, focus your prompts on the subject's movement, but expect some fluidity in the background objects. And use Gemini 2.5 Pro to help with prompting.

Top comments (4)

Cyber Safety Zone • Nov 20

Great write‑up! Really enjoyed how you broke down the chain of using Gemini 2.5 Pro to prompt NanoBanana for a base image and then feeding that into Veo 3.1 for motion. The details on why certain specs (like the fisheye lens, monochromatic green night vision) work so well were particularly insightful.

Your “Wins vs Bugs” section was super useful — it’s refreshing to see a balanced view rather than just hype. The discussion around the LEGOs morphing and text overlay drifting reminded me that even today, rigid‐body physics and static UI elements in video generation still need work.

Looking forward to seeing how this workflow evolves — maybe with tighter prompt control for object permanence or better text overlay stability. Thanks for sharing!