Anup Karanjkar

Posted on Jun 5 • Originally published at wowhow.cloud

Veo 3.1 Developer Guide: Timestamp Prompting, Multi-Shot Video, and Full API (June 2026)

#veo31 #googleveo #veo31fastgeneratepre

Veo 3.1 shipped with six features that Veo 3 couldn't do, and timestamp prompting — the ability to direct multiple shots inside a single 8-second generation call — is the one that changes how you structure every production pipeline. The model reached Vertex AI general availability in late May 2026 and hit the Gemini API developer tier in early June. If you're still generating clips the same way you did with Veo 3, you're leaving cost efficiency and creative control on the table.

This is the complete developer guide: three model variants with exact pricing, working Python code for every major API pattern, and an honest rundown of what still doesn't work.

Three Models, Three Price Points

Veo 3.1 ships as three distinct models. The quality difference between Standard and Fast is real but narrow for most social and web content; the difference between Fast and Lite is more significant. All three generate synchronized native audio by default.

Model	API ID	Price/sec	8s clip cost	Best for

Standard delivers the cinematic quality Google demonstrated at I/O 2026. Fast is roughly 70% of Standard quality at 37% of the cost — indistinguishable to most viewers on a phone screen, but visible under technical review in fine detail and hair. Lite is the iteration tier: good enough to verify shot composition, timing, and prompt intent before you commit Fast or Standard budget to the final version.

All three support 1080p output at 16:9 and 9:16 aspect ratios. Duration options are 4, 6, or 8 seconds per generation call. Disabling audio saves roughly 33% off the per-second rate on any tier — useful for clips where you're adding a post-production soundtrack anyway.

One pricing gotcha: the Gemini API documentation for Veo 3.1 Lite shows a "$0.05 per video" number that multiple developers in the Google AI forum have flagged as misleading. The billing is per-second, not per-video, so an 8-second Lite clip costs $0.40, not $0.05. Benchmark your actual token usage before committing to volume.

API Setup: Your First Generation

Veo 3.1 requires Python SDK version 1.52+ and a Gemini API key with Paid Tier access. It is not available on the free tier. Video generation is asynchronous — unlike the synchronous image API, calls return an operation object that you poll until completion:

import time
from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_GEMINI_API_KEY")

operation = client.models.generate_videos(
    model="veo-3.1-fast-generate-preview",
    prompt=(
        "A wide establishing shot of a neon-lit Tokyo street at 2am, rain falling, "
        "reflections shimmering in puddles. Slow pan right. "
        "SFX: Rain on pavement, distant traffic, a single bicycle bell."
    ),
    config=types.GenerateVideosConfig(
        aspect_ratio="16:9",
        resolution="1080p",
        duration_seconds=8,
        enhance_prompt=True,
    )
)

# Fast tier typically completes in 60-90 seconds
while not operation.done:
    time.sleep(15)
    operation = client.operations.get(operation)

video = operation.result.generated_videos[0]
with open("output.mp4", "wb") as f:
    f.write(video.video.video_bytes)

The enhance_prompt flag rewrites your prompt with additional cinematography detail before sending it to the model. It improves output quality on vague prompts but reduces precision on highly crafted ones. Set it to False if you've spent time engineering a specific prompt and want the model to interpret it literally. For quick exploration where quality matters more than control, leave it at True.

Standard tier generation takes 3–5 minutes per clip on average. Budget your timeout accordingly — a 10-second sleep interval is too short for Standard; 30 seconds is more appropriate.

Timestamp Prompting: One Call, Multiple Shots

Veo 3.1 understands temporal structure using a [HH:MM–HH:MM] notation inside a single prompt. You can direct four distinct shots within one 8-second generation call:

prompt = """[00:00-00:02] Wide shot from above: a lone hiker cresting a mountain ridge at golden hour.
SFX: Wind, boots on gravel.

[00:02-00:04] Close-up of the hiker's face, eyes narrowing against the light, slight smile.
Shallow depth of field. SFX: Wind fades.

[00:04-00:06] Reverse shot, hiker's POV: a vast valley stretching to the horizon,
mist on the far peaks. SFX: Distant birdsong begins.

[00:06-00:08] Slow crane pull-back, the hiker silhouetted against the sunset sky.
Ambient: quiet orchestral swell, building."""

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt=prompt,
    config=types.GenerateVideosConfig(
        aspect_ratio="16:9",
        resolution="1080p",
        duration_seconds=8,
        enhance_prompt=False,  # keep our structured prompt intact
    )
)

The model makes editorial decisions about cut timing and transition style — you don't specify whether it's a hard cut or a dissolve. What you control is camera position, subject action, and audio for each segment. The output is a single continuous MP4 with no visible seam between shots when your prompts are internally consistent.

The failure mode: inconsistent subjects. If your first segment specifies "a male hiker in a red jacket" and a later segment describes just "the hiker," the model maintains the subject reasonably well. But any segment that accidentally implies a different context — a different time of day, a different location — gets interpreted literally. The model follows instructions; it doesn't infer that segment 3 is meant to follow segment 2 temporally unless the prompt makes that explicit. Be redundant about scene continuity details across segments.

First and Last Frame: Controlled Transitions

You provide a starting image and an ending image; the model generates the motion between them with synchronized audio. The main use cases are product transitions, before/after comparisons, and scene changes where you need visual precision at both endpoints.

import base64

def load_image_b64(path: str) -> bytes:
    with open(path, "rb") as f:
        return f.read()

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt=(
        "A smooth cinematic transition: the empty coffee mug fills with steaming espresso, "
        "steam curling upward. SFX: Espresso machine hiss, liquid settling."
    ),
    config=types.GenerateVideosConfig(
        first_frame_image=types.Image(
            image_bytes=load_image_b64("empty_mug.png"),
            mime_type="image/png"
        ),
        last_frame_image=types.Image(
            image_bytes=load_image_b64("full_mug.png"),
            mime_type="image/png"
        ),
        aspect_ratio="16:9",
        duration_seconds=6,
    )
)

Two things that bite developers here. First: your source images must match the requested aspect ratio. A 16:9 aspect_ratio config with a square source image produces awkward cropping, not letterboxing. Crop your input images before the call, not after. Second: the model respects the first frame strongly but treats the last frame as a guide rather than a hard constraint. For 4-second clips, endpoint adherence is tighter. For 8-second clips, expect the model to take more liberty with how it arrives at the final frame. If the ending frame precision matters — product close-up, specific text on screen — use a 4-second duration and run a few variants.

Ingredients to Video: Character Consistency Across Clips

Before Veo 3.1, maintaining consistent character appearance across multiple separate generation calls required careful prompt engineering and produced visible drift after 3–4 clips. Ingredients to Video fixes this by accepting reference images as additional input. The model anchors character appearance, style, and setting to your provided references.

def load_image_b64(path: str) -> bytes:
    with open(path, "rb") as f:
        return f.read()

operation = client.models.generate_videos(
    model="veo-3.1-fast-generate-preview",
    prompt=(
        "Using the provided detective and office images: medium shot of the detective "
        "behind his desk. He looks up and says in a weary voice, "
        "'Of all the offices in this town, you had to walk into mine.'"
    ),
    config=types.GenerateVideosConfig(
        reference_images=[
            types.ReferenceImage(
                reference_image=types.Image(
                    image_bytes=load_image_b64("detective_character.png"),
                    mime_type="image/png"
                )
            ),
            types.ReferenceImage(
                reference_image=types.Image(
                    image_bytes=load_image_b64("office_setting.png"),
                    mime_type="image/png"
                )
            ),
        ],
        aspect_ratio="16:9",
        duration_seconds=8,
    )
)

Three reference images is the practical ceiling. Beyond three, the model starts averaging features across references in ways that produce blended, uncanny characters. Two is usually sufficient for one character plus one setting. If you need a second character in the scene, describe them in the prompt rather than adding a third reference image. The recommended workflow is to generate character and setting references first using Gemini 2.5 Flash Image (Nano Banana Pro for higher fidelity), then feed those into Veo 3.1 for the actual video.

Audio Prompting Syntax

Veo 3.1 generates audio by default, and the prompting syntax is explicit enough that learning it pays off immediately. Three patterns:

Dialogue uses quotation marks around the spoken text with the speaker described before the quote: A woman says, "We have to leave now." The model infers a voice that matches the described character. Multiple speakers work within the same prompt: A man in a suit says, "Sign here." The woman shakes her head: "Not yet." The model will cast two distinct voices and position them in the stereo field based on any camera angle cues in your prompt.

Sound effects use the SFX: prefix: SFX: thunder cracks in the distance. Timing relative to the visual action is inferred from context — if your prompt shows a character slamming a door followed by SFX: door slam echoes, the model places the sound at the door-slam action. You can't set millisecond timing, but the inference is accurate enough for editorial use.

Ambient audio uses Ambient noise: or Ambient:: Ambient noise: the quiet hum of a starship bridge, crew murmur, distant alerts. This sets the background bed for the entire clip. Combining Ambient with SFX produces layered audio: the background ambience plus discrete event sounds on top.

To disable audio entirely, set generate_audio=False in your GenerateVideosConfig. This saves approximately $0.13/sec on Standard, $0.05/sec on Fast. For any clip where you're adding a post-production music track or voice-over, disabling generated audio avoids paying for audio you won't use.

Pricing in Practice

At 8 seconds per clip, 100 daily generations on Fast costs $960/month. That's a real production budget, and most pipelines can reduce it without meaningful quality loss.

The cheapest workflow that maintains publishable quality: use Lite for all iteration and Fast only for approved shots. If you run 4 Lite drafts per final clip before approving: 4 × $0.40 + 1 × $1.20 = $2.80 per published 8-second clip. Versus $3.20 if you iterated on Fast throughout. Across 100 clips per month, that's $280 versus $320 — a 12% saving, not transformative on its own.

The bigger lever is audio. If your workflow adds music or voice-over in post, turning off audio generation on all Lite iterations cuts those iteration costs by 33%. On a 100-clip/month pipeline at Fast with audio off for drafts, you're at roughly $640/month versus $960 with audio on throughout. Over a year, that's $3,840 back.

Standard is justified when the clip goes directly to client or broadcast without post color grading. The visible quality gap shows under technical scrutiny — hair detail, fine textures, complex lighting interactions. For casual viewing on mobile, Fast is indistinguishable to most audiences. Run your actual prompts on both tiers and evaluate at your target display size before committing to Standard for production.

What Veo 3.1 Still Can't Do

The SynthID watermark is non-optional. It's invisible to the eye but detectable by Google's verification tools and increasingly by third-party detection services. If client contracts specify undetectable AI generation, Veo 3.1 doesn't comply. This isn't a bug; Google has been explicit that all Veo output will carry SynthID permanently.

The Add/Remove Object feature — available through AI Studio but not the generate_videos API — still runs on Veo 2 internally. No audio. Google hasn't announced a timeline for migrating editing features to Veo 3.1.

Native long-form generation doesn't exist in the API. The 60-second clips referenced in Google's marketing are assembled from sequential 8-second calls chained in post. Character consistency across chained clips is good using Ingredients to Video, but you're managing a reference image library and re-prompting context for every segment. There is no single API call that produces a 60-second video.

Extreme close-up human faces still produce occasional artifacts. The uncanny valley problem is reduced versus Veo 3, but not eliminated. For shots where a face is the primary subject, generate 3–4 variants and select. For mid-ground or wider framing, this is a non-issue.

Where Veo 3.1 Fits Right Now

Three scenarios where it's the right tool today.

Content pipelines at volume. If you need 20+ short-form clips per day — product highlights, social teasers, ad variants — Fast tier at roughly 90-second generation time is the only API-native path that fits a real production cadence. Sora 2 and Kling 2 are comparable on quality but slower on average; neither matches Fast's throughput for pipeline use.

Multi-shot narrative from a single call. Timestamp prompting has no direct equivalent in competitor models as of early June 2026. If you need a structured 4-shot sequence in a single 8-second clip, this is the only model with that built in.

Character-consistent series. The Ingredients to Video workflow produces noticeably better character consistency than prompt-only approaches in Runway ML or Pika. If you're building multi-episode content with recurring characters, this matters enough to drive model selection on its own.

Where it's not the right call: photorealistic portrait close-ups (Veo 3.1 Standard is adequate but not market-leading), highly stylized 2D animation aesthetics (Kling 2 performs better there), and any context where AI attribution must remain undisclosed. The free tier also doesn't exist, so there's no cost-free way to evaluate it before your first invoice — budget at least $50 for a realistic evaluation run against your actual prompts.

Originally published at wowhow.cloud

DEV Community