Google released two new generative-media models this week—Nano Banana 2 Lite for fast, cheap image generation and Gemini Omni Flash for short video generation and editing—designed for volume and speed rather than peak quality. The models target builders who need to produce large quantities of images and short clips quickly and affordably, signaling a shift in AI-generation economics.
Key facts
- What: A lightweight version of Google's image model now makes a picture in about four seconds for a fraction of a cent, while a new video model lets developers edit clips by talking to it.
- When: 2026-06-30
- Primary source: read the source
Both models were detailed in a blog post, where Google positioned them for builders who prioritize throughput over a single polished render.
Nano Banana 2 Lite, a stripped-down version of Google's image generator, produces a text-to-image picture in about four seconds and costs roughly three cents per thousand images. That pricing is the real story. At those numbers, generating images stops being a treat you dole out carefully and becomes something you can do by the thousand—testing a hundred variations of a design, auto-illustrating every item in a catalog, or letting an app generate imagery on the fly for each user. When a capability gets an order of magnitude cheaper, people do not just do the old thing for less money; they do new things that were never affordable before. Google is clearly betting that cheap-and-fast unlocks a different class of use than slow-and-pristine. It is available in Google's developer studio and its main AI programming interface, and it is rolling out to consumer products like the Gemini app and search.
Gemini Omni Flash, the more novel of the two, handles video—generating and, more interestingly, editing short clips of up to ten seconds. Google bills it as the first time developers get programmable access to conversational video editing. Traditional video editing means a timeline, tracks, and a mouse: you scrub to a frame and manually change things. Conversational editing means you describe the change in plain words—make the sky darker, slow the middle down, remove the person on the left—and the model produces the revised clip. Doing that through an API means a developer can bake that ability into their own app, so their users can revise video by talking rather than learning editing software. Combined with the fast image model, Google is sketching an end-to-end pipeline: generate a picture in seconds, turn it into a short clip, then refine the clip by conversation. It is available in the same developer surfaces plus Google's video-creation tool.
The honest caveat is that the "lite" and "flash" labels are doing a lot of quiet work. A four-second image model priced to run by the thousand is, almost by definition, making tradeoffs against the slower, pricier flagship—in fine detail, in how reliably it renders text inside an image, in handling unusual or complex prompts. Ten-second clips are short, and the hardest parts of video generation—keeping a character consistent, physics that do not melt, coherence across a longer scene—get harder the longer the clip. None of that makes these models less useful; it means they are precision tools for a specific job. The winners will be the builders who match the tool to the task: reach for cheap-and-fast when volume and iteration speed matter, and save the expensive flagship for the single hero image or the shot that has to be flawless. What Google actually shipped this week is less a leap in quality than a shift in economics—and in this field, the economics are often what decides which ideas get built at all.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)