ww-w.ai

Posted on May 20

Google I/O Review (3/5) — Gemini Omni Is a Learned Physics Engine

#ai #google #machinelearning #video

Gemini Omni Is a Learned Physics Engine — Like Unity, But the Rules Aren't Coded

Google I/O 2026 Review — Part 3 of 5

Most video generation models fake physics. They learn what gravity looks like — a ball falls, a cloth drapes — and reproduce the visual pattern. Push the scene past what the training data covered and things break. A marble doesn't bounce right. Shadows point the wrong way after a lighting edit. Swap a background and the character morphs into someone else.

Gemini Omni does something different. It maintains physics and identity across frames — not because someone coded gravity = 9.8 into the system, but because the model built an internal representation of how the physical world works.

That distinction matters more than the demo reel suggests.

The Demos That Stopped the Room

Three demos at I/O 2026 showed what Omni can do.

Hand-drawn character to animation. Someone sketched a character on paper, uploaded it, and Omni turned it into a 10-second animated story. Not a static image with parallax — an actual animation with movement, expression changes, and a coherent scene.

Marble physics. A marble bouncing down a chain-reaction track. Gravity pulled it at the right rate. Bounce trajectories matched the angle of impact. Each bounce produced a distinct sound, including a bell ring at the end. The physics weren't approximate. They looked simulated.

Claymation protein folding. A single prompt generated an educational video showing protein folding in claymation style. The clay texture stayed consistent across the sequence. The folding motion followed biologically plausible mechanics. One prompt. No keyframes. No rigging.

One reviewer at ChatPRD called it "the most impressive demo of the day." Having watched the full keynote and the hands-on sessions, I think that's fair.

What Makes This Different from Sora

Every video generation model can produce impressive isolated clips. The test is what happens when you edit.

Change the background in a Sora-generated scene, and the character often drifts — subtle changes to face shape, clothing color, body proportions. The model doesn't know the character is supposed to stay the same. It's generating each frame based on visual similarity to the previous frame, not based on an understanding that this is the same entity.

Omni maintains identity after edits. Swap the background from a forest to a kitchen. Change the lighting from warm to cold. Replace a prop. The character stays the same — same face, same proportions, same clothing. Google's claim is that the model maintains a persistent representation of objects and their properties, independent of the scene context.

This is the hardest problem in video generation and the reason most generated videos feel uncanny. They look right for 3 seconds. Then something shifts.

The Unity Analogy — And Why It Matters

Here is the mental model I keep coming back to.

In Unity or Unreal, physics works because engineers wrote the rules. Rigidbody.AddForce() applies Newtonian mechanics. Collision detection uses mathematical bounding volumes. Gravity is a constant. The engine simulates a world by executing code.

Omni does something conceptually similar — it maintains physics across frames — but through a different mechanism. The rules aren't coded. They're learned. The model internalized how gravity, light, momentum, and material properties behave by processing enormous amounts of video data. It built what researchers call a world model: an internal representation of physical laws that it applies when generating new frames.

Think of it this way:

	Game engine (Unity)	Learned physics (Omni)
Physics rules	Explicitly coded (`F = ma`)	Implicitly learned from data
Object identity	Tracked via object IDs	Maintained via internal representation
Edit behavior	Deterministic — same input, same output	Probabilistic — but consistent within a generation
Novel scenarios	Only what the code handles	Generalizes from training data patterns
Failure mode	Crashes or glitches visibly	Degrades subtly (uncanny valley)

The game engine approach has known limits and known strengths. You can trust the physics because you wrote the physics. The learned approach trades that certainty for generality — it can handle scenarios nobody anticipated, because it doesn't need someone to write the collision handler first.

The phrase I wrote in my full I/O review keeps sticking: "Like Unity, but the rules aren't coded. They're understood."

Practical Impact: Who Cares Beyond the Demo Reel

Three concrete use cases where this changes cost structures.

YouTube thumbnails and short-form video. A solo creator who currently pays $200-500 for a 30-second product animation can describe the scene in a prompt. If Omni delivers even 70% of the quality at near-zero marginal cost, the economics of content production shift for every small creator and indie team.

Product walkthrough videos. SaaS companies spend $5,000-15,000 per explainer video (script, motion graphics, voiceover, revisions). A world model that understands object permanence means you can generate a walkthrough, swap the UI screenshots for the next version, and the video stays coherent. The revision cycle collapses.

Educational content. The claymation protein-folding demo is not a party trick. If a biology teacher can prompt "show me mitosis in stop-motion clay style, 30 seconds" and get something accurate enough for a classroom, that's a production studio in a text box.

The common thread: Omni reduces the cost of visual storytelling from "hire a team" to "write a paragraph." Not for Hollywood. Not for AAA games. For the long tail of content that nobody could afford to produce before.

What It Can't Do Yet

This section matters more than the demo reel.

It's still in preview. Google showed curated demos on stage. We have not seen the failure cases — the weird hand, the physics glitch, the moment where identity drifts on frame 87. Every generative model looks incredible in a keynote. The question is what happens on the 50th generation you run on your own.

Long-form is unproven. The demos were 10 seconds. What happens at one minute? Two minutes? Five? World models degrade over time — small errors in frame N compound by frame N+100. Whether Omni maintains coherence over longer durations is an open question. Omni Flash clips are capped at 10 seconds; Sora supports up to 60.

Production-grade quality is not validated. "Impressive demo" and "I can ship this to customers" are different bars. Color accuracy, resolution consistency, artifact rates under varied prompts — none of these have been tested at scale by external users.

The pricing is unknown. A world model that generates physically consistent video is computationally expensive. If Omni pricing follows the Flash trajectory — where prices have climbed steeply across Flash generations — the cost math could limit adoption to enterprises.

Where This Fits in the Bigger Picture

Omni is not a video editor. It's not a motion graphics tool. It's a world simulator that outputs video. That framing changes what you compare it to.

Sora and Runway are video generators — they turn text into pixels. Omni is closer to a physics engine that happens to render its output as video frames. The difference is whether the system understands the scene or merely paints it.

If that understanding holds up outside curated demos — and that's a genuine if — the implications go beyond content creation. Robotics simulation, architectural visualization, scientific modeling, game prototyping. Any field that needs "show me what would happen if..." becomes a potential use case.

For now, it's a preview. An impressive one. But a preview.

What I'm watching for next: Public API access, pricing, and the first independent benchmarks on identity persistence across 60+ second clips. The demo set a bar. The product needs to clear it.

If you're tracking Gemini Omni or have tested other world-model approaches, I'd like to hear what you've seen. Comments or GitHub.

Sources:

DEV Community