Turn around, and the world disappears

#worldmodels #videogeneration #robotics #benchmark

AI world models — systems meant to simulate environments with persistent objects — fail a basic test of object permanence: when the camera pans away and back, they reset the scene to the last frame they saw rather than maintaining a continuous model of what happened off-screen. A new benchmark shows that larger models tend to forget worse, indicating the problem is structural, not a matter of scale.

Key facts

What: AI video models that are supposed to "understand" a 3D scene only remember what's on screen — pan away and back, and things have reset. Bigger models are worse at it.
When: 2026-06-19
Primary source: read the source (arXiv 2606.20545)

The benchmark uses a test anyone can picture: show the model a scene, pan the camera away, then pan back. A cat mid-leap toward the bed should, by the time you return, be on the bed — or at least somewhere plausible given a second has passed. Instead, the model snaps everything back to how it last saw it. The cat is still on the floor, frozen mid-jump. A door someone pushed open is closed again. A knocked-over stack of blocks is neatly restacked. The world didn't keep running while you weren't watching; it silently reset to the last remembered frame.

The most surprising result is which models do this worst. Scaling up tends to make the forgetting worse, not better — a strong clue that the problem isn't insufficient capacity. It's structural: these systems excel at painting whatever is in frame right now and have no real place to store the parts that have scrolled off-screen. They're less like a mind holding a scene in memory and more like an extraordinarily talented improviser who only knows what's directly in front of them — ask about the corner they just turned away from and there's simply nowhere it was written down.

The gap is concrete. Picture a kitchen robot: a cup rolls behind the toaster, a person reaches in front of the camera, and when the view clears, a model with no memory doesn't think "the cup is still behind the toaster" — it re-paints the scene from scratch. The cup may be gone, back where it started, or somewhere new entirely. You cannot plan a reliable grab against a world that rewrites itself every time something blocks the view. The same goes for a game: walk down a corridor, turn around, and the room you just left has silently rearranged its furniture.

This connects to a quietly important theme in the week's research. A separate paper on giving robots real spatial tools lands on the same missing ingredient from a different angle — persistent memory of where things are across multiple glances — while another argues robots might skip the imagined video entirely and plan from a single still frame, sidestepping the forgetting problem rather than solving it. Three groups, three directions, all circling the same gap. When that happens, it usually means a real weakness has been found rather than a one-off complaint.

The researchers argue that fixing this needs a genuinely different ingredient — something that acts as a persistent "state of the world," a memory the model writes to and reads back, kept separate from the picture it happens to be drawing at any moment. Today's models fold "what's true about the scene" and "what pixels go on screen right now" into one step, and the truth gets overwritten every time the picture changes. Splitting those apart — a lasting ledger of the world plus a renderer that draws from it — is the direction several teams are now pointing.

The practical stakes are clear. A model that forgets the room the instant you look elsewhere can still make a gorgeous six-second clip — genuinely useful for film and art. But it can't serve as the dependable imagination inside a robot deciding where to reach, or a game world you can explore and trust to stay consistent. DeepMind's Genie 2, for instance, can turn a single still image into a little 3D world you can walk around in — but for any of that to be useful, the world has to stay put when you look away. This benchmark turns a vague intuition — "these things don't really understand space" — into a specific, measurable failure that the next wave of research now has to beat.

The usual caveat applies: the work is days old and measures one particular kind of forgetting, so it's a sharp diagnosis rather than the final word, and a system that fails this test isn't worthless at everything else. But it's the kind of clean, almost playful experiment — turn around and see if the world is still there — that tends to stick, because anyone can understand exactly what's being asked, and exactly how today's models come up short.

Originally published on Ground Truth, where every claim is checked against the primary source.

DEV Community

Turn around, and the world disappears

Key facts

Top comments (0)