This AI predicts how objects move by tracking shapes, not pixels

#worldmodels #physics #diffusion #robotics

A new model called PhysiFormer predicts how objects will move in real three-dimensional space rather than predicting how a scene will look in pixels. Built on a diffusion transformer, it learns physical behavior from data alone—without hand-coded rules—and still respects rigidity and conserved momentum better than the previous standard approach. Its predictions are geometry-aware and viewpoint-independent, the qualities a robot needs when its camera sits somewhere the training data never covered.

Key facts

What: PhysiFormer forecasts physical motion as real 3D meshes in space - and recovers rigidity and momentum without anyone hand-coding the laws of physics.
When: 2026-06-27
Primary source: read the source (arXiv 2606.27364)

Most AI that predicts the physical world works by predicting pictures. Feed in video frames and the model guesses the next frames, pixel by pixel. That approach has a quiet weakness: a pixel-based model doesn't really know that a coffee mug is a solid object. It knows what a mug tends to look like from one camera angle. Move the camera and its grasp of the mug's actual shape can fall apart, because it never represented the shape—only the photograph of it.

PhysiFormer represents an object the way a graphics or engineering program would: as a mesh, a connected web of points in 3D coordinates that defines its surface. Give it the starting positions and velocities of those points, plus what the object is made of—rigid like a wooden block, or elastic like a rubber ball—and it predicts where every point travels next. It forecasts the motion of the thing itself, in world coordinates, not the appearance of the thing from a particular viewpoint. Change where you stand and the prediction doesn't break, because the model was never relying on the view. A pixel-based predictor is like a sports artist sketching what the next photo of a bouncing ball will look like. PhysiFormer is like a physics student tracking the ball's actual position and speed and saying where it'll be a moment later. The artist can be fooled by lighting, angle, and shadow. The student is reasoning about the ball, so it works from any seat in the stadium.

The genuinely surprising part is what PhysiFormer doesn't need. Researchers who build physics-aware AI usually bake in rules by hand—force the model to keep rigid objects rigid, force it to respect cause and effect. PhysiFormer skips most of that. Its diffusion transformer learns by repeatedly turning noise into structure, the same family of model behind modern image and video generators. It learns physical behavior from data alone, and still comes out respecting rigidity and conserved momentum better than the previous standard approach. It also handles many objects at once gracefully, treating them in a way that doesn't care which object you list first—which is how the real world works, since a pile of blocks has no official ordering.

PhysiFormer is also probabilistic. It doesn't commit to one single future but can sample several plausible ones. That matches reality, where a teetering stack of objects could topple in more than one believable way. A model that admits this uncertainty is more honest, and more useful for planning, than one that fakes a single confident answer.

Predicting physical interactions is foundational for robots that manipulate objects, for graphics and animation that need to look right rather than just plausible, and for any design tool that has to simulate how materials behave. Doing it in coordinate space rather than pixel space means the predictions are geometry-aware and viewpoint-independent—exactly the qualities a robot needs when its camera is in a different spot than the camera in the training data. PhysiFormer arrived as part of a larger surge of world-model research this week, and it represents one of the cleaner ideas in that wave: stop predicting the photograph, start predicting the thing.

The honest caveat: representing the world as 3D meshes assumes you can get those meshes in the first place, which is straightforward in simulation and much harder from a raw camera feed in a messy real kitchen. The results are reported on the authors' own evaluations against autoregressive baselines, and a method that shines on controlled object-motion tasks still has to prove itself on the clutter and noise of the real world. But the core bet—that geometry beats appearance for understanding physics—is a compelling one, and it's a direction worth watching as world models mature.

Originally published on Ground Truth, where every claim is checked against the primary source.

DEV Community

This AI predicts how objects move by tracking shapes, not pixels

Key facts

Top comments (0)