Researchers introduce a supervision technique that helps vision-language models imagine unseen perspectives, outperforming traditional text-based reasoning approaches.
A team of researchers has developed a novel training methodology designed to address a persistent weakness in vision-language models: their inability to reason about spatial information that falls outside the visible frame. The approach, called Imaginative Perception Tokens, offers a fundamentally different way to teach AI systems to mentally simulate alternative viewpoints and occluded environments.
The challenge that motivated this work is straightforward yet significant. Current multimodal AI models perform admirably when analyzing what appears directly in an image, but they falter when tasked with inferring spatial relationships that require mental visualization. Consider a robot needing to navigate around an obstacle it cannot see, or a system asked to count objects visible from multiple angles when only one perspective is available. These scenarios demand something beyond pattern recognition: they require the model to construct an internal representation of three-dimensional space.
A New Training Signal for Spatial Understanding
According to arXiv, researchers from institutions including the University of Washington and Google developed three benchmark datasets totaling approximately 20,000 examples to evaluate spatial reasoning capabilities. The tasks include perspective-taking scenarios where a model must describe what would be visible from a different vantage point, path-tracing exercises requiring navigation through partially hidden routes, and multiview counting that demands reconciling observations from multiple angles.
The core innovation lies in the Imaginative Perception Tokens themselves. Rather than asking models to generate images or write lengthy explanations about unseen spaces, the system generates intermediate representations that encode what the model would perceive under alternative spatial configurations. These tokens maintain consistency with the observable input while extending reasoning into the unobserved domain.
Superior Performance Without Image Generation

Photo by Google DeepMind on Pexels.
The results demonstrate meaningful improvements across evaluation metrics. On multiview counting tasks, the approach achieved 3.4 percent accuracy gains. Performance on path-tracing benchmarks proved competitive with powerful closed-source commercial models, despite operating on openly available architectures.
Perhaps most intriguingly, this token-based supervision consistently outperformed textual chain-of-thought training, a popular alternative where models are prompted to explain their reasoning step by step through language. The researchers suggest this discrepancy reflects a fundamental mismatch: spatial reasoning may be inherently difficult to express and process through linguistic tokens, whereas direct perceptual representations align more naturally with how vision systems operate.
The approach also demonstrated compatibility with label-only supervision, meaning combining minimal textual guidance with perceptual token training produced stronger results than either method alone. Conversely, adding chain-of-thought reasoning to this combination often degraded performance, indicating that forcing spatial computation through linguistic channels introduces unnecessary complexity.
Implications for Embodied AI
This research carries practical implications for robotics and autonomous systems that must navigate and reason about physical environments. By developing interpretable intermediate representations rather than opaque learned patterns, the approach also offers advantages for model transparency and debugging.
The work suggests that how information is represented during training profoundly shapes what tasks a model can excel at. As AI systems take on increasingly sophisticated spatial reasoning responsibilities, matching the supervision signal to the underlying computational challenge may prove as important as the model architecture itself.
This article was originally published on AI Glimpse.
Top comments (0)