DeepSeek Forces Visual Reasoning Through Points and Boxes

#deepseek #pekinguniversity #tsinghuauniversity #opensourceai

DeepSeek has released an open-source visual reasoning framework called Thinking with Visual Primitives. According to 36Kr, the system changes how a multimodal model is asked to reason: instead of describing an image in loose language, it has to work through explicit visual units like point coordinates and bounding boxes.

That is a much more concrete bet than “better multimodal understanding.” It pushes reasoning closer to measurement. When a model says “the object is near the left side,” language can blur the geometry; when it has to point to coordinates or mark a box, the error has less room to hide.

What DeepSeek’s Thinking with Visual Primitives Actually Changes

The release is a framework, not just a vague claim of improved perception. 36Kr reports that DeepSeek unveiled Thinking with Visual Primitives as a multimodal model and technical report, and released it as open source.

The interesting part is the representation layer. The model is not only generating words about what it sees. It is being made to reason through visual primitives — basic spatial elements such as points and boxes.

That sounds small, but it changes the failure mode. A lot of visual reasoning errors happen when the model jumps too quickly from pixels to prose. It can produce a fluent sentence that sounds right while quietly dropping the actual layout of the scene.

With visual primitives, the model has to show more of its work. If a task depends on location, size, or relative position, a coordinate is harder to fudge than a sentence.

Why Visual Primitives Beat Vague Descriptions for Spatial Tasks

The core claim 36Kr makes is specific: the framework improves spatial reasoning by requiring precise visual data points instead of vague natural-language descriptions. In practice, that means the model has to anchor its reasoning in things that can be checked.

Take a simple spatial task. “Which object is closest to the top-right corner?” A language-first system might narrate the scene and guess based on a rough impression. A primitive-based system can mark candidate objects with bounding boxes, compare positions, and reason from those coordinates.

Or imagine “point to the handle of the mug.” The phrase “the handle is on the side” is descriptive, but it is not an answer you can directly score. A point coordinate is.

That distinction matters because language is compressive. It throws away detail on purpose. Humans do this constantly and get away with it because we share context. Models often do not. They replace measurement with summary, and summary is where a lot of hallucination-like behavior starts.

This is the same broad instinct behind work to reduce LLM hallucinations: force the system to stay attached to observable structure for as long as possible. Here, the observable structure is spatial.

There is a nice symmetry with research on zero-shot world models, too. If you want a model to reason about a scene, you need a representation that preserves the scene. Text alone often smooths over exactly the information you care about.

Why Open Source Matters for Visual Reasoning

DeepSeek released Thinking with Visual Primitives as open source, according to 36Kr. For this kind of work, that matters more than usual.

A lot of multimodal claims are hard to inspect. You get a demo, a benchmark headline, maybe a polished sample image — but not the machinery that tells you what changed. Open-sourcing a visual reasoning framework gives researchers and builders something much more useful: a way to test whether the representation itself is doing the work.

That opens up a few concrete paths:

Researchers can compare primitive-based reasoning against language-only baselines.
Builders can inspect where coordinate constraints help and where they add overhead.
The community can try alternative visual primitives, scoring methods, or training recipes.

This is where open-source AI keeps getting more interesting. Once the representation layer is visible, progress is not limited to whoever owns the API. Other teams can copy, modify, and pressure-test the idea directly.

And this specific idea is worth pressure-testing. If multimodal models keep getting larger without getting more grounded, they will keep producing confident spatial mistakes. A framework built around points and boxes is a direct attempt to fix that at the level where the mistake starts.

What the DeepSeek-PKU-Tsinghua Collaboration Signals

36Kr says the work was developed in collaboration with Peking University and Tsinghua University. That does not just add prestige. It suggests a research direction.

This looks like an effort to treat multimodal reasoning as a representation problem, not only a scale problem. Bigger models can help, but there is a different question underneath: what internal units should a model think with when the task is visual and spatial?

DeepSeek’s answer here is unusually explicit. Use primitives that map back to the image. Make reasoning legible in geometric terms. Reduce the amount of hidden translation from scene to text.

That is a strong signal because it points away from the “just let the model narrate more” approach. If that collaboration keeps producing work in this vein, expect more systems that mix symbolic-looking structure with end-to-end multimodal learning.

It is also a useful contrast with a lot of current multimodal product rhetoric. Plenty of systems claim to “understand images.” Far fewer specify the units of that understanding. DeepSeek, PKU, and Tsinghua are at least making a falsifiable bet: that visual primitives are a better substrate for some reasoning tasks than free-form language.

Key Takeaways

DeepSeek released an open-source visual reasoning framework called Thinking with Visual Primitives.
The framework pushes a multimodal model to reason with point coordinates and bounding boxes, not just textual descriptions.
That matters most for spatial reasoning, where language often blurs position, size, and relative layout.
The open-source release lets researchers test whether grounded representations improve multimodal performance in practice.
The collaboration with Peking University and Tsinghua University signals serious interest in changing the representation layer, not just scaling model size.