S-Agent: Spatial Tool-Use Makes an 8B Agent Rival GPT-5.4 on Spatial Reasoning: Spatio-Temporal Evidence Accumulation

#agents #ai #llm #machinelearning

What: S-Agent is an agent framework for spatial reasoning over multi-view images and video, where a vision-language model acts as a planner that directs a hierarchy of spatial tools to build one shared 3-D model of the scene — what the paper calls spatio-temporal evidence accumulation.

Why: Spatial questions — how far apart, how many, which way is it facing — need geometry that no single flat frame carries. Moving that geometry out of the model's head and into an explicit 3-D store lets an 8-billion-parameter agent rival GPT-5.4 and Gemini 3 on spatial reasoning.

vs prior: A standard VLM does frame-by-frame reasoning: it re-derives the whole 3-D scene inside its context on every frame and loses track as the camera moves. S-Agent instead keeps the scene in an external Scene Memory that tools refine, frame after frame.

Think of it as

A detective rebuilding a room as a 3-D scale model from a stack of photos.

             A STACK OF PHOTOS (multi-view frames)
                          │
            ┌─────────────┴─────────────┐
            │                           │
    frame-by-frame                S-Agent detective:
    re-imagines the               each photo adds one
    room every photo              measurement to one
    (the picture slips)           shared 3-D scale model
            │                           │
            ▼                           ▼
    ✗ count drifts to 11          ✓ re-sightings collapse;
      (3, 3, 2, 3 seen)             count resolves to 4

the VLM planner = the lead detective who decides the next measurement, but never measures by hand
spatial tools & experts = the specialists called in — one spots each object, one lifts it into 3-D, one measures distance and angle
the multi-view frames = the stack of flat photos, each shot from a different angle
Scene Memory = the 3-D scale model on the table that every photo refines
Agent Memory = the case notebook holding the reasoning so far
evidence accumulation = each photo adds one measurement to the model; the answer is read off the model, not re-imagined

Quick glossary

VLM (vision-language model) — A model that takes images or video plus text and reasons over both. In S-Agent the VLM is not the thing that answers — it is the planner that decides which spatial tool to call next.

Semantic planner — The role the VLM plays: it reads the question and the scene so far and chooses the next action — ground this object, lift that one into 3-D, measure this distance — rather than computing the geometry itself.

Spatial tools & experts — A hierarchy of specialized tools the planner directs: 2-D object grounding, depth/geometry that lifts a 2-D detection into a 3-D position, and measurement (counting, distance, orientation).

Spatio-temporal evidence accumulation — The core idea: each frame contributes partial geometric evidence, and the tools aggregate it across space and time into one continuous 3-D world — so the answer is built up, not guessed from one view.

Scene Memory — An external store holding the evolving 3-D state of the world — where each object sits, how big, which way it faces. It is refined as frames arrive, and the planner reads its state instead of re-deriving it.

Agent Memory — The second store: the reasoning context — what the planner has asked, which tools ran, what is still unknown. Scene Memory is what the world looks like; Agent Memory is what the agent has done about it.

Training-free — The tool hierarchy improves spatial benchmarks with no weight updates at all. Fine-tuning an 8B model on the traces it produces then yields S-Agent-8B.

The news. On June 18, 2026, researchers posted S-Agent to arXiv — an LLM-agent framework for spatial intelligence over multi-view images and video. Instead of reasoning frame-by-frame, a vision-language model acts as a semantic planner that directs a hierarchy of spatial tools: it grounds objects in 2-D, lifts them into 3-D, and aggregates geometric evidence across frames. The tool hierarchy already improves multiple spatial benchmarks with no training; after fine-tuning on its own traces, S-Agent-8B rivals GPT-5.4 and Gemini 3 on spatial reasoning. Read the paper →

Picture a detective walking into a room they have never seen, handed a thick stack of photos — the same space shot from a dozen angles. The hopeless way to work is to riffle through the photos and try to picture the whole room: where the chair sits relative to the door, how far the table is from the window, which way the lamp is turned. Flat photos do not carry that, and the mental image slips with every page. What actually works is to build a small 3-D scale model on the table, and let each photo add one measurement to it — then answer every question by looking at the model, not by re-imagining it from the stack. That scale model is Scene Memory, the detective deciding what to measure next is the VLM planner, and "add one measurement per photo" is spatio-temporal evidence accumulation.

Underneath the metaphor, S-Agent is moving the 3-D scene out of the model's context window and into an explicit store. A plain VLM asked a spatial question sees only a sequence of flat frames and must re-build the geometry in its head on every step — exactly the frame-by-frame approach that loses track as the camera moves. S-Agent instead casts the VLM as a planner that directs a hierarchy of tools, the same orchestrator-and-workers shape the Agents track names: one tool grounds each object in 2-D, another lifts it into a 3-D position, another measures. Their outputs land in Scene Memory — the running 3-D model — while the planner's own reasoning lives in Agent Memory, keeping what the world looks like separate from what the agent has done.

Because the geometry now accumulates in a store the tools update, the same loop runs training-free — it changes how the agent acts, not the weights. The contrast with a code-as-action spatial agent is instructive: both move beyond asking one VLM to answer directly from the frames, but where that agent writes executable code as its action, S-Agent routes the work to typed spatial experts and a shared 3-D memory.

Approach	Where the 3-D scene lives	Spatial-reasoning result
Large VLM answering directly (e.g. GPT-5.4)	re-derived in the model's context, every frame	strong, but heavyweight
Code-as-action agent (SpatialClaw)	a stateful kernel the agent writes code against	+11.2 pts to 59.9% across 20 benchmarks
S-Agent (planner + spatial tools + Scene Memory)	an explicit 3-D model the tools refine	8B rivals GPT-5.4 & Gemini 3

Where it earns its keep

Picture a four-frame clip of a kitchen and the question "how many chairs?" A frame-by-frame counter tallies sightings — say it spots 3, then 3, then 2, then 3 chairs, and with no shared model it has no way to know which sightings are the same chair seen again, so it can drift toward 11. S-Agent places each detected chair at a 3-D coordinate in Scene Memory, so re-sightings from new angles collapse onto the same point — and the count resolves to 4. (The four-frame count is illustrative; only the **8-billion-parameter* scale, the training-free gains, and the GPT-5.4 / Gemini 3 parity come from the paper.)* That is the whole bet of accumulating evidence into one 3-D store rather than re-reasoning each flat frame: the geometry stops slipping, and an 8B agent reaches the neighborhood of frontier models built at far larger scale.

FAQ

What is spatio-temporal evidence accumulation?

It is the core mechanism of S-Agent: instead of answering a spatial question from one flat frame, the agent treats each frame as partial geometric evidence and aggregates it across space and time into a single continuous 3-D model of the scene. A vision-language model acts as a planner that directs spatial tools — grounding objects in 2-D, lifting them into 3-D, measuring distance, count, and orientation — and the answer is read off the assembled 3-D model rather than re-imagined from the frames.

Why does S-Agent matter?

It shows that spatial intelligence is gated by how the scene is represented, not just by model size. By moving the 3-D scene out of the model's context and into an explicit Scene Memory that tools refine frame by frame, an 8-billion-parameter agent rivals GPT-5.4 and Gemini 3 on spatial reasoning, and the tool hierarchy improves benchmarks training-free — before any fine-tuning at all.

How is S-Agent different from a vision-language model answering directly?

A VLM answering directly does frame-by-frame reasoning: it must re-derive the whole 3-D scene inside its context on every frame, which is lossy and slips as the camera moves. S-Agent recasts the VLM as a planner that directs a hierarchy of spatial tools and keeps the evolving 3-D state in an external Scene Memory, with the reasoning context held separately in Agent Memory — so geometry accumulates instead of being re-guessed each step.

Originally posted on Learn AI Visually.