David Evans

Posted on Nov 19

What Is Learn-to-Steer? NVIDIA’s 2025 Spatial Fix for Text-to-Image Diffusion

#ai #webdev #programming #productivity

Text-to-image diffusion models have become the workhorses of generative imaging. They can paint photorealistic scenes, mimic art styles, and blend concepts in ways that were science fiction a few years ago. Yet they stumble embarrassingly on a skill that even small children master: basic spatial reasoning.

Ask a state-of-the-art model for “a dog to the right of a teddy bear” and you often get:

The dog on the left
One of the objects missing
Or a bizarre hybrid where dog and teddy are fused into a single creature

These failures become more severe for unusual compositions like “a giraffe above an airplane”. Traditional fixes range from expensive fine-tuning to brittle, hand-written loss functions at inference time—but both options come with significant downsides.

NVIDIA’s Learn-to-Steer framework (accepted to WACV 2026) proposes a different path: instead of hard-coding spatial rules or retraining the entire model, it learns a data-driven objective that can “steer” diffusion at inference time. The method reads the model’s own cross-attention maps, trains a lightweight classifier to detect spatial relations, and then uses that classifier’s gradient as a learned loss to nudge the generation towards layouts that match the prompt.

In this blog, we’ll unpack:

What makes spatial reasoning so fragile in current diffusion models
How Learn-to-Steer learns spatial constraints from the model itself
How it steers images during generation without changing model weights
The top gains on spatial benchmarks like GenEval and T2I-CompBench
The trade-offs in compute cost and generality, and what this implies for future generative systems

Why Spatial Reasoning Fails in Text-to-Image Diffusion

What Makes Spatial Relations So Difficult for Diffusion Models?

Modern diffusion models (e.g., Stable Diffusion, Flux) are excellent at what should appear in an image—objects, styles, textures—but much less reliable at where those objects should be.

Several factors contribute:

Weak supervision of spatial language

Training data rarely comes with precise annotations like “object A is left of object B”.
Captions often describe content loosely, so phrases like “on top of” or “to the right of” are under-specified.

Entangled visual concepts

When two objects frequently co-occur, models may treat them as a single visual blob.
This leads to object fusion, where a “cat on a bookshelf” becomes a cat-bookshelf chimera.

Benchmark saturation without spatial coverage

Many standard text-to-image benchmarks emphasize realism and style, not relational accuracy.
Models can score highly while still being spatially confused.

Empirical studies confirm three recurring failure modes on spatial benchmarks:

Incorrect placement: Objects appear in the wrong relative position.
Missing entities: One or more requested objects never appear.
Merged entities: Two objects get mashed into a single, incoherent form.

The model “knows” the objects you asked for, but it doesn’t reliably understand where to place them.

Why Fine-Tuning and Handcrafted Losses Are Not Enough

Two broad strategies have tried to patch this gap:

Fine-tuning for spatial awareness

Retrain the diffusion model on datasets with explicit layouts or spatial annotations.
Methods like COMPASS show that this can significantly improve spatial accuracy.
But this comes at a cost: expensive retraining, sensitivity to dataset bias, and often regressions in other capabilities such as color fidelity or counting.

Handcrafted test-time losses

At inference, inject extra loss terms that penalize spatial errors (e.g., overlapping activation maps, incorrect ordering).
These losses must be manually designed to approximate relations like “left of” or “above”.
In practice, these heuristics are fragile, often over-fitting simple cases and failing on more complex layouts.

In short, we’ve lacked a solution that is:

Data-driven rather than rule-based
Plug-and-play at inference time (no full retraining)
Targeted enough to improve spatial reasoning without damaging other strengths

This is where Learn-to-Steer enters.

How Learn-to-Steer Works: Data-Driven Steering at Inference

How Cross-Attention Maps Provide a Spatial Signal

During diffusion, at each denoising step, the model computes cross-attention maps that connect text tokens to image regions. For a prompt like “a dog to the right of a teddy bear”, you can think of:

One set of attention maps for “dog”
Another set for “teddy bear”
Additional context around words like “right” or “of”

These maps form a rich, high-dimensional signal describing where in the image the model currently believes each word should manifest. Prior work has used cross-attention to locate objects or edit images; Learn-to-Steer goes further by treating them as a feature space in which spatial relations can be learned.

How a Relation Classifier Becomes a Learned Loss

The core idea of Learn-to-Steer is to train a small relation classifier that takes cross-attention maps for two objects and predicts the spatial relation between them (left-of, right-of, above, below, etc.).

The pipeline looks like this:

Collect supervision

Use images where the true relation between object A and object B is known (from datasets like GQA and synthetic layouts).
For each image, invert it through the diffusion model with a descriptive prompt to recover cross-attention maps for the relevant tokens.

Train a classifier on attention patterns

Input: attention maps for object A and object B.
Output: predicted relation (e.g., “A is left of B”).

Naively, however, this leads to a subtle but serious issue: relation leakage.

How Dual Inversion Solves the “Relation Leakage” Problem

If you always invert images with a correct prompt (e.g., “a dog to the left of a cat”), hints about the word “left” can leak into the attention patterns. A naïve classifier might then “cheat” by reading out linguistic artefacts instead of learning genuine visual geometry.

To prevent this, Learn-to-Steer uses a dual inversion strategy:

For each image with a true relation (say, dog left of cat), create two prompts:
- A positive prompt with the correct relation (“dog to the left of a cat”).
- A negative prompt with an incorrect relation (“dog above a cat”).
Run inversion with both prompts, obtaining two sets of attention maps.
Label both sets with the true relation (left-of), because that is what the image actually depicts.

The classifier sees pairs of attention maps that share the same underlying geometry but differ in the relation words used in the prompt. To succeed, it must ignore the unreliable linguistic cue and zero in on the geometric evidence in the attention patterns. This breaks the leakage shortcut and yields a classifier that actually understands “left-of” in terms of where things appear in the model’s internal vision.

To improve robustness, NVIDIA combines:

Real images (complex, natural scenes)
Synthetic images (simpler, cleaner attention patterns akin to generation scenarios)

How Learn-to-Steer Guides Images During Generation

Step-by-Step: From Prompt to Steered Latent

Once the relation classifier is trained, Learn-to-Steer uses it at inference time as a learned objective:

Parse the spatial prompt

Extract subject, relation, and object from the text (e.g., subject = dog, relation = right-of, object = teddy bear).

Run diffusion as usual—but with checkpoints

As the model denoises latent noise into an image, periodically extract cross-attention maps for the subject and object tokens.

Evaluate spatial correctness

Feed these maps into the relation classifier, which outputs a probability distribution over relations.
Compare this distribution to the desired relation from the prompt, and compute a loss (e.g., cross-entropy).

Backpropagate into the latent

Compute the gradient of this loss with respect to the latent representation at that timestep.
Nudge the latent in the direction that increases the classifier’s confidence in the correct relation.

Continue the diffusion process

Let the denoising proceed from the adjusted latent.
Repeat this steering a number of times (often during the earlier half of the diffusion steps).

Support for Multiple Architectures and Relations

A key advantage of Learn-to-Steer is that it’s architecture-agnostic:

It has been demonstrated on both UNet-based models (like Stable Diffusion 1.4/2.1) and MMDiT-style models (like Flux).
The only requirement is access to a text-image alignment signal (cross-attention or similar).

It can also handle prompts with multiple constraints, such as:

“A frog above a sneaker below a teapot.”

Here, Learn-to-Steer alternates attention between relations: