SB Lee

Posted on May 19

DreamZero vs Motus

#ai #worldmodel #vla #llm

I. Autoregressive & Bidirectional Unified Generation

DreamZero and Motus are the two most representative works in the field of World Action Models (WAMs). Both conduct joint video + action generation, adopt flow matching, and use chunk-based generation. However, they differ completely in generation paradigms, training mechanisms, alignment methods, and inference logic.

DreamZero adopts autoregressive causal generation (only looks at the past, not the future).

It generates chunk by chunk in chronological order.
It strictly follows causal attention masking.
At each time step, it can only see historical observations, not the future.
It naturally conforms to robots' physical timing logic.
It is suitable for closed-loop control and real-time inference.

Motus adopts a flexible unified framework (capable of looking at the future, non-causal).

Essentially built on the UniDiffuser-based unified multimodal framework with a bidirectional non-causal underlying architecture, Motus supports switching among 5 task modes.
These include: VLA (action-only generation), WM (video-only generation), IDM (inverse dynamics), VGM (video-guided action), and Video-Action Joint generation.
In video-action joint generation mode, Motus adopts a bidirectional non-causal architecture to generate the entire future video chunk and action chunk in one go, achieving temporal alignment via video sparsification and action densification.
In VLA mode, Motus only outputs actions without generating videos; the video side remains in a pure noise state with no video generation performed.
At this time, there is no bidirectional global modeling, as the video side contains no content for bidirectional interaction.
Its inference relies on the interaction between the action expert and the understanding expert, rather than complete tri-modal joint attention.

TIPS:Architecturally, Motus is bidirectional and non-causal.
Functionally, Motus is a flexible unified framework.
Motus does not adhere to non-causality in all cases. Its default architecture is bidirectional, built on the bidirectional, non-causal UniDiffuser, but causality can be restricted via masking.

II. Chunk-Based Generation Mechanism**

Both use chunks, but their logics are entirely different.

DreamZero: Autoregressive Chunk Generation

Each chunk depends on the output of the previous chunk.
Training uses chunk-wise teacher forcing.
Inference uses KV caching, computing only the current chunk.
Each chunk corresponds to a fixed duration of action.

Motus: Global Bidirectional Chunk Modeling, Multi-Step Denoising Generation

During training: Temporal relationships of all chunks are modeled at once with Tri-model Joint Attention. All frames and action tokens are mutually visible (bidirectional and non-causal), conditioned on real historical observations, and independent of autoregressive generation results.
During inference: Multi-step iterative denoising is performed based on Rectified Flow (as shown in Algorithm 6, where τ gradually decreases from Tτ to 1). The entire future sequence is generated in one pass, not a single forward step.
No autoregressive KV caching: Due to the non-causal architecture, there is no need to maintain historical KV caching as DreamZero does. However, the diffusion denoising process requires multi-step computation, with inference overhead concentrated on global attention calculation within a single step.
Video downsampling alignment is required: Long sequences are generated in one pass, so the video frame rate must match the action frequency (e.g., 1:6 sparsification) to avoid token imbalance.
Suitable for unified modeling, not long-horizon closed-loop tasks.

How Motus Achieves Global Temporal Dependencies

Motus’s global temporal dependencies are not generated step-by-step in chronological order like DreamZero, which only looks at the past. Instead:
The entire temporal sequence of "past, present, future" is fed into the model at once.
It adopts Tri-model Joint Attention, concatenating the QKV of three experts in the attention layer to achieve cross-modal information fusion.
Each expert retains independent feedforward networks and normalization layers. All frames and action tokens are mutually visible and influential, enabling unified modeling of the entire temporal sequence.
It adopts UniDiffuser-style global bidirectional temporal modeling. During training and inference, it processes:
[Initial frames, all future frames to be generated, all future actions to be generated] into a long token sequence, which is fed into the Transformer.
For example:[Historical Frames], [Future Frame 1], [Future Frame 2], [Future Frame 3], [Action 1], [Action 2], [Action 3], [Action 4]
It is not generated frame-by-frame or chunk-by-chunk; instead, the entire temporal sequence is input and modeled at once.
Motus uses Tri-model Joint Attention with a simple rule: all temporal positions are mutually visible, with no causal restrictions.

Historical Frames → Can see future frames
Future Frames → Can see historical frames
Actions → Can see all video frames (past + future)
Video Frames → Can see all actions

It does not use lower triangular causal masking, which is completely opposite to GPT and DreamZero.
Motus adopts a deliberately designed Video-Sparse, Action-Dense strategy, setting the video frame rate to 1/6 of the action frequency. This balances the imbalance in token counts between video and actions, preventing the model from overfitting video generation and weakening action prediction capabilities.

Regarding Downsampling

Whether video downsampling is required is not a manually chosen strategy but a natural consequence of the model’s generation paradigm.
Bidirectional and autoregressive models differ fundamentally in this characteristic, which directly affects the authenticity of video timing and the accuracy of action alignment.
Bidirectional generation models (such as Motus, UniDiffuser, and mainstream video generation architectures) adopt a global synchronous generation mechanism. The model processes the complete video sequence in one pass, with all frames mutually visible and influencing each other.
Past frames can attend to future frames, and future frames can act backward on past frames, forming a globally unified temporal representation.
To achieve full-sequence joint modeling, the model must use fixed-length video chunks during both training and inference. Because the global attention mechanism requires a predefined sequence length to construct an attention matrix of fixed size.
If the sequence length is not fixed, the size of the attention map changes dynamically, making training unstable.
Meanwhile, the bidirectional diffusion denoising process requires synchronous noise addition and denoising for the entire video. Temporal positional encoding, noise scheduling, and sampling steps all depend on a unified sequence length.
Thus, bidirectional models can only generate fixed lengths set during training when starting generation at any time point. When task duration does not match the model’s supported length, video downsampling or stretching is required, which inevitably distorts the native frame rate and temporal structure.
In contrast, autoregressive models follow strict causal generation. The model generates future content chunk-by-chunk based solely on historical observations and continuously records historical information via KV caching.
It does not need to see the complete sequence in advance or construct a global attention map in one pass. Thus, it naturally supports video generation of arbitrary lengths without downsampling or distorting the original frame rate.
This characteristic enables autoregressive architectures to maintain the true temporal relationship of videos, ensuring precise natural alignment between video frames and actions, and providing a more reliable temporal foundation for robot closed-loop control.
Bidirectional models must use fixed lengths and rely on downsampling, with inference extendable via sliding windows. Autoregressive models support longer contexts via KV caching but are also limited by cache capacity.
DreamZero’s advantage lies in preserving the native frame rate (no downsampling alignment), not in supporting arbitrary lengths.

III. Video + Action Joint Generation

DreamZero: Single Model, Single DiT, Joint Denoising

Video and action share the same DiT backbone.
Same layers, same attention, same temporal sequence.
Video and action share denoising time steps.
Jointly optimized with the same flow matching loss.
Video guides action, action aligns with video, deeply coupled.
Minimal architecture, no expert separation.
The mechanism for video-guided action adopts implicit inverse dynamics.
Its action prediction relies entirely on the shared attention mechanism, learning actions end-to-end from generated videos with higher physical alignment.

Motus: Mixture-of-Experts (MoT), Tri-Modal Joint Attention

Three independent experts: video expert (Wan2.2), understanding expert (Qwen-VL), action expert (trained separately).
Tri-modal cross-attention per layer.
Separate temporal order and noise scheduling for video and action.
Unified under the UniDiffuser framework.
Supports 5 mode switches, with looser modal coupling.
The mechanism for video-guided action uses explicit inverse dynamics mode.
The model first generates complete future videos, then infers actions accordingly. This step-by-step logic may be more stable and interpretable for certain tasks.

IV. Differences in Flow Matching

DreamZero: Joint Flow Matching + Teacher Forcing

Video and action share time steps.
Inputs real historical frames during training to avoid error accumulation.
Centered on video prediction; actions derived from visual futures.
Chunk-wise teacher forcing (inputs real history chunk-by-chunk to predict the next chunk).

Motus: Rectified Flow + Decoupled Time Steps

Separate time steps for video and action.
Aims for unified modeling.
Supports multiple task mode switches.
Global one-shot teacher forcing (inputs real initial frames + history, predicts the entire future in one pass).

V. How Video Guides Action Generation

DreamZero: Vision-First → Implicit Inverse Dynamics

First predicts future video (visual planning).
Infers actions from video via shared attention, end-to-end learning.
Actions strictly follow video planning; video quality directly determines action quality.

Motus: Dual-Path Collaboration → Explicit Inverse Dynamics + Joint Generation

Path 1 (IDM Mode): First generates complete future video, then explicitly infers actions.
Path 2 (Video-Action Joint Mode): Video and action generated simultaneously, mutually constrained.
Video and action are collaborative, with no mandatory video priority → more flexible, but weaker physical alignment than DreamZero.

VI. Inference Mechanism

DreamZero: Closed-Loop Real-Time Inference

After each execution step, replace predicted frames in KV cache with real observations.
Completely eliminates error accumulation.
Explicitly optimized for 7Hz asynchronous closed-loop, with detailed error accumulation elimination mechanism.
Fast inference, deployable.
Designed for closed-loop control; preventing error accumulation is one of its core design goals, naturally suitable for long-horizon tasks.

Motus: Open-Loop Generation

Generates all video + action in one pass.
No ground truth injection, no closed loop.
Errors accumulate.
Motus’s flexible unified framework may incur additional inference overhead, such as slow speed, but no specific latency data is reported in the paper.
Architecturally, it focuses more on a unified generative framework. Its inference method is closer to one-shot open-loop planning.
While short-term tasks can be handled via short-term prediction with open-loop execution, long-horizon tasks requiring long-term visual feedback are theoretically more prone to error accumulation.

VII. Training & Deployment Difficulty

DreamZero

Simple architecture, single DiT, easy to reproduce.
Strict requirements for timing and alignment.
Stable training, runs on real robots.
Requires high compute power, with mature optimizations.
DreamZero has mature system-level optimizations (asynchronous execution, quantization, DiT caching), achieving real-time control with 150ms latency.

Motus

Complex architecture, three experts, difficult integration.
Multi-stage training pipeline, data pyramid.
Deployment optimization details are sparsely documented in the paper.

VIII. Pre-Training Data Source & Scale

In terms of video backbone pre-training, both models leverage internet-scale video data to acquire basic spatiotemporal priors. DreamZero uses Wan2.1 as the backbone, while Motus initializes based on the newer Wan2.2 5B model.
Despite sharing the same origin, their data strategies during robot fine-tuning follow completely different technical paths.
DreamZero adopts a highly focused real robot data strategy, using approximately 500 hours of real-world teleoperation data, all from the AgiBot G1 dual-arm robot, collected across 22 different environments with a strict "task diversity over repetition" principle.
Its data strategy explicitly limits the number of demonstrations per task, actively discarding a task after 50 collections to force a broader task distribution and enhance zero-shot generalization.
For cross-embodiment transfer, DreamZero uses very little video-only data, including 20 minutes of unlabeled video from the YAM robot and 12 minutes of human-view video, which significantly improves performance on unseen tasks.
In contrast, Motus adopts a larger six-layer data pyramid structure, forming a full-level, large-scale heterogeneous training system spanning web videos, human videos, simulation data, task-agnostic data, multi-robot mixed data, and target robot data.
Its cross-embodiment and generalization capabilities rely on large-scale heterogeneous data from diverse sources such as Egodex, AgiBot, RDT, RoboMind, and RoboTwin.
Motus specifically sets Level 4: Task-Agnostic Data in its six-layer data pyramid.
This data is generated by random sampling in the target robot’s action space, independent of specific tasks, to align latent action space with real robot control signals.
This design is critical for the stability of latent action learning, while DreamZero’s data strategy does not emphasize such alignment data.
Overall, DreamZero follows a small, refined, real-data-first, diversity-focused data path, better aligned with the low-data-cost requirements of real robot deployment.
Motus follows a large-scale, multi-source, pyramid-style data path, relying on massive heterogeneous data for capability coverage. The two have fundamentally different data philosophies and deployment approaches.

IX. Action Representation

DreamZero uses the robot’s native joint space as action representation, directly outputting relative joint positions corresponding one-to-one with the robot body, typically around 14 dimensions.
This representation requires no additional spatial mapping, with the action space fully aligned with the real robot control space.
The model learns via inverse dynamics, directly mapping video predictions to executable joint control signals. Action decoding is simple and direct, highly aligned with real robot physical execution logic.
A minor detail: DreamZero’s joint space refers to relative joint positions, not raw joint angles.
This indicates that under the goal of aligning physical execution, it also applies normalization to adapt to different robot bodies.
In contrast, Motus uses latent action as a unified representation.
It first extracts pixel-level motion information via optical flow, then compresses it into a fixed-dimensional latent action representation via DC-AE.
This optical-flow-based action space is independent of specific robot morphology, naturally enabling cross-embodiment unified expression.
During execution, Motus decodes latent actions into corresponding robot control signals.
In short, DreamZero learns inverse dynamics for specific robots, focusing on precise execution.
Motus learns general motion priors, focusing on cross-domain generalization.
The two represent two distinct technical paths: physical execution alignment versus general motion abstraction.

DEV Community