Reversal Q-Learning: Teaching Offline RL to Work with Flow-Matching Policies

#python #machinelearning #deeplearning #ai

Reversal Q-Learning: Teaching Offline RL to Work with Flow-Matching Policies

Flow matching has become one of the more useful tools in the generative modeling toolkit. It trains faster than diffusion models, produces high-quality samples, and handles multimodal distributions well — which makes it attractive for modeling robot actions, where the "right" move in a given situation might not be a single point but a whole family of plausible behaviors.

The catch is that combining flow matching with reinforcement learning is genuinely hard, especially in the offline setting where you only have a fixed dataset and no ability to collect new experience. A new paper from Aditya Oberai, Seohong Park, and Sergey Levine — Reversal Q-Learning (RQL) — proposes a clean solution to this problem, and the core idea is elegant enough to be worth understanding in detail.

Why Flow Matching and Offline RL Don't Play Well Together

To understand the problem RQL solves, it helps to know what flow matching actually does. A flow matching policy learns a vector field that transports samples from a simple noise distribution toward the target action distribution. At inference time, you start with noise and integrate the vector field over F steps to produce an action. The more steps, the more expressive the policy — but also the more computation.

When you want to improve this policy using reinforcement learning, you need to assign credit to actions based on their downstream returns. In offline RL, you do this with Q-functions estimated from a static dataset. The problem is that the dataset contains raw (state, action) pairs — it has no record of the intermediate flow steps that produced those actions. The flow steps are invisible.

One principled way to handle this is the expanded MDP framework: treat each of the F flow refinement steps as a separate action in a longer Markov decision process. This makes the flow steps explicit and lets you apply standard Q-learning. But it creates two new problems:

Dataset incompatibility. Your offline dataset doesn't contain the intermediate flow states. You can't directly apply Q-learning to transitions that don't exist in your data.
The curse of horizon. Expanding the MDP by a factor of F means your effective planning horizon grows by F. Temporal difference (TD) learning accumulates bias over long horizons, so value estimates become unreliable.

Previous approaches worked around these issues by using weighted regression, distillation, or rejection sampling — all of which either discard information or introduce their own approximation errors.

The RQL Solution: Reverse the Flow to Reconstruct What Happened

RQL's key insight is that deterministic flow ODEs are reversible. If you know the final action a that the policy produced for state s, you can run the flow ODE backwards to recover the entire sequence of intermediate states x⁰, x¹, ..., xᶠ that led to it.

Formally, for any transition (s, a, r, s') in the offline dataset, RQL solves the reverse ODE:

d/df θ(s, x, f) = -v(s, θ(s, x, f), f)

where v is the learned vector field. This reconstructs the "virtual" on-policy trajectory through flow space — the exact sequence of intermediate states the current policy would have taken to produce action a from state s.

These virtual trajectories are deterministic and on-policy with respect to the current flow policy. That's what makes them useful: because they're on-policy, you can apply multi-step returns across the flow steps without introducing off-policy bias. And because they're deterministic, the multi-step returns are exact rather than sampled estimates.

Collapsing the Horizon

The second innovation addresses the curse of horizon. Because the virtual trajectories are deterministic and on-policy, RQL can use multi-step returns to skip over intermediate flow steps entirely. Instead of estimating a value function over a horizon of T × F steps (where T is the task horizon and F is the number of flow steps), RQL collapses the effective horizon back down to T.

This works because the intermediate flow steps don't interact with the environment — they're purely internal to the policy's generation process. The reward signal only arrives at the end of a full action, not after each flow step. So you can treat the entire flow generation as a single "macro-action" for the purposes of value estimation, while still training the individual flow steps using the expanded MDP structure.

The result is that RQL gets the expressiveness benefits of training the full flow policy step-by-step, without paying the value estimation cost of a T × F horizon.

What This Avoids

RQL avoids several costly alternatives:

No backpropagation through time. BPTT through the entire ODE integration is expensive and numerically unstable for long chains.
No distillation. Distilling the flow policy into a one-step approximation loses expressiveness.
No rejection sampling. Filtering offline data by Q-value wastes data and doesn't directly optimize the policy.

RQL trains the full flow policy directly using the actual Q-function, without BPTT instability.

Empirical Results

The authors evaluate RQL on 50 simulated robotic tasks, covering locomotion and manipulation environments. RQL achieves the best average performance among state-of-the-art flow-based offline RL methods, with particularly strong results on long-horizon tasks where the curse of horizon would otherwise hurt competitors most.

The implementation is in JAX and is available on GitHub, which makes it relatively accessible for researchers working in the offline RL space.

Why This Matters for Robotics

Offline RL is especially important for robotics because collecting online experience is expensive, slow, and sometimes unsafe. A large dataset of robot demonstrations — even imperfect ones — lets offline RL extract a policy that improves on the demonstrations by optimizing for reward rather than just imitating behavior.

Flow matching is attractive for robot policies because robot actions are often multimodal: there might be several equally valid ways to grasp an object, and a unimodal Gaussian policy would average over them into an invalid action. RQL makes it practical to combine expressive flow policies with offline RL without the approximations that previous methods required.

The Broader Context

RQL fits into a growing body of work on training generative policies with RL. Related approaches include GenPO (which uses exact diffusion inversion for on-policy RL) and FMER (which uses advantage-weighted regression with flow policies). What distinguishes RQL is its focus on the offline setting and its use of ODE reversibility to avoid the dataset incompatibility problem entirely. The expanded MDP framework itself is not new, but applying it offline required the virtual trajectory construction that RQL introduces.

Summary

Reversal Q-Learning addresses a concrete technical obstacle: how to apply Q-learning to flow-matching policies using offline data that doesn't contain intermediate flow states. The solution — run the flow ODE in reverse to reconstruct virtual on-policy trajectories, then use multi-step returns to collapse the expanded horizon — is technically clean and empirically effective. For researchers working at the intersection of generative models and offline RL, it's a useful addition to the toolkit.

The paper is available at arxiv.org/abs/2606.17551 and the code at github.com/aoberai/rql.