RL from verifiable rewards now beats GRPO baselines by a comfortable margin, and the advantage comes from assigning credit at far finer granularity than whole‑response scores. By turning verification into token‑ and subproblem‑level signals, the newest methods extract learning from progress that would otherwise be discarded.
Before these works, reinforcement learning for reasoning relied on a single scalar reward per generated answer. GRPO and similar RL‑HF pipelines treated the whole response as the unit of credit, which made credit assignment noisy and left hard problems stuck in “gradient dead zones.” No mechanism existed to reward partial solves or to isolate the effect of a single token on the final verdict.
DelTA’s discriminative token credit assignment reshapes the RL update into a linear discriminator over token‑gradient vectors, amplifying side‑specific directions while suppressing shared noise. “DelTA consistently outperforms all same‑scale RL baselines on both Qwen3-8B-Base and Qwen3-14B-Base, achieving the best result on every benchmark and the highest average score at both scales” [1]. The paper reports average gains of +3.26 points for the 8 B model and +2.62 points for the 14 B model across seven math suites, turning a marginal RL improvement into a systematic boost.
SCRL converts a reasoning chain into verifiable subproblems and normalizes rewards at each position, so that the longest consecutively solved subproblem sequence determines the advantage. “The gain is especially clear on Qwen3-4B, where SCRL reaches an average score of 35.0%, improving over the second‑best baseline QuestA (32.0%) by 3.0 points and over vanilla GRPO (30.9%) by 4.1 points” [2]. Across the same seven benchmarks the method adds +4.1 average points for the 4 B model and +1.9 points for the 14 B model, and on hard AIME/IMO sets it lifts pass@1 by +3.7 points and pass@64 by +4.6 points.
RELEX shows that RLVR trajectories live in an almost one‑dimensional subspace, making most of the performance gain capturable by a rank‑1 projection that grows near‑linearly with training steps. “Specifically, we find that the majority of downstream performance gains are captured by a rank‑1 approximation of the parameter deltas, where the magnitude of this projection evolves near‑linearly with training steps” [3]. Extrapolating from only 15–20 % of the usual RLVR steps, RELEX matches GRPO on Qwen2.5‑Math‑1.5B (71.6 % vs 71.5 %) and slightly exceeds it on Qwen3‑4B‑Base (85.6 % vs 85.5 %), but falls short on Qwen3‑8B‑Base (87.4 % vs 88.5 %) on the in‑domain MATH benchmark, while also beating RLVR on five out‑of‑domain tests.
The three papers leave open questions about scalability and universality. DelTA’s centroid reweighting still risks being dominated by high‑frequency formatting tokens, so its discriminative edge may shrink on longer, more heterogeneous sequences. SCRL depends on high‑quality reference chains; constructing those for novel domains could re‑introduce costly annotation. RELEX assumes a linear, rank‑1 evolution that has only been demonstrated on math‑oriented backbones; whether the same simplicity holds for dialog or retrieval‑augmented models remains to be seen.
If fine‑grained verification truly captures the lion’s share of RLVR learning, developers should replace monolithic reward wrappers with token‑ or subproblem‑level credit pipelines as the new default. Moreover, RELEX’s cheap extrapolation suggests that, by training only a short RLVR run and then extrapolating, comparable checkpoints can be obtained at a fraction of the compute budget, potentially enabling a rapid rollout of more reliable reasoning across deployed LLM services. The next wave of RL‑enhanced models will likely be judged not by how many steps they train, but by how sharply they can slice the verification signal.
References
- DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
- From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
- You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
Top comments (0)