What: The Cursor Composer 2.5 release blog introduces targeted textual feedback RL — a constructed short hint inserted at a specific span in a long agent rollout turns the resulting model distribution into a teacher, and an on-policy distillation KL loss replaces the diffuse end-of-rollout reward with a localized training signal.
Why: Modern coding-agent RL runs over rollouts of 100,000+ tokens — one scalar reward at the end has to credit or blame every token equally, and the per-decision signal drowns in noise. Targeted hints localize the gradient so the policy learns exactly the move the coach pointed at.
vs prior: Standard end-of-rollout scalar reward spreads one bit of credit across the whole rollout. Targeted textual feedback picks a target span, writes a short hint into local context, treats the hinted distribution as a teacher, and applies KL distillation only on that span — a much denser per-token signal where it matters (illustrative orders of magnitude in the worked example below).
Think of it as
a basketball coach reviewing 2 hours of game tape.
ONE 100K-TOKEN ROLLOUT
│
┌─────────────┴─────────────┐
│ │
┌────────▼─────────┐ ┌────────▼─────────┐
│ Scalar at end │ │ Hint at minute │
│ (one -1 reward) │ │ 47 (target span) │
└────────┬─────────┘ └────────┬─────────┘
│ │
credit ÷ 100k tokens KL on a ~50-token
≈ 0.001% per token span ≈ 2% per token
│ │
▼ ▼
✗ averaged drift ✓ localized credit
across the game where it mattered
- rollout = 2 hours of game tape (one game = one 100k-token episode)
- end-of-rollout scalar reward = the final scoreboard — one number for the whole game
- targeted textual hint = the coach pointing at minute 47 and saying 'shoot the corner three here'
- teacher distribution = the coach's chalkboard sketch of the corrected play
- on-policy distillation KL = the player drilling that one moment until their action matches the chalkboard
Quick glossary
Rollout — One full episode produced by sampling from the policy — for a coding agent that's an entire session of tool calls, file edits, and model outputs, often stretching to 100,000+ tokens. The unit of RL training data here is a whole rollout, not a single response.
Credit assignment — The problem of deciding which decisions inside a rollout actually deserve the reward. With one scalar at the end, every token gets the same blame; the gradient can't tell the move that mattered from the moves that didn't.
Target message — The specific model message the trainer picks out of the rollout for hinted distillation. The hint goes into local context around this message; the resulting distribution becomes the teacher for that span.
On-policy distillation — Distilling a teacher's distribution into a student while the student keeps generating its own rollouts (rather than imitating fixed teacher trajectories). The policy distribution at the target span is what gets matched.
KL divergence — The standard distance between two probability distributions — how far the student's next-token distribution is from the teacher's. Used here as the localized loss on the target span; gradients elsewhere come from the broader RL objective.
RLVR — Reinforcement Learning with Verifiable Rewards. The dominant post-training setup for modern reasoning and coding agents — a deterministic checker grades each rollout 0 or 1.
Sharded Muon / HSDP — The optimizer / parallelism layer Cursor swapped in to make this training run efficient at their scale. They are mentioned in the source as enabling infrastructure, not the focus of this explainer — covered separately under cost & latency framing.
The news. On May 18, 2026, Cursor released Composer 2.5, a "substantial intelligence and behavior upgrade" built on the same Moonshot Kimi K2.5 open-source checkpoint as Composer 2. The release blog details three training-stack changes; the headline one is targeted RL with textual feedback — a credit-assignment trick for very long agent rollouts. The same post also flags a companion frontier model being trained from scratch with SpaceXAI on Colossus 2 at roughly 1M H100-equivalents (~10× Composer 2.5's total compute).
Picture a basketball coach reviewing two hours of game tape after a one-point loss. The scoreboard says −1. That number is technically accurate, technically true feedback on the team's performance — and technically useless for fixing anything. Which possession lost the game? The blown switch in the second quarter? The bad rotation at minute 47? The missed corner three? The scoreboard cannot distinguish, and a player who only ever sees the scoreboard learns by drifting an average direction across every minute of every game. That is exactly what an end-of-rollout scalar reward does to an agent trained on 100k-token rollouts: one number, ~100,000 tokens, ~10⁻⁵ credit per token, and the signal on any individual move is lost in noise. This is the long-rollout version of the credit-assignment problem every RL textbook opens with.
Targeted textual feedback is the coach pulling out a clipboard. The trainer picks out the specific target message in the rollout that went wrong — a single tool call, a single response, a single multi-line plan — and constructs a short hint describing the desired improvement: "be more concise here", "check the file's imports first", "shoot the corner three". The hint gets inserted into the model's local context around that message. Now the model, with the hint in front of it, produces a different next-token distribution at that span — a distribution that reflects what the coach actually wanted. That distribution becomes the teacher. The original policy, running with the original (hint-free) context, is the student. An on-policy distillation KL loss moves the student's probabilities toward the teacher's, but only over the target span — the rest of the trajectory is still learning from the broader RL objective in parallel.
The shape of the gradient signal is what matters here. End-of-rollout scalar reward is one number trying to teach roughly 100,000 tokens. Targeted textual feedback is a small target span of dense, position-specific gradient sitting inside the same rollout. Once an annotated span exists, the team can in principle produce localized hint signals from each annotated span without re-running whole new rollouts to swing one scalar reward.
Where the wall-clock signal density actually shows up
Hold three variables fixed. One 100,000-token rollout. One scalar reward at the end (say, −1 — the rollout failed). One identifiable target span of ~50 tokens that the trainer judges to be the load-bearing moment. With end-of-rollout scalar RL alone, that single −1 has to distribute its gradient across all 100,000 tokens — the per-token signal magnitude is on the order of 1 / 100,000 ≈ 0.001% (illustrative). Stack a hundred such rollouts and the model is averaging over 10⁷ tokens to learn one consistent lesson. With targeted textual feedback, the same rollout gets a localized KL loss on the 50-token target span — per-token signal magnitude on that span is on the order of 1 / 50 ≈ 2% (illustrative), roughly ~2,000× stronger than the diffuse scalar on that span (illustrative). The broader RL gradient still applies over the full trajectory; the localized loss is additive, not a replacement.
| Property | End-of-rollout scalar | Targeted textual feedback |
|---|---|---|
| Signal location | One number at trajectory end | ~50-token span anywhere in the rollout |
| Per-token credit (~100k rollout) | ~10⁻⁵ ~illustrative | ~2% on the target span ~illustrative |
| Cost to produce one signal | One full rollout (~minutes ~setup-dependent) | One constructed short hint (~seconds ~setup-dependent) |
| Loss form | Policy gradient on scalar reward | On-policy distillation KL on the hinted span |
| Replaces broader RL objective? | n/a | No — runs alongside, additive |
| Sweet spot | Short rollouts with clear scalar outcomes ~setup-dependent, illustrative | Long agent rollouts where one moment was load-bearing ~setup-dependent, illustrative |
This is structurally the same move that took the field from RLHF to RLVR: once an inference-time procedure — long agent rollouts here, test-time search there — becomes load-bearing, the post-training stage has to be redesigned around what that procedure actually needs from the policy. Coding agents need localized corrections, not averaged ones, because every long rollout has a few moments that matter and a lot of moments that don't.
It is also a different fix from the training-inference mismatch diagnostic and from window-level RL for speculative drafters. Both of those localize the gradient in time too, but the localization comes from algorithmic structure (windowed sampling, mismatch isolation). Cursor's lever is a constructed short hint at a chosen target message — a textual description of the desired correction — which lets the training pass focus the gradient at the span the trainer flagged, not at fixed-size windows.
Goes deeper in: Agent Engineering → Production Evals → Online vs Offline
Related explainers
- RLVR — Reinforcement Learning with Verifiable Rewards — the post-training setup targeted textual feedback augments
- CoPD — co-evolving policy distillation — different on-policy distillation setup with parallel coevolving teachers
- PPOW — window-level RL for speculative drafters — algorithmic (not human-hinted) localization of the RL gradient
- TIM — Training-Inference Mismatch — diagnostic for what goes wrong when the long-rollout gradient is mismatched
- VPO — vector-reward advantage vs GRPO collapse — different advantage-estimator fix for diversity-vs-collapse during RL
FAQ
What is targeted textual feedback RL?
A credit-assignment technique introduced in Cursor's Composer 2.5 release. The trainer identifies a specific target message inside a long agent rollout, writes a short hint describing the desired improvement, and inserts that hint into the model's local context around the target. The resulting hint-conditioned distribution becomes a teacher; the original (hint-free) policy is the student. An on-policy distillation KL loss moves the student toward the teacher only over the target span, giving a localized training signal while the broader RL objective still applies over the full trajectory. The technique is aimed at long agent rollouts (100,000+ tokens) where one end-of-rollout scalar reward provides too little credit per token to meaningfully shape any individual decision.
Why doesn't a single end-of-rollout reward work for agent training?
It works fine for short rollouts with clear outcomes — that's where modern policy-gradient RL was developed. The problem is rollout length. A coding agent generating 100,000 tokens before the verifier grades the result has spread one scalar across 100,000 decisions, so the per-token gradient signal is roughly 10⁻⁵. Stacking rollouts averages over even more tokens to learn one consistent correction. Agents trained this way drift in an averaged direction across thousands of moves, which is exactly what you don't want for tasks where a single decision (the right tool call, the right plan revision, the right place to stop) is the load-bearing moment.
How does Cursor's approach differ from RLHF or RLVR?
RLHF and RLVR are about where the reward signal comes from — human preferences for RLHF, deterministic verifiers for RLVR. Targeted textual feedback is about where the reward signal is applied. Cursor still uses an RLVR-style outer loop with verifiable scalar outcomes; the new piece sits inside the loss function as an additional localized term. The teacher is not a separate reward model — it's the same model running with a textual hint in local context, and the KL distillation moves the hint-free student toward that hint-conditioned distribution on the target span only. This makes it cheap to layer on top of an existing RLVR setup: same rollouts, same verifier, same checkpoint, just an additional loss term on annotated spans.
Originally posted on Learn AI Visually.
Top comments (0)