Shoaibali Mir

Posted on May 31

Your RL Agent Failed a 12-Step Task. Which Step Was Wrong? (The Supervision Problem in Agentic RL)

#machinelearning #reinforcementlearning #llm #aws

About this series.

I'm going to take a fresh paper - Self-Distilled Agentic Reinforcement Learning (SDAR, arXiv:2605.15155) - and architect it end to end on AWS: the system design, the actual gate code, the evaluation plan, and a brutally honest cost model.

What I'm not going to do is wave a benchmark number around.

Reproducing a paper like this costs thousands in GPU time, and I'd rather show you the machinery than a screenshot you can't audit. The design is the deliverable.

This is Part 1.

A small, infuriating problem

Picture an LLM agent working a web-shopping task. It reads the goal, searches, clicks a category, filters, opens a product, compares, adds to cart - twelve steps in all. At the end, it bought the wrong thing.

So you do what reinforcement learning tells you to do: you score the trajectory. Reward = 0. Bad agent.

Now answer this: which of the twelve steps was actually wrong?

Maybe step 3, the search query, was fine and step 9, a filter choice, doomed everything.
Maybe steps 1–11 were brilliant and step 12 fat-fingered the wrong button.
Your single scalar reward has no idea. It punishes all twelve equally, including the eight that were correct.

That's the supervision problem in agentic RL, and it's the thing this whole series is about.

Why "just use RL" isn't enough for agents

RL has become the default way to post-train LLM agents. The catch is that the reward usually lands at the trajectory level - one number for the entire multi-step episode.

For a single-turn task ("answer this question"), that's tolerable; the action and the outcome are close together. For a long-horizon agent - ten, twenty, fifty turns of searching, calling tools, and reacting to an environment - it's a disaster of credit assignment. The signal is too coarse to tell the model which decisions earned the reward and which torpedoed it.

You can throw more episodes at it and let statistics sort the credit out eventually. But "eventually" on a 30-turn task burns a lot of expensive compute, and the training tends to get unstable along the way.

What you actually want is denser feedback: a signal at the level of individual tokens, not the whole episode.

Enter the teacher: On-Policy Self-Distillation

One way to get that denser signal is a technique called On-Policy Self-Distillation (OPSD).

The idea, in plain terms: alongside your student model (the one being trained), you run a teacher - the same model, but handed privileged context the student doesn't get. Think of extra hints: relevant skills, retrieved knowledge, a peek at what good looks like. Because the teacher is better-informed, its token-by-token probability distribution is a richer target than a single end-of-episode reward.

The student then learns to imitate the teacher at the token level. Dense feedback. Problem solved?

For single-turn settings, largely yes. For multi-turn agents, the wheels come off.

Why OPSD breaks on multi-turn agents

Two failure modes, and they're worth understanding before any code:

1. Instability compounds across turns.
In a multi-turn episode, the student's small mistakes at turn 1 change the state at turn 2, which changes turn 3, and so on. The teacher is reacting to an increasingly drifted situation. The dense signal that was supposed to stabilize training starts amplifying the wobble instead. More turns, more compounding, more chaos.

2. The teacher's "no" is often noise, not truth.
Here's the subtle one. The teacher's advantage is its privileged context - but that context comes from skill retrieval and utilization, and those aren't perfect. When the teacher rejects a token (says "the student should be less likely to do this"), it might be right... or the teacher might just have retrieved a bad skill. If you treat every teacher rejection as gospel and penalize the student for it, you're training on noise.

So you can't treat the teacher's positive endorsements and negative rejections the same way. Positives ("yes, do more of this") are relatively trustworthy. Negatives ("no, don't") are suspect. They need asymmetric handling.

SDAR's bet, in one sentence

This is where the paper's contribution lands:

Keep RL as the primary optimization backbone. Bolt OPSD on as a gated auxiliary objective - strengthen distillation on the teacher's confident positive guidance, and softly attenuate its noisy negative rejections.

The mechanism that does this is a sigmoid gate sitting on top of a detached token-level signal. Confident positive-gap tokens get the distillation signal amplified. Negative rejections get turned down - softened, not blindly obeyed.

RL still drives the bus. OPSD is a co-pilot whose advice you weight by how much you trust it.

Here's the contrast that the whole paper hinges on:

TRAJECTORY REWARD (vanilla RL)
  [step1 step2 step3 ... step12]  ──►  reward = 0
   (all twelve steps blamed equally)

DENSE OPSD (naive)
  teacher says yes/no on every token  ──►  trust it all
   (but multi-turn drift + noisy "no"s destabilize training)

SDAR (gated)
  teacher says yes/no  ──►  sigmoid gate  ──►
   amplify confident YES, soften suspicious NO
   (RL stays primary; dense signal is filtered, not swallowed)

The reported payoff (their numbers, not mine): meaningful gains over plain GRPO on ALFWorld, WebShop, and Search-QA, and - this is the part I find more interesting than the headline percentage - it avoids the training instability that naive GRPO+OPSD falls into.

Why this is an AWS problem, not a Bedrock checkbox

If you're reading this as an AWS practitioner, here's the part that matters: you cannot do this through managed fine-tuning.

SDAR needs a custom RL loop with four models in play at once - the trained actor, a frozen reference for the KL term, a rollout engine, and that privileged teacher branch - plus a live multi-turn environment to act in. That's a verl-agent/OpenRLHF-shaped workload running on GPU instances you provision and babysit, not a Bedrock fine-tuning job you submit and forget.

Which means the real questions are infrastructure questions: How many 80GB cards does a four-model setup actually need? Where does the teacher branch live? What does a converging run cost when an idle GPU node burns four figures over a forgotten weekend?

That's Part 2 - the full system diagram and the cost model, before a single GPU is rented.

Next in the series: "Architecting SDAR on AWS" - the component map, the service mapping, the memory math, and what it would actually cost to run.

If you've fought trajectory-reward credit assignment on your own agents, I want to hear how it went - drop it in the comments.

Shoaibali Mir

I'm an engineer with 5+ yrs of experience spanning across DevOps, Data, Cloud and AI/ML Engineering Domain. Along with full time work, I'm pursuing Masters Degree in AI/ML from BITS Pilani.

Top comments (2)

Harjot Singh • May 31

Which step was wrong is the credit-assignment problem and it's the quiet killer of multi-step agents. A 12-step task that fails gives you one bit of signal (it failed) spread across twelve decisions, and a sparse terminal reward can't tell you whether step 3 doomed the run or step 11 fumbled an otherwise-good plan. Without per-step attribution you're training (or debugging) on noise, reinforcing whole trajectories that happened to succeed including their bad steps, and penalizing good steps that happened to be on a failing path. The supervision problem is really a granularity problem: you need a reward or a check at the step level, not just the outcome. Two angles that help. Process supervision: judge each step's local correctness (was this a reasonable action given the state) rather than only the final result, which gives dense signal and localizes the failure. And verifiable intermediate states: where a step has a checkable postcondition, you get ground truth for free instead of relying on a learned judge that can be wrong. Reward the right steps, not just the lucky trajectories. That assign-credit-at-the-step-not-the-outcome instinct is core to how I think about agent reliability in Moonshift. Are you leaning toward process supervision with a step-level judge, or instrumenting checkable intermediate states where the task allows it?

Shoaibali Mir • Jun 1

This is a great breakdown, thanks for sharing your thoughts!

My preference lands on process supervision as the primary signal, with verifiable intermediate states as anchors wherever the task gives you a free ground-truth check. The two work well together - instrument what’s verifiable, use a step-level judge to fill the gaps.

The tricky part is the step-level judge can get it wrong, and if you train on every signal it gives you, you’re also learning from its errors. SDAR’s gate is essentially a confidence filter - amplify the steps it’s sure about, dial back the ones it isn’t.

Curious about Moonshift - are your failures clustering around specific steps or spreading across the trajectory? That usually tells you whether anchors alone get you most of the way there.