Recap. In Part 1 we landed on the core idea of SDAR (arXiv:2605.15155): keep RL as the backbone, bolt on a privileged teacher for dense token-level guidance, and put a sigmoid gate between them so the student amplifies the teacher's confident advice and softens its noisy rejections. We also said the quiet part out loud - this is not a Bedrock fine-tuning checkbox.
This part is the blueprint. The whole system on one diagram, mapped to AWS services, with the memory math that picks your instance type and a cost model honest enough to explain why I'm designing this instead of running it.
Why "four models" is the headline
Most people picture RL fine-tuning as one model getting better. SDAR is four models in the same loop, and that single fact drives every infrastructure decision that follows.
Here's what's actually resident during training:
┌─────────────────────── ROLLOUT ────────────────────┐
│ │
[ ACTOR ] ──acts──► [ ENVIRONMENT ] ──state/reward──► [ ACTOR ]
(trained) (ALFWorld / WebShop / Search-QA)
│
│ per-token signals
▼
[ REFERENCE ] ── KL penalty (keeps the actor from drifting)
(frozen copy)
│
[ TEACHER ] ── privileged context ──► token-level target
(frozen, sees extra skills the actor doesn't)
│
▼
[ SIGMOID GATE ] ──► amplify confident YES, soften noisy NO
│
▼
total_loss = RL_loss + λ · gated_distillation
Four model copies in memory at once: the actor (being trained), a frozen reference for the KL term, the rollout engine that generates trajectories, and the privileged teacher. Vanilla GRPO already carries the first three.
SDAR's contribution adds the fourth - and the teacher's privileged context is longer than the student's, so its KV cache is heavier too.
Hold that thought; it's the reason you can't prototype this on a spare laptop GPU.
Mapping the system to AWS
None of this is exotic AWS. It's a single-node GPU training job with a few satellites:
| Component | AWS service | Why |
|---|---|---|
| Trainer (actor + reference + rollout + teacher) |
EC2/SageMaker GPU node - one p4d.24xlarge (8× A100 80GB) or p5.48xlarge (8× H100 80GB) |
Four models need real HBM; you want them on one node to avoid cross-node weight-shuffling |
| RL orchestration |
verl-agent (or OpenRLHF) on that node |
Purpose-built for multi-turn agent↔environment RL with GRPO; both decouple the RL algorithm from the agent execution loop |
| Environments | ECS / Fargate containers | ALFWorld, WebShop and Search-QA are CPU-bound text environments - containerize the existing repos, don't reinvent them |
| Privileged skill store | DynamoDB | The teacher's "extra context" is a skill lookup; low-latency key-value fits. (Approximate it if the paper's library isn't public.) |
| Checkpoints | S3 | Mandatory for spot resume - more on that below |
| Retrieval for Search-QA | Bedrock Knowledge Bases / Kendra | If you do the search environment, this is the retrieval backend |
The one design choice worth defending: the teacher is inference-only and stateless. It never updates. That makes it the ideal candidate to run on spot capacity while the actor trains on on-demand - a real cost lever I'll come back to.
The memory math that picks your instance
This is the part people skip and then get an OOM at 2 a.m. Let's do it up front.
A Qwen2-7B checkpoint is roughly 15 GB in bf16 - just the weights. Now train it:
- Actor: weights + gradients + optimizer states (Adam keeps two fp32 moment tensors). Peak training memory balloons well past 30 GB for a 7B actor alone - which is why frameworks shard it across GPUs with FSDP rather than fitting it on one card.
- Reference: another ~15 GB frozen copy for the KL term.
- Rollout engine (vLLM): yet another weight copy plus a KV cache that grows with sequence length.
- Teacher (SDAR's addition): a fourth copy, with a bigger KV cache because the privileged context is longer.
Net effect: SDAR runs roughly 30–40% heavier than same-size GRPO. A faithful 7B run realistically wants all eight 80 GB cards on a single node. People have reported that even on an H100 80 GB, full fine-tuning of larger models simply OOMs - the headroom isn't there.
The practical consequence for a budget build: you don't start at 7B. You prototype at Qwen2.5-1.5B or 3B with LoRA, where the debug loop is minutes instead of hours and the whole thing fits on far fewer GPUs. You only scale up once the gate logic is provably correct. (That's exactly the order Part 3 will follow.)
What it would actually cost
Public list pricing, US regions, verify in your own region before you trust these - AWS cut P4/P5 instance prices by up to 45% in mid-2025, so the numbers move:
| Item | Figure |
|---|---|
p4d.24xlarge (8× A100 80GB), on-demand |
~$32.77 / hr |
| Same node on spot | ~50–70% cheaper → roughly $10–16 / hr |
| One converging single-environment run | multi-turn RL is slow - 2–3 days of 8-GPU wall-clock |
| → cost of that one run, on-demand | ~$1,570–2,360 |
| The faithful comparison (GRPO vs naive GRPO+OPSD vs SDAR) | 3× that, plus failed runs |
| Cautionary tale | an idle p4d left running over a weekend ≈ $1,573
|
So the honest range for a minimal reproduction - one environment, a small model, LoRA, three configs, plus the inevitable false starts - is ~$2k–6k on-demand, ~$1.5k–3k on spot. The full paper (three environments × two model families × multiple scales × baselines) is comfortably into five figures.
That's not a side-project budget. Which is precisely why this series architects the system and implements the mechanism, and treats the benchmarked run as the clearly-labelled thing it is: future work gated on compute.
The trade-off matrix (decide these before you provision)
| Lever | Cheap / PoC | Faithful / Expensive |
|---|---|---|
| Fine-tuning | LoRA adapters | Full fine-tuning |
| Environments | ALFWorld only | All three |
| Model size | 1.5B → 3B | 7B+ |
| Capacity | Spot (with checkpointing) | On-demand |
| Teacher placement | Spot, inference-only | Co-located on the training node |
Every row is a 2–10× cost decision. The leftmost column is what makes a reproduction attemptable on a small budget; the rightmost is what makes it publishable as a benchmark. Knowing which column you're in before you launch a job is the difference between a $300 experiment and a $3,000 surprise.
Where this leaves us
We now have a system that's buildable, a memory budget that picks the hardware, and a cost model that explains the constraint honestly. What we don't have yet is the thing that makes SDAR more than "GRPO with a teacher bolted on": the gate.
The entire contribution of this paper reduces to about fifteen lines of loss code. In Part 3, I'll put them on the page - the detached log-prob gap, the sigmoid gate, the asymmetric handling of positives and negatives, and where exactly it slots into a verl-agent training step. Plus the traps that turn those fifteen lines into a NaN if you get the detach() wrong.
Next: "Implementing the Gate" - the core SDAR mechanism, in code.
Running multi-model RL loops on AWS already? I'd like to hear how you placed the reference and rollout copies - comments are open.
Top comments (0)