An RL‑driven data scheduler can lift MMLU performance by 27.5 % relative while achieving a 2.23× higher HumanEval pass@1, and it does so with virtually no extra compute [1]. The scheduler learns a policy that decides, at each step, how many examples from each source task to present to the model. Because the policy operates online, the training loop sees only a 0.4 % wall‑clock increase per step.
Before AC‑ODM, most LLM pre‑training pipelines relied on static or uniform mixing of source corpora, assuming that larger models or longer training were the only ways to close downstream gaps. Researchers experimented with hand‑crafted curricula, but those schedules lacked feedback from the model’s evolving gradients. Consequently, improvements from smarter data allocation remained anecdotal.
AC‑ODM delivers those gains by learning a policy that allocates examples across tasks on the fly. “On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23× higher pass@1 on HumanEval, all while incurring a virtually negligible ( 0.4%) per-step wall-clock increase and only 2% additional memory overhead.” [1] This translates to a 7.2 % absolute lift in 0‑shot MMLU accuracy and a more than two‑fold jump in HumanEval pass@1, with the same hardware budget.
The study leaves open how the approach scales beyond a 1 B‑parameter backbone. All reported numbers come from a Pythia‑1B experiment, and the paper does not present results on larger, production‑scale models [1]. The proxy mode, which transfers a policy learned on a small model to a larger target, introduces an extra training phase; the paper does not report a quantified cost for this phase. An open question is whether the same relative gains survive when the model’s capacity dwarfs the data‑mixing policy’s representational power.
If the reported efficiency carries over, replacing uniform sampling with an AC‑ODM scheduler should become the new default in pre‑training scripts. Practitioners can drop a few lines of RL‑policy code, keep the memory footprint within 2 % of the baseline, and re‑run standard benchmarks to harvest immediate gains. The community ought to treat data mixing as a tunable hyper‑parameter on par with model depth, rather than an afterthought.
Top comments (0)