What 500 curated failure pairs actually fix: a breakdown across 3 seeds

#machinelearning #ai #datascience #python

In the previous post, I described the curation philosophy for IDFU's rejected-side dataset — why I avoid synthetic bug generation, why stub detection matters, why "honest failures" are hard to come by.

A few people asked the obvious question: does it work?

This post is the answer. Not as a marketing pitch, but as a breakdown. Aggregate scores hide more than they reveal, and I want to show what's actually changing under the hood.

The setup

Base model: Qwen2.5-Coder-3B-Instruct (trained on 92 programming languages)
Method: DPO via TRL with LoRA
Data: 500 preference pairs from the IDFU dataset
Eval: HumanEval, pass@1, three random seeds (42, 123, 7)
Hardware: RTX 4060, single-GPU, ~3-4 hours of training per seed

A note on the data shape, since I left this implicit in the previous post. Each preference pair has a chosen side and a rejected side, both produced inside IDFU. The rejected side is what Part 1 was about — honest failures, curated rather than synthesized. The chosen side comes from IDFU's internal certified pool — samples that passed all of IDFU's validation gates and are tracked separately from the commercially-released portion. (The gate composition itself is proprietary; this article describes only what the pool produces, not how it's filtered.)

So both sides of the pair are produced and verified by the same internal pipeline, against the same quality bar. The contrast that DPO trains against isn't "model output vs. ideal answer," it's "honest failure vs. honest success on the same task," with both sides held to the same internal standard. That's the configuration that produced the numbers below.

The hyperparameters were deliberately conservative. No sweep, no exotic tricks:

LoRA: r=16, alpha=32, dropout=0.05
Target modules: q/k/v/o/gate/up/down_proj (Qwen standard)
DPO: beta=0.1
Optimizer: learning_rate=5e-5
Batch: size=1, grad_accum=4 (effective batch 4)
Epochs: 3 (= 375 optimizer steps for 500 pairs)
Quantization: 4-bit NF4 + bf16 compute (bitsandbytes)
gradient_checkpointing=True
max_length=2048, max_prompt_length=512

Three seeds because one seed is a coincidence, two seeds is suggestive, and three seeds with consistent direction starts to look like signal. I'd run more if I could afford the GPU time, but training and data generation share the same card, and curation can't pause for a week.

The aggregate

Pass@1 across seeds:

seed=42:   84.1%   delta +3.66 pp
seed=123:  84.1%   delta +3.66 pp
seed=7:    83.5%   delta +3.05 pp
mean:      83.94 ± 0.35%   /   +3.46 ± 0.35 pp

Two seeds landed on the exact same delta, which I noticed and double-checked because it looked too clean. The HumanEval problem set is small (164 problems), failures move in integer counts, and ties at the same delta happen more often than people expect. Seed=7 was lower but in the same direction.

Not earth-shattering. But the headline number isn't really the point.

Where the improvement actually comes from

I categorized every HumanEval failure by exception type. Five categories cover everything:

Category	Base	DPO (mean ± std)	Δ	Relative
ASSERTION_FAIL	23	18.67 ± 0.58	-4.33	-19%
NAME_ERROR	6	3.67 ± 0.58	-2.33	-39%
OTHER_RUNTIME	2	2.67 ± 0.58	+0.67	+33% (n=2)
SYNTAX_ERROR	0	0.33 ± 0.58	+0.33	(1/3 seeds)
TYPE_ERROR	1	1.00 ± 0.00	0.00	0%
Total failures	32	26.34	-5.66	-18%

A few things stand out.

NAME_ERROR drops 39%. This is the cleanest signal. NAME_ERROR on HumanEval is typically the model using a stdlib name without importing it (e.g., math.ceil without import math, or re.match without import re), or referencing an undefined helper. One concrete case I traced was HumanEval/115 (max_fill), where the base model called math.ceil with no import; after DPO training, the model added the import inline. The IDFU rejected pool has a lot of these missing-import patterns, because they're a common failure mode when a model fakes confidence on partial knowledge. The transfer to HumanEval was the most direct prediction I had going in, and it's the one that landed hardest.

ASSERTION_FAIL drops 19%. This one I didn't expect, and the mechanism is interesting enough to be worth tracing.

I went through the 8 newly-passing problems on seed=42 by hand. Six of them were genuine algorithmic improvements — the base model's logic was wrong, the DPO model's logic was right. Standard stuff.

The other two (HumanEval/116 and /123) had identical algorithmic content between base and DPO outputs. The actual difference was that the base model wrote a self-test at the end of the function ("if this fails, raise"), and that self-test crashed at import time on the test harness, registering as ASSERTION_FAIL. The DPO model produced clean function-only output, no embedded self-tests, so import succeeded and the harness's actual tests ran and passed.

That's a behavioral artifact, not an algorithmic improvement. It reflects the chosen-side distribution: clean idiomatic Python without embedded test scaffolding. DPO learned to not tack on self-tests, and two of the eight ASSERTION_FAIL fixes are downstream of that.

So if I want to be honest about the breakdown: the real algorithmic improvement is roughly +2.4 pp out of the +3.46 pp aggregate. The rest is the model learning to output the kind of thing the chosen pool actually contained. That's still a real and useful effect — nobody wants self-tests injected into their library code — but it's not the same thing as "the model got smarter."

This is the kind of detail that aggregate scores erase. I think it matters.

TYPE_ERROR doesn't move at all. Three seeds, all returning exactly 1 failure. This is one specific HumanEval problem that the base model fails on and the DPO model also fails on, in the same way. DPO didn't touch it, didn't make it worse, didn't make it better. A clean null result.

OTHER_RUNTIME and SYNTAX_ERROR are noise. The absolute counts are 0-3 across seeds. A single failure shifting between categories produces large relative percentages on tiny denominators. I'd want a much bigger eval set to say anything about these.

What this is and isn't

What it isn't: a refutation of large-data RLHF. 500 samples is not going to compete with a 50,000-sample run done properly. The improvement is a few percentage points, not transformative.

What it is, I think: a signal that what you put in the preference pairs matters more than how much. AP2O reports up to ~+3% pass@k improvement on coding benchmarks (per the paper's abstract; see arXiv:2510.02393 for the full per-benchmark breakdown). This run got into the same ballpark with 500 pairs. The data efficiency ratio probably doesn't survive scaling — there's almost certainly diminishing returns where the curation work stops paying off and you just need volume. But at the small-data end, where most solo developers actually live, the curation seems to do real work.

The other thing I take from the breakdown is that the improvements aren't coming from gaming the eval. If 500 pairs of curated data caused the model to start failing in new ways — spike of SYNTAX_ERROR, models refusing to attempt — I'd be worried about distribution collapse. None of that shows up. The two categories that move (NAME_ERROR, ASSERTION_FAIL) move down. The categories that don't move stay put. That's roughly the shape I'd want.

What I'm doing next

Currently in production: a Tokenization/BPE-focused specialty pack, ETA a few days. The benchmark methodology used here — 3 seeds, full failure breakdown, manual inspection of the deltas — will be re-run on each new specialty as it ships. I'd rather publish slow and verified than fast and aggregate-only.

If anyone has run similar breakdowns on their own DPO experiments, I'd genuinely like to compare notes. The thing I can't tell from one run is whether the NAME_ERROR / ASSERTION_FAIL split is specific to IDFU's curation choices or a general property of curated DPO with this kind of chosen-pool composition. That's a question with real implications for what to put in the pairs, and I don't have enough data to answer it alone.

The dataset

A 100-row sample is up at huggingface.co/datasets/namakoo/idfu-verified-code if you want to look at what the curated dataset actually looks like. The full set is available there too.

If you're working on DPO or RLHF training and want to compare notes, find me on Twitter (@namakoo123) or reply here.

Curation continues. The next snapshot is being collected as you read this.