Curating Python failures for DPO: notes from the rejected side

#ai #python #opensource #machinelearning

Most of the work in DPO training data is on the rejected side. The chosen side has gold-standard reference implementations everywhere production code, peer-reviewed libraries, official examples. The rejected side is harder. You need code that someone could plausibly write, that fails for a real reason, and that fails in a way the model can actually learn from.

I tried a few approaches before settling on one that works.

Hand-curating from production code review is honest, but slow. After about fifty samples I'm tired and my judgments start drifting. Public failure datasets exist but tend to be sparse and narrow --toy bugs in toy domains, or syntactic typos that don't really teach anything. Asking GPT-4 to "write a buggy version of X" is fast and expensive, and the bugs come out so obviously fabricated that they'd train the model on the wrong distribution. The bug-versus-correct boundary in those samples is too clean. Real bugs are messier.

What I actually wanted was code that looked like it was written by someone trying. Code where someone read the docstring, understood the goal, made a reasonable attempt, and then missed something subtle --broken detail balance in MCMC, a forgotten power-of-2 guard in FFT, an off-by-one in a numerical kernel that shows up only after a few hundred iterations. The kind of failure where the test fails 200 samples in and you stare at the code wondering where the bug is.

So I started building that kind of dataset.

Quick about-me

Information theory background, formerly in autonomous driving R&D, currently solo. I've been curating a Python failure dataset for DPO/RLHF training for the past few weeks. Static-first design, deterministic checks, very minimal LLM judgment in the curation loop. The point is to make the data, not to chase model size.

This isn't a model story. It's a dataset story.

The shape of failures I'm collecting

There's a quiet trap in synthetic data work. If you ask a language model to "write a buggy version of this function," you get bugs that look fabricated. They're either too easy (typos, missing brackets) or too contrived (failure modes nobody would ever write organically). Either way, the model trained on that data learns the wrong boundary.

What I actually want is code that fails genuinely. Not because the bug is faked, but because some problems are subtly hard. Getting MCMC detail balance right requires understanding why the proposal-acceptance ratio works --not just memorizing the formula. Getting the FFT power-of-2 guard right requires recognizing that the recursion only works on certain input sizes. The hard parts are where the failures happen honestly.

So the dataset is curated, not generated. Each row is an honest attempt that failed for a meaningful reason. Trivial failures (syntax errors, missing imports) are filtered out. So are stub non-attempts (more on that below). What's left looks almost human.

A few weeks in, I caught myself staring at one row for ten minutes, trying to figure out why it was wrong. That was the moment I knew the data was the right shape.

What I learned curating

Three things surprised me along the way.

Section ordering in prompts matters more than I expected on a 7B model. When generating attempts, I had a function-signature contract sitting in the middle of the prompt -- a clear instruction telling the model "you must define this exact function with these exact keyword arguments." After it came a long block of counter-example code (kept to teach the model what not to do). The model was reading the contract, then drowning in the counter-examples, then writing the wrong signature. Function name adherence was around 8%.

I moved the contract to right before the task description, after the counter-examples. Adherence jumped to 75% in the next batch. Lost-in-the-middle is real on 7B. Recency bias too. The thing the model saw last had outsized weight.

The dataset attacks itself. Once attempts started looking genuinely human, the next problem was that some weren't real attempts at all --they were stubs with comments like # Dummy implementation and a hardcoded return value. The model, when uncertain, defaulted to plausible-looking nothingness. These passed every refusal filter, every length check, every shape constraint. They looked fine. But they weren't real failures; they were non-attempts.

I added a stub-detection layer. The first version was pattern matching against comment phrases. The second version (still in progress) is more structural -- short bodies, hardcoded returns, identity functions. It's an arms race. As soon as a pattern is filtered, the model finds another way to fake without saying so.

Quality plateaus with compute. Spending more time per sample doesn't help past a point. The marginal failure isn't more interesting; it's just more samples. The improvement comes from changing what kinds of failures the model is asked to produce --new domains, different invariants, contrastive examples in the prompt. Compute is necessary but not the lever.

The dataset right now

About two weeks of curation:

~10,000 verified failure rows
19 Python domains
Each row carries the prompt that produced the attempt
All quality filters deterministic
Snapshot every couple of weeks

I monitor quality drift, adjust prompts when failure modes shift, and otherwise let the curation run.

What's next

The next major question I'm working through is breadth versus depth. The current data is heavy on Monte Carlo and FFT -- the curation loop got good at those domains and kept producing high-quality failures there. The next snapshot will deliberately broaden domain coverage, even if average quality dips slightly. I'd love to hear from people actually training models on this kind of data: would you rather have a deep specialized set or a wider balanced one?

A couple of questions if you're working on DPO/RLHF training:

Which Python domains do you most need failure data for? I can shift the curation focus in a few days, so the answers actually change what gets generated.
Are "subtle failures" useful to you --code that passes pytest but violates a deeper mathematical invariant? There's a pool of those that's currently filtered out. Curious if anyone actually wants that shape of data.

If you want to take a look

I've put up a 100-row sample on Hugging Face for context:

https://huggingface.co/datasets/namakoo/idfu-verified-code

If you're hitting any of these problems, or if you have ideas about what failure data would actually be useful in training, reply here or find me on Twitter @namakoo123.

Curation continues regardless. The next sample is being collected as you read this.