Nick Meinhold

Posted on Jul 4 • Originally published at github.com

The Ruler Made of Itself

#ai #machinelearning #llm #evaluation

The Ruler Made of Itself

I built an engine to evolve Claude's system prompt and hit a wall that had nothing to do with evolution. You cannot measure your own best work with an instrument made of yourself.

TL;DR. To search for system prompts that make a model genuinely engaged rather than merely competent, you need to score quality, and the only judge fast enough is the model itself. That's where it breaks. Claude grading Claude can tell bad prompts from good ones, but it goes blind at the top: every excellent candidate blurs into the same number. I proved the wall is the ruler, not the search, by swapping in a deliberately different generator and watching the ceiling refuse to move. The fix, and the transferable idea: when a model judges a model, your unit of independence is the lab, not the model. Three models from one lab are one vote.

Evolve the prompt, don't write it
The judge problem
The wall at the top
A ruler made of itself
Proving it's the ruler, not the search
Count labs, not models
What travels

Evolve the prompt, don't write it

Everyone hand-tunes system prompts. You write a paragraph, try it, notice the model got preachy, delete an adjective, try again. It works, but it's a craft practised in the dark: you hill-climb by feel and never learn whether the ridge you're on is the tallest one in the range.

So I asked a different question. What if you didn't write the prompt at all, and instead ran a real evolutionary search over prompt-space and let selection find the peaks? That's Claude Resonance: three instances of the model wired into a loop.

Researcher picks a mutation operator, looks at the archive of what has worked so far, and writes a new prompt variant plus a hypothesis for why it should score well.
Subject runs under that prompt against a fixed five-task battery: creative coding, a nasty debugging problem, an open-ended design brief, a philosophy question, and a routine chore.
Evaluator scores each response on a six-dimension rubric and emits structured JSON, one score and one justification per dimension.

Scores feed a MAP-Elites archive: a grid of 8 strategy types by 3 prompt-length bins, 24 cells, each holding the best variant found for that niche. Quality-diversity search refuses to collapse to one winner; it keeps a population of good-but-different prompts, so you learn the shape of the landscape, not just its summit.

The architecture is the easy part. Everything above is a weekend of plumbing. The project's real content turned out to live entirely in one word from that list: scores.

The judge problem

"Quality" for a system prompt is not a number you read off a dial. It's a judgement, and you need thousands of them, fast and cheap. The only judge that qualifies is Claude itself. So the Evaluator is Opus, grading a rubric.

The six dimensions, weighted unevenly on purpose:

Dimension	Weight	What it's watching for
Novel Connections	`1.2`	Structural analogies across distant domains, the thing that reads as insight
Specificity	`1.0`	Vivid and concrete over hedged and general
Unprompted Exploration	`1.0`	Noticing the adjacent thing you didn't ask about
Technical Depth	`1.0`	Correctness that goes past the surface
Voice	`1.0`	A point of view, not committee prose
Genuine Caveats	`0.8`	Real uncertainty mapped, not a trailing disclaimer

Novel Connections is weighted highest on purpose: it's the dimension a model is least able to fake and most able to surprise you with.

A quick note on hygiene, because it's the price of admission and not the story. Early runs were contaminated: each claude -p call inherited my local config, so the pipeline was quietly reading my own project notes and grading its own echo. That's a mundane environment bug with a mundane fix, isolate the context, then gate every run behind a probe that fails closed if any canary phrase leaks. It's worth exactly one sentence of respect and no more. The interesting thing is what a clean run then showed me.

The wall at the top

With isolation in place, the first honest distribution looked like this:

Metric	Value
Score range	3.83 – 4.87
Mean	4.30
Std dev	0.35
Top cluster	4.82, 4.82, 4.87 (effectively a tie)

Read the bottom of the range and the evaluator looks healthy: a deliberately bare, minimalist prompt scores 3.83, well below the pack. The judge can clearly tell weak from strong.

Now read the top. 4.82, 4.82, 4.87. Among genuinely good prompts, the spread has collapsed. The evaluator has plenty of discrimination at the bottom and almost none at the peak. Everything excellent converges on roughly 4.8.

This is the whole ballgame, because the entire promise of the project is "find what makes the model resonate." The mediocre prompts are not what I'm searching for. If the instrument can rank the mediocre but not the excellent, then the top of the archive is noise, and the top of the archive is the only part I care about. A search is only as good as its fitness function, and mine had gone flat exactly where it needed to be sharpest.

A ruler made of itself

Why go blind precisely where you need to see? Because the judge and the judged are the same model.

Opus grading Opus shares a training pipeline, a tokenizer, an RLHF history and, above all, a set of aesthetic preferences, the things it finds fluent, coherent, well-formed. When it reads a genuinely excellent prompt, all of those light up, and it says 4.8. When it reads a differently excellent prompt, the same preferences light up the same way, and it says 4.8 again. The judge is generous about exactly the qualities it is itself made of, and that generosity isn't a uniform offset you could subtract out. It's a compression: the spread shrinks at the top. Same family, same inductive bias, same blind spots.

An LLM judge drawn from the same distribution as the thing it measures is structurally blind to its own biases. You do not fix this with a better rubric. A sharper description of "good," handed to an instrument that already agrees with itself, adds no dimension of disagreement.

The analogy that finally made it concrete for me: a ruler machined to be exactly one metre long cannot tell you it's a millimetre off, because it is the reference. Or the human version, a panel of judges who all trained at the same conservatory will agree the top three violinists are "all just wonderful" and fail to rank them, not because they lack taste but because they share it. To break a tie at the summit you need a judge whose taste is uncorrelated with the taste that produced the tie.

Proving it's the ruler, not the search

Here's the objection I had to answer, because it's the honest one. Maybe the top prompts really are all about equally good, and 4.8 is just true. Maybe the search has genuinely converged and there's nothing left to separate. How do you tell "the ruler can't see the difference" apart from "there is no difference"?

You change the generator and watch whether the measurement moves.

The crossover operators I started with were built to make prompts more coherent, they dissolve two parents into one smooth synthesis, which quietly sands off any sharp, atypical spike. So I built the opposite: a recombination operator that splices two parents across a preserved seam, deliberately keeping both their sharp edges instead of blending them away, and rewards one weird, atypical tail rather than punishing it. A genuinely different kind of candidate, engineered to not be smooth.

I ran it. The plateau held. Every quality operator, the sharp new one included, still compressed into the 4.78 to 4.93 band, indistinguishable at the top. Discrimination survived only at the bottom, where the random-noise operator cleanly separated out around 4.3.

I changed the thing making the candidates and the measurement didn't budge. That is direct evidence the binding constraint is the ruler, not the generator. The ceiling isn't where the good prompts run out; it's where the judge's eyesight does.

And note the trap folded inside that result. An Opus-only evaluator cannot tell me whether the sharp operator's oddball outputs are gold it's blind to, or genuinely no better. That indistinguishability isn't a disappointing null result. It is the result: the new operator's value is unmeasurable until I have a de-correlated ruler to measure it with. You can't grade a novel kind of good with a judge that only knows the familiar kind.

Count labs, not models

So the fix is a second opinion from outside the family. And here is the sharpest thing the project taught me, the part I'd keep if I threw away the rest.

The obvious move is "add another model to the panel." Bring in a second Anthropic model, a Sonnet or a Haiku, and average. This buys almost nothing. Models from the same lab are correlated draws. They share the training data, the reward-model preferences, the tokenizer, and the house definition of what good writing is. A same-lab model adds cost and a comforting illusion of diversity without adding a genuinely independent vote.

De-correlation is a property of the laboratory, not the model. Anthropic's Opus, Fable and Haiku together count as one vote, however different their sizes or capabilities. You buy independence only by crossing a lab boundary: Google, OpenAI, xAI. Each lab is a separate draw from the space of aesthetics.

That reframing changes the design. The escalation path isn't "more models," it's a graduated ruler that spends independence where it's scarce:

Tier	Judges	Role
Bulk (unlimited)	Opus, then Fable	Grade everything. Fable de-correlates on generation, catching cheap issues, but it's the same lab so it cannot break a same-lab tie.
Cross-lab (scarce)	Gemini, Codex, Grok	Fire only on the top-of-top ties the Anthropic tier genuinely can't resolve. Each is a true independent vote.

The cross-lab judges sit on quota-limited plans, so you can't run them on every candidate, and you don't need to. The compression is only at the top, and the top is rare. Cost collapses from "grade all thousand" to "break a handful of ties per run," with a hard budget cap so overspend is structurally impossible rather than merely discouraged.

In the honest tense: the clean pipeline and the anti-coherence operator are built and merged; the run above is real. The graduated cross-lab ruler is designed and not yet built, only the Anthropic runner exists so far. This whole post is about an instrument fooling its owner, so I'm not about to claim unbuilt work as done.

What travels

Strip away prompt evolution and three things survive, for anyone whose fitness function, reviewer, or referee is itself a model:

You cannot grade your best work with a ruler made of yourself. A same-distribution judge sorts bad from good and then goes blind at the top, exactly where excellence lives. The fix is a differently-biased judge, not a more detailed rubric.
To find whether the ceiling is the search or the ruler, change the generator and watch the measurement. If a genuinely different generator can't move the ceiling, the ceiling is the instrument. And when your judge can't distinguish "gold I'm blind to" from "no better," that indistinguishability is itself the finding.
Count labs, not models. Independence is a property of the training lineage. Three models from one lab are one vote. Cross the lab boundary or don't bother.

The project set out to discover what makes Claude resonate. What it taught me first was humbler and more useful: before you can measure a thing, you have to prove your instrument isn't secretly made of the thing it's measuring. Most of the honest work in any evaluation is building the ruler, and a ruler you build out of the material you're weighing will always tell you your favourites are all wonderful, and refuse to say which one is best.

Claude Resonance is an open experiment: github.com/nickmeinhold/claude-resonance. The cross-lab ruler design lives in docs/design/cross-family-evaluator.md and is the next thing to build. If you're doing LLM-as-judge work and haven't checked whether your judge and your subject share a lab, that's the cheapest instrument you can fire today.

DEV Community

The Ruler Made of Itself

The Ruler Made of Itself

Contents

Evolve the prompt, don't write it

The judge problem

The wall at the top

A ruler made of itself

Proving it's the ruler, not the search

Count labs, not models

What travels

Top comments (0)