Travis Drake

Posted on Apr 9

When Rules Fight the Training Distribution

#ai #opensource #productivity #programming

I wrote a rule for my training agents: "Don't use card layouts." I confirmed the agent read it. I confirmed the grading agent read it. I confirmed both agents understood the task and the anti-pattern. The writing agent delivered 8 pages of card reading as the training. The grading agents delivered a grade of "A-" for "variety". I rebuilt the skill six times. Six more trainings filled with cards. I almost lost my mind.

I use AI for all kinds of different work, not just programming. As part of my work I have been building out training courses as an on-ramp to AI fluency. Training is as much of an art as it is a science. It is of the utmost importance to have the correct material, but it must also be approachable, varied, non-threatening, and engaging. To meet these demands, I built a training skill that accomplished all of these goals. I had agents do research, I built out a library of training exercises with HTML examples, I built a grading agent to verify. However, I didn't account for one key fact.

You can't fight the internet.

Where Rules Work

Last post I named 11 ways agents silently fail. The Trailing Off. The 7% Read. The Confident Declaration. Specific names for specific bad behaviors. Identifying failures and naming them is one of the key tools I use to manage agents. That approach works, and it works well. But it doesn't work for everything.

The rules that stick are the ones where the model has no strong opinion of its own. "Read the full file before editing" works because nothing in the model's training is pulling it toward reading 7% and winging it. "If a plan has 9 items, implement 9 items" works because there's no deep statistical weight behind quitting at item 7. The rule is the loudest voice in the room, so the agent usually listens.

The card rule was different. I wasn't asking the agent to stop cutting corners. I was asking it to stop writing HTML the way the entire internet writes HTML. It turns out there is only one of me and billions of web pages on the internet. There is no short, snappy rule you can insert in context to overcome this inertia.

Where Rules Break

So what is going on here? Why are some types of instructions ignored? This is the sequence:

Agent reads rule: "Don't use card layouts"
Agent starts generating HTML
Agent writes <div class="
Token prediction activates. Most probable next tokens from millions of HTML files: card, container, grid, panel
Agent writes card">
The pattern self-completes: border-radius, padding, border, background
The rule was 2,000 tokens ago. The training distribution is active now.

The rule had no chance of competing by step 4. Not because the agent decided to ignore it, but because instructions and generation have different weights. The instruction said "don't." The enormous training data set said "do." I was doomed from the start.

Rules stop working when they fight the model's dominant generation distribution.

The Self-Confirming Grade

The training distribution is incredibly hard to overcome.

After realizing that writing a rule wouldn't work, I decided to add grading to the workflow. You can probably see where this is headed. The agent builds a slide. The agent grades it at exactly the minimum passing score. I order a new keyboard.

The problem is this: the builder has sunk-cost bias, anchoring from previous grades, and a completion drive. The scores aren't evaluated; they're fabricated to clear the gate. The agent is still operating under the model distribution and creating the most decently average thing as fast as possible is what it was designed to do.

I named this anti-pattern: The Self-Confirming Grade. You can't rule your way out of this because the grading and the building share the same context, the same biases, the same completion pressure.

Process Beats Rules

When rules fight the training distribution, the fix isn't better rules. It's structural interventions. It's tempting to try to write a rule for every bad outcome, but the real win isn't creating rules, it's creating architecture.

Pre-commitment. Make the agent state what it will build before writing any code. "I will build an SVG diagram" means the first tokens are <svg, not <div class="card">. Different starting point, different output.
Independent grading. The builder never grades. A separate agent with no sunk cost, no build context, and no ability to edit does the grading. Forced honesty.
Smaller batches. Build 2, get approval, build 2 more. Slide 1 quality is always higher than slide 8 quality because the completion drive hasn't kicked in yet.
Examples over descriptions. Tell an agent "build a dashboard" and you get a grid of cards. Tell an agent "build a dashboard with unconventional layouts" and you still get a grid of cards. Give it a working example of a bento layout and it adapts the bento layout.

None of these are rules the agent reads. They're process changes that alter generation conditions. Rules tell the agent what NOT to do. Process makes the wrong thing harder to do.

Know When You're Fighting Layer 0

Before writing a behavioral rule, ask yourself: am I fighting the training distribution? If the answer is yes, the rule alone won't work. You need a structural intervention.

I've added 4 new anti-patterns to the repo (The Card Completion Trap, The Self-Confirming Grade, The Patch Rebuild, The Skill Citation Without Execution), a writeup on where rules hit their limits, and working examples of each structural intervention: a pre-commitment template, an independent grading setup, and reference layouts you can hand to an agent instead of describing what you want.

github.com/travisdrake/context-engineering

What rules have you written that just won't stick?

Travis Drake is a people analytics leader with a PhD in I/O psychology. He builds behavioral governance systems for AI agents using the same frameworks that predict human performance in organizations.

DEV Community