DEV Community

DtoTHEmoon
DtoTHEmoon

Posted on

Why Your AI Agent Keeps Making the Same Mistakes (It's Not the Model)

Does this sound familiar?

Your AI just fixed a bug. Two weeks later, the exact same bug is back.

You deploy something, and you have no idea if it actually worked — so you manually test it.

You've written 100 lines of rules in your config file, but the AI still ignores half of them.

Every new chat session, you re-explain the same context from scratch.

I ran into all four of these problems while building an internal AI quoting system for a healthcare company — with no technical background. And after months of debugging, I realized: none of these were model problems. They were Harness problems.


What is Harness Engineering?

Harness Engineering is the discipline of building the scaffolding around your AI — the rules, constraints, verification scripts, and knowledge structures that make it produce consistent, reliable output.

Without Harness, even the best model will drift, forget, and repeat the same mistakes.

The data backs this up: research shows that 80% of Agent quality failures come from Harness gaps, not model limitations. And in one benchmark, the same 15 models all improved significantly when only the Harness changed — not the models themselves.

The problem is: most people don't know what their Harness is missing. They just know something feels broken.


The framework: two dimensions, not six steps

After studying real production failures and building my own system from scratch, I organized Harness Engineering into two dimensions.

Vertical Quality Layers (Q) — required for every project

Layer Name What it solves
Q1 SPEC AI knows what to build, what not to, and how to verify
Q2 Rules + Security Hard business limits + security red lines, equally mandatory
Q3 Skills Repetitive workflows standardized with counter-examples
Q4 Scripts (unified gate) Nothing is "done" until scripts pass

Horizontal Scale Layers (S) — enable only when needed

Layer Name When to enable
S1 Context Sessions losing coherence after ~20 turns
S2 dev-map + Memory Project iterating 2+ months, AI re-inventing solutions
S3 Multi-Agent Single agent consistently failing on long task chains

The key insight: Q4 is not step four. It's the exit gate for every layer. Code changes, doc updates, multi-agent outputs — all must pass Q4 before anything counts as done.

Most people skip Q4 entirely. That's why the same bug keeps coming back.


What I built: Rein

Rein is an open-source Skill for Claude Code (and any agent supporting the SKILL.md standard) that acts as a silent Harness Engineering advisor throughout your project.

It watches your conversations for patterns — not keywords — and speaks up only when it detects a real gap. When everything's fine, it stays silent. Silence is a feature.

What it detects automatically:

  • Repeated failures (same bug fixed twice → missing Rule or regression test)
  • Context loss (re-explaining background every session → incomplete project docs)
  • Scale shifts (internal tool going external → time to harden your Harness)
  • Cost spikes (API bill climbing → identifies token waste sources)
  • Over-engineering (more config, slower shipping → tells you what to delete)

Test results: 97% pass rate across 16 scenarios with Rein vs 52% without.

The biggest gap was in root cause diagnosis: 92% accuracy with Rein, 24% without.


A real example from my project

My verify.sh only checked if the service started. It didn't check if the business logic was correct.

So when the AI "fixed" a pricing calculation bug, it passed my verification — service was running — but the actual calculation was still wrong. Same bug, two weeks later.

After adding a business baseline check (call a known correct quote request, compare against expected output), that class of bug disappeared entirely.

This is Q4. Not just "is the service alive?" but "is the output actually correct?"


Install

git clone https://github.com/DtoTHEmoon/rein-skill.git ~/.claude/skills/rein
Enter fullscreen mode Exit fullscreen mode

Restart your agent. Rein activates automatically — no commands needed.

Also works with: OpenClaw, Codex CLI, Gemini CLI, Cursor, and any agent supporting SKILL.md.


The core philosophy

Start minimal. Add only when you have a real pain point. And know when to subtract — Rein will tell you when your Harness is getting in your own way.

If your scaffolding is slowing you down, it's time to cut.

GitHub: github.com/DtoTHEmoon/rein-skill

Top comments (0)