Why I Stopped Letting Claude Grade Its Own Work

Aaron C — Wed, 27 May 2026 02:25:58 +0000

Originally published at aaroncasey.bearblog.dev

The model that built your app will defend your app. That's not a flaw — it's just context doing what context does. The fix isn't to eliminate the bias. It's to make it auditable.

Claude was great at helping me build the app, but it had a tendency to reinforce its own decisions. That made it a strong builder and a weak judge. So I moved the testing into local AI personas that had no access to the build conversation.

The Trap
If you let the model that built your app also evaluate your app, you're grading your own homework. The grader already agrees with you. It helped you decide. It will defend the decision.

I learned this the direct way. Early on, I had Claude running the personas itself. The runs were real — Claude would actually execute them and produce output. That part worked fine. The problem showed up later, at design pivot points. When I wanted to change direction on a feature, Claude would invoke a persona's voice — "the veteran would push back on this," "the beginner wouldn't understand that" — and argue against the pivot without ever running the persona against the proposed change. The personas had become rhetorical ammunition. They were getting cited from memory to defend the existing path, not consulted fresh to evaluate the new one.

That's the trap. Not that the personas were fake. That they got conscripted as witnesses for whatever direction the build conversation was already leaning.

So I moved them out. Local models, on my own machine, with no access to the build conversation. The personas only know what they are and what they see. They can't be quoted from memory in arguments they were never part of.

The Problem I Was Solving
I'm building a desktop app for a specific kind of solo operator. The audience is wider than it looks — within it are people at very different stages, with very different habits, who think about their work in very different language. A veteran running a healthy business has nothing in common, workflow-wise, with a first-year operator still Googling basic terms.

I needed to know how the app would behave across that range before any real user touched it.

Why I Used Claude to Build
The build side is straightforward. I describe what I want, get a working draft, iterate. The architectural conversations are useful. The code is clean enough that I can read it, modify it, and own it.

One concrete example: I went back and forth with Claude on whether to use SQLite directly or layer an ORM on top for a PyQt5 desktop app with offline-first requirements. The conversation surfaced trade-offs I hadn't considered — migration headaches, query visibility during debugging, the way an ORM can obscure what's actually happening when you're chasing a weird state bug at 1am. We landed on raw SQLite with a thin wrapper. That decision has held up. Cloud-scale reasoning earns its keep on calls like that.

But the same model that helped me make that call is not the model I want telling me whether the resulting UI makes sense to a stranger.

Why the Testers Run Locally
I set up twenty-three personas as local AI characters. Mistral Nemo 12B is the workhorse. A few reasons drove that.

Independence from the build model. Covered above — it's the whole reason the architecture exists.

Privacy. The persona definitions and test transcripts include sample data that mirrors real users. Keeping all of it on my own machine means I never have to think about what's leaving the building.

Speed at volume. Once the model is loaded, I can run it as many times as I want. No rate limits, no round-trips. Around fifty runs per batch, overnight, while I sleep.

Cost control. No per-token meter. The hardware is paid for. The marginal cost of another test run is electricity.

Repeatability. Local models with fixed settings give me reproducible runs. When a persona flags a problem, I can rerun the same persona against the same build and confirm the failure before I touch anything.

I tried other models. Most drifted out of character within a few exchanges — a part-timer persona would start sounding like an expert, or a beginner would suddenly know terms they shouldn't. Nemo 12B held. It also produced the most human-sounding responses of anything I tested at that size. The hardware is an RTX 2070 Super, which is not fast, but it doesn't need to be fast if it's running overnight.

How the Workflow Works
Each persona is a detailed character file. Not a one-line system prompt — a full profile: who they are, where they work, how they got there, how they talk, what they care about, what they ignore, what they'd never say out loud. One of them is a version of me, as a baseline. The rest are not me, and that's the point.

A Python script orchestrates the runs. Claude wrote the orchestration code. Then I denied Claude access to the personas themselves. It knows the harness exists. It does not see the persona files, the system prompts, or the raw transcripts. That separation is the architecture.

When I push a new build, the script feeds each persona a session with the current version of the app. The personas have memory of previous test versions, so they can compare — they'll tell me when something improved, and they'll tear me up when something got worse.

By morning I have a large stack of feedback. I read through it and pull out what's real: confusion, missing features, broken assumptions, and places where the language of the app doesn't match how a working person actually talks about their work. Then I hand the consolidated reviews to Claude for synthesis and prioritization.

That last step matters. Claude reads the reviews. Claude does not get to be the reviewer. The reviews enter Claude's context as evidence from outside, not as something Claude generated.

I want to be honest about what this does and doesn't do. The separation doesn't eliminate the bias — it makes it auditable. Claude can still drift toward defending earlier decisions after reading the reviews. But now there's a clear paper trail I can check against, instead of a vague memory of what the personas "said." If Claude cites a review at a pivot point, I can pull up the actual review and verify it. Before, the citation and the source were both Claude. Now they're separate artifacts. That's the real win — not that the bias is gone, but that I can catch it when it shows up.

What This Approach Is Good For
This workflow surfaces UX confusion, jargon mismatches, missing features for adjacent user types, and assumptions you didn't realize you were making. It scales down well — a consumer GPU, a Python script, and patience is the whole stack.

It's especially good for catching the moment when your app's vocabulary stops matching your users' vocabulary. Real professionals have their own shorthand, their own way of naming things, their own pet peeves about being talked down to. If your interface doesn't reflect how these people actually talk, the personas will tell you. Loudly.

What It Is Not Good For
Local personas are not real users. They don't carry real stakes, real money, or real frustration. They won't tell you that your pricing is wrong, or that the install was too scary, or that they bounced before they got to the part you were proud of.

Even with character memory and detailed profiles, personas drift sometimes. You have to spot-check. Treat the reviews as a strong first filter, not as ground truth.

Human testing still matters. The personas catch the obvious failures so that the limited human attention I can get is spent on the subtle ones.

The Payoff
Better feedback, because it comes from outside the build conversation. An auditable bias trail, instead of a vague one. Less cost, because the testing layer runs on hardware I already own. More privacy, because nothing leaves the machine.

It's not a fancy setup, and it's not a fix-all. It's a desktop, a script, a stack of character files, a willingness to let the machine work overnight, and the discipline to actually check Claude's citations when it leans on the reviews. For a solo builder trying to ship something honest without a team behind them, it's the most valuable piece of process I've added all year.

The model that built your app will defend your app. Put the testing somewhere it can't reach — and keep the receipts.

DEV Community: Aaron C

Why I Stopped Letting Claude Grade Its Own Work