Why Your AI Agent Can't Check Its Own Work (and How to Fix It)

#ai #llm #agents #testing

Last week I ran into a problem that changed how I think about AI agent reliability.

I was using Opus 4.6 to generate 107 SFT (supervised fine-tuning) training samples for a crypto trading analysis model. The agent self-evaluated its own output:

Pass rate: 100%
Quality score: 1.000 (perfect) for all 107 samples
Assessment: "Extremely high quality data"

Something felt off. 100% pass rate? Every single sample scored perfectly? So I did something simple: I spawned an independent subagent to review the same data. Same model (Opus 4.6). Fresh context. No shared history.

The independent reviewer scored the data 5.5/10 and found 5 critical flaws:

45.8% of confidence values were stuck at 0.62 (standard deviation = 0). Not a distribution. A constant.
Bias labels were position-bound. Challenge 1 was always confirmation_bias (48% of the time). The model learned "first challenge = confirmation bias" as a shortcut.
loss_aversion was completely absent. In a trading dataset with liquidation cascades, the most relevant cognitive bias never appeared. Zero times.
The quality scorer was non-discriminating. 107 samples, all scored 1.000. The scorer was rubber-stamping everything.
Synthesis always fully flipped. When challenged, the model always did a complete 180. No partial corrections.

The Key Insight: Context Isolation > Model Diversity

Here's what's interesting: both reviews used the exact same model. The problem wasn't that Opus 4.6 isn't smart enough. The problem is structural.

When an agent generates output AND evaluates it within the same reasoning context, it develops blind spots. It can't see its own patterns because those patterns are baked into its current context window. The same way you can't proofread your own essay right after writing it.

A fresh context — even with the same model — breaks those blind spots. This is why pair programming works. Your partner doesn't need to be smarter than you. They just need different context.

Building tcell: A Cognitive Immune System

I wanted to automate this pattern. Not just "run a review once" but build a system that gets better at catching blind spots over time.

The result is tcell — named after T-cells in your immune system. Silent when healthy, precise when threats appear, and constantly evolving.

The architecture is directly inspired by Karpathy's autoresearch:

autoresearch	tcell	role
`prepare.py`	`prepare.py`	Fixed infrastructure (can't be modified by the system)
`train.py`	`critics/*.md`	Evolvable strategies (the system modifies these)
`program.md`	`program.md`	Human meta-instructions (only humans modify this)
`val_bpb`	`detection_rate`	Single metric to optimize

How the Evolution Loop Works

select → mutate → replay → keep/discard → record

Select the critic that hasn't evolved the longest
Mutate one dimension of its detection strategy
Replay against known blind spots ("canaries") using majority vote (3 runs)
Keep if detection rate improves >= 5% AND false positive rate stays <= 10%
Discard otherwise (git reset)

After 5 evolution iterations, the overconfidence critic reached 80% detection rate on canaries with 0% false positives.

The 7 Iron Rules

Never trust self-assessment. "100% pass" is a claim, not evidence.
Silence is proof of trust. If 50% of your alerts are false positives, you're a dog that barks at nothing.
Only speak with numbers. "45.8% of values at 0.62, std=0" — not "might be a problem."
Always use fresh eyes. Every review = new subagent, no shared history.
Audit thinking patterns, not just output. Catch the bias, not just the bug.
Evolve, don't ossify. Static reviewers give false security.
Noise budget is sacred. Max 1 alert per 10 tool calls.

Try It

git clone https://github.com/VictorVVedtion/tcell
cd tcell
python3 prepare.py self-test
python3 evolve.py leaderboard
python3 prepare.py session-score

Currently in cold start mode (8 confirmed blind spots, need 20 for autonomous evolution). MIT licensed, zero dependencies.

Every "canary" (confirmed blind spot) you contribute makes the whole immune system smarter.

Has anyone else run into the "agent grades its own homework" problem? What approaches have you tried?

tcell is open source: github.com/VictorVVedtion/tcell