eleonorarocchi

Posted on May 15

Why AI Agents can’t judge themselves

#ai #agents #llm

TL;DR

AI agents tend to overestimate the quality of their own outputs when there is no external verification criterion. In subjective tasks (design, writing, UX, naming, strategy), simply asking the model to "reflect" is not enough: it often remains trapped in the same trajectory that produced the first plausible solution, leading to weak critiques and superficial improvements.
Achieving real quality requires designing the runtime around the model: tests, rubrics, separate evaluators, external tools, and generator-evaluator loops that introduce critical distance between the system that produces the output and the one that approves it.

Why Internal Feedback Is Not Enough in Subjective Tasks

Sometimes, when you ask a model to evaluate a response it previously generated, it will rate it as good even when it clearly is not. Buggy code gets labeled "production-ready"; a generic layout is described as "modern and coherent"; a technically correct but flat piece of writing is called "clear, incisive, and well-structured."

This behavior becomes especially evident when the task lacks binary verification.

If the agent has to write a function and there is a reliable test suite available, the system has access to an external oracle: the test either passes or fails.

But as soon as the task moves into domains such as design, writing, naming, UX, strategy, or product architecture, quality can no longer be reduced to an assert.

This is where the self-evaluation problem in AI agents emerges: the same system that produces the output struggles to judge it with enough critical distance. And, if we think about it, humans often behave the same way.

The point, however, is not that LLMs have ego, self-esteem, or a desire to appear competent. Saying that agents "self-promote" is a useful shortcut, but technically inaccurate. A model is not trying to convince us that its output is good. More often, it simply remains inside the same probabilistic trajectory that generated the artifact in the first place.

If we ask:

Generate a landing page for a SaaS product.

And immediately afterward:

Evaluate the quality of the landing page.

we have not really created two distinct processes. We have asked the same model to continue reasoning within the same semantic space, with the same context, assumptions, and implicit orientation toward completing the task.

The result is often an evaluation that is overly generous, poorly discriminative, and not very useful for driving meaningful improvement.

Tasks With an Oracle and Tasks Without One

We can distinguish between two classes of tasks.

The first includes tasks with an external oracle. These are cases where quality can be verified relatively objectively: an automated test, a query returning an expected result, a formal constraint, a compiler, syntactic validation, or a numerical measurement.

In software engineering, many tasks fall at least partially into this category. Code can be evaluated through unit tests, integration tests, type checkers, linters, benchmarks, and static analysis. These tools do not fully capture software quality, but they provide strong signals. If an agent produces code that does not compile or breaks a test suite, the system does not need to "guess" that something is wrong: it knows.

The second class includes tasks without a clear oracle. Here, quality is subjective, multidimensional, or context-dependent. A UI can be technically correct but visually uninspired. A text can be grammatically flawless but lack a real thesis. A strategy can be well formatted yet impossible to execute. A naming proposal can be understandable but forgettable.

In these cases, the problem is not just verifying whether the output is correct, but determining whether it is actually good.

And unfortunately, "good" does not mean one single, clearly identifiable thing.

In design, it may mean visual coherence, originality, hierarchy, usability, or identity; in writing, clarity, density, rhythm, argumentative strength, or voice; in strategy, accurate diagnosis, explicit trade-offs, feasibility, specificity, and contextual alignment.

When an external oracle is missing, the agent tends to rely on its own linguistic evaluation. And that is exactly where the system becomes fragile.

The Failure Mode: Premature Convergence

The most common failure mode is not catastrophic error, but premature convergence.

The agent produces a plausible solution, refines it superficially, and declares it sufficient. The result is not necessarily wrong. Often, it is worse: mediocre but defensible.

This "plausible mediocrity" is difficult to detect because it contains many superficial signals of quality.

An AI-generated landing page will probably include a hero section, a CTA, a feature grid, a few cards, a responsive layout, a pleasant color palette, and tidy copy. A strategy document will include neatly titled sections, bullet points, frameworks, and recommendations. A refactoring will contain cleaner names and a few extra abstractions.

But all of this can still remain generic.

The agent tends to improve what it has already produced instead of questioning whether the direction itself is correct. It polishes the first solution instead of challenging it. It adds local coherence, not global quality.

This is where self-evaluation fails: not because the model cannot recognize any errors, but because it often does not apply criticism strong enough to break away from the first acceptable solution.

The Limits of Reflective Prompting

One of the earliest responses to this problem was reflective prompting: asking the model to critique its own output, identify issues, propose improvements, and iterate.

This approach works to some extent because it can eliminate obvious errors, improve clarity, fix inconsistencies, and add missing details. However, its main limitation is that the critique remains inside the same process that generated the output.

Prompts such as:

Reflect on your work and improve it.

or:

Identify any problems in the previous response.

often produce generic feedback:

"It could be more specific";
"Clarity could be improved";
"I would add more concrete examples";
"The structure is already solid, but it could be refined."

These observations are true, but weak, and they rarely lead to a substantial change in direction.

For simple tasks this may be enough, but for high-value tasks it often is not.

Why Runtime Matters

This problem contributed to the rise of harness engineering: designing the runtime around the model.

As I described in previous articles about harnesses, the core idea is that the performance of an agentic system depends not only on the model itself, but also on the operational environment in which the model works. What matters is how the prompt is constructed, which tools are available, how context is managed, how intermediate states are stored, how tests are executed, how feedback is orchestrated, when the system decides to iterate, and when it decides to stop.

In modern agentic systems, the model is just one component. Final behavior emerges from the interaction between the model, tools, memory, context, schedulers, evaluators, acceptance criteria, and retry mechanisms.

This shift in perspective is fundamental. If the model struggles to evaluate itself, the solution is not necessarily to wait for a better model. It is to design a runtime that makes the evaluation process less fragile.

In coding, this may mean running tests, reading errors, applying patches, and retrying. In design, it may mean generating screenshots, navigating the interface, and verifying interactive states. In writing, it may mean using editorial rubrics, comparing versions, and evaluating density and redundancy. In strategy, it may mean making assumptions explicit and testing alternative scenarios.

The runtime introduces signals that the model alone does not reliably produce.

Critical Distance as an Architectural Requirement

The self-evaluation problem can be summarized like this: generation and evaluation are too close to each other.

Critical distance can be introduced in many ways. Sometimes changing the prompt, role, or critique format is enough; other times it requires a different model, a different temperature, a stricter rubric, few-shot examples, external tools, or a separate agent.

The principle remains the same: the system must create a separation between the entity that produces and the entity that approves.

This separation does not guarantee perfect evaluation, but it reduces the risk that the agent settles for the first plausible solution.

This naturally leads to the generator-evaluator pattern: one agent produces, another evaluates, feedback returns to the first, and the cycle continues until the output surpasses a threshold.

It is not always necessary: for simple tasks it can become overengineering.

But for subjective, long, or high-value tasks, it becomes one of the most useful patterns in agent engineering.