eleonorarocchi

Posted on May 22

Generator-Evaluator Loops for AI Agents

#ai #agents #llm

TL;DR

Separating the generator from the evaluator improves quality and reduces premature self-validation.
The loop works best when feedback is explicit and based on clear rubrics, especially for subjective or complex tasks.
It is useful when the task has high value; for simple or easily testable tasks, it can become overengineering.

How to Separate Production and Evaluation in Tasks Without Ground Truth

If self-evaluation is a recurring failure mode in AI agents, the most natural solution is to separate the role of producing from the role of evaluating.

This is the principle behind generator-evaluator loops: one agent produces an output, another evaluates it according to explicit criteria, feedback is sent back to the generator, and the system iterates until the result reaches an acceptable threshold.

The pattern is simple to describe, but powerful, especially in domains where stable ground truth does not exist: design, writing, UX, naming, strategy, documentation, software architecture, complex refactoring.

In these cases, however, it is not enough to ask the model to "do better." You need to build an environment where improvement is driven by separate, structured, and repeatable critique.

From a Single Agent to an Agentic System

A single agent tends to merge three different roles: it plans the work, produces the output, and finally evaluates the result.

This fusion is convenient but fragile. While it may work for short tasks, in longer ones it often leads to premature convergence, because the same agent chooses a direction, develops it, then positively self-evaluates and closes the task.

To create a more robust system, these roles should be separated into at least three distinct agents:

the planner, which breaks the request into manageable parts and defines priorities, constraints, and work sequence;
the generator, which produces the artifact (e.g. code, interface, document, proposal, strategy, ...);
the evaluator, which judges the output without having participated in its generation.

This separation almost resembles a miniature organizational structure, and in fact, similarly, it helps reduce coupling between decision-making, production, and approval. As in a company, each role has a different objective, and this difference introduces useful friction.

The Analogy with GANs

The most immediate parallel is with Generative Adversarial Networks (GANs).

In a classic GAN, there are two neural networks: a generator and a discriminator. The generator produces synthetic data, for example images, while the discriminator receives both real and generated data and attempts to distinguish between them. In this way, the generator improves by trying to produce outputs plausible enough to fool the discriminator, while the discriminator improves by becoming better at detecting artificial outputs.

This idea has been applied across many domains: image generation, face synthesis, super-resolution, computer vision, synthetic data generation, image-to-image translation, music, and video. Examples include models such as StyleGAN, CycleGAN, TextGAN, and MuseGAN, which show how the generative-adversarial principle can be adapted to different forms of content.

In our case-AI agents-the analogy is useful because it captures an architectural intuition: a generative system improves when exposed to separate judgment.

There are, however, important aspects that should not be confused with the comparison, because a generator-evaluator loop is not a GAN in the technical sense.

In real GANs, there is a mathematical loss function: the discriminator is trained on real data, while the generator receives an optimization signal through gradients.

In AI agents, by contrast, feedback is linguistic and heuristic: the evaluator does not directly update the generator's weights, nor does it necessarily have access to a dataset of real examples.

When an evaluator judges a landing page, a piece of writing, or a strategy, it is not distinguishing "true" from "false" in the GAN sense. It is estimating how closely the output aligns with a set of preferences, criteria, and conventions.

Making Subjectivity Evaluatable

The core challenge is turning vague judgments into observable criteria.

Saying "this UI is ugly" does not help an agent improve-just as saying the same thing to a human designer would not.

Saying "the visual hierarchy does not sufficiently distinguish between the primary action and secondary content" is much more useful.

An effective evaluator needs what are commonly called evaluation rubrics.

In frontend design, for example, a rubric might include four dimensions:

Design quality, which measures whether the result feels like a coherent system or merely an assembly of components. It evaluates visual identity, creative direction, color usage, typography, rhythm, and layout.
Originality, which measures whether intentional choices are present or whether the output looks derived from templates, library defaults, and generic patterns (this includes recognizing AI slop: predictable gradients, white cards, interchangeable hero sections, stock icons, personality-less compositions).
Craft, which measures execution: spacing, contrast, alignment, typographic hierarchy, color consistency, and attention to detail.
Functionality, which measures usability: whether users understand what to do, find the primary actions, navigate without ambiguity, and complete intended tasks.

These criteria do not all need to carry the same weight. In many cases, models are already reasonably strong in craft and functionality; they produce orderly layouts, readable text, and understandable structures. The recurring issue is often the lack of originality and direction.

For this reason, an evaluator focused on real quality should penalize not only errors, but also blandness.

The same principle can be applied to other domains.

For writing, a rubric might evaluate thesis strength, argumentative structure, informational density, rhythm, specificity, voice, redundancy, and the ability to anticipate objections.

An AI-generated text may be correct, fluent, and well formatted, yet still fail to say anything memorable. In that case, a useful evaluator should be able to distinguish between superficial clarity and real argumentative value.

For strategy, a rubric might evaluate diagnostic quality, explicit assumptions, trade-offs, feasibility, prioritization, risks, dependencies, metrics, and contextual alignment.

And what about code? Beyond tests, a rubric should assess simplicity, maintainability, consistency with the codebase, error handling, extensibility, technical debt introduced, readability, and impact on existing abstractions.

The Operational Loop

A typical generator-evaluator loop follows a simple sequence:

the system receives a specification;
the planner breaks it into subtasks and defines success criteria;
the generator produces a first version of the output;
the evaluator inspects the artifact using explicit criteria and, when possible, real tools such as browsers, test runners, screenshots, parsers, linters, metrics, documentation, or access to the codebase;
the evaluator assigns scores, identifies issues, and produces actionable feedback;
the generator receives the feedback and iterates.

The cycle continues until the output exceeds a threshold, the budget is exhausted, or the system determines that human intervention is required.

The core of the loop is actionable feedback, because the evaluator must not only detect failure, but also make it correctable.

A generator-evaluator loop is fundamentally a matter of harness engineering, because the evaluator must have access to the right tools.

That said, an evaluator does not automatically improve simply because it is separated from the generator. For it to work, it must be calibrated using explicit criteria, scoring scales, potentially few-shot examples, thresholds, and similar mechanisms.

When to Use This Pattern

A generator-evaluator loop has a cost: more model calls, more tokens, more latency, more orchestration, and more complexity. As a result, it is impossible to apply it everywhere.

As a rule of thumb, if a reliable automated test exists, it is usually better to use it.

And if the task is simple or low-value, a complex loop may genuinely become overengineering.

DEV Community