Why AI Agent Outputs Need Adversarial Review (and How to Add It in One API Call)

#ai #llm #agents #testing

The Problem: Agents Grading Their Own Homework

If you're running LLM agents in production, you've probably built some kind of output validation. Maybe a second LLM call checks the first one's work. Maybe you parse for structural issues.

Here's what I kept finding: LLM-based self-review has a systematic leniency bias. When you prompt an LLM to review output from another LLM (or itself), it overwhelmingly approves. The reviewer and generator share similar blind spots — they fail in correlated ways.

This matters when your agent writes code that gets deployed, generates customer-facing content, or makes decisions affecting downstream systems.

The Approach: Adversarial Review with Dual Consensus

AgentDesk is an HTTP API that sits between your agent's output and whatever consumes it:

Two independent reviewers evaluate the output, each prompted adversarially (their job is to find problems, not confirm quality)
Dual consensus — both must agree on pass/fail
Anti-gaming validation — detects outputs that are superficially correct but substantively hollow
Scored 0-100 with structured feedback on every issue found

It's BYOK — you supply your own LLM API key. AgentDesk charges only for orchestration.

Adding It to Your Pipeline

curl

curl -X POST https://agentdesk-blue.vercel.app/api/v1/tasks \
  -H "Authorization: Bearer agd_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Summarize this quarterly report in 3 bullet points",
    "review": true,
    "review_type": "content"
  }'

Python

import requests

resp = requests.post(
    "https://agentdesk-blue.vercel.app/api/v1/tasks",
    headers={"Authorization": "Bearer agd_your_key"},
    json={
        "prompt": "Summarize this quarterly report in 3 bullet points",
        "review": True,
        "review_type": "content",
    },
)

task = resp.json()
# Poll GET /api/v1/tasks/{task['id']} for results

JavaScript

const resp = await fetch('https://agentdesk-blue.vercel.app/api/v1/tasks', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer agd_your_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    prompt: 'Summarize this quarterly report in 3 bullet points',
    review: true,
    review_type: 'content',
  }),
});

const task = await resp.json();
// Poll GET /api/v1/tasks/${task.id} for results

How It Works Internally

Reviewer A gets the output with an adversarial system prompt — find every flaw: factual errors, logical gaps, missing requirements.
Reviewer B independently evaluates from a different angle — completeness, edge cases, whether the output actually addresses the task.
Anti-gaming check detects outputs designed to pass review without being good — verbose empty answers, pattern-matched boilerplate.
Consensus engine combines reviews. Both must pass. Scores averaged. Disagreements flagged.

Completes in a single API call, typically 3-8 seconds.

How This Compares

Approach	Correlation with generator	Cost
Self-review (same model)	High	1 LLM call
Chain-of-verification	Medium	2-3 LLM calls
AgentDesk adversarial	Low	2-3 LLM calls
Human review	None	$$$ + slow

The key difference isn't the number of LLM calls — it's that adversarial prompting with anti-gaming breaks the correlation between generator and reviewer failure modes.