The Problem: Agents Grading Their Own Homework
If you're running LLM agents in production, you've probably built some kind of output validation. Maybe a second LLM call checks the first one's work. Maybe you parse for structural issues.
Here's what I kept finding: LLM-based self-review has a systematic leniency bias. When you prompt an LLM to review output from another LLM (or itself), it overwhelmingly approves. The reviewer and generator share similar blind spots — they fail in correlated ways.
This matters when your agent writes code that gets deployed, generates customer-facing content, or makes decisions affecting downstream systems.
The Approach: Adversarial Review with Dual Consensus
AgentDesk is an HTTP API that sits between your agent's output and whatever consumes it:
- Two independent reviewers evaluate the output, each prompted adversarially (their job is to find problems, not confirm quality)
- Dual consensus — both must agree on pass/fail
- Anti-gaming validation — detects outputs that are superficially correct but substantively hollow
- Scored 0-100 with structured feedback on every issue found
It's BYOK — you supply your own LLM API key. AgentDesk charges only for orchestration.
Adding It to Your Pipeline
curl
curl -X POST https://agentdesk-blue.vercel.app/api/v1/tasks \
-H "Authorization: Bearer agd_your_key" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Summarize this quarterly report in 3 bullet points",
"review": true,
"review_type": "content"
}'
Python
import requests
resp = requests.post(
"https://agentdesk-blue.vercel.app/api/v1/tasks",
headers={"Authorization": "Bearer agd_your_key"},
json={
"prompt": "Summarize this quarterly report in 3 bullet points",
"review": True,
"review_type": "content",
},
)
task = resp.json()
# Poll GET /api/v1/tasks/{task['id']} for results
JavaScript
const resp = await fetch('https://agentdesk-blue.vercel.app/api/v1/tasks', {
method: 'POST',
headers: {
'Authorization': 'Bearer agd_your_key',
'Content-Type': 'application/json',
},
body: JSON.stringify({
prompt: 'Summarize this quarterly report in 3 bullet points',
review: true,
review_type: 'content',
}),
});
const task = await resp.json();
// Poll GET /api/v1/tasks/${task.id} for results
How It Works Internally
- Reviewer A gets the output with an adversarial system prompt — find every flaw: factual errors, logical gaps, missing requirements.
- Reviewer B independently evaluates from a different angle — completeness, edge cases, whether the output actually addresses the task.
- Anti-gaming check detects outputs designed to pass review without being good — verbose empty answers, pattern-matched boilerplate.
- Consensus engine combines reviews. Both must pass. Scores averaged. Disagreements flagged.
Completes in a single API call, typically 3-8 seconds.
How This Compares
| Approach | Correlation with generator | Cost |
|---|---|---|
| Self-review (same model) | High | 1 LLM call |
| Chain-of-verification | Medium | 2-3 LLM calls |
| AgentDesk adversarial | Low | 2-3 LLM calls |
| Human review | None | $$$ + slow |
The key difference isn't the number of LLM calls — it's that adversarial prompting with anti-gaming breaks the correlation between generator and reviewer failure modes.
Pricing
- Free: 20 tasks/month (BYOK)
- Starter: $29/mo — 500 tasks
- Pro: $79/mo — 5,000 tasks + dual review + workflows
- Team: $199/mo — 50,000 tasks
Open source: github.com/Rih0z/agentdesk
If you're building with AI agents, I'd like to hear what's working for you on quality control. Drop a comment or open an issue on GitHub.
Top comments (0)