The Results from APIEval-20: What Surprised Us, What Didn't, and What It Means

#api #agents #ai #analytics

Two months ago, APIEval-20 went live, an open benchmark that evaluates how well an AI agent can find bugs in a real API when given only a JSON schema and one example payload, with no source code, no documentation, and no hints about where failures are planted.

Since then, we spent several weeks running 7 systems through it: three general-purpose LLMs (GPT-5, Claude Sonnet 4.6, Gemini 2.5 Pro), three coding agents (Claude Code, Cursor, GitHub Copilot), and KushoAI. These are the findings we found most interesting and the ones that surprised us most.

The Black-Box Constraint

Every system in this evaluation received exactly two inputs: a JSON schema and one valid sample payload. No source code. No documentation beyond the schema. No hints about where failures were planted.

You get a spec before you get full context. An AI testing tool needs to earn its keep in that environment.

Finding 1: Simple Bugs Are Solved

Missing required fields, null values, wrong types and empty arrays. Nearly every system we evaluated handles these now. The weakest tool in our benchmark still detected 63% of simple bugs.

It should no longer be the bar you use to evaluate an AI testing tool. If your demo shows a tool catching a missing required field, that tells you nothing meaningful.

Finding 2: The Complexity Cliff Is Large and Real

This is where the evaluation got interesting. We categorized planted bugs across three tiers: simple (schema mutation), moderate (field semantics), and complex (cross-field business logic).

The drop from simple to complex bugs is dramatic across almost every system. General-purpose LLMs fell from ~70% detection on simple bugs to ~30% on complex ones. Coding agents dropped from ~80% to ~53%. KushoAI dropped from 93% to 76%, the smallest cliff in the evaluation.

The complex bugs are the ones that matter in production. A refund amount that exceeds the original transaction. A recurring event rule that conflicts with an exception date. An SMS notification channel is enabled before verification is complete. Every individual field is valid. The failure lives in the relationship between fields.

Finding 3: Prompt Engineering Improves Breadth, Not Depth

"Just write a better prompt" is the default response when AI-generated tests underperform. Better prompts do help; they produce more field coverage, cleaner JSON, and more boundary value tests.

But they don't close the gap on complex bugs. A prompt chain that asks a coding agent to infer a test strategy, generate tests, and then review its own gaps still produced a 53% complex-bug detection rate for the best-performing coding agent (Claude Code). The ceiling isn't about instructions. It's about whether the system models conditional relationships between fields as a structural capability rather than a prompting one.

Finding 4: Variance Is the Hidden CI/CD Metric

Run-to-run consistency rarely shows up in tool evaluations. It should. A tool that produces a strong suite in one run and a weak one in the next creates review overhead that compounds across hundreds of endpoints. KushoAI had the lowest standard deviation across runs (±0.03). Gemini 2.5 Pro had the highest (±0.10). For teams integrating AI-generated tests into automated pipelines, this matters as much as peak performance.

The COI Question

KushoAI is one of the evaluated systems and the organization that ran this evaluation. We've tried to address that directly: the methodology, all workflow definitions, and the repeated-run setup are published. Scoring is execution-based; a generated test either triggers a planted bug in the live reference API or it doesn't. Evaluator discretion is minimal by design.

Run It Yourself

The dataset is on HuggingFace. The evaluation code is on GitHub. If you have a testing tool, internal or commercial, you can run it against APIEval-20 and compare your results against ours. That's the point.

We're interested in results that challenge our findings.