DEV Community

Cover image for APIEval-20: The First Benchmark That Tests AI Agents on Real Bug Detection
Engroso
Engroso

Posted on

APIEval-20: The First Benchmark That Tests AI Agents on Real Bug Detection

Every AI testing tool I've evaluated in the past year has the same blind spot: they're measured on outputs, not outcomes.

None of them answer the question I actually care about: does this agent find bugs in a live API?

APIEval-20 is the first benchmark I've seen that does.

The Setup

Provide the agent with a JSON schema and a valid sample payload. Nothing else. No response schema. No error messages. No implementation access. No changelog.

From that alone, the agent has to produce a test suite that exposes planted bugs in a live, running reference API.

What makes APIEval-20 interesting is that it fully automates evaluation against real APIs. Test cases are executed. Responses are analyzed. A bug is only counted as detected if a test case produces a response that deviates from correct behavior in a way that maps to the planted bug.

How it works

20 scenarios across e-commerce, payments, auth, user management, scheduling, notifications, and search. Each has 3–8 planted bugs, classified not by severity but by the reasoning depth required to find them.

Simple - structural mutations. Missing required fields, null values, and wrong types. Any agent doing even a basic permutation will catch these.

Moderate - field semantics. Out-of-range numbers, malformed emails, invalid currency codes, and enum boundary values. Requires the agent to understand what a field means, not just that it exists.

Complex - cross-field logic. Mutually exclusive fields are both provided. A discount is applied to an order type that's ineligible for it. Fields whose validity depends on the value of another field. This is where agents that look good on surface metrics fall apart completely.

Scoring

Final Score =

0.7 × Bug Detection Rate + 0.2 × Coverage Score + 0.1 × Efficiency Score
Enter fullscreen mode Exit fullscreen mode

Bug Detection (70%) - bugs_found / total_bugs. This is the number that matters. Everything else is secondary.

Coverage (20%) - Three sub-dimensions averaged:

  • param_coverage: what fraction of schema fields are exercised across the suite
  • edge_coverage: what fraction has at least one edge-value test (null, "", [], wrong type, out-of-range)
  • variation_score: 1 − mean(Jaccard similarity) across all payload pairs - penalizes suites that repeat near-identical payloads

Efficiency (10%) - min(1, bugs_found / number_of_tests). An agent that writes 80 tests to find 6 bugs scores 0.075 here. This metric actively rewards precision over volume, which is the right incentive.

Score bands: Weak (< 0.3), Developing (0.3–0.5), Proficient (0.5–0.7), Strong (0.7+). Strong is described as "comparable to a thorough human QA engineer."

What a Test Case Actually Looks Like

The agent's output format is simple - a test name and a complete request payload:

json
{
  "test_name": "Order with zero quantity item",
  "payload": {
    "user_id": "usr_4821",
    "items": [{ "product_id": "prod_991", "quantity": 0, "unit_price": 29.99 }],
    "currency": "USD",
    "shipping": { "address": "123 Main St", "method": "standard" }
  }
}
Enter fullscreen mode Exit fullscreen mode

That's a moderate-tier bug candidate; a quantity of zero is semantically invalid but structurally fine. A complex-tier test might look like a valid coupon code applied to a restricted order category. The agent has to reason its way there from the schema alone.

The Honest Limitations

20 scenarios are thin. Variance is high at this scale; one hard scenario can meaningfully swing an agent's aggregate score. The benchmark acknowledges this; APIEval-50 is on the roadmap.

The team has announced a head-to-head comparison of Cursor, Copilot, Devin, and KushoAI on this dataset, but the results haven't been published. Right now, you can run your own agent against it, but it can't compare to a baseline.

The Jaccard variation score can be manipulated. An agent that randomly changes payloads with a lot of variations will get a good score on variation without actually showing any reasoning. It acts as a proxy metric, and it is aware of this.

The benchmark is functional-only. Security testing - auth bypasses, injection, OWASP API Top 10 - is explicitly out of scope for v1. That's APIEval-Security, which is in the roadmap.

Why Run Your Agent Against It

If you're building or evaluating an AI testing agent, this is currently the most honest signal available specifically for API test generation. Most benchmarks in this space measure text quality. This one measures bug-finding on live systems.

The dataset, scenarios, and evaluation harness are all on Hugging Face: https://huggingface.co/datasets/kusho-ai/api-eval-20

Top comments (1)

Collapse
 
riya_joshi_99afcdd133c578 profile image
Riya Joshi

Looks like a promising eval for Agentic API Testing!