DEV Community

Cover image for Evaluating API Test Generation Across Leading AI Tools
Engroso
Engroso

Posted on • Originally published at blog.kusho.ai

Evaluating API Test Generation Across Leading AI Tools

ChatGPT, Claude, Claude Code, Cursor, Copilot — same spec, same input, measured across test count, coverage quality, and engineering time.

Every major tool can generate API tests. The question is: how many tests, how good, and at what cost in engineering time?

To find out, we ran a structured study using the Stripe Payments API as the benchmark, specifically the POST /v1/payment_intents endpoint for single-API tests, and a representative slice of the full Stripe spec for whole-spec tests.

We scored each approach across four dimensions: field coverage, test type depth, security coverage, and semantic accuracy.

What a Truly Exhaustive Suite Actually Covers

Before looking at the results, it's worth being precise about what "exhaustive" means. For a single endpoint like POST /v1/payment_intents, a complete suite requires:

  • Happy path tests across all valid enum values and field combinations
  • Null and missing tests for every field required and optional
  • Format tests (invalid emails, overflowed strings, wrong types)
  • Semantic tests (e.g., amount must be a positive integer in the smallest currency unit; statement_descriptor has a hard 22-character limit)
  • Security tests SQL injection and XSS for every user-controlled string field, not just one or two
  • Boundary conditions across all numeric and string fields

That benchmark requires roughly 40–50 tests for this single endpoint alone.

Chat LLMs (ChatGPT, Claude)

A one-shot prompt against the fully resolved endpoint definition produced 6–8 tests, a workable starting structure, but well short of exhaustive. Coverage gaps were consistent: 2–3 fields tested for null/empty while the rest were silently skipped; one SQL injection test in the suite rather than one per user-controlled field; minimal semantic tests for fields like statement_descriptor or amount.

For a full spec, chat LLMs are not a realistic option. Stripe's spec spans hundreds of endpoints.

Scores: 4/10 (single API), 2.5/10 (full spec)

LLM Coding Tools (Claude Code, Cursor, GitHub Copilot)

A genuine step up. $ref resolution and file creation are handled automatically. A one-shot prompt produced 7–9 tests per endpoint same coverage ceiling as chat LLMs, but with far less friction.

For whole-spec generation, a single prompt covering all endpoints produced output that looks complete: every endpoint has a file, every file has tests. What's missing is depth. No null/empty tests for optional fields. No format tests for receipt_email. No unit-semantic tests for amount. No per-field security coverage.

The most meaningful improvement came from a detailed ~400-word prompt that explicitly defines what "exhaustive" means, specifies currency-unit semantics, includes per-field injection tests, and covers format edge cases. With that prompt and two to three review-and-fix passes, scores climbed to 6.5/10. The catch: that process takes 6–8 hours of engineering time for a single well-documented spec, plus ongoing maintenance every time the spec changes.

Scores: 5/10 (single API), 4.5/10 (full spec), 6.5/10 (engineered prompt)

KushoAI: What a Purpose-Built Pipeline Looks Like

The same POST /v1/payment_intents endpoint that produced 7–9 tests from a one-shot coding tool produced 47 tests from KushoAI without prompt engineering, follow-up passes, or manual review. Across the full Stripe spec, that pattern held: 800+ tests in which coding tools produced 120–150 in a single pass.

Those 47 tests covered:

  • All valid enum values for capture_method, currency, and payment_method_types
  • Null and missing tests for every field — required and optional
  • Format tests for receipt_email (invalid formats, missing @, domain-only, very long addresses)
  • Semantic tests for amount (zero, negative, non-integer, correct smallest-currency-unit representation)
  • statement_descriptor boundary tests (22 chars, 23 chars, special characters, empty string)
  • SQL injection and XSS for every user-controlled string field
  • Nested object tests for shipping and address sub-fields

Time to exhaustive output for the full Stripe spec: ~30 minutes.

Score: 9/10 across all four dimensions.

Comparison Table

Why the Gap Exists and Why It Compounds on Real Specs

General-purpose LLMs optimize for endpoint breadth over scenario depth. When covering an entire spec in one pass, they produce a wide, structurally complete suite but thin on each individual endpoint. Explicit prompt instructions help, but don't fully close the gap: SQL injection occurs for some fields, not all; semantic tests improve but still miss several edge cases.

The deeper issue is context. With a 300-endpoint production spec, you can't fit more than a handful of endpoints into a single prompt without losing field detail on the rest. The model starts dropping fields; coverage for endpoints that appear later in the context is consistently thinner than for those that appear early.

At real production scale 200–300 endpoints, 20–30 fields per payload on average, deeply nested $ref chains, polymorphic types, the 6–8 hour estimate for a single clean public API becomes several days of work, before accounting for ongoing maintenance.

The Takeaway

LLM coding tools are genuinely useful for API test generation, and with enough prompt engineering and iteration, they can reach reasonable quality. The question is whether your team has the bandwidth to build and own that workflow.

If the goal is exhaustive coverage without the infrastructure overhead, the path is a pipeline built specifically for this problem: one that does per-field semantic analysis, handles $ref resolution and context splitting automatically, and produces consistent output regardless of spec size.

The Stripe benchmark was a relatively easy case. Plan accordingly for what you're actually testing.


This post is based on the AI Tools for API Test Generation: A Comparative Workflow Study — 2026 published by KushoAI. KushoAI builds AI-powered test generations for engineering teams. If you want to see the full methodology, scoring rubric, and raw data breakdown, the complete study is available at the link above.

Top comments (0)