Ihor Kosheliev

Posted on Apr 13

Why Single-Pass AI Test Generation Produces Garbage

#testing #qa #ai #automation

After 9 years of writing test cases manually, I built an AI tool that generates them from User Stories. The first version used a single API call. The output looked reasonable until I tried to automate it.

"Verify the system works correctly." What does that mean in Playwright?

"Enter valid data and submit." What data? Which field? What's the expected state after submit?

Single-pass AI treats test case writing like creative writing. But test cases are engineering artifacts. They need specific values, verifiable assertions, and steps an automation engineer can translate to code without asking questions.

So I rebuilt the pipeline with three passes. The quality jumped from 4-5/10 to 8-9/10 consistently. Here's what I learned.

The single-pass problem

Give any LLM a User Story and ask for test cases. You get a reasonable-looking list. But look at what's actually there:

Vague assertions — "Verify the system displays correct results." What results? Where? How do I assert that?

Missing coverage — 8 acceptance criteria in the story, 3 test cases in the output. Five requirements untested.

No priority differentiation — every test case is Priority 1. When the build breaks and you have 10 minutes, which ones do you run?

Placeholder data — "Enter a valid email." My automation script needs user@example.com, not a description of what to enter.

Merged scenarios — three distinct AC collapsed into one test. When it fails, which requirement is broken?

This isn't a prompt engineering problem. I spent weeks tweaking prompts. The real issue is structural: one pass doesn't have enough context to generate AND review simultaneously.

Three passes: Worker, Judge, Optimizer

Here's what CasePilot does instead.

Pass 1 — Worker

The Worker generates initial test cases from full context:

User Story title, description, acceptance criteria
Discussion comments (filtered: human only, no bot noise)
Project Knowledge (tech stack, business rules, UI patterns)
Wiki/Confluence pages linked to the project
Parent Epic context (if the story is part of a larger feature)
Existing test cases (to avoid generating duplicates)

The Worker prompt is instructed to think like a mid-level QA automation engineer. Not a writer. Each acceptance criterion gets its own test. Test data uses concrete values, not placeholders.

The Worker also applies ISTQB test design techniques directly in the prompt:

Boundary Value Analysis — min, min+1, max-1, max for every numeric field
Equivalence Partitioning — valid class, invalid class, edge class
Decision Table Testing — combinations of conditions for complex logic
State Transition Testing — valid and invalid workflow transitions

Pass 2 — Judge

The Judge receives the Worker's output plus the original User Story. It reviews like a QA Lead reviewing a pull request:

Can each test be translated directly into a test method?
Are assertions programmatically verifiable?
Are coverage gaps filled?
Are there duplicate or overlapping tests?

The Judge rewrites vague tests, adds missing edge cases, removes unnecessary ones, and scores the overall quality 1-10.

Real example: Worker generates 11 test cases for a registration form. Judge consolidates three email-validation tests into one parameterized test, removes a redundant "form displays correctly" check, adds a missing duplicate-email test. Result: 7 tests, quality score 9/10.

Pass 3 — Optimizer

For sets of 3+ test cases, the Optimizer analyzes the full suite:

Duplicate steps — "Navigate to login page" appears in 6 tests. Extract to shared precondition.
Overlapping coverage — Test 3 and Test 7 both verify the same error message. Merge or differentiate.
Suggested groups — Tests 1, 2, 5 share the same setup. Group them under "Authenticated User" precondition.

The Optimizer doesn't change the tests. It gives you insights on how to structure your test suite when you automate.

What this looks like in practice

A User Story about applying discount codes at checkout. 8 acceptance criteria: valid percentage coupon, invalid coupon, expired coupon, empty cart, multiple coupons, coupon removal, minimum order amount, case-insensitive codes.

Single-pass output:
3 generic test cases, all Priority 1, 1-2 steps each. "Apply a valid coupon and verify discount." No test data. No edge cases.

Three-pass output:
8 specific test cases. Mixed P1/P2/P3. Each has 3-5 steps with concrete data:

Title: [Checkout] should reject expired coupon code with clear error message
Category: negative
Priority: 2
Preconditions:
  - User is logged in with items in cart (total: $150.00)
  - Coupon "SUMMER2024" exists but expired on 2024-12-31
Steps:
  1. Navigate to checkout page
     Expected: Cart shows $150.00 total
  2. Enter "SUMMER2024" in coupon field and click Apply
     Expected: Error message "This coupon has expired" displayed
     Test Data: coupon = "SUMMER2024"
  3. Verify cart total remains $150.00
     Expected: No discount applied, total unchanged

An automation engineer reads this and starts writing code. No questions needed.

Five things I learned building this

1. Token budget matters more than prompt engineering.

I spent weeks tweaking prompts. The real breakthrough was increasing max output tokens from 4,096 to 8,192. The AI was literally running out of space to finish generating test cases. It would produce 3 good tests and then stop because the response was truncated. Not a quality problem. A capacity problem.

2. The model follows examples, not instructions.

"Generate at least one test per acceptance criterion" — ignored.
"Each test must have 3-5 steps with specific expected results" — partially followed.

Adding a concrete JSON example in the system prompt with 3 steps, specific assertions, real test data, and a [Feature Area] prefix fixed everything instantly. The AI pattern-matches on examples far more reliably than parsing natural language instructions.

3. Post-processing catches what prompts can't enforce.

The AI won't always:

Add [Feature Area] prefixes to titles
Distribute tests across positive/negative/edge categories
Include all ISTQB technique labels

Code-based post-processing handles these reliably. Trust AI for content, trust code for formatting. My pipeline has a postProcess step that enforces category distribution, adds feature area tags, scores flakiness risk, and flags shallow tests (fewer than 3 steps).

4. The Judge pass pays for itself.

Three API calls cost ~3x more than one. But the quality difference means users generate once instead of regenerating three times. Net token cost is actually lower. And the Judge catches real issues: a Worker test that says "Verify the page loads" gets rewritten to "Verify the checkout page displays cart items with prices, quantities, and subtotal matching the cart state."

5. Speed vs quality is a false tradeoff.

The three-pass pipeline takes 30-60 seconds on GPT-5.4. Users are fine waiting one minute for test cases they can actually automate. They are not fine getting instant results they have to rewrite manually.

I added a three-phase progress bar showing Worker, Judge, Optimizer passes so users see progress instead of staring at a spinner. Perception of speed matters more than actual speed.

Beyond test cases

The same two-pass pattern (Worker + Judge) powers three tools now:

CasePilot — test case generation from User Stories
BugPilot — structured bug reports from vague descriptions (repro steps, severity, root cause, impact radius)
StoryPilot — complete User Story enrichment from a title (description, AC, priority, story points, risks, DoD)

The pattern works because review is fundamentally different from generation. The Worker creates. The Judge evaluates against the source material. Two different cognitive tasks that don't combine well in a single prompt.

Try it

CasePilot is on the Azure DevOps Marketplace and coming to Jira. Free tier: 20 test cases/month, no credit card.

If you want to use the flakiness prediction and boundary value generation in your own test framework, I open-sourced those as a standalone npm package: npm install @iklab/testkit. Zero dependencies, works with Jest, Vitest, Playwright, anything.

I'm interested in how other people handle AI output quality for structured data. The three-pass approach works for test cases. Does it generalize to other domains where AI output needs to be precise and actionable? Let me know in the comments.

Ihor Kosheliev — Senior QA Automation Engineer. Building AI tools for QA at iklab.dev.

Top comments (1)

David Flores Flores • Jun 26

Strongly agree with the "test cases are engineering artifacts" point.

The biggest difference I see between useful and useless AI-generated QA output is whether the workflow forces a review pass for missing assumptions, concrete data, and verifiable expected results. A single prompt can produce something that looks complete, but it often hides gaps in plain sight.

One pattern that has worked well for me is:

Generate candidate cases from the story and acceptance criteria
Ask a second pass to map each case back to a specific requirement
Ask a final pass to flag vague steps, missing negative paths, duplicate coverage, and tests that would be hard to automate

That last "can a human or automation engineer actually use this?" pass is where most of the value shows up.