DEV Community

张振
张振

Posted on

Qualix: semantic coverage gates for AI-generated code

Qualix demo — tests pass but 500 USD boundary is missing

AI coding agents write tests. The tests pass. Coverage is green. And then the bug ships.

Here is a concrete example. A PRD says:

Requests at or above 500 USD require manager and finance approval.

A generated test suite might contain:

def test_low_amount():
    # 120 USD → manager approval only
    assert classify(Decimal("120")) == "manager_only"

def test_high_amount():
    # 600 USD → finance required
    assert classify(Decimal("600")) == "finance_required"
Enter fullscreen mode Exit fullscreen mode

Both tests pass. Branch coverage is green. The implementation, however, uses > 500 instead of >= 500. The case of exactly 500 USD — which the PRD says requires finance approval — silently routes to the wrong path.

This is not a clever edge case. It is the boundary the PRD explicitly defined. A coverage tool reports it as fine.

What Qualix does

Qualix is a quality gate that starts from the requirement, not from the code.

Given the PRD, it extracts semantic expectations (SEs) — the business behaviors that tests should prove:

  • SE-003: a request at exactly 500 USD requires manager AND finance approval

It then audits the test suite against those expectations. If no test exercises amount == 500, SE-003 is flagged as PARTIAL or MISSING, regardless of what line coverage says.

The full pipeline:

Q01  Structure the PRD into traceable REQ/BR/SE items
Q05a Design test targets from those semantics (EUT matrix)
Q05b Generate test code (optional, needs compile gate)
Q06  Audit the test suite against the original SE items
Q07  Structured code review tied back to requirement IDs
Enter fullscreen mode Exit fullscreen mode

It works with your existing test runner. It does not replace pytest, JUnit, or Jest. It sits above them and answers a different question: did the tests prove the requirement, or just execute the code?

Why now

AI coding agents have made it cheap to generate code and tests. That changes the bottleneck. The bottleneck is no longer "can we write this?" but "does this actually do what the product asked for?"

Line coverage was designed for the world where tests were hand-written. In that world, a developer who wrote the test usually understood the requirement. The test was evidence of understanding.

AI-generated tests are evidence of execution. The model generates what it can infer from the code. If the code has a logic error at the boundary, the generated test will probably pass — because the test is generated from the same (wrong) implementation.

Qualix is an attempt to inject the requirement back into the loop, at a point where it can still catch the gap before the code ships.

Current state

  • Apache 2.0, public alpha (0.2.0a1 on PyPI)
  • Java has the deepest path; TypeScript, Go, Python supported at basic level
  • GitHub Actions composite action for CI gate integration
  • Real-world results: in three production Java services, Q06 found 18 EUT targets with assertion gaps that line coverage did not flag (16 partial, 2 missing)
pip install qualix
./scripts/run_expense_demo.sh   # no API key needed, shows pre-computed findings
Enter fullscreen mode Exit fullscreen mode

What I am looking for

Feedback on:

  • Does the semantic coverage framing make sense? Is there a clearer way to explain the gap between line coverage and business-rule verification?
  • Java is the strongest path today. What language / framework would make this immediately useful for your team?
  • The current workflow requires an AI coding agent (Claude Code, Codex, Gemini CLI). Is the friction too high for evaluation?

The GitHub repo is at: https://github.com/alexangelzhang/qualix

Top comments (0)