Alexey Spinov

Posted on Jun 22 • Originally published at finops.spinov.online

Audit AI-Generated Tests: Half of Green CI Proves Nothing

#ai #testing #python #agents

To audit AI-generated tests, score how many mirror the code instead of checking it. Green CI proves your tests agree with the code, not that it is correct — and when one agent writes both, they often just mirror it. mirror_audit.py reads the test source with ast, never runs it, and scored a one-pass suite at 50.0%, exit 1.

AI disclosure: I drafted this with an AI writing assistant. The tool, the three fixtures, and every number below come from a real local run on Python 3.13.5, stdlib only. I ran it, checked the exit codes, hashed the STDOUT twice to confirm it's byte-for-byte deterministic, and edited every line before publishing.

A passing test feels like evidence. It usually is less than you think.

Here's the trap, said plainly. A test asserts that the code does what the test expects. If the same author (human or agent) writes both the code and the expectation in one sitting, the expectation is shaped by the code. The test passes because it was written to pass. Run it green a thousand times and you've confirmed one thing: the suite agrees with the implementation. Not that the implementation is right. Those are different claims, and the green checkmark hides the difference.

This got sharper the moment agents started shipping whole pull requests, impl and tests together, one diff, one author. The checkmark didn't get more trustworthy. The thing producing it changed, and our trust in it didn't.

What I actually measured

TL;DR.

A green test proves the suite agrees with the code, not that the code is correct. Same-author code-plus-tests makes that gap wide.
mirror_audit.py reads the test file's ast (it never executes anything), flags four mirror patterns per test, and counts a test as a mirror when ≥2 of 4 fire.
On a deliberately mirror-shaped suite: mirror-ratio 50.0%, 4 of 8 tests, exit 1 (CI fail).
On an honest suite (negative cases, independent expectations, boundaries): 0.0%, exit 0 (pass). The claim is falsifiable, and it passed the honest one.
mirror-ratio is not a bug-rate. It measures missing independent signal, not the presence of a bug. More on that below, because it's the line that keeps this honest.
Stdlib ast only. No API key, no network, nothing executed. Bad input → exit 2.

I'll give you the run before the argument, because the run is the argument.

The contrarian bit: "tests pass" is not "code works"

Most writing about AI-generated tests stops at coverage. Did the agent write tests? Do they pass? Ship it. That's the easy 80%. The expensive 20% is whether any of those passing tests could have failed for a real reason: whether there's an independent oracle anywhere, or just the code checking itself in a mirror.

Three shapes show up over and over in one-pass test output:

The recompute. assert apply_discount(200, 10) == round(200 * (1 - 10/100), 2). The right-hand side is the implementation's own formula, retyped. It can't disagree with the code. It's f(x) == f(x) wearing a costume.
The golden literal copied from a run. assert apply_discount(100, 25) == 75.0, where 75.0 was lifted from running the code once, not derived independently. You've pinned the test to whatever the code did on day one, bug included.
The smoke test. parse_iso_date("2026-06-23") with no assert, or assert result is not None. It goes green if the function returns anything. The signal is roughly zero.

None of these are wrong. They're just not checks. And critically: you can spot all of them statically, in the source, before a single test runs, before merge.

The falsifiable claim, stated so it can lose: if you take a suite written to mirror its implementation and a suite written to check it independently, a static reader of the test source should score the mirror suite high and the honest one low. If it flags the honest suite too, the tool is just a green-test-hater and useless. It didn't. The mirror suite came back 50.0%, the honest suite 0.0%. The tool drew a line between them. Run is right here.

The run

mirror_audit.py takes one or more Python test files (and optionally the implementation file, to catch recompute and copied-golden patterns). It parses each with ast.parse, walks for def test_* functions, and applies four deterministic flags. It never imports or runs your tests. It reads them the way a reviewer skims a diff, only it doesn't get tired.

First, the suite an agent emits next to its own code: happy paths, recomputes, a couple of # generated stamps, two smoke tests.

$ python3 mirror_audit.py fixtures/tests_mirror.py --impl fixtures/impl_under_test.py
mirror_audit  (static, offline, read-only; no tests executed)
impl-oracle   : on (impl_under_test.py)
tests scanned : 8
mirror tests  : 4  (>= 2 of 4 flags)
mirror-ratio  : 50.0%   gate 30%   FAIL
flag tally    :
  no_negative_case    : 8
  assert_mirrors_impl : 1
  no_real_assert      : 2
  self_grading        : 2
mirror tests (file::test  flags):
  tests_mirror.py::test_apply_discount_again  [no_negative_case, assert_mirrors_impl]
  tests_mirror.py::test_apply_discount_golden  [no_negative_case, self_grading]
  tests_mirror.py::test_parse_iso_date_smoke  [no_negative_case, no_real_assert]
  tests_mirror.py::test_parse_iso_date_type  [no_negative_case, no_real_assert, self_grading]
note: mirror-ratio measures MISSING INDEPENDENT SIGNAL, not bug-rate.
exit: 1

50.0%. Exit 1. Half the green tests carry no independent signal, and the gate fails the build. Note the asymmetry in the tally. Every test trips no_negative_case (not one of the eight checks an error or a boundary), but a single flag isn't enough. A test has to be a mirror on two axes before it's called one. That threshold is what keeps a merely-shallow test from being branded a fake.

Now the honest suite: same code under test, but with negative cases, a hand-written expectation table, and boundary asserts.

$ python3 mirror_audit.py fixtures/tests_honest.py --impl fixtures/impl_under_test.py
tests scanned : 5
mirror tests  : 0  (>= 2 of 4 flags)
mirror-ratio  : 0.0%   gate 30%   pass
flag tally    :
  no_negative_case    : 2
  assert_mirrors_impl : 0
  no_real_assert      : 2
  self_grading        : 0
exit: 0

0.0%. Exit 0. Three of the five honest tests do trip a single flag (a smoke-ish shape here, a happy path there), and the tool still passes them, because none of them is a mirror on two axes. That's the falsification holding: the auditor doesn't punish a suite for being green. It punishes a suite for being green and having nothing that could go red for a real reason.

The part that makes it concrete: the bug

A ratio is abstract. Here's the bug it's standing in for. The implementation under test has one real edge defect: apply_discount clamps the top of the percentage but forgets the bottom, so a negative discount inflates the price.

$ python3 fixtures/prove_bug.py
mirror recompute: apply_discount(100,-50)=150.0 == recompute(150.0) -> True  (test passes, green CI)
honest contract : apply_discount(100,-50)=150.0 <= 100 -> False  (test FAILS - bug caught)

BUG present: a -50% 'discount' returns 150.0 for a $100 item.
The mirror assert agreed with the bug. The honest contract caught it.

A minus-50% "discount" charges $150 for a $100 item. The mirror test computes the expected value with the same broken formula, gets $150, asserts 150 == 150, goes green. The honest test asserts a contract (a discount must never raise the price), gets 150 <= 100, goes red. Same code, same input. One suite blesses the bug; the other catches it. mirror_audit.py doesn't run either of these. It just tells you, statically, which suite you've got before you trust its checkmark.

The four flags, and why two

Each flag fires on a real ast node, deterministically. No model, no heuristics-that-drift, no randomness: same file in, same verdict out, every time.

no-negative-case. No pytest.raises / assertRaises, no relational boundary assert (<=, >=), no error contract anywhere in the body. Happy path only. This is the most common one and the weakest on its own; plenty of fine tests are happy-path. That's exactly why one flag isn't a verdict.
assert-mirrors-impl. An equality assert where both sides call the implementation: f(x) == f(x), or result == recompute_with_impl(x). There's no independent oracle; the expectation is the code. (This one needs the --impl file to know which names are "the implementation.")
no-real-assert. No assert at all (a pure smoke test), or only tautological ones: assert x is not None, assert isinstance(...), assertTrue(True). Green, signal ≈ 0.
self-grading marker. A per-test # generated / # auto stamp on this specific test, or an equality assert against a numeric literal that also appears in the implementation source. The intent is to catch a golden value copied out of the code rather than derived — but read it as a collision heuristic, not proof: it can't distinguish "copied from the code" from "the honest answer happened to be 100, which is also in the code." It's the noisiest of the four (more on that in the caveats).

A test is a mirror when ≥2 of 4 fire. The threshold is the whole design. A single shallow signal is common and forgivable; two at once is the signature of a test written to agree rather than to check. I picked 2 because it cleanly separated my two fixtures. It's a starting line, not a law. Tune it to your own suites and tell me where you land.

One honesty rule baked in: mirror-ratio is not a bug-rate

This is the line I will not let you walk away without. The mirror-ratio does not estimate how many bugs you have. It estimates how much of your green CI carries no independent signal: how much of it is the suite nodding along with the code. A 50% mirror-ratio means half your passing tests couldn't have caught a wrong answer if there was one. It does not mean half your code is buggy. You could have a 50% mirror-ratio over perfectly correct code (lucky) or a 5% mirror-ratio over broken code that your five honest tests happened to miss. The tool measures the quality of the check, not the correctness of the code. The output literally prints note: mirror-ratio measures MISSING INDEPENDENT SIGNAL, not bug-rate. on every run so nobody, including me, can quote it as a defect count.

If I dressed this up as "half your code is broken," that would be the exact overclaim the tool exists to catch. So I won't.

Where the outside numbers land (and where they don't)

I went looking for whether this matters beyond a toy fixture. Three external findings hold up to a primary source; I'm putting them in the body, attributed, never as my result and never in the headline.

Veracode, 2025 GenAI Code Security Report: across 80 curated coding tasks run through 100+ LLMs, 45% of generated samples failed the security test and introduced an OWASP Top-10 weakness; Java was the worst at a 72% failure rate (Veracode). Read it precisely: that's a security-failure rate on benchmark tasks, not "45% of all AI code is exploitable." Still, that's a lot of code shipping behind a green checkmark.
ICSE 2026 (SEIP track), "Vibe Coding in Practice" by Fawzy, Tahir & Blincoe (a grey-literature review of 101 practitioner sources and 518 firsthand accounts) finds that QA practices are frequently overlooked, and skipping testing is the single most common behavior, often by handing verification back to the same AI tool that wrote the code (arXiv 2510.00328). That last clause is the whole problem in one sentence.
"Rethinking Verification for LLM Code Generation" (Ma et al., arXiv 2507.06920, July 2025) finds that model-built evaluation suites tend to be homogeneous — "a limited number of homogeneous test cases, resulting in subtle faults going undetected," in their words — and proposes human-LLM collaboration to widen coverage. Read into our setting, that's the same blind spot showing up on both sides of the assert: a narrow, same-shaped suite can't catch the faults its own narrowness hides.

And one I'm flagging as weak so you can discount it: CodeRabbit, an AI code-review vendor, reported ~1.7x more "issues" in AI-coauthored PRs than human ones across 470 open-source PRs (CodeRabbit). I'd take that with salt. The "issues" were graded by CodeRabbit's own product, on a small sample, and they sell AI review. Interesting direction, not a fact to lean on. I'm including it and its conflict of interest because leaving it out would be cherry-picking, and putting it in unqualified would be the same.

This is not the runtime one. It's the pre-merge one

I've written before about an agent that returns 200 and lies: a runtime check that walks an execution span-trace and refuses to accept a success the agent never achieved. People will assume this is the same thing. It isn't, and the difference matters.

That one runs after execution, on a trace of what happened: status flags, payloads, the effect on the world. This one runs before anything executes, on the source of the test files in a pull request: no tests run, no spans read, no runtime at all. mirror_audit.py is pure ast over text. Different layer (static vs runtime), different input (test source vs span-trace), different metric (mirror-ratio vs share of empty-payload successes). One asks "did this run actually do the thing?" The other asks "could this test have caught it if it didn't?" You'd want both, but they're not the same tool wearing two hats.

It sits inside the same idea as the pre-execution gate: catch the problem before you ship it, fail the build, not the incident review. It's a cousin of the deterministic pre-gate I built for an LLM judge: same shape, a 0/1/2 exit you can drop straight into CI, just a different question asked of a different artifact. And it's the inverse of the token-waste probe after a failure. There, the signal is a real failure you're burning money past; here, the danger is a false success, a green that isn't earned.

What this is NOT (so I don't oversell it)

It does not find bugs. It finds tests that couldn't find bugs. A clean 0% mirror-ratio means your tests have independent signal, not that your code is correct. You can mirror-audit your way to a great suite that still misses the one case nobody wrote.
It does not read git blame or PR metadata, so it can't prove the same author wrote both files. It has no concept of authorship at all — it flags purely on what's in the test source, and the "same author wrote both" framing is the motivation for the metric, not something the tool detects. I'm not going to invent authorship I can't see.
It's conservative without the impl file. Drop the --impl argument and the recompute and golden-literal checks go dark. On the same mirror suite it scores 37.5% instead of 50.0%, because it can no longer tell that a literal matches one in the code. Dropping context lowers the score, never raises it. (That's not a blanket "never over-reports" guarantee — see the next caveat.) The output header tells you which mode you're in.
The golden-literal check can over-flag on a collision. It fires when an equality assert pins an expected value that also appears as a literal in the impl source — but it can't tell "copied from the code" from "happened to match." An honest, hand-derived assert apply_discount(200, 50) == 100 trips it just because 100 shows up in the implementation. So a suite full of legitimate small-integer expectations (0, 1, 100) can score higher than it deserves with --impl on. It's a syntactic collision heuristic, not proof a value was copied. Read the flagged tests, don't trust the flag blindly — and if your domain is all round numbers, weight self_grading lower.
It resolves at the test-function level, not the line. A test with one genuine assert buried under three smoke calls can still pass the audit. It's a triage signal for a reviewer, not a proof of suite quality.
The four flags are heuristics on syntax, not semantics. A sufficiently clever mirror (an assert that recomputes the impl through an indirection the AST can't follow) will slip past. It catches the common shapes that show up in one-pass output, which is most of them, not all of them.

It's deterministic, though, which is the one thing it promises and keeps: same test file, same verdict, every run. I hashed the STDOUT twice on both fixtures and got identical sha256 each time (5047bf48… for the mirror suite, 84fcdb73… for the honest one). A CI gate you can't reproduce isn't a gate.

Run it on a real suite from an agent PR and tell me your mirror-ratio. I'm genuinely curious what the distribution looks like in the wild, because my 50% is a fixture I built to be obvious, and real suites will be messier. What's the most mirror-shaped test you've ever merged: a recompute, a copied golden, a smoke test with assert result? Drop it in the comments, I read every one. Follow for the next number from the next run.

Top comments (1)

Luis • Jun 22

This is a really important distinction: green CI doesn’t mean correctness, it means agreement between code and its tests.
The idea of measuring “missing independent signal” instead of bug rate is the key insight here — it reframes test quality as an oracle problem, not a coverage problem.
In practice, the most dangerous tests are the ones that always pass because they were written in the same mental loop as the implementation.