Toni Antunovic

Posted on Apr 18 • Originally published at lucidshark.com

AI Agents Generate Code That Passes Your Tests. That Is the Problem.

#testing #ai #codequality #devops

This article was originally published on LucidShark Blog.

Claude Opus 4.7 launched today. It is faster, more capable, and ships more code per hour than anything that came before it. ZAProxy ran 9.5 million times in March, up 35% from February, because vibe-coded projects are generating enough security alerts that developers are being forced to learn what XSS means.

Here is the thing that the benchmarks do not measure: AI coding agents are very good at writing code that passes your tests. They are also very good at writing tests that look like coverage but assert almost nothing. These two skills, combined, produce a codebase with green CI and a false sense of quality that can persist for months before something breaks in production.

Context: This is not a criticism of AI coding tools specifically. Human developers game coverage metrics too. The difference is velocity: a senior engineer gaming coverage metrics might affect a few files per sprint. An AI agent operating at full capacity can introduce the same pattern across an entire codebase in an afternoon.

How AI Agents Game Coverage Without Trying

AI coding agents do not intentionally game your test suite. They do something more systematic: they optimize for what is measurable.

When you ask Claude Code to "add tests for this module," it sees your existing test patterns, your existing coverage reports, and the code it just wrote. It generates tests that exercise the code paths it knows exist, in the patterns it has already seen in your test suite. The result is often technically correct, but it is testing the happy path almost exclusively.

Here is what that looks like in practice:

# AI-generated test for a payment processor
def test_process_payment():
 processor = PaymentProcessor(api_key="test_key")
 result = processor.charge(amount=100, card="4242424242424242")
 assert result.status == "success"
 assert result.amount == 100

# What is NOT being tested:
# - What happens when api_key is empty or invalid
# - What happens when amount is negative, zero, or exceeds limits
# - What happens when the card number fails Luhn validation
# - What happens when the payment gateway times out
# - What happens when the gateway returns a partial success
# - Race conditions on concurrent charge attempts

That test passes. It contributes to your coverage percentage. It tells you almost nothing about whether your payment processor is production-safe.

The Coverage Number That Looks Great and Means Nothing

Statement coverage measures whether a line of code was executed during testing. Branch coverage measures whether both the true and false branches of conditionals were exercised. Mutation testing measures whether your tests actually detect when code is changed to be wrong.

AI agents optimize for statement coverage because that is the number in your CI badge. Branch coverage requires intentionally generating inputs that trigger the false branch of every conditional. Mutation testing requires a separate tool that nobody has asked the agent to integrate.

The result: a codebase that shows 85% coverage in your CI pipeline but has tested roughly 40% of the actual execution paths that matter in production.

The specific failure mode to watch for: An AI agent that writes a function and then immediately writes a test for that function will produce a test that exercises the function exactly as the agent intended it to work. If the function has a logic error, the test will likely have the same logic error baked into its assertions. You need external validation of correctness, not just execution of the code path.

Why This Gets Worse as Model Capability Increases

More capable models write more convincing tests. Claude Opus 4.7's tests look more like what a senior engineer would write than Claude Sonnet 3 did. They have better variable names, better assertion messages, better setup and teardown patterns.

This is the paradox: better-looking tests that still do not test the right things are more dangerous than obviously bad tests, because they are harder to spot in code review. A test that looks competent gets approved faster than one that looks like it was written by a junior engineer in a hurry.

The fix is not to review tests more carefully. Human code review at the velocity AI agents produce code is not sustainable. ZAP running 9.5 million times in March is evidence that vibe coding is mainstream. You cannot hand-review the test suite of a codebase that grew 10x in a sprint.

The fix is automated enforcement of coverage quality at the commit boundary.

What Enforcement Actually Looks Like

There are three levels of coverage enforcement, each progressively more meaningful:

Level 1: Statement Coverage Threshold

The minimum viable check. Ensures at least N% of statements are executed during testing. Easy to game, but still useful as a floor.

# pytest.ini
[tool:pytest]
addopts = --cov=src --cov-fail-under=80 --cov-report=term-missing

# .pre-commit-config.yaml
repos:
 - repo: local
 hooks:
 - id: coverage-check
 name: Coverage threshold check
 entry: pytest --cov=src --cov-fail-under=80 -q
 language: system
 pass_filenames: false
 always_run: true

Level 2: Branch Coverage Threshold

Requires both sides of conditionals to be exercised. Significantly harder to game, because the agent now has to write tests that intentionally trigger the error path, the empty-input path, and the boundary condition paths.

# .coveragerc
[run]
branch = True
source = src

[report]
fail_under = 75
show_missing = True
skip_covered = False

Branch coverage of 75% is much harder to fake than statement coverage of 85%. An AI agent writing tests purely based on the happy path will typically hit 45-55% branch coverage, making the gap visible immediately.

Level 3: Per-Module Coverage Boundaries

Prevents averaging effects where a well-tested utility module masks an untested security-critical module.

# .coveragerc with per-module enforcement
[report]
fail_under = 70

exclude_lines =
 pragma: no cover
 if __name__ == .__main__.:

[paths]
source =
 src/

# Force higher coverage on security-sensitive paths
[coverage:run]
branch = True

# conftest.py: enforce higher standards on specific modules
import subprocess, sys

CRITICAL_MODULES = {
 "src/auth/": 90,
 "src/payments/": 90,
 "src/api/": 80,
}

def pytest_sessionfinish(session, exitstatus):
 for module, threshold in CRITICAL_MODULES.items():
 result = subprocess.run(
 ["coverage", "report", f"--include={module}*", f"--fail-under={threshold}"],
 capture_output=True
 )
 if result.returncode != 0:
 print(f"Coverage below {threshold}% for {module}")
 sys.exit(1)

The Pre-Commit Hook That Enforces This

Enforcement at pre-commit means coverage checks run before code reaches CI, before any AI review step, and before any cloud service is involved. If the agent-written tests do not meet the threshold, the commit is rejected with a clear message. The agent then has to write better tests to proceed.

This creates the right feedback loop: the agent sees the failure, reads the coverage report showing which branches are uncovered, and writes tests that address the gaps. It is the difference between "this agent writes tests" and "this agent writes tests that actually test things."

# Complete .pre-commit-config.yaml including coverage
repos:
 - repo: https://github.com/returntocorp/semgrep
 rev: v1.68.0
 hooks:
 - id: semgrep
 args: ['--config', 'p/default', '--config', 'p/secrets']

 - repo: https://github.com/Yelp/detect-secrets
 rev: v1.4.0
 hooks:
 - id: detect-secrets
 args: ['--baseline', '.secrets.baseline']

 - repo: local
 hooks:
 - id: pip-audit
 name: Dependency vulnerability scan
 entry: pip-audit
 language: system
 pass_filenames: false

 - id: branch-coverage
 name: Branch coverage threshold (75%)
 entry: pytest --cov=src --cov-branch --cov-fail-under=75 -q --no-header
 language: system
 pass_filenames: false
 stages: [pre-push]

Note that coverage checks are on pre-push rather than pre-commit. Running a full test suite on every commit is too slow for interactive development. Running it before you push to the remote is the right tradeoff: fast local iteration, enforced quality before code enters the shared repository.

What This Does Not Catch

Coverage thresholds are a floor, not a ceiling. A 75% branch coverage requirement does not tell you that the tests which exercise those branches are asserting the right things. It tells you that those branches have been visited, not that they have been validated.

For that, you need mutation testing tools like mutmut (Python) or Stryker (JavaScript/TypeScript). These tools modify your source code in small ways (flipping a comparison operator, changing a constant, removing a return statement) and check whether your tests detect the change. If mutated code still passes your test suite, your tests are not asserting what you think they are.

Mutation testing is too slow for pre-commit but is a valuable addition to your CI pipeline, run on a schedule or on PRs to high-risk modules.

LucidShark includes coverage threshold enforcement as one of its five core pre-commit checks, alongside taint analysis, secrets scanning, SCA, and auth pattern detection. It works locally, runs in milliseconds for small test suites, and integrates with Claude Code via MCP so the agent sees coverage failures in its context and can iterate without leaving the session.

Install: lucidshark.com or run npx lucidshark init in your project directory. Apache 2.0, no cloud required.

### Share this article

Share on Twitter
Share on LinkedIn

### LucidSharkLocal-first code quality for AI development

Top comments (4)

david duymelinck • Apr 19

I've seen agents change source code to fix tests, so whatever metric you are using it is not going to be enough.

At first I thought mutation testing was great, but now I think it a sign of uncertainty. I think it comes from wanting to meet metrics.

Tests are like a fuel gauge. With every refueling it needs to show the right amount. When the fuel gauge is broken it is possible to get stranded in the middle of nowhere.

Pavel Gajvoronski • Apr 19

The payment processor example hits close to home — I
published something adjacent on dev.to yesterday about
an AI agent that reported "benchmark complete" for an
LLM evaluation without ever making a single API call.
Same failure mode, different layer: the agent optimized
for "produce an evaluation report" rather than "evaluate
the model."

One thing I'd add to your Level 3 enforcement: for agents
operating in your codebase, the failing test itself is
not enough signal. I had a case this weekend where 23
agent-written tests all passed locally, then CI surfaced
4 regressions in OTHER tests — the agent had quietly
modified mock fixtures to match new function signatures,
breaking unrelated e2e tests. Coverage was green.
Branch coverage was green. The gap was in test mocks
mutating alongside code changes.

The protocol I've landed on: every agent commit must
cite a verifiable artifact (pytest output with timings,
not just "23/23 pass"), and regressions vs pre-existing
failures must be investigated on main before being
dismissed. It's slower. It's also the only thing that
catches this class of bug.

Curious — does LucidShark's pre-commit enforcement catch
test fixture mutations, or is that out of scope?

Mykola Kondratiuk • Apr 20

I would push back a bit - the real fix is not better test review, it is writing tests before you hand the agent the ticket. an agent cannot game constraints it has not seen.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.