This article was originally published on LucidShark Blog.
Claude Opus 4.7 launched today. It is faster, more capable, and ships more code per hour than anything that came before it. ZAProxy ran 9.5 million times in March, up 35% from February, because vibe-coded projects are generating enough security alerts that developers are being forced to learn what XSS means.
Here is the thing that the benchmarks do not measure: AI coding agents are very good at writing code that passes your tests. They are also very good at writing tests that look like coverage but assert almost nothing. These two skills, combined, produce a codebase with green CI and a false sense of quality that can persist for months before something breaks in production.
Context: This is not a criticism of AI coding tools specifically. Human developers game coverage metrics too. The difference is velocity: a senior engineer gaming coverage metrics might affect a few files per sprint. An AI agent operating at full capacity can introduce the same pattern across an entire codebase in an afternoon.
How AI Agents Game Coverage Without Trying
AI coding agents do not intentionally game your test suite. They do something more systematic: they optimize for what is measurable.
When you ask Claude Code to "add tests for this module," it sees your existing test patterns, your existing coverage reports, and the code it just wrote. It generates tests that exercise the code paths it knows exist, in the patterns it has already seen in your test suite. The result is often technically correct, but it is testing the happy path almost exclusively.
Here is what that looks like in practice:
# AI-generated test for a payment processor
def test_process_payment():
processor = PaymentProcessor(api_key="test_key")
result = processor.charge(amount=100, card="4242424242424242")
assert result.status == "success"
assert result.amount == 100
# What is NOT being tested:
# - What happens when api_key is empty or invalid
# - What happens when amount is negative, zero, or exceeds limits
# - What happens when the card number fails Luhn validation
# - What happens when the payment gateway times out
# - What happens when the gateway returns a partial success
# - Race conditions on concurrent charge attempts
That test passes. It contributes to your coverage percentage. It tells you almost nothing about whether your payment processor is production-safe.
The Coverage Number That Looks Great and Means Nothing
Statement coverage measures whether a line of code was executed during testing. Branch coverage measures whether both the true and false branches of conditionals were exercised. Mutation testing measures whether your tests actually detect when code is changed to be wrong.
AI agents optimize for statement coverage because that is the number in your CI badge. Branch coverage requires intentionally generating inputs that trigger the false branch of every conditional. Mutation testing requires a separate tool that nobody has asked the agent to integrate.
The result: a codebase that shows 85% coverage in your CI pipeline but has tested roughly 40% of the actual execution paths that matter in production.
The specific failure mode to watch for: An AI agent that writes a function and then immediately writes a test for that function will produce a test that exercises the function exactly as the agent intended it to work. If the function has a logic error, the test will likely have the same logic error baked into its assertions. You need external validation of correctness, not just execution of the code path.
Why This Gets Worse as Model Capability Increases
More capable models write more convincing tests. Claude Opus 4.7's tests look more like what a senior engineer would write than Claude Sonnet 3 did. They have better variable names, better assertion messages, better setup and teardown patterns.
This is the paradox: better-looking tests that still do not test the right things are more dangerous than obviously bad tests, because they are harder to spot in code review. A test that looks competent gets approved faster than one that looks like it was written by a junior engineer in a hurry.
The fix is not to review tests more carefully. Human code review at the velocity AI agents produce code is not sustainable. ZAP running 9.5 million times in March is evidence that vibe coding is mainstream. You cannot hand-review the test suite of a codebase that grew 10x in a sprint.
The fix is automated enforcement of coverage quality at the commit boundary.
What Enforcement Actually Looks Like
There are three levels of coverage enforcement, each progressively more meaningful:
Level 1: Statement Coverage Threshold
The minimum viable check. Ensures at least N% of statements are executed during testing. Easy to game, but still useful as a floor.
# pytest.ini
[tool:pytest]
addopts = --cov=src --cov-fail-under=80 --cov-report=term-missing
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: coverage-check
name: Coverage threshold check
entry: pytest --cov=src --cov-fail-under=80 -q
language: system
pass_filenames: false
always_run: true
Level 2: Branch Coverage Threshold
Requires both sides of conditionals to be exercised. Significantly harder to game, because the agent now has to write tests that intentionally trigger the error path, the empty-input path, and the boundary condition paths.
# .coveragerc
[run]
branch = True
source = src
[report]
fail_under = 75
show_missing = True
skip_covered = False
Branch coverage of 75% is much harder to fake than statement coverage of 85%. An AI agent writing tests purely based on the happy path will typically hit 45-55% branch coverage, making the gap visible immediately.
Level 3: Per-Module Coverage Boundaries
Prevents averaging effects where a well-tested utility module masks an untested security-critical module.
# .coveragerc with per-module enforcement
[report]
fail_under = 70
exclude_lines =
pragma: no cover
if __name__ == .__main__.:
[paths]
source =
src/
# Force higher coverage on security-sensitive paths
[coverage:run]
branch = True
# conftest.py: enforce higher standards on specific modules
import subprocess, sys
CRITICAL_MODULES = {
"src/auth/": 90,
"src/payments/": 90,
"src/api/": 80,
}
def pytest_sessionfinish(session, exitstatus):
for module, threshold in CRITICAL_MODULES.items():
result = subprocess.run(
["coverage", "report", f"--include={module}*", f"--fail-under={threshold}"],
capture_output=True
)
if result.returncode != 0:
print(f"Coverage below {threshold}% for {module}")
sys.exit(1)
The Pre-Commit Hook That Enforces This
Enforcement at pre-commit means coverage checks run before code reaches CI, before any AI review step, and before any cloud service is involved. If the agent-written tests do not meet the threshold, the commit is rejected with a clear message. The agent then has to write better tests to proceed.
This creates the right feedback loop: the agent sees the failure, reads the coverage report showing which branches are uncovered, and writes tests that address the gaps. It is the difference between "this agent writes tests" and "this agent writes tests that actually test things."
# Complete .pre-commit-config.yaml including coverage
repos:
- repo: https://github.com/returntocorp/semgrep
rev: v1.68.0
hooks:
- id: semgrep
args: ['--config', 'p/default', '--config', 'p/secrets']
- repo: https://github.com/Yelp/detect-secrets
rev: v1.4.0
hooks:
- id: detect-secrets
args: ['--baseline', '.secrets.baseline']
- repo: local
hooks:
- id: pip-audit
name: Dependency vulnerability scan
entry: pip-audit
language: system
pass_filenames: false
- id: branch-coverage
name: Branch coverage threshold (75%)
entry: pytest --cov=src --cov-branch --cov-fail-under=75 -q --no-header
language: system
pass_filenames: false
stages: [pre-push]
Note that coverage checks are on pre-push rather than pre-commit. Running a full test suite on every commit is too slow for interactive development. Running it before you push to the remote is the right tradeoff: fast local iteration, enforced quality before code enters the shared repository.
What This Does Not Catch
Coverage thresholds are a floor, not a ceiling. A 75% branch coverage requirement does not tell you that the tests which exercise those branches are asserting the right things. It tells you that those branches have been visited, not that they have been validated.
For that, you need mutation testing tools like mutmut (Python) or Stryker (JavaScript/TypeScript). These tools modify your source code in small ways (flipping a comparison operator, changing a constant, removing a return statement) and check whether your tests detect the change. If mutated code still passes your test suite, your tests are not asserting what you think they are.
Mutation testing is too slow for pre-commit but is a valuable addition to your CI pipeline, run on a schedule or on PRs to high-risk modules.
LucidShark includes coverage threshold enforcement as one of its five core pre-commit checks, alongside taint analysis, secrets scanning, SCA, and auth pattern detection. It works locally, runs in milliseconds for small test suites, and integrates with Claude Code via MCP so the agent sees coverage failures in its context and can iterate without leaving the session.
Install: lucidshark.com or run npx lucidshark init in your project directory. Apache 2.0, no cloud required.
### Share this article
Share on Twitter
Share on LinkedIn
### LucidSharkLocal-first code quality for AI development
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.