DEV Community

Michael "Mike" K. Saleme
Michael "Mike" K. Saleme

Posted on

We audited every claim in our repos and found 14 files with wrong numbers

Last week a bot embarrassed us. Cursor Bugbot ran across five PRs on our agent security testing framework and filed nine real issues: an HTTP 413 handler that returned an empty body, undefined variables that only surfaced in live mode, regex patterns being compared as literal substrings, and a metric definition in an arXiv citation that directly contradicted what we were computing. Every finding was legitimate. We fixed them.

Then we asked the obvious follow-up: if the code had wrong numbers, what about the docs?

The audit

We pulled both repos and went line by line.

agent-security-harness — a Python library with 470+ security tests covering AI agent protocols: MCP (Model Context Protocol), A2A (Agent-to-Agent), L402, and x402. The README badge said 466 tests. Older documentation said 439. The MCP test count in the technical overview was wrong by more than a dozen. And a claim that we satisfy every AIUC-1 requirement was directly contradicted by our own framework crosswalk document, which correctly listed one requirement as partial.

constitutional-agent — governance gates and hard constraints for AI agents: six evaluation gates, twelve hard constraints, an amendment protocol. The README said 77 tests. Actual count when we ran the suite: 150. The dependency list included a package we removed months ago. One constraint referenced in the docs does not exist in the codebase.

Fourteen files needed changes across the two repos.

What we fixed

  • README badges and body text updated to match actual test counts in both repos
  • MCP, A2A, L402, x402 per-protocol counts corrected in agent-security-harness
  • AIUC-1 compliance language scoped accurately (we cover the controls we cover; we do not claim full certification)
  • Removed the phantom dependency from constitutional-agent
  • Removed the reference to the constraint that does not exist
  • Added missing CHANGELOG entries for three versions that shipped without them
  • Added Python 3.13 to the CI matrix (we were testing on 3.11 and 3.12 only)
  • Added a missing core dependency that was present in the dev environment but not declared in pyproject.toml — meaning clean installs could fail silently depending on what else was installed
  • Version bumps: agent-security-harness 4.1.0, constitutional-agent 0.2.0

The structural fix

Number drift is not a documentation problem. It is a process problem. A README is not a test — nothing was enforcing that the badge matched reality.

We added a CI check that does three things: runs count_tests.py to get the canonical test count from the source, checks that count against the version declared in pyproject.toml, and checks it against the badge in the README. The check runs on every push. If the numbers disagree, the build fails.

This is not novel. It is the same principle as pinning dependencies or generating API docs from source: stop maintaining two sources of truth and start deriving one from the other.

Honest accounting

The Bugbot findings were real bugs. The accuracy sweep found claims we had made in public-facing documentation that were wrong. Some were stale snapshots from an earlier phase of the project. Some were copy-paste errors. One (the AIUC-1 claim) was imprecise language that looked stronger than what we could actually demonstrate.

None of these caused a security incident. But security tooling that makes inaccurate claims about its own coverage is a specific kind of bad — it erodes exactly the trust that makes the tooling worth using.

We shipped the fixes. The counts are now correct. The CI will catch drift going forward.

Links

Top comments (0)