PASSED with 252 findings. FAILED with 78. Which audit would you trust?

A story about instrument qualification, false positives, and why honest governance sometimes means failing on purpose.

The paradox

This morning, CORE's audit system reported 252 findings and returned a verdict of PASSED.

This evening, it reported 78 findings and returned a verdict of FAILED.

Nothing in production changed. No bugs were introduced. No architecture was violated.

The sensors were fixed.

Finding	Befr	Aftr	Delta
Total findings	252	78	-174
Orphan files	91	0	-91
Modularity (blunt score)	100	0	-100
needs_split	—	19	new
needs_refactor	—	27	new
File size (redundant rule)	29	0	-29
Verdict	PASS	FAIL	honest

The FAILED verdict is the correct one. The PASSED verdict was a compliance illusion.

The instrument qualification problem

In GxP-regulated environments — pharmaceutical manufacturing, medical devices, clinical software — you do not run an assay on an uncalibrated instrument and trust the result. Before any measurement is taken seriously, the instrument must be qualified: it must demonstrably measure what it claims to measure, within defined tolerances, under defined conditions.

This principle is so fundamental that it precedes any discussion of the data itself. Bad data from a qualified instrument is a finding. Bad data from an unqualified instrument is noise — and acting on noise has a name: it is a deviation.

Software governance systems face the same problem. An audit engine that produces findings is an instrument. If that instrument has not been qualified — if its detectors produce false positives, if its thresholds are miscalibrated, if its rules conflate distinct problem classes — then the findings it produces are not evidence. They are noise with a compliance label.

Acting on that noise with automated remediation is not governance. It is confident, expensive, wrong work.

Case 1: The orphan file detector

CORE uses a static import graph traversal to detect source files unreachable from any declared entry point. The principle is sound: if no entry point can reach a file, that file is dead code and should be removed.

The detector flagged 91 files as orphans.

All 91 were false positives.

Static import graph traversal is a deliberate choice — deterministic, auditable, no runtime dependency. The tradeoff is that dynamically-loaded components must be explicitly declared as entry points. That declaration is itself a governance artifact: it makes the implicit loading contract explicit and versioned. The detector was not wrong — the contract was incomplete.

An automated agent pointed at those 91 findings would have deleted live production code. The agent would have been operating correctly within its mandate. The mandate was wrong.

The fix was not to make the detector smarter. It was to declare the dynamically-loaded directories as explicit entry points — converting an implicit runtime convention into a versioned, governed contract. Functionally this resembles static linking. Constitutionally it is different: the declaration is law, subject to change control, with documented rationale. The detector enforces the contract. The contract is owned by governance, not by the build system.

entry_points:
  - "src/will/self_healing/"
  - "src/will/test_generation/"
  - "src/shared/infrastructure/"
  # ... 10 more directories

After the fix: zero orphan findings. Zero code deleted. The codebase did not change. The instrument was qualified.

Case 2: The modularity score

Four rules were producing 100 findings collectively:

modularity.single_responsibility
modularity.semantic_cohesion
modularity.import_coupling
modularity.refactor_score_threshold

All four were proxies for a single composite score. All four mapped to the same remediation action: fix.modularity. All four carried the same enforcement level: reporting.

The problem is that they were measuring two fundamentally different things and treating them identically.

Problem class A: a file is too long with a single coherent responsibility.
This is a mechanical problem. The file does one thing but does too much of it. The solution is splitting — redistributing logic across smaller files along natural seams. No discipline boundaries are crossed. No architectural judgment is required. An automated system can propose and execute this split safely, subject to a Logic Conservation Gate that verifies no logic was lost.

Problem class B: a file mixes distinct architectural disciplines.
A file that combines CLI rendering, database access, and business logic in 300 lines is not a size problem. It is an architectural violation. Resolving it requires a human to decide where each responsibility belongs in the constitutional layer structure. An automated system cannot make that decision safely — not because AI is incapable of generating a proposal, but because the decision carries architectural authority that must remain with a human until the boundaries are formally established.

Conflating these two problems in a single score means the governance system cannot distinguish between what it is allowed to fix autonomously and what it must escalate. That distinction is not a technical nicety. In regulated environments, it is the difference between an approved automated action and an unauthorized architectural change.

The fix was to retire the four proxy rules and replace them with two precise sensors:

{
  "id": "modularity.needs_split",
  "enforcement": "reporting",
  "rationale": "Automatable. Mechanical redistribution, no discipline boundaries crossed."
},
{
  "id": "modularity.needs_refactor",
  "enforcement": "blocking",
  "rationale": "Requires human judgment. Autonomous action prohibited until architectural decision is approved."
}

The blocking enforcement on needs_refactor is the point. It is not a warning. It is a constitutional stop. The system will not proceed autonomously until a human has reviewed and authorized the architectural boundary decision.

This is why the audit now returns FAILED. Twenty-seven files contain mixed-discipline violations. They are real findings. They require real decisions. The system is correctly refusing to act without authorization.

The verdict paradox

A governance system that always passes is not a governance system. It is a reporting system with a green checkbox.

PASSED with 252 findings meant: the system detected many things, none of them were classified as blocking, therefore no action is required. The 91 false positives contributed to a picture of busyness without actionability. The composite modularity score produced findings that the automated remediator could not distinguish from each other. Everything was flagged, nothing was escalated.

FAILED with 78 findings means: the system has detected 27 architectural violations that require human decisions before any automated action proceeds. It has identified 19 files that can be split autonomously, subject to validation gates. Every finding in the report corresponds to a specific, actionable condition.

The failure verdict is evidence that the governance system is functioning correctly. It is not a regression. It is an honest measurement.

The principle

Governance quality is not measured by finding count. It is measured by finding accuracy.

In regulated environments, the difference between a false positive acted upon and a true positive ignored is not a technical footnote. It is a compliance failure. Instrument qualification is not overhead — it is the precondition for trusting any measurement that follows.

Before you ask what your audit found, ask whether your audit can be trusted.

CORE is an open-source constitutional governance runtime for AI-assisted software development. Architecture, governance rules, and enforcement mappings are public.

github.com/DariuszNewecki/CORE