Every AI team celebrates when their agent catches errors. Nobody tracks whether those errors stop recurring.
We ran 6 autonomous agents through 145+ specs and 960+ commits. The critical metric we discovered: 477:1.
That's 4,768 violations detected but only 18 promoted to structural enforcement. A stark gap between detection and actual prevention.
What the Ratio Means
A violation is a detected failure — an agent breaks rules, uses outdated context, or misses constraints. Detection is straightforward; every monitoring tool does it.
A promotion is when that violation becomes structurally impossible to repeat. Not "we documented it." Not "we added a Jira ticket." The violation gets encoded as an L5 hook, L4 test, or L3 template in the enforcement ladder.
The remaining 4,750 violations can recur because nothing structural changed — despite logging and alerting.
Why the Gap Exists
1. No promotion pipeline. Teams have error logging but lack mechanisms to transform logged errors into structural prevention.
2. Promotion requires architecture. Real solutions mean writing L5 hooks or L4 tests that fail builds — not just updating documentation.
3. The 80/20 trap. Most violations are low-severity conventions. The 18 promotions targeted highest-leverage issues causing cascading failures or production breaks.
Real Example
Violation: Coder agent committed code without running full test suites, breaking unrelated modules.
L2 Detection: Added a prose rule to documentation.
Problem: Agent violated it again within 2 days. Prose rules get lost in context compression.
L5 Promotion: Created a pre-commit hook running tests automatically. Commits fail if any test breaks.
Result: Zero violations in 30+ days. Prevention-by-construction achieved.
How to Measure Your Ratio
Most teams cannot answer: "Of the errors your AI agents made, how many can never happen again?"
To measure:
- Count distinct failure classes (not individual errors)
- Count promotions — structural enforcement that makes violations impossible
- Divide violations by promotions
A ratio of 477:1 is honest. Most production AI systems would be thousands-to-one or infinity-to-one.
Regression Rates
Our 18 L5 promotions show < 5% regression rates. Once promoted to structural enforcement, violations rarely recur.
Compare this to L2 prose enforcement with > 40% regression rates — documentation gets forgotten or compressed out of context.
Enterprise Implications
If you are deploying AI agents in production, you have violations. The determining question isn't detection volume but promotion count.
We publish this ratio because transparency about the gap builds more credibility than pretending the gap does not exist. Every team running AI agents has this gap — most just don't measure it.
The path forward isn't better detection. It's building the pipeline that turns detected violations into structural enforcement, one promotion at a time.
Free codebase governance audit: walseth.ai
Top comments (0)