Most code review pipelines are designed to answer one question: is this code correct?
Linters check syntax. Static analysers check for known vulnerability patterns. Dependency scanners check for CVEs. Security scanners check for OWASP Top 10 failures. All of them operate at the code level. They look at what the code does in isolation.
None of them answer a different question that becomes increasingly urgent once your codebase grows, your team uses AI tools, and you are approaching your first serious audit or investor due diligence. That question is: does this codebase do what you said it would do?
I have been thinking about where this fits in a modern review pipeline, because the answer is not obvious.
Where it fits
The standard pipeline looks roughly like this:
- Developer writes code (increasingly AI-assisted)
- PR created, automated linter and formatter run
- Code review (human and AI assist)
- CI pipeline: unit tests, integration tests
- Security scanner (Semgrep, Snyk, CodeQL)
- Dependency audit (OSV, npm audit)
- Merge to main
Every step from 2 to 6 checks code quality, correctness, and security. This is what I am calling Level 1 verification. It is necessary. It is increasingly well-tooled. It is not sufficient.
What is missing is a layer I am calling intent verification. In academic literature this is sometimes called requirements technical debt, defined as "the distance between the ideal value of the specification and the actual implementation of the system." (ScienceDirect.)
The practical version of this question is: if you wrote a one-paragraph description of what your product was designed to do, how much of your codebase can be demonstrably traced back to that intent?
Why AI tools make this harder
Here is the counterintuitive part of this problem.
AI coding tools are remarkably good at writing locally correct code. They produce functions that compile, pass unit tests, and in many cases handle edge cases well. The problem is that they have no awareness of what the product is supposed to do at the system level. They respond to the prompt, not to the design intent.
When a developer manually implements a feature, the friction between reading the spec and writing the code creates an implicit checkpoint. When an AI tool generates the implementation from an informal prompt, that checkpoint disappears. The code can be locally excellent and systemically misaligned.
Sonar's 2026 research found that 42% of committed code is now AI-assisted and that technical debt increased by 30% to 41% in organisations that adopted AI coding tools. Some of that debt is at the code quality level. Some of it is at the intent level, where systems technically work but have drifted from their declared design.
Veracode's 2025 GenAI Code Security Report, which tested over 100 large language models across 80 coding tasks, found that 45% of AI-generated code introduces OWASP Top 10 vulnerabilities. That is a code quality finding. The intent drift problem sits a layer above it and is not captured by the same tools.
What intent-level verification actually requires
To verify intent alignment, you need three things that most code review pipelines do not have.
First, a declared intent. This is the plain-language description of what the product was designed to do. Most products have fragments of this scattered across a pitch deck, a product spec, and a README, but rarely a structured, auditable intent statement.
Second, a coverage mechanism that can evaluate the codebase against that intent across domains. Security is one domain. Architecture is another. Compliance mapping is a third. You need something that can read sections of the codebase and evaluate whether those sections reflect the declared intent, not just whether the code is technically correct.
Third, independent verification. If a single AI model is evaluating code that was likely written with the assistance of a similar AI model trained on similar data, you have a validation loop. The evaluator inherits the blind spots of the author. You need architectural diversity: multiple models from different providers, evaluated independently and then reconciled.
This is the architecture IntentGuard uses. The audit runs across multiple independent AI models that must reach consensus before a finding is surfaced. This is not just a confidence filter. It is a structural mechanism for reducing the blind spot problem. If one model's training has a gap in a particular area, the others are unlikely to share the same gap.
What this looks like in practice
When IntentGuard runs an audit, the output is not a list of bugs. It is a mapping of sections of the codebase evaluated against a set of dimensions that together answer the intent question: architecture coherence, security posture, compliance surface, dependency risk, and whether AI components are declared and governed where present.
For a pre-Series A founder heading into technical due diligence, the output is five persona-specific reports: Executive, Developer, Auditor, Investor, and TCO. Each answers a different version of the same question.
For an IT auditor mapping a system against ISO 27001 or SOC 2 or DORA, the output is audit-grade evidence, not a developer's self-assessment, but a consensus finding from independent model evaluation.
Where this fits in your pipeline
Intent-level verification is not a replacement for Level 1 tooling. You still run Semgrep. You still run OSV. You still do code review. Those tools answer their questions well.
Intent-level verification sits above that pipeline. It is not something you run on every PR. It is something you run before a significant milestone: a fundraising round, a compliance audit, a new enterprise customer. It answers the question that Level 1 tools are not designed to answer.
The debt Vogels named is real. Level 2 is older and deeper. Both are accumulating every time AI writes code your team has not fully verified against what you said you were building.
Top comments (0)