DEV Community: Olebeng

Intent-Level Verification: The Audit Layer Above Your Code Review Pipeline

Olebeng — Tue, 19 May 2026 16:43:10 +0000

Most code review pipelines are designed to answer one question: is this code correct?

Linters check syntax. Static analysers check for known vulnerability patterns. Dependency scanners check for CVEs. Security scanners check for OWASP Top 10 failures. All of them operate at the code level. They look at what the code does in isolation.

None of them answer a different question that becomes increasingly urgent once your codebase grows, your team uses AI tools, and you are approaching your first serious audit or investor due diligence. That question is: does this codebase do what you said it would do?
I have been thinking about where this fits in a modern review pipeline, because the answer is not obvious.

Where it fits
The standard pipeline looks roughly like this:

Developer writes code (increasingly AI-assisted)
PR created, automated linter and formatter run
Code review (human and AI assist)
CI pipeline: unit tests, integration tests
Security scanner (Semgrep, Snyk, CodeQL)
Dependency audit (OSV, npm audit)
Merge to main

Every step from 2 to 6 checks code quality, correctness, and security. This is what I am calling Level 1 verification. It is necessary. It is increasingly well-tooled. It is not sufficient.

What is missing is a layer I am calling intent verification. In academic literature this is sometimes called requirements technical debt, defined as "the distance between the ideal value of the specification and the actual implementation of the system." (ScienceDirect.)

The practical version of this question is: if you wrote a one-paragraph description of what your product was designed to do, how much of your codebase can be demonstrably traced back to that intent?

Why AI tools make this harder

Here is the counterintuitive part of this problem.

AI coding tools are remarkably good at writing locally correct code. They produce functions that compile, pass unit tests, and in many cases handle edge cases well. The problem is that they have no awareness of what the product is supposed to do at the system level. They respond to the prompt, not to the design intent.

When a developer manually implements a feature, the friction between reading the spec and writing the code creates an implicit checkpoint. When an AI tool generates the implementation from an informal prompt, that checkpoint disappears. The code can be locally excellent and systemically misaligned.

Sonar's 2026 research found that 42% of committed code is now AI-assisted and that technical debt increased by 30% to 41% in organisations that adopted AI coding tools. Some of that debt is at the code quality level. Some of it is at the intent level, where systems technically work but have drifted from their declared design.

Veracode's 2025 GenAI Code Security Report, which tested over 100 large language models across 80 coding tasks, found that 45% of AI-generated code introduces OWASP Top 10 vulnerabilities. That is a code quality finding. The intent drift problem sits a layer above it and is not captured by the same tools.

What intent-level verification actually requires

To verify intent alignment, you need three things that most code review pipelines do not have.

First, a declared intent. This is the plain-language description of what the product was designed to do. Most products have fragments of this scattered across a pitch deck, a product spec, and a README, but rarely a structured, auditable intent statement.

Second, a coverage mechanism that can evaluate the codebase against that intent across domains. Security is one domain. Architecture is another. Compliance mapping is a third. You need something that can read sections of the codebase and evaluate whether those sections reflect the declared intent, not just whether the code is technically correct.

Third, independent verification. If a single AI model is evaluating code that was likely written with the assistance of a similar AI model trained on similar data, you have a validation loop. The evaluator inherits the blind spots of the author. You need architectural diversity: multiple models from different providers, evaluated independently and then reconciled.

This is the architecture IntentGuard uses. The audit runs across multiple independent AI models that must reach consensus before a finding is surfaced. This is not just a confidence filter. It is a structural mechanism for reducing the blind spot problem. If one model's training has a gap in a particular area, the others are unlikely to share the same gap.

What this looks like in practice

When IntentGuard runs an audit, the output is not a list of bugs. It is a mapping of sections of the codebase evaluated against a set of dimensions that together answer the intent question: architecture coherence, security posture, compliance surface, dependency risk, and whether AI components are declared and governed where present.

For a pre-Series A founder heading into technical due diligence, the output is five persona-specific reports: Executive, Developer, Auditor, Investor, and TCO. Each answers a different version of the same question.
For an IT auditor mapping a system against ISO 27001 or SOC 2 or DORA, the output is audit-grade evidence, not a developer's self-assessment, but a consensus finding from independent model evaluation.

Where this fits in your pipeline

Intent-level verification is not a replacement for Level 1 tooling. You still run Semgrep. You still run OSV. You still do code review. Those tools answer their questions well.

Intent-level verification sits above that pipeline. It is not something you run on every PR. It is something you run before a significant milestone: a fundraising round, a compliance audit, a new enterprise customer. It answers the question that Level 1 tools are not designed to answer.

The debt Vogels named is real. Level 2 is older and deeper. Both are accumulating every time AI writes code your team has not fully verified against what you said you were building.

How a TypeScript RPC library got classified as EU AI Act High Risk. A technical post-mortem.

Olebeng — Tue, 12 May 2026 12:30:18 +0000

This is a post-mortem on a specific failure mode in AI-powered code audit tooling. It involves tRPC, the EU AI Act, and a word that means two different things depending on context.

tRPC is a TypeScript-first RPC framework. Its transformer components handle data serialisation. Not machine learning. Not neural networks. Data serialisation.

An automated intent audit classified it as EU AI Act High Risk.
Here is exactly what happened.

The classification mechanism

AI-powered audit tools that evaluate codebases against regulatory frameworks need to determine which frameworks apply. For the EU AI Act, the relevant question is whether the system is an AI system and, if so, whether it falls into a high-risk category.

That determination is based on two inputs: the product description the user provides, and any AI or ML components detected in the code.

When the product description contains language that overlaps with EU AI Act high-risk domain vocabulary — health, medical, employment, education, critical infrastructure, transport — the system classifies the codebase as potentially high-risk and evaluates it accordingly.

The specific overlap

The second tRPC audit used a description that mentioned "transport layer" and "WebSocket transport" — accurate technical descriptions of how tRPC handles communication. The word "transport" overlapped with the critical infrastructure domain in the EU AI Act classifier.

Classification: High Risk. The LLM was then given context that included the High Risk classification and instructed to produce findings about it. Five AI Governance findings appeared in the report, all referencing EU AI Act obligations for high-risk AI systems. A TypeScript serialisation library with no AI components.

The same report stated, separately: "No AI components detected — EU AI Act: Not Applicable."

Both outputs were in the same document.

The fix

The third audit used a description that explicitly disambiguated the transformer components: data serialisation utilities, not AI architectures. The codebase contains no machine learning models, no LLM integrations, no AI decision-making components. The word "transformer" refers to data format transformation (superjson, devalue) not the attention mechanism in neural networks.

EU AI Act: Not Applicable. ISO 42001: Not Applicable. Genuine findings surfaced.

Three lessons for developers

One: when writing a product description for an AI-powered audit, be explicit about what components are not, not only what they are. Terminology overlap with regulated domain vocabulary is common and non-obvious.

Two: internal contradiction in an AI-generated report is not a minor formatting issue. "No AI components detected" and "EU AI Act High Risk" cannot both be true. If you see a contradiction like this, investigate the description rather than the findings.

Three: the transformer naming collision is a real and growing problem. HuggingFace's transformers library, tRPC's pluggable transformers, and the attention mechanism in ML architectures all use the same word in entirely different contexts. Any audit tool doing keyword-based AI component detection needs to handle this explicitly.

The scores

Audit v1 — 80 Healthy. Original description.
Audit v2 — 47.6 Critical. "transport" triggered High Risk classification.
Audit v3 — 61.5 Needs Attention. Corrected description. Accurate output.

The code did not change across all three audits.

The EU AI Act classified a TypeScript data serialisation library as High Risk. Here is what happened.

Olebeng — Wed, 06 May 2026 14:09:55 +0000

On 21 April I audited trpc/trpc, the TypeScript library for building end-to-end type-safe APIs. Score came back at 80. Healthy. Three High findings, 58% confirmation rate.

On 24 April I re-audited with a corrected product description. Score dropped to 47.6. Critical Risk. Three new High findings under AI Governance appeared in the sections evaluated by the AI Governance agent.

The reason: tRPC's "transformer" components were classified as High Risk under the EU AI Act.

tRPC has no machine learning components. It does not process model outputs. It does not make AI decisions. The transformer in tRPC's codebase is a data serialisation utility that handles how data is encoded and decoded across the client-server boundary. The word "transformer" is used in its original computer science sense, predating the AI context by decades.

What the three High findings stated

High AI Governance: High-risk AI system classification under EU AI Act without declared controls. The codebase is classified as high-risk due to transformer-based data processing, but lacks declared controls for transparency and risk management. Cited to packages/openapi/test/heyapi.test.ts:1–10.
High AI Governance: Missing output handling controls for AI data serialisation. Transformer components process serialised data without output validation, violating OWASP LLM05:2025. Cited to packages/openapi/test/heyapi.test.ts:10–15.
High AI Governance: EU AI Act High Risk classification — data transformation lacks specific risk mitigation.

Is this finding correct?

The honest answer is: it is technically defensible under a literal reading of the EU AI Act framework text, but a human auditor with full context would likely classify it differently.

The AI Governance agent evaluated the codebase against the framework text. The framework defines "AI system" broadly enough that automated evaluation of a codebase containing transformer-named components produces this result. The LLM models that evaluated the chunks received the EU AI Act risk level classification built from the intent model and reached consistent conclusions.

The tRPC confirmations in the same report tell a different story about the codebase: "No AI/ML Components Detected — EU AI Act Classification: Not Applicable" appeared as a confirmed finding alongside the High Risk classification. Both the confirmation and the violation came from the same analysis. The High Risk finding prevailed in scoring because of severity weighting rules.

This is not a product defect. It illustrates a genuine ambiguity in how AI governance frameworks apply to modern software. The EU AI Act definitions were written before transformer architecture became the dominant pattern in software naming conventions. The gap between "this component shares a name with AI architecture" and "this component is an AI system" requires human interpretation that automated analysis cannot yet consistently provide.

What this means for TypeScript developers

If your TypeScript codebase contains components named transformers, models, agents, pipelines, or inference, an automated AI governance evaluation will flag them for EU AI Act compliance review. That does not mean your codebase is non-compliant. It means the product description must explicitly state which components are AI systems and which are not.

A corrected description for tRPC that explicitly declares transformer components as data serialisation utilities with no AI characteristics would likely produce a different classification. I will publish that result when the re-audit runs.

The broader point stands regardless: as AI governance frameworks move from policy documents to enforcement instruments, the boundary between software that falls under them and software that does not will need to be stated explicitly in documentation, not inferred from code structure. IntentGuard surfaces where that documentation is missing.

Waitlist at intentguard.dev.

We audited the same codebase twice. The score went down. The audit got better. Here is why.

Olebeng — Wed, 29 Apr 2026 08:26:45 +0000

Score Down, Audit Better

On 12 April I ran an Intent Audit on envelope-zero/backend, an open-source Go REST API for personal envelope budgeting. The score came back at 79 with three Critical findings: no authentication at the API layer, no encryption for financial data, and an unprotected Delete Everything endpoint.

On 25 April I re-audited the same codebase with a corrected product description. The score dropped to 71.5. The three Critical findings became two High findings. The confirmation rate went from 57% to 67%. The Technical Readiness Score went from 70 to 76. Architecture maturity went from Level 2 to Level 3.

The code did not change between the two audits. Here is what did, and why it produced a more accurate result.

How an Intent Audit actually works

An Intent Audit operates on two separate inputs simultaneously. The first is the stated intent: what you say the codebase is designed to do, derived from the product description you provide. The second is the implementation evidence: what the code analysis independently surfaces in the sections of the codebase evaluated for each domain.

The audit produces both outputs and measures the distance between them. A finding is not simply "this code has a problem." A finding is "this code does not do what it was stated to do" or "this code has a characteristic that creates risk given its stated purpose."

The product description does not control what the code analysis finds. It establishes the intent baseline against which findings are contextualised. A precise description produces findings calibrated to the actual system. A generic description produces findings calibrated to a generic system that may not match what was actually built.

What changed between the two audits

The first audit used a minimal product description. Without context about the deployment model, the system type, or the specific compliance obligations that apply, the intent model evaluated the codebase as a generic financial API. Three findings were classified as Critical against that generic baseline.

The second description stated precisely what this system is: a self-hosted Go REST API for personal envelope budgeting, deployed on private infrastructure, with specific compliance obligations under GDPR Art. 32 and OWASP ASVS. Intent model confidence went from 74% to 82%.

With a more accurate intent baseline, two of the three Critical findings were reclassified. They were not wrong findings. They were correctly identified characteristics of the codebase that, when evaluated against the actual stated purpose of the system, carried lower severity than a generic Critical classification implied.

The Delete Everything endpoint at internal/controllers/v4/cleanup.go:13–18 remained in both audits. The code analysis identified it independently in both scans. The correct description did not make it go away. It made the other findings more precise, so this one stands out as it should.

The practical lesson

Before submitting a codebase for an Intent Audit, write the product description as the primary input it is. State what the system is and what it is not. State what data it handles, with specific sensitivity classification. State the compliance obligations that apply by name. State what the system deliberately does not implement and what it delegates to other layers. If there are AI components, name them explicitly.

A description that answers those questions establishes an accurate intent baseline. The audit then measures the gap between that baseline and what the code analysis finds in the sections evaluated. That gap is the finding set that is worth acting on.

IntentGuard is in final pre-launch hardening. Waitlist at intentguard.dev.

We audited the same codebase twice. The score went down. The audit got better. Here is why.

Olebeng — Tue, 28 Apr 2026 08:50:47 +0000

Score Down, Audit Better

On 12 April I ran an Intent Audit on envelope-zero/backend. The score came back at 79 with three Critical findings. No authentication at the API layer. No encryption for financial data. An unprotected Delete Everything endpoint.

The code did not change between the two audits. Here is what did.

Why the product description is the primary input

An Intent Audit does not just run static analysis against a codebase. It first builds an intent model. This is a structured representation of what the codebase is supposed to do, derived from the product description you provide alongside the codebase's own README and documentation. The intent model determines what the findings are evaluated against.

The first audit used a minimal product description that described a generic REST API. The second used a precise description that stated the deployment model (self-hosted Go binary, personal infrastructure), the compliance obligations that apply (GDPR Art. 32, OWASP ASVS), the specific audit concerns (authentication, encryption, destructive endpoint access control), and what the codebase deliberately does not handle (multi-tenancy, payment processing, PII beyond financial records). Intent model confidence went from 74% to 82%.

Why the score went down when the findings improved

The three Critical findings in the first audit were severity overclassifications that the corrected description resolved. "No authentication at the API layer" was Critical in the first audit. In the second, the corrected description gave the Intent Agent the context to evaluate whether authentication was expected at this specific layer given the deployment model. The finding was reclassified as a Medium architecture observation.

The Delete Everything endpoint remained. Both audits identified it. Both confirmed it across two independent models. The corrected description did not make it go away. It made it clearer, placing it correctly as a High compliance finding under OWASP ASVS V4.2 rather than as a generic Critical risk.

A Critical finding carries a higher score deduction than a High finding under CVSS-derived scoring. Removing two overclassified Criticals reduced the deduction. That is why the score dropped.

The practical lesson

The time you spend writing a precise product description before running an Intent Audit is the highest-leverage work in the entire process. A description that answers these questions consistently produces better findings:

What is this exactly — a library, a deployed application, an API, a framework? What data does it handle, with specific sensitivity classification? What compliance obligations apply by name? What does it deliberately not implement, and what does it delegate to the application layer or platform? Are there AI components, declared specifically?

A generic description produces findings calibrated to a generic system. A precise description produces findings calibrated to what this specific codebase was actually designed to do. The score is a summary. Getting the description right is the prerequisite for a summary that means something.

IntentGuard is in final pre-launch hardening. Waitlist at intentguard.dev.

The Lovable breach is not a vibe coding story. It is a verification story.

Olebeng — Thu, 23 Apr 2026 07:19:16 +0000

On 20 April 2026, a security researcher posted that any free account on Lovable — the AI coding platform valued at $6.6 billion — could access another user's source code, database credentials, AI chat histories, and live customer data. The vulnerability had been reported 76 days earlier and was never properly escalated.

Within 24 hours, the coverage framed it as a vibe coding security crisis. The framing is understandable. It is also imprecise. And the imprecision matters, because it points to the wrong solution.

What the vulnerability actually was

The flaw was a Broken Object Level Authorisation vulnerability — BOLA. Ranked #1 in the OWASP API Security Top 10. The API checked whether a user was authenticated. It did not check whether that authenticated user had permission to access the specific resource being requested. Five API calls from a free account was enough to retrieve another user's full project.

BOLA is not exotic. It is the most prevalent API security failure in production systems globally — which is exactly why OWASP ranks it first. It appears consistently in manual penetration tests, automated scans, and incident disclosures. What made the Lovable case notable was the scale: every project created before November 2025 was potentially affected. And the timeline: a backend permissions change on February 3rd accidentally re-enabled access to public project chats, researchers reported it on February 22nd and March 3rd, both reports were closed without escalation, and the vulnerability remained open until public disclosure on April 20th.

Why this keeps happening

The vibe coding framing suggests the problem is AI generating insecure code. That is part of it — between 40% and 62% of AI-generated code contains security vulnerabilities depending on the study, and Georgia Tech tracked 35 CVEs attributed to AI coding tools in March 2026 alone. But the Lovable platform-level vulnerability was not in the generated code. It was in the platform's own API authorisation layer.

The deeper pattern — the one that connects the platform incident to the generated code incidents — is the same in both cases: there is no systematic verification that the code enforces what the product is supposed to do.

When an AI generates a full-stack application from a natural language prompt, it optimises for functional correctness. The app loads. The data displays. The user flow works. What the AI does not do is reason about your threat model. It does not check whether the access control logic it generated matches the ownership boundaries you intended. It does not verify that the authentication function it wrote actually blocks what it is supposed to block. One of the Lovable incidents from February 2026 involved exactly that failure — inverted authentication logic that granted anonymous users full access while blocking authenticated ones. The intent was to restrict access. The implementation did the opposite. The code ran without errors.

The verification layer that does not exist yet

Every major AI coding platform generates code. None of them systematically verify that what was generated enforces what the product was designed to do.

This is the gap. Not the code quality gap — there are SAST tools for that. Not the dependency vulnerability gap — there are SCA tools for that. The gap between stated product intent and actual code behaviour — the question of whether the system you described is the system that was built — has no systematic answer in the current tooling landscape.

That is the gap I have spent the last year building IntentGuard to close. It is an automated code audit platform that reads your stated product intent against your actual codebase and produces structured findings — with file paths, line references, and framework control mappings — on every place where the two diverge. It maps against 16 compliance frameworks including OWASP API Top 10 on every audit. BOLA is covered. Not because of the Lovable incident — it was covered before I read this story. It is covered because BOLA is #1 on OWASP API Top 10, which means it belongs in every serious compliance audit of every API-facing codebase.

What you should do right now

If you have projects on any vibe-coded platform created before late 2025:
Rotate every API key and database credential. Do not wait to confirm whether you were specifically affected — assume you were and rotate now. The cost of rotating credentials you did not need to rotate is low. The cost of not rotating credentials that were exposed is not.

Audit your Supabase row-level security. The majority of vibe-coded platforms provision Supabase backends with RLS disabled. If you have a Supabase project connected to a Lovable, Bolt.new, or similar app, check whether RLS is enabled on every table that stores user data. The Supabase dashboard shows this per table.

Scan for hardcoded secrets in your source code. Tools like GitGuardian, Gitleaks, or TruffleHog will run against your repository and flag credentials embedded in code. This takes less than ten minutes to set up.

For every API endpoint that returns or modifies user-specific data: write a test that authenticates as User A, requests a resource belonging to User B, and verifies the response is a 403. This is the single most effective test for BOLA and it is the one that most vibe-coded applications skip entirely.

The broader point

The Lovable incident is one data point in a consistent trend. The trend is that code is being generated and shipped at a rate that human review cannot match, into production systems that handle real user data, without any systematic verification that the code enforces what the product was designed to do.

That is a solvable problem. The solution is not to slow down AI-assisted development — that ship has sailed. The solution is to build the verification layer that currently does not exist.

IntentGuard is in final pre-launch hardening. Waitlist at intentguard.dev if this is directly relevant to what you are building.

Why running every compliance framework on every codebase is wrong - and how we fixed it

Olebeng — Tue, 14 Apr 2026 06:59:33 +0000

When we first built the compliance agent in IntentGuard, it ran every framework against every codebase.

The result was technically thorough and practically useless.

A Go REST API with no payment processing was being evaluated against PCI DSS. A Python data pipeline with no personal data handling was generating GDPR findings. A non-AI internal tool was receiving EU AI Act violations as its most prominent output.

The findings were not wrong, exactly. They were irrelevant. And in audit contexts, irrelevant findings are worse than no findings - they train reviewers to ignore output, which is the opposite of what you want.

The problem with framework-agnostic scanning

Most compliance tools apply frameworks uniformly. You select the frameworks you want evaluated, and the tool checks the codebase against all of them equally. This approach has a surface-level logic to it - better to check more than less.

The problem is that compliance frameworks are not generic. PCI DSS applies to systems that process payment card data. HIPAA applies to systems handling protected health information. DORA - the EU's Digital Operational Resilience Act - applies to financial sector entities providing ICT services. Running these frameworks against a codebase that does not fall within their scope produces noise, not signal.

Worse: when a finding from an inapplicable framework appears at the same severity as a finding from an applicable one, the auditor has to mentally filter. That filtering work defeats the purpose of automation.

How we addressed it

Before any LLM call, we now run a deterministic classification step. It reads the intent model — the structured representation of what the product was designed to do — and classifies each framework as applicable or not applicable based on what the codebase actually is.

The classification is deterministic: no probability, no inference, no LLM. It looks for specific signals in the product description and inferred architecture. A codebase described as processing financial account data and using PCI DSS relevant patterns gets PCI DSS evaluated. One that does not, does not.

When a framework is not applicable, the compliance agent is instructed to produce a single informational finding: "[Framework] — Not applicable to this codebase." Not a critical violation. Not a high severity gap. An informational acknowledgement that the framework was considered and excluded.

The result is a compliance grid that reflects the codebase's actual regulatory context — not a generic checklist applied uniformly to everything.

Why this matters for the findings you get

Five frameworks are universal — they apply to every codebase regardless of type: ISO 27001, SOC 2, OWASP ASVS L2, NIST CSF, and CIS Controls v8.

These are the baseline for any modern software system.

The remaining eleven frameworks are conditional. GDPR activates on personal data handling. DORA activates on financial sector context. HIPAA activates on health data signals. OWASP API Top 10 activates on REST or GraphQL API patterns.

This means an IT auditor reviewing a financial services platform gets a compliance grid dominated by the frameworks that matter to their client — not one where ISO 42001 and EU AI Act appear at the top because those happen to be in the list.

The scope question

The obvious challenge with deterministic scoping is edge cases. A codebase that does not explicitly declare payment processing but accepts card numbers through a generic input handler would not trigger PCI DSS through intent model signals alone — it would surface through the Security Agent's findings instead.

This is by design. The scoping step uses the intent model, which comes from the product description the user provides. If the description is accurate, the scoping is accurate. If the description is incomplete, the user is told the confidence is low and prompted to provide more context.

The Security Agent, the Dependency Agent, and the Architecture Agent all run regardless of framework scoping. A PCI DSS relevant vulnerability will still appear as a security finding even if PCI DSS framework evaluation is scoped out. The framework compliance grid and the security finding list are separate outputs from separate agents.

Building IntentGuard in public from Johannesburg. If you have worked on compliance tooling and have thoughts on the framework scoping problem — particularly around edge cases — I would like to hear them in the comments.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI assitant.

Olebeng · Founder, IntentGuard · intentguard.dev

Why we only accept .txt for document uploads - and why that is the right call for now

Olebeng — Mon, 06 Apr 2026 16:45:26 +0000

IntentGuard lets users upload specification documents alongside their repository when submitting an audit. The Intent Agent uses these documents — a product requirements document, an architecture spec, an API reference — to build a higher-confidence model of what the codebase was supposed to do before reading a single line of code.

Currently, we only accept .txt files.

Every few days someone asks why. The honest answer is worth a post.

PDF is not a text format

When you open a PDF in a viewer, you see clean, readable text. What the viewer is actually doing is interpreting a stream of rendering instructions — glyph positions, font mappings, coordinate transforms — and reconstructing what looks like text from absolute positions on a page.
pdfminer.six, the standard Python library for PDF text extraction, reverses this process. It reads the rendering instructions, maps glyphs to Unicode characters using whatever font encoding the PDF creator chose, and attempts to reconstruct reading order from the x/y coordinates of each glyph.

This works well for simple, single-column, machine-generated PDFs. For anything more complex — multi-column layouts, tables, scanned documents, PDFs exported from tools that embed fonts as bitmaps — the extracted text can look plausible while being subtly corrupted. Column order gets swapped. Table cells merge. Headers appear in the middle of paragraphs.
Corrupted structure passed to an intent analysis pipeline does not produce an obvious error. It produces quietly wrong intent claims — which is worse.

The security concern

PDFs can contain embedded JavaScript, OpenAction triggers that fire on open, malicious stream objects, and external URI references. Processing untrusted PDFs without a purpose-built sandboxed parser is a real attack surface. pdfminer has had CVEs. Handling untrusted binary formats in a pipeline that processes proprietary codebases is not a decision to make under time pressure.

DOCX has a different surface: Office Open XML relationships to external resources, embedded objects, and macro containers. python-docx handles the common case cleanly but edge cases involving embedded objects or external references require careful sanitisation before any content reaches the analysis layer.

Why .txt is not a cop-out

A plain text file is deterministic. There is no binary parsing, no font mapping, no coordinate reconstruction, no embedded objects. It goes into the chunker directly. Its encoding is validated at upload. Its size is enforced client-side at 50KB per file, up to five files.

The result is that a founder who pastes their product spec into a .txt file gets more reliable intent analysis than one who uploads a beautifully formatted PDF that extracts poorly. Readable structure matters more than file format.

What is coming

PDF and DOCX upload support is in the Phase D roadmap. The correct approach is a purpose-built extraction pipeline with: sandboxed processing, content validation before the text reaches the chunker, encoding normalisation, and its own test suite. It deserves a dedicated build session and a security review — not a quick dependency add before launch.

Until then: .txt, and it works well.

Building IntentGuard in public from Johannesburg 🇿🇦. If you have built document ingestion pipelines that handle untrusted binary input safely, I'd like to hear how you approached the sandboxing problem.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI assitant.

Olebeng · Founder, IntentGuard · intentguard.dev

Why the same codebase should always produce the same audit score

Olebeng — Thu, 02 Apr 2026 05:04:11 +0000

There is a failure mode in AI-powered analysis tools that does not get talked about enough, and we ran into it directly.

When you submit the same repository twice — same commit, same inputs, same everything — you should get the same score. If the score changes between runs, the audit is not an audit. It is a random sample.

Early in testing, we observed score variance across consecutive runs on identical inputs. Not small variance. Meaningful swings — enough to change the risk interpretation of a codebase entirely. A score that sits in one category on one run and a different category on the next is worse than useless for the people who depend on it most: founders preparing investor materials, compliance leads building audit evidence, CTOs making remediation decisions.

This is a structural problem with LLM-based analysis, not an implementation bug, and it has a structural cause.

Where the variance comes from

Large language models are probabilistic by default. They sample from a probability distribution when generating output. The "temperature" setting controls how much randomness is introduced — higher temperature means more creative, more varied output. Lower temperature means more consistent, more deterministic output.

For creative tasks — writing, ideation, brainstorming — temperature is a feature. For security analysis, compliance mapping, and architectural assessment, temperature is a liability.

An LLM running at a non-zero temperature will produce slightly different findings on the same code across consecutive runs. Different findings feed into the scoring model. Different scores come out. The same codebase looks different on Tuesday than it did on Monday for no reason that reflects anything about the code.

The fix and what it requires

Setting temperature to zero eliminates sampling randomness. Given the same inputs, the model produces the same outputs. That is the starting point.
But there is a second layer of variance that temperature alone does not solve: finding confidence weighting. When multiple independent models analyse the same code, they may reach different conclusions on borderline cases. How those disagreements are resolved affects the final score — and if the resolution is inconsistent, variance returns through a different door.

IntentGuard uses a consensus pipeline across up to four independent AI models per finding. For the scoring model to be deterministic, the consensus logic itself must be deterministic — the same set of model votes must always produce the same confidence-weighted outcome.

We use CVSS v3.1-derived severity scoring as the foundation. CVSS is an industry standard specifically designed for this purpose: reproducible, quantifiable risk scores that two different analysts, given the same evidence, will calculate the same way. Mapping LLM-generated findings to CVSS-derived scores gives the scoring model a deterministic anchor — the same evidence produces the same deduction, every time.

Why this matters more for some users than others

For a developer running a quick check, score consistency is a nice-to-have. For the use cases IntentGuard is built for, it is non-negotiable. A VC performing technical due diligence on a portfolio company needs to know that the score they see reflects the actual state of the codebase — not the state it happened to be in on the particular run they triggered. A compliance lead building audit evidence needs findings that are reproducible and defensible. A founder preparing investor materials cannot present a Technical Readiness Score that might have read differently yesterday.

Deterministic scoring is what separates an analytical instrument from a magic eight ball.

The test that now passes

The gate we set for ourselves was simple: submit the same repository three times in succession with identical inputs and confirm the score is identical across all three runs.

That gate is now passing. 368 automated tests, including the determinism checks, are green.

Building IntentGuard in public from Johannesburg 🇿🇦. If deterministic analysis in multi-model AI pipelines is something you have thought about — whether you agree with the approach or see gaps — I would like to hear it in the comments.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI text editor.

Olebeng · Founder, IntentGuard · intentguard.dev

We read the spec before we read the code. Here is why that changes everything.

Olebeng — Tue, 24 Mar 2026 07:30:08 +0000

When a repository is submitted to IntentGuard, the first thing the pipeline does is nothing that any other code analysis tool does.

It does not read the code.

It reads what the code was supposed to do.

That single design decision — reading intent before reading implementation — is the architectural foundation everything else is built on. I want to explain why we made it, what it requires, and what it changes about the findings you get out the other side.

The question nobody was asking automatically

Every code analysis tool in existence — static analysers, linters, security scanners, SAST platforms — starts from the same place. It reads the code and asks: what is in here? What patterns are dangerous? What vulnerabilities exist?

These are useful questions. There are excellent tools answering them.
The question none of them ever asked is: does this code do what it was designed to do?

Not "is this code clean?" Not "is this code secure?" But: does this implementation reflect the product that was specified, promised to users, committed to investors, and stated in the compliance documents?

That is a different question. And it turns out, you cannot answer it if you start from the code — because the code itself cannot tell you what it was supposed to be.

Pass 1 — Building the intent model

The first pass of the Intent Agent never receives source code. This is an architectural constraint, not a configuration option.

It receives the human-stated intent: the product description the user writes at audit time, the README, any specification documents that have been uploaded, and the repository file tree — directory structure and file names only, no content.

From these inputs, it constructs what we call the Intent Model — a structured representation of what this product was designed to do. What features were claimed. What non-functional properties were promised. What deployment context was assumed. What compliance obligations were stated.
The Intent Model is the baseline. Every finding in an IntentGuard audit is anchored to a claim in the Intent Model — not a pattern in the code, not a rule in a rulebook, but a specific thing the product was supposed to do or be.

There is an important epistemic reason why Pass 1 never reads the code. If it did, it would build an intent model anchored to what the code does — and would naturally generate claims that match the implementation. That defeats the entire purpose. The intent model must come from human-stated intent, not from what the code actually contains. The gap between those two things is the product.

When the inputs are rich — a detailed description, a thorough README, uploaded specification documents — the resulting Intent Model is high confidence and highly specific. When the inputs are thin — a two-sentence description and no documentation — the Intent Model is weaker, and the audit report says so explicitly. Garbage in, limited analysis out. We tell users when this is the case rather than pretending otherwise.

Pass 2 — Comparing intent against evidence

Pass 2 receives the Intent Model and does something that is not sending the entire codebase to a language model.

It retrieves semantically relevant code chunks.

For each claim in the Intent Model, we embed the claim and retrieve the code most likely to confirm or contradict it — using vector similarity against the embedded code chunks stored at ingestion time. The model never sees the full codebase. It sees the code that is most relevant to each specific intent claim.

This matters for two reasons. First, it is faster and cheaper than full-codebase analysis. Second, and more importantly, it produces better results — because a model asked to evaluate one specific claim against relevant evidence will outperform a model given thousands of lines of unrelated code and asked to find everything wrong with it.

For each intent claim, Pass 2 produces one of two finding types:
confirmation or violation.

A confirmation means the code evidence supports the claim. The feature was implemented as stated. The architectural constraint was respected. The compliance obligation is present in the implementation.

A violation means the code contradicts the claim. The feature was stated but not implemented. The architectural constraint was declared and silently ignored. The compliance obligation exists in the spec and is absent from the code.

Both types matter. This is one of the things that makes IntentGuard structurally different from tools that only report problems — 30 to 40 percent of every audit report is confirmations, because knowing what is solid is just as useful as knowing what needs fixing. A codebase where 85 percent of intent claims are confirmed is not a failing codebase. It is a codebase with a known, bounded set of gaps. That is a very different thing to work with.

Why this changes what findings mean

Most security and code analysis findings are context-free. "Hardcoded credential detected at line 47" is a finding about the code. It is real and it matters.

An IntentGuard finding is different. It is a finding about the relationship between the code and the intent behind it.

"This product stated that all user data would be processed in the EU. The database connection string defaults to a US-East endpoint" is not just a configuration finding. It is an intent mismatch — the code contradicts a specific commitment that was made about the product.

That is a categorically different kind of finding. It has different stakeholders, different urgency, and different remediation logic. A developer finding the first one fixes a config. An exec or investor seeing the second one understands a business risk.

After Pass 2 completes, the Intent Model is passed to five specialist agents — Architecture, Security, Compliance, AI Governance, and Dependency — each of which independently audits the codebase against that shared baseline. None of them receive each other's outputs. All of them work from the same Intent Model.

That shared baseline is what makes the findings from different agents comparable, composable, and trustworthy.

The part that surprised us most

When we started running audits on AI-generated codebases, we expected to find security issues. We expected to find dependency vulnerabilities. We expected to find compliance gaps.

What we did not expect was how consistent the intent drift pattern was.
Codebases built with AI coding assistants — Cursor, Copilot, Claude, Gemini — tend to implement features correctly in isolation. Individual functions work. Tests pass. The CI pipeline is green.

But over iterations, the implementation drifts from the intent. Architectural constraints that were stated in the original design are quietly reversed by an AI assistant that did not have that context. Compliance obligations that were present in the product description are absent from the implementation because they were never included in a prompt. Data flows that were specified as EU-only end up routing through US infrastructure because the assistant made a sensible default choice without knowing the regulatory requirement.

None of this shows up in a security scan. None of it triggers a linting rule. It only surfaces when you compare the code against the intent — which is exactly what the two-pass pipeline was designed to do.

Building IntentGuard in public from Johannesburg 🇿🇦. If you are thinking about the intent-vs-implementation gap in AI-generated codebases, or have questions about the retrieval architecture, I would like to hear from you in the comments.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI text editor.

Olebeng · Founder, IntentGuard · intentguard.dev

Hello Dev.to - we are building the world's first automated Intent Audit platform

Olebeng — Tue, 17 Mar 2026 16:05:02 +0000

Hi Dev.to

I am Olebeng, a solo founder based in Johannesburg, South Africa, and this is the first post from the IntentGuard account.

I want to start by being direct about what we are, what we are not, and why I think the problem we are solving matters to this community specifically.

What IntentGuard is

IntentGuard is an automated Intent Audit platform.

That is a category that does not exist yet. We are building it.

The core question we answer is one that no tool has ever been able to answer automatically:

Does your code do what it was supposed to do?

Not "does your code have vulnerabilities?" Not "does your code pass your linting rules?" Those questions already have excellent tools answering them.

The question nobody has answered automatically is whether your code still reflects the intent behind it — the product description, the architecture decisions, the compliance obligations, the promises made to users.
That gap is what IntentGuard audits.

Why this matters right now

If you have been building with Cursor, Copilot, Claude, or any AI coding assistant, you already know the speed is extraordinary. You can go from idea to working prototype in hours.

What you might not know yet - but will find out at the worst possible moment - is that AI-generated code has a specific failure mode that no existing tool catches: intent drift.

The code works. The tests pass. The CI pipeline is green.

But the code no longer reflects what the product was designed to do. Data flows that were never supposed to exist. Compliance obligations that were stated in the spec and silently dropped in implementation. Architecture decisions that made sense in week one and were quietly reversed by an AI assistant in week six.

This is not a criticism of AI coding tools. It is the next problem to solve.

What we have built so far

IntentGuard is eight sessions into a ten-session build. Here is where we are:

A two-pass Intent Agent that constructs a model of what a product was supposed to do — before reading a single line of code
Five specialist agents (Architecture, Security, Compliance, AI Governance, Dependency) that each independently audit the codebase against that intent model
A multi-LLM consensus pipeline — up to 4 independent models per finding, so no single model's hallucination makes it into a report
Four persona-specific reports from one scan: Executive, Developer, Auditor, Investor

I am building this in public because I think the architecture decisions we have made - particularly around the intent reconstruction pipeline and the zero-data-retention sandbox - are worth discussing openly.

What I will be posting here

Technical articles. How the Intent Agent actually works. How we do deterministic diffing without hallucinated PRs. How we enforce multi-LLM consensus without producing contradictory outputs. Real architecture decisions with real trade-offs.

No marketing. No "10 reasons you need IntentGuard." If the technical work is not interesting enough to stand on its own, no amount of copy will fix that.

If you are building with AI coding tools, dealing with vibe-coded codebases, Investing is start-ups or thinking about the intent-vs-implementation gap - I would like to hear from you.

What is the hardest part of maintaining alignment between what you intended to build and what the code actually does?

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI text editor.

Olebeng
Founder, IntentGuard · intentguard.dev