Your Test Suite Now Mostly Proves the AI Agrees With Itself

Hung Nguyen Van — Sat, 30 May 2026 08:45:28 +0000

Picture a renewal call. The client is happy with the work. Then they ask one fair question.

"This feature here. Show me the requirement it came from, and the test that proves it does what we asked."

You know the suite is green. You know coverage is high. And you realize you can't actually answer. Not quickly, not with evidence. You can show that the tests pass. You can't show that the code does what the spec said.

That gap used to be tiny. In the AI coding era it has quietly become the most expensive thing in your codebase, and almost nobody is measuring it.

How a green build became a feeling instead of a fact

For most of software's history, the spec, the code, and the test came from different minds. A person wrote the requirement. A person wrote the code. Tests sat there as an outside check. When all three lined up, the agreement meant something, because three independent readings had converged.

Now one model reads the spec, writes the code, and writes the test in the same breath. If it misreads the requirement, the code is wrong and the test is wrong in the exact same way. The test passes. Everything is green. The spec is broken and the screen tells you you're fine.

A passing test used to mean an independent check agrees the code is correct. Today it usually means the model is consistent with itself. Most of your green suite is the AI grading its own homework.

The cruel part is that every tool you'd reach for to catch this is the one the problem already corrupted. Coverage tells you which lines ran, never whether they do the right thing. The test runner confirms the code matches the test, which is the precise thing that's now suspect. Linters check style. Your ticket tracker never touches the code. Each tool reads one side of the triangle and trusts the other two. The one cross-check that ever mattered, a second independent reading, is exactly what the AI removed.

So you ship on a feeling. The bill arrives later, in someone else's environment. The business rule that was never really implemented. The requirement that drifted three sprints ago and took the old green tests with it. The audit question that turns a confident team silent.

The one question that still means anything

There's a single question left that survives all of this, and it's narrower than the ones teams usually ask.

For this requirement, is there real code that implements it, and a real test that exercises that code. Proven by evidence, not claimed by a label.

Answer that for every requirement and the fear drains out of the room. The tautology can't hide, because a test that only claims to cover something proves nothing on its own. A high coverage number can't bury a weak group, because you're reading requirement by requirement, not one comfortable average. Drift shows up the moment alignment breaks, not the moment a customer finds it. And the renewal-call question stops being a threat. It becomes a screen you turn around and point at.

The trouble is that nothing in a normal toolchain answers it. Knowing whether spec, code, and test agree means reading all three as separate things and checking them against each other. No tool most teams own does that.

What changes when you can actually see it

This is the entire reason DQA exists. It reads the spec, the code, and the tests as three independent sources and tells a team, requirement by requirement, whether they truly line up.

The line it refuses to blur is proven versus declared. A test that says it covers a requirement is declared, and declared is what an AI can fake all day. A requirement where evidence shows real code implements it and a real test exercises that code is proven. Only proven counts. That one rule is what beats the model grading its own homework.

What comes back isn't another percentage to feel good about. It's a plain list. What's fully aligned, what's only partial, what has nothing real behind it, sorted worst first, so the weakest spot is the first thing you see instead of the thing an average hides.

That turns the renewal call into a different conversation. One version ends with "I'll get back to you" and a quiet scramble through Jira. The other ends with you turning the screen around. Same client, same question, completely different business.

AI will write most new code within a couple of years. That isn't the risk. The risk is shipping it while still trusting a green test the way we did when a second human wrote it.

The shops that come out of this era with their reputations intact won't be the ones that adopted AI fastest. They'll be the ones who could still say, on any given commit, exactly which requirements had real code and a real test behind them, and prove it without flinching.

So, your last release. Do you know which requirements are actually aligned, or do you only know that the tests passed?

I'm running a free gap report for 3 dev shops this month. Send 1 repo and 1 spec, and I'll send back which requirements are proven-aligned, which aren't, and which "covered" tests don't actually prove anything. No pitch, no commitment. Comment "gap" or DM me.

tautology problem — AI confirming itself.

Hung Nguyen Van — Wed, 13 May 2026 01:48:57 +0000

Yesterday I posted about senior devs spending 25 minutes reviewing a single AI-generated PR. Someone DMed me: "Just replace the senior with an AI reviewer." That's the trap.

AI writes the code. AI writes the tests. AI reviews the code. Three layers, each one "smart." The problem: all three share the same source of reasoning.

If the AI misreads the spec — the code is wrong, the tests pass with wrong code, the review approves wrong code. All three layers green. Spec still violated.

This is the tautology problem — AI confirming itself.

In April 2026, Anthropic published a postmortem most people didn't read carefully. They admitted: AI-generated regressions in their own codebase slipped past human review, automated review, unit tests, end-to-end tests, automated verification, and dogfooding. Anthropic's full stack — still missed it.

If Anthropic's stack can't catch it — the honest question for any team shipping AI-assisted code: how much is your stack actually catching?

The industry has tried several approaches. None of them solves tautology:

Test frameworks (Jest, Pytest…) — tests written by the same AI, same source
Linters / SAST (SonarQube, Semgrep) — don't read the spec, only pattern-match code
AI code review (Copilot, CodeRabbit, Qodo) — review code-vs-codebase, not code-vs-original-spec
Manual senior review — doesn't scale, returns you to 25 min/PR (see yesterday's post)

This is why we built DQA — a Trust Layer for AI-generated code. Not a fifth review tool. A structurally different layer.

DQA compiles rules directly from the spec document — no AI interpretation in the loop. Every commit AI ships gets cross-checked:

Does this feature trace back to an original requirement?
Does it violate any structural constraint?
Is there a signed, timestamped evidence chain for audit?

It sits between "AI writes code" and "code merges to production." A third party, structurally independent — not sharing the same source of reasoning as code-AI, test-AI, or review-AI.

If you're shipping AI-assisted code actively in production and want to compare notes on verification patterns your team is hitting — DM me.

I'm in conversations with three dev teams this week, ~30 min each. No pitch deck. You share your pain, I share patterns from other teams. If it fits, I'll suggest a next step. If not, you walk away with 30 minutes of insight into how others are handling this.

👉 DM me or comment "DM" — I'll message you first.

DEV Community: Hung Nguyen Van

Your Test Suite Now Mostly Proves the AI Agrees With Itself

How a green build became a feeling instead of a fact

The one question that still means anything

What changes when you can actually see it

tautology problem — AI confirming itself.