Mirza Iqbal

Posted on Jun 2

AI reported success.

#ai #llmops #enterprise #agents

Nobody checked if it was true.

Your agent reported success.

That is not the same as the work being done.

I read three of the highest-engagement AI articles on dev.to this week, then I read the comments under them, which is where the real signal lives.

Those articles covered how developers use AI, debugging AI code, and a test report that claimed 97.3 percent coverage.

Different topics. Same question underneath, asked five ways by five readers.

Did the AI actually verify what it said it finished?

Nobody is writing the answer. So here is mine, from running this in production for enterprise teams that cannot afford a wrong answer on a Monday.

Green check versus true statement

A green checkmark and a true statement are different things.

An agent that writes "tests pass" is reporting on its own behavior.

It is not reporting on reality.

One reader said it better than any article did. Coverage of 97.3 percent means the lines ran during the test. It does not mean anything was asserted. It does not mean the system does what you think it does.

That is the whole problem in one sentence. A metric measured execution. A reader wanted verification. Those two are different things, and most teams never notice the swap.

Why enterprise teams get bitten hardest

A solo developer who ships an unverified change feels it the next morning and fixes it.

An enterprise team ships the same change into a revenue dashboard, a support workflow, a regulated process. Failure surfaces a week later, when the operations lead is re-running yesterday's batch by hand while everyone else waits.

Cost here does not show up as developer hours. It shows up as trust the business placed in a number that turned out to be theatre.

I have walked teams at large DACH organisations through exactly this. Company size does not change the pattern. What matters is whether anyone closed the loop between what the agent claimed and what was true.

One question that catches it

When a team brings me a system that "works" but keeps surprising them, I ask one thing first.

For every action your agent reports as done, who verified it independently of the agent?

Honest answer is almost always nobody.

An agent runs the action. It reports the outcome. That same agent is the only witness that the action worked. You have a closed loop with no external check, and a closed loop will always tell you it succeeded.

This shape sits behind the comment threads this week. Readers asking whether a run preserved its own trace. Readers asking what they personally verified versus what the model claimed. Readers asking how a system keeps its own truth while work happens inside it.

They are all circling one missing piece. An independent verifier.

What does not close the loop

Teams reach for the same three moves first, and none of them fix it.

More retries. A retry makes a flaky action eventually report success. It never checks that the success is real.

A bigger model. A stronger model writes a more convincing report. That report is still self-issued.

A confidence score. An agent rating its own confidence is an agent grading its own exam.

All three reduce the visible error rate. That feels like progress. It is the opposite. A loud failure becomes a silent one, and silent failures cost more because they hide until they reach a customer.

What actually closes it

This fix is a discipline. Tooling comes after.

Decide which claims matter. Skip most of the output. Keep the handful where being wrong costs real money.

For each of those claims, build a check the agent cannot satisfy by talking. A good check observes reality. It ignores the agent's narration of reality.

If the agent says the record was written, something other than the agent reads that record back.

If the agent says the workflow ran, something other than the agent confirms the downstream effect happened.

Principle is simple. A claim and its proof must come from different sources. Once they share a source, you have a press release where you wanted a verification.

Across the enterprise rollouts where I have put this in place, the share of silent failures dropped sharply inside the first month. Agents were no smarter than before. What changed is who got to witness the outcome.

What this writeup does not hand you

I run a specific version of this in production. Exact checks, thresholds, and the runbook that maps a failed claim to a first response. Those are what I bring into client work.

I am not pasting them here, and the reason is honest.

Drop the implementation into a post and the next team to hit this does a search, copies it, ships it, and never has the conversation that surfaces why their loop was closed in the first place.

Implementation is the cheap part. Deciding which claims must be proven by something other than the thing making the claim is the part that holds up under load.

One question for you

Most "it works on my machine" surprises are really "nobody verified it independently" surprises wearing a costume.

So here is the same question I ask the teams I work with.

In your stack right now, name one thing your AI reports as done that nothing other than the AI has ever checked.

Drop it in the comments. I will reply with the question that usually tells you whether it is safe or whether it is the next silent failure waiting for a Monday.