Mr Chandravanshi

Posted on May 16

Inside Systems 01: Your Verification Process Did Not Break. It Was Replaced.

#ai #career #llm #productivity

Why fluent AI outputs quietly replace verification habits without announcing the change

The Rule Vikram Built

Vikram had a rule he followed for seven years.

Never forward a number you did not personally trace to its source.

Not because his manager told him. Because in 2017 he sent a procurement estimate to a client that was built on a number a colleague had provided without sourcing it, and the error cost him three weeks of repair work and one relationship that never fully recovered. The rule came from that. It was specific, operational, and his.

By late 2024, he had been using the tool for five months. He was faster. His first drafts were cleaner. The sourcing rule had quietly stopped applying to outputs the tool produced, not as a decision, not as an exception he consciously granted, but as a gap that opened without being named.

He did not notice the gap. The outputs were good.

The Substitution Happened Quietly

The illusion operating here is precise enough to name in one sentence.

Fluency looks like diligence, and diligence is what verification is supposed to confirm, so fluency quietly substitutes for the confirmation without either party agreeing to the substitution.

That sentence is not an accusation. It is a description of how calibration actually works under repeated exposure to high-quality outputs.

The tool produces language that has the texture of careful work. Complete sentences, confident framing, appropriate hedges in the right places, specific-sounding figures. Every surface signal that trained readers associate with reliable material is present. The signal that is absent, the one that says "this was independently checked against a primary source," does not have a surface form. It was never visible. You inferred it from the other signals.

For seven years, those signals correlated strongly enough with actual verification that treating them as proxies was reasonable. Then the tool arrived and produced the same surface signals from a completely different process.

The inference kept running. The correlation had broken.

What The Architecture Actually Learned

Gary Marcus has made this point repeatedly in technical contexts: the gap between a system producing correct-shaped output and a system that has any access to whether the output is correct is not a limitation to be patched in the next version.

It is the architecture.

These systems were trained on text. Text that was mostly written by people who had verified their claims before writing them. The training absorbed the surface form of verified writing, the vocabulary, the cadence, the citation-adjacent phrasing, without absorbing the verification process that produced it.

When you read the output, you receive all the surface markers of diligent work. The work behind those markers is not there. It was never part of what was learned.

This is not a criticism of the tool. It is a description of what it is. The problem is not using a tool that produces fluent unverified text. The problem is using a tool that produces fluent unverified text and having your judgment respond as though the verification happened.

Verification Became The Expensive Part

The mechanism that makes this persistent is not carelessness. It is something more structural.

Verification is expensive relative to reading. It requires locating primary sources, cross-referencing specific claims, catching the specific failure modes of specific subject areas. It is the slow part of intellectual work, the part that does not scale, the part that does not get faster with practice in the same way that writing does.

When a tool compresses the writing phase dramatically, the ratio changes. If writing a first draft took two hours and verification took one, the ratio was 2:1. When the tool compresses the writing phase to twenty minutes, the same absolute verification time is now 1:3 in the other direction. Verification becomes the dominant cost.

Under cost pressure, the behavior that gets reduced is the expensive one. This is not a moral failure. It is how systems respond to changed cost ratios. The behavior looks the same from outside, the output still arrives, still gets sent, still performs the function. What changed is invisible in the output itself.

What changed is how much of the process that used to precede the output is still happening.

What Expertise Is Actually Built From

Ted Chiang would locate the deeper problem one layer below the verification question.

Expertise in any analytical field is partly the accumulated history of catching your own errors before they leave your hands. The researcher who learned not to trust rounded numbers in secondary sources because one wrong rounded number sent her down a three-week dead end in 2019. The engineer who runs a specific sanity check on every cost estimate because he learned in 2021 that a particular class of error was invisible until the project was already committed.

That knowledge is not transferable through instruction. It is built through the specific experience of being wrong in recoverable situations, where the cost was high enough to register but not high enough to be catastrophic.

When the tool produces the first draft, the occasions for that experience change. You are no longer building from raw material and encountering your own errors on the way to a finished product. You are editing a finished-seeming product and encountering errors, when you encounter them, in a different cognitive position. The editor's relationship to a draft's errors is different from the author's. Both catch things. They do not build the same instincts.

The instinct Vikram built in 2017 required being the person who passed the unsourced number. The tool creates conditions where that experience is less likely to occur. Which means the conditions that would update or reinforce that instinct are also less likely to occur.

The instinct does not disappear. It atrophies through reduced exercise in the specific conditions that maintain it.

The Calibration Problem

The cost that is already running is not located in any single output.

It is located in the calibration itself.

Vikram's judgment about what needs verification is being trained on tool outputs. When an output is accurate, he files: this kind of output does not require checking. When an output is accurate again, he files the same. Over five months of mostly accurate outputs, the prior probability that any given output needs checking has shifted downward without a deliberate decision.

This is rational updating on available evidence. The problem is that the evidence set is missing the failure cases that would correct it. The tool's errors are not randomly distributed across outputs. They cluster in specific domains, specific types of claims, specific question shapes where the training produced confident-sounding text that does not correspond to verifiable fact. Those failure modes are not visible in the surface form of the output. You encounter them by checking, and the rate at which you check has been declining.

The calibration is updating on a sample that excludes the data most relevant to calibration.

The Identity Still Feels Intact

The identity this behavior protects is worth naming directly.

Vikram considers himself someone who does not cut corners on accuracy. This is not self-flattery in his case. He has a documented history of the opposite, built over seven years including one expensive correction that produced a lasting rule. The professional identity is real and earned.

The behavior of not applying the sourcing rule to tool outputs is not experienced as cutting a corner. It is experienced as appropriate adaptation to a new tool. A different kind of process, not a reduced one.

Both things are simultaneously true: the identity is accurate, and the behavior contradicts what the identity would require if applied consistently. They coexist because the new process feels substantively different from what the rule was designed to prevent, even when the failure mode it prevents is identical.

The rule was built to catch unsourced numbers passed to clients. The tool produces unsourced numbers with excellent surface form. The rule does not automatically transfer because the experience of using the tool does not feel like the experience that created the rule.

The Internal Argument

Here is the internal argument that does not resolve cleanly.

One side: the tool is a productivity tool, using it well requires calibrating trust to actual performance, the outputs are reliable enough that applying the full verification protocol to every output would eliminate the efficiency gain that justifies using it, treating the tool with blanket suspicion is not a practical working posture.

Other side: the failure modes are invisible in the surface form, the calibration is updating on a biased sample, the instincts being maintained through reduced exercise are the ones most needed for the cases the tool gets wrong, and those cases are exactly the ones where no surface signal flags the need for checking.

Both arguments are structurally sound. Neither defeats the other.

The working resolution most people land on is implicit rather than explicit: they verify when something feels uncertain and trust when it does not. The problem with that resolution is that the tool's errors frequently do not feel uncertain. They arrive feeling like the outputs that were correct, because they were produced by the same process.

Feeling uncertain is not a reliable trigger for verification when the signal is absent in the class of errors that matter most.

The Practical Adjustment

The practical adjustment is narrower than it sounds.

Not full verification of every output. That eliminates the tool's value and is not what careful work requires even without the tool.

One question, asked after reading any output that will be passed forward: what specifically would need to be true for this to be correct, and have I confirmed any of it independently?

The question does not require answering yes. It requires being asked.

Vikram's 2017 rule was not "verify everything." It was "never forward a number you did not personally trace." The precision of the rule is what made it operational. It covered the specific failure mode he had encountered. It did not require treating everything with equal suspicion.

The equivalent for tool outputs is a similarly narrow rule, derived from the actual failure modes of the actual tool you are using, applied to the specific class of claims where those failure modes cluster.

That rule does not exist in generic form. You have to build it from your own failure cases.

Which means you have to be present when the failure cases occur.

Vikram's March Incident

Vikram found his failure case in March, eight months into regular use.

A unit cost figure in a supplier analysis. Accurate-sounding, appropriately specific, surrounded by a well-constructed paragraph. Wrong by 23 percent. The paragraph was so well constructed that he had read it as confirmation of a figure he vaguely remembered rather than as a claim requiring independent verification.

He traced the error. He updated his practice. He now asks the question after every output that contains specific quantitative claims.

The adjustment took one incident to make and five minutes per output to run.

The incident cost two days of rework and one client conversation he would have preferred not to have.

The question he did not ask for eight months was: what would I need to confirm for this to be usable?

He knew how to ask it. The tool had trained him, through eight months of mostly accurate outputs, to ask it less often.

That is what the tool actually did to his process. Not through any design intent. Through the same mechanism that any reliable system uses to reduce the perceived need for verification: it was right often enough that checking felt redundant, until the specific case where it was not.

DEV Community