Zephyre

Posted on Jun 28

Verification Cost Is the Real AI Coding Cost

#programming #ai

I used to ask a simple question when routing coding tasks across models:

Which model is strong enough for this?

That question is still useful, but it is not the first one I ask anymore.

The better first question is:

How quickly can I verify the output?

That changed the way I use low-cost models. I do not treat them as weaker replacements for my main coding model. I treat them as useful workers for tasks where the verification path is short.

Level 1: Can I inspect the output directly?

Some tasks are cheap to review because the output is visible.

Examples:

README cleanup
usage examples
comments
changelog notes
small formatting scripts
issue templates

If the model writes a bad README paragraph, I can see it. If it adds vague wording, I can delete it. The failure is annoying, but it is cheap.

This is where low-cost models are useful.

Level 2: Can I run a test?

The next best category is testable work.

If I can describe the expected behavior and run a test suite, I am more willing to route the first draft to a cheaper model.

But the prompt needs boundaries.

Instead of:

Add tests for this helper.

I would write:

Add tests for empty input, null input, duplicate values, invalid config, default config, and normal input. Do not change runtime code.

The difference is small, but it forces the model to work inside a verification frame.

Level 3: Can I manually verify it?

Some tasks do not have automated tests, but still have a clear manual check.

Examples:

CLI output formatting
config examples
migration dry-run notes
small data conversion scripts

For these, I ask the model to include:

how to run it
what input to use
what output to expect
which edge cases to check

If the model cannot explain how to verify its own output, I do not trust the patch.

Level 4: Could it change hidden behavior?

This is where I slow down.

Small refactors are often more dangerous than they look.

The diff may be short. The code may look cleaner. But the behavior might change in a fallback path, a default value, a permission check, or a compatibility branch.

I raise the risk level when a task touches:

fallbacks
defaults
routing
permissions
billing
rate limits
migrations
backwards compatibility

These failures are not always obvious in the code review. You need context to notice them.

My current routing rule

I route by verification cost:

Low verification cost: low-cost model can draft it.
Medium verification cost: low-cost model can draft, human edits.
High verification cost: strong model may help, but tests and human review are required.

This rule is more useful than “small task vs large task.”

A small task can be expensive if it is hard to verify.

The point

Low-cost AI coding models are not useless.

They are useful when the work is easy to inspect, easy to test, or easy to roll back.

The expensive part of AI coding is not always generation.

Often, it is trust.

Top comments (4)

Mike Czerwinski • Jun 28

Routing by verification cost is the right axis, but the ladder has a rung that quietly inverts it. At Level 2 and 3 you let the cheap model draft the patch and then write its own tests or explain how to verify itself. That does not lower verification cost, it relocates the fabrication into the test. The same generator now authors both the code and the check that vouches for it, and a model that writes a green test for the behavior it thought it wrote will hand you both at once. Level 4 is where this bites hardest: a hidden-behavior change ships under a passing self-authored test, because the test asserts the intended behavior, not the changed one. Verification cost is only real when the verifier does not share the generator's author. A cheap model writing its own passing test lowers the apparent cost, not the trust. The rung that matters is not how cheap the check is to run. It is who wrote it.

Zephyre • Jul 8

Thanks. This is the exact distinction I should have made clearer.

I agree that letting the same cheap model write the patch and then write the test can make verification look cheaper without making the result more trustworthy. In that setup, the test may only confirm the behavior the model believed it implemented.

The routing rule I would use now is:

Low-cost generation is useful only when the verification evidence is independent, cheap, and repeatable.

That evidence can be:

an existing test that was already part of the project;
a reproduction from a bug report;
human-written acceptance criteria;
real logs or input/output samples;
a separate reviewer checking the diff risk;
a human acceptance step for defaults, permissions, billing, migrations, or fallback behavior.

If the same model writes the patch and defines the passing condition, I would treat the result as a draft, not as verified work.

So the ladder needs one more question before "can I run a test?":

Who wrote the verifier, and what evidence is it using?

I am turning this into a small proof-log case because it is a useful boundary: a cheap model can draft the code, but it should not be the only source of truth for why the code is safe.

Mike Czerwinski • Jul 8

Good ladder. One more split inside "independent": independent of the model that generated the code isn't the same as independent of the person who reviewed it. Human-written acceptance criteria drafted by the same engineer doing the diff review is independent of generation but not of the reviewer's own blind spots, which is a narrower guarantee than the word "independent" implies. Worth naming which independence you're claiming when you write the criteria down.

Zephyre • Jul 10

Thanks, that distinction is worth naming explicitly.

I used "independent" too loosely. There are at least three different claims:

Independent of generation: the same model that wrote the patch did not also define the passing condition.
Independent of reviewer bias: the acceptance criteria were not only invented during the same review by the same person trying to approve the diff.
Independent evidence source: the check is anchored in something outside the generated patch, such as an existing test, a repro, real logs, a user-visible behavior, or a product constraint.

Those are not equivalent.

For small AI coding tasks, I would now write the verification field more narrowly:

What is this evidence independent of?

If the answer is only "independent of the model that generated the code," that is still useful, but it is a weaker guarantee than "independent of both the generator and the reviewer's fresh assumptions."

So the routing rule becomes:

A cheap model can draft the change when the verification claim is explicit and the evidence source is named.

Otherwise I should call it a draft, not verified work.