DEV Community

Zephyre
Zephyre

Posted on

Verification Cost Is the Real AI Coding Cost

I used to ask a simple question when routing coding tasks across models:

Which model is strong enough for this?

That question is still useful, but it is not the first one I ask anymore.

The better first question is:

How quickly can I verify the output?

That changed the way I use low-cost models. I do not treat them as weaker replacements for my main coding model. I treat them as useful workers for tasks where the verification path is short.

Level 1: Can I inspect the output directly?

Some tasks are cheap to review because the output is visible.

Examples:

  • README cleanup
  • usage examples
  • comments
  • changelog notes
  • small formatting scripts
  • issue templates

If the model writes a bad README paragraph, I can see it. If it adds vague wording, I can delete it. The failure is annoying, but it is cheap.

This is where low-cost models are useful.

Level 2: Can I run a test?

The next best category is testable work.

If I can describe the expected behavior and run a test suite, I am more willing to route the first draft to a cheaper model.

But the prompt needs boundaries.

Instead of:

Add tests for this helper.

I would write:

Add tests for empty input, null input, duplicate values, invalid config, default config, and normal input. Do not change runtime code.

The difference is small, but it forces the model to work inside a verification frame.

Level 3: Can I manually verify it?

Some tasks do not have automated tests, but still have a clear manual check.

Examples:

  • CLI output formatting
  • config examples
  • migration dry-run notes
  • small data conversion scripts

For these, I ask the model to include:

  1. how to run it
  2. what input to use
  3. what output to expect
  4. which edge cases to check

If the model cannot explain how to verify its own output, I do not trust the patch.

Level 4: Could it change hidden behavior?

This is where I slow down.

Small refactors are often more dangerous than they look.

The diff may be short. The code may look cleaner. But the behavior might change in a fallback path, a default value, a permission check, or a compatibility branch.

I raise the risk level when a task touches:

  • fallbacks
  • defaults
  • routing
  • permissions
  • billing
  • rate limits
  • migrations
  • backwards compatibility

These failures are not always obvious in the code review. You need context to notice them.

My current routing rule

I route by verification cost:

  • Low verification cost: low-cost model can draft it.
  • Medium verification cost: low-cost model can draft, human edits.
  • High verification cost: strong model may help, but tests and human review are required.

This rule is more useful than “small task vs large task.”

A small task can be expensive if it is hard to verify.

The point

Low-cost AI coding models are not useless.

They are useful when the work is easy to inspect, easy to test, or easy to roll back.

The expensive part of AI coding is not always generation.

Often, it is trust.

Top comments (1)

Collapse
 
jugeni profile image
Mike Czerwinski

Routing by verification cost is the right axis, but the ladder has a rung that quietly inverts it. At Level 2 and 3 you let the cheap model draft the patch and then write its own tests or explain how to verify itself. That does not lower verification cost, it relocates the fabrication into the test. The same generator now authors both the code and the check that vouches for it, and a model that writes a green test for the behavior it thought it wrote will hand you both at once. Level 4 is where this bites hardest: a hidden-behavior change ships under a passing self-authored test, because the test asserts the intended behavior, not the changed one. Verification cost is only real when the verifier does not share the generator's author. A cheap model writing its own passing test lowers the apparent cost, not the trust. The rung that matters is not how cheap the check is to run. It is who wrote it.