DEV Community: Kair Akhmettayev

Why Cheaper AI Code Generation Does Not Necessarily Reduce

Kair Akhmettayev — Wed, 24 Jun 2026 20:11:38 +0000

In many AI-assisted workflows, code generation is no longer the only
bottleneck. Assistants read repositories, edit files, run commands, and write tests. Agentic systems plan, call tools, retrieve more context, and assemble an answer over several steps or several models.

After each run, however, the engineer is left with the same question:

What was actually checked, what did the model merely assume, and how much of this result can I rely on before merge?

Producing plausible code has become cheaper. Checking its foundations has not necessarily followed. Comparing AI tools only by token price, generation speed, or agent count therefore misses the engineering decision that matters: the path from a request to a justified merge decision.

This article asks three questions:

Does AI reduce the total cost of a decision once model calls, review, rework, and escaped-error risk are counted?
Which part of that cost is targeted by routing, retrieval, multi-model deliberation, and automated checks?
What would a separate verification layer need to produce, and how could its value be falsified rather than merely claimed?

1. The verification tax

The productivity evidence is mixed. METR ran a randomized controlled trial with 16 experienced open-source developers performing 246 real tasks in mature repositories they knew well, using early-2025 tooling. With AI, tasks took 19% longer on average [1].

In February 2026, METR reported that newer data probably shows a larger uplift, but explicitly called the signal unreliable. The raw estimate for returning developers was -18% change in completion time with a confidence interval of [-38%, +9%]; for newly recruited developers it was -4% with [-15%, +9%], where negative means speedup. Both intervals include zero effect [2].

The honest conclusion is neither "AI always speeds developers up" nor "AI always slows them down." Productivity depends on tool maturity, repository familiarity, task shape, context acquisition, and the cost of checking the result.

The 2025 DORA report provides a different, observational view of nearly 5,000 technology professionals: 90% use AI at work, more than 80% perceive a productivity gain, but 30% have little or no trust in AI-generated code. AI adoption is positively associated with delivery throughput and product
performance and negatively associated with delivery stability [9]. This is not a causal estimate. It is consistent with a systems hypothesis: faster local generation may increase downstream load if testing and delivery controls do not scale with change volume.

A separate synthesis of seven Google studies found that 39% of external
developers trust GenAI output quality only slightly or not at all. Perceived rigor of review and testing, and developer control over where AI is used, were positively associated with trust [7].

Review itself is not only defect-finding. In Bacchelli and Bird's study of 200 Microsoft review threads and 570 comments, code improvements accounted for 29% of comments and defects for 14%. The authors identify understanding the context and the change as central to review and record knowledge transfer as an outcome in its own right [3].

Automation should therefore be judged not only by bugs found, but by how much it reduces the cost of building a correct mental model of the change.

An illustrative review-load model

Assume a team handles 20 PRs per week and an average review takes 30 minutes:

20 PR × 0.5 h = 10 reviewer-hours / week

If AI doubles throughput while review cost per PR stays fixed:

40 PR × 0.5 h = 20 reviewer-hours / week

If AI-assisted PRs become wider and review time rises by 25%:

40 PR × 0.625 h = 25 reviewer-hours / week

Scenario	PR/wk	Review/PR	Review load
Pre-AI	20	30 min	10 h
2× throughput	40	30 min	20 h
2× throughput + wider PRs	40	37.5 min	25 h

This is a sensitivity model, not a market statistic. It shows the mechanism: faster generation may move work from writing to checking rather than remove it.

2. The total cost of an engineering decision

The token bill is not the total cost. Define the expected cost of one decision:

C_total = C_model + C_tools + R_hour × (T_review + T_rework) + P_escape × L_escape

C_model: model calls;
C_tools: CI, sandbox, retrieval, and other compute;
R_hour: internal cost of one engineering hour;
T_review: time to an apply/review/reject decision;
T_rework: expected time to fix issues found before merge;
P_escape: probability that a material error passes review;
L_escape: expected loss from such an escape.

Take an illustrative baseline: C_model = $5, review takes 60 minutes, and R_hour = $80. Set tools, rework, and risk aside temporarily:

C_total = $5 + $80 = $85

The ceiling on pure model-bill optimization

If model calls are a fraction

f = C_model / C_total

then optimizing only the model bill while holding workload, quality, review, rework, and risk fixed — even reducing model cost to zero — lowers C_total by at most f.

At the reference numbers:

f = 5 / 85 = 5.9%

This is not a ceiling on routing's total effect. A weaker cheap model may raise retries, T_rework, and P_escape; a good router may cut latency and failed calls. It is an accounting observation: when the model bill is a small part of the total, optimizing that line alone cannot solve a review-bound bottleneck.

Cutting review from 60 to 40 minutes produces a different scale of change:

C_total = $5 + $80 × (40/60) = $58.33
Saving = ($85 - $58.33) / $85 = 31.4%

Change	Model	Review	`C_total`	Saving
Baseline	$5.00	$80.00	$85.00	—
Model calls halved	$2.50	$80.00	$82.50	2.9%
Review 60→40 min	$5.00	$53.33	$58.33	31.4%
Both	$2.50	$53.33	$55.83	34.3%

In autonomous agentic loops with little human oversight, f may be large and routing can become the main economic lever. In workflows constrained by costly human review, f is lower. The relevant question is which term actually dominates the total cost.

3. Different systems control different parts of the cost

Modern AI systems often look similar: agents, orchestration, retrieval, a judge, and synthesis. Similar shape does not imply the same job.

Routing: Kilo Gateway and RouteLLM

Kilo exposes an OpenAI-compatible endpoint, access to many models, BYOK, usage tracking, spend limits, and organization controls [11]. ByteByteGo describes routing on a known mode — planning, coding, debugging — with user-selected tiers and a server-updated model map. The reported Kilo figures — roughly one-third lower average request cost, 80–90% of requests not requiring frontier models, a greater-than-10× tier gap, and an estimated $87K quarterly overspend from misrouting routine traffic — are vendor-reported and not independently verified [8].

An idealized model shows the potential scale. If top-tier is needed for 15% of steps and other requests cost 10% of top-tier:

relative_cost = 0.15 × 1 + 0.85 × 0.10 = 0.235
relative reduction = 1 - 0.235 = 76.5%

RouteLLM provides primary research evidence for the trade-off: a 3.66×
cost-saving ratio at 95% of GPT-4's MT-Bench score for a GPT-4/Mixtral-8×7B pair, equivalent to 72.7% relative cost reduction (1 - 1/3.66) [12]. Its cost model uses short single-turn prompts and benchmark score as quality. It is not a coding-agent loop or evidence that a repository change is safe.

Routing answers which model should run a step and what that call costs. It does not by itself establish that the evidence behind an engineering claim is sufficient.

Agentic RAG: sufficient context

Google describes a multi-agent RAG with a dedicated Sufficient Context Agent. It compares the query, retrieved snippets, and a draft, names missing information, and can trigger another retrieval pass. Google reports up to 34% higher accuracy than standard RAG on factuality datasets [4].

The Sufficient Context research exposes a broader failure mode: models often answer incorrectly rather than abstain when context is insufficient. Guided abstention improved correctness among answered cases by 2–10% for Gemini, GPT, and Gemma [5].

This supports a sufficient-context loop, but it is not a measured reduction in T_rework or P_escape for software development. A codebase is not merely a document corpus; it contains runtime behavior, callers, invariants, and migrations.

Multi-model deliberation: consensus is not proof

OpenRouter Fusion runs a parallel panel of 1–8 models. A judge returns a
structured comparison of consensus, contradictions, partial coverage, unique insights, and blind spots; a final model writes the answer. The documentation describes the pipeline but does not provide an independent effectiveness benchmark [10].

Google Research compared 180 agent configurations and produced a useful
counterexample to "more agents = more reliable." Independent topology amplified errors by up to 17.2×, while centralized coordination held amplification to 4.4×. Multi-agent improved the parallelizable Finance-Agent result by 80.9%, but every multi-agent variant degraded the sequential PlanCraft result by 39–70%.
The authors' predictive model selected the optimal architecture for 87% of
unseen configurations [6].

This evaluation did not contain repository code review, so the numbers cannot be assigned to a particular product. The engineering hypothesis is narrower: value depends on topology, task decomposability, a centralized gate, and the quality of evidence handoffs — not on agent count alone.

Tests and static analysis

SAST, DAST, CodeQL, Semgrep, unit tests, and mutation tests provide repeatable checks of explicitly encoded properties under controlled inputs, configuration, and environment. Their quality is bounded by coverage, false positives, false negatives, and flakiness.

They are necessary, but do not always reveal that a model never opened the
relevant file, built a conclusion on a false assumption, or tested an
implementation detail instead of a system invariant. Green checks are not proof of complete intent.

4. Side by side

Approach	Primary problem	Unit of decision	Main output	Does not solve by itself
Kilo / model routing	Model access, cost, policy	Model request	Completion + cost data	Trust in an engineering change
Agentic RAG	Incomplete context	Sufficiency of retrieved context	Grounded answer	Patch safety and codebase invariants
Fusion / multi-model	Fragility of one model's answer	Agreement/disagreement	Consensus + contradictions	Factual checking of repository claims
Tests / static	Formalizable properties	Test/rule result	Pass/fail + diagnostics	Intent, assumptions, completeness
Verification artifact	Hidden checking area	Merge decision	Evidence boundaries + verdict	A correctness guarantee or replacement for engineers

These systems are not necessarily direct competitors. Routing manages model-call cost. Agentic RAG tests context sufficiency. Multi-model deliberation surfaces disagreement. Tests check formalized properties. A verification artifact should connect those signals to a decision about how far a candidate is supported.

5. Trust debt and hidden checking work

Suppose an engineering answer contains a set of material claims:

C = {c1, c2, ..., cn}

For each claim, a reviewer needs to know whether it is supported by evidence, contradicted, or still an assumption. A rough diagnostic metric is:

evidence_coverage = supported_claims / total_material_claims

If an answer contains 20 material claims and sufficient evidence exists for 12:

evidence_coverage = 12 / 20 = 60%

The remaining 40% are not necessarily wrong. They are the area a reviewer still needs to inspect. If a tool does not expose that area, the engineer first has to discover it and only then verify it. That is hidden verification work.

The goal of a verification layer is not to declare an answer absolutely correct.
It is to:

connect material claims to checkable evidence;
expose relevant targets that were and were not inspected;
separate assumptions from supported conclusions;
preserve critique and rejected hypotheses;
surface open production and PR risks;
narrow the manual search area without hiding uncertainty.

Review remains. The search area should become smaller.

6. When extra verification pays for itself

Ignoring risk for a moment, an extra check costing ΔC pays for itself when it saves at least:

T_break_even = ΔC / R_hour

At R_hour = $80:

Extra cost/run	Required review saving
$2	1.5 min
$5	3.75 min
$10	7.5 min
$20	15 min

Now restore expected loss. Reducing P_escape by 0.1 percentage point — from 1.0% to 0.9%, for example — at L_escape = $10,000 yields:

(0.010 - 0.009) × $10,000 = $10 expected saving per run

`L_escape`	Saving/run	Saving/month at 100 runs
$1,000	$1	$100
$10,000	$10	$1,000
$100,000	$100	$10,000
$1,000,000	$1,000	$100,000

This is an expected-loss model, not a measured product outcome and not literal insurance. Its point is that expensive verification can still be economically rational when a small reduction in failure probability protects against a large loss.

The main KPI of such a layer is the change in C_total at an unchanged or
stricter quality and risk bar. Token savings alone are insufficient.

7. One implementation used to test the hypothesis

One implementation we are building and evaluating is Ündes. Multiple models, critique, consensus, and synthesis
are mechanisms. The product object being tested is a reviewable artifact that aims to preserve:

the proposed solution or code candidate;
the evidence it rests on;
relevant targets that were and were not checked;
assumptions and claims that could not be proven;
critique and rejected hypotheses;
open production and PR risks;
recommended next checks;
a trust verdict.

The current state must be separated from the target model. The runtime
normalizes verdicts to PATCH_SAFE or DIAGNOSTIC and stores a separate
patch-safe boolean. Today it lands on DIAGNOSTIC / patch-safe=false more often than not. The phrases safe to apply, needs review, and
insufficient evidence are human-facing interpretations of a trust boundary, not three implemented runtime enums.

Routing is also not a hidden automatic cost optimizer. Operators explicitly declare providers, models, and per-stage overrides. Single-model mode is opt-in and reports the absence of cross-model assurance. The accurate description is configurable, operator-controlled routing.

This does not establish product superiority. It identifies an implementation of an architectural hypothesis that still needs a comparative benchmark.

What the internal telemetry says

Across two internal evaluation runs, we measured input tokens spent before the first targeted seam-fetch (tokensBeforeFirstSeamFetch):

Run	Total input tokens	Before first targeted seam-fetch	Share
A	322,807	170,162	52.7%
B	352,432	183,876	52.2%
Weighted	675,239	354,038	52.4%

This is not the first evidence of any kind: a context pack and observed files were available earlier. The metric marks the first targeted probe of a specific seam. Both runs ended in DIAGNOSTIC, not trusted output.

Two observations are not a benchmark. They do not establish token or time
savings. They frame a measurable hypothesis: targeted evidence acquisition
starts late, so some reasoning may happen before key premises are tested.

8. A falsifiable benchmark

A minimum comparative protocol could be:

5 public repositories across different stacks
20 tasks per repository
4 workflow variants
2 independent repeats
Total: 5 × 20 × 4 × 2 = 800 runs

Workflow variants:

Strong single-model coding assistant.
Multi-model deliberation without a repository trust artifact.
Verification workflow in single-model mode.
Verification workflow in multi-model mode.

Metric	What it measures
Evidence coverage	Share of material claims tied to checkable evidence
Unchecked relevant targets	Missed files, callers, and seams
Unsupported-claim rate	Claims emitted without sufficient grounding
Missed-risk count	Ground-truth risks absent from the output
False-confidence rate	Confident verdict on a wrong candidate
False-patch-safe	Unsafe result that passed the gate
Avoidable-DIAGNOSTIC	Correct candidate rejected by an evidence-acquisition defect
Reviewer minutes	Time to an apply/review/reject decision
Model cost	Actual call cost
Time to first targeted seam-fetch	When targeted seam checking started

Do not collapse these into one composite score. A cheap unsafe answer does not become better, and an expensive insufficient evidence can be the correct result.

9. Limits of what is proven

METR's 19% slowdown is a specific RCT with early-2025 tools, experienced maintainers, and familiar mature repositories, not a universal result [1].
METR's newer intervals include zero effect and are described as unreliable by the authors [2].
Google's +34% concerns Agentic RAG factuality, not patch safety [4].
Multi-agent topology can improve or degrade results; consensus does not prove factual correctness [6].
Kilo figures reported by ByteByteGo are vendor-reported [8].
Two internal runs are too few for a performance claim.
A trust verdict is not a correctness guarantee; it must be evaluated through calibration, false confidence, false-patch-safe, and missed risks.

10. Conclusion

Routing can materially reduce the model bill, especially in autonomous agentic loops. Agentic RAG checks whether retrieved context is sufficient. Multi-model deliberation surfaces consensus and contradictions, but its effect depends on topology and task shape. Tests and static analysis check formalized properties.

One engineering question remains:

How far is the candidate supported by evidence, and what still needs human verification before merge?

Cheap inference, fast review, and a convincing artifact are worthless if they raise false confidence. The research hypothesis is therefore:

The value of a verification layer is determined not by how much code it
generates, but by how much it narrows hidden checking work without increasing false confidence.

Until a comparative benchmark is run, this remains a grounded architectural hypothesis with working telemetry — not a proven productivity claim.

References

More Context Does Not Mean More Trust

Kair Akhmettayev — Thu, 21 May 2026 12:53:04 +0000

After publishing my previous post about engineering trust in AI coding, someone asked me a question that cannot be answered honestly in one sentence:

Why would a model start hallucinating more right after increasing context
length and reducing batch size, and then become normal again after 20-30
minutes?

The first thing to say is: without inference-stack logs, we cannot prove the exact cause.

But I would not explain it as "the model needed thirty minutes to learn". During ordinary inference, the model is not training. If the weights did not change, the model itself did not become smarter after half an hour.

The more likely explanation is that the serving mode changed. That is still a hypothesis, not a proven conclusion.

In other words, the problem may not have been only "model quality". It may have been how the production inference system behaved under a new workload.

More Context Is Not a Free Upgrade

It is tempting to think:

Give the model more context and it will make fewer mistakes.

Sometimes that is true. But not always.

If the system is actually sending more relevant input tokens to the model, a longer context window increases the chance that the necessary information reaches the model at all. But it does not guarantee that the model will use the entire context equally well.

There is a well-known effect often called "lost in the middle". In "Lost in the Middle: How Language Models Use Long Contexts", the authors evaluated multi-document QA and key-value retrieval. In those tasks and on the models studied, performance often dropped when the relevant information was placed in the middle of a long context, and was better when it was closer to the beginning or the end of the input.

So "the context got longer" does not automatically mean "the answer became more trustworthy".

Sometimes a long prompt simply gives the important fact a larger place to hide.

Context Length Changes More Than the Prompt

In production inference, a request is not only a semantic object. There is also a runtime profile:

prefill;
decode;
KV cache;
dynamic batching;
GPU memory pressure;
truncation policy;
timeout policy;
fallback paths;
scheduler behavior.

When requests actually start carrying more input tokens, prefill becomes more expensive. The system has to process more prompt tokens before generation begins. Memory pressure and KV-cache pressure increase. The set of requests that can fit into a batch may change.

If batch size is reduced at the same time, the shape of computation changes as well.

With greedy decoding or a fixed seed, the same runtime, and the same backend version, we usually expect reproducibility. But production inference does not always guarantee bitwise-identical behavior even for mathematically equivalent operations. PyTorch documents this class of issue in its Numerical Accuracy notes.

Serving frameworks also treat this as a real concern. At the time of writing, vLLM has a beta batch-invariance mode: under supported conditions, it is meant to make output deterministic and independent of batch size or request order in the batch. The existence of such a mode is a useful reminder that batching is not always just a performance detail.

Why the Effect Might Disappear After 20-30 Minutes

We should not pretend to know the cause without telemetry. But there are several plausible hypotheses.

1. Cache or Runtime Warm-Up

One possible hypothesis: if the stack uses prefix or prompt caching, and the traffic contains repeated prefixes, the first requests may have a worse latency or cache-hit profile. That alone does not prove an increase in hallucination rate. But it can indirectly lead to truncation, timeout, or fallback if there is a wrapper above the model with timeout-based degradation or a similar fallback policy.

A similar idea applies to runtime caches, memory pools, and allocator behavior. After a configuration change, early requests may run in a less stable profile, while later requests see smoother latency. That is a hypothesis to test with metrics, not an explanation to accept by default.

From the outside, this can look like:

The model was confused for the first half hour, then became normal.

But in that version of the story, the model did not stabilize. The system around the model did.

2. Dynamic Batching May Have Reached a More Stable Workload

After a batch-size change, the system may spend some time processing a mixed stream of requests: short, long, old, new, and differently structured contexts.

Dynamic batching builds batches from live traffic. If batch composition changes, latency and memory pressure can change as well. In some stacks, the shape of the batch may also affect bit-level or numeric reproducibility. That does not mean every such difference becomes a semantically different answer.

Again, this does not mean "the model got worse". It may be a transitional
serving-state issue.

3. Truncation, Timeout, or Fallback May Have Fired

In practice, this is a class of causes worth checking before concluding "the model hallucinated".

The model may look as if it invented a fact. But sometimes the simpler root cause is that the fact never reached the prompt.

If we are not talking about a bare model server, but about a RAG, tool-use, or agent pipeline, this can happen at several levels. Especially if there is a wrapper above the model server that shortens evidence, skips a tool step, or uses a fallback path under latency or resource limits.

For example:

the retriever did not return part of the evidence;
the wrapper truncated the context;
a tool returned an incomplete result;
part of the system block was shortened;
the request went through a fallback path;
the response was cut by an output limit.

From the outside, it looks like a semantic failure. Internally, it may be a delivery failure.

How to Test This Instead of Guessing

To prove the cause, you need to log not only model answers, but also the mode in which those answers were produced.

At minimum:

input tokens;
output tokens;
prefill latency;
decode latency;
time to first token;
effective batch size;
prefix or prompt cache hit rate;
KV-cache pressure or eviction rate;
context truncation events;
timeout events;
fallback events;
model version;
precision or quantization;
temperature, top_p, seed;
the first divergent token for the same prompt.

The experiment also needs controls: the same prompt, the same retrieval
snapshot, the same model and backend version, the same sampling parameters, and a fixed seed where the backend supports it.

A simple matrix:

short context + batch size 1;
long context + batch size 1;
long context + batch size N;
long context + dynamic batching;
long context right after a cold restart;
long context after warm-up.

If divergences appear mostly under long context + cold runtime + dynamic
batching, that is an argument for an infrastructure hypothesis worth checking with repeated runs and metrics. It is not proof by itself.

The cause still has to be confirmed with latency, cache, truncation, fallback, and first-divergent-token data.

If divergences remain after warm-up and under batch size 1, then you need to look at the prompt itself, retrieval, placement of evidence, and the model.

Where Undes Fits

This leads to a broader engineering point: a final answer is not enough. We need to understand how the answer was produced.

Undes does not control the provider's inference stack directly. It cannot look inside another provider's KV cache or scheduler. But it can help at a different layer:

show which files and snippets were read, sent into context, and recorded as evidence;
record which claims are supported by evidence;
surface places where the model referenced an unconfirmed method or file;
separate grounded findings from assumed implementation;
preserve rejected hypotheses;
keep open checks visible instead of hiding them inside confident prose;
mark an answer as not patch-safe when the verification is incomplete.

This is not magical hallucination suppression. A more precise statement is:

Undes makes unchecked hallucinations harder to hide.

If a model invents a method, a properly instrumented run can raise that as a warning. If a dependency was not read, it should become an open check. If an important objection disappeared between phases, it can become a transition signal.

Some unsupported claims stop being just polished sentences in a final answer. They become visible as warnings, open checks, or diagnostic status that an engineer can inspect and close.

Why "More Context" Is Not the Same as "More Trust"

You can give a model more text. But trust does not come from the amount of text.

Trust comes from a verifiable link between:

the request;
the context that was read;
evidence;
claims;
critique;
unresolved risks;
the final answer.

Long context helps only when the system can show what from that context was used and what remained unverified.

Otherwise, a longer prompt may simply become a larger place for mistakes to hide.

Closing

AI coding is not only about model quality. It is about the quality of the entire engineering loop around the model: context delivery, retrieval, batching, evidence, checks, and trust status.

The next layer of AI tooling is not just "give the model more context".

The next layer is reducing the chance that a model can present unverified output as verified output without a warning or diagnostic status.

AI agents can generate code now. Undes helps you decide whether to trust it.

AI can write code fast now. The harder part is knowing when to trust it. That’s what this article is about: evidence, assumptions, rejected ideas, and reviewable engineering decisions.

Kair Akhmettayev — Tue, 19 May 2026 17:25:50 +0000

Kair Akhmettayev

May 19

AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.

#ai #softwareengineering #codereview #programming

6 min read

AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.

Kair Akhmettayev — Tue, 19 May 2026 17:08:22 +0000

AI tools have dramatically increased the speed of software development.

That is a fact.

Today, a model can write a function or method in minutes, sketch out tests, suggest a migration, explain an error, propose a refactoring plan, or draft an initial architecture decision.

This no longer feels like magic.

It is becoming a normal part of engineering work.

But speed has introduced another problem:

we have lost confidence.

And I do not mean only confidence in code quality.

I mean confidence that the code will actually work correctly and reliably.

A team receives an AI-generated answer: confident, coherent, often useful.

But the main question for developers is no longer whether AI can suggest something.

It can.

The real question is different:

Can we trust that suggestion?

The problem is not that AI makes mistakes

Everyone makes mistakes:

people;
tests;
documentation;
static analyzers;
models.

The problem with AI-generated answers is different:

they often make mistakes beautifully.

An answer can look logically structured, well-written, and very convincing. It may use the right terminology, sound professional, and even include code snippets that look completely valid.

But that is not always enough for a reliable engineering decision.

A developer or tech lead still needs to understand:

which files the model actually considered;
which facts from the codebase the answer is based on;
which assumptions were made without evidence;
which hypotheses were considered and rejected;
which checks are still open;
whether this can be merged, or whether it is only a diagnostic conclusion.

Without that visibility, an AI answer becomes a new kind of technical debt.

The model saved time by producing the first version.

But it pushed the verification burden back onto the team: figuring out where the answer contains facts, where it contains assumptions, where the risks are, and where it is simply a confident guess.

A confident answer is not the same as a verified answer

In a regular chat interface, the final answer often looks like the final truth.

The model says:

Here is the root cause.

Here is the fix.

Here are the tests.

And for simple cases, that may be enough.

But in a real project, details matter:

Was a neighboring call-site missed?
Did a contract change in another module?
Is the fix based on a file the model never read?
Did the model mix existing code with code it invented itself?
Did it present an assumption as a confirmed fact?
Was important criticism lost on the way to the final answer?

These are not edge cases.

This is everyday engineering work.

That is why the problem with AI coding in teams is not only the quality of the model.

The bigger problem is the lack of a verifiable process around the answer.

What a good AI engineering artifact should contain

If an AI answer is used in engineering work, it should look more like a reviewable artifact than a polished chat message.

A useful artifact should show the following.

1. What is being proposed

Not a vague statement like:

improve validation

But specific files, functions, tests, and the boundaries of the change.

2. What evidence from the codebase supports the answer

The model should show which files or code fragments confirm its conclusions.

3. Which assumptions are still assumptions

If behavior was not confirmed by the code that was actually read, this must be stated clearly.

4. Which hypotheses were rejected

This is just as important as the final conclusion.

A good investigation shows not only what turned out to be true, but also what was checked and ruled out.

5. Which checks remain open

Some things cannot be honestly closed without additional files, tests, running the project, or a human decision.

That is not a failure if the system says it explicitly.

6. Trust status

The result should distinguish between:

this can be considered a patch candidate;
this is useful diagnostics, but not a merge-ready patch.

This kind of format changes the role of an AI answer.

It stops being just generated text and becomes an engineering decision that can be reviewed.

Verification should be part of generation

One might say:

Fine, let the model write the answer first, and then we’ll ask it to check itself.

For small tasks, that works.

Sometimes.

But once the task becomes more serious, post-fact verification quickly runs into limitations:

the model may defend its own previous answer;
some evidence may already be lost from the context;
criticism may remain as prose, but never affect the final result;
open checks may be softened to make the final answer look cleaner;
generated code may not make it into the final answer in full.

That is why verification should be part of the process, not an optional step at the end.

Especially not something a developer only remembers after the problem has already happened.

We need a process where different agents or model roles do different things:

some propose a solution;
others criticize it;
a separate step synthesizes the overall conclusion;
the system checks evidence and open items;
the final answer receives a trust status.

What matters is this:

using multiple AI roles does not automatically make the answer correct.

The value is not in:

models argued, so now it must be right

The value is that the argument, evidence, risks, rejected hypotheses, and limitations do not disappear.

They become part of the final artifact.

This is exactly why I am building Undes

Undes is a local-first AI engineering CLI that does not simply generate an engineering answer.

It generates the answer together with verification.

The idea is simple:

AI generates.

Undes verifies.

A single prompt should not produce just “a model answer”.

It should produce a verifiable engineering result:

proposed implementation or diagnostic answer;
evidence from the codebase;
assumptions;
rejected hypotheses;
risks;
open checks;
trust / patch-safety status.

Undes builds a structured workflow around the task:

proposal;
critique;
synthesis;
evidence checks;
risk review;
final artifact.

It is not trying to replace Cursor, Claude Code, Copilot, or other AI coding tools.

Those tools are useful.

They accelerate generation.

Undes focuses on a different layer:

making AI-generated engineering answers more trustworthy and more useful for teams.

Why local-first matters

For an engineering trust tool, it matters where the code lives.

The community version of Undes is designed as a local-first CLI:

the code is read locally;
the user configures access to model providers;
the result stays on the developer’s machine.

This does not mean there are no calls to LLMs, whether cloud-based or local.

But the process itself runs locally on the developer’s machine.

For many teams, this is an important boundary.

A trust-focused engineering tool should not begin with:

Upload your entire codebase to us.

What Undes does not promise

There is an important point here.

Undes does not promise magical correctness.

It does not turn AI into a formal verifier.

It does not replace:

tests;
code review;
CI;
security review;
engineering responsibility.

In fact, the strength of this approach is honesty:

if there is not enough evidence, the result should be diagnostic;
if there is an unresolved risk, it should be visible;
if generated code is based on an assumption, that should be stated;
if the task requires a human decision, the system should not pretend everything is closed.

For a team, this is more practical than a polished but overconfident answer.

Where this is especially useful

This approach is not needed for every small question.

If you just need to quickly recall syntax or draft a throwaway script, a regular chat is enough.

Undes makes sense where the cost of a mistake is higher:

feature implementation;
bug fixes in an unfamiliar part of the project;
migration planning;
architecture decision review;
incident investigation;
refactoring that may break neighboring contracts;
codebase onboarding, where it is important to separate facts from assumptions.

In these cases, a fast answer is only half of the value.

The other half is understanding how well that answer is proven.

What should the next step in AI coding look like?

The first wave of AI coding tools made generation accessible.

The next step is to make AI-generated engineering work verifiable.

Not because models are bad.

But because good engineering teams do not trust a result just because it sounds confident.

They look at:

evidence;
risks;
contracts;
tests;
open checks;
boundaries of applicability.

AI tools should help us not only write faster, but also make fewer mistakes.

That is the direction I want to move Undes in.

Try it

I am exploring this direction in the community version of Undes, an experimental local-first AI engineering CLI.

The most useful first test is simple:

take a small real task in your repository and look not only at the final answer, but also at the trust signals around it:

which evidence was used;
which assumptions remain;
which hypotheses were rejected;
which checks are still open;
what trust status the result received.

For me, the most valuable feedback is whether the artifact exposes enough signal for a real engineering review before merge.

Because the goal is not just another polished AI answer.

The goal is an AI-generated engineering answer you can actually trust.

Disclosure: this article is based on my own experience building Undes. I used AI assistance for English translation and editing, and reviewed the final text before publishing.

DEV Community: Kair Akhmettayev

Why Cheaper AI Code Generation Does Not Necessarily Reduce

1. The verification tax

An illustrative review-load model

2. The total cost of an engineering decision

The ceiling on pure model-bill optimization

3. Different systems control different parts of the cost

Routing: Kilo Gateway and RouteLLM

Agentic RAG: sufficient context

Multi-model deliberation: consensus is not proof

Tests and static analysis

4. Side by side

5. Trust debt and hidden checking work

6. When extra verification pays for itself

7. One implementation used to test the hypothesis

What the internal telemetry says

8. A falsifiable benchmark

9. Limits of what is proven

10. Conclusion

References

More Context Does Not Mean More Trust

More Context Is Not a Free Upgrade

Context Length Changes More Than the Prompt

Why the Effect Might Disappear After 20-30 Minutes

1. Cache or Runtime Warm-Up

2. Dynamic Batching May Have Reached a More Stable Workload

3. Truncation, Timeout, or Fallback May Have Fired

How to Test This Instead of Guessing

Where Undes Fits

Why "More Context" Is Not the Same as "More Trust"

Closing

Further Reading

AI can write code fast now. The harder part is knowing when to trust it. That’s what this article is about: evidence, assumptions, rejected ideas, and reviewable engineering decisions.

AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.

AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.

The problem is not that AI makes mistakes

A confident answer is not the same as a verified answer

What a good AI engineering artifact should contain

1. What is being proposed

2. What evidence from the codebase supports the answer

3. Which assumptions are still assumptions

4. Which hypotheses were rejected

5. Which checks remain open

6. Trust status

Verification should be part of generation

This is exactly why I am building Undes

Why local-first matters

What Undes does not promise

Where this is especially useful

What should the next step in AI coding look like?

Try it