<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kair Akhmettayev</title>
    <description>The latest articles on DEV Community by Kair Akhmettayev (@kair_akhmettayev_0a8ba408).</description>
    <link>https://dev.to/kair_akhmettayev_0a8ba408</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2576886%2F67a45907-9268-4a9a-8d1c-93e361a9ba26.jpg</url>
      <title>DEV Community: Kair Akhmettayev</title>
      <link>https://dev.to/kair_akhmettayev_0a8ba408</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kair_akhmettayev_0a8ba408"/>
    <language>en</language>
    <item>
      <title>More Context Does Not Mean More Trust</title>
      <dc:creator>Kair Akhmettayev</dc:creator>
      <pubDate>Thu, 21 May 2026 12:53:04 +0000</pubDate>
      <link>https://dev.to/kair_akhmettayev_0a8ba408/more-context-does-not-mean-more-trust-185e</link>
      <guid>https://dev.to/kair_akhmettayev_0a8ba408/more-context-does-not-mean-more-trust-185e</guid>
      <description>&lt;p&gt;After publishing my &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok"&gt;previous post&lt;/a&gt; about engineering trust in AI coding, someone asked me a question that cannot be answered honestly in one sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Why would a model start hallucinating more right after increasing context&lt;br&gt;
length and reducing batch size, and then become normal again after 20-30&lt;br&gt;
minutes?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first thing to say is: without inference-stack logs, we cannot prove the exact cause.&lt;/p&gt;

&lt;p&gt;But I would not explain it as "the model needed thirty minutes to learn". During ordinary inference, the model is not training. If the weights did not change, the model itself did not become smarter after half an hour.&lt;/p&gt;

&lt;p&gt;The more likely explanation is that the serving mode changed. That is still a hypothesis, not a proven conclusion.&lt;/p&gt;

&lt;p&gt;In other words, the problem may not have been only "model quality". It may have been how the production inference system behaved under a new workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Context Is Not a Free Upgrade
&lt;/h2&gt;

&lt;p&gt;It is tempting to think:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Give the model more context and it will make fewer mistakes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sometimes that is true. But not always.&lt;/p&gt;

&lt;p&gt;If the system is actually sending more relevant input tokens to the model, a longer context window increases the chance that the necessary information reaches the model at all. But it does not guarantee that the model will use the entire context equally well.&lt;/p&gt;

&lt;p&gt;There is a well-known effect often called "lost in the middle". In &lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;"Lost in the Middle: How Language Models Use Long Contexts"&lt;/a&gt;, the authors evaluated multi-document QA and key-value retrieval. In those tasks and on the models studied, performance often dropped when the relevant information was placed in the middle of a long context, and was better when it was closer to the beginning or the end of the input.&lt;/p&gt;

&lt;p&gt;So "the context got longer" does not automatically mean "the answer became more trustworthy".&lt;/p&gt;

&lt;p&gt;Sometimes a long prompt simply gives the important fact a larger place to hide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Length Changes More Than the Prompt
&lt;/h2&gt;

&lt;p&gt;In production inference, a request is not only a semantic object. There is also a runtime profile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prefill;&lt;/li&gt;
&lt;li&gt;decode;&lt;/li&gt;
&lt;li&gt;KV cache;&lt;/li&gt;
&lt;li&gt;dynamic batching;&lt;/li&gt;
&lt;li&gt;GPU memory pressure;&lt;/li&gt;
&lt;li&gt;truncation policy;&lt;/li&gt;
&lt;li&gt;timeout policy;&lt;/li&gt;
&lt;li&gt;fallback paths;&lt;/li&gt;
&lt;li&gt;scheduler behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When requests actually start carrying more input tokens, prefill becomes more expensive. The system has to process more prompt tokens before generation begins. Memory pressure and KV-cache pressure increase. The set of requests that can fit into a batch may change.&lt;/p&gt;

&lt;p&gt;If batch size is reduced at the same time, the shape of computation changes as well.&lt;/p&gt;

&lt;p&gt;With greedy decoding or a fixed seed, the same runtime, and the same backend version, we usually expect reproducibility. But production inference does not always guarantee bitwise-identical behavior even for mathematically equivalent operations. PyTorch documents this class of issue in its &lt;a href="https://docs.pytorch.org/docs/2.9/notes/numerical_accuracy.html" rel="noopener noreferrer"&gt;Numerical Accuracy&lt;/a&gt; notes.&lt;/p&gt;

&lt;p&gt;Serving frameworks also treat this as a real concern. At the time of writing, vLLM has a &lt;a href="https://docs.vllm.ai/usage/batch_invariance.html" rel="noopener noreferrer"&gt;beta batch-invariance mode&lt;/a&gt;: under supported conditions, it is meant to make output deterministic and independent of batch size or request order in the batch. The existence of such a mode is a useful reminder that batching is not always just a performance detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Effect Might Disappear After 20-30 Minutes
&lt;/h2&gt;

&lt;p&gt;We should not pretend to know the cause without telemetry. But there are several plausible hypotheses.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Cache or Runtime Warm-Up
&lt;/h3&gt;

&lt;p&gt;One possible hypothesis: if the stack uses prefix or prompt caching, and the traffic contains repeated prefixes, the first requests may have a worse latency or cache-hit profile. That alone does not prove an increase in hallucination rate. But it can indirectly lead to truncation, timeout, or fallback if there is a wrapper above the model with timeout-based degradation or a similar fallback policy.&lt;/p&gt;

&lt;p&gt;A similar idea applies to runtime caches, memory pools, and allocator behavior. After a configuration change, early requests may run in a less stable profile, while later requests see smoother latency. That is a hypothesis to test with metrics, not an explanation to accept by default.&lt;/p&gt;

&lt;p&gt;From the outside, this can look like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The model was confused for the first half hour, then became normal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But in that version of the story, the model did not stabilize. The system around the model did.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dynamic Batching May Have Reached a More Stable Workload
&lt;/h3&gt;

&lt;p&gt;After a batch-size change, the system may spend some time processing a mixed stream of requests: short, long, old, new, and differently structured contexts.&lt;/p&gt;

&lt;p&gt;Dynamic batching builds batches from live traffic. If batch composition changes, latency and memory pressure can change as well. In some stacks, the shape of the batch may also affect bit-level or numeric reproducibility. That does not mean every such difference becomes a semantically different answer.&lt;/p&gt;

&lt;p&gt;Again, this does not mean "the model got worse". It may be a transitional&lt;br&gt;
serving-state issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Truncation, Timeout, or Fallback May Have Fired
&lt;/h3&gt;

&lt;p&gt;In practice, this is a class of causes worth checking before concluding "the model hallucinated".&lt;/p&gt;

&lt;p&gt;The model may look as if it invented a fact. But sometimes the simpler root cause is that the fact never reached the prompt.&lt;/p&gt;

&lt;p&gt;If we are not talking about a bare model server, but about a RAG, tool-use, or agent pipeline, this can happen at several levels. Especially if there is a wrapper above the model server that shortens evidence, skips a tool step, or uses a fallback path under latency or resource limits.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the retriever did not return part of the evidence;&lt;/li&gt;
&lt;li&gt;the wrapper truncated the context;&lt;/li&gt;
&lt;li&gt;a tool returned an incomplete result;&lt;/li&gt;
&lt;li&gt;part of the system block was shortened;&lt;/li&gt;
&lt;li&gt;the request went through a fallback path;&lt;/li&gt;
&lt;li&gt;the response was cut by an output limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the outside, it looks like a semantic failure. Internally, it may be a delivery failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Test This Instead of Guessing
&lt;/h2&gt;

&lt;p&gt;To prove the cause, you need to log not only model answers, but also the mode in which those answers were produced.&lt;/p&gt;

&lt;p&gt;At minimum:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input tokens;&lt;/li&gt;
&lt;li&gt;output tokens;&lt;/li&gt;
&lt;li&gt;prefill latency;&lt;/li&gt;
&lt;li&gt;decode latency;&lt;/li&gt;
&lt;li&gt;time to first token;&lt;/li&gt;
&lt;li&gt;effective batch size;&lt;/li&gt;
&lt;li&gt;prefix or prompt cache hit rate;&lt;/li&gt;
&lt;li&gt;KV-cache pressure or eviction rate;&lt;/li&gt;
&lt;li&gt;context truncation events;&lt;/li&gt;
&lt;li&gt;timeout events;&lt;/li&gt;
&lt;li&gt;fallback events;&lt;/li&gt;
&lt;li&gt;model version;&lt;/li&gt;
&lt;li&gt;precision or quantization;&lt;/li&gt;
&lt;li&gt;temperature, top_p, seed;&lt;/li&gt;
&lt;li&gt;the first divergent token for the same prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The experiment also needs controls: the same prompt, the same retrieval&lt;br&gt;
snapshot, the same model and backend version, the same sampling parameters, and a fixed seed where the backend supports it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A simple matrix:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;short context + batch size 1;&lt;/li&gt;
&lt;li&gt;long context + batch size 1;&lt;/li&gt;
&lt;li&gt;long context + batch size N;&lt;/li&gt;
&lt;li&gt;long context + dynamic batching;&lt;/li&gt;
&lt;li&gt;long context right after a cold restart;&lt;/li&gt;
&lt;li&gt;long context after warm-up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If divergences appear mostly under long context + cold runtime + dynamic&lt;br&gt;
batching, that is an argument for an infrastructure hypothesis worth checking with repeated runs and metrics. It is not proof by itself.&lt;/p&gt;

&lt;p&gt;The cause still has to be confirmed with latency, cache, truncation, fallback, and first-divergent-token data.&lt;/p&gt;

&lt;p&gt;If divergences remain after warm-up and under batch size 1, then you need to look at the prompt itself, retrieval, placement of evidence, and the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Undes Fits
&lt;/h2&gt;

&lt;p&gt;This leads to a broader engineering point: a final answer is not enough. We need to understand how the answer was produced.&lt;/p&gt;

&lt;p&gt;Undes does not control the provider's inference stack directly. It cannot look inside another provider's KV cache or scheduler. But it can help at a different layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;show which files and snippets were read, sent into context, and recorded as evidence;&lt;/li&gt;
&lt;li&gt;record which claims are supported by evidence;&lt;/li&gt;
&lt;li&gt;surface places where the model referenced an unconfirmed method or file;&lt;/li&gt;
&lt;li&gt;separate grounded findings from assumed implementation;&lt;/li&gt;
&lt;li&gt;preserve rejected hypotheses;&lt;/li&gt;
&lt;li&gt;keep open checks visible instead of hiding them inside confident prose;&lt;/li&gt;
&lt;li&gt;mark an answer as not patch-safe when the verification is incomplete.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not magical hallucination suppression. A more precise statement is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Undes makes unchecked hallucinations harder to hide.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If a model invents a method, a properly instrumented run can raise that as a warning. If a dependency was not read, it should become an open check. If an important objection disappeared between phases, it can become a transition signal.&lt;/p&gt;

&lt;p&gt;Some unsupported claims stop being just polished sentences in a final answer. They become visible as warnings, open checks, or diagnostic status that an engineer can inspect and close.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "More Context" Is Not the Same as "More Trust"
&lt;/h2&gt;

&lt;p&gt;You can give a model more text. But trust does not come from the amount of text.&lt;/p&gt;

&lt;p&gt;Trust comes from a verifiable link between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the request;&lt;/li&gt;
&lt;li&gt;the context that was read;&lt;/li&gt;
&lt;li&gt;evidence;&lt;/li&gt;
&lt;li&gt;claims;&lt;/li&gt;
&lt;li&gt;critique;&lt;/li&gt;
&lt;li&gt;unresolved risks;&lt;/li&gt;
&lt;li&gt;the final answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Long context helps only when the system can show what from that context was used and what remained unverified.&lt;/p&gt;

&lt;p&gt;Otherwise, a longer prompt may simply become a larger place for mistakes to hide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;AI coding is not only about model quality. It is about the quality of the entire engineering loop around the model: context delivery, retrieval, batching, evidence, checks, and trust status.&lt;/p&gt;

&lt;p&gt;The next layer of AI tooling is not just "give the model more context".&lt;/p&gt;

&lt;p&gt;The next layer is reducing the chance that a model can present unverified output as verified output without a warning or diagnostic status.&lt;/p&gt;

&lt;p&gt;AI agents can generate code now. Undes helps you decide whether to trust it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2307.03172" rel="noopener noreferrer"&gt;"Lost in the Middle: How Language Models Use Long Contexts"&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/usage/batch_invariance.html" rel="noopener noreferrer"&gt;vLLM Batch Invariance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/v0.13.0/design/prefix_caching/" rel="noopener noreferrer"&gt;vLLM Automatic Prefix Caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.pytorch.org/docs/2.9/notes/numerical_accuracy.html" rel="noopener noreferrer"&gt;PyTorch Numerical Accuracy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I am building &lt;a href="https://github.com/HominoITea/undes" rel="noopener noreferrer"&gt;Undes&lt;/a&gt;, a local-first AI engineering CLI that generates and verifies engineering answers. If your team is already using AI coding agents, I would like feedback on which trust signals you still need before acting on generated code in a real workflow.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>AI can write code fast now. The harder part is knowing when to trust it.
That’s what this article is about: evidence, assumptions, rejected ideas, and reviewable engineering decisions.</title>
      <dc:creator>Kair Akhmettayev</dc:creator>
      <pubDate>Tue, 19 May 2026 17:25:50 +0000</pubDate>
      <link>https://dev.to/kair_akhmettayev_0a8ba408/ai-can-write-code-fast-now-the-harder-part-is-knowing-when-to-trust-it-thats-what-this-article-cfi</link>
      <guid>https://dev.to/kair_akhmettayev_0a8ba408/ai-can-write-code-fast-now-the-harder-part-is-knowing-when-to-trust-it-thats-what-this-article-cfi</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok" class="crayons-story__hidden-navigation-link"&gt;AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/kair_akhmettayev_0a8ba408" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2576886%2F67a45907-9268-4a9a-8d1c-93e361a9ba26.jpg" alt="kair_akhmettayev_0a8ba408 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/kair_akhmettayev_0a8ba408" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Kair Akhmettayev
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Kair Akhmettayev
                
              
              &lt;div id="story-author-preview-content-3702305" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/kair_akhmettayev_0a8ba408" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2576886%2F67a45907-9268-4a9a-8d1c-93e361a9ba26.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Kair Akhmettayev&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 19&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok" id="article-link-3702305"&gt;
          AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/softwareengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;softwareengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/codereview"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;codereview&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              &lt;span class="hidden s:inline"&gt;Add&amp;nbsp;Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            6 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>AI Coding Is Fast Now. Engineering Trust Still Has to Be Earned.</title>
      <dc:creator>Kair Akhmettayev</dc:creator>
      <pubDate>Tue, 19 May 2026 17:08:22 +0000</pubDate>
      <link>https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok</link>
      <guid>https://dev.to/kair_akhmettayev_0a8ba408/ai-coding-is-fast-now-engineering-trust-still-has-to-be-earned-40ok</guid>
      <description>&lt;p&gt;AI tools have dramatically increased the speed of software development.&lt;/p&gt;

&lt;p&gt;That is a fact.&lt;/p&gt;

&lt;p&gt;Today, a model can write a function or method in minutes, sketch out tests, suggest a migration, explain an error, propose a refactoring plan, or draft an initial architecture decision.&lt;/p&gt;

&lt;p&gt;This no longer feels like magic.&lt;/p&gt;

&lt;p&gt;It is becoming a normal part of engineering work.&lt;/p&gt;

&lt;p&gt;But speed has introduced another problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;we have lost confidence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And I do not mean only confidence in code quality.&lt;/p&gt;

&lt;p&gt;I mean confidence that the code will actually work correctly and reliably.&lt;/p&gt;

&lt;p&gt;A team receives an AI-generated answer: confident, coherent, often useful.&lt;/p&gt;

&lt;p&gt;But the main question for developers is no longer whether AI can suggest something.&lt;/p&gt;

&lt;p&gt;It can.&lt;/p&gt;

&lt;p&gt;The real question is different:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can we trust that suggestion?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The problem is not that AI makes mistakes
&lt;/h2&gt;

&lt;p&gt;Everyone makes mistakes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;people;&lt;/li&gt;
&lt;li&gt;tests;&lt;/li&gt;
&lt;li&gt;documentation;&lt;/li&gt;
&lt;li&gt;static analyzers;&lt;/li&gt;
&lt;li&gt;models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem with AI-generated answers is different:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;they often make mistakes beautifully.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An answer can look logically structured, well-written, and very convincing. It may use the right terminology, sound professional, and even include code snippets that look completely valid.&lt;/p&gt;

&lt;p&gt;But that is not always enough for a reliable engineering decision.&lt;/p&gt;

&lt;p&gt;A developer or tech lead still needs to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which files the model actually considered;&lt;/li&gt;
&lt;li&gt;which facts from the codebase the answer is based on;&lt;/li&gt;
&lt;li&gt;which assumptions were made without evidence;&lt;/li&gt;
&lt;li&gt;which hypotheses were considered and rejected;&lt;/li&gt;
&lt;li&gt;which checks are still open;&lt;/li&gt;
&lt;li&gt;whether this can be merged, or whether it is only a diagnostic conclusion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that visibility, an AI answer becomes a new kind of technical debt.&lt;/p&gt;

&lt;p&gt;The model saved time by producing the first version.&lt;/p&gt;

&lt;p&gt;But it pushed the verification burden back onto the team: figuring out where the answer contains facts, where it contains assumptions, where the risks are, and where it is simply a confident guess.&lt;/p&gt;




&lt;h2&gt;
  
  
  A confident answer is not the same as a verified answer
&lt;/h2&gt;

&lt;p&gt;In a regular chat interface, the final answer often looks like the final truth.&lt;/p&gt;

&lt;p&gt;The model says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Here is the root cause.&lt;br&gt;&lt;br&gt;
Here is the fix.&lt;br&gt;&lt;br&gt;
Here are the tests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And for simple cases, that may be enough.&lt;/p&gt;

&lt;p&gt;But in a real project, details matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was a neighboring call-site missed?&lt;/li&gt;
&lt;li&gt;Did a contract change in another module?&lt;/li&gt;
&lt;li&gt;Is the fix based on a file the model never read?&lt;/li&gt;
&lt;li&gt;Did the model mix existing code with code it invented itself?&lt;/li&gt;
&lt;li&gt;Did it present an assumption as a confirmed fact?&lt;/li&gt;
&lt;li&gt;Was important criticism lost on the way to the final answer?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not edge cases.&lt;/p&gt;

&lt;p&gt;This is everyday engineering work.&lt;/p&gt;

&lt;p&gt;That is why the problem with AI coding in teams is not only the quality of the model.&lt;/p&gt;

&lt;p&gt;The bigger problem is the lack of a verifiable process around the answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a good AI engineering artifact should contain
&lt;/h2&gt;

&lt;p&gt;If an AI answer is used in engineering work, it should look more like a reviewable artifact than a polished chat message.&lt;/p&gt;

&lt;p&gt;A useful artifact should show the following.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. What is being proposed
&lt;/h3&gt;

&lt;p&gt;Not a vague statement like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;improve validation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But specific files, functions, tests, and the boundaries of the change.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. What evidence from the codebase supports the answer
&lt;/h3&gt;

&lt;p&gt;The model should show which files or code fragments confirm its conclusions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Which assumptions are still assumptions
&lt;/h3&gt;

&lt;p&gt;If behavior was not confirmed by the code that was actually read, this must be stated clearly.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Which hypotheses were rejected
&lt;/h3&gt;

&lt;p&gt;This is just as important as the final conclusion.&lt;/p&gt;

&lt;p&gt;A good investigation shows not only what turned out to be true, but also what was checked and ruled out.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Which checks remain open
&lt;/h3&gt;

&lt;p&gt;Some things cannot be honestly closed without additional files, tests, running the project, or a human decision.&lt;/p&gt;

&lt;p&gt;That is not a failure if the system says it explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Trust status
&lt;/h3&gt;

&lt;p&gt;The result should distinguish between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this can be considered a patch candidate;&lt;/li&gt;
&lt;li&gt;this is useful diagnostics, but not a merge-ready patch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kind of format changes the role of an AI answer.&lt;/p&gt;

&lt;p&gt;It stops being just generated text and becomes an engineering decision that can be reviewed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Verification should be part of generation
&lt;/h2&gt;

&lt;p&gt;One might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Fine, let the model write the answer first, and then we’ll ask it to check itself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For small tasks, that works.&lt;/p&gt;

&lt;p&gt;Sometimes.&lt;/p&gt;

&lt;p&gt;But once the task becomes more serious, post-fact verification quickly runs into limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the model may defend its own previous answer;&lt;/li&gt;
&lt;li&gt;some evidence may already be lost from the context;&lt;/li&gt;
&lt;li&gt;criticism may remain as prose, but never affect the final result;&lt;/li&gt;
&lt;li&gt;open checks may be softened to make the final answer look cleaner;&lt;/li&gt;
&lt;li&gt;generated code may not make it into the final answer in full.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why verification should be part of the process, not an optional step at the end.&lt;/p&gt;

&lt;p&gt;Especially not something a developer only remembers after the problem has already happened.&lt;/p&gt;

&lt;p&gt;We need a process where different agents or model roles do different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;some propose a solution;&lt;/li&gt;
&lt;li&gt;others criticize it;&lt;/li&gt;
&lt;li&gt;a separate step synthesizes the overall conclusion;&lt;/li&gt;
&lt;li&gt;the system checks evidence and open items;&lt;/li&gt;
&lt;li&gt;the final answer receives a trust status.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What matters is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;using multiple AI roles does not automatically make the answer correct.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The value is not in:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;models argued, so now it must be right&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The value is that the argument, evidence, risks, rejected hypotheses, and limitations do not disappear.&lt;/p&gt;

&lt;p&gt;They become part of the final artifact.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is exactly why I am building &lt;a href="https://github.com/HominoITea/undes" rel="noopener noreferrer"&gt;&lt;strong&gt;Undes&lt;/strong&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Undes&lt;/strong&gt; is a local-first AI engineering CLI that does not simply generate an engineering answer.&lt;/p&gt;

&lt;p&gt;It generates the answer together with verification.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI generates.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Undes&lt;/strong&gt; verifies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A single prompt should not produce just “a model answer”.&lt;/p&gt;

&lt;p&gt;It should produce a verifiable engineering result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;proposed implementation or diagnostic answer;&lt;/li&gt;
&lt;li&gt;evidence from the codebase;&lt;/li&gt;
&lt;li&gt;assumptions;&lt;/li&gt;
&lt;li&gt;rejected hypotheses;&lt;/li&gt;
&lt;li&gt;risks;&lt;/li&gt;
&lt;li&gt;open checks;&lt;/li&gt;
&lt;li&gt;trust / patch-safety status.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Undes&lt;/strong&gt; builds a structured workflow around the task:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;proposal;&lt;/li&gt;
&lt;li&gt;critique;&lt;/li&gt;
&lt;li&gt;synthesis;&lt;/li&gt;
&lt;li&gt;evidence checks;&lt;/li&gt;
&lt;li&gt;risk review;&lt;/li&gt;
&lt;li&gt;final artifact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is not trying to replace Cursor, Claude Code, Copilot, or other AI coding tools.&lt;/p&gt;

&lt;p&gt;Those tools are useful.&lt;/p&gt;

&lt;p&gt;They accelerate generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Undes&lt;/strong&gt; focuses on a different layer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;making AI-generated engineering answers more trustworthy and more useful for teams.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why local-first matters
&lt;/h2&gt;

&lt;p&gt;For an engineering trust tool, it matters where the code lives.&lt;/p&gt;

&lt;p&gt;The community version of &lt;strong&gt;Undes&lt;/strong&gt; is designed as a local-first CLI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the code is read locally;&lt;/li&gt;
&lt;li&gt;the user configures access to model providers;&lt;/li&gt;
&lt;li&gt;the result stays on the developer’s machine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This does not mean there are no calls to LLMs, whether cloud-based or local.&lt;/p&gt;

&lt;p&gt;But the process itself runs locally on the developer’s machine.&lt;/p&gt;

&lt;p&gt;For many teams, this is an important boundary.&lt;/p&gt;

&lt;p&gt;A trust-focused engineering tool should not begin with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Upload your entire codebase to us.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What &lt;strong&gt;Undes&lt;/strong&gt; does not promise
&lt;/h2&gt;

&lt;p&gt;There is an important point here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unde&lt;/strong&gt;s does not promise magical correctness.&lt;/p&gt;

&lt;p&gt;It does not turn AI into a formal verifier.&lt;/p&gt;

&lt;p&gt;It does not replace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tests;&lt;/li&gt;
&lt;li&gt;code review;&lt;/li&gt;
&lt;li&gt;CI;&lt;/li&gt;
&lt;li&gt;security review;&lt;/li&gt;
&lt;li&gt;engineering responsibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In fact, the strength of this approach is honesty:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if there is not enough evidence, the result should be diagnostic;&lt;/li&gt;
&lt;li&gt;if there is an unresolved risk, it should be visible;&lt;/li&gt;
&lt;li&gt;if generated code is based on an assumption, that should be stated;&lt;/li&gt;
&lt;li&gt;if the task requires a human decision, the system should not pretend everything is closed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a team, this is more practical than a polished but overconfident answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where this is especially useful
&lt;/h2&gt;

&lt;p&gt;This approach is not needed for every small question.&lt;/p&gt;

&lt;p&gt;If you just need to quickly recall syntax or draft a throwaway script, a regular chat is enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Undes&lt;/strong&gt; makes sense where the cost of a mistake is higher:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feature implementation;&lt;/li&gt;
&lt;li&gt;bug fixes in an unfamiliar part of the project;&lt;/li&gt;
&lt;li&gt;migration planning;&lt;/li&gt;
&lt;li&gt;architecture decision review;&lt;/li&gt;
&lt;li&gt;incident investigation;&lt;/li&gt;
&lt;li&gt;refactoring that may break neighboring contracts;&lt;/li&gt;
&lt;li&gt;codebase onboarding, where it is important to separate facts from assumptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these cases, a fast answer is only half of the value.&lt;/p&gt;

&lt;p&gt;The other half is understanding how well that answer is proven.&lt;/p&gt;




&lt;h2&gt;
  
  
  What should the next step in AI coding look like?
&lt;/h2&gt;

&lt;p&gt;The first wave of AI coding tools made generation accessible.&lt;/p&gt;

&lt;p&gt;The next step is to make AI-generated engineering work verifiable.&lt;/p&gt;

&lt;p&gt;Not because models are bad.&lt;/p&gt;

&lt;p&gt;But because good engineering teams do not trust a result just because it sounds confident.&lt;/p&gt;

&lt;p&gt;They look at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;evidence;&lt;/li&gt;
&lt;li&gt;risks;&lt;/li&gt;
&lt;li&gt;contracts;&lt;/li&gt;
&lt;li&gt;tests;&lt;/li&gt;
&lt;li&gt;open checks;&lt;/li&gt;
&lt;li&gt;boundaries of applicability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI tools should help us not only write faster, but also make fewer mistakes.&lt;/p&gt;

&lt;p&gt;That is the direction I want to move &lt;strong&gt;Undes&lt;/strong&gt; in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;I am exploring this direction in &lt;a href="https://github.com/HominoITea/undes" rel="noopener noreferrer"&gt;the community version of &lt;strong&gt;Undes&lt;/strong&gt;&lt;/a&gt;, an experimental local-first AI engineering CLI.&lt;/p&gt;

&lt;p&gt;The most useful first test is simple:&lt;/p&gt;

&lt;p&gt;take a small real task in your repository and look not only at the final answer, but also at the trust signals around it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which evidence was used;&lt;/li&gt;
&lt;li&gt;which assumptions remain;&lt;/li&gt;
&lt;li&gt;which hypotheses were rejected;&lt;/li&gt;
&lt;li&gt;which checks are still open;&lt;/li&gt;
&lt;li&gt;what trust status the result received.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, the most valuable feedback is whether the artifact exposes enough signal for a real engineering review before merge.&lt;/p&gt;

&lt;p&gt;Because the goal is not just another polished AI answer.&lt;/p&gt;

&lt;p&gt;The goal is an AI-generated engineering answer you can actually trust.&lt;/p&gt;




&lt;p&gt;Disclosure: this article is based on my own experience building Undes. I used AI assistance for English translation and editing, and reviewed the final text before publishing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>codereview</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
