DEV Community

Kair Akhmettayev
Kair Akhmettayev

Posted on

More Context Does Not Mean More Trust

After publishing my previous post about engineering trust in AI coding, someone asked me a question that cannot be answered honestly in one sentence:

Why would a model start hallucinating more right after increasing context
length and reducing batch size, and then become normal again after 20-30
minutes?

The first thing to say is: without inference-stack logs, we cannot prove the exact cause.

But I would not explain it as "the model needed thirty minutes to learn". During ordinary inference, the model is not training. If the weights did not change, the model itself did not become smarter after half an hour.

The more likely explanation is that the serving mode changed. That is still a hypothesis, not a proven conclusion.

In other words, the problem may not have been only "model quality". It may have been how the production inference system behaved under a new workload.

More Context Is Not a Free Upgrade

It is tempting to think:

Give the model more context and it will make fewer mistakes.

Sometimes that is true. But not always.

If the system is actually sending more relevant input tokens to the model, a longer context window increases the chance that the necessary information reaches the model at all. But it does not guarantee that the model will use the entire context equally well.

There is a well-known effect often called "lost in the middle". In "Lost in the Middle: How Language Models Use Long Contexts", the authors evaluated multi-document QA and key-value retrieval. In those tasks and on the models studied, performance often dropped when the relevant information was placed in the middle of a long context, and was better when it was closer to the beginning or the end of the input.

So "the context got longer" does not automatically mean "the answer became more trustworthy".

Sometimes a long prompt simply gives the important fact a larger place to hide.

Context Length Changes More Than the Prompt

In production inference, a request is not only a semantic object. There is also a runtime profile:

  • prefill;
  • decode;
  • KV cache;
  • dynamic batching;
  • GPU memory pressure;
  • truncation policy;
  • timeout policy;
  • fallback paths;
  • scheduler behavior.

When requests actually start carrying more input tokens, prefill becomes more expensive. The system has to process more prompt tokens before generation begins. Memory pressure and KV-cache pressure increase. The set of requests that can fit into a batch may change.

If batch size is reduced at the same time, the shape of computation changes as well.

With greedy decoding or a fixed seed, the same runtime, and the same backend version, we usually expect reproducibility. But production inference does not always guarantee bitwise-identical behavior even for mathematically equivalent operations. PyTorch documents this class of issue in its Numerical Accuracy notes.

Serving frameworks also treat this as a real concern. At the time of writing, vLLM has a beta batch-invariance mode: under supported conditions, it is meant to make output deterministic and independent of batch size or request order in the batch. The existence of such a mode is a useful reminder that batching is not always just a performance detail.

Why the Effect Might Disappear After 20-30 Minutes

We should not pretend to know the cause without telemetry. But there are several plausible hypotheses.

1. Cache or Runtime Warm-Up

One possible hypothesis: if the stack uses prefix or prompt caching, and the traffic contains repeated prefixes, the first requests may have a worse latency or cache-hit profile. That alone does not prove an increase in hallucination rate. But it can indirectly lead to truncation, timeout, or fallback if there is a wrapper above the model with timeout-based degradation or a similar fallback policy.

A similar idea applies to runtime caches, memory pools, and allocator behavior. After a configuration change, early requests may run in a less stable profile, while later requests see smoother latency. That is a hypothesis to test with metrics, not an explanation to accept by default.

From the outside, this can look like:

The model was confused for the first half hour, then became normal.

But in that version of the story, the model did not stabilize. The system around the model did.

2. Dynamic Batching May Have Reached a More Stable Workload

After a batch-size change, the system may spend some time processing a mixed stream of requests: short, long, old, new, and differently structured contexts.

Dynamic batching builds batches from live traffic. If batch composition changes, latency and memory pressure can change as well. In some stacks, the shape of the batch may also affect bit-level or numeric reproducibility. That does not mean every such difference becomes a semantically different answer.

Again, this does not mean "the model got worse". It may be a transitional
serving-state issue.

3. Truncation, Timeout, or Fallback May Have Fired

In practice, this is a class of causes worth checking before concluding "the model hallucinated".

The model may look as if it invented a fact. But sometimes the simpler root cause is that the fact never reached the prompt.

If we are not talking about a bare model server, but about a RAG, tool-use, or agent pipeline, this can happen at several levels. Especially if there is a wrapper above the model server that shortens evidence, skips a tool step, or uses a fallback path under latency or resource limits.

For example:

  • the retriever did not return part of the evidence;
  • the wrapper truncated the context;
  • a tool returned an incomplete result;
  • part of the system block was shortened;
  • the request went through a fallback path;
  • the response was cut by an output limit.

From the outside, it looks like a semantic failure. Internally, it may be a delivery failure.

How to Test This Instead of Guessing

To prove the cause, you need to log not only model answers, but also the mode in which those answers were produced.

At minimum:

  • input tokens;
  • output tokens;
  • prefill latency;
  • decode latency;
  • time to first token;
  • effective batch size;
  • prefix or prompt cache hit rate;
  • KV-cache pressure or eviction rate;
  • context truncation events;
  • timeout events;
  • fallback events;
  • model version;
  • precision or quantization;
  • temperature, top_p, seed;
  • the first divergent token for the same prompt.

The experiment also needs controls: the same prompt, the same retrieval
snapshot, the same model and backend version, the same sampling parameters, and a fixed seed where the backend supports it.

A simple matrix:

  1. short context + batch size 1;
  2. long context + batch size 1;
  3. long context + batch size N;
  4. long context + dynamic batching;
  5. long context right after a cold restart;
  6. long context after warm-up.

If divergences appear mostly under long context + cold runtime + dynamic
batching, that is an argument for an infrastructure hypothesis worth checking with repeated runs and metrics. It is not proof by itself.

The cause still has to be confirmed with latency, cache, truncation, fallback, and first-divergent-token data.

If divergences remain after warm-up and under batch size 1, then you need to look at the prompt itself, retrieval, placement of evidence, and the model.

Where Undes Fits

This leads to a broader engineering point: a final answer is not enough. We need to understand how the answer was produced.

Undes does not control the provider's inference stack directly. It cannot look inside another provider's KV cache or scheduler. But it can help at a different layer:

  • show which files and snippets were read, sent into context, and recorded as evidence;
  • record which claims are supported by evidence;
  • surface places where the model referenced an unconfirmed method or file;
  • separate grounded findings from assumed implementation;
  • preserve rejected hypotheses;
  • keep open checks visible instead of hiding them inside confident prose;
  • mark an answer as not patch-safe when the verification is incomplete.

This is not magical hallucination suppression. A more precise statement is:

Undes makes unchecked hallucinations harder to hide.

If a model invents a method, a properly instrumented run can raise that as a warning. If a dependency was not read, it should become an open check. If an important objection disappeared between phases, it can become a transition signal.

Some unsupported claims stop being just polished sentences in a final answer. They become visible as warnings, open checks, or diagnostic status that an engineer can inspect and close.

Why "More Context" Is Not the Same as "More Trust"

You can give a model more text. But trust does not come from the amount of text.

Trust comes from a verifiable link between:

  • the request;
  • the context that was read;
  • evidence;
  • claims;
  • critique;
  • unresolved risks;
  • the final answer.

Long context helps only when the system can show what from that context was used and what remained unverified.

Otherwise, a longer prompt may simply become a larger place for mistakes to hide.

Closing

AI coding is not only about model quality. It is about the quality of the entire engineering loop around the model: context delivery, retrieval, batching, evidence, checks, and trust status.

The next layer of AI tooling is not just "give the model more context".

The next layer is reducing the chance that a model can present unverified output as verified output without a warning or diagnostic status.

AI agents can generate code now. Undes helps you decide whether to trust it.

Further Reading


I am building Undes, a local-first AI engineering CLI that generates and verifies engineering answers. If your team is already using AI coding agents, I would like feedback on which trust signals you still need before acting on generated code in a real workflow.

Top comments (0)