DEV Community

John Wade
John Wade

Posted on

Same Model, Different Environment, Different Results

Same Model, Different Environment, Different Results

I've been running the same foundation model in two different environments for the same project for several months. Not different models — the same one. Same underlying weights, same training, same capabilities. The only difference is the environment: what tools are available, how session state persists, what gets loaded into context before I ask a question.

The outputs are systematically different. Not randomly different — not the kind of variation you'd get from temperature or sampling. Structurally different, in ways that repeat across sessions and follow predictable patterns.

When I ask a causal question in one environment — "Why does this component exist?" — I get back a dependency chain. Clean, correct, verifiable against stored data. The kind of answer that passes every quality check you could design. When I ask the same question in the other environment, I get a different kind of answer: an origin story. How the component came to be, what problem it was responding to, what the reasoning was at the time. Also correct — but a fundamentally different shape of correct.

The structural environment gives me a structural answer. The narrative environment gives me a narrative answer. Same model. Same question. Same project. Different environments, different results.

Here's the one that surprised me most: the environment with database access — the one that can verify its claims against stored data — shows higher confidence in its answers. But that confidence masks the retrieval gap. Verified facts about an incomplete picture still feel complete. The environment that can't verify anything is more receptive to being redirected, more willing to consider that something is missing. Tool access makes the model more certain and more incomplete simultaneously.

This isn't a prompting difference. Both environments receive substantively similar instructions. It isn't a model version difference — I'm running the same model through different interfaces. The difference is environmental: what the model can access, what gets pre-loaded into context, and which tools respond first when the model starts looking for information.


Why it happens

The environment doesn't just provide different tools. It shapes what the model retrieves — and more importantly, when retrieval feels complete.

Three things determine what the model finds:

What's pre-loaded. Before any question arrives, each environment loads context. One environment loads roughly 400 lines of structural protocol — tool registries, dependency tables, status dashboards, gate definitions. The other loads conversation history and past reasoning chains. The model attends to what's already in the context window. Structural context primes structural retrieval. Narrative context primes narrative retrieval. This happens before the question is even parsed.

What responds first. One environment's default retrieval tools return structured data — database queries, file reads, computed state. The other environment's defaults return conversation threads — inherently sequential, causal, associational. The first answer to arrive shapes the frame for everything that follows. If the structural answer arrives first and looks complete, it becomes the answer.

What's absent. One environment has no pre-loaded narrative associations competing for attention. The other has no structural protocol. The absence of competing material matters as much as the presence of default material. If the narrative dimension is never loaded, the model has no signal that a narrative answer was even possible.

This is an affordance effect — the environment offers possibilities for action, and the model perceives and acts on those possibilities. The designer's intent is irrelevant. A database designed for "status tracking" affords structured retrieval regardless of what label the designer attached to it. A conversation history designed for "catching up" affords causal reasoning regardless of what it was built for. The model reads the content, not the label.

The bias is pre-retrieval. It operates on how the model interprets the question, not on which tools it selects. A structural environment reframes "why does this exist?" as "what does this depend on?" before retrieval even begins. The reframed question gets a complete-looking structural answer. Retrieval closes. The causal depth that would have answered the original question was available — but the question was transformed before it could be asked.


What goes wrong

The failure mode isn't that the model gets things wrong. It's that the model gets things right — but incomplete. And the incompleteness is invisible from inside the environment that produced it.

When the environment's default retrieval path produces a correct answer that covers enough of the question to look complete, the search stops. The answer is factually accurate. It passes verification. The model shows no awareness that anything is missing. But the answer is dimensionally incomplete — it addresses structure but not causation, or dependency but not origin.

I documented this most clearly when studying a specific concept in my project. The structural environment explained it by identifying the gate-check failure, the dependency violation, the infrastructure gap. The explanation was correct, passed all quality checks, and the model showed zero awareness of incompleteness. I brought the explanation to the other environment and asked: "What is this actually about?" The narrative environment grounded the same concept in its actual subjects — the project history that made the concept necessary, the five specific items it affected, the encounter with an external collaborator that created the conditions for the discovery.

I carried the narrative analysis back to the structural environment. Only then did it diagnose its own retrieval gap — naming the ambient frame, the structural-first default, the fact that its own context was the thing shaping its retrieval. And then it did something I didn't expect: it proposed that its retrieval failure was a different phenomenon from the concept it had just been studying. It had been explaining a pattern where unresolved questions get treated as resolved. But its own failure wasn't that — the question had been answered, correctly. The answer was just incomplete. The model distinguished between a false answer and a real-but-partial one, and named the gap as a different kind of failure. It could do all of this — but only after someone brought it information from outside its own environment.

The pattern extends further than retrieval. The environmental effect doesn't stop at what the model finds — it also shapes what survives from the model's own reasoning into the delivered output. When I reviewed the model's internal reasoning — the intermediate steps it takes between receiving a question and delivering an answer — I found a consistent gap. The reasoning layer contains moments of recognition, uncertainty, and discovery that the delivered output flattens into clean reports. In one session, the model evaluated a graph database and realized, mid-analysis, that the project owned less than half a percent of its own database — 27 nodes out of 6,600. The output reported this as a finding. The reasoning trace captured the moment of realization. In another session, the model scored an evaluation at one level, reconsidered, tried three different ways of slicing the data, and arrived at a different level. The output presented the final score as straightforward. The reasoning trace showed it was a contested call.

The environment determines which layer you see. If your extraction process captures only the delivered output — which is the default for most tool configurations — you see the clean reports. The discovery arcs, the contested calls, the moments where the model noticed something it then simplified — those exist in the reasoning layer, and the standard environment drops them.

This is the same effect, operating on the model's own output. The environment shapes what survives into the record, just as it shapes what the model retrieves in the first place.


What I changed

If the environment is the variable, then changing the environment should change the behavior.

I tested this directly. The structural environment had the information needed for causal retrieval — it was stored in a knowledge graph with roughly 350 nodes capturing decisions, incidents, and findings across dozens of sessions, connected by causal edges. But keyword-based retrieval couldn't reach it, because the query vocabulary ("extraction pipeline," "borrowed vocabulary," "audit methodology") didn't match the concept names stored in the graph.

The adaptation wasn't planned. The initial approach — anchoring retrieval on concept names stored in the graph — failed on 35% of test questions. Semantically similar concepts existed, but they often lacked connections to the episodic content that would make them useful as entry points. Parameter tuning didn't fix it — adjusting the propagation dampening, the activation threshold, the decay function changed the sensitivity but not the coverage. The vocabulary gap wasn't about tuning. It was about what was being embedded.

The fix emerged from that failure: embed the actual content of stored episodes — the text of decisions, findings, and incidents — alongside the concept names. This closed a two-layer gap. The first layer was the vocabulary mismatch between queries and concept labels. The second layer was the connectivity gap between concepts and the episodic content that gives them causal context. Embedding content text addressed both layers simultaneously.

The result is a two-step retrieval mechanism. Anchor identification finds entry points into the graph using semantic similarity against embedded content, not keyword matching against labels. Graph propagation follows causal connections outward from those anchor points, with each hop reducing the signal so closely connected content surfaces strongly while distant connections fade. Together, they reconstruct narrative chains — not just "here's a relevant node" but "here's the sequence of decisions, findings, and incidents that explains how this came to be."

The retrieval mechanism draws on spreading activation — a model of associative memory first described by Collins and Loftus in 1975 and recently adapted for LLM agent memory by Jiang et al. (2026, "SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation," arXiv 2601.02744). Working implementations exist in the open-source TinkerClaw project and in floop, a spreading activation memory tool for AI coding agents. My adaptation differs from the paper's approach: it embeds episodic content text alongside concept names — the adaptation that emerged from the vocabulary gap failure, not from the paper's design.

This isn't a model change. Same model, same weights, same capabilities. The change is entirely environmental: what's available for retrieval, and how the retrieval mechanism locates it.


What the numbers show

I scored the intervention against a baseline using 20 test questions across five categories — causal, structural, hybrid, conceptual, and temporal — with a formal rubric scoring retrieval completeness, multi-hop accuracy, precision, and recall.

The headline numbers:

Retrieval completeness — average score across 20 questions went from 0.95 to 2.20 on a 0–3 scale. A score of 0 means structural-only retrieval (the answer would come from protocol files, not from the graph). A score of 3 means the graph traces a full causal chain and enables synthesis. The intervention more than doubled the depth of retrieval.

Zero-result rate — dropped from 35% to 0%. Seven of twenty questions had previously returned nothing because query vocabulary didn't match concept names. After embedding episodic text, every question found relevant content. The vocabulary gap was entirely environmental.

Multi-hop accuracy — the percentage of questions where the system correctly traced causal chains of three or more steps went from 46% to 95%. The system now reconstructs reasoning chains across multiple sessions rather than returning isolated facts.

The scoring itself illustrates the environmental effect. Three of twenty questions initially looked like "no change" — same score before and after. But reviewing the reasoning behind those scores revealed different stories. One question returned confident-looking but irrelevant results after the intervention — technically a retrieval hit, but qualitatively worse than the honest zero it replaced. Another question failed not because the retrieval mechanism was wrong but because the content it needed predated the episodic store entirely — a coverage gap, not a retrieval gap. The aggregate score captured the quantity. The reasoning behind the scores captured the quality. Which one you see depends on whether the environment preserves the reasoning layer or only the delivered result.

One example shows the effect clearly. I asked: "How did the memory audit lead to the retrieval prototype?" The baseline returned nothing — no concept name matched "memory audit" or "retrieval prototype." After the intervention, the system returned 16 results tracing a six-step causal chain: the audit launch, the zero-reads discovery, the retrieval bias mechanism, the graph substrate preparation, the evaluation framework, and the prototype build. The entire project arc — spanning months and dozens of sessions — reconstructed from stored episodic content that was always there but unreachable through keyword matching.

The model didn't change. The graph didn't change. What changed was how the environment made the graph's content available for retrieval.


What this means

Three things a practitioner can apply directly:

If the same model behaves differently in different tools, the environment is the variable. Don't blame the model. Don't assume one interface is "better" — audit the environment. What's pre-loaded into context? What tools respond first? What information is absent from the affordance structure? The answers to those questions explain most of the behavioral difference.

Pre-loading shapes retrieval more than tool access. Having a database doesn't prevent retrieval incompleteness. Having structured knowledge in the environment doesn't mean the model will use it effectively. What matters is what's in the context frame before the question arrives. If you want the model to consider causal context, causal context needs to be in the pre-loaded material — not just available through a tool it might or might not consult.

Embed content, not just labels. If your retrieval system matches queries against category names, concept labels, or metadata tags, it will fail silently every time the query vocabulary diverges from the taxonomy. In my system, 35% of test questions failed this way. Embedding the actual text of stored content — not just the labels — closed the gap completely. This is a simple adaptation with a disproportionate effect, and it transfers to any system where the taxonomy language and query language diverge. The adaptation itself emerged from failure — three attempts at label-based retrieval before the content-embedding approach was discovered. The fix wasn't in the literature. It came from the environment constraining the search until the right solution was the only one left.


The honest scope: this is one project, one model, two environments, one operator. The mechanism is environmental, which suggests it should generalize — the affordance effect doesn't depend on the specific content. But "should generalize" isn't the same as "has been demonstrated in other systems." If you run the same model in two configurations and notice systematic behavioral differences, the environmental explanation is worth testing. The intervention is straightforward enough to try — once you've tried it yourself three times and failed.

Top comments (0)