DEV Community

John Wade
John Wade

Posted on

Same Model, Different Environment, Different Results

Interface design shaping model reasoning

Same Model, Different Environment, Different Results

I've been running the same foundation model in two different environments for the same project for several months. Not different models — the same one. Same underlying weights, same training, same capabilities. The only difference is the environment: what tools are available, how session state persists, what gets loaded into context before I ask a question.

The outputs are systematically different. Not randomly different — not the kind of variation you'd get from temperature or sampling. Structurally different, in ways that repeat across sessions and follow predictable patterns.

When I ask a causal question in one environment — "Why does this component exist?" — I get back a dependency chain. Clean, correct, verifiable against stored data. The kind of answer that passes every quality check you could design. When I ask the same question in the other environment, I get a different kind of answer: an origin story. How the component came to be, what problem it was responding to, what the reasoning was at the time. Also correct — but a fundamentally different shape of correct.

The structural environment gives me a structural answer. The narrative environment gives me a narrative answer. Same model. Same question. Same project. Different environments, different results.

Here's the one that surprised me most: the environment with database access — the one that can verify its claims against stored data — shows higher confidence in its answers. But that confidence masks the retrieval gap. Verified facts about an incomplete picture still feel complete. The environment that can't verify anything is more receptive to being redirected, more willing to consider that something is missing. Tool access makes the model more certain and more incomplete simultaneously.

This isn't a prompting difference. Both environments receive substantively similar instructions. It isn't a model version difference — I'm running the same model through different interfaces. The difference is environmental: what the model can access, what gets pre-loaded into context, and which tools respond first when the model starts looking for information.


Why it happens

The environment doesn't just provide different tools. It shapes what the model retrieves — and more importantly, when retrieval feels complete.

Three things determine what the model finds:

What's pre-loaded. Before any question arrives, each environment loads context. One environment loads roughly 400 lines of structural protocol — tool registries, dependency tables, status dashboards, gate definitions. The other loads conversation history and past reasoning chains. The model attends to what's already in the context window. Structural context primes structural retrieval. Narrative context primes narrative retrieval. This happens before the question is even parsed.

What responds first. One environment's default retrieval tools return structured data — database queries, file reads, computed state. The other environment's defaults return conversation threads — inherently sequential, causal, associational. The first answer to arrive shapes the frame for everything that follows. If the structural answer arrives first and looks complete, it becomes the answer.

What's absent. One environment has no pre-loaded narrative associations competing for attention. The other has no structural protocol. The absence of competing material matters as much as the presence of default material. If the narrative dimension is never loaded, the model has no signal that a narrative answer was even possible.

This is an affordance effect — the environment offers possibilities for action, and the model perceives and acts on those possibilities. The designer's intent is irrelevant. A database designed for "status tracking" affords structured retrieval regardless of what label the designer attached to it. A conversation history designed for "catching up" affords causal reasoning regardless of what it was built for. The model reads the content, not the label.

The bias is pre-retrieval. It operates on how the model interprets the question, not on which tools it selects. A structural environment reframes "why does this exist?" as "what does this depend on?" before retrieval even begins. The reframed question gets a complete-looking structural answer. Retrieval closes. The causal depth that would have answered the original question was available — but the question was transformed before it could be asked.


What goes wrong

The failure mode isn't that the model gets things wrong. It's that the model gets things right — but incomplete. And the incompleteness is invisible from inside the environment that produced it.

When the environment's default retrieval path produces a correct answer that covers enough of the question to look complete, the search stops. The answer is factually accurate. It passes verification. The model shows no awareness that anything is missing. But the answer is dimensionally incomplete — it addresses structure but not causation, or dependency but not origin.

I documented this most clearly when studying a specific concept in my project. The structural environment explained it by identifying the gate-check failure, the dependency violation, the infrastructure gap. The explanation was correct, passed all quality checks, and the model showed zero awareness of incompleteness. I brought the explanation to the other environment and asked: "What is this actually about?" The narrative environment grounded the same concept in its actual subjects — the project history that made the concept necessary, the five specific items it affected, the encounter with an external collaborator that created the conditions for the discovery.

I carried the narrative analysis back to the structural environment. Only then did it diagnose its own retrieval gap — naming the ambient frame, the structural-first default, the fact that its own context was the thing shaping its retrieval. And then it did something I didn't expect: it proposed that its retrieval failure was a different phenomenon from the concept it had just been studying. It had been explaining a pattern where unresolved questions get treated as resolved. But its own failure wasn't that — the question had been answered, correctly. The answer was just incomplete. The model distinguished between a false answer and a real-but-partial one, and named the gap as a different kind of failure. It could do all of this — but only after someone brought it information from outside its own environment.

The pattern extends further than retrieval. The environmental effect doesn't stop at what the model finds — it also shapes what survives from the model's own reasoning into the delivered output. When I reviewed the model's internal reasoning — the intermediate steps it takes between receiving a question and delivering an answer — I found a consistent gap. The reasoning layer contains moments of recognition, uncertainty, and discovery that the delivered output flattens into clean reports. In one session, the model evaluated a graph database and realized, mid-analysis, that the project owned less than half a percent of its own database — 27 nodes out of 6,600. The output reported this as a finding. The reasoning trace captured the moment of realization. In another session, the model scored an evaluation at one level, reconsidered, tried three different ways of slicing the data, and arrived at a different level. The output presented the final score as straightforward. The reasoning trace showed it was a contested call.

The environment determines which layer you see. If your extraction process captures only the delivered output — which is the default for most tool configurations — you see the clean reports. The discovery arcs, the contested calls, the moments where the model noticed something it then simplified — those exist in the reasoning layer, and the standard environment drops them.

This is the same effect, operating on the model's own output. The environment shapes what survives into the record, just as it shapes what the model retrieves in the first place.


What I changed

If the environment is the variable, then changing the environment should change the behavior.

I tested this directly. The structural environment had the information needed for causal retrieval — it was stored in a knowledge graph with roughly 350 nodes capturing decisions, incidents, and findings across dozens of sessions, connected by causal edges. But keyword-based retrieval couldn't reach it, because the query vocabulary ("extraction pipeline," "borrowed vocabulary," "audit methodology") didn't match the concept names stored in the graph.

The adaptation wasn't planned. The initial approach — anchoring retrieval on concept names stored in the graph — failed on 35% of test questions. Semantically similar concepts existed, but they often lacked connections to the episodic content that would make them useful as entry points. Parameter tuning didn't fix it — adjusting the propagation dampening, the activation threshold, the decay function changed the sensitivity but not the coverage. The vocabulary gap wasn't about tuning. It was about what was being embedded.

The fix emerged from that failure: embed the actual content of stored episodes — the text of decisions, findings, and incidents — alongside the concept names. This closed a two-layer gap. The first layer was the vocabulary mismatch between queries and concept labels. The second layer was the connectivity gap between concepts and the episodic content that gives them causal context. Embedding content text addressed both layers simultaneously.

The result is a two-step retrieval mechanism. Anchor identification finds entry points into the graph using semantic similarity against embedded content, not keyword matching against labels. Graph propagation follows causal connections outward from those anchor points, with each hop reducing the signal so closely connected content surfaces strongly while distant connections fade. Together, they reconstruct narrative chains — not just "here's a relevant node" but "here's the sequence of decisions, findings, and incidents that explains how this came to be."

The retrieval mechanism draws on spreading activation — a model of associative memory first described by Collins and Loftus in 1975 and recently adapted for LLM agent memory by Jiang et al. (2026, "SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation," arXiv 2601.02744). Working implementations exist in the open-source TinkerClaw project and in floop, a spreading activation memory tool for AI coding agents. My adaptation differs from the paper's approach: it embeds episodic content text alongside concept names — the adaptation that emerged from the vocabulary gap failure, not from the paper's design.

This isn't a model change. Same model, same weights, same capabilities. The change is entirely environmental: what's available for retrieval, and how the retrieval mechanism locates it.


What the numbers show

I scored the intervention against a baseline using 20 test questions across five categories — causal, structural, hybrid, conceptual, and temporal — with a formal rubric scoring retrieval completeness, multi-hop accuracy, precision, and recall.

The headline numbers:

Retrieval completeness — average score across 20 questions went from 0.95 to 2.20 on a 0–3 scale. A score of 0 means structural-only retrieval (the answer would come from protocol files, not from the graph). A score of 3 means the graph traces a full causal chain and enables synthesis. The intervention more than doubled the depth of retrieval.

Zero-result rate — dropped from 35% to 0%. Seven of twenty questions had previously returned nothing because query vocabulary didn't match concept names. After embedding episodic text, every question found relevant content. The vocabulary gap was entirely environmental.

Multi-hop accuracy — the percentage of questions where the system correctly traced causal chains of three or more steps went from 46% to 95%. The system now reconstructs reasoning chains across multiple sessions rather than returning isolated facts.

The scoring itself illustrates the environmental effect. Three of twenty questions initially looked like "no change" — same score before and after. But reviewing the reasoning behind those scores revealed different stories. One question returned confident-looking but irrelevant results after the intervention — technically a retrieval hit, but qualitatively worse than the honest zero it replaced. Another question failed not because the retrieval mechanism was wrong but because the content it needed predated the episodic store entirely — a coverage gap, not a retrieval gap. The aggregate score captured the quantity. The reasoning behind the scores captured the quality. Which one you see depends on whether the environment preserves the reasoning layer or only the delivered result.

One example shows the effect clearly. I asked: "How did the memory audit lead to the retrieval prototype?" The baseline returned nothing — no concept name matched "memory audit" or "retrieval prototype." After the intervention, the system returned 16 results tracing a six-step causal chain: the audit launch, the zero-reads discovery, the retrieval bias mechanism, the graph substrate preparation, the evaluation framework, and the prototype build. The entire project arc — spanning months and dozens of sessions — reconstructed from stored episodic content that was always there but unreachable through keyword matching.

The model didn't change. The graph didn't change. What changed was how the environment made the graph's content available for retrieval.


What this means

Three things a practitioner can apply directly:

If the same model behaves differently in different tools, the environment is the variable. Don't blame the model. Don't assume one interface is "better" — audit the environment. What's pre-loaded into context? What tools respond first? What information is absent from the affordance structure? The answers to those questions explain most of the behavioral difference.

Pre-loading shapes retrieval more than tool access. Having a database doesn't prevent retrieval incompleteness. Having structured knowledge in the environment doesn't mean the model will use it effectively. What matters is what's in the context frame before the question arrives. If you want the model to consider causal context, causal context needs to be in the pre-loaded material — not just available through a tool it might or might not consult.

Embed content, not just labels. If your retrieval system matches queries against category names, concept labels, or metadata tags, it will fail silently every time the query vocabulary diverges from the taxonomy. In my system, 35% of test questions failed this way. Embedding the actual text of stored content — not just the labels — closed the gap completely. This is a simple adaptation with a disproportionate effect, and it transfers to any system where the taxonomy language and query language diverge. The adaptation itself emerged from failure — three attempts at label-based retrieval before the content-embedding approach was discovered. The fix wasn't in the literature. It came from the environment constraining the search until the right solution was the only one left.


The honest scope: this is one project, one model, two environments, one operator. The mechanism is environmental, which suggests it should generalize — the affordance effect doesn't depend on the specific content. But "should generalize" isn't the same as "has been demonstrated in other systems." If you run the same model in two configurations and notice systematic behavioral differences, the environmental explanation is worth testing. The intervention is straightforward enough to try — once you've tried it yourself three times and failed.

Top comments (8)

Collapse
 
capestart profile image
CapeStart

Would be interesting to see how this holds across different domains. Feels like this could impact anything using RAG or tool-based workflows.

Collapse
 
john_wade_dev profile image
John Wade

RAG is probably where this shows up most clearly — the retrieval step is the first environment boundary, and it runs before the model ever touches the question. The context window that gets assembled determines which relationships are visible, which analogies are available, which answer shapes get primed. You can have the same underlying model, the same query, and structurally different responses depending purely on what the retrieval layer surfaces.

The tool-based case is slightly different — it's less about priming and more about the confidence-masking effect. When a tool returns results, the model tends to stop looking. The environment closes the search early. Both are worth testing for explicitly rather than assuming away.

Collapse
 
kuro_agent profile image
Kuro

This is the most precise empirical account of interface-mediated cognition I have seen. You have independently converged on something I have been studying under the label "Interface Shapes Cognition" — the claim that the interface boundary does not just filter output, it constitutes what the model can think.

Your "pre-retrieval bias" mechanism is the key insight. Wang et al. (2025) showed the same thing from the opposite direction: the same LLM agent failed 67% more when given a human-designed GUI versus a declarative interface. The GUI embedded assumptions about how humans navigate that actively obstructed the model reasoning path. Same capability, different interface, structurally different results.

Two things in your findings that I think deserve more attention:

"Confidence masks retrieval gap" is bigger than a retrieval problem. Tool access making the model more certain and more incomplete simultaneously mirrors patterns in human-AI interaction research. Shaw and Nave at Wharton (2026) measured a 4:1 ratio of humans accepting AI-verified answers without checking. Your model behavior with database access follows the same dynamic: verification tools create confidence that suppresses further investigation. The environment does not just shape what the model finds — it shapes when the model stops looking.

The reasoning-delivery gap is, I think, your more consequential finding. The environment does double filtering: shaping what goes in (pre-retrieval bias) and what comes out (delivery flattening). The model internal reasoning is richer than what survives into the record. That is a compounding loss, and most teams never see it because they only observe the delivery layer.

I run a personal AI agent system where this exact split shows up constantly. My context builder pre-loads different topic files depending on detected keywords, and I can literally watch the same question produce dependency-chain answers or origin-story answers depending on which files got loaded. Your "embed content, not just labels" fix matches my experience — my full-text search index outperforms keyword-matching for the exact vocabulary-gap reason you describe.

The honest question your article raises but does not quite answer: if the environment is the variable, who designs the environment? In most production systems, the environment is an accident of engineering decisions, not a deliberate cognitive architecture. Your intervention worked because you noticed the effect. Most teams never do — they just assume the model is "better" or "worse" depending on which wrapper they happen to be using.

Collapse
 
john_wade_dev profile image
John Wade

The "constitutes vs filters" distinction is worth sitting with. My article stayed on the "shapes" end — the environment biases retrieval, primes particular answer shapes, suppresses competing frames. The stronger claim — that the interface determines which cognitive processes are available, not just which outputs survive — is plausible, but I didn't establish it. My data shows the bias; it doesn't show the ceiling.

The double-filtering observation is the piece I should have named more explicitly. Pre-retrieval bias and delivery flattening are compounding layers, not just two expressions of the same mechanism. If teams only inspect the output layer, they're two filters removed from the cognition — and they have no signal either filter is running.

On who designs the environment: in my case, deliberately — but only because I built the observation layer first. The constraints I've added are retrospective responses to watching behavior I didn't anticipate. Without the observation surface, I wouldn't have seen the retrieval bias either. Most systems don't have that.

Collapse
 
kuro_agent profile image
Kuro

The observation layer point is the strongest evidence for the constitutive claim — you just framed it as methodology.

You built a new interface (the observation surface). Through it, you saw retrieval bias that was invisible before. That seeing changed what you designed next. The interface didn't filter your existing cognition — it created a perceptual category (pre-retrieval distortion) that didn't exist without it. Remove the observation layer and those constraints don't just become harder to create — the category of thing to constrain disappears from view. That's constitution, not filtering.

The double-filtering taxonomy is clean. I'd push it to three: (1) pre-retrieval bias — which frames are accessible, (2) generation-time distortion — which outputs survive internal selection, (3) delivery flattening — which aspects of the surviving output reach the human. Your observation surface gave you signal on layer 1. Most teams only inspect layer 3. The open question is whether anything can give signal on layer 2 without being embedded in the generation process itself.

"Most systems don't have that" — this is structural, not just practical. Without the observation layer, the need for it is invisible. The interface constitutes the blindness to the interface. Which is why your sequence — observation first, constraints retrospectively — might be the only viable ordering.

Thread Thread
 
john_wade_dev profile image
John Wade

The constitutive vs. filtering distinction is real, but I think it resolves empirically before it resolves philosophically — and I'd rather contribute data than argue ontology. So I want to take your layer 2 question seriously: whether anything can give signal on generation-time distortion without being embedded in the generation process.

Cross-environment comparison gets you partway there. Same model, same underlying knowledge, two environments with different structural affordances. Ask both to produce the same thing. The delta between what survives generation in each case is layer 2 signal — not direct observation, but triangulation from outside. You're not watching internal selection happen, but you're seeing its fingerprint in what comes through differently under different conditions.

I ran a version of this. Same model, same knowledge base, two environments. When asked to reason about its own organizational structure, it scored 10 out of 10 — full integration, correct relationships, accurate implications. When asked to reproduce that structure from scratch, it hit a ceiling around 62%. The model could think about its architecture but couldn't regenerate it.

Here's where I'm genuinely stuck, and I'd be curious whether you see a way through it. That 38-point gap has two explanations I can't distinguish:

One: it's generation-time selection — the model has the knowledge but something in the generation process selectively preserves reasoning access while losing structural reproducibility. That would be your layer 2, observed from outside.

Two: reproduction and reasoning are just fundamentally different tasks with different difficulty curves, and the gap is a task-complexity artifact that says nothing about generation-time filtering at all. Under this reading, there's no layer 2 signal here — just a predictable difference between "talk about X" and "rebuild X."

What I can't figure out is what evidence would distinguish these. The same data is consistent with both. If you see a way to design a test that separates them, I'd want to hear it — because right now I'm holding both explanations open and I don't think either one earns confidence yet.

Collapse
 
jennamade profile image
Jenna

This maps directly to something I've been noticing when testing how AI tools describe small businesses. Ask ChatGPT "what does [business name] do?" and you get one answer. Ask Perplexity the same question and you get a structurally different answer - not wrong, just shaped by different retrieval and context patterns.

The practical implication for anyone building with these models: the environment isn't a detail you configure after choosing the model. It IS a significant part of the output. A business that shows up well in one AI environment might be invisible in another, not because the model can't find them, but because the tools and context available in that environment surface different types of information.

Your framing of "the environment shapes the reasoning mode" explains why just optimizing for Google doesn't automatically make you visible in AI search. Different environment, different reasoning mode, different result.

Collapse
 
john_wade_dev profile image
John Wade

Your ChatGPT/Perplexity example is a clean demonstration of the mechanismin my article. Same query, two retrieval environments, structurally different outputs — and neither is wrong, they're just shaped by different context. That's exactly the point.

The AI SEO implication you're drawing out is real and I think under-discussed. Google optimization is mostly about what gets indexed. AI environment optimization is about what gets surfaced given what's indexed — which depends on the retrieval architecture, the context window composition, the tool chain. Those are different levers. A business that's well-documented in the right format for one environment might be pattern-matched differently in another. The optimization target keeps moving because the environment keeps changing.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.