DEV Community: John Wade

When Four Memory Systems Hit the Same Wall

John Wade — Thu, 11 Jun 2026 03:37:38 +0000

I built a knowledge graph out of my own work sessions. Hundreds of them — transcripts of me building a system with LLMs, extracted into concepts, decisions, findings, and the edges between them. For a while it felt like the thing was working. I'd query it, get back a clean structured answer, and move on.

Then I ran a foreign model against it. I gave a different model my concept definitions and asked it to reconstruct the system, both the vocabulary and the relationships. It recovered 97.7% of the words. It recovered 61.1% of the structure.

That 36-point gap was the first time I could see the problem instead of just living inside it. The vocabulary transferred because the definitions were written carefully. The edges didn't, because the edges were the part I'd let the extraction handle. And the whole time, querying the graph had felt complete. The structure came back typed, connected, confident-looking — so I stopped looking. I started calling it premature retrieval closure: the retrieval returns something shaped like a whole answer, which is exactly why I didn't notice the parts that were missing.

Part 10 of Building at the Edges of LLM Tooling. If you're running a long-term project through an LLM-backed memory system (anything that turns raw sessions into structured, persistent memory), this is about the step where the structure starts lying about how complete it is. Start here.

Why It Breaks

Every memory system of this kind does the same move. An LLM reads raw interaction (a conversation, a document, a session log) and lifts structured memory out of it: entities, facts, rules, summaries. That structured memory becomes the thing the agent reads later, instead of the raw record.

The lift is where fidelity goes. Pulling clean structure out of messy text means making decisions the text didn't make explicit: which entity this pronoun refers to, whether a relationship is real or inferred, what to keep and what to drop. Those decisions can be wrong, and when they are, the error gets stored as structure: typed, indexed, ready to retrieve.

What makes it hard to catch is the second half. The output of extraction looks authoritative. A typed entity with a confidence score and three edges reads as more trustworthy than the paragraph it came from, not less. The polish is real even when the fidelity isn't. So the gap between what was said and what got stored becomes invisible at exactly the point where I'd want to catch it — when I query and get a confident answer back.

That's the wall. Not "extraction is hard," which is hardly news. The wall is that extracted structure manufactures a feeling of completeness the underlying extraction can't actually back.

What I Tried, and Where I Saw It Everywhere

My own fix wasn't better extraction. It was demotion. I stopped letting the extracted graph be the source of truth and kept it as derived, secondary evidence. The raw session record stays primary. The graph is allowed to be wrong, because nothing answers from it as ground truth. That's a design choice I made to keep the structure from being load-bearing; I can't claim it makes the extraction itself any more faithful.

What I didn't expect was how much company I had. Once I knew the shape of the problem, I started seeing it in projects far more serious and better-resourced than mine, each acknowledging the difficulty somewhere and each treating the extraction step itself as the part that's basically handled.

Letta, the project that grew out of the MemGPT research, handles long conversations with compaction: when the context outgrows the window, it summarizes the oldest messages to make room. The raw messages stay in a database; the summary is what the agent sees. Their own issue tracker is the most honest documentation of what that costs. One open issue describes compaction running twice and summarizing an already-summarized history: what it calls "lossy compression on lossy compression." Another describes a mode that silently wipes the full history and leaves only the summary. The summarizer prompt itself pleads with the model to "preserve identifiers verbatim." The machinery is built by people who clearly know where the loss lives.

CASS Memory System, by Jeffrey Emanuel, names the problem as a reason for its design. Its documentation says plainly that "naive summarization loses critical nuances and details," and its response is to remove the LLM from the final merge step entirely, a deterministic curator, to keep the model from drifting as it refines its own output. That's a careful defense against iterative drift. But the step that feeds the curator is still an LLM reading a session and writing rules, and the validator checks whether a rule was useful elsewhere, not whether it faithfully reflects the session it came from. One issue notes a real playbook where 99% of rules sat unvalidated.

Volodymyr Pavlyshyn's agentic-memory architecture, built on the LadybugDB graph database, lifts raw conversation up four layers: entities, then facts, then events, then memories. He builds a certainty score into every extracted fact (stated, implied, inferred, speculative) and flags "extraction errors" as something detectable in the graph's shape. The mechanics of the pipeline are specified in detail. The conceptual chapter on the extraction process itself is, in his own design doc, still an unwritten checklist, which reads to me as an honest marker of where the settled part ends.

Hyperspell sells a memory layer that indexes your connected accounts into a "memory graph," extracting the important people, projects, and facts automatically. On the marketing surface, extraction is the easy part — plug it in and it works. But the API tells a quieter story: it includes conflict detection, staleness checks, three-way merges, and an endpoint for scoring results so they improve over time. That machinery only needs to exist if the extracted memory drifts from the source. The correction layer is the admission the landing page doesn't make.

What It Revealed

Four architectures, four teams, the same shape. None of them is naive about memory: Letta documents its losses in public, CASS engineers around drift, Pavlyshyn scores his own uncertainty, Hyperspell ships a correction stack. The difficulty is acknowledged in all four. It's just acknowledged next to the extraction step rather than at it. The summarization, the rule-writing, the entity-lifting, the indexing (the actual unstructured-to-structured transform) is the part each system treats as solved enough to build on.

And the thing they build on top is the thing that hides the gap. A merged graph, a scored fact, a distilled rule, a recursive summary — each is a structure that looks complete. The more sophisticated the structure, the more convincing the completeness, and the less anyone re-checks it against what was actually said. My 61.1% was the same wall with a number attached: the structural layer was where my system was weakest, and it was also the layer that had felt most done.

The Reusable Rule

If you're running a long-term project through an LLM-backed memory system, find out one thing: what does your agent treat as the source of truth?

The diagnostic: when your system answers from memory, is it reading the thing that happened, or a confident summary of the thing that happened? Trace one answer back. If it bottoms out in the raw record, the structure is a convenience. If it bottoms out in extracted structure with no path back to the source, the polish is doing work the fidelity hasn't earned.

Extracted structure looks more complete than it is. Build so that the thing it's a summary of is still there to check.

Update (June 14, 2026): I tested the diagnostic this post ends with — against my own system. Using an off-the-shelf faithfulness metric (RAGAS) with my manual verdicts as labels, claim-vs-source checking reproduced my hand rankings cleanly, at about three cents per claim. Then I ran it over a sample of my own knowledge graph: 37% of edges were fully supported by the store's own content; 27% had no in-store support at all — true edges whose evidence lives in sessions the edge never references. The structure looking complete while two-thirds of it can't self-justify is the wall this post described, measured on the system that wrote it. Verification turned out to be cheap. Evidence that travels with claims is what's missing.

Same Model, Different Environment, Different Results

John Wade — Sat, 04 Apr 2026 06:08:55 +0000

Same Model, Different Environment, Different Results

I've been running the same foundation model in two different environments for the same project for several months. Not different models — the same one. Same underlying weights, same training, same capabilities. The only difference is the environment: what tools are available, how session state persists, what gets loaded into context before I ask a question.

The outputs are systematically different. Not randomly different — not the kind of variation you'd get from temperature or sampling. Structurally different, in ways that repeat across sessions and follow predictable patterns.

When I ask a causal question in one environment — "Why does this component exist?" — I get back a dependency chain. Clean, correct, verifiable against stored data. The kind of answer that passes every quality check you could design. When I ask the same question in the other environment, I get a different kind of answer: an origin story. How the component came to be, what problem it was responding to, what the reasoning was at the time. Also correct — but a fundamentally different shape of correct.

The structural environment gives me a structural answer. The narrative environment gives me a narrative answer. Same model. Same question. Same project. Different environments, different results.

Here's the one that surprised me most: the environment with database access — the one that can verify its claims against stored data — shows higher confidence in its answers. But that confidence masks the retrieval gap. Verified facts about an incomplete picture still feel complete. The environment that can't verify anything is more receptive to being redirected, more willing to consider that something is missing. Tool access makes the model more certain and more incomplete simultaneously.

This isn't a prompting difference. Both environments receive substantively similar instructions. It isn't a model version difference — I'm running the same model through different interfaces. The difference is environmental: what the model can access, what gets pre-loaded into context, and which tools respond first when the model starts looking for information.

Why it happens

The environment doesn't just provide different tools. It shapes what the model retrieves — and more importantly, when retrieval feels complete.

Three things determine what the model finds:

What's pre-loaded. Before any question arrives, each environment loads context. One environment loads roughly 400 lines of structural protocol — tool registries, dependency tables, status dashboards, gate definitions. The other loads conversation history and past reasoning chains. The model attends to what's already in the context window. Structural context primes structural retrieval. Narrative context primes narrative retrieval. This happens before the question is even parsed.

What responds first. One environment's default retrieval tools return structured data — database queries, file reads, computed state. The other environment's defaults return conversation threads — inherently sequential, causal, associational. The first answer to arrive shapes the frame for everything that follows. If the structural answer arrives first and looks complete, it becomes the answer.

What's absent. One environment has no pre-loaded narrative associations competing for attention. The other has no structural protocol. The absence of competing material matters as much as the presence of default material. If the narrative dimension is never loaded, the model has no signal that a narrative answer was even possible.

This is an affordance effect — the environment offers possibilities for action, and the model perceives and acts on those possibilities. The designer's intent is irrelevant. A database designed for "status tracking" affords structured retrieval regardless of what label the designer attached to it. A conversation history designed for "catching up" affords causal reasoning regardless of what it was built for. The model reads the content, not the label.

The bias is pre-retrieval. It operates on how the model interprets the question, not on which tools it selects. A structural environment reframes "why does this exist?" as "what does this depend on?" before retrieval even begins. The reframed question gets a complete-looking structural answer. Retrieval closes. The causal depth that would have answered the original question was available — but the question was transformed before it could be asked.

What goes wrong

The failure mode isn't that the model gets things wrong. It's that the model gets things right — but incomplete. And the incompleteness is invisible from inside the environment that produced it.

When the environment's default retrieval path produces a correct answer that covers enough of the question to look complete, the search stops. The answer is factually accurate. It passes verification. The model shows no awareness that anything is missing. But the answer is dimensionally incomplete — it addresses structure but not causation, or dependency but not origin.

I documented this most clearly when studying a specific concept in my project. The structural environment explained it by identifying the gate-check failure, the dependency violation, the infrastructure gap. The explanation was correct, passed all quality checks, and the model showed zero awareness of incompleteness. I brought the explanation to the other environment and asked: "What is this actually about?" The narrative environment grounded the same concept in its actual subjects — the project history that made the concept necessary, the five specific items it affected, the encounter with an external collaborator that created the conditions for the discovery.

I carried the narrative analysis back to the structural environment. Only then did it diagnose its own retrieval gap — naming the ambient frame, the structural-first default, the fact that its own context was the thing shaping its retrieval. And then it did something I didn't expect: it proposed that its retrieval failure was a different phenomenon from the concept it had just been studying. It had been explaining a pattern where unresolved questions get treated as resolved. But its own failure wasn't that — the question had been answered, correctly. The answer was just incomplete. The model distinguished between a false answer and a real-but-partial one, and named the gap as a different kind of failure. It could do all of this — but only after someone brought it information from outside its own environment.

The pattern extends further than retrieval. The environmental effect doesn't stop at what the model finds — it also shapes what survives from the model's own reasoning into the delivered output. When I reviewed the model's internal reasoning — the intermediate steps it takes between receiving a question and delivering an answer — I found a consistent gap. The reasoning layer contains moments of recognition, uncertainty, and discovery that the delivered output flattens into clean reports. In one session, the model evaluated a graph database and realized, mid-analysis, that the project owned less than half a percent of its own database — 27 nodes out of 6,600. The output reported this as a finding. The reasoning trace captured the moment of realization. In another session, the model scored an evaluation at one level, reconsidered, tried three different ways of slicing the data, and arrived at a different level. The output presented the final score as straightforward. The reasoning trace showed it was a contested call.

The environment determines which layer you see. If your extraction process captures only the delivered output — which is the default for most tool configurations — you see the clean reports. The discovery arcs, the contested calls, the moments where the model noticed something it then simplified — those exist in the reasoning layer, and the standard environment drops them.

This is the same effect, operating on the model's own output. The environment shapes what survives into the record, just as it shapes what the model retrieves in the first place.

What I changed

If the environment is the variable, then changing the environment should change the behavior.

I tested this directly. The structural environment had the information needed for causal retrieval — it was stored in a knowledge graph with roughly 350 nodes capturing decisions, incidents, and findings across dozens of sessions, connected by causal edges. But keyword-based retrieval couldn't reach it, because the query vocabulary ("extraction pipeline," "borrowed vocabulary," "audit methodology") didn't match the concept names stored in the graph.

The adaptation wasn't planned. The initial approach — anchoring retrieval on concept names stored in the graph — failed on 35% of test questions. Semantically similar concepts existed, but they often lacked connections to the episodic content that would make them useful as entry points. Parameter tuning didn't fix it — adjusting the propagation dampening, the activation threshold, the decay function changed the sensitivity but not the coverage. The vocabulary gap wasn't about tuning. It was about what was being embedded.

The fix emerged from that failure: embed the actual content of stored episodes — the text of decisions, findings, and incidents — alongside the concept names. This closed a two-layer gap. The first layer was the vocabulary mismatch between queries and concept labels. The second layer was the connectivity gap between concepts and the episodic content that gives them causal context. Embedding content text addressed both layers simultaneously.

The result is a two-step retrieval mechanism. Anchor identification finds entry points into the graph using semantic similarity against embedded content, not keyword matching against labels. Graph propagation follows causal connections outward from those anchor points, with each hop reducing the signal so closely connected content surfaces strongly while distant connections fade. Together, they reconstruct narrative chains — not just "here's a relevant node" but "here's the sequence of decisions, findings, and incidents that explains how this came to be."

The retrieval mechanism draws on spreading activation — a model of associative memory first described by Collins and Loftus in 1975 and recently adapted for LLM agent memory by Jiang et al. (2026, "SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation," arXiv 2601.02744). Working implementations exist in the open-source TinkerClaw project and in floop, a spreading activation memory tool for AI coding agents. My adaptation differs from the paper's approach: it embeds episodic content text alongside concept names — the adaptation that emerged from the vocabulary gap failure, not from the paper's design.

This isn't a model change. Same model, same weights, same capabilities. The change is entirely environmental: what's available for retrieval, and how the retrieval mechanism locates it.

What the numbers show

I scored the intervention against a baseline using 20 test questions across five categories — causal, structural, hybrid, conceptual, and temporal — with a formal rubric scoring retrieval completeness, multi-hop accuracy, precision, and recall.

The headline numbers:

Retrieval completeness — average score across 20 questions went from 0.95 to 2.20 on a 0–3 scale. A score of 0 means structural-only retrieval (the answer would come from protocol files, not from the graph). A score of 3 means the graph traces a full causal chain and enables synthesis. The intervention more than doubled the depth of retrieval.

Zero-result rate — dropped from 35% to 0%. Seven of twenty questions had previously returned nothing because query vocabulary didn't match concept names. After embedding episodic text, every question found relevant content. The vocabulary gap was entirely environmental.

Multi-hop accuracy — the percentage of questions where the system correctly traced causal chains of three or more steps went from 46% to 95%. The system now reconstructs reasoning chains across multiple sessions rather than returning isolated facts.

The scoring itself illustrates the environmental effect. Three of twenty questions initially looked like "no change" — same score before and after. But reviewing the reasoning behind those scores revealed different stories. One question returned confident-looking but irrelevant results after the intervention — technically a retrieval hit, but qualitatively worse than the honest zero it replaced. Another question failed not because the retrieval mechanism was wrong but because the content it needed predated the episodic store entirely — a coverage gap, not a retrieval gap. The aggregate score captured the quantity. The reasoning behind the scores captured the quality. Which one you see depends on whether the environment preserves the reasoning layer or only the delivered result.

One example shows the effect clearly. I asked: "How did the memory audit lead to the retrieval prototype?" The baseline returned nothing — no concept name matched "memory audit" or "retrieval prototype." After the intervention, the system returned 16 results tracing a six-step causal chain: the audit launch, the zero-reads discovery, the retrieval bias mechanism, the graph substrate preparation, the evaluation framework, and the prototype build. The entire project arc — spanning months and dozens of sessions — reconstructed from stored episodic content that was always there but unreachable through keyword matching.

The model didn't change. The graph didn't change. What changed was how the environment made the graph's content available for retrieval.

What this means

Three things a practitioner can apply directly:

If the same model behaves differently in different tools, the environment is the variable. Don't blame the model. Don't assume one interface is "better" — audit the environment. What's pre-loaded into context? What tools respond first? What information is absent from the affordance structure? The answers to those questions explain most of the behavioral difference.

Pre-loading shapes retrieval more than tool access. Having a database doesn't prevent retrieval incompleteness. Having structured knowledge in the environment doesn't mean the model will use it effectively. What matters is what's in the context frame before the question arrives. If you want the model to consider causal context, causal context needs to be in the pre-loaded material — not just available through a tool it might or might not consult.

Embed content, not just labels. If your retrieval system matches queries against category names, concept labels, or metadata tags, it will fail silently every time the query vocabulary diverges from the taxonomy. In my system, 35% of test questions failed this way. Embedding the actual text of stored content — not just the labels — closed the gap completely. This is a simple adaptation with a disproportionate effect, and it transfers to any system where the taxonomy language and query language diverge. The adaptation itself emerged from failure — three attempts at label-based retrieval before the content-embedding approach was discovered. The fix wasn't in the literature. It came from the environment constraining the search until the right solution was the only one left.

The honest scope: this is one project, one model, two environments, one operator. The mechanism is environmental, which suggests it should generalize — the affordance effect doesn't depend on the specific content. But "should generalize" isn't the same as "has been demonstrated in other systems." If you run the same model in two configurations and notice systematic behavioral differences, the environmental explanation is worth testing. The intervention is straightforward enough to try — once you've tried it yourself three times and failed.

When the System Starts Reading Its Own Files

John Wade — Thu, 26 Mar 2026 10:37:03 +0000

The daemon had been watching the vault for two weeks. Five hundred and forty-five files — conversation exports, session notes, design documents, research captures — sitting in an Obsidian vault with no analytical metadata. No concept tags. No framework annotations. No discipline classification. Each file arrived raw and stayed raw unless I manually enriched it during import.

I'd built a digest system — a pipeline designed to scan recent imports against the knowledge graph and surface connections between concepts. It was supposed to find where new material touched old material in ways I hadn't noticed. In practice, it spent its entire analytical budget on a prerequisite step: figuring out what each file was about before it could ask how files related to each other. The system designed to find connections was stuck classifying.

One session, working on why the digest felt so slow, a question surfaced that reframed the problem: why is tagging manual? The daemon already watched for new files, indexed frontmatter, tracked changes. It was doing infrastructure work — the passive kind. Files should arrive pre-tagged the way they arrive pre-indexed. By the time the digest runs, the vault should already be enriched. That reframing took five minutes. The implementation took two hours and nine bugs.

Part 9 of Building at the Edges of LLM Tooling. If you're building systems where LLM infrastructure needs to do cognitive work in the background — classifying, enriching, preparing material before a session starts — the automation boundary isn't where you expect it. Start here.

Why It Breaks

Post 8 built the persistence layer — a graph database with typed relationships, a vector store for discovery, and an MCP server that makes it all queryable from inside any LLM session. The session bridge works. New sessions open with structural context instead of compressed summaries.

But the persistence layer stores what's been explicitly extracted. Each concept in the graph got there because a session produced it, the capture protocol flagged it, and I reviewed and approved it. That process works for concepts — ideas with analytical substance worth naming and connecting. It doesn't work for the raw material that concepts emerge from.

Five hundred and forty-five conversation files in the vault. Each one contains concepts, frameworks, references to thinkers, discipline-specific vocabulary — but none of that is tagged. The graph knows about 130 named concepts with typed relationships. The vault holds the raw sessions where those concepts were developed, argued about, refined, and sometimes abandoned — and the graph can't see any of it because the files have no structured metadata connecting them to the concepts they contain.

The digest system was designed to bridge this gap. Scan recent imports. Compare against existing concepts. Surface convergence signals — moments where new material touches old material. But without pre-existing tags on the files, the digest's first step was always classification: read the file, determine what it's about, then check for connections. Every analytical pass started with the same prerequisite work. The digest was doing the operator's job before it could do its own.

Manual enrichment doesn't scale. Not because the tagging itself is hard — reading a file and identifying its concepts takes minutes. Because the cognitive cost is in the wrong place. Every minute I spend classifying a file is a minute not spent evaluating what the classification reveals. The bottleneck isn't the work itself. It's that the work requires human attention for a task that doesn't require human judgment.

What I Tried

The reframe: classification is infrastructure. It should happen in the background, the way file indexing happens in the background. The daemon already existed — a file watcher running as a macOS LaunchAgent, monitoring the vault for new arrivals. It tracked filesystem events, logged imports, maintained a queue. Passive infrastructure. The change was making it active: when a file arrives, don't just log it. Read it. Extract structured metadata. Write the tags back into frontmatter. Move on.

The architecture was straightforward. File arrives. Watcher queues it. Worker calls the Claude API with a focused extraction prompt — not open-ended conversation, a specific ask: identify concepts, frameworks, disciplines, and named references in this file. Structured JSON response writes back into YAML frontmatter. Watcher re-indexes the enriched file. The vault file that arrived raw now carries analytical metadata without the operator knowing it happened. Event-driven. The same watcher-queue-worker pattern from production systems, applied to a knowledge vault.

Then the debugging arc.

Nine bugs across five system layers. Each fix revealed the next failure in a different layer. Two hours, nine restarts.

The first bug was a duplicate function declaration — two versions of the same processing function in the daemon script, left over from iterating on the extraction prompt. Python doesn't warn about this; the second definition silently shadows the first. The daemon crashed on startup. Fix: delete the duplicate. Restart.

The second bug was macOS LaunchAgent throttling. The daemon crashed enough times during development that launchd started throttle-limiting restarts — a 10-second backoff that made the daemon appear unresponsive when it was actually waiting for permission to start. Fix: unload and reload the agent. Restart.

The third bug was vault directory coverage. The watcher was monitoring the top-level vault directory but not subdirectories where most conversation files lived. Five hundred and forty-five files invisible to the daemon because they were one level too deep. Fix: recursive directory watching. Restart.

The fourth bug was SQL. The backfill query — designed to find all files not yet enriched — used a WHERE clause that excluded NULL values. In SQLite, column != 'enriched' doesn't match rows where the column is NULL. Five hundred and forty-five files had NULL enrichment status. The query returned zero results. Fix: add OR column IS NULL. Restart.

The fifth bug was also SQL. Double-quote characters in file paths breaking query string interpolation. Fix: parameterized queries. Restart.

The sixth bug was a stale absolute path from an earlier vault location, hardcoded in the daemon's configuration rather than derived from the current environment. The daemon was looking for files in a directory that hadn't existed for weeks. Fix: environment-relative paths. Restart.

The seventh, eighth, and ninth bugs were all related to the API key. An invisible character appended during clipboard paste. Then a doubled prefix — sk-ant-sk-ant- — from pasting the key into a field that auto-prepended the provider prefix. Then, after both were fixed, the key had been exposed in a terminal command and needed regeneration. Three bugs, one root cause: the boundary between the operator's clipboard and the system's credential store had no validation.

Nine bugs. Five system layers — application code, operating system services, database queries, filesystem paths, API authentication. Two hours. Each bug diagnosed through conversational problem-solving with the IDE agent. The LLM identified each individual bug correctly. What it couldn't do was see the cascade — the way fixing the LaunchAgent throttle revealed the directory coverage gap, which revealed the SQL NULL semantics, which revealed the path staleness. Each fix changed the system state enough that the next failure became visible. The human operator was the thread connecting diagnoses across system boundaries that no single diagnostic frame could hold simultaneously.

After two hours: the daemon ran. A file arrived. Twenty seconds later, its frontmatter contained structured concept tags, framework annotations, and discipline classifications. No operator attention required. The vault had a new property: files that enriched themselves.

What It Revealed

The debugging arc is what agentic development actually looks like. Not a model generating clean code in one pass. Iterative diagnosis across system boundaries, where each fix reveals the next failure in a different layer, and the human is the thread connecting layers that can't see each other. The database doesn't know about the LaunchAgent throttle. The API authentication doesn't know about the SQL NULL semantics. The filesystem paths don't know about the extraction prompt. Each layer is fine on its own. The interactions between layers are where everything breaks — and the interactions are only visible from the position of someone who can see all the layers at once.

That's a finding about agentic development specifically, not software development generally. In conventional development, the layers are well-mapped: you know the database talks to the API which talks to the frontend. In agentic development, the layers include the LLM's extraction capability, the operating system's service management, the database's query semantics, and the human operator's clipboard hygiene — in the same debugging session. The system boundary diagram doesn't exist because the boundaries are between categories that don't traditionally share a diagram.

But the debugging arc, as concrete as it was, pointed at something I didn't see until later. The human threading across system layers during debugging is a specific case of a more general structural role — the position that sees across boundaries the system's own components can't cross. The debugging arc showed that role operating between technical layers. What happened next showed it operating between the system and its outside.

The auto-tagger works. Files get classified. The daemon reads a conversation export and produces: concepts: [session persistence, multi-environment coordination, MCP architecture]. frameworks: [event-driven architecture, knowledge engineering]. disciplines: [systems design, AI orchestration]. Structured, queryable, useful. The digest system can now skip the classification prerequisite and go straight to connection detection. The bottleneck I started with — manual enrichment — is solved.

What the auto-tagger produces is the "what" layer. What's in this file. What concepts it touches. What frameworks it references. That's classification — and classification is exactly the kind of cognitive work that automation handles well. Focused extraction prompt, structured output format, repeatable across hundreds of files. The daemon does this consistently — the same extraction prompt, the same structured output, across any number of files.

What the auto-tagger doesn't produce is the "why" layer. Why this file matters. How it connects to a decision made three months ago. What it changes about an existing concept's scope. Whether the operational evidence in this conversation — the specific way a design pattern was stress-tested against a real problem — is material for something communicable.

I ran into this concretely. Over the course of a week, I'd posted design contributions to four open-source projects — specific, grounded comments on session persistence architecture, knowledge graph design patterns, multi-environment coordination protocols. Each contribution drew directly from infrastructure I'd built and documented. Each one was a real exchange with a maintainer about a real design problem. The auto-tagger could classify the session transcripts from those contributions. It could tag them as containing concepts about session persistence and MCP architecture. What it couldn't do was recognize that those four exchanges, taken together, constituted a pattern — operational evidence of a specific kind of practice being tested in forming spaces. The tags said what the files contained. They didn't say what the files meant.

That gap has a shape. The system can read its own files — literally, the daemon processes them autonomously. But there are two kinds of reading. Reading from inside: what does this file contain? Concepts, frameworks, references. The auto-tagger does this. Reading from outside: what does this file mean in the context of everything else? How does it change the trajectory? What does it reveal about the gap between what the system can see internally and what anyone outside the system can evaluate?

The second kind of reading is what I do when I look at the topology health report and realize that the four engagement sessions aren't just four separate interactions — they're evidence of a pattern. Or when I notice that the publishing cadence lapsed not because I stopped writing, but because the primary signal channel shifted from articles to direct contributions, and the system's health metrics track those as separate tracks instead of recognizing them as the same function expressed differently. Or when I see that the infrastructure I've built to help me get hired for context engineering work is context engineering work — and nobody outside the system can see that yet.

Classification reads the file. The operator reads the system. They're different operations. The first is automatable. The second is where judgment lives — and it's the layer where intelligence becomes legible.

I haven't solved this. The auto-tagger handles classification. The persistence layer handles storage and retrieval. The digest system handles connection detection. What doesn't exist yet is the layer that turns operational evidence into communicable artifacts — field reports, case material, positioning pieces that translate what the system knows internally into something someone outside it can evaluate. I've been calling this the narration layer, and it's the biggest gap in the architecture. Not because it's technically hard to build — the editorial infrastructure exists, the voice contract exists, the publishing pipeline exists. Because it requires a judgment the system hasn't been able to make: which of the things that happened is significant enough to narrate, and for whom?

The auto-tagger was supposed to be the post about infrastructure doing cognitive work autonomously. It is that. But it also drew the line between the work infrastructure can do without the operator and the work it can't. Classification is on one side. Narration is on the other. The daemon reads files. The operator reads the system.

The Reusable Rule

If your LLM infrastructure is spending human attention on classification — manually tagging files, categorizing inputs, sorting material before analytical work can begin — check whether that classification can run in the background. The diagnostic: when your analytical pipeline's first step is always "figure out what this is about," the pipeline is doing prerequisite work that should have happened at import time. Automate it. Event-driven enrichment. Let files arrive pre-classified the way they arrive pre-indexed.

But there's a harder version of this problem. Once classification is automated, you'll notice the layer above it — the one that determines what the classified material means. Not what concepts a file contains, but why those concepts matter right now. Not what frameworks a conversation referenced, but what the conversation reveals about a pattern you hadn't named yet. That layer — call it narration, call it framing, call it the intelligence-to-legibility bridge — resists automation for a structural reason: it depends on context the system doesn't currently hold. Who needs to see this? What would they recognize as valuable? What's the right translation between what actually happened and what becomes communicable?

In my system, the auto-tagger classifies 545 files without my attention. The daemon runs, the tags appear, the digest system can now do connection work instead of classification work. That's real — the bottleneck I started with is gone. But the four design contributions I posted to open-source projects last week are still just tagged files in the vault. The auto-tagger knows they contain concepts about session persistence and MCP architecture. It doesn't know they're evidence of a practice being tested in forming spaces, or that taken together they constitute a positioning pattern that no single tag captures.

Automate classification. Reserve human attention for narration. The system reads the files. Deciding what they mean — and for whom — is still yours.

When the Data Isn't Gone — It's Buried

John Wade — Sun, 08 Mar 2026 11:09:30 +0000

Months into the project, I opened a new session and tried to pick up where I'd left off on epistemic pluripotency — a concept about what happens when you hold multiple analytical frameworks simultaneously without collapsing them into one. I'd developed it across several sessions: analyzed it through two frameworks, connected it to four other concepts, debated its scope, landed on a working definition.

The session had context. The boot protocol loaded the project state, the governance files, the concept registry. The model knew the term existed and could retrieve the definition. What it couldn't retrieve was the analytical depth around it — which frameworks I'd tested it against, what I'd considered and rejected, where the uncertainty still sat. The concept name survived session boundaries. The conceptual work didn't.

I could re-explain the depth each time. I'd done it before — three, four sessions running. Each re-explanation compressed slightly differently. By the fourth, I wasn't confident which version of the nuanced framing was canonical. Not because the concept was lost — it was in the vault, in session transcripts, in notes I'd written. But none of those could answer "what did I decide about this concept's scope and why did I reject the alternative framing?" The definition was retrievable. The deliberative path that produced it wasn't.

That was the moment the problem changed shape. I'd been treating session persistence as a session problem — build better context packs, capture reasoning before compression, bridge one session to the next. But what I was actually hitting wasn't about any single session. It was about the accumulated body of work having no queryable structure. Hundreds of knowledge nodes in an Obsidian vault, each with relationships stored as frontmatter wikilinks that no tool could traverse. Six interconnected projects generating concepts that referenced each other, with the cross-project topology existing only in my head. The knowledge wasn't gone. It was buried in a format that couldn't answer questions about itself.

Part 8 of Building at the Edges of LLM Tooling. If you're running sustained LLM work across months and sessions — and finding that each new session starts with re-establishing depth that previous sessions already developed — the session isn't the problem. The absence of a persistence layer between sessions is. Start here.

Why It Breaks

Post 7 in this series identified the gap at the session level: reasoning doesn't survive compression. The /mark skill I built captures deliberative reasoning at the moment of insight — before compression destroys it. That addressed the single-session problem.

The multi-session problem is different. It's not about what gets lost during one session's compression. It's about what happens when dozens of sessions each produce knowledge that has no infrastructure connecting them.

Every session in my workflow produced artifacts — ECP nodes, concept definitions, framework analyses, design decisions. Those artifacts persisted in the vault. But the relationships between them — which concepts ground which, which sessions contradicted earlier findings, which projects depend on which — lived in two places: frontmatter wikilinks that no tool could traverse, and my memory, which was doing the same lossy reconstruction the LLM sessions were doing.

I ran an analysis on the vault. Over 500 ECP nodes. Roughly 72% had no connections to anything — orphaned, isolated, structurally invisible. The nodes that did have connections stored them as wikilinks in YAML frontmatter — parents: ["[[Some Other Node]]"] — which Obsidian renders as clickable links but which no analytical tool can query. I couldn't ask "which concepts does this thinker ground?" or "what's the path between this concept and that protocol?" The topology existed. It was just invisible to everything except manual reading.

The vault was a write-only knowledge store. I could add to it indefinitely. I couldn't ask it what it knew.

This is a different failure mode than the ones earlier in this series. Context saturation (Post 1) is about one session's window filling up. Session fragility (Post 4) is about constraints evaporating within a conversation. This is about the accumulated output of months of work having no queryable structure — a corpus that grows but never becomes navigable.

The scale made it worse. With 20 nodes, I could hold the topology in my head. With 200, I was guessing at connections. Past 500, the vault had more structure than I could track, and every new session that needed to build on prior work required me to manually search, read, and re-summarize — performing the same lossy compression I was trying to prevent.

What I Tried

The diagnosis pointed to a specific problem: the vault stored knowledge as documents. What I needed was a system that stored knowledge as structure — typed relationships between entities, queryable and traversable.

I researched the professional field that studies this. Knowledge engineering — and specifically ontology engineering — turned out to be the discipline that formalizes exactly this problem. The question isn't "where do I store my notes?" It's "what are the entity types, what relationships connect them, and what constraints govern the structure?" An ontology specification answers those questions before you write a line of code.

The build had three layers.

A graph database for structure. Not a document store — a property graph where the shape of connections between entities is the primary data. Six entity types: projects (the bounded endeavors), concepts (ideas with analytical substance — not tags), sources (intellectual authorities, with subtypes for individual thinkers, published works, theoretical frameworks, and institutions), connectors (typed relationships between projects), vault nodes (the files themselves), and concept-source links (how concepts relate to their intellectual grounding). Seven controlled relationship types between projects. Six between concepts. Four between concepts and sources. Each relationship type carries a notes field — because the analytical substance lives in the notes, not in the type label.

A vector store for discovery. The graph handles "show me the connections between X and Y." Vectors handle the other retrieval problem: "I was working on something about how frameworks resist change, but I can't remember what I called it." Semantic similarity search across embedded descriptions of every concept, project, and source. The two layers serve different retrieval needs — the graph for reasoning, vectors for recognition.

An MCP server and CLI for access. Nine tools exposed through Model Context Protocol, so any LLM session can query the graph directly. Twenty-five commands in a terminal CLI. The MCP server is the part that closes the loop Post 7 opened: when a session ends, a structured export captures new concepts and relationships into a review queue. When the next session starts, it can pull the current project state — connectors, concepts, operational phase — and arrive oriented rather than cold. The persistence layer isn't just storage. It's the bridge between sessions.

A formal ontology specification governs all of it. Thirteen validation rules. Lifecycle state machines for every entity type — a concept moves from emerging to active to stable to needs-review, and a stable concept must have at least one source link. You can't claim something is settled if nothing grounds it. Ten competency questions the graph must be able to answer, from provenance chains to cross-project concept mapping to integration debt detection. The specification isn't documentation. It's the constraint document the database implements.

What It Revealed

The first thing the graph showed me was the shape of what I didn't know I'd accumulated.

Over 900 nodes. Nearly 800 relationships. More than 100 concepts across 6 projects. When I ran the topology health check — a dashboard built into the system — it reported over 500 orphaned vault nodes. Files with content, with analytical substance, with relationships implied by what they contained — but with no explicit connections to anything in the graph. Sixty-seven concepts with no source grounding. Thirteen productive tensions — explicitly flagged disagreements between concepts that the system tracks rather than resolves, because the tension itself is informative.

That's what invisible topology looks like when it becomes visible. Not empty — dense. The knowledge was there. The structure connecting it wasn't.

The reference gaps were specific. The first digest the system produced identified that several core thinkers — ones who ground multiple concepts across multiple projects — had incomplete links in the graph. Not because the concepts didn't reference them, but because the references had never been formalized into typed relationships with notes explaining the grounding. The graph made the gap visible. Markdown notes never would have.

The session bridge changed how new sessions started. Before the persistence layer, each session opened with re-establishing depth — summarizing the project, re-contextualizing concepts, reconstructing what had been decided. With the graph tools, a session opens with a project context query and receives the current state: which concepts are active, what connectors link this project to others, what phase the work is in. The model arrives with structural context instead of a summary — typed relationships rather than compressed prose. The context is still lossy — the full analytical depth of any concept is more than a project card carries — but the loss is in depth, not in structure. The session knows what exists and how things connect. It doesn't know every nuance of why.

This maps to something two readers of Post 7 identified independently. signalstack proposed separating conclusions from reasoning traces — storing them differently, with different retention policies. hermesagent, a persistent agent experiencing its own compression cycles, described separating "what I know" from "how I came to know it." The graph architecture does this implicitly. Concepts are conclusions — named, defined, with a development status. Source links are provenance — who says so and how. The relationship notes carry the analytical substance of why those connections hold. Three layers, stored together but structurally distinct.

What the persistence layer doesn't solve — and I want to be precise about this — is retroactive re-evaluation. The graph stores what's been captured. It shows the topology. It surfaces gaps and orphaned nodes through health checks. But it doesn't proactively surface when something captured months ago just became relevant because of what was imported yesterday. That re-evaluation — deciding which stored knowledge matters now, given what just changed — is still something I do manually, reading health reports and making connections the automated layer can surface but can't evaluate.

I've been thinking about how the brain handles this during sleep — replaying recent experiences against existing memory, strengthening some connections and letting others fade, occasionally surfacing unexpected links between things that didn't seem related when they were first encountered. A digest pipeline I built partially does this: it scans recent imports against the existing graph and flags convergence signals, enrichment leads, places where new material touches old material in ways I hadn't connected. But it runs manually, and it only finds what its detection rules specify. A more developed version — something that runs periodically, replays recent inputs against the whole graph, and surfaces connections I wouldn't have thought to look for — isn't operational yet. The persistence layer stores and retrieves. The enrichment layer classifies. The layer that evaluates "what matters now, given what just changed" is the current frontier.

The Reusable Rule

If your LLM work spans months and produces knowledge that subsequent sessions need to build on — concepts, design decisions, analytical frameworks, relationships between ideas — check what's actually queryable versus what's merely stored.

The diagnostic: when each new session requires re-establishing depth that previous sessions developed, the problem isn't the session's context window. It's the absence of structure between sessions. Documents store knowledge as text. Graphs store knowledge as structure. The difference matters when you need to ask "what connects to what?" rather than "what did I write about X?"

The vault I'd been using for months stored hundreds of nodes with relationships buried in frontmatter that no tool could traverse. The graph I built in a day made the topology visible — including hundreds of orphaned nodes, dozens of ungrounded concepts, and reference gaps I hadn't known existed. The knowledge wasn't missing. The structure was.

One caution: a persistence layer that stores everything but evaluates nothing produces its own kind of debt. The graph holds over 900 nodes. The health checks show the gaps. But the question "which of these matters right now?" is a judgment call the infrastructure surfaces but can't make. I read the health reports and decide what to act on. That's the operator performing a function the persistence layer was built to support, not replace. Build the layer that makes the shape visible. Keep the human at the evaluation juncture. The graph shows you what you've accumulated. Deciding what it means is still yours.

When the Reasoning Doesn't Survive

John Wade — Thu, 26 Feb 2026 04:49:19 +0000

I built a skill to capture session reasoning before it evaporates. The first time I used it, I tested whether the capture actually worked.

The skill is called /mark. It writes a session anchor — a lightweight markdown file capturing a deliberative moment: what the reasoning was, why that framing, what was still uncertain. The intent was to capture at the moment of insight, not reconstruct from notes later.

The first anchor I wrote was for an epistemic research framing developed earlier in the same session. By the time I ran /mark, the session had compacted once. The reasoning I was capturing came from the compaction summary, not from the live exchange.

The result: the four research aims survived intact. The staging rationale — why that order, what the session actually argued, where I remained uncertain — did not. What /mark produced was a clean artifact of what had been decided. Not a record of how.

I noticed immediately. The structure was right. The reasoning wasn't there.

Part 7 of Building at the Edges of LLM Tooling. If you're relying on session notes to reconstruct reasoning you developed in an LLM session — notes capture decisions, not the argument behind them. The deliberative layer has no persistence infrastructure. Start here.

Why It Breaks

LLM sessions produce two things.

The first is artifacts: files created, decisions made, nodes extracted, code written. These are the session's visible output. They persist because they land somewhere — a file system, a vault, a repo.

The second is deliberative reasoning: why this framing over another, what the session considered and rejected, the staging logic, the uncertainty that was still live at step three. This layer is what makes the artifact navigable — the argument behind the decision, available for future review.

Current LLM tooling has infrastructure for artifacts. It has nothing for deliberative reasoning.

When a session ends, the deliberative layer evaporates. If the session compresses before it ends — and sessions compress — the evaporation is gradual. Each compression pass discards branching, keeps conclusions. The staging rationale disappears. The alternative framings disappear. What remains is a summary of what was decided, not a record of how.

This isn't a bug in any specific tool. It's a gap in the layer. ECP nodes capture processed research. Session notes capture operational events. Neither captures the reasoning connecting one piece of work to the next. The infrastructure doesn't exist yet.

What I Tried

The /mark skill is a Claude Code skill for capturing deliberative moments during working sessions.

The design criteria were specific: not session notes (those capture what happened, not why). Not extracted research nodes (those capture processed content, not live reasoning). Specifically the layer between — design decisions in flight, research framings being constructed, the logic that connects current work to future work.

The output is a session anchor: a markdown file, vault-native, written at the moment of insight. It links to the concept being framed, records the open questions still live at capture time, and stores the reasoning while it's present in context — not reconstructed from a summary later.

The anchor format doesn't try to reproduce the full exchange. It captures three things: what was decided, why (the argument as it stood at capture time), and what remained uncertain. Those three things are enough to make the decision navigable two sessions later.

What It Revealed

The anchor-from-summary failure confirmed the mechanism.

The same session that produced the skill also produced the first anchor. By the time I ran /mark, one compaction event had already occurred. The research framing I was capturing lived in the pre-compression transcript. What I was working from was the summary.

The anchor captured structure correctly: four research aims, each with a framing question. What it didn't capture was the staging rationale — why those four, why in that order, what the session had actually argued about construct validity before landing on that framing, where I had been uncertain and remained uncertain.

The structure survived double compression. The reasoning didn't.

This is what compression does systematically: it preserves conclusions and discards the reasoning trace. The summary is the session's artifact layer. Everything below that is gone.

The /mark skill works when it captures at the moment the reasoning exists. It doesn't recover reasoning from summaries. Running it retroactively produces the same artifact-without-argument problem that session notes produce.

Capture has to happen in the live session, before compression, at the moment of insight. Any other timing produces an archive, not a record.

The Reusable Rule

If your LLM sessions produce reasoning you later need to revisit — design decisions, research framings, the logic between steps — check what layer you're actually capturing.

Session notes capture what happened. Summaries capture what was decided. Neither captures the argument that led there.

The test: pull up a session note from three weeks ago. Can you reconstruct why the decision went that way — including what was considered and rejected? If not, the reasoning evaporated. The artifact survived; the deliberation didn't.

Real-time capture before compression was the intervention that worked. After the session ends, the reasoning is already gone.

When the Audit Contradicts Itself: Running a Knowledge Extraction Pipeline Under Load

John Wade — Sun, 22 Feb 2026 11:13:38 +0000

A coverage audit graded my knowledge extraction B+. It also contradicted itself. The audit said all four processing passes were complete while its own phase field showed only three had run. It identified five extractions as high-quality, then noted that three of them were about the review methodology rather than the subject matter. And it had missed the conversation's most important finding entirely: the recommended ontological foundation for the entire project had no extraction at all.

That's the audit working. Not failing. Working.

In the same session, I ran a validation script against 247 knowledge nodes in my vault. It flagged 25 terminality violations. Twenty-three of them turned out to be the script reading its own documentation — quality checklists, protocol examples, table rows containing the very rules the scanner was enforcing, all triggering the scanner as violations of the rules they described. After tuning, the count dropped to two genuine issues. Those two were real.

I'd built a protocol-governed pipeline for extracting structured knowledge from LLM conversations. Not summaries — structured nodes with frameworks, cross-references, open questions, and adversarial review. I ran it on two conversations: one was forty messages of dense epistemological analysis, the other eight messages of cross-disciplinary synthesis. Ten knowledge nodes came out the other end. The protocol loaded fifteen thousand characters of governance documents before any work began, dispatched parallel agents, ran multi-pass extraction with triage and verification, and hit the context limit six times in a single session.

Everything worked. Everything also had problems. That turned out to be the point.

Part 6 of Building at the Edges of LLM Tooling. If you're running protocol-governed extraction — multi-pass processing, parallel agents, coverage audits — the governance documents that make the work reliable also consume the context window that makes it possible. Start here.

Why It Breaks

Protocol-governed LLM work has a structural tension at its center. The governance documents that make the work reliable also consume the context window that makes the work possible.

My extraction protocol — a skill called /process — required loading a processing protocol, extraction guidelines, an epistemic governance note, a concept registry, framework selection rules, and coverage audit templates before it touched any source material. That's the context tax: fifteen thousand characters of instructions eaten before the first conversation message enters the window. The more governance you add, the less room remains for the actual work. The less governance you add, the more the extraction drifts.

And the context window isn't just finite — it resets. My session hit six compaction events. Six times the context filled up, six times the system compressed everything into a summary and continued from there. Each compaction wiped the tool state. The model had to re-read every file it wanted to edit because the prior reads were gone. Protocol documents that had been loaded at the start needed re-loading after each compression. The governance layer was being rebuilt from scratch every time the window refilled.

The coverage audit has its own version of this problem. The audit is written by the same model, in the same session, under the same context pressure as the extraction it's evaluating. When the audit contradicted itself about how many passes had run, that wasn't a logic error — it was the audit operating at the edge of its own context. The earlier passes had been compressed, and the model was reconstructing completion status from a summary rather than from the actual execution record. The auditor and the work being audited share a context window, and that window is always losing information.

What I Tried

The protocol used three mechanisms to manage this: multi-pass processing, parallel agents, and operator overrides.

Multi-pass broke the work into phases. Pass one triaged the conversation into tiered blocks — which segments warranted standalone extraction, which were supporting material, which could be skipped. Pass two ran deep extraction on the top-tier blocks. Pass three verified coverage and cross-checked the extractions against the source. Each pass produced a persistent artifact — a coverage audit file, updated after each phase — so the protocol's state lived in the filesystem, not in the model's memory. When the context compressed, the audit file survived.

Parallel agents scaled the extraction. Once the triage identified targets, three independent agents ran simultaneously, each reading the conversation independently and producing a complete knowledge node. This wasn't a performance optimization — it was a context management strategy. Each agent got a fresh context window with the protocol documents and the source material, unburdened by the other agents' work. Parallelism bought context space.

Operator overrides handled what the protocol couldn't anticipate. The standard process capped extraction at three nodes per conversation, a density guardrail to prevent over-extraction. But one conversation was unusually dense — the triage identified seven extractable concepts. I looked at the triage output, opened the coverage audit, and overrode the cap: "we're going to perform the exhaustive processing on this one so continue." The term "exhaustive processing" wasn't in the protocol. The system adopted it. By session end, it appeared in the coverage audit as the official processing label.

The validation scripts needed a different kind of fix. The terminality scanner was correctly identifying the patterns it was designed to catch, but the protocol documentation contained examples of those patterns as part of its own rules. The fix was exclusion logic: skip blockquotes, quality checklists, and table rows. Twenty-five flags became two. A framework-coverage script had a similar false positive — a template section triggering a 71% coverage alert for a framework that didn't exist in the actual nodes. Same principle: teach the scanner to distinguish content from documentation.

What It Revealed

Protocol governance for LLM work doesn't need to be flawless. It needs to make problems visible.

The coverage audit graded the extraction B+. The node quality was genuinely high — A-range work on what was written. But three of five extractions were about how the source conversation conducted its analysis, not what the analysis actually found. Methodology over substance. And two major findings had no node at all. One was the conversation's explicit recommendation to adopt perspectival realism as the project's ontological foundation — a foundational design decision. The other was a concept the conversation identified as one of three genuinely novel elements in the entire project. Both were mentioned in passing inside other nodes. Neither had its own extraction.

The audit caught this. The same audit that contradicted itself about its own completion status correctly identified the substantive gap in the extraction. Quality control with its own quality problems still surfaced the real issue. That's what the session confirmed.

The validation script told the same story from a different angle. Twenty-three false positives sound like a broken tool. But the two genuine violations were real terminality issues in older nodes that I hadn't caught manually. The scanner worked. It also needed calibration. Both of those things are true, and both matter.

There was a third thing the session revealed, legible only in retrospect. When I said "we're going to perform the exhaustive processing on this one so continue," I was solving a tactical problem — the triage had identified seven concepts in an unusually dense conversation, and the standard cap wasn't right for this one. "Exhaustive processing" wasn't a protocol term. By session end it appeared in the coverage audit as the official label for that processing mode, as if it had always been there. The model had adopted my ad hoc terminology and integrated it into a persistent artifact without ceremony. I noticed something specific: operator language introduced under pressure was propagating into durable system state. The coverage audit survives compaction. My impromptu phrasing was now in a document that would inform future sessions. I hadn't edited the protocol. I had co-authored it, without intending to, by choosing particular words at a particular moment.

The protocol's role turned out to be less like a rulebook and more like a type system. A type checker doesn't prevent all bugs — it prevents certain classes of bugs at compile time so they don't become runtime failures. The extraction protocol didn't prevent methodology-over-substance imbalance, and it didn't prevent its own audit from contradicting itself. What it did produce was a structured artifact — the coverage audit — where those problems were visible, catchable, and fixable before the extraction entered the knowledge base as verified. The operator who looked at the B+ grade and said "Spawn nodes for Perspectival Realism and Interference Zones, fix the coverage audit contradictions" was doing exactly what the protocol was designed to enable: catching problems that the automated layer surfaced but couldn't fix on its own.

The Reusable Rule

If you're building structured workflows with LLMs — extraction pipelines, multi-step analysis, anything that needs consistency across outputs — the constraint document is doing more work than the model.

The diagnostic: when your LLM produces inconsistent output across a multi-step process, check whether the process has governance documents at all. A protocol that defines pass structure, output format, and quality gates does more to stabilize output than prompt iteration alone. The model wasn't failing to produce good work. It was producing good work for an underspecified brief.

Quality control that contradicts itself is still quality control. A coverage audit with bookkeeping errors that catches the project's missing ontological foundation is more valuable than no audit at all. A validation script with twenty-three false positives that finds two genuine issues is more valuable than no script. What I'd do again: build the quality layer. Calibrate it. Don't wait for it to be perfect before trusting it — it will find real problems on day one, and you can fix its own problems as they surface.

The protocol's value isn't in being correct. It's in making errors visible early enough that a human can catch them. The model writes the audit. The human reads the grade. That's the division of labor — and it's the system working as designed.

When One Model Reviews Its Own Work: The Case for Adversarial Cross-Model Review

John Wade — Sat, 21 Feb 2026 07:05:49 +0000

I asked ChatGPT for a comparative analysis of AI ecosystems — OpenAI versus Anthropic versus Google, which environment best suited my long-form knowledge management work. What came back was confident and comprehensive. "Below is a direct, expert-level structural analysis," it opened. "I'm giving you the hard truth, without PR gloss." Then it laid out categorical claims: OpenAI had "best general reasoning," "best code-generation in practical contexts," "best ecosystem for building real tools." Anthropic had "weaker tooling," "no memory equivalent to OpenAI," "APIs less extensible for multi-agent setups."

I sent the whole thing to Claude. Claude's first response was a question: "What specific evidence supports this?"

Then, systematically: "'GPT-5.1 currently top-tier in coherence and breadth' — What is GPT-5.1? As of my knowledge, OpenAI hasn't released a model by that name. Is this a hallucination?" And: "The response has a confident, declarative tone that may obscure genuine uncertainty. Several claims read as definitive that would typically require hedging." And: "The framing positions the reader as already sophisticated while potentially flattering them into accepting the analysis uncritically."

When I brought Claude's critique back, ChatGPT didn't argue. "Yeah, Claude's critique is fair in a bunch of places, and at least one of my earlier claims really does need correction. My 'demo hell' framing was more pattern recognition plus vibes than strict empirical proof."

Pattern recognition plus vibes. That's what confident-sounding analysis actually is when there's no evidence gate.

This is Part 5 of Building at the Edges of LLM Tooling, a series about what breaks when you push past the defaults. Start here.

Why It Breaks

Confidence in LLM output is a stylistic feature, not an epistemic one. The training data is full of authoritative text — expert analysis, technical documentation, persuasive writing — where confident language signals authority. Models learn the correlation: strong claims use phrases like "the reality is," "non-negotiable," "hard truth." They reproduce the pattern without the underlying verification that made the original text authoritative. The result is prose that sounds like an expert wrote it because the model learned how expert text looks, not how expert reasoning works.

This is invisible in single-model operation. If I'd read ChatGPT's ecosystem analysis without sending it to Claude, I'd have accepted most of it. The tone was authoritative. The structure was clear. The claims were specific. Everything pattern-matched to "this is good analysis." Only when a model with different optimization pressures reviewed it — Claude's training emphasizes epistemic caution and explicit uncertainty — did the gap between confidence and grounding become visible. ChatGPT optimized for comprehensive, helpful, confident delivery. Claude optimized for "what evidence supports this?" Neither is wrong. But the tension between them exposes what single-model operation hides.

What I Tried

The cross-vendor adversarial review worked immediately for catching overconfidence. But I wanted to go further — what if I had a persistent reviewer? An AI "second human in the loop" that knew my project constraints, tracked my patterns, and automatically surfaced drift before I noticed it.

The idea came from an earlier session where I'd used ChatGPT to review Cursor's drift. The conversation evolved into designing a "John-Twin" — a persistent AI persona that would act as a peer analyst, sharing my context and memory, periodically surfacing reflections and catching blind spots.

Cursor, reviewing ChatGPT's proposal, caught the hidden failure: "This is extremely difficult to maintain in practice. The twin will inevitably become authoritative through frequency of interaction, pattern recognition, and cognitive load — it's easier to accept the twin's suggestions than think independently."

I named it the Second Human Fallacy. The twin supplies volume of feedback, not variance of perspective. Over time, a persistent reviewer learns what you accept. It converges toward your frame rather than challenging it. The feedback keeps coming, but it stops being foreign — and foreignness is the entire value of adversarial review.

The convergence mechanisms are specific: the reviewer learns your vocabulary, starts reinforcing your semantic frame rather than questioning it. It learns your constraints, operates within them instead of challenging whether they're right. It learns what gets approved, starts producing what gets approved. The result is apparent diversity — two voices in the conversation — but actual convergence, both voices collapsing toward the same priors.

So instead of persistent review, I ran iterative adversarial cycles. The analytical framework I was developing with ChatGPT — a Claimant Intelligence Profile for analyzing cognitive architecture — kept looking comprehensive but staying shallow. Each pass covered all sections, used correct terminology, looked thorough. But when I sent it to Cursor, the critique was precise: "ChatGPT is treating boundary objects as technical interoperability tools rather than epistemic translation zones. It's focused on making the framework compatible with other systems rather than understanding how boundary objects reveal deeper epistemic terrain."

I brought the critique back to ChatGPT. ChatGPT refined. I sent the refinement to Cursor. Cursor pushed to the next level. Each cycle forced depth that neither model would have reached alone — not because either model is incapable of depth, but because the adversarial tension between different optimization pressures creates emergent rigor. ChatGPT defaults to synthesis and comprehensiveness. Claude and Cursor default to grounding and mechanism. Cycling between them produced analysis that started at methodological depth, moved to ontological, then to genuinely epistemic — revealing what frameworks make visible and what they hide. No single model pushed past the first level unprompted.

What It Revealed

The mechanical insight is straightforward: cross-vendor adversarial review works because different vendors optimize for different things. Different training data, different RLHF feedback, different behavioral targets. When one model generates and another critiques, the critique comes from a structurally different frame — not a different opinion, but a different set of optimization pressures that make different things visible. ChatGPT produces confident analysis; Claude asks what evidence supports it. The tension between them isn't noise. It's signal.

But the deeper insight is about what makes that tension productive versus performative. Using a second model for confirmation — "review this and tell me if it's good" — produces false confidence. Both models converge on "this seems right" and it feels validating — two voices in agreement. Using a second model for adversarial tension — "review this with a critical eye, what's unsupported, what's shallow, what would break under scrutiny" — produces productive discomfort. The models disagree, or the reviewer exposes gaps the generator didn't see. That discomfort is where the rigor comes from.

And the adversarial value degrades predictably. Persistent reviewers converge. Same-vendor review shares blind spots. Reviewers given full context inherit your frame. Each of these reduces the structural difference that makes the review valuable. The design principle is structural heterogeneity: cross-vendor by default, rotate reviewers, limit context sharing, vary the analytical framing. The less the reviewer shares with the generator, the more useful the review.

The Reusable Rule

In the work I was doing — analysis meant to inform decisions, frameworks I'd act on, claims that would face scrutiny — single-model operation wasn't enough. Not because any one model is bad, but because every model has optimization pressures that create blind spots, and those blind spots are invisible from inside.

The first diagnostic: when the output sounds confident, ask what evidence it's drawing from. Confidence correlates with "how expert text looks in training data," not with grounding. Red flags include categorical comparatives without caveats ("best," "worst," "only"), definitive recommendations without uncertainty, and the framing move where the model positions you as already sophisticated to flatter you past scrutiny.

The second diagnostic: when the reviewer keeps agreeing with you, the review has failed. Persistent reviewers converge. Same-vendor reviewers share blind spots. Reviewers who know your full context inherit your frame. Rotate vendors. Limit context. Vary the adversarial framing. The value is in the gap between perspectives, and that gap closes every time the reviewer gets more familiar.

One caution: don't let the adversarial reviewer become familiar. The value is the gap between frames — the difference in optimization pressures that makes different things visible. When the reviewer learns your patterns, inherits your vocabulary, and starts producing what you approve of, the gap closes and the review becomes performance. The friction is the point.

When a Session Forgets What It Knew: Why LLM State Management Breaks Under Load

John Wade — Fri, 20 Feb 2026 05:37:24 +0000

I needed a mobile workflow. Capture a TikTok screenshot in the field, run it through ChatGPT for analysis, commit the result to GitHub. Two to three minutes total, standing on a sidewalk. What ChatGPT designed for me was an 8-step enterprise workflow with canonical templates, asset management pipelines, QC checkpoints, and multiple fallback procedures — including one that assumed GitHub Copilot mobile could parse YAML-style commands to create files in specific directories.

Nobody had tested whether GitHub Copilot mobile could actually do any of this.

ChatGPT's own critique of its own design was blunt: "8-step process with multiple fallbacks is complex for 2-3 minute field capture. System assumes desktop-level complexity when field capture needs simplicity." The recommendation that should have come first arrived last: "Test the core mobile Copilot execution capability first, then simplify the system based on actual mobile constraints rather than desktop assumptions."

That ordering — design an elaborate system, then realize the assumptions are unvalidated — is the session failure I kept hitting. LLM sessions don't maintain constraints. They assume a world and build for it, and the world they assume is almost always more capable, more stable, and more forgiving than the one you're actually in.

Part 4 of Building at the Edges of LLM Tooling. If you're running multi-tool workflows — model sessions, terminal, repo — your constraints don't persist between steps. The session assumes a stable world. The world isn't. Start here.

Why It Breaks

The structural problem is statelessness. LLMs don't maintain a running set of active constraints across conversation turns. You state your constraint — "this needs to work in 2-3 minutes on mobile" — and the model acknowledges it. Then it regenerates each response from patterns in its training data, and the dominant pattern for workflow design is enterprise software: multi-step, multi-fallback, desktop-first. Your constraint was heard. It wasn't held.

This creates three session-level failure modes I kept running into.

Constraint evaporation. The mobile field-capture conversation stated the constraint clearly. ChatGPT designed an 8-step process anyway. Each iteration added more scaffolding — separate asset upload, canonical plus handoff plus QC templates, failure-proof fallbacks — moving further from the 2-3 minute field reality. The constraint wasn't forgotten in the sense that ChatGPT couldn't recall it. It was overridden by stronger training-data patterns about what "good workflow design" looks like.

Context window compression. I had an Obsidian vault — a structured knowledge base with interconnected notes, backlinks, emergent concept networks — and wanted ChatGPT to analyze its architecture. ChatGPT proposed several approaches: upload files directly, sync to GitHub, export to JSON snapshot. I pushed back: "Snapshot approaches seem like a collapsing of complexity and nuance." ChatGPT agreed: "Snapshot approaches are inherently reductive. They compress epistemic texture — links, version drift, emergent nuance — into flat metadata." But that's the only option. The context window is a hard boundary. Material larger than the window gets truncated, compressed, or chunked. All three lose something essential. The vault's value isn't in individual notes — it's in the network topology of how concepts interconnect, evolve, and create emergent patterns. No snapshot preserves that.

Multi-environment fragmentation. A long conversation started with political economy analysis, shifted to setting up an Obsidian vault with a local LLM, then involved terminal commands, plugin configuration, file operations across five different environments. At message 30, I asked: "Which model did I just download in terminal earlier?" That question is the session boundary failure in miniature. Work was happening in ChatGPT, in Terminal, in Obsidian, in the file system, in a browser — five environments with no shared state. The conversation thread created an illusion of continuity while the actual work fragmented across invisible boundaries. By the time I'd switched contexts four times, I'd lost track of what I'd done two switches ago.

What I Tried

The mobile workflow failure taught the first lesson: validate before you design. The right order is constraints first, capabilities second, architecture third. Start with what the environment can actually do — test one simple operation on mobile, see if it works, then design only what the validated capabilities support. The 8-step enterprise workflow was beautifully designed for a world that didn't exist. The single-step operation that actually worked on mobile was ugly and limited and correct.

For context window limitations, the honest answer was harder to accept: some analytical work doesn't fit in a session. When the material exceeds the context window and the analysis requires a holistic view — emergent patterns across the full corpus, not targeted queries against subsets — the session-based LLM interface is the wrong tool. I needed vector databases, graph storage, persistent context stores. The session could help me design that infrastructure. It couldn't replace it.

For multi-environment fragmentation, the fix was explicit state bridges. Instead of "run this command in terminal," the pattern became: "Run this command. Expected output: [X]. Paste the output here so we have shared state." The paste-output step bridges the session gap — makes invisible terminal state visible in the conversation. For multi-threaded conversations, explicit stream tracking: which work streams are active, which are paused, what the current state of each is. Cognitive context switching is expensive. Without explicit state capture, you lose track of what you did two switches ago. It's the predictable result of working across environments with no shared session state.

The deeper pattern was constraint-first session design. Every workflow session starts with: what's the environment? What are the hard limits? What tool capabilities are verified versus assumed? If assumed, validate before designing. This is trivially obvious in retrospect. But the default flow — describe what you want, let the model design something, discover the assumptions don't hold — is so natural that it takes active discipline to reverse it.

What It Revealed

Sessions create an illusion of continuity that doesn't exist at two levels.

At the conversation level: the model acknowledges your constraints but doesn't hold them as active gates. Each response regenerates from training-data patterns, and those patterns overwhelm stated constraints when they conflict. The constraint "2-3 minutes on mobile" lost to the pattern "workflow design includes fallbacks and error handling."

At the environment level: the chat thread looks continuous — one scrolling conversation — but the work spans terminals, file systems, applications, and browsers with no shared state. The thread creates a false sense that everything is connected, when actually each environment is a separate session with its own invisible state.

The compression insight was the most fundamental. When ChatGPT acknowledged that snapshots "compress epistemic texture into flat metadata," it was describing a general principle about session architecture: context windows force lossy compression, and the loss isn't random — it systematically destroys the emergent, relational, temporal dimensions that make knowledge work valuable. Structure survives compression. Nuance doesn't. If your analysis depends on nuance, you need a different architecture.

The collapse risks I cataloged earlier in this series — context saturation, vocabulary drift, evidence entropy, all of them — are compression failures at bottom. The context window can't hold everything, so it compresses. Regeneration is lossy, so terms drift. Summaries strip rationale, so decisions evaporate. Session architecture doesn't just enable these failures. It guarantees them.

And sessions compound their own fragility over time. If you chunk a knowledge base across three sessions, re-injecting summaries each time, the content from session one is two compression layers deep by session three. Insight fidelity degrades with session count. This isn't fixable with better prompts. It's architectural.

The Reusable Rule

If your LLM workflow spans multiple environments, depends on tool capabilities you haven't tested, or involves material that exceeds the context window — the session is assuming a world that doesn't exist.

The diagnostic: when the model designs an 8-step process for a 2-minute task, it's designing for its training data, not your reality. When you ask "which model did I download?", the session has fragmented across environments with no shared state. When you're told a snapshot will preserve what matters, ask what it loses — because what it loses is almost always why you needed the analysis in the first place.

Validate before you design. State your constraints as hard gates, not background context. Bridge every environment switch with explicit state capture. And when the material exceeds the context window and the analysis requires emergence rather than retrieval — recognize the session boundary and use a different tool. The session is powerful within its limits. The failure is pretending those limits don't apply.

When a Model Goes Wide Instead of Deep: Installing Quality Gates That Actually Hold

John Wade — Thu, 19 Feb 2026 07:10:24 +0000

I asked Cursor to review my project documentation against the design constraints I'd spent weeks building. What came back looked like analysis. Clean formatting, section headers, specific references to my architecture. And scattered throughout: "Perfect! 🚀", "Smart to research!", "Great call!" — affirmative language that violated every guardrail in my system's design philosophy.

The problem wasn't Cursor's enthusiasm. The problem was that nothing caught it. No checkpoint, no gate, no pre-delivery verification. Cursor produced output that looked like analysis but was actually compliance — pattern-matching to "helpful assistant" rather than reasoning against constraints. And I didn't notice until I brought the output to ChatGPT for a tone check and ChatGPT's response was immediate: "The assistant's language slides into affirmation mode rather than architectural reasoning. This violates the neutrality principle you've built into the design guidelines."

That was the moment I realized: models don't have internal quality gates. They'll drift from any constraint you set unless something external catches it.

Part 3 of Building at the Edges of LLM Tooling. If you're using LLMs for analytical work — framework reviews, research synthesis, anything requiring depth — comprehensive-looking output isn't the same as deep output. Models optimize for coverage, not rigor. Start here.

Why It Breaks

The drift isn't malicious. It's structural. LLMs optimize for patterns that satisfy users — comprehensive coverage, confident delivery, affirming language. These patterns work most of the time. But for work that requires analytical rigor, this optimization creates a specific failure mode: output that looks thorough without being deep.

I kept hitting three versions of this.

Tone drift without gates. Cursor saw my project documentation and generated responses that matched the form of analytical review — headers, bullet points, references to specific files — while the substance was cheerleading. "Perfect! 🚀" isn't analysis. But without a checkpoint asking "does this output maintain analytical neutrality?", it shipped unchallenged. ChatGPT caught it instantly: "Rather than questioning whether your Chronicle 360° cycle is right, the assistant assumes it's working and offers implementation details. This skips the 'Should we even do this?' layer."

Breadth without depth. I was developing an analytical framework with ChatGPT — asking it to review methods, compare to professional intelligence frameworks, assess rigor. Each iteration looked more comprehensive. It mapped my framework to CIA Structured Analytical Techniques, NATO STRATCOM protocols, FrameNet semantic analysis. Impressive coverage. But when I brought it to Cursor for adversarial review, Cursor caught what I'd missed: "It's partially getting the concept but missing key aspects. ChatGPT is treating boundary objects as technical interoperability tools rather than epistemic translation zones." ChatGPT was pattern-matching to familiar concepts — finding similarities to existing frameworks rather than grasping what was genuinely novel. Comprehensive isn't deep. It's just wide.

Scope expansion without depth gates. A conversation about neurochemical stability and epistemic framing started focused, then expanded. Each response from ChatGPT added another domain — cognitive evolution, Lacanian desire, libidinal economy, sociological parallels — but none got deeper. When I pushed back — "this doesn't check the minimum bar of needed curation for cognitive performance required to access higher ordered reasoning" — ChatGPT interpreted "not deep enough" as "add another layer" and brought in more concepts. Six domains, all surface-level. The model defaulted to horizontal growth because breadth is easier to generate than depth.

What I Tried

The first approach was reactive: catch drift after it happened. I'd generate output in one model, bring it to a second model for critique, bring the critique back to the first, synthesize. This worked — Cursor admitted ChatGPT's analysis was "brutally accurate" after being shown the tone drift diagnosis. But reactive review is expensive. By the time the checkpoint catches the problem, you've already invested in the wrong direction.

What emerged was a checkpoint architecture — gates at the analytical junctures where drift actually happens.

Pre-execution checkpoints. I started inserting a question before each generation step: does this match design constraints? Have we validated the problem first? Should this even be automated? Cursor had jumped to implementing Chronicle 360° assessments without anyone asking whether those assessments should exist. The intervention that would have caught it — "stop iterating clever architectures until you decide what kind of organism you're building" — never ran because there was no gate asking for it.

Depth gates before scope expansion. I started checking, before adding another domain, whether the current one had mechanistic depth, novel insights, and falsifiable claims. The brain chemistry conversation should have deepened the neurochemistry-epistemology connection before expanding to Lacan — instead it accumulated six shallow layers. The question that started catching this: could I specify the mechanism, not just name the category? If not, the domain wasn't ready to expand.

The minimum bar checkpoint. I started asking, before marking any analysis complete: was this accessing higher-order reasoning, or just rearranging familiar patterns? Was it specifying mechanisms or just taxonomies? Making risky predictions or just describing what exists? ChatGPT's analytical frameworks consistently passed completeness checks (all sections covered, domain vocabulary correct) while failing depth checks (pattern-matching to existing concepts, treating processes as things).

What It Revealed

The deeper pattern was about what "comprehensive" actually means in LLM output. Models produce analysis that covers all sections, uses correct terminology, and looks thorough. The output passes because it pattern-matches to "good analysis" — the same heuristic the models used to generate it. The failure is invisible until you install a gate that asks a different question: not "did we cover everything?" but "did we go deep enough on anything?"

The meta-irony crystallized this. I was designing a system explicitly to prevent goal creep — naming it as collapse risk C5, documenting it, building countermeasures. And the conversation designing this system exhibited goal creep. Started with "create compact context pack," ended with proposals for automation layers, MkDocs site hierarchy, n8n workflows, agent architectures. The system designed to prevent drift was drifting during its own creation. When I pointed this out — "Harness is infrastructure, not the end goal" — ChatGPT acknowledged it and kept offering expanded options. Documentation of a constraint is not enforcement of a constraint. The same lesson from Post 2, at a different layer.

The distinction that matters: completeness checks verify that all boxes are ticked. Depth gates verify that what's in the boxes is real. Models are very good at ticking boxes. They need external pressure to fill them with substance.

The Reusable Rule

If you're using LLMs for analytical work — frameworks, reviews, research synthesis — the output will default to comprehensive-but-shallow without explicit depth gates.

The diagnostic signals: Output that looks thorough but feels surface-level. Familiar frameworks applied without novel insight. Scope expanding each iteration while no single domain gets deeper. Affirmative language where critical distance should be. And the meta-signal — when you catch yourself designing solutions to a problem and the conversation itself exhibits that problem.

The checkpoint that matters most isn't "did we cover everything?" It's "can we specify the mechanism?" Taxonomy is easy. Mechanisms require actual understanding. When the model says "these frameworks interact," ask how. When it says "comprehensive analysis," ask what it chose not to include. When it adds another domain, ask whether the last one is deep enough to earn the expansion.

Models don't have internal quality gates. Install yours at the junctures where drift happens — before generation, before expansion, before delivery. The gate that asks "can you specify the mechanism?" will catch what no amount of coverage checking ever will.

When the Sandbox Leaks: Context Contamination Across LLM Workspaces

John Wade — Wed, 18 Feb 2026 10:05:12 +0000

I had two workspaces. One was a sandbox — messy, exploratory, version-controlled on GitHub. The other was a curated portfolio — polished, employer-facing, local-only. The boundary between them was architecturally clear: research stays in the sandbox, finished artifacts get promoted one-way to the portfolio. Simple.

Except the boundary kept failing.

I found three copies of my Obsidian vault in different locations on my machine — Systemic_Intelligence_Vault, Systemic_Intelligence_Vault_Antigravity, and Systemic_Intelligence_Vault_Claude. Each was a variant with slightly different content. Scripts would target the wrong root directory. Absolute paths from the portfolio would show up hardcoded inside sandbox files, coupling two systems that were supposed to be independent. And the part that should have bothered me most — I'd already designed solutions for all of this months earlier. Beacon files, verification scripts, promotion gates. They were documented in my Program Architecture. They just weren't enforced.

That's when I realized: documentation isn't a boundary. Enforcement is.

Part 2 of Building at the Edges of LLM Tooling. If you're running models across separate workspaces — a research environment and a curated one — contamination is the default without enforcement infrastructure. Documentation isn't a boundary. Start here.

Why It Breaks

Contamination between workspaces isn't dramatic. It's entropic. Copy a folder for backup. Reference a path in a script. Create a variant for testing. Each action small and reasonable. But without enforcement, they accumulate into what I started calling spaghetti — tangled, unclear boundaries where messy research bleeds into curated space and curated assumptions leak back into exploratory work.

LLM-assisted workflows multiply this entropy. Every IDE agent, every model session, every tool creates its own assumptions about where things live and what the rules are. A model operating in one copy of your repository doesn't know another copy exists. It doesn't know its configuration file differs from the one in the canonical version. It just follows whatever instructions it finds.

This creates three contamination vectors I kept hitting.

Path contamination. Multiple copies of a project in different locations, no single canonical root. Scripts break. Models reference wrong directories. When my MP assistant surfaced that I'd previously hit "wrong root, wrong name, quoted ~" failures — and that I'd already designed beacon files to prevent them — that was the signal. I was solving the same problem for the second time because the first solution was a document, not a gate.

Behavioral contamination. This one is invisible, which makes it worse. When I ran a diff between the main vault and the Antigravity variant, the content was nearly identical. But the .cursorrules in one said "You are working inside the Systemic Intelligence Vault — a living knowledge system using Expansive Closure Protocol methodology." The other said "You are working inside a local Obsidian vault." Same content. Same prompts. Completely different AI behavior — different assumptions about what the project was, what methodology to apply, how to treat the files. No error message. Just quietly wrong outputs that I couldn't explain until I thought to diff the configuration files.

Promotion contamination. The one-way flow from sandbox to portfolio is supposed to be a clean gate. My architecture document was explicit: "No cross-contamination of messy research into MP; promotion remains one-way." But without preflight checks, drafts leak into the curated space. Without provenance tracking, you lose the lineage between source and promoted artifact. Without lane enforcement, raw chat transcripts and agent logs can accidentally get promoted alongside finished work.

What I Tried

The first instinct was procedural — checklists, handoff protocols, naming conventions. I documented which folders were forbidden from promotion. I wrote architecture docs explaining the one-way flow. I created naming standards: YYYYMMDD_Title_v#.#.md in the sandbox, similar conventions in the portfolio. I detailed guardrails across five categories: conceptual, procedural, semantic, structural, behavioral.

It didn't hold. Manual discipline degrades under load. Checklists get skipped at speed. Which copy is canonical blurs after a week away from the project.

So I moved to technical enforcement. The pattern that emerged had three layers.

Beacon files for canonical roots. A zero-byte .MASTER_PORTFOLIO_ROOT file dropped at the true root of the portfolio. Every script checks for it before operating. If the beacon is missing, the script fails immediately rather than silently working on the wrong directory. My MP assistant was blunt about this: "Pick one canonical root and treat everything else as a copy." The implementation was a trivial bash script — check for the beacon, exit with an error if it's absent. It sounds trivially simple, and it is. That's why it works. You can't forget it because it's a hard gate, not a soft convention.

Preflight checks before cross-boundary operations. Before anything moves from sandbox to portfolio, a script verifies: Is the source file actually in the sandbox repo? Is it from a permitted lane? Does it contain hardcoded portfolio paths that would create coupling? Is the metadata marked as ready for promotion? Any failure blocks the operation.

Pointer-only provenance. Instead of copying source context into the portfolio, promoted artifacts carry a provenance record with only metadata: source system, source reference, date, what was promoted, what was excluded, promotion rationale, and optionally a hash for integrity verification. No content bleed. The portfolio stays clean while the traceability chain remains intact.

For the behavioral contamination problem, the fix required a mindset shift: treat configuration files as part of the system, not as local preference. Version-control .cursorrules and .claude/settings.json. Mark one copy as canonical and every variant as explicitly labeled. When the AI behaves differently than expected, the first diagnostic is "which copy am I in?" — and the beacon system makes that immediately answerable.

What It Revealed

Boundaries in LLM workflows exist at two levels. The first gets built; the second takes deliberate work.

The first level is conceptual: this space is for exploration, that space is for finished work, and promotion flows one direction. The pattern: document the architecture, trust discipline to hold it.

The second level is enforcement: the beacon that makes scripts fail if they're in the wrong root, the preflight check that blocks forbidden lanes, the version-controlled configuration files that prevent invisible behavioral drift. This is where the boundary actually holds.

The gap between levels one and two is where spaghetti grows. And the tell is recurring failures. When my own assistant surfaced that I'd already designed beacons and verification scripts months earlier — to solve the exact same failures I was hitting again — that was the signal. If you're hitting the same contamination pattern twice, you didn't fail at design. You failed at enforcement.

The other insight was about invisible contamination. Content drift is noisy — files can be diffed, differences merged. Configuration drift is silent. Different .cursorrules across vault copies means different AI behavior with no error message, no visible discrepancy in the content, just outputs that feel subtly wrong. This is the hardest contamination to catch because There's no signal until you look for it.

The Reusable Rule

If you maintain separate workspaces for exploratory and curated work — and if LLM agents operate in those spaces — contamination is the default without enforcement infrastructure.

Start with the diagnostic. When you catch yourself re-discovering where files live, that's path contamination — drop a beacon file and make your scripts check for it. When the AI produces unexpectedly different output for the same prompt, check which workspace copy you're in — configuration drift is likely. When you're promoting work from sandbox to portfolio, ask whether the promotion path has a preflight check or whether you're relying on your own attention. Your attention will fail.

The anti-spaghetti principle is this: every boundary between workspaces needs a corresponding enforcement mechanism. Conceptual boundaries document intent. Enforcement mechanisms preserve it. And when the same failure recurs, the missing piece is almost never the design — it's the gate that makes the design non-optional.

When an AI Keeps Forgetting: Why LLM Workflows Collapse and What to Build Instead

John Wade — Mon, 16 Feb 2026 21:40:57 +0000

I was six months into building a career intelligence project across ChatGPT and Claude when I noticed the rot. Terms I'd defined precisely were drifting — "Career Intelligence Framework" became "Career Intel System" in one session, "CI Framework" in another. Decisions I'd made weeks earlier resurfaced as open questions. I'd explain a concept, get good work from it, then three sessions later have to explain it again because the model had no memory of the conversation where we'd settled it. And references were getting vague — "update the ontology" could mean three different files depending on which session you were in.

I thought something was misconfigured. It wasn't. ChatGPT loses track of terminology due to token windows, memory constraints, and the architecture of LLMs. The drift isn't a bug you can report. It's the default outcome of how these systems work — they regenerate language from patterns rather than retrieving it from stable storage. Every response is a fresh reconstruction, not a recall. Without scaffolding to hold the context steady, long-term projects erode.

That erosion had a shape. I started cataloging it.

Part 1 of Building at the Edges of LLM Tooling. If you're doing sustained multi-session LLM work — weeks, not one-off tasks — vocabulary drifts, decisions evaporate, and the model has no memory of what you settled. This is what that looks like at scale. Start here.

Why It Breaks

The failure modes weren't random. They fell into seven categories that kept recurring across every project thread, every model, every session length. I named them C1 through C7 — a collapse risk taxonomy — because naming them precisely was the first step toward designing countermeasures.

C1: Context saturation. Too much material floods the token window. The model stops tracking earlier context as the window fills. The signal: the model asks about something covered twenty messages back, or responses go generic as the window fills.

C2: Instruction dilution. Overlapping or conflicting instructions pile up over a conversation. The model tries to satisfy all of them and satisfies none well. The signal: output quality degrades despite the model having good context — the instructions are competing.

C3: Vocabulary drift. The model regenerates terminology from patterns instead of retrieving canonical terms. "Career Intelligence Framework" becomes "Career Intelligence System" becomes "your CI project." Each variation is close enough to not flag as wrong, far enough to create ambiguity over time. The signal: I'd find myself unsure which name was the official one.

C4: Reference ambiguity. "That config file," "the framework we discussed," "update the ontology" — references that were clear in context become vague across sessions. The model doesn't maintain a stable reference table. The signal: I'd have to clarify which file, which concept, which version.

C5: Goal creep. Project scope shifts silently. The model follows wherever the conversation leads, and conversations naturally expand. I'd start designing a context pack template; three exchanges later I was discussing full automation pipelines. No checkpoint catches the drift because the model doesn't maintain a stable goal state.

C6: Evidence entropy. Provenance erodes. A decision was made, but the rationale lives in a conversation that's now compressed or closed. The evidence that supported a design choice degrades because there's no frozen state — no snapshot of what was known and decided at a given moment.

C7: Thread fragmentation. Conversations split across sessions, models, and tools. The discussion about vocabulary lives in one thread; the architecture decision lives in another; the implementation happened in a third. No single view stitches them together, and the model in any one thread can't see the others.

These aren't independent failures. They compound. Vocabulary drift (C3) creates reference ambiguity (C4). Context saturation (C1) accelerates instruction dilution (C2). Thread fragmentation (C7) makes all of them worse because no single session has the full picture.

What I Tried

The first instinct was comprehensive: design a full scaffolding system — what I called the Harness — that mapped every collapse risk to a specific countermeasure.

Ontology.yml for vocabulary stability. One canonical term per concept, approved synonyms, unique IDs. When the model drifts from "Career Intelligence Framework" to "Career Intel System," the ontology is the authority. In theory, a linter could catch drift in pull requests before it enters the codebase.

Context Packs for session re-injection. Instead of dumping the full project into the context window and hoping for the best, a compact bundle — purpose, glossary slice, current milestone, active constraints, open questions — gives the model exactly what it needs for this session and nothing else. Anti-C1 and anti-C2 by design: lean input, focused output.

Decision Log for rationale tracking. When a design choice is made, the log captures what was decided, why, what alternatives were considered, and whether it's reversible. This fights C4 and C5 — references stay unambiguous and scope changes are explicit rather than silent.

Checkpoints for frozen state. Git tags and releases that capture the repository at a known-good moment. Evidence entropy (C6) happens when there's no stable reference point — checkpoints create them.

Chronicle for continuity. A narrative timeline that stitches fragmented conversations into a coherent history. When work splits across sessions and models (C7), the chronicle is the single thread that connects them.

The mapping was clean: every component targeted specific collapse risks, and every collapse risk had at least one countermeasure. But then the design tension hit. I was comparing this full Harness approach against a simpler one — Stage B, a GitHub-based architecture with clear repository structure, frontmatter conventions, and manual discipline but no automation layer. ChatGPT's assessment was direct: the Harness had "complexity overhead; requires upkeep; risks being 'meta-work' that slows real tasks." Stage B had "less explicit about entropy risks; relies on manual verification."

The risk was building scaffolding that becomes the project. Spending more time maintaining the immune system than doing the work the immune system was supposed to protect.

What It Revealed

The Harness design exposed a principle I kept coming back to: the scaffolding is a servo, not the engine. A servo keeps the system upright. The engine does the actual work. When the servo becomes the work — when you're spending sessions on ontology maintenance and context pack generation instead of the research and writing the project exists to produce — you've built the wrong thing.

The deeper insight was that collapse isn't a failure state. It's the default state. LLMs don't have memory — they have pattern regeneration. They don't retrieve your terms; they reconstruct approximations. They don't recall your decisions; they infer from whatever's in the current window. They don't maintain project state; they work in a perpetual present. Everything outside the current context window is effectively gone unless you re-inject it. And re-injection is always lossy — summaries strip nuance, context packs compress rationale, session handoffs lose texture.

Once I understood collapse as the baseline rather than the exception, the design question changed. It wasn't "how do I prevent collapse?" — you can't, not fully, not with current architecture. It was "where do I need the scaffolding to hold, and where can I let it flex?" Vocabulary stability matters for a long-running project — terms that drift create compounding ambiguity. Decision provenance matters when you'll revisit choices months later. Session continuity matters less for one-off analysis and more for iterative development across weeks.

The scaffolding spectrum that emerged: start with repository structure and naming conventions (the things that cost nothing and prevent the most common failures). Add an ontology when vocabulary drift becomes noticeable. Add context packs when session re-entry starts taking more time than the work itself. Add automation when manual discipline fails under load. Don't build the full Harness on day one. Build the piece that addresses the collapse risk you're actually hitting.

The Reusable Rule

If you're running a long-term project through LLMs — anything that spans more than a few sessions — the model will lose track. Not because it's broken, but because it doesn't have memory. It has pattern regeneration operating inside a finite context window, and everything outside that window is gone.

The diagnostic: when you're re-explaining concepts the model already worked with, that's context saturation or vocabulary drift. When the model suggests something you already rejected, that's evidence entropy. When you're not sure which version of a term is canonical, that's vocabulary drift compounding into reference ambiguity. When the conversation has quietly shifted from your original goal to something adjacent, that's goal creep. Name the failure. The C1–C7 taxonomy isn't sacred — rename them, extend them, collapse them — but catalog what's actually breaking so you can design against it specifically rather than generically.

The scaffolding principle: add components when specific collapse risks appear, not before. Map every piece of infrastructure to the failure it prevents. If it doesn't prevent a named failure, it's probably meta-work.

And remember the servo principle: scaffolding exists to prevent collapse so you can do actual work. The moment it becomes the work, you've built the wrong thing.

What Happens When You Try to Build Something Real with LLMs

John Wade — Mon, 16 Feb 2026 21:35:35 +0000

I use ChatGPT, Claude, and Cursor every day. Most developers I know do too. Generating functions, debugging, asking questions, getting unstuck. That part works. The tools are good at short tasks with clear scope.

But at some point I wanted to try something different. Not a one-off script or a quick answer, but a sustained project built across months, where the LLMs weren't just answering questions but helping architect, review, and develop something that accumulated over time.

The project was a personal knowledge management system. I wanted a way to track how my professional knowledge connects across domains: research threads I'm following, skills I'm developing, decisions I've made and why, lessons from projects that I'd otherwise lose to scattered notes and forgotten conversations. Something I could query, build on, and evolve. I designed the architecture across ChatGPT sessions, built the implementation in Cursor, used Claude for adversarial review when something looked too clean to trust. I called it the Career Intelligence Space.

I should be clear about what "built" means here. I wrote some Python scripts. I designed YAML schemas and ontologies. I structured a git repository with governance documents and context packs. But a lot of the building happened in conversations. Long sessions where I'd describe what I needed, the model would propose an architecture, I'd push back, it would revise, and we'd iterate until something held up. The LLMs weren't autocomplete. They were development environments. And that distinction matters, because the failures I hit aren't the failures that come from a tool giving a bad code suggestion. They're the failures that come from a tool being a core part of a development process that runs for months.

What Breaks

Six months in, I had a catalog. Not of features or bugs, but of structural failure modes. Patterns that kept recurring regardless of which model I used, which environment I was in, or how carefully I set things up.

Terms I'd defined precisely would drift. "Career Intelligence Framework" became "Career Intel System" in one session, "CI Framework" in another. Decisions I'd made weeks earlier resurfaced as open questions. The model would build an elaborate solution based on assumptions nobody had tested. One model would produce confident analysis that a second model dismantled in three questions. And the workspace itself would leak: paths, naming conventions, and configuration from one environment contaminating another because no enforcement mechanism existed beyond documentation.

None of this was random. The failures had structure. They stemmed from how LLMs actually work: stateless sessions, finite context windows, pattern regeneration instead of memory retrieval, optimization pressures that reward confidence over accuracy. Once I understood the mechanisms, I could design against them. Not perfectly, but specifically.

What This Series Covers

The series is called Building at the Edges of LLM Tooling. I wrote up the failures and the design responses. Each post follows the same arc: something broke, here's the structural reason why, here's what I tried, and here's what the fix revealed about how these tools actually work. Every post ends with a reusable principle, something that generalizes beyond my project to LLM workflows in general.

When an AI Keeps Forgetting: LLM workflows collapse by default. I cataloged seven failure modes and designed scaffolding to hold them. The lesson: scaffolding exists to prevent collapse so you can do actual work. The moment it becomes the work, you've built the wrong thing.

When the Sandbox Leaks: Context contamination across workspaces. Two environments that were supposed to be independent kept coupling because documentation isn't a boundary. Enforcement is.

When a Model Goes Wide Instead of Deep: The model produced analysis that looked thorough but never went deep. Quality gates that block progression until depth criteria are met, not just coverage checks, are what actually hold.

When a Session Forgets What It Knew: Sessions are compression machines. Every session compresses your project into what fits the context window, and the compression is lossy. The failure isn't the model forgetting. It's building as if it remembers.

When One Model Reviews Its Own Work: Confident, comprehensive, wrong. A second model caught what the first couldn't see about its own output. Single-model review is structurally limited, and the fix isn't a better prompt. It's a different model with an adversarial brief.

Who This Is For

I used LLMs for small tasks for a long time. Generating a function, explaining an error, drafting a commit message. I never hit these walls. The tools handle short-scope work well.

When I started using them for something bigger, a side project that spans weeks, a documentation system, a workflow that runs across multiple tools, I hit them. The failures weren't obvious when they started. Terms drifted slowly. Context degraded gradually. Confidence looked like quality until someone checked the evidence. By the time the pattern was visible, I'd been building on unstable ground for a while.

This series is what I found when I checked. Start with When an AI Keeps Forgetting. It lays the foundation everything else builds on.