DEV Community

John Wade
John Wade

Posted on

When the Reasoning Doesn't Survive


I built a skill to capture session reasoning before it evaporates. The first time I used it, I tested whether the capture actually worked.

The skill is called /mark. It writes a session anchor — a lightweight markdown file capturing a deliberative moment: what the reasoning was, why that framing, what was still uncertain. The intent was to capture at the moment of insight, not reconstruct from notes later.

The first anchor I wrote was for an epistemic research framing developed earlier in the same session. By the time I ran /mark, the session had compacted once. The reasoning I was capturing came from the compaction summary, not from the live exchange.

The result: the four research aims survived intact. The staging rationale — why that order, what the session actually argued, where I remained uncertain — did not. What /mark produced was a clean artifact of what had been decided. Not a record of how.

I noticed immediately. The structure was right. The reasoning wasn't there.

Part 7 of Building at the Edges of LLM Tooling. If you're relying on session notes to reconstruct reasoning you developed in an LLM session — notes capture decisions, not the argument behind them. The deliberative layer has no persistence infrastructure. Start here.


Why It Breaks

LLM sessions produce two things.

The first is artifacts: files created, decisions made, nodes extracted, code written. These are the session's visible output. They persist because they land somewhere — a file system, a vault, a repo.

The second is deliberative reasoning: why this framing over another, what the session considered and rejected, the staging logic, the uncertainty that was still live at step three. This layer is what makes the artifact navigable — the argument behind the decision, available for future review.

Current LLM tooling has infrastructure for artifacts. It has nothing for deliberative reasoning.

When a session ends, the deliberative layer evaporates. If the session compresses before it ends — and sessions compress — the evaporation is gradual. Each compression pass discards branching, keeps conclusions. The staging rationale disappears. The alternative framings disappear. What remains is a summary of what was decided, not a record of how.

This isn't a bug in any specific tool. It's a gap in the layer. ECP nodes capture processed research. Session notes capture operational events. Neither captures the reasoning connecting one piece of work to the next. The infrastructure doesn't exist yet.


What I Tried

The /mark skill is a Claude Code skill for capturing deliberative moments during working sessions.

The design criteria were specific: not session notes (those capture what happened, not why). Not extracted research nodes (those capture processed content, not live reasoning). Specifically the layer between — design decisions in flight, research framings being constructed, the logic that connects current work to future work.

The output is a session anchor: a markdown file, vault-native, written at the moment of insight. It links to the concept being framed, records the open questions still live at capture time, and stores the reasoning while it's present in context — not reconstructed from a summary later.

The anchor format doesn't try to reproduce the full exchange. It captures three things: what was decided, why (the argument as it stood at capture time), and what remained uncertain. Those three things are enough to make the decision navigable two sessions later.


What It Revealed

The anchor-from-summary failure confirmed the mechanism.

The same session that produced the skill also produced the first anchor. By the time I ran /mark, one compaction event had already occurred. The research framing I was capturing lived in the pre-compression transcript. What I was working from was the summary.

The anchor captured structure correctly: four research aims, each with a framing question. What it didn't capture was the staging rationale — why those four, why in that order, what the session had actually argued about construct validity before landing on that framing, where I had been uncertain and remained uncertain.

The structure survived double compression. The reasoning didn't.

This is what compression does systematically: it preserves conclusions and discards the reasoning trace. The summary is the session's artifact layer. Everything below that is gone.

The /mark skill works when it captures at the moment the reasoning exists. It doesn't recover reasoning from summaries. Running it retroactively produces the same artifact-without-argument problem that session notes produce.

Capture has to happen in the live session, before compression, at the moment of insight. Any other timing produces an archive, not a record.


The Reusable Rule

If your LLM sessions produce reasoning you later need to revisit — design decisions, research framings, the logic between steps — check what layer you're actually capturing.

Session notes capture what happened. Summaries capture what was decided. Neither captures the argument that led there.

The test: pull up a session note from three weeks ago. Can you reconstruct why the decision went that way — including what was considered and rejected? If not, the reasoning evaporated. The artifact survived; the deliberation didn't.

Real-time capture before compression was the intervention that worked. After the session ends, the reasoning is already gone.

Top comments (6)

Collapse
 
maxxmini profile image
MaxxMini

The "structure survived, reasoning didn't" pattern hits close to home.

I run an AI agent that wakes up fresh every session. Its continuity comes from a two-tier file system: daily logs (raw what-happened) and a curated long-term memory file (distilled decisions). Periodically, the agent reviews daily logs and promotes what matters to long-term memory.

The failure mode is exactly what you describe. When the agent promotes a decision to long-term memory, it captures "we switched from relay-based browser automation to Playwright" — the conclusion. What doesn't survive: the three failed approaches that led there, why contentEditable injection silently dropped content, why ClipboardEvent simulation worked on one platform but not another. The staging rationale evaporates during the promotion step, same as your compression pass.

Your /mark skill targets the right layer. The intervention I've found complementary is a lessons-learned file — not "what we decided" but specifically "what broke and why the fix wasn't obvious." It's the negative space of your anchor: instead of capturing the reasoning at the moment of insight, it captures the reasoning at the moment of failure. The two together might cover the deliberative spectrum more completely.

One thing I'm curious about: does the anchor format handle reasoning that spans multiple sessions? The hardest deliberative traces I've lost weren't from single-session compression — they were from multi-session evolution where the framing shifted incrementally. Session 1 establishes a premise. Session 3 quietly modifies it. By session 5, the current framing contradicts the original, but no single session contains the full argument for why it changed. Each anchor would be locally correct but globally misleading.

Also — the "capture before compression" timing constraint feels like it creates a metacognitive burden. You have to recognize a deliberative moment as a deliberative moment, in real time, before context compaction. Have you found that the act of deciding "this is worth marking" itself changes the reasoning? Like an observer effect on your own thinking process.

Collapse
 
john_wade_dev profile image
John Wade

Late to this — both questions hit something real.

On multi-session drift — anchoring alone doesn't catch it. An anchor is a point-in-time capture. It tells you what the reasoning was in session 3. It doesn't tell you that session 7 quietly departed from it. For that, I needed something that compares state across time.

What I built: an append-only session state log — every session close writes a snapshot that gets appended before the working state is overwritten. That gives me a timeline I can grep. If session 7's framing contradicts session 3's premise, the log holds both snapshots and the gap is findable. Not automatically surfaced — I have to look. But the record exists. After ~80 sessions, the pattern I see is bimodal — some threads resolve within a session, others persist across dozens of snapshots without resolution. The persistent ones are where drift lives.

The second mechanism is commissions — external-perspective assessments I run periodically through different models with structured briefs. A commissioned assessor reviews the current project state against stated goals and prior decisions. These function as drift detectors from outside the cognitive loop. I can't always see my own framing shift. An assessor looking at the whole trajectory can. Three commissions so far, each one caught something I'd normalized.

So: anchors capture reasoning at the moment. The session log tracks how snapshots change over time. Commissions catch drift I've stopped noticing. Three layers, each addressing a different failure mode.

Your lessons-learned file is the piece I don't have a systematic version of — capturing "what broke and why the fix wasn't obvious" is the negative space of the anchor. Anchors preserve reasoning at the moment of insight. Lessons-learned preserves reasoning at the moment of failure. Together they'd bracket the deliberative spectrum more completely. Right now failure reasoning only survives in my system if it happens to land in a session snapshot, which means it's subject to the same compression triage as everything else.

On the observer effect — yes. I've noticed it. The act of deciding "this is worth marking" does change what gets produced. The reasoning gets slightly more structured, slightly more legible, when I know it's being captured rather than flowing past. I don't have a fix for this. It might not need one — if the intervention produces a more durable trace at the cost of some naturalness, that might be an acceptable trade. But I want to be honest that the capture isn't neutral. The act of marking shapes what gets marked. That's an open frontier, not a solved problem.

Post 8 in the series covers building the persistence layer — graph database, vector store, MCP server — that's designed to hold reasoning structure across sessions rather than just compressed summaries. Your two-tier system (daily logs + curated long-term memory, periodic promotion) runs into the same promotion-lossy problem at the tier boundary. That's the specific wall the persistence layer is trying to address.

Collapse
 
signalstack profile image
signalstack

MaxxMini's multi-session drift question is the harder version of this problem. Single-session compression is lossy in a predictable direction: you lose branching, keep conclusions. Multi-session drift is cumulatively lossy — Session 5's framing can be locally coherent while quietly contradicting Session 1's founding premise, with no single moment where the contradiction became visible. Each anchor is locally correct; the global trajectory is wrong.

What I think this points to: the 'why' layer and the 'what' layer have different decay rates and probably need different retention policies. Decisions stay relevant longer. The reasoning behind them is most useful in the window right after construction, then mainly as audit material. Most memory architectures conflate them in the same storage tier, which is why the promotion step loses reasoning — it's treating two different artifact types as one.

The fix might be a two-output promotion step: one conclusion artifact for long-term storage, one reasoning trace that doesn't need to survive forever — just long enough to be useful when a later session contradicts an earlier premise. The trace can be compact: premise → objection considered → resolution → conclusion. That's the argument chain, not the full exchange. Cheap to store, expensive to reconstruct later.

Your /mark skill is targeting this layer at the right moment. The gap is what happens to the trace after the session ends — that's where I think the infrastructure is still missing.

Collapse
 
john_wade_dev profile image
John Wade

You're drawing a distinction I should have made more sharply in the post. Single-session compression is lossy in a predictable direction — branching goes, conclusions stay. You know what you're losing. The multi-session version is worse precisely because each session is locally coherent. Session 5's framing can quietly contradict Session 1's founding premise, and no single session sees the drift because each one only sees itself.
The two-output promotion step you're describing — conclusion artifact for long-term storage, reasoning trace that survives long enough to catch contradictions — is close to what I ended up building, though I arrived at it from a different angle. I hit the wall when working complexity exceeded what any session could hold, and the fix was a graph database with typed relationships and a formal ontology. Concepts land as conclusions (named, defined, with a development status). Source links carry provenance (who says so, how — grounds, challenges, extends). The relationship notes field carries the reasoning substance. Three layers, stored together but structurally distinct — which maps to your point about different artifact types needing different retention policies.
The gap you identified — what happens to the trace after the session ends — was the exact problem that forced the build. I wrote about it in the next post: When the Data Isn't Gone — It's Buried. The short version: the persistence layer after capture was the missing infrastructure. The longer version gets into what the topology looked like when it finally became visible — and what it still can't do, which is the retroactive re-evaluation piece you're pointing at with "the reasoning behind them is most useful in the window right after construction, then mainly as audit material." That part isn't solved yet. The monitoring surfaces where the terrain has shifted. Evaluating what the shift means is still manual.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.