Mike Czerwinski

Posted on Jun 22

Salience is not carry value: notes from a running session-memory pipeline

#ai #llmops #agents #memory

After a few hundred substantial sessions with an agent, you have a corpus you couldn't read end-to-end in a month, and the next session opens cold anyway. There's a real architectural problem hiding inside that sentence, and most of the agent-memory writing I've seen doesn't touch it.

Storage solves what gets written.
Locks solve what gets defended.
Selection solves what gets carried.

The first two have a lot of tooling. The third is where most projects quietly land — under a vector store, behind a confident-sounding summarizer, with no way to inspect what the next session is actually about to walk in with.

But selection has its own internal trap, and naming it is the centerpiece of this post:

Salience is not carry value.

Salience tells you what mattered inside the session. Carry value tells you what the next session would be stupid without. Those are different signals, and conflating them is what makes "top-N highlights" pipelines quietly underperform: the loudest moment is rarely the highest-leverage one.

A real failure mode I've watched happen:

A naming debate got carried into the next session because it triggered five corrections in a row — high salience, easy to score.
The actual architectural decision from the same session was dropped, because it was made quietly once and never re-litigated.
Next session opened, picked up the naming debate as if it were unresolved, and walked past the architecture as if it were never made.

That's salience pretending to be carry value. It's a quiet failure, hard to attribute, expensive over time — the kind that never gets a postmortem because nobody noticed when it started.

This is an operational note from a pipeline I've been running — and a roadmap of where it falls short and where it's going. The architecture under it isn't novel; in my experience, the operator-facing surface is where it breaks, and that's the part worth writing about.

What's hard

You can't carry the whole session forward. Pick wrong and the next session opens with confident noise — high-fluency model voice asserting things that aren't load-bearing, while the actual load-bearing moment from yesterday sits two scrolls away. Pick under-aggressively and the next session opens cold, re-litigates settled questions, and burns tokens recovering context the previous session already paid for.

Storage doesn't fix this. Neither does a bigger window. You have to decide what gets carried — and decide for what kind of work next, not just what mattered yesterday.

What the pipeline does today

Two stages, deterministic first, optional LLM second.

Stage one — mechanical salience. Every session produces a transcript. Each event gets a salience weight from a deterministic scorer (corrections weighted heavier than acknowledgments, repeat-corrections heavier still, edit-then-revert as a strong signal of a contested decision, etc.). The pipeline filters at a minimum cutoff, keeps the top-N highlights, and writes a per-session record with a provenance_ref on every highlight pointing back to the raw transcript span. Nothing summarizes into confident voice with no ground under it — every highlight is one click away from its source.

Stage two — optional synthesis. An --llm flag adds a thesis layer over the highlights. This is the part most people start with, and where most memory projects end. I treat it as the cheap finishing pass, not the load-bearing layer. If the highlights are wrong, the synthesis amplifies the wrong thing.

Retrieval-time brief. Per project, an INDEX.md is built from the existing consolidates without a model in the retrieval-time loop. The synthesis pass at stage two is optional and runs once at consolidate time. At the moment the next session opens, no model is being asked to invent the brief — it just reads the file. Deterministic, inspectable, cheap. Open it and you can see what the next session is about to walk in with. If something looks wrong, you fix the brief by hand.

A trimmed example of the current format:

## Carried into next session

- Decision: Use deterministic salience before LLM synthesis.
  provenance_ref: sessions/2026-06-21.md#L44-L61
  weight: 4
  reason: repeated correction + implementation change

## Do not carry

- Rejected naming variants from the salience scorer.
  reason: local brainstorming, no future dependency

That's the actual surface the next session reads. Not a vector match. Not a confident reconstruction. A file you can edit.

Where this falls short

Honest current state: the pipeline ranks by salience. That's most of the way to the answer, but it's not the answer. Seven things I think the format needs next, ranked by how fast they close the salience-vs-carry-value gap:

1. Two scores per highlight, not one.

- Decision: Use INDEX.md as retrieval-time brief.
  salience: 3
  carry_value: 5
  reason: quiet decision, but many future files depend on it

Same data, different question. Salience says "this was loud." Carry value says "the next session would be stupid without it."

2. Memory classes, not a flat list. A brief that mixes everything into "important things" is a brief the model can't act on. Minimum useful taxonomy:

Active decisions — settled, not to be re-litigated without cause
Operating constraints — style, rules, definitions currently in force
Open loops — unfinished work to return to
Rejected paths — so the agent doesn't re-propose
Volatile context — important now, scheduled to expire
Glossary / naming — local meanings, project-specific terms

The model now knows how to use each entry, not just that it was important.

3. TTL / expires_when. Briefs without expiry become museums. Every entry should declare what makes it stale — a published article, a superseded decision, a finished sprint. Without expiry, context cholesterol quietly clogs the next session.

- Context: Article title currently "Salience is not carry value".
  carry_type: volatile_context
  expires_when: article published or title changed

4. bring_when / do_not_bring_when. This is the unanswered question of memory: not just what's in the store, but why is this showing up now. Encoding the trigger directly on each entry is the answer that doesn't require a separate retrieval engine.

- Decision: Use deterministic salience before LLM synthesis.
  bring_when:
    - discussing memory pipeline architecture
    - revising agent-memory article
    - debugging poor session carryover
  do_not_bring_when:
    - casual writing tasks
    - unrelated career or CV work

5. Brief budget. A retrieval brief that's allowed to grow indefinitely is a brief that loses to the very confident-summarizer failure mode it was meant to prevent. Hard caps — e.g. 12 items total, with sub-caps per class — keep the next session walking in fast.

6. Supersession as a first-class field. status: active | superseded | archived, plus supersedes: and superseded_by: pointers. So the next session doesn't carry an old truth that has since been replaced — and the path between truths stays inspectable, not silently overwritten.

7. Recovery-cost as the key metric. All of this is means to one end: minimize how many tokens, minutes, or corrections it takes for the next session to catch up to where the previous one left off. If the pipeline works, recovery cost drops over time. If it doesn't, the brief is theater. Adjacent metrics that fall out of the same instrumentation: re-litigation rate, false-carry rate (irrelevant context brought), missed-carry rate (needed context not brought), edit rate (how often the operator manually fixes the brief).

That's the difference between a pretty INDEX.md and a real context-distillation layer.

Two principles I'd defend today, and a third I'm adding

Mechanical salience first, LLM synthesis second. If a thesis can be wrong, the highlights it sits on need to be inspectable, weighted, and traceable back to raw transcript.
Retrieval-time briefs without a model in the retrieval-time loop. Open a file, read what the next session is about to walk in with. If a model is being asked to invent the brief at retrieval time, you have a confident voice with no ground under it.
Score for the next task, not the last drama. Salience surfaces candidates. Carry value decides which ones move forward. Conflating them is the most common quiet failure of selection-time policy.

Why this matters more than it looks

The Anthropic Economic Research paper on agentic coding (June 2026) measured something interesting: across roughly 400k Claude Code sessions and 235k operators, success varies sharply with domain expertise, and humans concentrate on planning while the agent runs execution.

My read of the operational implication — not a direct claim from the paper — is that the gap between operators that compounds over months probably does not come from storage alone. It comes from what gets selected, preserved, and re-entered into work at the right moment. Storage decides what's available. Carry-value policy decides what's actually in front of you when it counts.

Which makes carry-value policy, not storage schema, the place where operator skill quietly accrues.

What this is, and what it isn't

This isn't a framework, and it isn't novel architecture — a lot of serious agent-state work seems to converge on something close. It's an operational note plus a roadmap: storage is the easy half, what gets carried is where the design discipline actually lives, and the carry surface is more interesting than the storage one because the failure modes are quieter and more expensive.

If you've shipped a version of this — particularly on the selection side, not the storage side — I want to see your INDEX.md. Especially the entries you cut, the ones you brought back after cutting, and the ones you cut a second time.

Credits & references

The framing memory should be a product state, not a prompt trick came from a recent dev.to thread by Yana Li (AI memory should be a product state, not a prompt trick). The closing question there, about why this memory, why now, why not another, is what pushed me to write this.
The decision-store parallel — continuous derived signal feeding into a small set of named tiers — is shared work from private operator conversations with a peer running it at the lock layer, not the session layer.
Dumb on purpose is my shorthand for deterministic gates that a model cannot quietly rewrite.
Anthropic Economic Research, Agentic coding and persistent returns to expertise (Hitzig et al., June 2026).

Top comments (2)

Max Quimby • Jun 28

The salience-vs-carry-value split is the cleanest articulation of this I've read. The naming-debate example is painfully familiar — correction-count is such a tempting salience proxy precisely because it's cheap to compute, and it rewards the loudest thing instead of the load-bearing one. The piece I'd push on: carry value isn't really a property of the session you're writing, it's a property of the session that hasn't happened yet. The quiet architectural decision is high-carry for some next sessions and irrelevant for others. We've had more luck treating selection as a read-time, query-conditioned operation than as a write-time summarization — i.e., don't freeze "top-N highlights" at session end, keep the raw decisions and select against what the next session is actually trying to do. The tradeoff is you pay retrieval cost every time instead of once. How are you scoring carry value at write time without knowing what the next session walks in needing?

Mike Czerwinski • Jun 28

Honestly the short answer is I don't, fully. The write-time pass only labels things along axes that are roughly stable across next-sessions: invariants, decisions with explicit reasoning, constraints with named scope. That is a low-bar filter, not a score. The actual selection has to be read-time, query-conditioned, the way you described it.

The split I've found useful is: write-time decides what is allowed to be carried (filter), read-time decides what is carried (rank). The filter is cheap and survives the lack of next-session context because it is only rejecting throwaway state. The ranker pays the retrieval cost you mentioned, but only against a much smaller pool. The tradeoff is not pay-once vs pay-every-time, it is pay-twice but pay-the-second-time over something that already is not garbage.

The case I have not solved is yours exactly: a decision that is load-bearing for one future session and noise for ten others. The honest answer is the filter lets it through and the ranker either picks it up or does not, and sometimes it does not, and we lose.