Ivan BUSH

Posted on May 23 • Originally published at github.com

What I learned building memory for Claude Code — measured against the popular alternative

#ai #claude #llm #opensource

The problem nobody talks about

Every Claude Code session eventually hits /compact. When it does, Claude sends your entire conversation to an LLM summariser and replaces the context window with the output. The summariser is one-pass and lossy by design. Five minutes later you're re-explaining the hostname, the flag you had to restate twice already, the edge case you described carefully three turns ago.

The production fix the community shipped is hook-based capture: intercept every tool call, ship the observation to a worker, store it in a vector index, inject it back at the next session start. claude-mem (77k GitHub stars) is the canonical example. It works. It genuinely works. Hook on PostToolUse, hook on PreToolUse, hook on UserPromptSubmit, hook on Stop — five always-on processes, ~80–150 MB of resident RAM, one Claude API call per session end. A 50-tool-call session spawns roughly 50–100 short-lived Node child processes just for the hook events, before any actual work happens.

I had a different question. Can the same problem be solved as a substrate — a parsed, scored, queryable artifact on disk — instead of as a stream?

~/.claude/projects/ already contains every correction you pushed back on, every number you had to restate, every constraint the model lost track of. Those files exist right now on your machine. The question is whether reading them structurally — pairing each correction with its context, scoring each pair on seven signals, keeping the result as plain numpy files you can inspect — beats discarding them for a one-pass summary.

That question is what weighted-compact is an answer to. Here is what I found when I measured it.

What I built

One paragraph: weighted-compact reads your ~/.claude/projects/ session files, extracts (premise, correction) pairs wherever you corrected the model, scores each pair on seven independent signals, and exposes the result via a 47 ms compaction function and a local stdio MCP server.

~/.claude/projects/                    7 signals (6 automatic)
┌──────────────────┐                  ┌────────────────────┐
│ session_1.jsonl  │  ──bootstrap──▶  │ misstep            │
│ session_2.jsonl  │                  │ density            │
│     ...          │                  │ label (optional)   │  ─importance.npz─┐
│ session_N.jsonl  │                  │ span_keep / maybe  │                  │
└──────────────────┘                  │ span_skip / think  │                  ▼
                                      │ topic_decay        │       ┌──────────────────┐
                                      │ cosine             │  ───▶ │  compacted       │
                                      └────────────────────┘       │  markdown        │
                                               ▲                   │  + budget meta   │
                                           REM-decay               └──────────────────┘
                                       (nightly, wall-clock)                ▲
                                                                            │
                                                                       MCP / CLI
                                                                       (you query
                                                                        with intent)

The six automatic signals are: a per-user misstep predictor (logistic regression trained on your own stumble events, AUC 0.665 on the maintainer's corpus), sixteen density features, four span coverage tiers, topic position decay, recency, and cosine neighbourhood. The seventh — a sparse human label — is opt-in. The published ablation puts its 95% paired CI at [−0.004, +0.109], crossing zero on the lower bound: the six automatic signals carry the substrate by themselves.

A nightly REM pass lays a wall-clock half-life multiplier on top, independent of the signal mixture: yesterday × 0.91, one week ago × 0.50, one month ago × 0.05.

Install and first run:

pipx install 'weighted-compact[mcp]'
weighted-compact bootstrap    # parse ~/.claude/projects/ → pairs.jsonl
weighted-compact importance   # compose the 7-signal score
weighted-compact mcp-serve    # stdio MCP for Claude Desktop / IDE clients

Bootstrap on the maintainer's corpus (376 sessions, 750 MB of raw JSONL) took 6.07 seconds. The substrate on disk for 613 pairs is 24 MB total. RAM at runtime: 33 MB for the loaded substrate; 0 MB at idle because nothing runs between queries.

How I tested it — and what I found that I didn't like

The measurement loop is called reconstruction fidelity. Hide one correction pair from the context. Build the compacted context from the remaining pairs using a given ranker. Ask a question whose answer lived in the hidden pair. Score the answer with a judge from a different model family than the generator. The cross-family constraint is a methodology contract, not an optional robustness step: Gemma judges Qwen reconstructions; no same-family judging.

Here are the numbers. N=30 pairs, seed=42, gemma3:4b judge, k_drop=0.5 (keep half the session), from the partial run published in docs/bench-vs-claude-mem.md:

Method	judge_yes / n	mean_context_chars
weighted-compact-importance	3/30 = 10.0%	7,296
weighted-compact-recency	4/30 = 13.3%	5,874
compact-qwen analog (one-pass LLM summary)	1/30 = 3.3%	1,196

The headline finding I did not want: recency beats my mixture. A single-line heuristic — rank pairs by their position within the session, most recent wins — outperforms the seven-signal mixture at this sample size. The mixture's 95% paired CI crosses zero on the lower bound at the larger N=62 run reported in the README. Both recency and the mixture beat the summary by a real margin (10+ pp). Within structured selection, the mixture's edge over cheap baselines is not yet measurable.

What held up: any structured selection beats one-pass LLM summarisation by a substantial margin. The summary method kept 1 of 30 answers. Recency kept 4 of 30. The architecture bet — parse into pairs and query, rather than discard the structure — survives. What doesn't hold yet: the bet that a seven-signal mixture specifically outperforms a single-heuristic baseline.

The comparison against the qwen-analog is honest about its scope. The 8-pp gap cited in the README is against a qwen2.5:7b-driven local summariser, not against Claude Code's actual /compact — Anthropic's prompt is closed and not in the harness. Replacing this with captured real-/compact traces is the single change that would harden the headline most. It's filed for v0.3.

The Sonnet cross-judge: why one judge isn't enough

After the gemma3:4b run, I re-judged the same 90 predictions (3 methods × 30 pairs) using Sonnet 4.6 via OpenRouter. Sub-dollar credit cost — cheap enough to do before every published result.

Results under Sonnet:

Method	gemma judge_yes / 30	sonnet judge_yes / 30	per-method κ
weighted-compact-importance	3/30 = 10.0%	3/30 = 10.0%	0.630
weighted-compact-recency	4/30 = 13.3%	4/30 = 13.3%	0.712
compact-qwen analog	0/30 = 0.0%	2/30 = 6.7%	0.000

Both judges agreed on importance and recency exactly. They disagreed on the qwen summary: gemma said 0/30, Sonnet said 2/30. Cohen's κ across all 90 predictions: 0.549 (moderate agreement per Landis-Koch). The confusion matrix is balanced — Sonnet is not systematically more generous than gemma, the disagreements go in both directions.

The compact-qwen per-method κ of 0.000 is a small-sample artifact: when gemma scores all-no on a method, any Sonnet flip kills the marginal denominator. Read the cross-corpus κ=0.549 as the calibration number, not the per-method row.

The takeaway is methodological. A single judge can carry systematic bias without either you or the judge noticing. Running a cross-family check on the same predictions — not re-generating the answers, only re-judging — costs $0.34 and surfaces where the local judge diverges. The two families agreeing on the structured-selection rows and diverging on the summary row is exactly the behavior you'd want to see: it means the summary's underperformance isn't a gemma-vs-qwen style artifact.

The cheap-judge (gemma3:4b local, free to run) is usable for continuous monitoring. It is not a substitute for definitive scoring. The protocol: cheap judge for development iteration, cross-family cloud judge before publishing a result.

What I chose not to build

No auto-injection. The substrate publishes; the client polls. Nothing pushes context into your prompt automatically. This is a deliberate break from the hook-driven pattern — the reader calls compact_session via MCP or pastes the output manually.

No daemon. The nightly REM pass fires at 04:00, takes 31 ms, and exits. At any moment the full set of weighted-compact processes is exactly the set of things you consciously launched. A daemon you forgot is running is a daemon you can't audit — and a tool that reads your conversation history specifically should not be in that category.

No cloud anything. CI enforces zero outbound network calls in the default path. There's a scripts/leak-scan.sh that runs on every commit, scanning for substrate filename patterns and hardcoded personal home paths. The maintainer's substrate never touches GitHub; the public remote is an orphan-cut branch carrying only framework code.

The reason these were the right choices: the substrate is an artifact, not a service. misstep (stumble predictor), session-narrative (cross-session recall), and FKMF (knowledge-gap detection) are three other projects that currently read the same per-pair files. A substrate that three independent readers can compose against needs to be local, stable, and auditable. A daemon would make it into a service — and a service is something you'd have to fork, not extend.

The plugin model

RankerRegistry and the Signal Protocol shipped this week. The eight built-in rankers (importance, density, random, recency, cosine, bm25, compact_qwen, compact_sonnet) register through the same @register decorator available to external packages. Nothing about them is privileged.

Adding a custom ranker — one that scores each pair by len(correction_text) / len(premise_text) as a worked example — looks like this:

from weighted_compact.ranker import register

@register(
    name="length",
    description="Pair score = len(correction_text) / len(premise_text).",
    requires_extras=(),
    query_aware=False,
    since_version="0.1.0",
)
def load_length_ranker():
    from weighted_compact.recon_qa.context import load_pairs
    pairs = load_pairs()
    scores: dict[int, float] = {}
    for p in pairs:
        premise = p.get("premise_text", "") or ""
        correction = p.get("correction_text", "") or ""
        scores[p["pair_idx"]] = len(correction) / max(1, len(premise))
    return scores

That's the full ranker. weighted-compact qa-gate --ranker length --signal judge picks it up from the registry at runtime. It runs against the same reconstruction-fidelity harness as all the built-ins. The gate returns a Δfidelity number that tells you whether your ranker helped, hurt, or tied.

If you want your signal inside the mixture rather than as a standalone ranker, implement the Signal Protocol instead: a name attribute and a compute(pair_indices) -> np.ndarray method. The Protocol is documented in docs/extension-recipe.md with a full worked example.

Stability promise: docs/stability.md commits to the weighted_compact.ranker API surface, RankerSpec, Signal, and the CLI verb names through v1.0.

What's open

49 GitHub issues, all drafted by the maintainer with pre-set architecture notes and explicit scope. 25 were closed in the launch week by direct shipping. The 24 remaining each name "what NOT to do" so contribution scope is unambiguous.

Specific calls where the architecture is clear and the implementation is not yet done:

VS Code MCP client (issue #4) — connect to mcp-serve via the VS Code MCP extension; no new server code required
JetBrains plugin (#5) — same stdio MCP surface, different client
Neovim plugin (#6) — same
AUR package (#11) — PKGBUILD for Arch Linux users; pipx install works but a native package is better
Reproduction on a second corpus (#13) — run the label-weight ablation on your own ~/.claude/projects/ and report the sign; magnitudes won't match, the direction should
Translations (#14) — the correction marker regex currently covers RU/EN/UA patterns; other languages are missing

If you want to add a ranker plugin, docs/extension-recipe.md is the complete recipe.

Honest where I am

Single user. 613 pairs from the maintainer's Claude Code sessions. One Linux box. The seven-signal mixture doesn't beat recency at N=30–62 under the gemma3 judge — that's the open question for v0.3, where a full coefficient grid sweep and cross-session correlation are the planned work.

The numbers in this article will not match your corpus. The methodology is the contribution; magnitudes are corpus-dependent. The 3.8% per-question fidelity floor (Sonnet 4.6, N=1718, k_drop=0) is the absolute starting position — roughly 96% of pair-specific detail is unrecoverable once a pair is hidden from context. That's the baseline before any ranker does work.

MIT licensed. Issues and Discussions enabled on the repo. The benchmark script is scripts/bench_vs_claude_mem.sh — it runs on any corpus where you have both tools installed.

If this shape is interesting to you, the repo's issues labeled community-invitation are where the next 24 contributors fit. No code required to start — a quick reproduction on your own corpus is filed there as issue #13.

→ github.com/zzallirog/weighted-compact

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.