DEV Community: Giulio D'Erme

Retrieval-Augmented Self-Recall — What the Comments Taught Me (RE-call v0.3)

Giulio D'Erme — Sat, 18 Jul 2026 12:07:45 +0000

A follow-up to Part 1: the self-recall thesis — the series runs through Part 6. Code: RE-call — everything below is measured and reproducible (make eval), full study in docs/ENTAILMENT_SUPERSESSION_STUDY.md.

I published a thesis post about agent memory and got five comments that were better than the post.

Two of them didn't just critique the design — they described, precisely, why it would fail and what would fix it. So I did the only reasonable thing: I turned both into experiments, ran them on the same eval harness the series is built on, and shipped what survived. That's RE-call v0.3, and this post is the receipt.

I want to be explicit about why I'm writing it this way. The point of publishing this series was never broadcast — it was error-correction. A design you keep in a drawer accumulates conviction; a design you publish accumulates objections, and objections are the cheapest high-quality signal you will ever get. The comment section of Part 1 did more for this codebase than any week of solo iteration. This post exists to pay that back with the thing commenters almost never receive: evidence that someone listened, measured, and changed the code.

Comment 1: "A similarity score is not a confidence score"

Vinicius Pereira put it in one line I've been quoting since:

Proximity is a candidate; entailment is the evidence.

His argument: the near-misses that hurt most are high-similarity and wrong — memos semantically adjacent to the query that don't answer it. A threshold-based gap_warning (Part 3, Part 5) waves them straight through by construction, because their similarity clears any threshold you could calibrate. The abstention signal cannot be the retriever's own score. You need a separate check that the retrieved memo actually entails an answer.

He was right, and measurably so. I built a held-out challenge set of 10 near-miss queries — each names a strongly on-topic memo that does not contain the asked-for fact ("how much did the cache reduce memory usage" against a memo that measures latency). Baseline, with the calibrated threshold from Part 5 doing its best:

Embedder	Near-miss FCR @ calibrated threshold
`hashing-64`	1.00
`bge-small`	0.80
`voyage-3`	0.40

The threshold that scores a perfect 0.00 on far-gap queries passes 40–100% of near-misses. There is no threshold to fix. The distractor's cosine is genuinely high — that's what makes it a near-miss.

So v0.3 adds an opt-in entailment stage: a small QNLI cross-encoder ("does this sentence answer this question?") judges the trusted hits, and a hit that doesn't entail the query is demoted to a new verdict, not_entailed. The key property is exactly the one Vinicius predicted: it emits a decision at the judge's own trained boundary, not another score — so there is no per-embedder constant left to recalibrate. And that transfer claim held: the identical judge, zero tuning, on every embedder:

Embedder	Near-miss FCR: threshold → +entailment
`hashing-64`	1.00 → 0.60
`bge-small`	0.80 → 0.50
`voyage-3`	0.40 → 0.40

Where the comment needed a refinement — which is the point of measuring

The ablation was the honest surprise. Running the judge alone, without the threshold, degrades far-gap detection (gap FCR 0.00 → 0.40 on both semantic embedders): fed nearest-noise from a topic the corpus doesn't cover, the QNLI model sometimes calls it an answer. So entailment does not replace the calibrated threshold — the two guard different failure classes and must be stacked. Threshold catches far gaps; judge catches near-misses.

And the costs are real, and published: ~0.1–1.0 s of judge time per query on CPU, one legitimately answerable query wrongly rejected on both semantic embedders (its gold memo answers by negation — "do we retry on 4xx?" → "we do not retry" — and the judge reads that as not-answering), MRR on answerable queries dips 1.000 → 0.929. The residual near-miss FCR (0.40–0.60) is the judge's own quality bound — Part 5's law, one layer up: gap detection is bounded by the embedder, and abstention-by-entailment is bounded by the judge. Ships OFF by default for exactly these reasons; you opt in with your eyes open.

Comment 2: "Supersession is a relation, not a property"

The same comment carried a second thesis, on the guard I'd already confessed was weakest (freshness):

You are trying to infer a relation between two memos at read time, when both look valid in isolation. That inference is a losing game. Bind the truth when it is created.

And Mateo Ruiz had independently named the target shape:

Retrieval should return confidence + provenance + validity, not just relevance.

That sentence is now, almost verbatim, how RE-call's trust layer describes itself. Every hit returns a verdict (ok / superseded / expired / …), a calibrated confidence, and provenance; a memo declares supersedes: old-memo.md in its frontmatter at write time, and retrieval returns the current head of the chain instead of a resolved-but-still-embedded old decision.

For v0.3 I added the experiment that closes the "why not just timestamps?" question — against the steelman, not a strawman: "among the confidently-relevant hits, trust the newest", with the stale docs re-touched after their successors, the way any living corpus re-syncs constantly. Superseded-trust rate (how often the stale memo is handed back as the answer — lower is better):

Embedder	Plain search	Recency (steelman)
`hashing-64`	1.00	0.83
`bge-small`	0.83	1.00
`voyage-3`	1.00	1.00

Look at the bge-small row: the timestamp heuristic is worse than plain relevance ranking — the tie-break actively promotes the freshly-re-synced stale memo in the one case where ranking had preferred the successor. A per-document timestamp cannot see a two-document relation, and making the timestamp "smarter" makes it more confidently wrong. The declared relation holds at 0.00 in the same runs.

Vinicius also called the residual failure mode in advance: write-time binding is only as good as the author's discipline — a forgotten link is an orphan memo that looks valid forever. But, as I replied then: impossible to infer becomes possible to enforce. So v0.3 ships recall lint — dangling supersedes: references, cycles, ambiguous successors, versioned siblings with no declared edge, closures declared only in prose. No DB, exit 1 on errors, drops into CI in one line. (It paid for itself before it shipped: writing its tests uncovered a real parser bug where a scalar [[wikilink]] was read as a YAML list, producing an edge that silently never resolved.)

The experiment I still owe

Nazar Boyko asked, before Part 5 was even published, whether the gap threshold should be relative — top hit versus the rest of the batch — rather than an absolute cutoff re-tuned per embedder. It's a good idea with a suspected hole (a spread-based check is blind to the single confident distractor — which is precisely the near-miss class above), but suspicion is not measurement. It's on the list, and the harness is now shaped to answer it.

And Tae Kim's point — a typed coverage_check slot so the "no real match" signal can't be silently dropped — sharpened a design rule that now runs through the whole stack: the retriever computes the signal, the schema carries it. Computed, it's a measurement; self-reported by the model, it's a declaration. Those fail very differently.

What I'm actually arguing for

Five people I've never met read a post about a niche RAG problem and, between them, produced: a falsifiable critique of my abstention mechanism, the correct architecture for supersession, a proposed alternative worth benchmarking, and a schema-design principle. Total cost to me: publishing something concrete enough to be wrong about, and taking the replies seriously enough to run them.

That's the whole model. Not "content", not reach — working in public as a form of peer review. The asymmetry is absurdly favorable: you contribute one design and get back the failure modes it would have taken you months to hit alone. The only price is that you have to be willing to write "I was wrong, here's the measurement" — which, in a series whose thesis is calibrated honesty, is not a price at all. It's the product.

So: thank you Vinicius, Mateo, Nazar, Tae — and Amin, whose memory-compaction angle (keep the gist graph, not every turn) is a different axis of the same problem and deserves its own experiment. v0.3 has your fingerprints on it.

If you're reading this and see the next hole — the negation-blind judge, the owed relative-threshold benchmark, a stronger entailment model, something I haven't imagined — the comment section is open and the harness is public. Evidently, it works.

Code: RE-call (MIT). The full v0.3 study with every table: docs/ENTAILMENT_SUPERSESSION_STUDY.md. Series index: Part 1.

Retrieval-Augmented Self-Recall — Part 6: The Fine-Tune That Did Nothing, and Shipping It as an MCP Server

Giulio D'Erme — Sat, 18 Jul 2026 12:07:17 +0000

Part 6 (finale) of Retrieval-Augmented Self-Recall. Code: RE-call. Part 5: the gap threshold that didn't transfer.

I fine-tuned the embedder on my own domain expecting a win. I measured it properly, on held-out queries.

The improvement was exactly zero. Δ+0.00 MRR. Δ+0.00 nDCG@10. Not "small". Not "within noise". Zero.

It's also the result I wanted, which takes some explaining. That's the first half of this post. The second half is how the whole engine ships, so an agent can actually use it.

The fine-tune that did nothing

After Part 5, the natural next question: if calibrating the threshold helps, would a better embedding help more? So I fine-tuned one on my domain.

The setup: all-MiniLM-L6-v2, OnlineContrastiveLoss on query/gold-chunk pairs, trained on the 14-document corpus. The result:

Model	Test MRR	Test nDCG@10
Base	1.00	1.00
+ Fine-tuned	1.00	1.00
Δ	+0.00	+0.00

Zero lift. And that is the correct outcome, not a failed experiment.

Here's the reasoning, because it's the whole point. The base model already scores a perfect MRR and nDCG@10 on this corpus. There is no headroom left to recover. The only ways to manufacture a "gain" from here would be dishonest ones: evaluate on the training set (and measure memorization, not retrieval), or artificially cripple the baseline so fine-tuning has something to fix. Reporting +0.00 is the honest read, and the honest read is that off-the-shelf embeddings already saturate this corpus.

But the full result is more nuanced, and more useful. On a harder, opaque-jargon corpus — one where the base model genuinely struggles to map queries to the right chunks — the same fine-tuning gave +0.24 MRR. So the real conclusion isn't "fine-tuning doesn't work." It's:

Fine-tuning helps when the base model doesn't already cover your vocabulary. When it does, you get nothing. Know which regime you're in before you spend the GPU hours.

That's the value of a null result. "+0.00" told me my corpus was already well-covered by a general-purpose embedder — which saved me from a fine-tuning pipeline I didn't need, and told me exactly when I would need one. Teams that reflexively bury negative results throw away findings like that and re-learn them the expensive way.

Shipping it: the MCP server

An engine nobody can plug in is a paper. RE-call ships as recall_mcp, an MCP (Model Context Protocol) server over stdio, so Claude — Desktop, Code, or any MCP client — can query its own memory directly as a tool.

That closes the loop with the applied series. There are three layers:

The human-editable memory — the plain markdown files you curate by hand (the two-file memory system from the Claude Code series).
The retrieval engine — RE-call: hybrid search on Postgres, plus the honesty guards.
The MCP server — how the agent reaches layer 2 at runtime, as a first-class tool.

And the design principle carries straight over from the applied series: the honesty signals ride inside the tool's structured output. When the agent queries memory, the response isn't just a ranked list — every hit carries a trust verdict (ok / superseded / expired / …), a calibrated confidence, provenance, and validity, and the result carries gap_warning, freshness, and an explicit abstained + reason. The agent physically cannot get the answer without also getting "here's how much to trust it." Honesty isn't an advisory the model may ignore; it's baked into the shape of the response.

Where the series lands

Six parts ago I claimed self-recall is a different RAG problem — one about calibrated abstention, not ranking. Everything since served that one idea:

Architecture (Part 2) — hybrid dense + sparse retrieval on nothing but Postgres, because agent memory doesn't need a dedicated vector DB.
Guards (Part 3) — gap_warning, freshness, and anti-re-litigation, the three things a memory does that a search index doesn't.
Evaluation (Part 4) — a false-confident rate measured alongside MRR, because the failure that matters is the one ranking metrics can't see.
Findings (Parts 4–6) — hybrid + rerank earns its cost only on weak embedders; a hard-coded abstention threshold is a silent landmine; and fine-tuning is regime-dependent, worth exactly nothing on a corpus your base model already covers.

Two of those findings are negative results. That's deliberate. In a domain about knowing your own limits, the honest nulls are the most valuable thing on the table — and everything here is public, reproducible, and covered by a 150-test suite whose 49 DB-touching tests run against real Postgres, so you don't have to take my word for any of it.

Read it, run it, break it

The whole engine is open source: RE-call (MIT). Clone it, point it at your own corpus, and check the calibration on your embedder before you trust any threshold — including mine.

And if you came here from the applied track, Claude Code, Beyond the Prompt is where all of this gets used in anger: the memory an agent reads at the start of every session, backed by the engine you just read the internals of.

Thanks for reading the whole way down. If you build on it — or find where I'm wrong — I want to hear about it.

And that's not a rhetorical close: the comments on Part 1 already found where I was wrong, twice, and the fixes shipped as v0.3 — an entailment stage for the near-miss a threshold can't see, and write-time supersession that beats any timestamp. The receipts, with the commenters' names on them, are in the series follow-up.

The finale of Retrieval-Augmented Self-Recall. Code: RE-call. Building agent memory, or hiring people who do? This series is the long-form version of my answer.

Retrieval-Augmented Self-Recall — Part 5: The Gap Threshold That Didn't Transfer

Giulio D'Erme — Sat, 18 Jul 2026 12:06:53 +0000

Part 5 of Retrieval-Augmented Self-Recall. Code: RE-call. Part 4: the eval harness.

I shipped the gap_warning guard from Part 3 with a sensible-looking default: if the best cosine similarity is below 0.50, call it a probable gap and abstain. I tested it. It worked.

Then I swapped the embedding model. Every question that should have been refused came back as a confident answer.

No error. No crash. No failed test. The guard was still there, still running, still reporting that everything was fine. It had just quietly stopped being a guard.

Here's the mechanism, because it generalises well past my project, and there's a decent chance it's live in your RAG system right now.

Cosine similarity is not calibrated across models

The gap guard fires when best_cosine < threshold. I used 0.50. The problem is that 0.50 means completely different things to different embedders, because each model lays out its vector space with its own geometry. Same number, different meaning.

The harness measured the cosine distributions for answerable vs. unanswerable queries, per embedder. Look at what "similarity" actually ranges over:

Embedder	Answerable cosine	Unanswerable cosine	Separable?	FCR @ 0.50	FCR @ calibrated
`hashing-64`	0.30 – 0.68	0.35 – 0.53	no — overlap	0.20*	—
`bge-small`	0.70 – 0.90	0.51 – 0.64	yes, at ~0.70	1.00	0.00
`voyage-3`	0.53 – 0.70	0.09 – 0.32	yes, at ~0.50	0.00	0.00

* misleadingly low: with overlapping distributions the 0.50 cut also wrongly flags answerable queries, and the error-minimizing threshold simply stops firing at all. No threshold works here.

Read across the rows and the whole story is there.

voyage-3: unanswerable queries score 0.09–0.32, answerable score 0.53–0.70. A threshold of 0.50 lands cleanly in the gap between them. FCR is 0.00. My default worked — by accident. Voyage's geometry just happens to put the boundary near 0.50.

bge-small: unanswerable queries score 0.51–0.64 — entirely above my 0.50 threshold. So the guard, which only fires below 0.50, never fires on them at all. Result: FCR 1.00. Every unanswerable query was confidently answered. The guard was switched off, and nothing told me. Recalibrate the threshold to ~0.70 and FCR drops to 0.00 — the distributions are separable, I was just cutting in the wrong place.

hashing-64: answerable (0.30–0.68) and unanswerable (0.35–0.53) overlap. No threshold separates them, because the embedder is too weak to distinguish "relevant" from "vaguely near." The right lesson here isn't "pick a better threshold" — it's "this embedder can't support abstention at all."

The lesson: never ship a hard-coded abstention threshold

A magic constant that "works" is the most dangerous kind of code, because it works right up until the context shifts — and then it fails silently, which is the worst possible failure mode for a safety guard. My 0.50 wasn't a good threshold that I'd validated. It was a coincidence that held for one embedder and collapsed the moment I changed one.

The fix is cheap and non-negotiable: calibrate the abstention threshold per embedding model, against a small labeled set. Twenty-odd queries — a handful answerable, a handful not — is enough to see where the two distributions actually sit and cut between them. Do NOT ship a constant. The writeup says it in one line: calibrate per embedding model against a small labeled set; do not ship a hard-coded constant.

Why this generalizes past RAG

Any decision that thresholds a similarity score inherits this exact trap:

semantic caching ("is this query close enough to a cached one?")
near-duplicate / dedup detection
"is this document relevant enough to include?"
clustering cutoffs, entity-matching thresholds

In every one of these, the threshold is a property of the (embedder, corpus) pair, not a universal constant. Swap the model and your carefully-chosen number is now cutting in the wrong place — and unless you're measuring the failure explicitly, you won't know.

Which is the meta-point, and the reason Part 4 mattered: this failure is invisible without the eval harness. The system ranks well, returns plausible results, and lies on gaps. You only catch it by measuring a false-confident rate per embedder. Ranking metrics would have shown me green the entire time.

Two footnotes from after this was drafted. First: when I trailed this finding in Part 1, a commenter guessed the mechanism before the post existed — and proposed a relative threshold (top hit versus the rest of the batch) instead of an absolute one. That experiment is still owed; my worry is that a spread-based check is blind to the single confident distractor. Second, and worse: there is a whole failure class no threshold can catch, by construction — the near-miss that scores high. That one needed a different kind of fix, and it's the subject of the follow-up post.

If a better threshold helps, would a better embedding help more? The intuitive next move is to fine-tune the embedder on my domain. I did. The result was zero — and Part 6 is about why that was exactly the right outcome, and the one case where fine-tuning actually did move the needle.

Part 5 of Retrieval-Augmented Self-Recall. Code: RE-call. The "measure the failure that matters, not the one that flatters" discipline runs through Claude Code, Beyond the Prompt too.

Retrieval-Augmented Self-Recall — Part 4: Benchmarking Retrieval and Honesty

Giulio D'Erme — Sat, 18 Jul 2026 12:06:26 +0000

Part 4 of Retrieval-Augmented Self-Recall. Code: RE-call. Part 3: the honesty guards.

A standard RAG benchmark would have handed my retriever a perfect score on a day it was confidently answering questions it had no data for.

That's not a bug in my system. It's a gap in what those benchmarks ask. Every one of them asks the same thing: when there was an answer, did you rank it first? None of them ask the question that decides whether agent memory is safe: when there was no answer, did you say so?

So RE-call ships its own harness. Here's how it works, and the first finding it produced.

The test set: the unanswerable queries are the point

The evaluation runs on 14 answerable queries + 5 unanswerable queries over a synthetic corpus.

Those 5 unanswerable queries are the whole reason the harness exists. They're questions the corpus genuinely cannot answer, where the correct behavior is to abstain — to fire gap_warning, not to confidently return the nearest memo. Standard retrieval benchmarks are built entirely from answerable queries; they have no way to score "did it correctly say nothing?" This harness is built around that case.

Two families of metrics

Because there are two jobs — rank well when there's an answer, abstain when there isn't — there are two families of metrics:

Ranking quality (for the answerable queries):

precision@k, recall@k
MRR (mean reciprocal rank)
nDCG@10

Guard quality (for the unanswerable queries):

False-confident rate (FCR) — the fraction of unanswerable queries that the guard failed to flag. High FCR means the system confidently answered questions it should have abstained on. This is the honesty metric, and it's the one almost nobody reports.

Why you need both is the crux: a system can post excellent MRR and terrible FCR. It ranks beautifully whenever an answer exists, and lies confidently whenever one doesn't. If you only look at ranking metrics — as most RAG evals do — that failure is completely invisible. FCR is what drags it into the light.

The ablation: every embedder × every fusion stage

The harness runs the full matrix: each embedder (HashingEmbedder, bge-small, voyage-3) crossed with each fusion configuration (dense only → hybrid → hybrid + rerank). That's what lets you answer "which component actually earns its cost?" instead of cargo-culting a reranker into every pipeline.

And it runs against the real thing: 49 integration tests on a live pgvector container (of a 150-test suite), in CI, no mock database. The benchmark exercises the actual retrieval path, not a stand-in.

Finding 1: hybrid + rerank helps most exactly where you'd expect — and nowhere else

Here's the ablation on the weak (hashing) embedder — quality climbs monotonically as you add stages:

Configuration	MRR	nDCG@10
Dense only	0.63	0.72
+ sparse (hybrid)	0.74	0.80
+ cross-encoder rerank	1.00	1.00

Now the same pipeline on the strong bge-small embedder: dense retrieval already achieves a perfect nDCG@10. Hybrid fusion and reranking add nothing — there's no headroom left to recover.

The conclusion, stated plainly in the writeup: hybrid + rerank buys the most on weaker embedders or harder corpora; on an easy corpus with a strong embedder it's redundant.

That is a genuinely useful engineering result, because the reflex in RAG is to stack a reranker onto everything. This says: don't pay for stages your embedder has already made unnecessary. Measure first. A cross-encoder rerank on every query is real latency and real cost — and on a strong embedder over a well-covered corpus, you're buying zero.

The rigor is the point

None of these numbers come from an in-memory toy. They come from the same harness that runs in CI against real Postgres — with a dependency audit — every commit. The reason to trust the findings is that the measurement is reproducible. That's the whole pitch of this track: measure honestly, including the parts that make your work look less impressive.

Which is a good segue, because the same ablation surfaced something that made a chunk of my gap_warning design look worthless on certain embedders. I did not see it coming.

Part 5 is the finding I keep leading with: I shipped a sensible-looking abstention threshold, switched embedders, and watched every query that should have been refused sail through as a confident answer. Why a hard-coded similarity threshold is a landmine — and what to do instead.

(The harness itself has kept growing since this was drafted — it now also scores declared-supersession versus timestamps, and a held-out "near-miss" challenge set that no threshold can catch by construction. Both came out of reader comments, and both are covered in the follow-up post.)

Part 4 of Retrieval-Augmented Self-Recall. Code: RE-call. The eval-first discipline here is the same one behind Claude Code, Beyond the Prompt.

Retrieval-Augmented Self-Recall — Part 3: Teaching RAG to Say \"I Don't Know\

Giulio D'Erme — Sat, 18 Jul 2026 12:06:01 +0000

Part 3 of Retrieval-Augmented Self-Recall. Code: RE-call. Part 2: hybrid retrieval on Postgres.

Ask your agent "have we tried this filter on this market before?" when the honest answer is never. A ranking retriever hands back the three closest memos anyway — something about a different filter, on a different market — and the agent, looking at three confident results, concludes: yes, we've looked at this.

It just made a decision on a hallucination. Nothing in the stack noticed. No error was raised, because from the retriever's point of view nothing went wrong: you asked for the nearest neighbours and it gave you the nearest neighbours.

Everything in Part 2 made retrieval good. Good ranking makes this failure worse, not better — it returns confident noise faster. This post is about making retrieval honest, which for agent memory is the part that actually decides whether you can trust it.

So RE-call wraps retrieval in honesty guards — this post covers the original three, each answering a question ranking metrics never ask. (The current repo has grown that table to six, and the growth story is its own post: two of the new guards exist because readers of Part 1 pointed at exactly the weaknesses you're about to see me describe. I'll flag those spots as we go.)

Guard 1: `gap_warning` — "is the best match good enough to trust?"

After retrieval, look at the best dense cosine similarity. If it falls below a calibrated threshold, the top result isn't the answer — it's the least-bad noise. The system sets gap_warning = true.

The important design idea: this is a second-order signal. Retrieval still returns its ranked list; the guard annotates how much to trust it. That annotation is what lets the calling agent do something other than blindly act — it can abstain, ask a human to confirm, widen the search, or explicitly note "no prior memory on this" before proceeding.

That single flag is the difference between an agent that says "we've looked at this before" and one that says "I have nothing relevant on this — treat it as new." In a system that makes decisions, that distinction is worth more than any ranking improvement.

There's one buried landmine here: what threshold? The obvious move is to pick something like 0.50 and move on. That obvious move is quietly, dangerously wrong — and it's the biggest finding in this series, so I'm giving it its own post (Part 5). For now, the load-bearing word is calibrated: the threshold is fit to data, never hard-coded.

Guard 2: freshness — "is this memory still current?"

Every memo has a timestamp. The freshness guard reports the age of retrieved content and warns when it's stale relative to the re-index cadence (my corpus re-indexes daily, so "stale" has a concrete meaning).

This one is specific to memory in a way document QA rarely deals with. A documentation corpus is mostly static — last year's page is still roughly true. Agent memory is a moving target: a decision recorded in April may have been reversed in June. Without a freshness signal, April-truth and June-truth are indistinguishable at retrieval time, and the agent will happily act on a superseded conclusion. Freshness lets it weight recency, or at least flag the risk.

Honest update, because this section aged: freshness turned out to be the weakest guard of the three, and a commenter on Part 1 put a finger on why — supersession is a relation between two memos, and no per-document timestamp can see a relation. We later measured it: even a steelmanned "trust the newest relevant hit" heuristic still hands back the stale memory 83–100% of the time, while an explicitly declared supersedes: link holds at 0.00. The fix (a trust layer that binds the relation at write time) and the measurement are in the follow-up post.

Guard 3: anti-re-litigation — "did we already settle this?"

The most agent-specific guard of the three. Before the agent proposes an idea, it queries memory for closed decisions on that topic — the "we tried X, it failed, here's why" memos — and the guard surfaces them.

The failure it prevents is subtle and expensive: an agent re-proposing a dead idea because the memo that killed it three months ago didn't happen to rank in the top results for today's phrasing. Ranking-optimized retrieval is bad at this specifically, because a settled-decision memo is often lexically distant from the fresh proposal even though it's the most decision-relevant document in the store.

The implementation leans on structure: decision-type memos (closed hypotheses, postmortems) are typed, and a targeted retrieval path prioritizes them when the agent is in "propose" mode. Memory that can't defend its own past decisions is condemned to relive them.

The unifying idea

Retrieval answers one question: what's closest? The guards answer the three that actually govern whether the agent should act:

Should you trust it? (gap_warning)
Is it still current? (freshness)
Did we already decide this? (anti-re-litigation)

That's the whole difference between a search index and a memory. A search index ranks. A memory knows its own limits.

(Since this was drafted, the guard table grew: trust verdicts with declared supersession, an opt-in entailment judge for the high-similarity-but-wrong case a threshold can never catch, and a write-time lint for the supersession graph. All three exist because readers argued with this post's ancestors — that story, with measurements, is the follow-up.)

One rule, or the guards are theater

A guard only helps if its signal reaches the decision layer. A gap_warning that gets computed and then dropped before the agent sees it is worse than useless — it's false assurance that the system is careful when it isn't. So in RE-call the honesty signals ride inside the retrieval result: you cannot get the answer without also getting "here's how much to trust it." (If you read the applied series, this is the same principle as making the tool enforce the rule instead of trusting the prompt to.)

The guards make claims: this is a gap, this is stale. Claims demand measurement — and "how well does it know when it doesn't know?" is a metric that standard RAG benchmarks don't even have. Part 4 builds the eval harness that measures it, and delivers the first finding about which pipeline components actually earn their cost.

Part 3 of Retrieval-Augmented Self-Recall. Code: RE-call. This is the layer that makes Claude Code, Beyond the Prompt's memory trustworthy, not just searchable.

Retrieval-Augmented Self-Recall — Part 2: Hybrid RAG on Nothing but Postgres

Giulio D'Erme — Sat, 18 Jul 2026 12:05:37 +0000

Part 2 of Retrieval-Augmented Self-Recall. Code: RE-call. Part 1: the self-recall problem.

Say "vector search" and the reflex is a dedicated vector database: Pinecone, Weaviate, Qdrant. I didn't install one. RE-call keeps the dense vectors, the full-text index, and the metadata you filter them by in a single Postgres.

Not as a shortcut, and not out of allergy to new infrastructure. Because for agent memory a separate vector store is the wrong shape, and it quietly costs you the one property this whole system depends on.

Here's the argument, and the retrieval pipeline it buys you.

Why not a dedicated vector DB

Agent memory has three properties that make Postgres the natural fit:

It's already relational. Memos have timestamps, source types, tags, decision status. That's structured metadata you want to filter and join on — exactly what a relational database is for. A separate vector store means keeping two systems in sync and losing transactional consistency between the vectors and the metadata.
pgvector gives you real vector search inside Postgres. Approximate-nearest-neighbor cosine search, in the same database as your rows. And Postgres already ships full-text search. So you get dense and sparse retrieval in one transactional store — no sync layer, no second system to operate.
The scale doesn't justify the complexity. My corpus is ~700 memos, about 5 MB, re-indexed daily. Even orders of magnitude larger, a read-mostly, latency-tolerant memory is nowhere near the regime where a distributed vector DB earns its operational cost. Reaching for one here is over-engineering.

One store, one source of truth, ops you already know. Now the pipeline.

The retrieval pipeline

RE-call retrieves in up to four stages. The first two run in parallel; the last two refine.

1. Dense retrieval

Embed the query, run a pgvector cosine-similarity search, take the top-k. This is semantic matching — it finds memos that mean the same thing as the query even with no shared words. Strong on concepts, weak on exact tokens.

2. Sparse retrieval

Run a Postgres full-text search (tsvector/tsquery) over the same corpus. This is lexical matching — it nails exact terms: a specific error code, a ticker, a piece of domain jargon, a proper noun. Strong on precision, blind to paraphrase.

3. Fusion with RRF

Now you have two ranked lists that disagree. You fuse them with Reciprocal Rank Fusion (k=60):

score(doc) = Σ  1 / (k + rank_in_list_i(doc))

Each document's score is the sum, across both lists, of one over its rank (plus a constant k). Documents that rank high in either list bubble up; documents high in both dominate.

The reason RRF specifically: dense cosine scores and full-text scores aren't on the same scale — you can't just add or average them without arbitrary normalization. RRF sidesteps that entirely by fusing on rank instead of raw score. k=60 is the well-established default and it's robust; you rarely need to tune it.

Why fuse at all? Because dense and sparse fail in opposite directions — one misses exact tokens, the other misses meaning. Combining them gives you the concept-matching of embeddings and the precision of keyword search. (If you read the applied series, this is the "hybrid beats either alone" lesson — here's the actual mechanism under it.)

4. Cross-encoder rerank (optional)

The fused shortlist can be reordered by a cross-encoder (ms-marco-MiniLM). Unlike the bi-encoder embeddings in stage 1 — which encode query and document separately and compare vectors — a cross-encoder encodes the query and a candidate together and scores their relevance jointly. It's meaningfully more accurate and meaningfully slower, so you only run it on the top handful of candidates, never the whole corpus.

Whether stage 4 is worth its cost turns out to depend heavily on your embedder — which is the subject of Part 4's benchmark.

One property of this stage matters more than I realized when I first drafted this: the cross-encoder reorders, but what it emits is still a score you end up thresholding somewhere downstream. That distinction — a score you tune versus a decision you can trust — comes back with force later in the series.

Pluggable embedders

RE-call treats the embedder as a swappable component, with three shipped:

HashingEmbedder — deterministic, offline, no model download. It exists so the test suite and CI can run vector retrieval with zero external dependencies. Weak, but reproducible.
FastEmbed (bge-small) — a strong local model, no API calls. Good default for privacy or air-gapped runs.
Voyage (voyage-3) — a cloud model, the strongest of the three.

Pluggability isn't just tidiness. It lets you test deterministically offline, run locally for privacy, or call the cloud for maximum quality — same pipeline, different tradeoff. And it sets up the single most important finding in this series: the embedder you choose changes how you have to calibrate everything downstream (Part 5). Hold that thought.

Is "just Postgres" actually enough?

Fair challenge, and I don't want to hand-wave it. The claim is backed by the harness, not vibes: of RE-call's 150-test suite, 49 integration tests run against a real pgvector container — no mock database — in CI. The retrieval you just read about is exercised against the real engine on every commit.

And to be honest about the boundary: when would you want a dedicated vector DB? Billions of vectors, sub-10 ms p99 under heavy concurrent QPS, distributed sharding across nodes. Agent memory is none of those. It's small, read-mostly, and perfectly happy with tens-of-milliseconds retrieval. Match the tool to the regime — and this regime is Postgres-shaped.

Retrieval now returns the closest memos. But "closest" is not "relevant" — the closest match to a question with no answer is still just noise wearing a high similarity score. Part 3 is about the guards that let the system tell the difference: how RE-call learns to say "I don't know."

Part 2 of Retrieval-Augmented Self-Recall. Code: RE-call. If you came from Claude Code, Beyond the Prompt, this is the retrieval layer under Part 5's semantic search.

Retrieval-Augmented Self-Recall: The RAG Problem Nobody Talks About

Giulio D'Erme — Fri, 17 Jul 2026 09:13:07 +0000

Part 1 of Retrieval-Augmented Self-Recall — the research track behind Claude Code, Beyond the Prompt. All code is open source: RE-call.

Almost every RAG tutorial you've read solves the same problem: you have a pile of documents, a user asks a question, and you retrieve the chunks that answer it. Rank the right passage to the top, stuff it in the prompt, done.

There's a second kind of RAG that behaves completely differently, and almost nobody writes about it: an agent retrieving from its own memory.

I ran into it building the operational brain behind an automated trading system. That agent accumulates memory as it works — over 700 typed markdown memos, about 5 MB, re-indexed daily. Decisions, dead ends, calibration notes, "we tried this and it failed" postmortems. When the agent starts a task, it queries that memory: have we tested this before? what did we decide about X? is this still true?

The moment your knowledge base is the agent's own growing memory, the RAG problem inverts — and standard retrieval quietly does the wrong thing.

Why self-recall is not document QA

In document QA, there's a load-bearing assumption you probably never think about: the answer is in the corpus. Someone asked a question because the docs can answer it. Your whole job is ranking — surface the right chunk.

In self-recall, the most important queries are exactly the ones where the answer isn't there.

The agent asks "have we tried a mean-reversion filter on this market?" If the honest answer is no, never, a ranking-optimized retriever will still cheerfully return the three most cosine-similar memos — probably something about a different filter on a different market — and the agent, seeing confident results, concludes "yes, we looked at this." It just made a decision on a hallucination.

The failure isn't bad ranking. The top results might be the genuinely closest memos. The failure is that the system had no way to say "there's nothing relevant here."

Three failure modes unique to agent memory

Once you look at memory this way, three distinct failure modes show up that document-QA RAG never has to handle:

Hallucinating over gaps. The query has no real answer in memory, but retrieval returns the nearest neighbors anyway, and their presence reads as a "yes." The system needs to abstain — to flag "this is probably a gap" instead of pretending.
Re-litigating settled decisions. The agent proposes an idea it already tried and killed three months ago, because the "we decided against this, here's why" memo didn't surface at the moment of proposing. Memory that can't defend its own past decisions is doomed to relive them.
Acting on stale memory. A memo that was true in April is retrieved and treated as current in July. Without a freshness signal, old truth and current truth are indistinguishable — and in a system that touches money, that's expensive.

None of these are ranking problems. You cannot fix them by getting a better embedding model or a fancier reranker, because a better retriever just returns more confidently wrong results faster.

The reframe: abstention, not ranking

Here's the thesis of this whole series:

Document-QA RAG optimizes ranking. Agent-memory RAG has to optimize calibrated abstention — knowing when it doesn't know.

This matters because the metrics you've been trained to care about — MRR, nDCG, precision@k — don't measure abstention at all. They score how well you ordered the results assuming an answer exists. They are silent on the case that matters most in self-recall: the query with no answer, where the correct behavior is to return nothing and say so.

That's why reaching for an off-the-shelf RAG stack and pointing it at your agent's memory feels fine in a demo and rots in production. The stack is tuned for the wrong objective. It was never asked to abstain, so it doesn't.

RE-call: a reference implementation

To work through this properly I built and open-sourced RE-call — a retrieval engine designed around abstention from the start rather than bolted on after. The one-paragraph version, which the rest of this series unpacks:

Storage & retrieval: PostgreSQL + pgvector as a single transactional store — dense vector search and sparse full-text search in the same database, fused with Reciprocal Rank Fusion, with an optional cross-encoder reranker.
Three honesty guards: a gap_warning that fires when the best match is too weak to trust, a freshness signal that flags stale memory, and an anti-re-litigation check that surfaces closed decisions before the agent re-proposes them.
Honest evaluation: a harness that measures not just ranking quality but a false-confident rate — how often the system fails to abstain when it should.

It ships as an MCP server, so an agent (Claude or otherwise) can query its own memory directly. That's the loop that closes back to my other series — this is the engine underneath "the memory file" and "semantic search."

What this series covers

Six parts, each a standalone piece of the problem:

This one — why self-recall is a different problem.
Hybrid RAG on nothing but Postgres — the architecture, and why I didn't reach for a dedicated vector database.
Teaching RAG to say "I don't know" — the three honesty guards, in detail.
Benchmarking retrieval and honesty — the eval harness, and why I measure a false-confident rate alongside MRR.
The gap threshold that didn't transfer — the finding that a single hard-coded abstention threshold is worthless across embedding models. This one surprised me.
The fine-tune that did nothing, and shipping it as an MCP server — an honest null result, then how the whole thing deploys.

A note on tone, because it's the point: this track reports what didn't work. A fine-tuning experiment that produced zero lift. An abstention threshold that fell apart the moment I changed embedders. In most domains those get buried. In this domain — calibration, honesty, knowing your limits — the negative results are the most useful thing I can hand you.

Part 2 builds the retrieval core: dense plus sparse plus fusion plus reranking, all inside a single Postgres database, with pluggable embedders — and the argument for why, in 2026, you probably don't need a separate vector store to do this well.

Update (July 2026): the comments below did exactly what publishing is for. Two of them — the entailment-over-similarity argument and "supersession is a relation, not a property" — became measured experiments and shipped as RE-call v0.3: an opt-in entailment stage for the near-miss no threshold can catch, write-time supersession that beats even a steelmanned timestamp heuristic (83–100% stale-trust → 0.00), and a supersession lint. The full follow-up, with the commenters' names on it: What the Comments Taught Me (RE-call v0.3). The "three honesty guards" described above were v0.1's set — the current table has six.

Part 1 of Retrieval-Augmented Self-Recall. Code: RE-call (Postgres + pgvector, MIT). If you came from Claude Code, Beyond the Prompt, this is the engine under Part 1's memory and Part 5's search.

One MCP Server, Two Models: An Always-On Ops Agent That Costs $0

Giulio D'Erme — Wed, 15 Jul 2026 10:15:23 +0000

A companion to Part 4: your first MCP server and the hardening deep dive. Part of Claude Code, Beyond the Prompt.

Part 4 gave Claude hands: an MCP server exposing narrow, audited tools onto my systems. It works. But it has a shape problem that took me a while to name.

Claude is interactive and metered. It acts when I'm at the keyboard, and every action costs tokens. My systems, meanwhile, run twenty-four hours a day. Something fails at 3 a.m. and there is nobody home.

What I wanted was an always-on agent: watching the journal, triaging recurring errors, and able to answer "why is that service failing?" from my phone, for free.

The answer turned out not to be a second Claude. It was a second client on the same tool server.

And that's the idea worth stealing, so let me put it up front:

Once the enforcement lives in the tools, the model becomes a swappable client. The fence isn't in the model, it's in the server — so you can plug a cheap, dumb local model into the same fence and let it be wrong, safely.

That is true for safety. It turns out not to be true for cost, and I only found that out by measuring — after I'd already built the thing. There's a section on that below, with the numbers that killed a feature I was rather proud of.

The architecture: one server, many clients

Four pieces run on the box:

The MCP tool server. The hardened boundary from Part 4: around thirty narrow tools, sandboxed, audit-logged, bearer-auth, bound to a private network only. Never exposed to the internet.
Claude, connected as an MCP client, for interactive work with me in the loop.
A local model, served by Ollama on the same machine, connected as another client. It never talks to the internet.
Three autonomous entry points that drive the local model: a Telegram bot, a watchdog, and a nightly audit.

The important part is what it is not: the local model is not Claude's assistant, and Claude is not its supervisor. They're peers. Two clients, one fence.

The tool catalogue

Here's the actual surface, grouped by what it does. Note that every tool carries its guardrail in the tool, not in a prompt.

Observation (read-only)

Tool	What it does	The guardrail
`db_query_ro`	Run a SQL query	`SELECT` only via a read-only DB role, 500-row cap, 5s statement timeout
`read_file`	Read a repo file	Path traversal blocked; secrets (`.env`, `*.key`, `.ssh/`) refused server-side
`list_dir_live` / `rg_search_live`	List / grep live files	Confined to allowlisted roots
`journalctl_tail` / `journalctl_grep`	Read a unit's journal	Read-only, line-capped, output truncated
`systemctl_is_active` / `systemctl_status` / `systemctl_cat`	Unit liveness, status, unit file	Read-only, no restart capability
`port_check` / `health_probe_http` / `env_presence_check`	Is it listening, is it healthy, is the env var set	`env_presence_check` returns presence only, never values
`sha256_file`	Checksum a deployed file	Read-only; used to verify a deploy matches git

Retrieval (this is the one people skip)

Tool	What it does
`code_search`	Hybrid semantic + BM25 search over the indexed repo. Find code by intent ("where do we filter signals"), not by guessing a grep string.
`docs_search`	Same hybrid retrieval over the knowledge base (memory / research / plans). Returns a `gap_warning` when the top-3 cosine similarity is below threshold, i.e. "I probably have nothing on this, don't trust these hits."

That gap_warning is the single most valuable field in the whole server. It's the difference between an agent that says "we looked at this before" and one that says "I have nothing relevant, treat it as new."

Git and GitHub (read)

git_log · gh_issue_list · gh_issue_view · gh_pr_list · gh_pr_view

Gated write

Tool	What it does	The gate
`create_branch`	New branch	Name must match `auto/<agent>-fix-*`. Nothing else is accepted.
`apply_patch`	Apply a unified diff and commit	The gauntlet below.
`open_pr`	Push the branch, open a PR	Never auto-merges. Dedups against open PRs with the same error signature. 24h rate cap.
`gh_issue_create`	File an issue for a human	Title prefix enforced; body must contain `## Evidence` and `## Source` sections; 24h cap.
`systemctl_restart`	Restart a unit	Allowlist only.
`deploy_files`	Ship a file	The tool itself enforces backup → checksum-verify against git → restart → health-check.
`telegram_alert`	Page the operator	`info` / `warn` / `critical` levels only.

The design that makes it safe: different callers, different tool surfaces

This is the part I'd most like you to take away, because it's cheap and almost nobody does it.

The server exposes ~30 tools. No caller sees all of them.

The Telegram bot → read-only tools. When I'm asking questions from my phone, there is no path to a write. Not "the model has been told not to." There is no write tool in its list.
The watchdog → read + gated write. It's the only caller that can propose a patch.
The nightly audit → read + gh_issue_create. It's the only caller that can open an issue. (A watchdog running every fifteen minutes with issue-creation would be a spam machine.)
Claude → the full set, because a human is in the loop for every call.

Same server, four different surfaces. It's per-caller least privilege, and it's the "scope the reads, per tool" lesson from the hardening piece, applied one level up.

The auto-fix gauntlet

The watchdog can open a pull request. Here's every condition it must satisfy — all of them, or the server rejects the call:

≤ 20 changed lines, ≤ 1 file. Anything bigger is not an auto-fix, it's a change.
Category must be whitelisted: JSONB serialization, timezone-naive datetime, missing None check, import error, connection retry/backoff. That's it. Five known, boring, mechanical bug shapes.
The error signature must have occurred at least twice in the last 30 days. This is my favourite rule: the agent may only fix bugs it has already seen recur. No novel fixes. Novelty goes to a human.
The target file must not be frozen. The frozen list is everything that could actually hurt: the live money-path modules, central config, CI, systemd units, deploy scripts — and the agent's own source code, so it can't rewrite its own guardrails.
The branch must match auto/*-fix-*.
No renames, binaries, new files, or deletions. Modifications to one existing file, nothing else.
It never pushes to the main branch. Branch, commit, open a PR, alert me. A human merges. Always.

Miss any one and the call comes back blocked, with an instruction not to retry but to escalate to a human via telegram_alert.

The philosophy underneath: the local model is allowed to be wrong. It can't do damage, so it doesn't need to be smart. Every proposal lands as a PR I read on my phone.

Why a dumber model is safe here: defense in depth

A local model is worse at tool-calling than a frontier model. It will hallucinate a tool name. So the enforcement is layered:

Server-side guardrails are the real wall. Every gate above is enforced in the server, not in the prompt.
A client-side tool whitelist sits in front of it. The tool name from the model is Unicode-normalized (NFKC, so a lookalike character can't sneak through), any namespace separator (/, :, .) is rejected, and anything not in that caller's advertised list is blocked before it ever reaches the server. This exists precisely because a weaker model hallucinating apply_patch from the read-only Telegram bot is not hypothetical.
The database role is the wall for data. A dedicated SELECT-only role with a connection limit, not a string check on the SQL. (The hardening deep dive is entirely about why.)
Secrets are blocked server-side, so no tool can read them even if asked.
Every call is audit-logged with a hash of its arguments, latency, and result kind. Two models, one ledger.

Notice: none of this depends on the model being good. That's the whole point.

What it actually does, day to day

3 a.m., from bed. Something looks off. I open Telegram and type "why is the X collector failing?" The local model calls journalctl_tail, then systemctl_is_active, then code_search to find the relevant function, loops through up to six tool iterations, and comes back with a three-sentence diagnosis and the log excerpt it based it on. Cost: zero. Laptop: closed.

The watchdog. A periodic vigilance pass. It finds an exception that has now fired three times in a month, recognizes the shape (JSONB serialization), confirms the file isn't frozen, builds an eight-line diff, opens auto/…-fix-jsonb-…, files a PR labelled for autofix, and sends me a warn. I read the diff on my phone and merge it, or close it. It has never once been able to merge itself.

The nightly audit. Anything novel and recurring worth a human's attention becomes a GitHub issue with a mandatory ## Evidence section (log excerpt, unit, occurrence count) and ## Source section (file path, unit, lineage). The strict format is what makes them triage-able instead of noise.

Claude, during the day. Same tools, different mode: interactive, with me in the loop. Semantic search before grepping, DB queries instead of pasted psql output, deploys through the deploy tool that enforces its own checklist.

The token dividend: the server reads, the model doesn't

Here's the thing I underestimated when I built this. I reach for these tools constantly, and safety is not the reason. Economy is.

Every tool is a compression function. It does the expensive reading on the box and hands back only the answer. Compare what a model would otherwise have to do:

Without the server	With the tool
SSH in, `cat` a file, paste 800 lines into the chat	`read_file` returns it, path-checked and byte-capped
Grep the repo, then read ten files to find one function	`code_search` returns the handful of relevant chunks: roughly 300 tokens instead of 6,000
Paste a 5,000-line journal dump	`journalctl_tail` returns a capped, filtered tail
Run `psql`, paste the whole result table	`db_query_ro` returns at most 500 structured rows
Re-read the knowledge base to check a past decision	`docs_search` returns the matching memo, plus a `gap_warning` if there is no matching memo
`cat` the env file to check a variable is set	`env_presence_check` returns `present: true`. Not the value. Not the file.
Read a deployed file to check it matches git	`sha256_file` returns a hash

Look hard at those last two. The question was "is it set?" and "does it match?", so the tool returns a boolean and a hash. Zero tokens spent on content the model never needed. The answer, not the material.

This is the same lever as "use a subagent for heavy reading" from the token piece: spend the 50,000 tokens of scanning somewhere that isn't your main context, and bring back the 500-token conclusion. An MCP tool is exactly that, made permanent — the fan-out happens server-side, every single call, without you having to think about it.

And here's the part I find genuinely satisfying: the guardrail and the token budget turn out to be the same line of code. The row cap that stops a runaway query is the row cap that stops a 40,000-token result. The path confinement that blocks .ssh/ is what stops the model wandering into directories it never needed. The truncation on a log tail is both a safety valve and a cost control.

That isn't a coincidence. "Return only what was asked for" is simultaneously the security principle and the efficiency principle. Harden the tool properly and you get the cheaper bill for free — or, put the other way round, if your tool returns a wall of text, it's both expensive and insecure, and you should fix it once.

Then I tried to go one step further, and measured why it failed

If every tool is a compression function, the obvious next move is to compress the whole investigation. Give the frontier model one more tool:

ask_local(question) -> { answer, evidence, grounded, ... }

Claude hands over a bulk, mechanical question — "scan six hours of journal for unit X, give me the distinct error signatures and their counts" — the local model does the five tool calls and the reading in its own free context, and Claude gets back three sentences instead of twenty thousand tokens of logs. Model-to-model delegation. The ultimate version of the token dividend.

I built it. It works. It ships disabled. Here's why, because the why is worth more than the feature.

The design decision I'd defend anywhere

ask_local returns evidence, not just an answer.

This is non-negotiable, and it's the same lesson as the dead collector. An unverifiable lossy compressor is a hallucination-laundering machine: to the caller, a wrong summary and a right one look identical, so it acts on either — silently. That's the worst failure mode there is.

So every answer ships with the tool calls it was built from, and grounded flips to false, with a loud warning, when no tool returned usable data — i.e. the model answered from its weights rather than from my systems. Tool errors and blocked calls never count as grounding. The caller checks grounded before acting, or it has learned nothing.

The numbers that killed it

I ran it against the real local model on the real box: twelve cores, already sitting at load 12–17 because the live systems are using them. llama3.2:3b, warm, the identical call, varying only how many tools I advertised:

Tools advertised	Latency
1	180.3 s
3	28.6 s
8	252.5 s

Look at that column. It isn't monotonic. One tool is slower than three. That makes no sense if latency tracks the work — and that's the finding:

Latency here is not a function of the work. It's a function of how much CPU happens to be free at that instant. The same call takes 29 seconds or 250 seconds. It isn't slow, it's unpredictable — and you cannot budget a synchronous tool against a 10× swing.

While it runs, inference also takes about seven of the twelve cores away from the live systems. Which detonates the line I'd have written without measuring: the local model is not the cheap tenant on a shared production box. It's the most expensive one.

The model trap, free of charge

Two other models, same box, same call:

Model	Result
`qwen3:8b` (as shipped)	timeout, >600 s — never returns
`qwen3:8b`, thinking disabled	233 s — still far too slow
`llama3.2:3b`	29–250 s — the only viable family, still erratic

That first row is worth your time. Qwen3 ships with thinking mode on. Combine thinking with tool-calling on a CPU and it doesn't get slow, it never comes back — I gave it ten minutes and it was still going. Turning thinking off took it from ">600 s" to 233 s, which means the thinking alone was costing more than six minutes per turn.

If you're putting a Qwen3-family model behind tools on CPU, disable thinking or you will sit there wondering why nothing ever returns.

What I actually learned

The mechanism was fine. The 3B called the right tools. The in-process dispatch worked. And the guardrails did exactly their job: every single failed run came back grounded: false, truncated: true, with a loud warning. Not once did it hand back a confident invention. It failed noisily, which is the only acceptable way to fail.

The hardware was the problem, and it produces a sharper rule than the one I started with:

A local model works beautifully as an asynchronous background worker — a watchdog that runs every fifteen minutes, a bot you're willing to wait for. Unpredictable latency simply doesn't matter when nobody is blocked.

It does not work as a synchronous delegate for your interactive model. There is no budget you can set when the same call takes 29 s or 250 s.

And notice: my Telegram bot and my watchdog are already asynchronous. I didn't design that from insight — the hardware had decided it for me long before anyone measured. The measurement just told me why I'd been right by accident, and stopped me from being wrong on purpose.

So the corrected version of the thesis at the top of this post:

The tools make the model safe to swap. The hardware decides whether it's worth swapping. Those are two different axes, and I had quietly collapsed them into one.

Setting it up

The order that worked:

Build the tool server first, not the agent. FastMCP (or the SDK of your choice) over HTTP, bearer-auth, bound to localhost or a private network. Never exposed publicly. Start with read-only tools; you'll be surprised how far that gets you.
Create a read-only database role and connect through it. Grants, not string checks. Add a row cap and a statement timeout.
Add the audit table on day one, not later. Log the tool, a hash of the args, latency, result kind.
Wire Claude to it as an MCP client. Use it for a week. The tools you actually reach for are the ones worth hardening.
Then add the local model. Ollama, one pull, and a small tool-calling loop: send the tool schemas, parse tool_calls, dispatch to the MCP server, feed results back, cap the iterations (six is plenty).
Give each entry point its own tool list. Bot: read-only. Watchdog: read + gated write. This costs you ten lines and buys the whole safety story.
systemd units for the server, the bot, and a timer for the watchdog.

About the model

Size the model to your hardware — and then measure it. Do not trust the model card, and do not trust me.

Two things I had wrong until I ran the numbers:

Bigger isn't just slower; it can be infinite. A thinking-enabled qwen3:8b never returned a single tool-calling turn on CPU. Not "took a while" — never came back.
A weaker model is safe, but that doesn't make it usable. The guardrails don't live in the model, so a dumber one costs you nothing in risk. It can still cost you everything in latency. On CPU, tool-calling speed is the binding constraint, not intelligence.

Practically: a 3B-class model is the only thing I'd put behind a synchronous tool on CPU, and even that is erratic under contention. A 7B–8B is fine for a background worker where nobody is waiting. A 30B-class model wants ~18–20 GB and minutes per turn. With a GPU, none of this is a conversation.

And if the model can think: turn thinking off. Ollama makes swapping a model a one-line change. It does not make the consequences a one-line change.

The pros, and the honest cons

Pros

Always-on, and it never bills a token. It watches while I don't. Read the cons before you call it free, though — it bills in CPU.
Privacy. The local model never sends a byte off the machine. For anyone whose logs or schema can't leave the building, this is the whole ballgame.
A smaller token bill, and better answers. Every call returns the answer instead of the raw material, so context stays dense. Cheaper and less wrong, from the same design.
The model becomes safe to swap. Because the fence is in the tools, I can change the local model without touching the safety posture. Note the word: safe, not free.
One audit trail, two models. Whoever acted, it's in the same ledger.
It makes you harden the tools properly, because now something less careful than Claude is holding them.

Cons, honestly

On a shared box, inference is the most expensive tenant. Not merely slow — unpredictable. The identical call took 29 s or 250 s depending purely on what the live systems left free, while eating ~7 of 12 cores. Fine for an async watchdog. Fatal for anything synchronous. Measure your own box before believing anyone, including me.
The model can be a trap. A thinking-enabled model behind tools on CPU may never return at all.
Local tool-calling is unreliable. Expect hallucinated tool names and malformed arguments. Plan for it — that's what the client-side gate is for — rather than being surprised by it.
It is not smart enough for judgment. It triages, diagnoses, and proposes. It does not decide. Every gate, every promotion, every merge is still mine.
More moving parts. Four units instead of one. Worth it only if you genuinely have something running 24/7 that you'd like watched.

The through-line

Part 4's rule was: the tool enforces the rule, not the prompt.

The corollary is why this setup works at all: if the enforcement is in the tools, the model is just a client. Interchangeable. You can hand a cheap, imperfect model the same hands and let it be wrong, because being wrong can't cost you anything.

But I very nearly shipped a second, sloppier corollary — therefore the model is free — and the only reason I didn't is that I measured it. It isn't free. On a box that's already working, it's the hungriest process on the machine, and its latency is a coin flip.

So the honest pair, which is what this post is really about:

The tools decide whether a model is safe to swap. The hardware decides whether it's worth swapping.

Get the first one right and you can afford to experiment freely. Get the second one wrong and you'll ship something that works perfectly in every respect except the one that matters.

That's how a 24/7 ops agent stops being an infrastructure project and becomes a systemd unit — and how a clever delegation tool becomes a config flag set to 0.

A companion to Part 4 and the hardening deep dive, part of Claude Code, Beyond the Prompt. The retrieval layer behind code_search and docs_search is open source: RE-call.

Clearing an off grid price bug out of Polymarket's order path

Giulio D'Erme — Tue, 14 Jul 2026 20:33:36 +0000

This is a submission for DEV's Summer Bug Smash: Clear the Lineup powered by Sentry.

Project Overview

Polymarket ships a unified Python SDK, py-sdk, for building on their prediction market: constructing, pricing, signing, and submitting orders. It's currently in beta and moving fast, which is exactly where money path bugs like to hide. I've been running the SDK against live markets, so I pointed an audit pass at its order validation and signing code and found a real one.

Bug Fix or Performance Improvement

Before an order is signed, the SDK validates the price against the market's tick size (the smallest allowed price increment). Two functions do this, _resolve_price for limit orders and _resolve_protected_market_price for market orders, and both check the wrong thing. They validate the price's decimal place count, not whether the price is an actual multiple of the tick:

if decimal_places(price) > config.price:
    raise UserInputError(f"price must conform to tick size {tick_size} ...")
return round_normal(price, config.price)

Decimal place count equals tick grid membership only for power of ten ticks (0.1, 0.01, 0.001, 0.0001). The SDK also supports two half step ticks, 0.005 and 0.0025, and for those the two measures diverge. A price like 0.007 has three decimal places, so it passes the check, but it is not a multiple of 0.005. The grid is {0.005, 0.010, 0.015, ...}, and 0.007 is not on it.

from decimal import Decimal
from polymarket._internal.actions.orders.limit import _resolve_price

_resolve_price(Decimal("0.007"), Decimal("0.005"))
# returns Decimal("0.007"), no error, even though 0.007 is off the tick grid

The consequence is worse than a cosmetic slip. Polymarket orders are signed with EIP-712, and the price is baked into the signature. That means the exchange cannot round an off grid price onto the grid without invalidating the signature, so it can only reject the order. The client side guard that exists specifically to prevent that wasted signing round trip does not fire, on exactly the markets where it is needed. It stays invisible on every classic market, because there decimal count and grid membership happen to agree.

Code

Merged PR: GiulioDER/py-sdk#1
Reported upstream: Polymarket/py-sdk#162

The fix is eight lines, a grid membership check added after the existing decimal check in both validators:

if price % tick_size != 0:
    raise UserInputError(f"price {price} must be a multiple of tick size {tick_size}.")

It is purely additive. Any price that validated before is a tick multiple, so it still passes; only genuinely off grid prices are newly rejected, and those were going to be rejected by the exchange anyway, just later and less clearly.

My Improvements

I did not want to ship a "looks right to me" patch into someone else's live money path, so the fix carries its proof:

A red to green test suite. New unit tests assert that off grid prices are rejected on both validators, that on grid prices (including the exact range boundaries price == tick and price == 1 - tick) still pass, and that the pre existing decimal place error is unchanged. Plus an end to end test that drives the real public prepare_limit_order_draft path with a mocked 0.005 tick market and confirms the guard fires there too.
An exhaustive correctness sweep. For every supported tick, I enumerated every on grid multiple across the whole valid range and every in allowance off grid probe: about 23,000 cases, zero false rejects and zero false accepts. Decimal % Decimal is exact, so there is no floating point residue to worry about.
Clean gates. ruff format, ruff check, and pyright all pass; the full order path unit suite stays green.
An adversarial review. I had the change reviewed by an independent pass whose only job was to find a reason a maintainer would reject it. Its strongest counter argument, "maybe the server is meant to snap off grid prices," is exactly what EIP-712 signing rules out: a signed order cannot be silently repriced. That turned into the clearest line in the writeup.

Since the repository limits pull requests to collaborators, I merged the fix on a fork (allowed by the contest rules) and filed a full report as an upstream issue, so the maintainers have the bug, the repro, and the patch in one place.

What I took away: a validation check should test the invariant it actually claims to enforce, not a proxy that happens to coincide with it. This one promised the price "must conform to tick size" but tested decimal place count, and those two agree on every classic market and diverge exactly on the newer half step ticks. A single assertion on a 0.005 tick would have caught it.

Green all the way down: a trading bot that lied to me in four different languages

Giulio D'Erme — Tue, 14 Jul 2026 20:28:40 +0000

This is a submission for DEV's Summer Bug Smash: Smash Stories powered by Sentry.

Here's a fun way to lose confidence in every dashboard you own.

I run a small fleet of automated trading bots. One of them, I'll call it the index bot because it trades index CFDs and gold rather than FX, sat one afternoon at exactly its starting balance, zero open positions, having placed no trades for a suspiciously long time.

Nothing was on fire. systemctl said the service was active. The bot's own risk circuit reported ok. The fix for the bug it was hitting had, according to git, been shipped weeks ago. The last deploy had said done.

Every single one of those was a lie. Here they are in the order I peeled them off.

Lie #1: `active`

$ systemctl is-active index-bot
active

Except it wasn't running. It was stuck in a crash loop, and had been for three days.

A stray systemd config override, left over from an unrelated experiment, had repointed the service at a minimal Python virtualenv that was missing two libraries the bot imports on startup (the Postgres driver and the market data client). So the process would launch, import its way a few lines in, hit the missing module, die, and get restarted by Restart=always. Launch, die, restart. Forever.

The trap: systemctl is-active doesn't answer "is this service working?" It answers "does a process exist right now?" A restart loop keeps a process existing. Poll it and you almost always land in the brief window between crashes, where the answer is, technically, active. Three days, no alert, because the one signal anyone was watching was structurally incapable of noticing.

Takeaway: a restart policy launders a dead process into one that looks healthy. Alert on NRestarts climbing, and make your health check prove the process did its work, wrote a heartbeat row or answered a ping, not merely that it drew breath.

I fixed the venv override. The bot came up and stayed up. Progress! Onto lie number two.

Lie #2: `ok`

Now that it was actually running, the journal filled with this, once a minute, forever:

poll: 1 new order intent
place_order INDEX500: INVALID_REQUEST "Relative stop loss has invalid precision"
router summary: { placed: 0, errors: 1, circuit_state: 'ok' }

Read that last line again. It placed zero orders. It logged one error. And it declared its circuit state ok.

The bot has a risk circuit breaker, and it's a good one. It trips on drawdown, on stale prices, on daily loss limits. But a broker rejecting 100% of your orders is not a drawdown event. No money is being lost because no orders exist. So the circuit, which only ever learned to watch for losing money, cheerfully reported green while the bot failed to trade at all.

Takeaway: a guard that watches one failure mode is blind to every other one. "Am I losing money?" and "am I actually able to place a trade?" are different questions, and a health signal that answers only the first will glow green through the second. A 100% rejection rate should page as loudly as a drawdown breach.

So why was every order rejected? That's the actual bug, and it's a beauty.

The bug all four green lights were hiding: the decoder and the encoder disagreed about what a price is

The bot talks to cTrader's Open API. For a market order, cTrader won't take an absolute stop loss price; it wants the stop as an integer distance expressed in a fixed wire scale. cTrader streams prices on a fixed scale of 100,000 integer units per price unit, and the adapter decoded incoming prices with exactly that constant:

_SPOT_PRICE_SCALE = 100_000.0          # cTrader's fixed wire scale

def decode_price(raw: int) -> float:
    return raw / _SPOT_PRICE_SCALE     # correct

But when it came time to encode the stop loss distance to send back, the adapter used the symbol's display digits instead of the wire scale:

# digits = the number of decimals the symbol quotes to
sl_units = int(stop_distance * (10 ** symbol.digits))   # this line is the bug

Look at what those two lines assume. The decoder says a price unit is worth 100_000. The encoder says it's worth 10 ** digits. Those are the same number only when digits == 5.

A EUR/USD style FX pair quotes to 5 digits, so 10**5 == 100_000, encoder and decoder agree, and orders sail through.
An index CFD or gold quotes to 2 digits, so 10**2 == 100, the stop is encoded 1000× too small, and cTrader rejects it as "invalid precision."

The bug had existed the whole time. It was invisible on every account that traded 5 digit FX, where the wrong answer happens to equal the right one. The lone account trading 2 digit instruments, an index CFD and gold, was the only place the two scales diverged, and it failed 100% of the time.

The fix is a single line: encode on the same scale you decode with.

def stop_distance_to_units(distance: float) -> int:
    return int(round(distance * _SPOT_PRICE_SCALE))

Takeaway: if you decode with one constant and encode with another, you've planted a bug that hides everywhere the two constants coincide and detonates the first time they don't. Encode and decode through the same function, and write the unit test for the case where they'd diverge. Here, a single assertion on a 2 digit symbol would have caught it years earlier.

Lies #3 and #4: `shipped`, and `done`

Here's the part that made me laugh, then wince.

That fix? It was already in master. Past me had already found this, written stop_distance_to_units, and left a comment right next to it warning not to use the display digits, because they encode 2 digit instruments far too small. Per git, the bug was solved and shipped.

The live box was running the old code anyway.

$ sha256sum  <live>/ctrader_adapter.py                       # c5bfc46f…
$ git show origin/master:…/ctrader_adapter.py | sha256sum    # f2948754…

Different bytes. The box had never received the fix. The most consistent explanation is grim: that file lives in a hardened, immutable set, locked with chattr +i so nothing can quietly tamper with the money path, and a plain scp over an immutable file fails. A blocked copy and a successful one look identical unless someone checks. And nobody was: whatever deployed this never compared the bytes on the box to the bytes in git, because if it had, this exact drift is what it would have caught. The deploy said done. The bytes said otherwise.

Takeaway: a deploy that doesn't verify its end state is a wish, not a deployment. Copy the file, then assert sha256(target) == sha256(source), or you will one day discover, as I did, that "deployed weeks ago" and "running in production" are unrelated facts.

The pattern: liveness is not health

Line them up:

The signal	What it claimed	What was true
`systemctl is-active`	`active`	stuck in a crash loop for 3 days
circuit breaker	`circuit_state: ok`	rejecting 100% of orders
`git log` on master	fix shipped	fix never on the box
the deploy	`done`	bytes never changed

Four independent "success" signals, four different lies, and each was technically correct about the narrow thing it measured. The unit did exist. The account wasn't drawing down. The commit was on master. The deploy command did run. None of them measured whether the system was doing its job, and the gap between "it's running" and "it's working" is where this whole afternoon lived.

Five things I'm taking into every system I touch after this:

Alert on restart counts, not just liveness. Restart=always turns a corpse into a green light.
Health checks must assert work happened, a heartbeat, a filled order, a written row, never just "a process is up."
Every guard is blind outside its one failure mode. Enumerate the failure modes; "can't lose money" is not the same as "can trade."
Encode and decode through the same constant, and test the input where two scales would disagree.
A deploy must verify its end state. Hash the target. scp onto an immutable file fails silently, and silence reads as success.

How this was actually caught

Full disclosure, because it's the most interesting part: I didn't find most of this by staring at logs. I pointed an AI agent (Claude) at the live journal and had it diff the running bytes against git, part of an audit pipeline I've been building whose entire premise is don't trust a green dashboard. A human who "knows the system is fine" skims right past circuit_state: 'ok'. A skeptical reader with no such prior stops on it and asks the dumb, correct question: ok according to whom?

That question, asked four times, at four layers, was the whole fix. The bugs were boring: a stray config file, a guard that watched too little, a units mismatch, an unverified copy. No clever algorithm, no exotic race. And that's exactly why they'd survived for weeks: nothing about them looked wrong, because everything that was supposed to tell me they were wrong was busy reporting green.

The least glamorous bugs are the ones that quietly cost the most. Smash them by distrusting your own status lights.

If your infra has ever said active while doing absolutely nothing, I'd love to hear your version in the comments.

Fine-Tuning and RAG: What a Dozen Failed Experiments Taught Me

Giulio D'Erme — Tue, 14 Jul 2026 12:29:58 +0000

The internet has a strong opinion about fine-tuning versus RAG, and most of it comes from people who never ran the experiment.

I ran about a dozen — fine-tuning LLMs, fine-tuning embedders, six flavors of RAG — on a real system that makes forward-looking predictions against noisy financial outcomes, with the kind of statistical rigor that kills your favorite result: walk-forward splits, permutation tests, multiple-comparison correction, pre-registered kill-gates. Most of the experiments died. The autopsies turned out to be far more useful than any hot take, because they all point at the same rule.

Here's that rule up front, so the rest of the post has somewhere to land:

Fine-tuning and RAG both operate on the input side — the vocabulary the model knows, and the context it can see. Neither changes whether the thing you're trying to predict is actually predictable from your data. Where the task is "know the words / find the fact / recall the knowledge," both work. Where the task is "predict a noisy outcome," neither manufactures a signal that isn't there — they just hand you more convincing ways to fool yourself.

Let me show you the bodies.

A note on why my results differ from the blog consensus

Almost every experiment below looked good on a naive train/test split. The failures only showed up under discipline: walk-forward with embargo, permutation tests for significance, Holm/BH-FDR correction because I was testing many cells, an SPY-counterfactual to check I wasn't just capturing market beta, and a binding calibration gate that a result had to clear no matter how good the P&L looked.

That gap is the whole story. If you evaluate fine-tuning or RAG on a single random split and read the headline metric, you will "discover" edges that evaporate the moment you test them honestly. The technique is not the hard part. The evaluation is.

Part 1 — Fine-tuning: four autopsies

Bigger model, better loss, worse decisions

I fine-tuned a 7B and a 14B open model on the same ~777 labeled examples for the same task. The 14B finished with a better evaluation loss (0.97 vs 1.01). It also made worse decisions on held-out data: 46.2% win rate versus the 7B's 68.4% — a 22-point gap, p=0.019.

Two lessons, both expensive to learn the hard way:

Generative loss does not predict downstream quality. The 14B was better at the thing loss measures (reproducing tokens) and worse at the thing I cared about (being right). If you pick your model by eval_loss, you will ship the worse one with confidence.
Small data punishes big models. With a few hundred to a few thousand examples, extra capacity doesn't find deeper structure — it memorizes the label generator, including its noise and mistakes. A smaller model's limited capacity is implicit regularization; it's forced to learn the simple, general pattern. Under ~3k examples, reach for the smaller model.

Fine-tuning a classifier learns the safe answer

Three separate attempts to fine-tune a model to make a directional call, three times the same outcome: the model converged to the safe, majority answer. It didn't learn a signal — it learned the distribution of the labels, which for a noisy target means "predict the common class and stop taking risks."

This is what fine-tuning does when there's no signal to find: it fits the label distribution beautifully and tells you nothing. The confident, low-loss model is not evidence the edge exists.

The tiny-transformer trap

I tokenized order-flow into sequences and trained a ~130k-parameter transformer on a direction target. It never converged — training loss sat exactly at the -log(0.5) coin-flip floor. Meanwhile a boring gradient-boosted model, on the same target with 12 hand-built features, reached AUC 0.71.

The signal was real and learnable; the architecture just couldn't touch it at that scale. "Use a transformer" is not a substitute for either signal or scale — a 130k-param model has nowhere near the capacity to discover from raw tokens what a GBM reads off 12 good features. Match the architecture to the data you actually have, not the one on the paper you're copying.

Fine-tuning an embedder: it depends entirely on your vocabulary

This one has a happy ending, and it's the most useful result of all because it tells you when fine-tuning pays. I domain-adapted a small embedder (all-MiniLM-L6-v2) on two corpora and measured retrieval quality (MRR, nDCG@10) on held-out queries:

A "rich" corpus the base model already understood: base MRR 1.00 → fine-tuned MRR 1.00. Δ+0.00. Zero lift — and that's the correct result, because there was no headroom to recover.
An "opaque-jargon" corpus, where concepts hid behind codenames the base model had never seen: base MRR 0.306 → fine-tuned 0.547. +0.24 MRR, +79% relative.

The conclusion writes itself: fine-tune the embedder only as much as your corpus's vocabulary diverges from what the base model already knows. Well-covered domain? You get nothing. Private jargon, internal codenames, a specialist vocabulary? That's exactly where fine-tuning earns its keep. (Full controlled study, open source, in RE-call.)

The fine-tuning through-line: fine-tuning is a vocabulary and behavior tool. It teaches words, formats, and styles the model didn't have. It does not manufacture predictive signal from a noisy target — and measured by loss instead of the downstream metric, it will actively mislead you.

Part 2 — RAG: the eight-times-falsified idea

"It fires, it just doesn't help"

The seductive idea: retrieve relevant history and inject it into the classifier's prompt so it decides with more context. I tested it eight to nine times — across two models, three embedding stacks, four retrieval targets (own history, sector peers, an outcome-supervised REPLUG-LSR adapter, and niche-conditional slicing), and three paradigms (similarity, outcome-supervised distillation, structured-feature extraction).

Every single time, the sanity check passed: RAG fired, changing 25–33% of the model's decisions. And every single time, those changed decisions were statistically indistinguishable from noise against the real forward outcome. ΔWR hovered around zero with confidence intervals straddling it, through round after round, under multiple-comparison correction.

The mechanism of failure is worth stating precisely: RAG amplified the score's magnitude without knowing when the retrieved past was relevant. Positive historical context pushed the model more bullish — even when the current situation had reversed. It added confidence, not correctness. "It fires" and "it helps" are completely different claims, and only the first one was ever true.

The apparent-alpha trap

One RAG variant looked like a winner. A structured-feature setup posted +11%/year of apparent alpha — the kind of number that gets a strategy shipped. Its ranking AUC was 0.486. Worse than a coin.

The +11% was regime exploitation, not skill. The model predicted the majority class more often, and the evaluation window happened to reward that class. A ranker with AUC below 0.5 has, by definition, no ability to tell winners from losers — so any P&L it produces is a property of the period, not the model.

This is the most dangerous failure in the whole post, because the metric that flatters you (P&L) and the metric that tells the truth (AUC/calibration) point in opposite directions. Make calibration a binding gate, not a footnote. A profitable backtest on a below-chance model is a mirage, and it will happily survive right up until it's live with real money.

Where RAG actually earned its place

Here's the part the "RAG is dead" crowd gets wrong: RAG didn't fail everywhere. It failed at prediction. It stayed on, permanently, for everything retrieval-shaped — the searchable knowledge base, pulling relevant context a human reads, giving the system access to facts it wasn't trained on.

That's the exact line the rest of my writing lives on. RAG to find code by meaning (semantic code search) works beautifully. RAG to give an agent a working memory (RE-call) works beautifully. RAG to make a hard forward-looking call better failed nine times. Same technique, opposite outcomes — because they're different problems wearing the same acronym.

The RAG through-line: RAG changes what the model sees, not whether the target is predictable. If the signal is in the data, retrieval helps you surface it. If it isn't, retrieval just gives the model more confident-looking ways to be wrong.

Part 3 — The rule that explains all of it

Line the failures up and they're the same shape:

Fine-tuning operates on the model's vocabulary and behavior.
RAG operates on the model's available context.
Neither operates on the only question that decides a prediction task: is the target actually a function of the inputs?

So the decision isn't "fine-tune vs RAG." It's what kind of problem do you have?

"The model doesn't know my words / format / domain." → Fine-tune. The embedder if it's a retrieval problem (and only if your vocabulary genuinely diverges — see the RE-call result); the LLM if it's a behavior or style problem. Measure on the downstream metric, never on loss.
"The model can't access facts / code / memory it wasn't trained on." → RAG. This is retrieval's home turf and it's excellent at it.
"The model should predict my noisy outcome better." → Neither will save you. No amount of fine-tuning or retrieval creates signal that isn't in the data — and both will hand you a convincing way to believe otherwise (a lower loss, a fatter backtest, a RAG that "fires"). Go find real signal, or accept there isn't any and stop paying to re-discover that.

And the meta-lesson, the one that made every result above trustworthy: nearly all of them looked good until the evaluation got honest. The bigger model won on loss. The dead RAG variant won on P&L. The naive splits won on the headline number. Walk-forward, permutation, multiple-comparison correction, and a binding calibration gate are what turned a pile of exciting-but-fake wins into a dozen reliable *no*s and two real *yes*es.

The techniques are easy. Knowing whether they worked is the entire job.

The retrieval side of all this — how to build RAG that actually works, and the full controlled study on when fine-tuning an embedder pays — is open source: RE-call. The applied side — RAG for code search and agent memory in day-to-day Claude Code — is my series Claude Code, Beyond the Prompt. If you're about to fine-tune or bolt on RAG, run the honest evaluation first — it's cheaper than the mirage.

Claude Code, Beyond the Prompt — Part 7: How I Cut Claude Code's Token Bill (and Made It Faster)

Giulio D'Erme — Tue, 14 Jul 2026 12:21:42 +0000

Part 7 (finale) of Claude Code, Beyond the Prompt — patterns from running a live automated trading system on Claude Code. Part 6: GitHub as Claude's task queue.

Every token Claude reads costs you twice: once in money, once in time. A bloated context isn't just a bigger bill — it's a slower response, every single turn.

When people want to fix this, they reach for the wrong lever: switch to a cheaper model. But the model isn't where the waste is. The waste is in making Claude read things it doesn't need to and redo things it shouldn't have to. And it turns out the six pieces from this series — built for other reasons — are, almost by accident, a token-reduction system.

Here's how the savings actually work, and how to measure your own. This is the payoff article, so it's concrete.

The core insight: tokens are the currency of both cost and speed

This is the mental model that changes how you work: tokens are the shared currency of your bill and your latency. Anything that reduces what Claude has to read or regenerate wins on both axes at once. You're not trading cost against speed — you're buying both with the same coin.

So the optimization target isn't "cheaper model." It's information density: maximum relevant context, minimum noise, in every turn.

Where the tokens actually leak

Watch a naive session and the waste is obvious once you know where to look. Each leak maps to a piece we already built:

The leak	The fix	From
Re-explaining your whole project every session	Claude reads a small memory file instead	Part 1
Redoing work because it acted on stale state	Ground first; no wasted work on wrong assumptions	Part 2
Re-typing long procedural prompts	One command invocation instead of a paragraph	Part 3
Pasting giant logs and command dumps	MCP tools return small, structured results	Part 4
Reading ten whole files to find one function	Semantic search returns the relevant chunk	Part 5
Re-deriving history and "what changed"	GitHub holds the record; Claude reads only what it needs	Part 6

The biggest two are almost always the first and the fifth.

Re-explaining context (Part 1): without a memory file, you spend the opening of every session re-establishing your stack, conventions, and current state — hundreds of tokens of preamble before any work happens. With one, that context is a compact file Claude reads once. Multiply by every session, forever.

Reading whole files (Part 5): this is the giant. Grepping-and-reading to locate code routinely pulls ~6,000 tokens across five files to find the 30 lines that matter. A good semantic lookup returns ~300 tokens — the right chunk. That's roughly a 20× reduction on a single lookup, and lookups happen constantly. (Illustrative, not a lab measurement — but the order of magnitude is real, and you'll see it yourself the first time you compare.)

Three platform levers most people miss

Beyond the six pieces, Claude Code gives you three more levers that are pure token savings:

1. Prompt caching. The stable prefix of your context gets cached — so a stable, skimmable CLAUDE.md stays cached across turns and is cheap and fast to reuse. This is a concrete, dollars-and-milliseconds reason the Part 1 split pays off: the stable file caches cleanly; the dynamic file is kept small. Churn your always-loaded context and you keep busting the cache and paying full freight. Stability isn't just tidy — it's cached.

2. Lazy / deferred tool loading. Every tool you load carries a schema that sits in context. Load two hundred tools eagerly and you've spent thousands of tokens before saying a word. Claude Code can defer tool schemas and load them on demand, so a session only pays for the tools it actually uses. Skills work the same way (Part 3): only the short description is always loaded; the body loads when relevant. Progressive disclosure is a token strategy, not just an organizing one.

3. Subagents for heavy reading. When a task needs Claude to read a lot — search across dozens of files, sweep a big directory — spin up a subagent to do the bulk reading in its own context and return only the distilled conclusion. The 50,000 tokens of spelunking happen off to the side; your main thread receives the 500-token answer. For big fan-out work this is one of the largest single savings available, and it keeps your main context clean for the actual work.

The point is that they compound

Here's why this is a finale and not a footnote: these don't add, they multiply, because each attacks a different part of the session.

A naive session looks like: re-explain everything + grep-read ten files + paste a wall of logs + redo work that assumed stale state. Stack the fixes and that same session becomes: read a compact memory file + one semantic lookup + structured tool results + grounded, no-rework execution. Every phase got denser. The totals aren't close.

I won't hand you a tidy "I cut costs 80%" number, because it would be dishonest — your codebase, your tasks, and your model choice all move it. What I'll tell you plainly: the direction is large and consistent, and the two biggest wins are memory (stop re-explaining) and semantic search (stop reading whole files). Start there; they're most of the gain for least of the effort.

Measure your own

Don't take my word for any of it — instrument it:

Claude Code can show your token and cost usage. Check it before and after adopting these patterns.
Watch two things specifically: tokens-per-session, and how often Claude reads whole files versus targeted chunks. The second is the leading indicator — when it drops, the bill follows.
Notice the latency, not just the invoice. Denser context means faster turns, and on a long working day that's the part you actually feel.

You're optimizing information density. Treat context as a budget and spend it on signal.

The mindset shift

If you take one thing from this article: stop asking "how do I prompt better" and start asking "what is Claude reading that it doesn't need to, and what is it redoing that it shouldn't have to?"

Every token you remove is money saved and latency cut. That reframe — context as a scarce budget spent on signal — is what turns a big monthly bill into a small one and a sluggish assistant into a fast one.

Where the series lands

Seven parts ago I made a claim: getting dramatically more out of Claude Code has almost nothing to do with writing better prompts. I hope the case is made now.

The leverage was never in the prompt. It was in the scaffolding around it:

Memory so it never forgets your project (Part 1).
Rituals so it's grounded in what's true before it acts (Part 2).
Commands so your workflows and judgment are captured once (Part 3).
MCP tools so it can act on your systems inside a fence you built (Part 4).
Semantic search so it finds by meaning, not by guessing (Part 5).
GitHub so there's a shared, persistent source of truth (Part 6).
Token sense so all of it runs cheap and fast (Part 7).

None of it required a bigger model. Every piece is something you can build incrementally, starting with a ten-minute memory file this afternoon. Do them in order and each one pays for itself before you build the next. Together they turn Claude Code from a chat box into the operational layer for real work — which is exactly what it's been for me, running a live system on it every day.

Going deeper

The one piece I kept pointing away from — the retrieval machinery under semantic search and memory — has its own home: my open-source RE-call. It digs into giving an agent a memory that genuinely works — the architecture, the evaluation harness, and the honest findings, including the experiments that failed — with a full writeup in docs/WRITEUP.md. If the RAG parts of this series were your favorite, that's where to go next.

And if the evidence is what hooks you: I ran about a dozen fine-tuning and RAG experiments on a real prediction system — walk-forward splits, permutation tests, pre-registered kill-gates — and most of them died. The autopsies are worth more than any hot take: why a bigger model with a better loss made worse decisions, why RAG-in-the-prompt failed eight straight times, and the one rule that explains all of it. It's all here: Fine-Tuning and RAG: What a Dozen Failed Experiments Taught Me.

Thanks for reading the whole way. If you build even one of these, tell me how it goes — I read every reply.

The finale of Claude Code, Beyond the Prompt. Building something similar, or hiring people who do? The deeper agent-memory research is open source — RE-call, linked above. Follow for the next series.

DEV Community: Giulio D'Erme

Retrieval-Augmented Self-Recall — What the Comments Taught Me (RE-call v0.3)

Comment 1: "A similarity score is not a confidence score"

Where the comment needed a refinement — which is the point of measuring

Comment 2: "Supersession is a relation, not a property"

The experiment I still owe

What I'm actually arguing for

Retrieval-Augmented Self-Recall — Part 6: The Fine-Tune That Did Nothing, and Shipping It as an MCP Server

The fine-tune that did nothing

Shipping it: the MCP server

Where the series lands

Read it, run it, break it

Retrieval-Augmented Self-Recall — Part 5: The Gap Threshold That Didn't Transfer

Cosine similarity is not calibrated across models

The lesson: never ship a hard-coded abstention threshold

Why this generalizes past RAG

Next

Retrieval-Augmented Self-Recall — Part 4: Benchmarking Retrieval *and* Honesty

The test set: the unanswerable queries are the point

Two families of metrics

The ablation: every embedder × every fusion stage

Finding 1: hybrid + rerank helps most exactly where you'd expect — and nowhere else

The rigor is the point

Next

Retrieval-Augmented Self-Recall — Part 3: Teaching RAG to Say \"I Don't Know\

Guard 1: gap_warning — "is the best match good enough to trust?"

Guard 2: freshness — "is this memory still current?"

Guard 3: anti-re-litigation — "did we already settle this?"

The unifying idea

One rule, or the guards are theater

Next

Retrieval-Augmented Self-Recall — Part 2: Hybrid RAG on Nothing but Postgres

Why not a dedicated vector DB

The retrieval pipeline

1. Dense retrieval

2. Sparse retrieval

3. Fusion with RRF

4. Cross-encoder rerank (optional)

Pluggable embedders

Is "just Postgres" actually enough?

Next

Retrieval-Augmented Self-Recall: The RAG Problem Nobody Talks About

Why self-recall is not document QA

Three failure modes unique to agent memory

The reframe: abstention, not ranking

RE-call: a reference implementation

What this series covers

Next

One MCP Server, Two Models: An Always-On Ops Agent That Costs $0

The architecture: one server, many clients

The tool catalogue

Observation (read-only)

Retrieval (this is the one people skip)

Git and GitHub (read)

Gated write

The design that makes it safe: different callers, different tool surfaces

The auto-fix gauntlet

Why a dumber model is safe here: defense in depth

What it actually does, day to day

The token dividend: the server reads, the model doesn't

Then I tried to go one step further, and measured why it failed

The design decision I'd defend anywhere

The numbers that killed it

The model trap, free of charge

What I actually learned

Setting it up

About the model

The pros, and the honest cons

The through-line

Clearing an off grid price bug out of Polymarket's order path

Project Overview

Bug Fix or Performance Improvement

Code

My Improvements

Green all the way down: a trading bot that lied to me in four different languages

Lie #1: active

Lie #2: ok

The bug all four green lights were hiding: the decoder and the encoder disagreed about what a price is

Lies #3 and #4: shipped, and done

The pattern: liveness is not health

Retrieval-Augmented Self-Recall — Part 4: Benchmarking Retrieval and Honesty

Guard 1: `gap_warning` — "is the best match good enough to trust?"

Lie #1: `active`

Lie #2: `ok`

Lies #3 and #4: `shipped`, and `done`