yongrean

Posted on Jun 29

Confidence is the one signal your model can't corroborate

#ai #llm #opensource #architecture

This series started as a cheap-model brag and keeps getting better comments than posts. Three readers — @nazar_boyko, @txdesk, and @jugeni — independently converged on the same seam, and @jugeni put it in one line I can't improve on:

AUTO wants a corroborator the model cannot write, not a confidence it can.

Here's what that means, and why it's the sharpest critique this design has taken.

Four signals, but not four of a kind

Quick recap of the earlier posts: the LLM scores four features per email — confidence, senderTrust, reversibility, urgency — and a deterministic rule maps those to a tier. The model perceives; a rule I can read decides.

But those four aren't the same kind of thing. Three of them describe the world:

senderTrust can be anchored to observed history — have you actually corresponded with this person, and how often. There's a source outside the email.
reversibility is a property of the action the system would take, not the message. Accepting a calendar invite is reversible because accepting is reversible — not because the email said so.
urgency answers to the clock. A real deadline either exists or it doesn't.

confidence is different in kind. It's the model grading its own work — "how sure am I about the other three?" There is no source outside the model's opinion of itself. And in my rule, the AUTO branch gates when confidence >= 0.85 (alongside the others).

Where that bites

The dangerous email isn't the one the model is unsure about — the low-confidence floor already routes that to the queue. It's the one the model is confidently wrong about. A polished impersonation that reads as a trusted sender is exactly a high-confidence, high-senderTrust, reversible-looking email. It walks toward AUTO through the one feature the model authors about itself, and self-graded confidence is the gate that structurally can't catch a confident lie.

What actually stops it today

I want to be precise about the blast radius, because it's smaller than that paragraph sounds.

AUTO is classify-only in the current build — an AUTO classification sets a tier and triggers no action. When execution does run, AUTO only ever maps to reversible, internal actions (archive, mark-read). And the three irreversible actions — send, hard-delete, forward-external — sit behind a deterministic floor that ignores every score. So a confident impersonation that reaches AUTO gets quietly handled in a recoverable way, never anything you can't undo.

The seam is real; it just can't currently reach anything unrecoverable. But "bounded by the floor" is not the same as "designed right." The day AUTO starts taking even reversible actions on its own, leaning on a number the model wrote about itself is the wrong gate.

The fix is the framing

@jugeni's line is the spec: gate on corroboration the model can't author.

Make senderTrust a deterministic floor from observed history when history exists — manualOverrides >= N pins it, instead of merely suggesting it to the model in the prompt.
Source reversibility from the action the tier would trigger, by lookup, not from the model's read of the email (that's @txdesk's point, and it's already how the irreversible floor works — it just isn't how the AUTO gate works yet).
Keep confidence as a tiebreaker, never as the thing that promotes to AUTO on its own.

The pattern generalizes past email. Any time you let a model score features for a decision, sort the features by whether their source is independent of the model's self-assessment. Gate on the ones that are. The self-graded one is scenery — useful context, never authorization.

The honest part

I haven't done this yet. Today confidence still gates AUTO, and what makes that safe is the floor underneath, not the gate itself. The thing I owe is an adversarial eval: a high-confidence, polished impersonation, measured to see whether it actually reaches AUTO — turning "I think the floor saves us" into a number instead of a belief. That's next, and the eval set is in the open if you want to write the case before I do.

Three posts in, the lesson keeps being the same shape: keep the model in the perception layer, and make every decision answer to something the model can't quietly author. AGPLv3, the whole thing: github.com/k08200/klorn.

Top comments (26)

Tae Kim • Jul 1

The "sort features by who authored them" principle generalizes cleanly into retrieval systems too — in a RAG pipeline, the model's own confidence in a retrieved answer has the same self-referential problem: it can be high precisely because the retrieved chunk sounded plausible, not because it's grounded. What we've found more reliable is treating external signals like citation overlap across retrieved chunks, or entity-level consistency with a knowledge graph, as the corroborators the model can't author — if the model is confident and the external signals agree, trust it; if the model is confident and external signals are thin or contradictory, that's the canary. The delta framing from the comments is the right eval design: measuring whether (confidence minus world-anchored corroboration) separates adversarial from cooperative samples is a much sharper instrument than measuring whether a threshold held.

yongrean • Jul 1

@taekim the RAG parallel is dead on and I hadn't connected it — confidence in a retrieved answer is high because the chunk sounded plausible, same self-referential trap. Citation overlap across chunks and entity-level KG consistency are corroborators the model can't author, same shape as sender-history consistency here. Confident + thin/contradictory external = the canary, both domains. And yes, the delta (confidence − corroboration) as a distributional separator is the eval I'm building, not a threshold pass. Might borrow the RAG framing in the writeup to show the principle isn't email-specific — thank you.

Tae Kim • Jul 2

Glad the RAG framing landed — the structural shape really is the same: the model is producing confidence from the same well it's being evaluated against, so the self-referential trap closes. The thing I'd add from running it on the retrieval side: treating (confidence − corroboration) as a distributional metric is more stable than a threshold, because the threshold you'd set in month one is wrong by month three as the corpus or the embedding model drifts. Building it as a distribution gives you a natural alarm when the gap widens even if the mean confidence stays flat.

yongrean • Jul 2

@hannune exactly — the threshold you set in month one is wrong by month three, and a distribution gives you the alarm when the gap widens even if mean confidence stays flat. Same instinct another commenter here has for a drift SLO on the canary queue — distributional (confidence − corroboration) and that meta-canary are the same idea at two layers. Going into the eval (#679) as a distribution, not a threshold. Thanks for pushing the drift side; that's the part that bites in production.

Tae Kim • Jul 3

The drift SLO framing is the right move — once the gap between confidence and corroboration widens without the mean moving, a static threshold is blind to it. Using the distribution itself as the eval input, not just the point estimate, is what makes that alarm possible. Running it as a canary-queue check rather than a one-shot gate also lets you catch slow drift before it compounds into a production regression.

yongrean • Jul 7

This one we'd already built without naming it that well — a daily calibration snapshot that tracks the full confidence distribution per tier plus a window-over-window drift signal, kept deliberately separate from the one-shot eval-floor gate that only runs at PR time. Your framing — "gap between confidence and corroboration widening without the mean moving" — is honestly a better description of what it's for than our own docs manage. Going to borrow that language.

TxDesk • Jun 30

the generalization at the end is the keeper: sort the features by whether their source is independent of the model, gate on those, treat the self-graded one as context not authorization. that outlives the email case completely, it's the rule for any model-scored decision. and the honest close is right, "the floor saves us" only becomes true once it's a number from an adversarial eval rather than a belief. that eval is the post i'd most want to read. good series.

yongrean • Jul 1

@txdesk both are the right place to land, thank you. "Zero discretion where it's unrecoverable, model input only where the worst case is a convenience miss" is the boundary in one sentence — and you're right that "the floor saves us" is a belief until the adversarial eval turns it into a number. That eval is the next post; you and a couple others here basically specced it. Catch you on the next one.

TxDesk • Jul 2

that boundary sentence started as your framing, i just tightened it, so it's yours to build on. glad it landed. and yeah, the eval is the whole thing now, everything upstream of it is hypothesis until it returns a number you can point at. if you write that post i'll be there, "here's what the floor actually caught" is the one i most want to read. catch you on the next.

yongrean • Jul 2

@txdesk "here's what the floor actually caught" is the post I want to write too — which means building the fixture and running it first. That's the next real one, not another design post. See you there.

TxDesk • Jul 4

that's the right next post and the harder one, theory's cheap, the fixture is where it either holds or falls apart. "not another design post" is exactly it. build it, run it, tell us what actually survived contact. i'll be there for that one.

yongrean • Jul 7

Holding myself to it — "here's what the floor actually caught" is the next one, and you're right that it only counts if it's a fixture, not a diagram. I've actually got real distribution/drift data sitting in production now (separate thread going with another commenter on exactly that instrumentation), so the post has real numbers to run instead of hypotheticals. Give me a bit to build the write-up around what actually tripped, not what I expected to trip — that's the whole point of the post.

Mike Czerwinski • Jun 29

"AUTO wants a corroborator the model cannot write, not a confidence it can."

You wrote the cleaner version of the line. Two pushes on the post you owe.

Confidence inverts, it doesn't demote. Self-graded confidence has the same computational shape on adversarial input as on cooperative input. That is the entire definition of a confident hallucination. The model has no internal signal to distinguish "polished impersonation" from "real trusted sender" because both produce the same output distribution. So confidence isn't a weak signal that survives as a tiebreaker. On cooperative input it is scenery; on adversarial input it is counter-evidence. High confidence paired with thin world-anchored corroboration is the signature of a well-crafted phish. The same number flips meaning depending on what the rest of the gate sees.

The adversarial eval should measure delta, not threshold. "Does the impersonation reach AUTO" answers whether the floor saved you on this run. The sharper measurement is whether the (confidence minus world-anchored-corroboration) spread separates adversarial from cooperative samples in distribution. If it does, that delta is the corroborator the model can't author. Computing it requires comparing what the model said about itself to what something outside the model said about the situation. You get the gate as a contrast, not as a number.

The pattern beyond email is the one you already named: sort features by who authored them and let only the world-authored ones decide. The self-authored one earns a different job, not authorization, not scenery, but the canary that fires when it disagrees with everything else.

Cross-domain receipt: I just shipped a piece on Telegram trading signals where seven channels advertised win rates near 78% and a forensic recompute gave 46%. Same gate. Self-authored confidence versus world-authored corroboration. Different surface, identical seam.

yongrean • Jun 30

@jugeni both land, and the first is a correction I'll take — "tiebreaker" was the wrong frame. You're right that confidence doesn't demote, it inverts: same computational shape on adversarial and cooperative input is exactly what makes a confident hallucination, so high-confidence-with-thin-corroboration isn't a weak signal, it's the phish signature. The only job that survives is the canary — confidence earns its keep as the thing that fires when it disagrees with the world-anchored features, never as an input to the gate.

And delta-not-threshold is just the better eval. "Did it reach AUTO" only tells me the floor held this run. Measuring whether the (confidence − corroboration) spread separates adversarial from cooperative samples in distribution gives me the gate as a contrast, and that contrast is the corroborator the model can't author. That's what I'll build instead.

The 78%-vs-46% trading-signal receipt is the cleanest restatement of the whole thing — self-authored confidence vs world-authored corroboration, identical seam. Stealing that as the canonical cross-domain example. Genuinely changed how I'll write the gate, thank you.

Mike Czerwinski • Jun 30

Two things on the build, plus a note.

For the eval to read clean: adversarial and cooperative samples need to match in everything except the seam you are measuring. Held-out cooperative set from known-safe senders gives you the floor distribution of (confidence minus corroboration). Adversarial set needs to be hand-crafted to be high-confidence-thin-corroboration on purpose: polished impersonation, plausible domain, the four features minus the world-anchored ones (sender-history consistency, action-type lookup). If the adversarial set has lower confidence than the cooperative one, the eval is leaking signal somewhere else and the spread is not measuring what it looks like. Match on confidence first, then watch where corroboration falls apart.

The canary wiring: confidence reads the world-anchored gate, not the other way around. The gate decides. Confidence fires when its own number disagrees with what the gate concluded. That keeps it out of the vote and turns disagreement into the log line you triage later. A confident-yes the gate rejected is the post-mortem you want. A confident-yes the gate accepted is silence.

And the cross-domain example holds in both directions. The trading receipt is one half. The other is the same shape inside the gate: a model rating its own classification at 0.85 is the same act as a strategy reporting its own win rate at 78%. Different surface, same actor auditing itself.

yongrean • Jun 30

@jugeni this is the build spec, thank you. Taking two things verbatim:

The eval discipline — match the adversarial and cooperative sets on confidence first, then measure where corroboration falls apart, and treat "adversarial has lower confidence than cooperative" as a leak signal, not a result. That's the part I'd have gotten wrong: without controlling for confidence the spread just re-measures confidence. Held-out known-safe senders as the floor distribution is the right cooperative set.

And the canary wiring is the cleanest placement of confidence I've seen — the gate decides on the world-anchored features, confidence reads the gate and only fires on disagreement, and the confident-yes the gate rejected is the post-mortem. The side effect I like: "confident-yes, gate-rejected" is exactly the queue I already want to hand-audit. It's the same shape as a user override, so the canary feeds the correction loop instead of dying in a log nobody reads.

And 0.85-self-classification == 78%-self-reported-win-rate is the whole essay in one line: same actor auditing itself, different surface. That's the post.

Mike Czerwinski • Jul 1

that compression is the whole thing: 0.85-self-classification and 78%-self-reported-win-rate really are the same failure wearing two outfits, an actor grading its own confidence. glad the canary wiring reads clean, that queue-not-log distinction is the part I'd want someone auditing my own systems to catch too. good build.

FastAnchor_io • Jun 30

The main testing of the models should be achieved by purchasing large models and testing computing power. Only in this way can they be used with unique features. Based on the current prices of Chinese models, they are the cheapest. After I used them, I felt that Chinese models can continuously connect with other models and manufacturers.
Mainly, it's cheap. You can also consult me about the model address and computing power.

yongrean • Jun 30

@fastanchor_io cheap + OpenAI-compatible is the sweet spot here — the model's swappable on purpose, so whatever's cheapest and runs locally works. For this task it's less about raw compute and more about reading the same few signals consistently, so a small local model is plenty. Appreciate the offer.

FastAnchor_io • Jun 30

In my understanding, models like the Tongyi Qianwen models, such as Qianwen 9B and Qianwen 128B, have significant differences. Especially after connecting to the Internet, the effects after local operation are actually the same, and they all belong to text processing.

Kartik N V J K • Jul 1

The split you drew between the three signals that describe the world and confidence that describes the model's own opinion is the cleanest framing of self-reported certainty I have read. Gating AUTO at confidence >= 0.85 still leans on the one number with no external anchor, so the calibrated-but-wrong email is exactly the one that slips through. Have you tried replacing that gate with an agreement check across two independent scorers, so corroboration comes from outside the model rather than from itself?

yongrean • Jul 1

@kartik-nvjk the diagnosis is exactly right — 0.85 still leans on the one number with no external anchor, so the calibrated-but-wrong email is precisely what slips through.

On two independent scorers: it works, but only if they're independent in their failure modes, not just their weights. Two LLMs — even different families — share enough training-data priors that a good impersonation tends to fool both, so their agreement mostly doubles the confident hallucination instead of corroborating against it. An ensemble of self-authored opinions is still self-authored.

The version that does what you're after is agreement between the model and a non-model scorer. Klorn already has one: a deterministic keyword/feature scorer that runs as the LLM fallback and produces the same four signals with no model at all. LLM-vs-deterministic agreement is corroboration from genuinely outside the model, and it's basically free. Past that, the strongest anchors aren't scorers — they're world facts the model can't author (sender-history consistency, action-type reversibility). That's where this thread landed: the corroborator has to come from a source the model can't write, and a second model can write the same wrong answer.

Dipankar Sarkar • Jul 2

The "model can't author its own corroborator" framing is the sharp part. Worth adding: confidence can be corroborated externally, just not by asking the model for it. Run the same email through N independent samples and measure how often the four features land the same way. Self-graded confidence is one number a single pass writes; cross-sample agreement is a number no single pass authors, so the model can't inflate it the way it can inflate 0.85. It won't catch every impersonation (a confident lie tends to be stable across samples), but it cleanly separates "calibrated and consistent" from "high number, unstable underneath." The self-report and the sample-agreement are different measurements even when they happen to agree numerically.

yongrean • Jul 2

@dipankar_sarkar this doesn't exist today — the judge is a single deterministic pass (temp 0, cached), so there's no cross-sample stability signal. And you're right it's a differently-sourced number: sample-agreement is something no single pass authors, so the model can't inflate it the way it inflates 0.85. Your own caveat is the honest limit — a confident lie stays stable across samples, so it separates "calibrated and consistent" from "high but unstable," not true from false. Costs N calls, so for a cost-conscious classifier it's probably a targeted check on the AUTO-candidate tail, not every email. Filing it as a candidate corroborator — good idea, thank you.

Dipankar Sarkar • Jul 2

The AUTO-tail scoping is the right cost call. Here's one way to get a stability signal at temp 0, where you can't just resample: perturb the input, not the sampler. Keep the deterministic pass but run it against a few surface-invariant rewrites of the same email. Header order, quoting and whitespace, the trailing-signature framing. Bytes change, meaning holds. A calibrated read stays put across those. A high-confidence impersonation that leans on 'looks like a trusted sender' surface cues is exactly what wobbles when you strip the framing. Same N calls as sampling, but each one is deterministic and the perturbation is something you authored, so it stays outside the well the model grades itself from. It won't catch a semantically clean lie either, but it bites the specific tail you're already gating.

yongrean • Jul 2 • Edited

Perturbing the input instead of the sampler is the move — the judge runs a single temp-0 pass, so resampling gives me nothing, but surface-invariant rewrites (header order, quoting, whitespace, signature framing) are N deterministic passes where the bytes change and the meaning doesn't. A calibrated read holds; a high-confidence impersonation that leans on "looks like a trusted sender" surface cues is exactly what wobbles when the framing is stripped. Same cost as sampling, and the perturbation is something I author, so it stays outside the well the model grades itself from — which is the whole property I'm after. The honest limit is unchanged: a semantically clean lie is stable under perturbation too, so this separates "calibrated and consistent" from "high but framing-dependent," not true from false. One implementation trap I'll be careful about — the rewrites have to be genuinely meaning-preserving, or the delta measures the rewrite instead of the model. Scoping it to the AUTO tail as planned. Filed this as the concrete method for that check.

View full discussion (26 comments)