DEV Community

yongrean
yongrean

Posted on

Confidence is the one signal your model can't corroborate

This series started as a cheap-model brag and keeps getting better comments than posts. Three readers — @nazar_boyko, @txdesk, and @jugeni — independently converged on the same seam, and @jugeni put it in one line I can't improve on:

AUTO wants a corroborator the model cannot write, not a confidence it can.

Here's what that means, and why it's the sharpest critique this design has taken.

Four signals, but not four of a kind

Quick recap of the earlier posts: the LLM scores four features per email — confidence, senderTrust, reversibility, urgency — and a deterministic rule maps those to a tier. The model perceives; a rule I can read decides.

But those four aren't the same kind of thing. Three of them describe the world:

  • senderTrust can be anchored to observed history — have you actually corresponded with this person, and how often. There's a source outside the email.
  • reversibility is a property of the action the system would take, not the message. Accepting a calendar invite is reversible because accepting is reversible — not because the email said so.
  • urgency answers to the clock. A real deadline either exists or it doesn't.

confidence is different in kind. It's the model grading its own work — "how sure am I about the other three?" There is no source outside the model's opinion of itself. And in my rule, the AUTO branch gates when confidence >= 0.85 (alongside the others).

Where that bites

The dangerous email isn't the one the model is unsure about — the low-confidence floor already routes that to the queue. It's the one the model is confidently wrong about. A polished impersonation that reads as a trusted sender is exactly a high-confidence, high-senderTrust, reversible-looking email. It walks toward AUTO through the one feature the model authors about itself, and self-graded confidence is the gate that structurally can't catch a confident lie.

What actually stops it today

I want to be precise about the blast radius, because it's smaller than that paragraph sounds.

AUTO is classify-only in the current build — an AUTO classification sets a tier and triggers no action. When execution does run, AUTO only ever maps to reversible, internal actions (archive, mark-read). And the three irreversible actions — send, hard-delete, forward-external — sit behind a deterministic floor that ignores every score. So a confident impersonation that reaches AUTO gets quietly handled in a recoverable way, never anything you can't undo.

The seam is real; it just can't currently reach anything unrecoverable. But "bounded by the floor" is not the same as "designed right." The day AUTO starts taking even reversible actions on its own, leaning on a number the model wrote about itself is the wrong gate.

The fix is the framing

@jugeni's line is the spec: gate on corroboration the model can't author.

  • Make senderTrust a deterministic floor from observed history when history exists — manualOverrides >= N pins it, instead of merely suggesting it to the model in the prompt.
  • Source reversibility from the action the tier would trigger, by lookup, not from the model's read of the email (that's @txdesk's point, and it's already how the irreversible floor works — it just isn't how the AUTO gate works yet).
  • Keep confidence as a tiebreaker, never as the thing that promotes to AUTO on its own.

The pattern generalizes past email. Any time you let a model score features for a decision, sort the features by whether their source is independent of the model's self-assessment. Gate on the ones that are. The self-graded one is scenery — useful context, never authorization.

The honest part

I haven't done this yet. Today confidence still gates AUTO, and what makes that safe is the floor underneath, not the gate itself. The thing I owe is an adversarial eval: a high-confidence, polished impersonation, measured to see whether it actually reaches AUTO — turning "I think the floor saves us" into a number instead of a belief. That's next, and the eval set is in the open if you want to write the case before I do.

Three posts in, the lesson keeps being the same shape: keep the model in the perception layer, and make every decision answer to something the model can't quietly author. AGPLv3, the whole thing: github.com/k08200/klorn.

Top comments (0)