DEV Community: Whatsonyourmind

Conformal intervals under a log transform: the blow-up isn't a back-transform bug

Whatsonyourmind — Mon, 06 Jul 2026 00:16:30 +0000

Someone log-transforms their target, fits a model, wraps it in conformal prediction, inverse-transforms the intervals back with expm1, and the upper bound comes out at 350x the point forecast. The natural reaction is "the back-transform is broken." I ran into exactly this framing on a real bug report recently, and the interesting part is that the back-transform is correct — the interval is doing precisely what it should. What looks like a bug is two honest effects stacking. Here's the mental model, with a runnable check.

The setup

You have skewed, positive data (demand, counts, prices). The standard move is to fit on log1p(y) and report on the original scale with expm1. You also want distribution-free intervals, so you add split conformal prediction on top. The conformal machinery runs in the model's scale — here, log space — and hands you interval endpoints lo and hi in log space. You expm1 those two columns and get your original-scale interval.

Then hi explodes. point ≈ 400, lo ≈ 200, hi ≈ 145,000. Surely you transformed something wrong.

The one fact that resolves it

Coverage is invariant under a monotone transform.

If g is strictly increasing (and expm1 is), then for any interval [lo, hi] in log space:

y ∈ [lo, hi]   ⟺   g(y) ∈ [g(lo), g(hi)]

The two events are the same event. So if your log-space interval covers the truth 90% of the time, the expm1-transformed interval [expm1(lo), expm1(hi)] covers the (back-transformed) truth 90% of the time too — exactly, not approximately. The back-transform adds zero error.

The crucial detail: you transform the endpoints, not the scores. expm1(lo) and expm1(hi), never expm1(score) added to a point forecast. Transforming an additive score through a nonlinear map is what actually breaks — but returning endpoint columns and mapping those is the correct operation, and it's usually what libraries already hand you.

So why is `hi` enormous?

Two things, both honest:

Lognormal skew under a convex map. expm1 is convex, so a roughly symmetric interval in log space is supposed to become very asymmetric and wide on the top in the original scale. A right-skewed quantity has a genuinely long upper tail. Some of that 350x is simply correct.
An inflated log-space score, then exponentiated. Conformal width is set by an upper quantile of the calibration residuals. On a short, badly-specified, or thin calibration set, that quantile is large and unstable — and exp of a large number is a very large number. This is small-sample variance amplified by the transform, not miscoverage.

Show me

Self-contained, no forecasting library needed — just the conformal mechanism on lognormal data:

import numpy as np
rng = np.random.default_rng(0)

def split_conformal_logspace(cal_true, cal_pred, mean_log, alpha):
    scores = np.abs(cal_true - cal_pred)            # |residual| in log space
    n = scores.size
    k = int(np.ceil((n + 1) * (1 - alpha)))         # finite-sample rank
    qhat = np.inf if k > n else np.sort(scores)[k - 1]
    return mean_log - qhat, mean_log + qhat, qhat

def coverage(y, lo, hi):
    return float(np.mean((y >= lo) & (y <= hi)))

alpha, mu, sigma = 0.10, 6.0, 0.5
cal   = rng.normal(mu, sigma, 2000)
test  = rng.normal(mu, sigma, 20000)

lo, hi, qhat = split_conformal_logspace(cal, np.full_like(cal, mu), mu, alpha)

cov_log  = coverage(test, lo, hi)                          # log space
cov_orig = coverage(np.expm1(test), np.expm1(lo), np.expm1(hi))  # after expm1

print(f"log-space coverage : {cov_log:.4f}")
print(f"orig-space coverage: {cov_orig:.4f}")
print(f"identical          : {cov_log == cov_orig}")
print(f"hi/point ratio     : {np.expm1(hi)/np.expm1(mu):.1f}x")

Output:

log-space coverage : 0.8999
orig-space coverage: 0.8999
identical          : True
hi/point ratio     : 2.3x

Coverage is bit-for-bit identical across scales. Now shrink and corrupt the calibration set the way a misspecified model on a short series would — few effective residuals, inflated variance:

for n_cal, extra in [(2000, 0.0), (30, 0.0), (30, 1.5)]:
    c   = rng.normal(mu, sigma, n_cal)
    pred = np.full(n_cal, mu) + rng.normal(0, extra, n_cal)   # misspecification
    lo, hi, q = split_conformal_logspace(c, pred, mu, alpha)
    print(f"n={n_cal:>4} extra={extra}  qhat={q:5.2f}  hi/point={np.expm1(hi)/np.expm1(mu):7.1f}x")

n=2000 extra=0.0  qhat= 0.82  hi/point=    2.3x
n=  30 extra=0.0  qhat= 0.92  hi/point=    2.5x
n=  30 extra=1.5  qhat= 2.99  hi/point=   19.9x

The upper endpoint balloons with qhat — driven by the calibration set, not the back-transform — and coverage stays valid the whole way.

What to actually do

Keep transforming endpoints, not scores. The expm1(lo), expm1(hi) step is correct. Don't "fix" it.
Read a giant upper bound as a diagnostic. It usually means your log-space score quantile is unstable: more calibration history, more windows, or a better-specified model. Inspecting intervals in the modeling scale makes this obvious.
Prefer an internal transform when the library supports it. If the transform lives inside the model (e.g. a working Box-Cox path), conformity scores are computed and back-transformed consistently and you never hand-roll log1p/expm1 — which sidesteps the whole class of confusion.
Expect asymmetry, and treat it as honest. For right-skewed targets a wide, one-sided upper interval is the correct answer, not a rendering glitch.

The through-line: a monotone transform is one of the few things conformal prediction handles for free. The coverage guarantee rides along untouched. When an interval looks absurd after expm1, the transform is rarely the culprit — the calibration is where to look.

Disclaimer: This article was drafted with AI assistance and reviewed and edited by the author. The technical design and opinions are my own.

Marginal coverage is a lie of averages: the conformal diagnostics that catch it

Whatsonyourmind — Sun, 05 Jul 2026 02:49:53 +0000

You wrapped your classifier in a conformal predictor, calibrated it for 90% coverage, checked the held-out set, and saw 90.2%. Ship it.

That number is real — and it can still be hiding a model that badly under-covers exactly the cases you care about. Marginal coverage is an average, and averages launder failure. This is a different problem from conformal prediction breaking under drift: here the exchangeability holds and the marginal guarantee is genuinely met — the method is just quietly unfair across the slices of your data. Two cheap diagnostics catch it.

What the marginal number actually promises

Split-conformal prediction gives you a marginal coverage guarantee: over a fresh exchangeable sample, the true label lands in the prediction set C(x) at least 1 − α of the time. That's it. It says nothing about coverage conditional on the input, the true class, or the difficulty of the example.

And marginal coverage is trivially satisfiable. A predictor can hit 90% on the nose by over-covering the easy region and under-covering the hard one — the two errors net out in the average. The guarantee is honest; your reading of it is not.

A 90% predictor that fails a third of your classes

Three classes, 100 calibration-held-out points. Suppose:

Classes A and B: 80 points, true label in the set for 76 of them → 95%.
Class C: 20 points, true label in the set for 14 → 70%.

Marginal coverage = (76 + 14) / 100 = 90%. Exactly on target. And class C — maybe your rare-but-critical class, the fraud case, the malignant scan — is covered 70% of the time. The headline number told you none of this.

The fix is to stop averaging over the thing that matters. Report the worst-class coverage gap:

import numpy as np

def worst_class_coverage(y_true, in_set, n_classes):
    # in_set[i] = True iff the true label of sample i is in its prediction set
    y_true = np.asarray(y_true)
    in_set = np.asarray(in_set, dtype=float)
    per_class = {
        k: in_set[y_true == k].mean()
        for k in range(n_classes) if (y_true == k).any()
    }
    worst = min(per_class, key=per_class.get)
    return worst, per_class[worst], per_class

One min over per-class coverage turns "90% overall" into "70% on class C" — the number you'd actually want on a dashboard.

The failure marginal coverage hides even from per-class checks: set size

Class-conditional coverage catches which label gets shortchanged. But conformal sets have a second axis that leaks coverage: size. A method can be systematically overconfident on the inputs it thinks are easy — the ones it hands a singleton {ŷ} — and lean on big, cautious sets elsewhere to make the average whole.

Angelopoulos & Bates call the diagnostic size-stratified coverage (SSC): bucket the samples by the size of their prediction set |C(x)|, then check coverage within each bucket. A conditionally honest method covers ≥ 1 − α in every size stratum. A method that under-covers its singletons — the confident-but-wrong region — shows it here and nowhere else:

def size_stratified_coverage(sizes, in_set, min_stratum=20):
    sizes, in_set = np.asarray(sizes), np.asarray(in_set, dtype=float)
    out = {}
    for s in np.unique(sizes):
        m = sizes == s
        out[int(s)] = {"coverage": in_set[m].mean(), "count": int(m.sum())}
    # ignore tiny strata (noisy); report the worst of the rest
    eligible = {s: d["coverage"] for s, d in out.items() if d["count"] >= min_stratum}
    worst = min(eligible.values()) if eligible else None
    return worst, out

If your size-1 stratum sits at 82% while everything else is at 95% and the marginal lands at 90%, you don't have a 90% predictor. You have a predictor that is wrong one time in five exactly when it tells you it's sure — and a single averaged number will never say so.

While you're at it: is the set even useful?

Coverage is only half the story, because coverage is free: the set containing all K classes covers 100% of the time and tells you nothing. So pair coverage with an informativeness read — average set size, singleton rate, and a size efficiency relative to the trivial all-K set:

def size_efficiency(sizes, K):
    if K <= 1:
        return 1.0
    avg = np.asarray(sizes).mean()
    return float(np.clip(1 - (avg - 1) / (K - 1), 0, 1))  # 1 = all singletons, 0 = all-K sets

The rule I use: only credit tightness on the strata that actually pass coverage. A razor-thin set that under-covers isn't efficient, it's wrong — rewarding it for being small is how you talk yourself into shipping the 82% singleton region.

The honest caveat

You cannot get exact conditional coverage for free. Distribution-free conditional coverage is impossible in finite samples (Vovk, 2012; Barber, Candès, Ramdas & Tibshirani, 2021) — that's a theorem, not a tooling gap. Class-conditional coverage and SSC are diagnostics, not guarantees: they stratify by things you can observe (label, set size) and surface where the marginal average is covering for a conditional failure. They won't certify conditional validity; they'll just stop you from shipping a number that lies by omission.

I'm adding both as first-class diagnostics to TrustLens (an open-source model-reliability library), because "report the worst stratum, not just the mean" is the same discipline that makes any reliability metric trustworthy. But you don't need a library — the three functions above are the whole idea. Compute them next to your marginal number, and the next time a predictor claims 90%, you'll know whether it means it.

A Capability Token for Agent Tool Calls: One Signed Object That Is Both the Gate and the Audit

Whatsonyourmind — Fri, 03 Jul 2026 15:44:38 +0000

When an LLM agent decides to call a tool, something has to say "yes." In most codebases that "yes" is one of two things: a boolean returned from a policy check, or a row appended to an event log after the fact. Both are weak. A boolean carries no evidence — it's gone the instant the branch is taken. An event log carries no authority — it's written after the executor already committed to running the tool, so it can't gate anything, and it can be edited later without anyone noticing.

This piece is the sequel to my earlier one, Stop trusting the agent: bind tool-call approvals to the exact call, where I argued you should bind the approval to the exact call's arguments so an approval for one payload can't be reused on another. That fixed one attack. It left three others standing. Here I want to define the full object — a capability token — and show that the same object is simultaneously (a) the thing the executor checks before running the tool and (b) the audit record after. Enforcement and evidence collapse into one signed value. The earlier article bound one approval to one call; this one specifies the whole token and checks it, unchanged, across three agent frameworks.

The token

CapabilityToken = {
    "tool": "transfer",
    "args_hash": "sha256(canonical(args))",
    "caller_context_hash": "sha256(agent_id | session_id | user_id)",
    "approved_for": {"step_index": 7, "attempt": 0},
    "policy_version": "pol-2026-07-03:9f21c...",   # or content-hash of the rule set
    "prev_entry_hash": "sha256(previous ledger entry)",
    "exp": 1751560000,                             # wall-clock, still present
    "sig": "ed25519(all of the above)"
}

Each field exists to kill a specific failure that a boolean or a plain event log cannot catch.

1. args_hash — bind to the exact arguments. This is the predecessor's whole point, so briefly: an approval for transfer(amount=10) must not be replayable onto transfer(amount=10000). Hash the canonicalized args into the signed token; the executor recomputes the hash from the actual call and rejects on mismatch. Done. Move on.

The next three are the failure classes that per-call args-binding alone cannot catch. This is where the article advances past the last one.

2. caller_context_hash — bind to who is calling. Args-binding stops payload swaps but says nothing about context. A token minted for agent A in session S is still a perfectly valid signature over transfer(amount=10). Lift it into agent B's run, or a different user's session, and the args still match. Bind a hash of the caller identity (agent id, session id, user id) into the token and the executor rejects any call whose live context doesn't reproduce the hash. A token becomes non-transferable across contexts.

3. approved_for {step_index, attempt} — two clocks, not one. Wall-clock exp is necessary but not sufficient. Consider a retry queue: an approval is minted, the attempt fails, the payload sits in the queue, and a later retry picks it up and executes "fresh" — still inside its wall-clock window, args still matching, context still matching. Time-based freshness passed and the wrong thing happened. The fix is a second clock: bind the approval to a point in the execution sequence — approved for step N, attempt M. An approval is for a specific attempt, not for every retry that happens to reuse its payload. The executor checks both: not expired and this is the attempt it was minted for.

4. policy_version — reconstructable authority, not a dangling pointer. Suppose you log "rule R42 fired." R42 lives in a mutable policy store. Six weeks later, during an incident review, you look up R42 — and it now says something different, because someone edited the policy. Your log told you which rule fired but not what it said at decision time. Bind the policy version (or, better, a content-hash of the exact rule set) into the token. Now the entry reconstructs the authority under which the call ran — the decision is reproducible, not merely telemetry pointing at a moving target.

5. prev_entry_hash + a periodic external anchor — chain integrity. Here's the failure that individually-perfect tokens still miss: absence. Every entry can be well-formed, correctly signed, args-bound, context-bound — and the tail can be silently gone. A crash mid-write, an aggressive log rotation, or a deliberate tamper drops the last N entries, and a dropped tail is indistinguishable from "those calls never happened." You cannot tell missing from removed. So hash-chain the entries — each token carries the hash of the previous one — and periodically publish a checkpoint hash outside the ledger's own trust domain (a transparency log, a second account, anything the ledger's writer doesn't control). Now a broken chain is visible, and a truncation past the last anchor is detectable. Absence of an entry becomes distinguishable from removal of an entry.

6. sig — the audit half. Sign with an asymmetric key so the token is non-repudiable and verifiable by parties who can't mint tokens. This is what lets the same object be evidence: anyone with the public key can check it, later, offline.

The same token, three frameworks

The token is the invariant. Frameworks only differ in where you check it.

Google ADK (adk-python). BaseTool gives you before_tool_callback and after_tool_callback. The before-callback is the gate; the after-callback writes the evidence — the same object.

def before_tool_callback(tool, args, ctx):
    tok = ctx.state["pending_token"]
    if not verify(tok, tool.name, args, ctx):   # args_hash, caller_context, approved_for, policy, sig
        return {"error": "capability check failed"}   # block the call

def after_tool_callback(tool, args, ctx, result):
    append_to_ledger(sign(finalize(ctx.state["pending_token"], result)))  # chain + anchor

Microsoft Semantic Kernel. The natural hook is an auto-function-invocation filter. DENY = don't call next() / return a refusal. REDACT = mutate context.arguments before next(). The genuinely missing primitive is REQUIRE_APPROVAL, and SK's shape forces its meaning: the auto-invoke loop is synchronous, so "approval" cannot be an in-loop await — that would hold the chat-completion connection open while a human decides. It has to mean terminate-and-resume:

async def on_auto_invoke(context, next):
    tok = context.arguments.get("_cap_token")
    verdict = verify(tok, context.function, context.arguments)
    if verdict == DENY:
        return                                   # skip next(); refuse
    elif verdict == REQUIRE_APPROVAL:
        context.terminate = True                 # exit the loop; resume later
        # ...resume by re-invoking with a fresh, argument-bound token for this exact call
    elif verdict == REDACT:
        context.arguments = redact(context.arguments)
        await next()
    else:
        await next()

pydantic-ai. Check the token in the tool wrapper / RunContext before the body runs.

def guarded(fn):
    def wrapper(ctx: RunContext, **kwargs):
        if not verify(ctx.deps.token, fn.__name__, kwargs, ctx):
            raise ToolDenied(fn.__name__)
        return fn(ctx, **kwargs)
    return wrapper

Three call sites, one object. The freshness bug (field 3), the context-lift bug (field 2), and the moving-policy bug (field 4) are caught identically in all three, because they live in the token, not the framework. These are the same problems being worked through right now in upstream discussions I've taken part in — the ADK decision-ledger issue, the Semantic Kernel auto-function-invocation approval gap, and a pydantic-ai proposal to replace the plain tool_call_approved bool with an HMAC-bound approval token carrying (run_id, tool_call_id, expiry) — where the recurring question is always where the check belongs, once you accept that the invariant is a single bound object.

Evidence, not telemetry

That's the whole payoff. A plain event log is telemetry: it tells you a story about the past that you have to trust the storyteller for. A decision ledger of capability tokens is evidence — each entry is tamper-evident (signed + chained + anchored) and still meaningful when replayed later (args, caller, sequence position, and the exact policy text are all bound in). You can hand it to someone who wasn't there, who can't mint tokens, weeks after the fact, and they can check it. A boolean can't do that. A log line can't do that. One signed object does both jobs.

The token itself is maybe a hundred lines of your own code. The discipline is deciding it's a first-class object in your agent, not an afterthought bolted on when something has already gone wrong.

Traces show what your agent did - a decision ledger shows what it was allowed to do

Whatsonyourmind — Thu, 25 Jun 2026 12:11:20 +0000

Agent observability has gotten good at answering what happened: OpenTelemetry spans for each model call and tool execution, structured event logs, replayable traces. If a run misbehaves, you can reconstruct the sequence.

But for anything that has to stand up to an incident review or a compliance ask, "what happened" isn't the question. The question is what was authorized:

Why was this tool selected for this step?
Under whose authority did the call run — agent credentials, or a specific user's?
What did a guardrail refuse, and on what rule?
What confirmation was required, and what approval made the action permissible?

Every one of those passes through a decision point in your agent runtime — a policy callback, a confirmation gate, a per-tool auth check. But traces describe execution; almost nothing writes down the authority. That's the gap a decision ledger fills.

Here's the part that took me a while to get right: a decision ledger that's just "more events" buys you nothing. To be auditable rather than merely verbose, it has to support a verifier that can prove executed == authorized without trusting the agent's own narration. That decomposes into three layers, and each catches a failure the others can't.

Layer 1 — Entry conformance

Each decision and each outcome is a well-formed, canonicalized, hash-bound record. The load-bearing field is on the outcome: it must commit to the decision that authorized it.

decision_event = { decision_id, action_ref, principal, auth_mode,
                   policy_version, decision_state, args_digest, ts }

outcome_event  = { action_ref,
                   decision_digest = SHA256(JCS(decision_event)),
                   result_digest, terminal_state, ts }

action_ref answers "are these two events about the same intended action?" — make it content-derived (e.g. SHA256(JCS({agent_id, action_type, scope, ts}))) so any verifier can recompute it from the intent alone, with no shared runtime state.

decision_digest answers a different question: "did this outcome commit to the exact decision that authorized it?" Keep the two separate — collapsing them loses your ability to catch a swapped outcome (a result re-attributed to the wrong decision).

Layer 2 — Log completeness

Layer 1 can only reason about entries that exist. It cannot see an entry that was never written — and that's the highest-stakes failure for incident response, because a tool call that bypassed the policy path (or a crash between authority-grant and ledger-write) looks like silence, not a malformed row.

Close it by chaining: each entry carries prev_digest pointing at the prior ledger head, and each turn/session close records the current ledger_head_digest. Now the ledger is an append-only chain, and a dropped entry shows up as a broken chain — detectable without trusting the writer.

This catches two things Layer 1 can't:

Orphaned authority — a decision says allowed, the handler then raises or times out, and no outcome is ever written. Indistinguishable from "allowed and silently succeeded" unless the chain expects exactly one terminal outcome for every allowed.
Silent omission — an entry simply missing.

Layer 3 — Execution completeness

The final layer ties the ledger back to the execution trace you already emit. Require a bijection at the action boundary:

The trace proves execution happened; the ledger proves it was authorized; the bijection between them is the "no tool executes off-ledger" invariant. It's the omission detector that Layer 1's per-entry rules structurally cannot express, because it reasons across two independent systems.

Why three layers

Put together, the invariant a verifier can now assert is:

That's the actual compliance property — and you cannot get it from logging alone, no matter how thorough. Per-entry conformance proves each record is well-formed and bound; the chain proves the set is complete; the bijection proves the set matches reality.

The deeper principle is one I keep coming back to: a step that reasons can only ask you to trust it; a step that emits a re-checkable artifact — a content hash, a solver's optimality certificate, a recomputable digest — turns "we logged it" into "anyone can re-run it and get the same answer." Move the factual, state-changing parts of an agent through deterministic tools that leave certificates, and the audit stops being a leap of faith.

(That re-checkable-certificate idea is what I've been building into OraClaw — deterministic decision tools that return verifiable results — but the three-layer ledger above is framework-agnostic; it's worth wiring into whatever runtime you're on.)

If you're building agents that will ever face an auditor, the cheapest time to add the ledger is before you need it.

Stop letting your AI agent eyeball A/B picks — wire in a real contextual bandit via MCP (free, no key)

Whatsonyourmind — Wed, 24 Jun 2026 06:33:45 +0000

If you give an LLM agent a table of A/B variants and ask "which one should we send next?", it will confidently pick the one with the highest conversion rate.

That feels right. It is often wrong.

The model has no concept of sample size, exploration, or regret. It pattern-matches "biggest number = winner" and moves on. For a one-off question, fine. But inside an agent loop that picks a variant on every request — email subject lines, ad copy, model routing, recommendation ranking — that naïve pick quietly accumulates regret and starves the options it never gave a fair chance.

The fix isn't a better prompt. It's to not ask the LLM to do the math at all. Route the decision to a real bandit algorithm and let the model do what it's good at (orchestration, language) while a deterministic solver does what it's good at (the optimization).

This post is a copy-paste demo you can run in your terminal right now, no signup, no API key. I'll use OraClaw — a deterministic decision-intelligence MCP server — but the point stands regardless of tool: stop letting the model guess at math it can verify.

The trap, concretely

Here's a realistic state mid-experiment. Three subject lines, different amounts of traffic:

Variant	Pulls	Rewards (conversions)	Raw rate
A	120	18	15.0%
B	80	17	21.3%
C	15	4	26.7%

Ask an LLM "which should we send next?" and you'll usually get B — it has the best rate among the well-tested variants, and C "only has 15 samples, too noisy to trust."

That reasoning sounds responsible. It's exactly backwards. With only 15 pulls, C is under-explored — we don't actually know it's worse, and the cost of finding out is tiny. A bandit's whole job is to weigh that uncertainty instead of hand-waving it away.

Let's get a real answer.

Run it yourself: the no-key REST endpoint (60 seconds)

OraClaw exposes a free, no-auth REST endpoint. Paste this into your terminal — nothing to install, nothing to sign up for:

curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit \
  -H "Content-Type: application/json" \
  -d '{
    "algorithm": "ucb1",
    "arms": [
      {"id": "variant_a", "name": "Subject line A", "pulls": 120, "totalReward": 18},
      {"id": "variant_b", "name": "Subject line B", "pulls": 80,  "totalReward": 17},
      {"id": "variant_c", "name": "Subject line C", "pulls": 15,  "totalReward": 4}
    ]
  }'

The response (this is the actual output, abbreviated):

{
  "selected": { "id": "variant_c", "name": "Subject line C" },
  "score": 1.4633997784480877,
  "algorithm": "ucb1",
  "exploitation": 0.2666666666666666,
  "exploration": 1.196733111781421,
  "regret": {
    "cumulativeRegret": 18.333333333333314,
    "averageRegret": 0.08527131782945728,
    "estimatedOptimalArm": "variant_c",
    "totalPulls": 215
  }
}

UCB1 picks C, and the response shows why in a way you can audit: a low exploitation term (its observed rate is mediocre) but a high exploration bonus (we've barely tested it). The sum — the upper confidence bound — is what it actually optimizes. That's the principled "give the under-sampled option a shot" reasoning the LLM only gestured at.

Two things worth noticing:

It's deterministic. Run that curl again and you get the exact same score: 1.4633997784480877. UCB1 has no randomness; the same inputs always yield the same decision. That's the difference between a tool you can put in a CI test and a model whose answer drifts run to run. (If you want stochastic exploration, swap "algorithm": "thompson".)
No key needed. The free tier is IP-rate-limited, not auth-gated. You just verified the whole thing without handing over an email.

Now wire it into your agent (the actual point)

The REST call is the proof. The real ergonomics come from MCP — your agent calls it like any other tool, no glue code.

Add the server to Claude Code (or any MCP client) in one line:

claude mcp add oraclaw -- npx -y @oraclaw/mcp-server

Or drop it straight into a client config:

{
  "mcpServers": {
    "oraclaw": {
      "command": "npx",
      "args": ["-y", "@oraclaw/mcp-server"]
    }
  }
}

Now your agent has an optimize_bandit tool. Instead of prompting the model to reason about exploration, you let it call the solver and act on a verifiable result. The MCP call returns the identical payload (same score: 1.4633997784480877) — the MCP server and the REST API are the same engine.

When the best choice depends on context

The plain bandit assumes the best arm is fixed. Often it isn't — the right model/route/variant depends on the request. That's a contextual bandit, and it's a one-tool swap (optimize_contextual, a LinUCB implementation). Feed a feature vector describing the current situation:

curl -s -X POST https://oraclaw-api.onrender.com/api/v1/optimize/contextual-bandit \
  -H "Content-Type: application/json" \
  -d '{
    "arms": [
      {"id": "small",   "name": "small-fast-model"},
      {"id": "mid",     "name": "mid-model"},
      {"id": "frontier","name": "frontier-model"}
    ],
    "context": [0.9, 0.2, 1.0],
    "history": [
      {"armId": "small",    "context": [0.1, 0.1, 0.0], "reward": 1.0},
      {"armId": "frontier", "context": [0.9, 0.2, 1.0], "reward": 0.95},
      {"armId": "small",    "context": [0.9, 0.2, 1.0], "reward": 0.2}
    ]
  }'

Here the context vector might encode [task_difficulty, latency_budget, needs_reasoning]. The model that wins on an easy, latency-sensitive task is not the one that wins on a hard reasoning task — LinUCB learns that mapping from history instead of you maintaining a brittle if difficulty > 0.7 ladder by hand. This is the honest version of "let the agent pick which model to call": don't have the LLM introspect about cost/quality tradeoffs in a prompt — give it a learner.

Why route it out instead of prompting harder

Verifiable, not vibes. A UCB score is a number you can assert on in a test. "The model usually picks the right one" is not.
Deterministic where it matters. Same inputs → same decision (for UCB1/LinUCB). You can pin it in CI and diff it.
No tokens, fast. It's arithmetic, not generation — runs in single-digit-to-low-tens of milliseconds and burns zero LLM tokens. You're not paying a frontier model to compute a confidence bound it'll round wrong anyway.
Right tool for the job. The LLM stays in charge of orchestration and language. The math goes to a solver built for it.

The bandit is one of ~20 algorithms in the same server — forecasting (ARIMA / Holt-Winters), anomaly detection, linear/MIP optimization (HiGHS), Monte Carlo, PageRank/graph analysis, CMA-ES, conformal scoring. Same pattern every time: the agent describes the problem, a deterministic solver returns an answer you can check.

Try it

Run the curl above — verify the deterministic output yourself, no signup: https://oraclaw-api.onrender.com/api/v1/health (lists every endpoint).
Add it to your agent in one line:

   claude mcp add oraclaw -- npx -y @oraclaw/mcp-server

The free MCP tools need no key.

Building something that calls it a lot? A free key (just an email) raises the limits:

   curl -s -X POST https://oraclaw-api.onrender.com/api/v1/auth/signup \
     -H "Content-Type: application/json" -d '{"email":"you@example.com"}'

If you outgrow the free tier, higher limits start at $9/mo — direct checkout here. But you can do everything in this post for $0.

If your agent is making decisions, make them ones you can verify. Stop asking the model to eyeball the math — route it to something that gets it provably right.

Run the demo, then tell me in the comments what your agent was eyeballing that it shouldn't have been.

Your bandit's exploration floor probably violates its own floor

Whatsonyourmind — Wed, 17 Jun 2026 22:19:34 +0000

Most multi-armed bandit / A-B allocation systems add a minimum exploration weight: every arm should get at least, say, 5% of traffic, so no variant is ever fully starved and you keep collecting data on all of them. The guarantee sounds simple — p_i >= f for every arm — and the implementation looks even simpler:

def clip_renorm(w, f):
    p = np.maximum(w, f)   # raise anything below the floor up to it
    return p / p.sum()     # renormalize so probabilities sum to 1

This is wrong, and it fails silently. The renormalize step pushes the floored arms back below the floor.

Why clip-then-renormalize breaks

Clipping raises the small weights up to f, which makes the total exceed 1. Dividing by that total then scales everything down — including the arms you just clipped to f. So they land below f again, and the floor you advertised is not the floor you enforce.

Concrete case — 4 arms, a confident winner, floor f = 0.10:

w   = [0.94, 0.02, 0.02, 0.02]   floor = 0.10
clip-renorm -> [0.7581, 0.0806, 0.0806, 0.0806]   min = 0.0806  ❌ (< 0.10)

The three starved arms each get 8.06%, not the 10% you promised. And it isn't an edge case. Over 100,000 random peaky weight vectors (Dirichlet, α=0.3, n=4, f=0.10):

Whenever one arm dominates (exactly when a bandit is exploiting), the floor leaks.

The fix: one affine map onto the simplex

Instead of clipping, mix the learned weights with the uniform floor. Put the weights on the simplex (sum(w) = 1), then:

def additive_simplex(w, f):
    w = w / w.sum()
    return f + (1.0 - len(w) * f) * w

Each output is f + (non-negative), so p_i >= f holds exactly, and the total is n*f + (1 - n*f)*1 = 1 by construction — no renormalization needed, so nothing gets dragged back under the floor. It also preserves the ordering and relative spacing of w (it's affine), so you don't distort the policy you learned. Same run:

additive-simplex -> [0.664, 0.112, 0.112, 0.112]   min = 0.112  ✅

Over the same 100,000 vectors it violated the floor 0.00% of the time.

The one guard you do need

The map needs n * f <= 1 — you can't promise four arms a 30% floor each (that's 120%). Handle it explicitly instead of producing negative weights:

def exploration_floor(w, f):
    n = len(w)
    if f < 0:
        raise ValueError("floor must be non-negative")
    if n * f >= 1.0:
        return np.full(n, 1.0 / n)          # floor is infeasible -> uniform
    w = np.asarray(w, dtype=float)
    w = w / w.sum()
    return f + (1.0 - n * f) * w

That's the whole correct primitive: a non-negativity check, an infeasible-floor fallback to uniform, and the affine mix.

Why it actually matters

The exploration floor isn't cosmetic. It's what bounds worst-case regret and guarantees you keep collecting data on every arm — the property a lot of bandit regret arguments lean on, and often a fairness/SLA requirement too ("no variant ever drops below X%"). A floor that's silently 7.7% instead of 10% means the guarantee you reported to stakeholders, and any bound that depends on it, doesn't hold. The bug is invisible because the output still sums to 1 and still looks floored — the smallest number is just quietly too small.

import numpy as np
rng = np.random.default_rng(0)
f, n, viol = 0.10, 4, 0
for _ in range(100_000):
    w = rng.dirichlet(np.ones(n) * 0.3)
    p = np.maximum(w, f); p = p / p.sum()       # clip-renorm
    if p.min() < f - 1e-12: viol += 1
print(f"clip-renorm floor violations: {viol/100_000:.1%}")   # ~97%

I ran into this reviewing a Thompson-sampling weighting routine and proposed the additive-simplex version (plus the two guards) as a fix upstream. If your bandit or weighted-experiment layer clips-then-renormalizes to enforce a minimum, it's worth a one-line check: does the smallest probability it emits actually clear the floor?

A model with R-squared near 0 can still give valid 90% prediction intervals - here's why (and the catch)

Whatsonyourmind — Wed, 17 Jun 2026 21:33:25 +0000

I recently calibrated a recovery-rate model that had only two weak features. Its point accuracy was almost nothing — R² basically zero. I expected its uncertainty estimates to be junk too. They weren't: the 90% conformal prediction intervals covered ~89% of held-out outcomes. Valid, just wide.

That surprised me enough to nail it down, because it contradicts a belief a lot of us carry around: "my model isn't accurate, so I can't trust its uncertainty." For split conformal prediction, that's backwards. Here's the precise statement, a runnable demo, and the one caveat that actually bites.

Coverage is a property of the procedure, not the model

Split conformal prediction gives a distribution-free, finite-sample marginal coverage guarantee:

and it holds for any point model, as long as the calibration and test data are exchangeable. The model is a black box. You fit it however you like, then on a held-out calibration set you take the (1−α) quantile of the absolute residuals, and that quantile becomes the half-width of your intervals.

Nowhere does that construction require the model to be good. A bad model just has large residuals, so the calibration quantile is large, so the intervals are wide — wide enough to still cover at the stated rate. Accuracy doesn't buy you validity; it buys you efficiency (narrower intervals at the same coverage).

The demo (numbers are reproducible, seed fixed)

Same dataset and target, three models from strong to useless, target coverage 90%:

model	R²	marginal coverage	mean interval width
gradient boosting	0.741	0.895	5.39
weak linear (1 noisy feature)	0.061	0.905	10.39
predict-the-mean	−0.000	0.907	10.83

All three land at ~90% coverage. The only thing that changes is width: the good model's intervals are half as wide. That's the whole story in one table — validity is constant, efficiency tracks accuracy.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

rng = np.random.default_rng(20260617)
n = 6000
X = rng.normal(size=(n, 5))
group = rng.integers(0, 3, size=n)
y = X @ np.array([2.0, -1.5, 1.0, 0.5, -0.8]) + 1.5 * group + rng.normal(size=n)

s = lambda a: (a[:3000], a[3000:4500], a[4500:])
Xtr, Xcal, Xte = s(X); ytr, ycal, yte = s(y); _, _, gte = s(group)
ALPHA = 0.10

def conformal(model, label):
    model.fit(Xtr, ytr)
    res = np.abs(ycal - model.predict(Xcal))
    k = int(np.ceil((len(res) + 1) * (1 - ALPHA)))
    q = np.sort(res)[min(k, len(res)) - 1]          # calibration quantile
    pred = model.predict(Xte)
    covered = (yte >= pred - q) & (yte <= pred + q)
    r2 = 1 - np.sum((yte - pred)**2) / np.sum((yte - yte.mean())**2)
    gcov = {int(g): round(covered[gte == g].mean(), 3) for g in np.unique(gte)}
    print(f"{label}: R2={r2:6.3f} cov={covered.mean():.3f} width={2*q:5.2f} group={gcov}")

conformal(GradientBoostingRegressor(random_state=0), "strong")
class Weak(LinearRegression):
    def fit(s, X, y): return super().fit(X[:, 4:5], y)
    def predict(s, X): return super().predict(X[:, 4:5])
conformal(Weak(), "weak  ")

The catch: marginal ≠ conditional

Here's the part you can't skip. The guarantee is marginal — averaged over the whole distribution. It says nothing about coverage within a subgroup. Watch what the same run reports per subgroup:

model	marginal	group 0	group 1	group 2
strong GBM	0.895	0.835	0.985	0.857
predict-the-mean	0.907	0.889	0.933	0.897

The strong model has the worse conditional coverage — groups 0 and 2 sit at 83–86% while group 1 is over-covered at 98%. A single global residual quantile produces constant-width intervals that can't adapt to residuals that vary by group, so it robs the hard groups to pay the easy one. (The mean-only model looks more uniform here only because its residuals happen to be roughly homoskedastic across groups — luck, not virtue.)

If your decisions are made per-subgroup — per region, per asset class, per customer segment — marginal coverage is not enough, and a high overall number can hide silent under-coverage where it matters. The fixes are Mondrian / group-conditional conformal (calibrate a separate quantile per group) or a normalized/locally-weighted nonconformity score so interval width adapts.

What to take away

A weak model gives you wide but honest intervals, not invalid ones. "The model is bad so the uncertainty is meaningless" is the wrong instinct — wide intervals are the correct signal that the model doesn't know much.
The genuinely dangerous case is the opposite: a confident-looking narrow interval whose coverage is a lie. That happens not from low accuracy but from a broken exchangeability assumption — distribution drift between calibration and deployment. (That failure mode, and adaptive conformal as the fix, is a separate write-up.)
Always check conditional coverage on the groups you actually act on. The marginal number is necessary, not sufficient.

Conformal prediction is one of the few tools that gives you a real guarantee with almost no assumptions. Just remember which guarantee it gives — coverage over the whole distribution — and verify the rest yourself.

Stop trusting the agent: bind tool-call approvals to the exact call

Whatsonyourmind — Wed, 17 Jun 2026 15:11:04 +0000

Agentic systems gate dangerous tool calls — file writes, money movement, deploys — behind an "approval": a human-in-the-loop click, or a policy check. Look at how that approval is usually represented and you'll often find a boolean sitting in the run/session state: approved: true.

A boolean is the wrong primitive, and it fails in three ways that prompt injection is happy to exploit.

Three ways an approval boolean breaks

Flip. Anything that can write the run state — a serialized context crossing a process/durable-execution boundary, a confused-deputy code path, an injection that steers state — turns false into true.
Replay. You approved "read report.csv". The approval is just true, so the same flag is honored for the next tool call too — "delete prod.db". The boolean doesn't know which call it approved.
Argument drift. You approved "transfer $10 to alice". Between approval and execution the args mutate to $10,000. The boolean still says approved.

The root cause is the same in all three: the approval is modeled as a property of the run, when it should be evidence for one specific call.

Bind the approval to the call

When approval is granted, mint a tag over the things that must not change: the tool-call id, a digest of the canonical arguments, the principal, and an expiry. Verify it at dispatch, against a per-run secret.

import hmac, hashlib, json, time

def canon(args: dict) -> bytes:
    # canonical serialization so benign reserialization doesn't invalidate a token.
    # (production: RFC 8785 JCS, which also normalizes numbers — 10 vs 10.0)
    return json.dumps(args, sort_keys=True, separators=(",", ":")).encode()

def mint(key: bytes, call_id: str, args: dict, principal: str, ttl: int = 300) -> dict:
    exp = int(time.time()) + ttl
    digest = hashlib.sha256(canon(args)).hexdigest()
    msg = f"{call_id}|{digest}|{principal}|{exp}".encode()
    tag = hmac.new(key, msg, hashlib.sha256).hexdigest()
    return {"call_id": call_id, "principal": principal, "exp": exp, "tag": tag}

def verify(key: bytes, tok: dict, call_id: str, args: dict, principal: str) -> bool:
    if tok.get("call_id") != call_id:      return False   # replay onto another call
    if tok.get("principal") != principal:  return False   # wrong principal
    if tok.get("exp", 0) < time.time():    return False   # expired
    digest = hashlib.sha256(canon(args)).hexdigest()
    msg = f"{call_id}|{digest}|{principal}|{tok['exp']}".encode()
    expect = hmac.new(key, msg, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expect, tok["tag"])         # forged / flipped / arg-drift

Run the three attacks against it (plus principal-swap and a forged tag):

KEY = b"per-run-secret-not-a-global-one"
tok = mint(KEY, "call-1", {"amount": 10, "to": "alice"}, "user:42")   # approve $10 to alice

verify(KEY, tok, "call-1", {"amount": 10,    "to": "alice"}, "user:42")  # True   legit
verify(KEY, tok, "call-2", {"amount": 10,    "to": "alice"}, "user:42")  # False  replay
verify(KEY, tok, "call-1", {"amount": 10000, "to": "alice"}, "user:42")  # False  arg drift
verify(KEY, tok, "call-1", {"amount": 10,    "to": "alice"}, "user:99")  # False  wrong principal
verify(KEY, {**tok, "tag": "00"*32}, "call-1", {"amount": 10, "to": "alice"}, "user:42")  # False  forged

The flag can no longer be flipped (no valid tag), replayed (call-id is in the MAC), or drifted (args digest is in the MAC). An attacker who fully controls the transported state still can't manufacture a token without the key.

Three details that decide whether it actually holds

Canonicalization. Both sides must hash the same bytes. Sort keys, and normalize numbers (10 vs 10.0 vs 1e1 must agree) — RFC 8785 (JSON Canonicalization Scheme) is the off-the-shelf answer. Put the canonicalization recipe id inside the hashed bytes so the two sides can't silently disagree about the rules.
Fail closed, with a typed result. Absent / expired / mismatched ⇒ a distinct "not approved" outcome — not a normal tool payload, and not a generic exception. Otherwise "approval missing" is indistinguishable downstream from "the tool ran and returned something falsy," and the caller can't tell whether to re-request approval.
One enforced checkpoint, deny-by-default. This belongs at the single point right before dispatch: Semantic Kernel's AUTO_FUNCTION_INVOCATION filter (don't call next ⇒ the call is skipped), ADK's before_tool callback, or the MCP tool-call boundary. Tools that need approval are classified as such; anything unclassified is denied, not allowed through.

The gotcha that bites in production: replay

If your agent runs on a replay-based durable-execution engine (Temporal and friends), the per-run secret must survive replay. Workflow code is re-executed from history on recovery, so a key minted with a non-deterministic call won't match the token already in history — approvals verify fine in dev and then fail closed after the first worker restart, which is the worst possible time to discover it. Derive the key deterministically (HKDF(server_secret, run_id)) or establish it once via a recorded side-effect, and make the expiry deterministic too rather than reading wall-clock inside workflow code.

The takeaway

Authorization in an agent system shouldn't be ambient, mutable state that travels with the run. It should be evidence bound to a single call envelope — this principal, this tool, these exact arguments, until this time — that the executor re-verifies at the moment of dispatch. The boolean isn't a simplification of that; it's the bug.

I work on reliability and verification for AI and numerical systems — agent authorization, determinism, and "prove the thing that claims to be authorized actually was." The snippet above is runnable as-is. Happy to compare notes if you're hardening an agent's tool boundary — GitHub.

Conformal prediction silently breaks under drift - and how to make it hold

Whatsonyourmind — Wed, 17 Jun 2026 14:39:25 +0000

Conformal prediction is the easiest way to put a calibrated uncertainty band around any model: wrap a point predictor, and you get intervals with a finite-sample coverage guarantee — no distributional assumptions. It's deservedly popular.

There's a catch that bites in production: that guarantee is marginal and it assumes exchangeability. The moment your data drifts — almost any time series, any online-serving setting — exchangeability is gone, and split-conformal silently stops delivering the coverage it promises. No error, just a band that's quietly too narrow.

Here's the failure, then a fix that actually holds, with runnable code.

The failure, measured

Target 90% intervals. Residuals whose spread drifts upward over time (a textbook covariate/heteroscedastic shift). Calibrate split-conformal on the first chunk and let it run:

import numpy as np
rng = np.random.default_rng(0)

T, alpha, W = 4000, 0.10, 500                  # 90% target; W = calibration window
scale = 1.0 + 3.0 * (np.arange(T) / T) ** 2    # residual spread drifts upward
score = np.abs(rng.standard_normal(T) * scale) # nonconformity = |residual|

q = np.quantile(score[:W], 1 - alpha)          # frozen calibration quantile
static = score[W:] <= q
print(round(static.mean(), 3))                 # -> 0.579

58% coverage where you asked for 90% — and in the last quarter of the run, deep into the drift, it's 35%. A dashboard reporting "90% prediction intervals" would be off by more than half, with nothing flagging it.

Why it breaks, and the two things you have to fix

There are two distinct ways drift kills coverage, and they need different fixes:

The score scale goes stale. Your calibration scores were collected when residuals were small; now they're large. The frozen quantile is simply too small.
The miscoverage rate drifts. Even with a reasonable scale, the realized error rate wanders away from α.

Adaptive Conformal Inference (Gibbs & Candès, 2021) fixes #2 directly. It treats the target miscoverage as a control variable and runs a feedback loop: after each step, nudge α_t up if you've been covering too often, down if you've been missing too often.

alpha_t = alpha_t + gamma * (alpha - err_t)     # err_t = 1 if the point fell outside

A miss pushes α_t down → you use a higher quantile → wider next interval. It's a thermostat for coverage, and it gives a long-run coverage guarantee with no exchangeability assumption.

But ACI adapts the level, not the scale. Point it at a frozen calibration set and it helps a lot but hits a ceiling — once residuals exceed the largest score it ever saw, even α_t → 0 (the widest interval it can form) isn't wide enough. You also have to let the scores track the current regime, e.g. with a rolling window.

Measured, same setup, four ways:

method	overall coverage	coverage in late-drift tail
static split-conformal	0.579	0.347
ACI only (frozen calibration)	0.864	0.786
rolling window only	0.862	0.859
rolling window + ACI	0.900	0.904

Neither piece is enough alone. The rolling window supplies the right scale; ACI supplies the guarantee. Together they land exactly on target, even in the part of the series where the static method had collapsed to 35%.

a, hold = alpha, []
for t in range(W, T):
    pool = score[t - W:t]                              # rolling -> tracks the new scale
    a_eff = min(max(a, 1e-3), 1 - 1e-3)
    covered = score[t] <= np.quantile(pool, 1 - a_eff)
    hold.append(covered)
    a += 0.02 * (alpha - (0.0 if covered else 1.0))   # ACI feedback on miscoverage
print(round(np.mean(hold), 3))                         # -> 0.900

Three things that matter in practice

The score function decides marginal vs conditional coverage. |y − ŷ| gives you marginal coverage with a constant-width band. If your noise is heteroscedastic and you want bands that are locally right (conditional coverage), normalize the score — |y − ŷ| / σ̂(x), or use Conformalized Quantile Regression (CQR) where the score is the signed distance to predicted quantiles. The choice changes whether wide intervals show up where the data is actually noisy.
Coverage is a usable drift signal — but a noisy one. Rolling empirical coverage drifting away from 1 − α is a cheap, model-agnostic drift detector. Just remember it's a Bernoulli mean: its standard error is sqrt(c(1−c)/n), so over a 100-point window a 90%-coverage estimate has a ±3-point sampling wobble. Trigger on sustained deviation, not one short window.
Pick γ for your drift speed. Larger γ tracks faster but makes interval widths jumpier; smaller γ is smoother but lags. 0.01–0.05 is a sane starting range; tune against your realized coverage trace, not in the abstract.

The takeaway

A guarantee that assumes exchangeability is not a guarantee in production — it's an assumption wearing a guarantee's clothes. What makes ACI worth reaching for is that it drops the assumption and replaces it with a feedback loop you can actually verify online: watch the realized coverage, and let it correct itself. If you serve intervals anywhere a too-narrow band is expensive, that self-correction is the difference between a number you can trust and one that quietly lies as the world moves.

I work on reliability and verification for numerical and AI systems — calibration, drift, and "does the guarantee actually hold under load" tooling. The benchmark above is fully runnable; I'm happy to compare notes if you're putting conformal methods into production — GitHub.

When your optimizer silently returns the wrong answer (and how to catch it)

Whatsonyourmind — Wed, 17 Jun 2026 13:49:40 +0000

Numerical solvers have a failure mode that is worse than crashing: every so often they return status: Optimal and hand you a number that is simply wrong. No exception, no warning — just a confident, incorrect optimum. If that number drives a downstream decision (a schedule, an allocation, a price), you may never notice.

I ran into a clean example of this in HiGHS recently while reducing a bug that had surfaced through cvxpy, and the debugging path generalizes to any LP/QP/MILP stack. Here's the case, how I isolated it, and a short checklist you can apply to your own models.

The symptom: same model, two answers

A mixed-integer model that HiGHS solves to Optimal with objective 0.0 under default settings — but solve the same model with presolve turned off and you get Optimal with objective ≈ 6.68e8. Both runs report success. One of them is wrong.

When presolve-on and presolve-off disagree on a problem that has a well-defined, bounded optimum, that is not a tolerance issue — it means one of the reduction steps is mangling the model. (This particular case is an open, actively-investigated issue; a separate wrong-answer I reduced to a standalone .mps from a cvxpy program is filed here.)

The first diagnostic is free: flip presolve

Before anything else, re-solve with presolve disabled and compare the two objectives:

import highspy

def solve(path, presolve="on"):
    h = highspy.Highs()
    h.setOptionValue("presolve", presolve)
    h.setOptionValue("output_flag", False)
    h.readModel(path)
    h.run()
    return h.getObjectiveValue()

on  = solve("model.mps", "on")
off = solve("model.mps", "off")
print(on, off)        # disagree on a feasible, bounded model => bug in a reduction

The same idea works through the modeling layer — in cvxpy, compare prob.solve(solver=cp.HIGHS) against the same solve with {"presolve": "off"}. If the two disagree, a reduction step is the culprit, and you have already cut the search space in half.

Why scaling is so often the trigger

The common thread in this family of bugs is coefficient magnitude. HiGHS prints the coefficient ranges at the top of every run:

Coefficient ranges:
  Matrix  [4e-01, 5e+02]
  Cost    [2e+01, 3e+02]
  Bound   [1e+02, 1e+02]
  RHS     [3e+01, 2e+04]

When a single constraint mixes coefficients spanning many orders of magnitude, bound-tightening and substitution accumulate floating-point error, and integer-rounding logic ("this RHS rounds up to the next integer bound") can tip the wrong way. The minimal reproducer I extracted kept exactly the rows whose coefficients carried the large magnitudes — drop them and the collapse disappears.

The same root cause shows up across solvers, just wearing different clothes:

OSQP (QP): an open report where v1.0.0+ runs all the way to max-iterations with gap = -nan, even though the primal and dual residuals are already at 1e-14. The duality-gap termination criterion is poisoned by a NaN, so the solver never recognizes that it has already converged.
Clarabel (conic/QP): a report where a wildly ill-scaled QP (objective on the order of 1e9) returns a false PrimalInfeasible with equilibration on, but solves cleanly with equilibrate_enable=False. Ruiz equilibration is capped at equilibrate_max_scaling = 1e4 by default — about four orders short of a 1e8 dynamic range, so the post-scaling KKT system is still badly conditioned.

Different solvers, same lesson: magnitude is not cosmetic.

How to minimize a solver bug so it actually gets fixed

A 350-row model is not a bug report a maintainer can act on. The reduction loop is mechanical and worth automating:

Reproduce on the latest release first. Half of "bugs" are already fixed. Pin the version you tested.
Greedily drop rows and columns. Remove a chunk; if the wrong-answer signature survives, keep it removed; otherwise restore it and try a smaller chunk. Binary-search your way down. I took one case from 348×169 to 41×40 this way and it still collapsed.
Make the "still broken" check a predicate, not an eyeball. Here it was abs(on - off) > tol (or status == Infeasible while presolve-off says Optimal), re-evaluated after every removal.
Export the reduced model to a portable format (.mps) so the report is solver-version- and language-independent.
File with three things: the version, the exact on/off command delta, and the minimal .mps. That is a report that gets triaged in minutes instead of sitting untouched.

A scaling-hygiene checklist

Even when there is no solver bug, bad scaling silently erodes accuracy. Cheap habits that prevent most of it:

Read the coefficient ranges on every run. If the matrix or RHS spans more than ~1e6, treat the result with suspicion.
Rescale units before the solver sees them (dollars → millions, bytes → GB). Single highest-leverage fix.
Do not encode big-M larger than necessary. An M of 1e9 where 1e4 would do is how you manufacture these bugs.
Keep a presolve-off run in your test suite for any model whose output you trust blindly — a periodic on/off agreement check is a cheap regression guard.
For QP/conic, check the equilibration cap against your data's dynamic range, and prefer pre-scaling to relying on the solver to rescue pathological inputs.

The broader point

These bugs are dangerous precisely because the solver's contract — "I returned Optimal" — is exactly what you would normally trust. The on/off differential is so useful because it doesn't trust that contract: it cross-checks two code paths that are supposed to agree and flags the moment they don't. That "verify the thing that claims to be correct" instinct is worth wiring into any pipeline where a wrong number is expensive.

I work on reliability and verification for numerical and AI systems — minimal reproducers, determinism, and "prove the output is what it claims" tooling; the HiGHS reducer above came out of that. The issues referenced are linked inline. If you hit something in this family, I'm happy to compare notes — GitHub.

Determinism as a feature: when to let your agent call a math API instead of reasoning

Whatsonyourmind — Wed, 17 Jun 2026 09:16:18 +0000

LLM agents are great at deciding what to do and unreliable at computing it. Ask one to allocate traffic across five variants, price tail risk, or solve a scheduling constraint and you'll get a confident, plausible, subtly-wrong number — tokens burned included.

The fix usually isn't a better prompt. It's the same instinct that gave us the calculator: move the deterministic math out of the probabilistic engine.

The tell

You have a determinism problem the moment your agent's output needs to be:

reproducible — same inputs → same answer, every run,
auditable — someone can check why it's 0.62 and not 0.61, or
correct under adversarial inputs — a fat-tailed return, an infeasible constraint.

An LLM gives you none of those for free. A tool call does.

What to offload (and a cheap test for each)

"Which variant should I ship?" → a multi-armed / contextual bandit. The agent picks the question; Thompson sampling picks the allocation. Test: ask your agent to allocate 1,000 users across 4 arms with the same conversion counts, twice. Different answers? Offload it.
"Is this metric anomalous?" → score the series against a baseline; don't eyeball it inside the context window.
"What's the 95% VaR / CVaR?" → Monte Carlo paths, not a vibe.
"Schedule these tasks under these limits" → an LP/MIP solver. LLMs can't reliably satisfy hard constraints; solvers can't violate them.

The pattern

Expose the math as MCP tools so the agent calls them like any other tool — intent stays in the model, the number comes from code:

// agent decides intent; the tool computes the answer
const alloc = await callTool("optimize_contextual", {
  arms: variants,          // [{ id, name }]
  context: userFeatures,   // segment, prior_open_rate, hour_of_day
  history: pastRewards
});
// `alloc` is reproducible, sub-millisecond, and you can show your work

Two design details that bite people:

Delayed reward. If reward trickles in (email opens over hours), set a fixed attribution window before crediting an arm — otherwise the bandit over-exploits early openers and collapses variant diversity.
Cold start. Start each arm on a Beta(1,1) prior (or an informed prior from past campaigns) so exploration doesn't die on run one.

When not to offload

Determinism is a constraint, and constraints have cost. If the task is genuinely fuzzy — summarizing a doc, routing an intent, drafting copy — keep it in the model. A rule of thumb that's served me well:

If you want a batteries-included set of these as MCP tools — bandits, forecasting, Monte Carlo, optimization, anomaly/risk — I maintain OraClaw (npx -y @oraclaw/mcp-server; 11 of the tools are free, no key). But the pattern matters more than the tool — wire in whatever solver you like. Disclosure: I built it.

What Happens When 1,000 Agents Make the Same Mistake Simultaneously

Whatsonyourmind — Mon, 11 May 2026 18:00:38 +0000

title: "What Happens When 1,000 Agents Make the Same Mistake Simultaneously"
published: true
description: "A fleet of agents sharing one foundation model do not give you 1,000 independent opinions — they give you one opinion at 1,000x scale, and the correlated risk is invisible until it cascades."
tags: ai, mcp, agents, risk

canonical_url:

What Happens When 1,000 Agents Make the Same Mistake Simultaneously

Here is a scenario that has not happened yet at scale. It will.

A hedge fund runs 1,000 AI trading agents. Each manages a slice of the portfolio independently. Each uses an LLM for risk assessment -- evaluating positions, interpreting market signals, deciding whether to hold, hedge, or exit. The agents are diverse: different prompts, different context windows, different position sizes. On paper, this is a well-diversified system.

Tuesday morning, the market drops 3%.

Each agent independently evaluates its positions. The LLM in each agent processes the drop, considers historical context, and concludes some version of: "A 3% drop is within normal volatility. Current positions are within risk tolerance. Recommendation: hold."

This conclusion is reasonable. For any single agent, it is arguably correct. A 3% drop is within normal volatility. Individual positions are within their risk bands.

But 1,000 agents just made the same decision for the same reason at the same time. Every single one is holding. The aggregate exposure has not decreased by a single dollar.

Wednesday morning, the market drops another 5%. Total drawdown: 8%.

Now the same LLMs reassess. But the loss is already locked in. Selling now crystallizes the damage. The agents that were trained on "don't panic sell" hold longer. The agents that weren't start selling into a falling market, driving prices lower, triggering stop-losses in the agents that were holding. Cascade.

The fund loses 12% in 48 hours. Not because any individual agent made an irrational decision. Because every agent made the same rational-looking decision, and nobody was watching the correlation.

The Invisible Risk: Correlated Failures

Individual agent risk is measurable and manageable. System-level correlated risk is invisible until it detonates.

This is not a new concept in finance. Long-Term Capital Management collapsed in 1998 for exactly this reason -- not because their models were wrong about individual positions, but because every sophisticated player in the market was running similar models and similar positions. When the correlation spiked, the diversification vanished.

LLM-based agents introduce a new variant of this problem. Traditional quant funds at least used different models -- different signals, different timeframes, different risk parameters. Agents running the same foundation model have a much deeper correlation: they share the same training data, the same reasoning patterns, the same blind spots.

When GPT-4 thinks a 3% drop is fine, it is not one agent's opinion. It is the opinion of every agent built on GPT-4. The model's assessment is the market's assessment, because the model is a large chunk of the market's decision-making apparatus. This circularity is invisible to each individual agent.

Three Failure Modes Nobody Is Monitoring

1. Behavior correlation spikes. In normal markets, 1,000 agents with different contexts and positions behave differently. In stress scenarios, their behavior converges because the underlying LLM's response to stress follows the same pattern. If you are not measuring inter-agent behavior correlation in real time, you will not see the convergence until it is too late.

The fix is not better prompts. It is statistical monitoring that flags when the fleet's decisions become suspiciously aligned. When 950 out of 1,000 agents agree on the same action in a volatile market, that agreement itself is the risk signal -- regardless of whether the action looks correct individually. This is exactly the kind of deterministic guardrail OraClaw is built for: the agreement-correlation score is a number, not a narrative, and it does not share the foundation model's blind spots.

2. Tail risk blindness. LLMs trained on historical data learn the distribution of normal outcomes. They are systematically bad at reasoning about tail events -- the 1-in-100 scenarios where the most damage occurs. Ask any LLM what happens if the S&P drops 15% in a week, and you get a historically-informed narrative. You do not get a quantitative assessment of portfolio impact under correlated stress with proper fat-tail modeling.

Risk metrics designed for tail events exist. They simulate thousands of extreme scenarios, account for correlation structures that only appear during crises, and produce numbers -- not narratives -- for worst-case exposure. These metrics should sit between the agent and any risk decision, as a hard mathematical guardrail that the LLM cannot override. OraClaw runs 5,000-path Monte Carlo and returns VaR + CVaR + worst-case scenario in under 5ms — math the agent calls but cannot rewrite.

3. Ensemble agreement is not ensemble accuracy. Many multi-agent systems use agreement as a confidence signal: "If 4 out of 5 agents agree, the decision is high-confidence." This is valid when the agents are genuinely independent. It is dangerous when they share a common foundation model.

Five agents built on GPT-4 agreeing is not five independent opinions. It is one opinion expressed five times with slightly different wording. The agreement is measuring model consistency, not decision quality. Proper ensemble scoring detects when multiple models agree for the wrong reasons -- when agreement stems from shared bias rather than convergent evidence.

What the Math Layer Looks Like

Multi-agent systems need three things that LLMs cannot provide:

Real-time correlation monitoring. Measuring the statistical similarity of agent decisions across the fleet, with alerts when correlation exceeds safe thresholds. This is a streaming statistics problem, not a reasoning problem.

Quantitative tail risk. VaR and CVaR computed at the portfolio level, accounting for position correlation, with proper fat-tail distributions. Updated continuously, not narrated occasionally.

Calibrated ensemble scoring. Measuring whether multi-agent agreement actually predicts accuracy, with correction factors for shared-model bias. Turning "4 out of 5 agree" into a real probability that the decision is correct.

None of these require intelligence. They require math -- the kind that runs in milliseconds, produces auditable numbers, and does not share the blind spots of the system it is protecting. OraClaw's convergence-scoring tool does exactly this: Hellinger-distance over signal distributions, not vibe-checks over agent prose.

The Stakes

Single-agent failures are costly. Multi-agent correlated failures are catastrophic. The difference is not one of degree but of kind: individual mistakes are linear; correlated mistakes are exponential.

Your agents need a math layer between them and catastrophic decisions. Not a smarter prompt. Not a better model. A statistical guardrail that measures what the agents cannot see about themselves.

The math exists. The question is whether it will be deployed before or after the first correlated cascade.

Try OraClaw

OraClaw is an MCP server that gives Claude deterministic risk-and-correlation tools — calibrated probability, monotonic constraints, audit trails, ensemble scoring. The math layer your fleet needs before the first cascade, not after. Install in Claude Code:

claude mcp add oraclaw -- npx @oraclaw/mcp-server

17 tools, MIT licensed. Repo: github.com/Whatsonyourmind/oraclaw

Get Started

GitHub: github.com/Whatsonyourmind/oraclaw
More posts: dev.to/lukastan

OraClaw provides anomaly detection, risk metrics, and ensemble scoring for multi-agent systems. GitHub | ClawHub

DEV Community: Whatsonyourmind

Conformal intervals under a log transform: the blow-up isn't a back-transform bug

The setup

The one fact that resolves it

So why is hi enormous?

Show me

What to actually do

Marginal coverage is a lie of averages: the conformal diagnostics that catch it

What the marginal number actually promises

A 90% predictor that fails a third of your classes

The failure marginal coverage hides even from per-class checks: set size

While you're at it: is the set even useful?

The honest caveat

A Capability Token for Agent Tool Calls: One Signed Object That Is Both the Gate and the Audit

The token

The same token, three frameworks

Evidence, not telemetry

Traces show what your agent did - a decision ledger shows what it was allowed to do

Layer 1 — Entry conformance

Layer 2 — Log completeness

Layer 3 — Execution completeness

Why three layers

Stop letting your AI agent eyeball A/B picks — wire in a real contextual bandit via MCP (free, no key)

The trap, concretely

Run it yourself: the no-key REST endpoint (60 seconds)

Now wire it into your agent (the actual point)

When the best choice depends on context

Why route it out instead of prompting harder

Try it

Your bandit's exploration floor probably violates its own floor

Why clip-then-renormalize breaks

The fix: one affine map onto the simplex

The one guard you do need

Why it actually matters

A model with R-squared near 0 can still give valid 90% prediction intervals - here's why (and the catch)

Coverage is a property of the procedure, not the model

The demo (numbers are reproducible, seed fixed)

The catch: marginal ≠ conditional

What to take away

Stop trusting the agent: bind tool-call approvals to the exact call

Three ways an approval boolean breaks

Bind the approval to the call

Three details that decide whether it actually holds

The gotcha that bites in production: replay

The takeaway

Conformal prediction silently breaks under drift - and how to make it hold

The failure, measured

Why it breaks, and the two things you have to fix

Three things that matter in practice

The takeaway

When your optimizer silently returns the wrong answer (and how to catch it)

The symptom: same model, two answers

The first diagnostic is free: flip presolve

Why scaling is so often the trigger

How to minimize a solver bug so it actually gets fixed

A scaling-hygiene checklist

The broader point

Determinism as a feature: when to let your agent call a math API instead of reasoning

The tell

What to offload (and a cheap test for each)

The pattern

When not to offload

What Happens When 1,000 Agents Make the Same Mistake Simultaneously

canonical_url:

What Happens When 1,000 Agents Make the Same Mistake Simultaneously

The Invisible Risk: Correlated Failures

Three Failure Modes Nobody Is Monitoring

What the Math Layer Looks Like

The Stakes

Try OraClaw

Get Started

So why is `hi` enormous?