<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Whatsonyourmind</title>
    <description>The latest articles on DEV Community by Whatsonyourmind (@whatsonyourmind).</description>
    <link>https://dev.to/whatsonyourmind</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853543%2F7a1097de-0d5b-4ff6-9f15-8c33dcd87d8b.png</url>
      <title>DEV Community: Whatsonyourmind</title>
      <link>https://dev.to/whatsonyourmind</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/whatsonyourmind"/>
    <language>en</language>
    <item>
      <title>Your bandit's exploration floor probably violates its own floor</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 22:19:34 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/your-bandits-exploration-floor-probably-violates-its-own-floor-24j2</link>
      <guid>https://dev.to/whatsonyourmind/your-bandits-exploration-floor-probably-violates-its-own-floor-24j2</guid>
      <description>&lt;p&gt;Most multi-armed bandit / A-B allocation systems add a &lt;strong&gt;minimum exploration weight&lt;/strong&gt;: every arm should get at least, say, 5% of traffic, so no variant is ever fully starved and you keep collecting data on all of them. The guarantee sounds simple — &lt;code&gt;p_i &amp;gt;= f&lt;/code&gt; for every arm — and the implementation looks even simpler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clip_renorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# raise anything below the floor up to it
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;     &lt;span class="c1"&gt;# renormalize so probabilities sum to 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is wrong, and it fails silently. The renormalize step pushes the floored arms &lt;strong&gt;back below the floor&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why clip-then-renormalize breaks
&lt;/h2&gt;

&lt;p&gt;Clipping raises the small weights up to &lt;code&gt;f&lt;/code&gt;, which makes the total exceed 1. Dividing by that total then scales &lt;em&gt;everything&lt;/em&gt; down — including the arms you just clipped to &lt;code&gt;f&lt;/code&gt;. So they land below &lt;code&gt;f&lt;/code&gt; again, and the floor you advertised is not the floor you enforce.&lt;/p&gt;

&lt;p&gt;Concrete case — 4 arms, a confident winner, floor &lt;code&gt;f = 0.10&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;w   = [0.94, 0.02, 0.02, 0.02]   floor = 0.10
clip-renorm -&amp;gt; [0.7581, 0.0806, 0.0806, 0.0806]   min = 0.0806  ❌ (&amp;lt; 0.10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three starved arms each get &lt;strong&gt;8.06%&lt;/strong&gt;, not the 10% you promised. And it isn't an edge case. Over 100,000 random peaky weight vectors (Dirichlet, α=0.3, n=4, f=0.10):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;clip-and-renormalize violated the floor 97.2% of the time&lt;/strong&gt; — worst arm seen: 7.69% against a 10% floor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Whenever one arm dominates (exactly when a bandit is exploiting), the floor leaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: one affine map onto the simplex
&lt;/h2&gt;

&lt;p&gt;Instead of clipping, &lt;strong&gt;mix&lt;/strong&gt; the learned weights with the uniform floor. Put the weights on the simplex (&lt;code&gt;sum(w) = 1&lt;/code&gt;), then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;additive_simplex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each output is &lt;code&gt;f + (non-negative)&lt;/code&gt;, so &lt;code&gt;p_i &amp;gt;= f&lt;/code&gt; holds &lt;strong&gt;exactly&lt;/strong&gt;, and the total is &lt;code&gt;n*f + (1 - n*f)*1 = 1&lt;/code&gt; by construction — no renormalization needed, so nothing gets dragged back under the floor. It also preserves the &lt;em&gt;ordering&lt;/em&gt; and relative spacing of &lt;code&gt;w&lt;/code&gt; (it's affine), so you don't distort the policy you learned. Same run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;additive-simplex -&amp;gt; [0.664, 0.112, 0.112, 0.112]   min = 0.112  ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Over the same 100,000 vectors it violated the floor &lt;strong&gt;0.00%&lt;/strong&gt; of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one guard you do need
&lt;/h2&gt;

&lt;p&gt;The map needs &lt;code&gt;n * f &amp;lt;= 1&lt;/code&gt; — you can't promise four arms a 30% floor each (that's 120%). Handle it explicitly instead of producing negative weights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;exploration_floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;floor must be non-negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;full&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# floor is infeasible -&amp;gt; uniform
&lt;/span&gt;    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;asarray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole correct primitive: a non-negativity check, an infeasible-floor fallback to uniform, and the affine mix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it actually matters
&lt;/h2&gt;

&lt;p&gt;The exploration floor isn't cosmetic. It's what bounds worst-case regret and guarantees you keep collecting data on every arm — the property a lot of bandit regret arguments lean on, and often a fairness/SLA requirement too ("no variant ever drops below X%"). A floor that's silently 7.7% instead of 10% means the guarantee you reported to stakeholders, and any bound that depends on it, doesn't hold. The bug is invisible because the output still sums to 1 and still &lt;em&gt;looks&lt;/em&gt; floored — the smallest number is just quietly too small.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;viol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirichlet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;maximum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="c1"&gt;# clip-renorm
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1e-12&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;viol&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip-renorm floor violations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;viol&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# ~97%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran into this reviewing a Thompson-sampling weighting routine and proposed the additive-simplex version (plus the two guards) as a fix upstream. If your bandit or weighted-experiment layer clips-then-renormalizes to enforce a minimum, it's worth a one-line check: does the smallest probability it emits actually clear the floor?&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>statistics</category>
    </item>
    <item>
      <title>A model with R-squared near 0 can still give valid 90% prediction intervals - here's why (and the catch)</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 21:33:25 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/a-model-with-r-squared-near-0-can-still-give-valid-90-prediction-intervals-heres-why-and-the-31jp</link>
      <guid>https://dev.to/whatsonyourmind/a-model-with-r-squared-near-0-can-still-give-valid-90-prediction-intervals-heres-why-and-the-31jp</guid>
      <description>&lt;p&gt;I recently calibrated a recovery-rate model that had only two weak features. Its point accuracy was almost nothing — R² basically zero. I expected its uncertainty estimates to be junk too. They weren't: the 90% conformal prediction intervals covered ~89% of held-out outcomes. Valid, just &lt;em&gt;wide&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That surprised me enough to nail it down, because it contradicts a belief a lot of us carry around: &lt;em&gt;"my model isn't accurate, so I can't trust its uncertainty."&lt;/em&gt; For split conformal prediction, that's backwards. Here's the precise statement, a runnable demo, and the one caveat that actually bites.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coverage is a property of the procedure, not the model
&lt;/h2&gt;

&lt;p&gt;Split conformal prediction gives a distribution-free, finite-sample &lt;strong&gt;marginal coverage guarantee&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;P( Y ∈ Ĉ(X) ) ≥ 1 − α&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and it holds for &lt;strong&gt;any&lt;/strong&gt; point model, as long as the calibration and test data are exchangeable. The model is a black box. You fit it however you like, then on a held-out &lt;em&gt;calibration&lt;/em&gt; set you take the (1−α) quantile of the absolute residuals, and that quantile becomes the half-width of your intervals.&lt;/p&gt;

&lt;p&gt;Nowhere does that construction require the model to be good. A bad model just has large residuals, so the calibration quantile is large, so the intervals are &lt;strong&gt;wide&lt;/strong&gt; — wide enough to still cover at the stated rate. Accuracy doesn't buy you &lt;em&gt;validity&lt;/em&gt;; it buys you &lt;em&gt;efficiency&lt;/em&gt; (narrower intervals at the same coverage).&lt;/p&gt;

&lt;h2&gt;
  
  
  The demo (numbers are reproducible, seed fixed)
&lt;/h2&gt;

&lt;p&gt;Same dataset and target, three models from strong to useless, target coverage 90%:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;th&gt;R²&lt;/th&gt;
&lt;th&gt;marginal coverage&lt;/th&gt;
&lt;th&gt;mean interval width&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gradient boosting&lt;/td&gt;
&lt;td&gt;0.741&lt;/td&gt;
&lt;td&gt;0.895&lt;/td&gt;
&lt;td&gt;5.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weak linear (1 noisy feature)&lt;/td&gt;
&lt;td&gt;0.061&lt;/td&gt;
&lt;td&gt;0.905&lt;/td&gt;
&lt;td&gt;10.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;predict-the-mean&lt;/td&gt;
&lt;td&gt;−0.000&lt;/td&gt;
&lt;td&gt;0.907&lt;/td&gt;
&lt;td&gt;10.83&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All three land at ~90% coverage. The only thing that changes is width: the good model's intervals are &lt;strong&gt;half as wide&lt;/strong&gt;. That's the whole story in one table — validity is constant, efficiency tracks accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GradientBoostingRegressor&lt;/span&gt;

&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20260617&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6000&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;group&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4500&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4500&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="n"&gt;Xtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Xcal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Xte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ytr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ycal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ALPHA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;conformal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Xtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ytr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ycal&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Xcal&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ALPHA&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;          &lt;span class="c1"&gt;# calibration quantile
&lt;/span&gt;    &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Xte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;covered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yte&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yte&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;yte&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;yte&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;yte&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;gcov&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;covered&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;gte&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gte&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: R2=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;6.3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; cov=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;covered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; width=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; group=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gcov&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;conformal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;GradientBoostingRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strong&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Weak&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;conformal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Weak&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weak  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The catch: marginal ≠ conditional
&lt;/h2&gt;

&lt;p&gt;Here's the part you can't skip. The guarantee is &lt;strong&gt;marginal&lt;/strong&gt; — averaged over the whole distribution. It says nothing about coverage &lt;em&gt;within&lt;/em&gt; a subgroup. Watch what the same run reports per subgroup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;th&gt;marginal&lt;/th&gt;
&lt;th&gt;group 0&lt;/th&gt;
&lt;th&gt;group 1&lt;/th&gt;
&lt;th&gt;group 2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;strong GBM&lt;/td&gt;
&lt;td&gt;0.895&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.835&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.857&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;predict-the-mean&lt;/td&gt;
&lt;td&gt;0.907&lt;/td&gt;
&lt;td&gt;0.889&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;td&gt;0.897&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;em&gt;strong&lt;/em&gt; model has the &lt;em&gt;worse&lt;/em&gt; conditional coverage — groups 0 and 2 sit at 83–86% while group 1 is over-covered at 98%. A single global residual quantile produces constant-width intervals that can't adapt to residuals that vary by group, so it robs the hard groups to pay the easy one. (The mean-only model looks more uniform here only because its residuals happen to be roughly homoskedastic across groups — luck, not virtue.)&lt;/p&gt;

&lt;p&gt;If your decisions are made per-subgroup — per region, per asset class, per customer segment — marginal coverage is not enough, and a high overall number can hide silent under-coverage where it matters. The fixes are &lt;strong&gt;Mondrian / group-conditional conformal&lt;/strong&gt; (calibrate a separate quantile per group) or a normalized/locally-weighted nonconformity score so interval width adapts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take away
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A weak model gives you &lt;strong&gt;wide but honest&lt;/strong&gt; intervals, not invalid ones. "The model is bad so the uncertainty is meaningless" is the wrong instinct — wide intervals &lt;em&gt;are&lt;/em&gt; the correct signal that the model doesn't know much.&lt;/li&gt;
&lt;li&gt;The genuinely dangerous case is the opposite: a confident-looking &lt;em&gt;narrow&lt;/em&gt; interval whose coverage is a lie. That happens not from low accuracy but from a &lt;strong&gt;broken exchangeability assumption&lt;/strong&gt; — distribution drift between calibration and deployment. (That failure mode, and adaptive conformal as the fix, is a separate write-up.)&lt;/li&gt;
&lt;li&gt;Always check &lt;strong&gt;conditional&lt;/strong&gt; coverage on the groups you actually act on. The marginal number is necessary, not sufficient.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conformal prediction is one of the few tools that gives you a real guarantee with almost no assumptions. Just remember which guarantee it gives — coverage over the whole distribution — and verify the rest yourself.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>statistics</category>
    </item>
    <item>
      <title>Stop trusting the agent: bind tool-call approvals to the exact call</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 15:11:04 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/stop-trusting-the-agent-bind-tool-call-approvals-to-the-exact-call-5080</link>
      <guid>https://dev.to/whatsonyourmind/stop-trusting-the-agent-bind-tool-call-approvals-to-the-exact-call-5080</guid>
      <description>&lt;p&gt;Agentic systems gate dangerous tool calls — file writes, money movement, deploys — behind an "approval": a human-in-the-loop click, or a policy check. Look at how that approval is usually represented and you'll often find a boolean sitting in the run/session state: &lt;code&gt;approved: true&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A boolean is the wrong primitive, and it fails in three ways that prompt injection is happy to exploit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three ways an approval boolean breaks
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Flip.&lt;/strong&gt; Anything that can write the run state — a serialized context crossing a process/durable-execution boundary, a confused-deputy code path, an injection that steers state — turns &lt;code&gt;false&lt;/code&gt; into &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay.&lt;/strong&gt; You approved "read &lt;code&gt;report.csv&lt;/code&gt;". The approval is just &lt;code&gt;true&lt;/code&gt;, so the same flag is honored for the &lt;em&gt;next&lt;/em&gt; tool call too — "delete &lt;code&gt;prod.db&lt;/code&gt;". The boolean doesn't know which call it approved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argument drift.&lt;/strong&gt; You approved "transfer &lt;strong&gt;$10&lt;/strong&gt; to alice". Between approval and execution the args mutate to &lt;strong&gt;$10,000&lt;/strong&gt;. The boolean still says &lt;code&gt;approved&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The root cause is the same in all three: the approval is modeled as a &lt;strong&gt;property of the run&lt;/strong&gt;, when it should be &lt;strong&gt;evidence for one specific call&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bind the approval to the call
&lt;/h2&gt;

&lt;p&gt;When approval is granted, mint a tag over the things that must not change: the tool-call id, a digest of the canonical arguments, the principal, and an expiry. Verify it at dispatch, against a per-run secret.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# canonical serialization so benign reserialization doesn't invalidate a token.
&lt;/span&gt;    &lt;span class="c1"&gt;# (production: RFC 8785 JCS, which also normalizes numbers — 10 vs 10.0)
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;exp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;
    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;principal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;   &lt;span class="c1"&gt;# replay onto another call
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;principal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;   &lt;span class="c1"&gt;# wrong principal
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;   &lt;span class="c1"&gt;# expired
&lt;/span&gt;    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;principal&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;exp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;expect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compare_digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;         &lt;span class="c1"&gt;# forged / flipped / arg-drift
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the three attacks against it (plus principal-swap and a forged tag):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per-run-secret-not-a-global-one&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# approve $10 to alice
&lt;/span&gt;
&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# True   legit
&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False  replay
&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False  arg drift
&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:99&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False  wrong principal
&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False  forged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flag can no longer be flipped (no valid tag), replayed (call-id is in the MAC), or drifted (args digest is in the MAC). An attacker who fully controls the transported state still can't manufacture a token without the key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three details that decide whether it actually holds
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Canonicalization.&lt;/strong&gt; Both sides must hash the &lt;em&gt;same bytes&lt;/em&gt;. Sort keys, and normalize numbers (&lt;code&gt;10&lt;/code&gt; vs &lt;code&gt;10.0&lt;/code&gt; vs &lt;code&gt;1e1&lt;/code&gt; must agree) — RFC 8785 (JSON Canonicalization Scheme) is the off-the-shelf answer. Put the canonicalization recipe id inside the hashed bytes so the two sides can't silently disagree about the rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail closed, with a &lt;em&gt;typed&lt;/em&gt; result.&lt;/strong&gt; Absent / expired / mismatched ⇒ a distinct "not approved" outcome — not a normal tool payload, and not a generic exception. Otherwise "approval missing" is indistinguishable downstream from "the tool ran and returned something falsy," and the caller can't tell whether to re-request approval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One enforced checkpoint, deny-by-default.&lt;/strong&gt; This belongs at the single point right before dispatch: Semantic Kernel's &lt;code&gt;AUTO_FUNCTION_INVOCATION&lt;/code&gt; filter (don't call &lt;code&gt;next&lt;/code&gt; ⇒ the call is skipped), ADK's &lt;code&gt;before_tool&lt;/code&gt; callback, or the MCP tool-call boundary. Tools that need approval are classified as such; anything unclassified is denied, not allowed through.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The gotcha that bites in production: replay
&lt;/h2&gt;

&lt;p&gt;If your agent runs on a replay-based durable-execution engine (Temporal and friends), the per-run secret &lt;strong&gt;must survive replay&lt;/strong&gt;. Workflow code is re-executed from history on recovery, so a key minted with a non-deterministic call won't match the token already in history — approvals verify fine in dev and then &lt;strong&gt;fail closed after the first worker restart&lt;/strong&gt;, which is the worst possible time to discover it. Derive the key deterministically (&lt;code&gt;HKDF(server_secret, run_id)&lt;/code&gt;) or establish it once via a recorded side-effect, and make the expiry deterministic too rather than reading wall-clock inside workflow code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Authorization in an agent system shouldn't be ambient, mutable state that travels with the run. It should be &lt;strong&gt;evidence bound to a single call envelope&lt;/strong&gt; — this principal, this tool, these exact arguments, until this time — that the executor re-verifies at the moment of dispatch. The boolean isn't a simplification of that; it's the bug.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I work on reliability and verification for AI and numerical systems — agent authorization, determinism, and "prove the thing that claims to be authorized actually was." The snippet above is runnable as-is. Happy to compare notes if you're hardening an agent's tool boundary — &lt;a href="https://github.com/Whatsonyourmind" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>python</category>
      <category>llm</category>
    </item>
    <item>
      <title>Conformal prediction silently breaks under drift - and how to make it hold</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 14:39:25 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/conformal-prediction-silently-breaks-under-drift-and-how-to-make-it-hold-466g</link>
      <guid>https://dev.to/whatsonyourmind/conformal-prediction-silently-breaks-under-drift-and-how-to-make-it-hold-466g</guid>
      <description>&lt;p&gt;Conformal prediction is the easiest way to put a calibrated uncertainty band around &lt;em&gt;any&lt;/em&gt; model: wrap a point predictor, and you get intervals with a finite-sample coverage guarantee — no distributional assumptions. It's deservedly popular.&lt;/p&gt;

&lt;p&gt;There's a catch that bites in production: that guarantee is &lt;strong&gt;marginal&lt;/strong&gt; and it assumes &lt;strong&gt;exchangeability&lt;/strong&gt;. The moment your data drifts — almost any time series, any online-serving setting — exchangeability is gone, and split-conformal silently stops delivering the coverage it promises. No error, just a band that's quietly too narrow.&lt;/p&gt;

&lt;p&gt;Here's the failure, then a fix that actually holds, with runnable code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure, measured
&lt;/h2&gt;

&lt;p&gt;Target 90% intervals. Residuals whose spread drifts upward over time (a textbook covariate/heteroscedastic shift). Calibrate split-conformal on the first chunk and let it run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;                  &lt;span class="c1"&gt;# 90% target; W = calibration window
&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;    &lt;span class="c1"&gt;# residual spread drifts upward
&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;standard_normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# nonconformity = |residual|
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# frozen calibration quantile
&lt;/span&gt;&lt;span class="n"&gt;static&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;static&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;                 &lt;span class="c1"&gt;# -&amp;gt; 0.579
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;58% coverage where you asked for 90% — and in the &lt;strong&gt;last quarter&lt;/strong&gt; of the run, deep into the drift, it's &lt;strong&gt;35%&lt;/strong&gt;. A dashboard reporting "90% prediction intervals" would be off by more than half, with nothing flagging it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it breaks, and the two things you have to fix
&lt;/h2&gt;

&lt;p&gt;There are two distinct ways drift kills coverage, and they need different fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The score scale goes stale.&lt;/strong&gt; Your calibration scores were collected when residuals were small; now they're large. The frozen quantile is simply too small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The miscoverage rate drifts.&lt;/strong&gt; Even with a reasonable scale, the realized error rate wanders away from &lt;code&gt;α&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Adaptive Conformal Inference&lt;/strong&gt; (Gibbs &amp;amp; Candès, 2021) fixes #2 directly. It treats the target miscoverage as a control variable and runs a feedback loop: after each step, nudge &lt;code&gt;α_t&lt;/code&gt; up if you've been covering too often, down if you've been missing too often.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;alpha_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alpha_t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;gamma&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;err_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# err_t = 1 if the point fell outside
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A miss pushes &lt;code&gt;α_t&lt;/code&gt; down → you use a higher quantile → wider next interval. It's a thermostat for coverage, and it gives a long-run coverage guarantee with &lt;em&gt;no&lt;/em&gt; exchangeability assumption.&lt;/p&gt;

&lt;p&gt;But ACI adapts the &lt;em&gt;level&lt;/em&gt;, not the &lt;em&gt;scale&lt;/em&gt;. Point it at a frozen calibration set and it helps a lot but hits a ceiling — once residuals exceed the largest score it ever saw, even &lt;code&gt;α_t → 0&lt;/code&gt; (the widest interval it can form) isn't wide enough. You also have to let the scores track the current regime, e.g. with a rolling window.&lt;/p&gt;

&lt;p&gt;Measured, same setup, four ways:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;method&lt;/th&gt;
&lt;th&gt;overall coverage&lt;/th&gt;
&lt;th&gt;coverage in late-drift tail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;static split-conformal&lt;/td&gt;
&lt;td&gt;0.579&lt;/td&gt;
&lt;td&gt;0.347&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ACI only (frozen calibration)&lt;/td&gt;
&lt;td&gt;0.864&lt;/td&gt;
&lt;td&gt;0.786&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rolling window only&lt;/td&gt;
&lt;td&gt;0.862&lt;/td&gt;
&lt;td&gt;0.859&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;rolling window + ACI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.904&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither piece is enough alone. The rolling window supplies the right &lt;em&gt;scale&lt;/em&gt;; ACI supplies the &lt;em&gt;guarantee&lt;/em&gt;. Together they land exactly on target, even in the part of the series where the static method had collapsed to 35%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;                              &lt;span class="c1"&gt;# rolling -&amp;gt; tracks the new scale
&lt;/span&gt;    &lt;span class="n"&gt;a_eff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;covered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;a_eff&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;covered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;covered&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# ACI feedback on miscoverage
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hold&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;                         &lt;span class="c1"&gt;# -&amp;gt; 0.900
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Three things that matter in practice
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The score function decides marginal vs conditional coverage.&lt;/strong&gt; &lt;code&gt;|y − ŷ|&lt;/code&gt; gives you marginal coverage with a constant-width band. If your noise is heteroscedastic and you want bands that are &lt;em&gt;locally&lt;/em&gt; right (conditional coverage), normalize the score — &lt;code&gt;|y − ŷ| / σ̂(x)&lt;/code&gt;, or use Conformalized Quantile Regression (CQR) where the score is the signed distance to predicted quantiles. The choice changes whether wide intervals show up where the data is actually noisy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage is a usable drift signal — but a noisy one.&lt;/strong&gt; Rolling empirical coverage drifting away from &lt;code&gt;1 − α&lt;/code&gt; is a cheap, model-agnostic drift detector. Just remember it's a Bernoulli mean: its standard error is &lt;code&gt;sqrt(c(1−c)/n)&lt;/code&gt;, so over a 100-point window a 90%-coverage estimate has a ±3-point sampling wobble. Trigger on sustained deviation, not one short window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick &lt;code&gt;γ&lt;/code&gt; for your drift speed.&lt;/strong&gt; Larger &lt;code&gt;γ&lt;/code&gt; tracks faster but makes interval widths jumpier; smaller &lt;code&gt;γ&lt;/code&gt; is smoother but lags. &lt;code&gt;0.01–0.05&lt;/code&gt; is a sane starting range; tune against your realized coverage trace, not in the abstract.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A guarantee that assumes exchangeability is not a guarantee in production — it's an assumption wearing a guarantee's clothes. What makes ACI worth reaching for is that it &lt;em&gt;drops&lt;/em&gt; the assumption and replaces it with a feedback loop you can actually verify online: watch the realized coverage, and let it correct itself. If you serve intervals anywhere a too-narrow band is expensive, that self-correction is the difference between a number you can trust and one that quietly lies as the world moves.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I work on reliability and verification for numerical and AI systems — calibration, drift, and "does the guarantee actually hold under load" tooling. The benchmark above is fully runnable; I'm happy to compare notes if you're putting conformal methods into production — &lt;a href="https://github.com/Whatsonyourmind" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>statistics</category>
    </item>
    <item>
      <title>When your optimizer silently returns the wrong answer (and how to catch it)</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 13:49:40 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/when-your-optimizer-silently-returns-the-wrong-answer-and-how-to-catch-it-ll6</link>
      <guid>https://dev.to/whatsonyourmind/when-your-optimizer-silently-returns-the-wrong-answer-and-how-to-catch-it-ll6</guid>
      <description>&lt;p&gt;Numerical solvers have a failure mode that is worse than crashing: every so often they return &lt;code&gt;status: Optimal&lt;/code&gt; and hand you a number that is simply wrong. No exception, no warning — just a confident, incorrect optimum. If that number drives a downstream decision (a schedule, an allocation, a price), you may never notice.&lt;/p&gt;

&lt;p&gt;I ran into a clean example of this in &lt;a href="https://github.com/ERGO-Code/HiGHS" rel="noopener noreferrer"&gt;HiGHS&lt;/a&gt; recently while reducing a bug that had surfaced through cvxpy, and the debugging path generalizes to any LP/QP/MILP stack. Here's the case, how I isolated it, and a short checklist you can apply to your own models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom: same model, two answers
&lt;/h2&gt;

&lt;p&gt;A mixed-integer model that HiGHS solves to &lt;code&gt;Optimal&lt;/code&gt; with objective &lt;code&gt;0.0&lt;/code&gt; under default settings — but solve the &lt;em&gt;same&lt;/em&gt; model with presolve turned off and you get &lt;code&gt;Optimal&lt;/code&gt; with objective ≈ &lt;code&gt;6.68e8&lt;/code&gt;. Both runs report success. One of them is wrong.&lt;/p&gt;

&lt;p&gt;When presolve-on and presolve-off disagree on a problem that has a well-defined, bounded optimum, that is not a tolerance issue — it means one of the reduction steps is mangling the model. (&lt;a href="https://github.com/ERGO-Code/HiGHS/issues/2900" rel="noopener noreferrer"&gt;This particular case&lt;/a&gt; is an open, actively-investigated issue; a separate wrong-answer I reduced to a standalone &lt;code&gt;.mps&lt;/code&gt; from a cvxpy program is &lt;a href="https://github.com/ERGO-Code/HiGHS/issues/3073" rel="noopener noreferrer"&gt;filed here&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The first diagnostic is free: flip presolve
&lt;/h2&gt;

&lt;p&gt;Before anything else, re-solve with presolve disabled and compare the two objectives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;highspy&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;solve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;presolve&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;highspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Highs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setOptionValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;presolve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;presolve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setOptionValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getObjectiveValue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;on&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;solve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.mps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;off&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;solve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.mps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;off&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# disagree on a feasible, bounded model =&amp;gt; bug in a reduction
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same idea works through the modeling layer — in cvxpy, compare &lt;code&gt;prob.solve(solver=cp.HIGHS)&lt;/code&gt; against the same solve with &lt;code&gt;{"presolve": "off"}&lt;/code&gt;. If the two disagree, a reduction step is the culprit, and you have already cut the search space in half.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scaling is so often the trigger
&lt;/h2&gt;

&lt;p&gt;The common thread in this family of bugs is &lt;strong&gt;coefficient magnitude&lt;/strong&gt;. HiGHS prints the coefficient ranges at the top of every run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Coefficient ranges:
  Matrix  [4e-01, 5e+02]
  Cost    [2e+01, 3e+02]
  Bound   [1e+02, 1e+02]
  RHS     [3e+01, 2e+04]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a single constraint mixes coefficients spanning many orders of magnitude, bound-tightening and substitution accumulate floating-point error, and integer-rounding logic ("this RHS rounds up to the next integer bound") can tip the wrong way. The minimal reproducer I extracted kept exactly the rows whose coefficients carried the large magnitudes — drop them and the collapse disappears.&lt;/p&gt;

&lt;p&gt;The same root cause shows up across solvers, just wearing different clothes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OSQP&lt;/strong&gt; (QP): an &lt;a href="https://github.com/osqp/osqp/issues/760" rel="noopener noreferrer"&gt;open report&lt;/a&gt; where v1.0.0+ runs all the way to max-iterations with &lt;code&gt;gap = -nan&lt;/code&gt;, &lt;em&gt;even though&lt;/em&gt; the primal and dual residuals are already at &lt;code&gt;1e-14&lt;/code&gt;. The duality-gap termination criterion is poisoned by a NaN, so the solver never recognizes that it has already converged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clarabel&lt;/strong&gt; (conic/QP): a &lt;a href="https://github.com/oxfordcontrol/Clarabel.rs/issues/217" rel="noopener noreferrer"&gt;report&lt;/a&gt; where a wildly ill-scaled QP (objective on the order of &lt;code&gt;1e9&lt;/code&gt;) returns a false &lt;code&gt;PrimalInfeasible&lt;/code&gt; with equilibration on, but solves cleanly with &lt;code&gt;equilibrate_enable=False&lt;/code&gt;. Ruiz equilibration is capped at &lt;code&gt;equilibrate_max_scaling = 1e4&lt;/code&gt; by default — about four orders short of a &lt;code&gt;1e8&lt;/code&gt; dynamic range, so the post-scaling KKT system is still badly conditioned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different solvers, same lesson: magnitude is not cosmetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to minimize a solver bug so it actually gets fixed
&lt;/h2&gt;

&lt;p&gt;A 350-row model is not a bug report a maintainer can act on. The reduction loop is mechanical and worth automating:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reproduce on the latest release first.&lt;/strong&gt; Half of "bugs" are already fixed. Pin the version you tested.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greedily drop rows and columns.&lt;/strong&gt; Remove a chunk; if the wrong-answer signature survives, keep it removed; otherwise restore it and try a smaller chunk. Binary-search your way down. I took one case from 348×169 to 41×40 this way and it still collapsed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the "still broken" check a predicate, not an eyeball.&lt;/strong&gt; Here it was &lt;code&gt;abs(on - off) &amp;gt; tol&lt;/code&gt; (or &lt;code&gt;status == Infeasible&lt;/code&gt; while presolve-off says &lt;code&gt;Optimal&lt;/code&gt;), re-evaluated after every removal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export the reduced model to a portable format&lt;/strong&gt; (&lt;code&gt;.mps&lt;/code&gt;) so the report is solver-version- and language-independent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File with three things: the version, the exact on/off command delta, and the minimal &lt;code&gt;.mps&lt;/code&gt;.&lt;/strong&gt; That is a report that gets triaged in minutes instead of sitting untouched.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A scaling-hygiene checklist
&lt;/h2&gt;

&lt;p&gt;Even when there is no solver bug, bad scaling silently erodes accuracy. Cheap habits that prevent most of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read the coefficient ranges on every run.&lt;/strong&gt; If the matrix or RHS spans more than ~&lt;code&gt;1e6&lt;/code&gt;, treat the result with suspicion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rescale units before the solver sees them&lt;/strong&gt; (dollars → millions, bytes → GB). Single highest-leverage fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not encode big-M larger than necessary.&lt;/strong&gt; An &lt;code&gt;M&lt;/code&gt; of &lt;code&gt;1e9&lt;/code&gt; where &lt;code&gt;1e4&lt;/code&gt; would do is how you manufacture these bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a presolve-off run in your test suite&lt;/strong&gt; for any model whose output you trust blindly — a periodic on/off agreement check is a cheap regression guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For QP/conic, check the equilibration cap&lt;/strong&gt; against your data's dynamic range, and prefer pre-scaling to relying on the solver to rescue pathological inputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The broader point
&lt;/h2&gt;

&lt;p&gt;These bugs are dangerous precisely because the solver's contract — "I returned &lt;code&gt;Optimal&lt;/code&gt;" — is exactly what you would normally trust. The on/off differential is so useful &lt;em&gt;because&lt;/em&gt; it doesn't trust that contract: it cross-checks two code paths that are supposed to agree and flags the moment they don't. That "verify the thing that claims to be correct" instinct is worth wiring into any pipeline where a wrong number is expensive.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I work on reliability and verification for numerical and AI systems — minimal reproducers, determinism, and "prove the output is what it claims" tooling; the HiGHS reducer above came out of that. The issues referenced are linked inline. If you hit something in this family, I'm happy to compare notes — &lt;a href="https://github.com/Whatsonyourmind" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>optimization</category>
      <category>python</category>
      <category>debugging</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Determinism as a feature: when to let your agent call a math API instead of reasoning</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Wed, 17 Jun 2026 09:16:18 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/determinism-as-a-feature-when-to-let-your-agent-call-a-math-api-instead-of-reasoning-10mf</link>
      <guid>https://dev.to/whatsonyourmind/determinism-as-a-feature-when-to-let-your-agent-call-a-math-api-instead-of-reasoning-10mf</guid>
      <description>&lt;p&gt;LLM agents are great at deciding &lt;em&gt;what&lt;/em&gt; to do and unreliable at &lt;em&gt;computing&lt;/em&gt; it. Ask one to allocate traffic across five variants, price tail risk, or solve a scheduling constraint and you'll get a confident, plausible, subtly-wrong number — tokens burned included.&lt;/p&gt;

&lt;p&gt;The fix usually isn't a better prompt. It's the same instinct that gave us the calculator: move the deterministic math out of the probabilistic engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tell
&lt;/h2&gt;

&lt;p&gt;You have a determinism problem the moment your agent's output needs to be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;reproducible&lt;/strong&gt; — same inputs → same answer, every run,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auditable&lt;/strong&gt; — someone can check &lt;em&gt;why&lt;/em&gt; it's 0.62 and not 0.61, or&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;correct under adversarial inputs&lt;/strong&gt; — a fat-tailed return, an infeasible constraint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An LLM gives you none of those for free. A tool call does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to offload (and a cheap test for each)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"Which variant should I ship?"&lt;/strong&gt; → a multi-armed / contextual bandit. The agent picks &lt;em&gt;the question&lt;/em&gt;; Thompson sampling picks the allocation. Test: ask your agent to allocate 1,000 users across 4 arms with the same conversion counts, twice. Different answers? Offload it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Is this metric anomalous?"&lt;/strong&gt; → score the series against a baseline; don't eyeball it inside the context window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What's the 95% VaR / CVaR?"&lt;/strong&gt; → Monte Carlo paths, not a vibe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Schedule these tasks under these limits"&lt;/strong&gt; → an LP/MIP solver. LLMs can't reliably satisfy hard constraints; solvers can't violate them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Expose the math as MCP tools so the agent calls them like any other tool — intent stays in the model, the number comes from code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// agent decides intent; the tool computes the answer&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;alloc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;optimize_contextual&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;arms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;variants&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// [{ id, name }]&lt;/span&gt;
  &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userFeatures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// segment, prior_open_rate, hour_of_day&lt;/span&gt;
  &lt;span class="na"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pastRewards&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// `alloc` is reproducible, sub-millisecond, and you can show your work&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design details that bite people:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delayed reward.&lt;/strong&gt; If reward trickles in (email opens over hours), set a fixed attribution window before crediting an arm — otherwise the bandit over-exploits early openers and collapses variant diversity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold start.&lt;/strong&gt; Start each arm on a &lt;code&gt;Beta(1,1)&lt;/code&gt; prior (or an informed prior from past campaigns) so exploration doesn't die on run one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When &lt;em&gt;not&lt;/em&gt; to offload
&lt;/h2&gt;

&lt;p&gt;Determinism is a constraint, and constraints have cost. If the task is genuinely fuzzy — summarizing a doc, routing an intent, drafting copy — keep it in the model. A rule of thumb that's served me well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you'd want a unit test for the output, it belongs in a tool, not a prompt.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;If you want a batteries-included set of these as MCP tools — bandits, forecasting, Monte Carlo, optimization, anomaly/risk — I maintain &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;OraClaw&lt;/a&gt; (&lt;code&gt;npx -y @oraclaw/mcp-server&lt;/code&gt;; 11 of the tools are free, no key). But the pattern matters more than the tool — wire in whatever solver you like. Disclosure: I built it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>What Happens When 1,000 Agents Make the Same Mistake Simultaneously</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Mon, 11 May 2026 18:00:38 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/what-happens-when-1000-agents-make-the-same-mistake-simultaneously-4icb</link>
      <guid>https://dev.to/whatsonyourmind/what-happens-when-1000-agents-make-the-same-mistake-simultaneously-4icb</guid>
      <description>&lt;h1&gt;
  
  
  What Happens When 1,000 Agents Make the Same Mistake Simultaneously
&lt;/h1&gt;

&lt;p&gt;Here is a scenario that has not happened yet at scale. It will.&lt;/p&gt;

&lt;p&gt;A hedge fund runs 1,000 AI trading agents. Each manages a slice of the portfolio independently. Each uses an LLM for risk assessment -- evaluating positions, interpreting market signals, deciding whether to hold, hedge, or exit. The agents are diverse: different prompts, different context windows, different position sizes. On paper, this is a well-diversified system.&lt;/p&gt;

&lt;p&gt;Tuesday morning, the market drops 3%.&lt;/p&gt;

&lt;p&gt;Each agent independently evaluates its positions. The LLM in each agent processes the drop, considers historical context, and concludes some version of: "A 3% drop is within normal volatility. Current positions are within risk tolerance. Recommendation: hold."&lt;/p&gt;

&lt;p&gt;This conclusion is reasonable. For any single agent, it is arguably correct. A 3% drop &lt;em&gt;is&lt;/em&gt; within normal volatility. Individual positions &lt;em&gt;are&lt;/em&gt; within their risk bands.&lt;/p&gt;

&lt;p&gt;But 1,000 agents just made the same decision for the same reason at the same time. Every single one is holding. The aggregate exposure has not decreased by a single dollar.&lt;/p&gt;

&lt;p&gt;Wednesday morning, the market drops another 5%. Total drawdown: 8%.&lt;/p&gt;

&lt;p&gt;Now the same LLMs reassess. But the loss is already locked in. Selling now crystallizes the damage. The agents that were trained on "don't panic sell" hold longer. The agents that weren't start selling into a falling market, driving prices lower, triggering stop-losses in the agents that were holding. Cascade.&lt;/p&gt;

&lt;p&gt;The fund loses 12% in 48 hours. Not because any individual agent made an irrational decision. Because every agent made the &lt;em&gt;same&lt;/em&gt; rational-looking decision, and nobody was watching the correlation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Invisible Risk: Correlated Failures
&lt;/h2&gt;

&lt;p&gt;Individual agent risk is measurable and manageable. System-level correlated risk is invisible until it detonates.&lt;/p&gt;

&lt;p&gt;This is not a new concept in finance. Long-Term Capital Management collapsed in 1998 for exactly this reason -- not because their models were wrong about individual positions, but because every sophisticated player in the market was running similar models and similar positions. When the correlation spiked, the diversification vanished.&lt;/p&gt;

&lt;p&gt;LLM-based agents introduce a new variant of this problem. Traditional quant funds at least used &lt;em&gt;different&lt;/em&gt; models -- different signals, different timeframes, different risk parameters. Agents running the same foundation model have a much deeper correlation: they share the same training data, the same reasoning patterns, the same blind spots.&lt;/p&gt;

&lt;p&gt;When GPT-4 thinks a 3% drop is fine, it is not one agent's opinion. It is the opinion of every agent built on GPT-4. The model's assessment is the market's assessment, because the model &lt;em&gt;is&lt;/em&gt; a large chunk of the market's decision-making apparatus. This circularity is invisible to each individual agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Failure Modes Nobody Is Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Behavior correlation spikes.&lt;/strong&gt; In normal markets, 1,000 agents with different contexts and positions behave differently. In stress scenarios, their behavior converges because the underlying LLM's response to stress follows the same pattern. If you are not measuring inter-agent behavior correlation in real time, you will not see the convergence until it is too late.&lt;/p&gt;

&lt;p&gt;The fix is not better prompts. It is statistical monitoring that flags when the fleet's decisions become suspiciously aligned. When 950 out of 1,000 agents agree on the same action in a volatile market, that agreement itself is the risk signal -- regardless of whether the action looks correct individually. This is exactly the kind of deterministic guardrail OraClaw is built for: the agreement-correlation score is a number, not a narrative, and it does not share the foundation model's blind spots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tail risk blindness.&lt;/strong&gt; LLMs trained on historical data learn the distribution of normal outcomes. They are systematically bad at reasoning about tail events -- the 1-in-100 scenarios where the most damage occurs. Ask any LLM what happens if the S&amp;amp;P drops 15% in a week, and you get a historically-informed narrative. You do not get a quantitative assessment of portfolio impact under correlated stress with proper fat-tail modeling.&lt;/p&gt;

&lt;p&gt;Risk metrics designed for tail events exist. They simulate thousands of extreme scenarios, account for correlation structures that only appear during crises, and produce numbers -- not narratives -- for worst-case exposure. These metrics should sit between the agent and any risk decision, as a hard mathematical guardrail that the LLM cannot override. OraClaw runs 5,000-path Monte Carlo and returns VaR + CVaR + worst-case scenario in under 5ms — math the agent calls but cannot rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Ensemble agreement is not ensemble accuracy.&lt;/strong&gt; Many multi-agent systems use agreement as a confidence signal: "If 4 out of 5 agents agree, the decision is high-confidence." This is valid when the agents are genuinely independent. It is dangerous when they share a common foundation model.&lt;/p&gt;

&lt;p&gt;Five agents built on GPT-4 agreeing is not five independent opinions. It is one opinion expressed five times with slightly different wording. The agreement is measuring model consistency, not decision quality. Proper ensemble scoring detects when multiple models agree for the wrong reasons -- when agreement stems from shared bias rather than convergent evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Math Layer Looks Like
&lt;/h2&gt;

&lt;p&gt;Multi-agent systems need three things that LLMs cannot provide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time correlation monitoring.&lt;/strong&gt; Measuring the statistical similarity of agent decisions across the fleet, with alerts when correlation exceeds safe thresholds. This is a streaming statistics problem, not a reasoning problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantitative tail risk.&lt;/strong&gt; VaR and CVaR computed at the portfolio level, accounting for position correlation, with proper fat-tail distributions. Updated continuously, not narrated occasionally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibrated ensemble scoring.&lt;/strong&gt; Measuring whether multi-agent agreement actually predicts accuracy, with correction factors for shared-model bias. Turning "4 out of 5 agree" into a real probability that the decision is correct.&lt;/p&gt;

&lt;p&gt;None of these require intelligence. They require math -- the kind that runs in milliseconds, produces auditable numbers, and does not share the blind spots of the system it is protecting. OraClaw's convergence-scoring tool does exactly this: Hellinger-distance over signal distributions, not vibe-checks over agent prose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stakes
&lt;/h2&gt;

&lt;p&gt;Single-agent failures are costly. Multi-agent correlated failures are catastrophic. The difference is not one of degree but of kind: individual mistakes are linear; correlated mistakes are exponential.&lt;/p&gt;

&lt;p&gt;Your agents need a math layer between them and catastrophic decisions. Not a smarter prompt. Not a better model. A statistical guardrail that measures what the agents cannot see about themselves.&lt;/p&gt;

&lt;p&gt;The math exists. The question is whether it will be deployed before or after the first correlated cascade.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try OraClaw
&lt;/h2&gt;

&lt;p&gt;OraClaw is an MCP server that gives Claude deterministic risk-and-correlation tools — calibrated probability, monotonic constraints, audit trails, ensemble scoring. The math layer your fleet needs before the first cascade, not after. Install in Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add oraclaw &lt;span class="nt"&gt;--&lt;/span&gt; npx @oraclaw/mcp-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;17 tools, MIT licensed. Repo: github.com/Whatsonyourmind/oraclaw&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;More posts: &lt;a href="https://dev.to/lukastan"&gt;dev.to/lukastan&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;OraClaw provides anomaly detection, risk metrics, and ensemble scoring for multi-agent systems. &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://oraclaw.com/clawhub" rel="noopener noreferrer"&gt;ClawHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
      <category>risk</category>
    </item>
    <item>
      <title>I Built an Agent Portfolio Advisor by Composing 3 OpenClaw Skills — Here's What Actually Works</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Mon, 20 Apr 2026 17:33:01 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/i-built-an-agent-portfolio-advisor-by-composing-3-openclaw-skills-heres-what-actually-works-2dpa</link>
      <guid>https://dev.to/whatsonyourmind/i-built-an-agent-portfolio-advisor-by-composing-3-openclaw-skills-heres-what-actually-works-2dpa</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/devteam/join-the-openclaw-challenge-1200-prize-pool-5682"&gt;OpenClaw Challenge&lt;/a&gt;: Prompt 1 — "OpenClaw in Action".&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;Agent Portfolio Advisor&lt;/strong&gt; — one OpenClaw agent that takes "I have €10K, 3-year horizon, medium risk tolerance" and returns a recommended asset mix &lt;strong&gt;with a confidence band&lt;/strong&gt;, not a guess.&lt;/p&gt;

&lt;p&gt;The trick: the agent doesn't &lt;em&gt;compute&lt;/em&gt; anything itself. It composes three deterministic skills and lets them own the math. The LLM's job is just to understand the user, pick the right skill, and translate the answer back into language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three skills (all live at &lt;a href="https://github.com/openclaw/skills/tree/main/skills/whatsonyourmind" rel="noopener noreferrer"&gt;openclaw/skills/whatsonyourmind&lt;/a&gt;):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Job in the pipeline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;oraclaw-bandit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pick the best asset allocation from N candidates (UCB1 / Thompson / ε-greedy)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;oraclaw-simulate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Monte Carlo the chosen allocation over the horizon (10,000 paths)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;oraclaw-risk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VaR / CVaR on the simulated paths&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No LLM math. No probability theater. Every number has a source the agent can cite.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used OpenClaw
&lt;/h2&gt;

&lt;p&gt;The flow is three MCP tool calls, composed in order.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — &lt;code&gt;oraclaw-bandit&lt;/code&gt; picks the allocation
&lt;/h3&gt;

&lt;p&gt;Five candidate allocations seeded from historical performance. UCB1 balances "what worked" with "what we haven't tried enough". Free tier, no API key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "arms": [
      { "id": "60-40",  "name": "60% stocks / 40% bonds", "pulls": 120, "totalReward": 84.0 },
      { "id": "70-30",  "name": "70% stocks / 30% bonds", "pulls": 95,  "totalReward": 69.3 },
      { "id": "80-20",  "name": "80% stocks / 20% bonds", "pulls": 80,  "totalReward": 61.6 },
      { "id": "all-in", "name": "100% stocks",            "pulls": 60,  "totalReward": 49.8 },
      { "id": "safe",   "name": "40% stocks / 60% bonds", "pulls": 150, "totalReward": 91.5 }
    ],
    "algorithm": "ucb1"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response (real):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"selected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"safe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"40% stocks / 60% bonds"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.648&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"algorithm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ucb1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"exploitation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.61&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"exploration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.038&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"regret"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;UCB1 picked &lt;code&gt;safe&lt;/code&gt; not because it has the highest mean reward, but because its mean reward is closest to the top AND it's been pulled more (confidence is tighter). That's explore/exploit done right.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — &lt;code&gt;oraclaw-simulate&lt;/code&gt; runs the Monte Carlo
&lt;/h3&gt;

&lt;p&gt;Once we have an allocation, simulate 3 years of monthly returns. Assume 6% expected annual return, 12% annual volatility (standard for 40/60 with modest equity tilt):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/simulate/montecarlo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "distribution": "normal",
    "params": { "mean": 11800, "stddev": 2100 },
    "iterations": 10000
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;10,000 simulated ending values for €10,000 invested. Real response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mean"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11807.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stdDev"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2098.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"percentiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p5"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;8354.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p25"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10387.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p50"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;11812.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p75"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;13218.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p95"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;15273.5&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iterations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"executionTimeMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.8&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent now knows: &lt;strong&gt;median outcome €11,813. 5% chance of finishing below €8,355. 5% chance of finishing above €15,274.&lt;/strong&gt; That's a confidence band, not a point estimate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — &lt;code&gt;oraclaw-risk&lt;/code&gt; closes the loop (premium)
&lt;/h3&gt;

&lt;p&gt;For a 2-asset portfolio with correlation, &lt;code&gt;oraclaw-risk&lt;/code&gt; runs VaR + CVaR properly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/analyze/risk &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer oc_YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "weights": [0.4, 0.6],
    "returns": [
      [0.02, -0.03, 0.01, 0.04, -0.02, 0.01, -0.01, 0.03, 0.02, -0.04],
      [0.01, 0.02, -0.01, 0.01, 0.03, -0.02, 0.02, 0.01, -0.03, 0.01]
    ],
    "confidence": 0.95
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"var"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.019&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cvar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.026&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expectedReturn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.006&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"volatility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.012&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;VaR 1.9%&lt;/strong&gt; = on 95% of days this portfolio won't lose more than 1.9%. &lt;strong&gt;CVaR 2.6%&lt;/strong&gt; = when things go bad (worst 5% days), the average loss is 2.6%. Volatility 1.2% reflects the 40/60 correlation — diversification actually worked.&lt;/p&gt;

&lt;p&gt;Get a free API key: &lt;code&gt;POST https://oraclaw-api.onrender.com/api/v1/auth/signup&lt;/code&gt; with &lt;code&gt;{"email":"..."}&lt;/code&gt; — instant, no card.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wiring all three into one MCP agent
&lt;/h3&gt;

&lt;p&gt;The OpenClaw skills ship as MCP tools. Any agent (Claude Desktop, Cursor, Cline) can call them through a single server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oraclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@oraclaw/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ORACLAW_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"oc_YOUR_KEY"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via Claude CLI: &lt;code&gt;claude mcp add oraclaw -- npx -y @oraclaw/mcp-server&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The agent now has &lt;code&gt;optimize_bandit&lt;/code&gt;, &lt;code&gt;simulate_montecarlo&lt;/code&gt;, and &lt;code&gt;analyze_risk&lt;/code&gt; as callable tools — plus 14 more (CMA-ES, LP solver, A* pathfinding, Bayesian, ensemble, forecast, anomaly, graph analytics, calibration...).&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Full pipeline, real responses embedded above. To run it yourself:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No API key needed&lt;/strong&gt; for Step 1 and Step 2 (25 free calls/day/IP)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free API key&lt;/strong&gt; (30 seconds, email-only) unlocks Step 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected runtime&lt;/strong&gt;: ~15ms per call on the live API. The whole pipeline finishes in under 100ms including network.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built a minimal TypeScript orchestrator (~80 lines) that wraps these three skills into a &lt;code&gt;PortfolioAdvisor.recommend(userProfile)&lt;/code&gt; function returning &lt;code&gt;{ allocation, confidence_band, tail_risk, narrative }&lt;/code&gt;. The narrative is the only part the LLM produces. Source snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;recommend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;UserProfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allocation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;oraclaw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;optimize_bandit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;arms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ALLOCATIONS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ucb1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;oraclaw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;simulate_montecarlo&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;distribution&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;normal&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;expectedReturnFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;horizonYears&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;risk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;oraclaw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_risk&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;weightsFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;returns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;historicalSeriesFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;confidence_band&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;p5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;tail_risk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;var&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cvar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cvar&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;narrative&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;allocation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;risk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The LLM only runs in &lt;code&gt;llm.explain&lt;/code&gt;.&lt;/strong&gt; Every number it cites came from a deterministic tool call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. OpenClaw's skill-composition model is better than monolithic agents.&lt;/strong&gt; I could swap &lt;code&gt;oraclaw-bandit&lt;/code&gt; for &lt;code&gt;oraclaw-contextual&lt;/code&gt; (LinUCB, context-aware) without touching the other two. Each skill has its own &lt;code&gt;SKILL.md&lt;/code&gt;, its own &lt;code&gt;_meta.json&lt;/code&gt; with required env vars, its own pricing. Modularity that actually holds up under real use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The hardest part wasn't the math — it was knowing which skill to compose when.&lt;/strong&gt; That's exactly what an LLM is good at: reading user intent, picking tools, narrating results. Every attempt to have the LLM &lt;em&gt;compute&lt;/em&gt; the Monte Carlo or UCB1 itself gave worse answers than the skills. Every attempt to have the skills do routing gave worse UX than the LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Confidence bands are a trust primitive.&lt;/strong&gt; A "recommended allocation: 40/60, median outcome €11,813 — but there's a 5% chance you end up below €8,355" is a decision a human can actually make. "Invest in 40/60, it's good" is not. OpenClaw's deterministic skill layer is what makes confidence bands reachable for agents. Without &lt;code&gt;oraclaw-simulate&lt;/code&gt;, the agent is guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The free tier matters for the feedback loop.&lt;/strong&gt; 25 calls/day was enough to prototype the whole pipeline without paying or signing up. The moment I wanted production traffic on the premium &lt;code&gt;analyze_risk&lt;/code&gt;, the $9/mo Starter tier (50K calls/month) was a no-brainer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All 14 OraClaw skills&lt;/strong&gt; on ClawHub: &lt;a href="https://github.com/openclaw/skills/tree/main/skills/whatsonyourmind" rel="noopener noreferrer"&gt;openclaw/skills/whatsonyourmind&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP server&lt;/strong&gt; (one npm install): &lt;a href="https://www.npmjs.com/package/@oraclaw/mcp-server" rel="noopener noreferrer"&gt;@oraclaw/mcp-server&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free API key signup&lt;/strong&gt;: &lt;code&gt;POST https://oraclaw-api.onrender.com/api/v1/auth/signup&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;17 tools, schemas, source&lt;/strong&gt;: &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built with OpenClaw. Free-tier friendly. MIT licensed.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>openclawchallenge</category>
      <category>ai</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Monte Carlo Simulation in 5 Minutes: From Zero to Confidence Intervals in One API Call</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Mon, 20 Apr 2026 11:26:45 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/monte-carlo-simulation-in-5-minutes-from-zero-to-confidence-intervals-in-one-api-call-5adn</link>
      <guid>https://dev.to/whatsonyourmind/monte-carlo-simulation-in-5-minutes-from-zero-to-confidence-intervals-in-one-api-call-5adn</guid>
      <description>&lt;p&gt;Your PM walks into standup and asks: "What's the probability we hit our revenue target this quarter?"&lt;/p&gt;

&lt;p&gt;You have historical data. You have growth rates. You have variance. You could eyeball it and say "pretty likely." Or you could simulate 10,000 possible futures and come back with: "There's a 73% chance we exceed $2.1M, but a 12% chance we fall below $1.6M — and here's why."&lt;/p&gt;

&lt;p&gt;That's not a guess. That's a Monte Carlo simulation.&lt;/p&gt;

&lt;p&gt;The same technique shows up everywhere developers build things that depend on uncertain inputs. Your portfolio dashboard shows a single number for projected returns — but there's a universe of possible outcomes hiding behind that number. Your deployment pipeline estimates "3 days" for a migration — but the real answer is a probability distribution with a long tail. Your pricing model assumes a 5% conversion rate — but what if it's 3%? What if it's 8%?&lt;/p&gt;

&lt;p&gt;Monte Carlo reveals the full picture. Not just the average case, but the best case, the worst case, and everything in between. And it does this through a method so simple it feels like cheating: run the same calculation thousands of times with slightly different inputs, then look at the aggregate.&lt;/p&gt;

&lt;p&gt;The technique is named after the Monte Carlo Casino in Monaco — a nod to the role that randomness plays. It was originally developed during the Manhattan Project by Stanislaw Ulam and John von Neumann, who used random sampling to model neutron diffusion when analytical solutions were intractable. Today it's used in quantitative finance, drug discovery, climate modeling, game AI, and any domain where you need to reason about uncertainty.&lt;/p&gt;

&lt;p&gt;Let's break it down.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Monte Carlo Actually Is
&lt;/h2&gt;

&lt;p&gt;At its core, Monte Carlo simulation answers one question: &lt;strong&gt;given uncertain inputs, what's the range of possible outputs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the recipe:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Define your model.&lt;/strong&gt; This is the calculation that produces an output from inputs. Revenue = users x conversion_rate x average_order_value. Portfolio return = weighted sum of asset returns. Project duration = sum of task durations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Describe the uncertainty.&lt;/strong&gt; Instead of plugging in single values, you describe each input as a probability distribution. Your conversion rate isn't "5%" — it's "normally distributed with mean 5% and standard deviation 1.5%." Your task durations aren't "3 days" — they're "triangularly distributed between 2, 3, and 7 days."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sample and run.&lt;/strong&gt; Draw a random value for each input from its distribution. Run the model. Record the output. Repeat 10,000 times.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analyze the distribution.&lt;/strong&gt; You now have 10,000 possible outputs. Sort them. The 500th value is your 5th percentile (p5) — only 5% of simulated futures were worse than this. The 9,500th is your 95th percentile (p95). The spread between p5 and p95 is your confidence interval.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The math is addition and sorting. The power comes from repetition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Mistakes Developers Make
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Using uniform distributions when the real data is normal or lognormal.&lt;/strong&gt; Financial returns are approximately normal. Project durations are right-skewed (lognormal or triangular). Revenue is often lognormal. Uniform distributions — equal probability across the range — almost never match reality and will underestimate tail risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too few iterations.&lt;/strong&gt; 100 simulations is noise. You'll get different answers every time you run it. At 1,000 you start seeing convergence. At 10,000 your percentiles stabilize to about 1% precision. For VaR calculations where you care about the extreme tails (p1, p99), you may need 50,000+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring correlations.&lt;/strong&gt; If you simulate stock A and stock B independently, you'll underestimate portfolio risk. In reality, stocks tend to fall together during crashes. Correlated inputs require either copulas or a covariance matrix approach — or you can sidestep the problem by using historical return vectors that naturally capture correlation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No convergence check.&lt;/strong&gt; Run your simulation at 1,000, 5,000, and 10,000 iterations. If your p5 changes by more than 2-3%, you need more iterations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Real Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Portfolio Risk: "What's My 95% VaR?"
&lt;/h3&gt;

&lt;p&gt;This is the most common Monte Carlo application in finance, and the question that shows up most in developer forums and GitHub issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; "I have a portfolio of assets. What's the maximum I could lose in a single day with 95% confidence?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inputs you need:&lt;/strong&gt; Portfolio weights (how much is in each asset), historical return series for each asset, and your confidence level (typically 95% or 99%).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the output tells you:&lt;/strong&gt; Your Value-at-Risk (VaR) is a single number — say 2.1% — meaning "on 95% of days, your portfolio won't lose more than 2.1%." The Conditional VaR (CVaR, also called Expected Shortfall) tells you the average loss in that worst 5% — it answers "when things go bad, how bad do they get on average?"&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Project Estimation: "What's the Probability We Deliver by March?"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; "We have 12 tasks remaining. Each has a best-case, likely, and worst-case duration. What's the probability we finish by March 15?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inputs you need:&lt;/strong&gt; For each task, a triangular distribution (min, mode, max). Task dependencies if they exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the output tells you:&lt;/strong&gt; Instead of "we'll finish March 10" you get "60% chance we finish by March 10, 85% chance by March 20, 95% chance by April 1." This lets your PM set expectations honestly. The long tail — that 5% chance it takes until April — is exactly the risk that single-point estimates hide.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Pricing Uncertainty: "What's Our Expected Revenue?"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The question:&lt;/strong&gt; "Our conversion rate has been between 2% and 8% over the past year. If we launch at $49/mo, what's our expected revenue range?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inputs you need:&lt;/strong&gt; A distribution for your conversion rate (beta distribution fits bounded percentages well), traffic projections (perhaps normal distribution around your forecast), and churn rate (exponential or lognormal).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the output tells you:&lt;/strong&gt; "Expected revenue is $847K, but the 90% confidence interval is $520K to $1.2M." This is the difference between a pitch deck that says "$847K" and one that says "$847K, with downside protection plans for the $520K scenario."&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It: Portfolio Confidence Intervals
&lt;/h2&gt;

&lt;p&gt;Let's make this concrete. Say you want to model an uncertain outcome — like projected quarterly revenue — where your best estimate is $100,000 with historical variation of about $15,000.&lt;/p&gt;

&lt;p&gt;You can simulate 10,000 possible outcomes right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/simulate/montecarlo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "distribution": "normal",
    "params": { "mean": 100000, "stddev": 15000 },
    "iterations": 10000
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response gives you the full distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mean"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100023.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stdDev"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;14987.32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"percentiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p5"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;75312.18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p25"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;89843.67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p50"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100045.22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p75"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;110198.54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"p95"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;124701.89&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"histogram"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iterations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"executionTimeMs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3.2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reading the Output
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;p5 (~$75K)&lt;/strong&gt; = "There's only a 5% chance the outcome falls below this." This is your downside risk floor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p25 (~$90K)&lt;/strong&gt; = "A pessimistic-but-plausible scenario."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p50 (~$100K)&lt;/strong&gt; = "The median outcome — half of simulated futures landed above, half below."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p75 (~$110K)&lt;/strong&gt; = "An optimistic-but-plausible scenario."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p95 (~$125K)&lt;/strong&gt; = "Only 5% of simulations exceeded this." This is your upside ceiling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The spread between p5 and p95 IS your risk measure.&lt;/strong&gt; A $75K-$125K range on a $100K expected value tells you there's significant uncertainty. If the spread were $95K-$105K, you'd sleep a lot better.&lt;/p&gt;

&lt;p&gt;The histogram breaks the full range into buckets so you can visualize the shape of the distribution — where probability mass concentrates, and how fat the tails are.&lt;/p&gt;

&lt;h3&gt;
  
  
  Portfolio VaR: Correlated Multi-Asset Risk
&lt;/h3&gt;

&lt;p&gt;For portfolio risk with multiple correlated assets, there's a dedicated endpoint. This one's premium (it uses a more expensive covariance decomposition path than plain sampling), so you'll need a free API key first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Grab a free API key (no credit card, instant):&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/auth/signup &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"email":"you@example.com"}'&lt;/span&gt;
&lt;span class="c"&gt;# → response includes { "api_key": "oc_..." }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, for a 60/40 stock-bond portfolio with 10 periods of historical returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/analyze/risk &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer oc_YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "weights": [0.6, 0.4],
    "returns": [
      [0.02, -0.03, 0.01, 0.04, -0.02, 0.01, -0.01, 0.03, 0.02, -0.04],
      [0.01, 0.02, -0.01, 0.01, 0.03, -0.02, 0.02, 0.01, -0.03, 0.01]
    ],
    "confidence": 0.95
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"var"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.021&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cvar"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.028&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expectedReturn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.006&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"volatility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.016&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"horizonDays"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"assets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VaR of 2.1%&lt;/strong&gt;: On 95% of days, your portfolio won't lose more than 2.1%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CVaR of 2.8%&lt;/strong&gt;: On the worst 5% of days, the average loss is 2.8%. This is the "expected shortfall" — it captures tail risk that VaR alone misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expected return of 0.6%&lt;/strong&gt;: Weighted average return across assets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volatility of 1.6%&lt;/strong&gt;: Portfolio standard deviation, accounting for correlations between the two assets. This is typically lower than the weighted average of individual volatilities — that's the diversification benefit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of these endpoints are from &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;OraClaw&lt;/a&gt;, an open decision-intelligence API with &lt;strong&gt;17 tools&lt;/strong&gt; (11 free, 6 premium). The free tier gives you &lt;strong&gt;25 calls/day&lt;/strong&gt; for non-premium tools — no API key required. The $9/mo Starter tier unlocks all 17 tools and raises the ceiling to 50K calls/month. Pay-per-call beyond that is $0.005.&lt;/p&gt;




&lt;h2&gt;
  
  
  The MCP Angle: Your AI Agent Can Call This Directly
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Desktop, Cursor, Cline, or any other MCP-aware assistant, you don't need to hand-write those curl commands. OraClaw ships as an &lt;strong&gt;MCP server&lt;/strong&gt; — the AI gets the tools directly, with schemas, and decides when to call them.&lt;/p&gt;

&lt;p&gt;Drop this into your MCP client config (e.g. &lt;code&gt;claude_desktop_config.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oraclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"oraclaw-mcp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ORACLAW_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"oc_YOUR_KEY"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart the client, and now when you ask &lt;em&gt;"What's the 95% VaR on a 60/40 portfolio with these 10 daily returns?"&lt;/em&gt;, the agent will call &lt;code&gt;simulate_montecarlo&lt;/code&gt; or &lt;code&gt;analyze_risk&lt;/code&gt; directly instead of making one up. Works for all 17 tools — multi-armed bandits, contextual optimization, constraint solving, pathfinding, anomaly detection, forecasting, convergence scoring, and more.&lt;/p&gt;

&lt;p&gt;The server is on npm, auto-discovers in Claude Desktop once installed, and uses stdio transport (no port collisions). Full setup guide: &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  DIY vs API: The Build-vs-Buy Calculation
&lt;/h2&gt;

&lt;p&gt;Building Monte Carlo from scratch isn't rocket science, but it's more than a weekend project if you want to do it right. You need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement sampling for multiple distribution types (normal, lognormal, triangular, beta, exponential)&lt;/li&gt;
&lt;li&gt;Handle edge cases (negative standard deviations, degenerate distributions, numerical overflow)&lt;/li&gt;
&lt;li&gt;Add variance reduction techniques for tail percentiles&lt;/li&gt;
&lt;li&gt;Validate convergence (are 10,000 iterations enough for your use case?)&lt;/li&gt;
&lt;li&gt;Build the percentile calculation, histogram binning, and summary statistics&lt;/li&gt;
&lt;li&gt;Test everything — off-by-one errors in percentile calculations are notoriously hard to catch&lt;/li&gt;
&lt;li&gt;For portfolio risk: implement the covariance matrix, Cholesky decomposition, and the inverse normal CDF&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estimated effort for a senior developer: 20-40 hours. At $100/hour, that's $2,000-$4,000 in engineering time. An API call costs $0.005 — or nothing on the free tier. You'd need to make 400,000 calls to break even on building it yourself.&lt;/p&gt;

&lt;p&gt;The real cost isn't even the implementation. It's the maintenance: keeping up with edge cases users discover, handling new distribution types, optimizing for performance as iteration counts scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  When You Need More
&lt;/h2&gt;

&lt;p&gt;An API is the right tool when you need confidence intervals on a distribution, portfolio VaR for a handful of assets, or quick what-if analysis. It covers the 80% case — the scenarios that show up in dashboards, investor decks, project planning, and product analytics.&lt;/p&gt;

&lt;p&gt;For the other 20% — correlated multi-asset simulations with thousands of paths, copula models for non-linear dependencies, GPU-accelerated pricing of exotic derivatives — you'll want a dedicated quant library. &lt;a href="https://www.quantlib.org/" rel="noopener noreferrer"&gt;QuantLib&lt;/a&gt; (C++/Python) is the gold standard for derivatives pricing. &lt;a href="https://www.pymc.io/" rel="noopener noreferrer"&gt;PyMC&lt;/a&gt; handles Bayesian Monte Carlo with MCMC samplers. &lt;a href="https://numpy.org/" rel="noopener noreferrer"&gt;NumPy&lt;/a&gt; alone can brute-force millions of paths per second if you vectorize properly.&lt;/p&gt;

&lt;p&gt;But for "give me the confidence interval on this forecast" or "what's the VaR on my portfolio" — the kind of question that shows up in a sprint planning meeting or a product review — an API call is faster and cheaper than standing up infrastructure. You spend your time interpreting results instead of debugging sampling algorithms.&lt;/p&gt;

&lt;p&gt;The best Monte Carlo simulation is the one that actually gets run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try the free tier right now, no key:&lt;/strong&gt;
&lt;code&gt;curl -X POST https://oraclaw-api.onrender.com/api/v1/simulate/montecarlo -H "Content-Type: application/json" -d '{"distribution":"normal","params":{"mean":100,"stddev":15},"iterations":10000}'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grab a free API key for the premium tools (VaR, anomaly, graph, forecast, constraints, CMA-ES):&lt;/strong&gt;
&lt;code&gt;POST https://oraclaw-api.onrender.com/api/v1/auth/signup&lt;/code&gt; with &lt;code&gt;{"email":"..."}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plug into your agent (Claude Desktop / Cursor / Cline):&lt;/strong&gt;
Add the &lt;code&gt;oraclaw&lt;/code&gt; MCP server block above, restart the client, done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browse all 17 tools + schemas:&lt;/strong&gt;
&lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the post was useful, a ❤️ or a follow on dev.to helps more people find the MCP angle. Questions in the comments welcome — I answer every one.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>mcp</category>
      <category>api</category>
    </item>
    <item>
      <title>I Needed an LP Solver but Gurobi Costs $10K/yr — So I Built an API for $9/month</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Sun, 19 Apr 2026 21:13:22 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/i-needed-an-lp-solver-but-gurobi-costs-10kyr-so-i-built-an-api-for-9month-3i0m</link>
      <guid>https://dev.to/whatsonyourmind/i-needed-an-lp-solver-but-gurobi-costs-10kyr-so-i-built-an-api-for-9month-3i0m</guid>
      <description>&lt;h2&gt;
  
  
  The $10,000 Pricing Page That Says Nothing
&lt;/h2&gt;

&lt;p&gt;Last year I needed to solve a scheduling problem. Nothing exotic -- a constrained optimization where you have limited resources, competing priorities, and a function to maximize. The kind of thing that operations research solved decades ago with linear programming.&lt;/p&gt;

&lt;p&gt;So I went looking for an LP solver I could call from a web service. I found Gurobi, the gold standard. Clicked "Pricing." And landed on a page with zero numbers and a "Contact Sales" button.&lt;/p&gt;

&lt;p&gt;I'm not the only one who finds this frustrating. If you've spent any time in optimization forums, you've seen the same complaints echoed over and over:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The best MIP solvers (CPLEX, GUROBI, FICO) are all extremely expensive unless you're an academic."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Gurobi is super fast, but the licensing was just impossible."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"I just hate it when you go to the pricing page and there's NO PRICING. None."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After some digging, the numbers surfaced: Gurobi runs $10,000 to $50,000 per year depending on your configuration. IBM CPLEX starts at $3,420/year for a single user. These are tools designed for Fortune 500 logistics departments, not a developer building a scheduling feature for a SaaS app.&lt;/p&gt;

&lt;p&gt;The licensing model makes things worse. Gurobi licenses are tied to specific machines. One HN commenter described how their company bought &lt;em&gt;"4 old 24-core Xeons off eBay"&lt;/em&gt; just to avoid paying for additional license seats. Another pointed out the fundamental incompatibility with modern infrastructure:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The inability to do something like have autoscaling containers using Gurobi was ultimately the dealbreaker."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I built my own. Not a solver from scratch -- that would be foolish. I wrapped HiGHS, an open-source LP/MIP solver that's already proven in production, into a hosted API that anyone can call with a single HTTP request. No license files. No sales calls. No seat counting.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Are You Actually Paying $10K/Year For?
&lt;/h2&gt;

&lt;p&gt;If you're not from an operations research background, linear programming might sound abstract. It isn't. Here's what it does in plain English:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linear programming (LP)&lt;/strong&gt; finds the best outcome given constraints. You have some quantity you want to maximize or minimize (profit, cost, time), and you have limits on what you can do (budget, hours, materials). An LP solver finds the mathematically optimal answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed-integer programming (MIP)&lt;/strong&gt; is the same thing, but some of your variables have to be whole numbers. You can't produce 3.7 chairs. You produce 3 or 4.&lt;/p&gt;

&lt;p&gt;These problems are everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manufacturing&lt;/strong&gt;: Which products should a factory make this week to maximize profit, given limited labor and materials?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logistics&lt;/strong&gt;: What's the cheapest way to route 50 delivery trucks across 200 stops?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling&lt;/strong&gt;: How do you assign 30 nurses to 3 shifts across 7 days while respecting labor laws and preferences?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: How do you allocate a portfolio across 20 assets to maximize return while keeping risk below a threshold?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mathematical theory behind LP solvers is well-established. The simplex method dates to 1947. Interior-point methods are from the 1980s. Branch-and-bound for MIP has been refined for decades. What you're paying for with commercial solvers isn't novel math -- it's engineering: hand-tuned heuristics, presolve routines, and parallelization that shave seconds off industrial-scale problems with millions of variables.&lt;/p&gt;

&lt;p&gt;But most developers don't have millions of variables. They have dozens. Maybe hundreds. And for those problems, the gap between a $50,000/year commercial solver and a free open-source one is measured in microseconds, not hours.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solver Landscape in 2026
&lt;/h2&gt;

&lt;p&gt;Here's an honest comparison of what's available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Solver&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Self-Serve Signup&lt;/th&gt;
&lt;th&gt;REST API&lt;/th&gt;
&lt;th&gt;Container-Friendly&lt;/th&gt;
&lt;th&gt;Docs Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gurobi&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$10K-$50K/yr&lt;/td&gt;
&lt;td&gt;No (contact sales)&lt;/td&gt;
&lt;td&gt;No (license file)&lt;/td&gt;
&lt;td&gt;No (per-seat)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPLEX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$3,420+/yr&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Cloud (limited)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Mediocre&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OR-Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Alpha/limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;"Remarkably terrible"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HiGHS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (library)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (self-host)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Sparse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OraClaw&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$9/mo Starter (50K calls), $0.005/call pay-per-call&lt;/td&gt;
&lt;td&gt;Yes (1 email)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few notes on this table:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gurobi and CPLEX&lt;/strong&gt; are genuinely excellent solvers. If you're solving problems with 100,000+ variables and need cutting-edge performance, they earn their price. But their licensing model was designed for a world where software ran on owned hardware, not ephemeral containers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OR-Tools&lt;/strong&gt; is Google's open-source optimization suite. It's powerful and free, but the documentation is... a known problem. The OR-Tools tag on Stack Overflow is a graveyard of unanswered questions. Getting it running in production requires compiling native binaries and managing a Python environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HiGHS&lt;/strong&gt; is the solver engine I chose to build on. It's open-source, developed at the University of Edinburgh, and won the 2024 DIMACS challenge for LP solvers. It runs as a WASM module, meaning no native compilation, no platform-specific binaries. The catch: it's a library, not a service. You have to host it yourself.&lt;/p&gt;

&lt;p&gt;The gap in this landscape is obvious. If you want a solver you can call from any language over HTTP with zero setup, your options are limited to either paying enterprise prices or building and hosting the infrastructure yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real Example: Factory Scheduling
&lt;/h2&gt;

&lt;p&gt;Let's walk through a concrete LP problem.&lt;/p&gt;

&lt;p&gt;You run a furniture workshop. You make two products: &lt;strong&gt;chairs&lt;/strong&gt; and &lt;strong&gt;tables&lt;/strong&gt;. Each chair earns $45 profit. Each table earns $80 profit. You want to maximize weekly profit.&lt;/p&gt;

&lt;p&gt;But you have constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Labor&lt;/strong&gt;: You have 400 hours of labor per week. A chair takes 5 hours. A table takes 20 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wood&lt;/strong&gt;: You have 450 units of wood per week. A chair uses 10 units. A table uses 15 units.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity&lt;/strong&gt;: You can't make more than 100 of either product per week.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mathematically, this is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Maximize:    45x + 80y
Subject to:  5x + 20y ≤ 400    (labor)
             10x + 15y ≤ 450   (wood)
             0 ≤ x ≤ 100       (chair capacity)
             0 ≤ y ≤ 100       (table capacity)
             x, y ∈ integers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;x&lt;/code&gt; is the number of chairs and &lt;code&gt;y&lt;/code&gt; is the number of tables.&lt;/p&gt;

&lt;p&gt;You could solve this by graphing the feasible region and checking corner points. Or you could send one API call (LP/MIP is a premium tool — get a key in 30 seconds at &lt;a href="https://web-olive-one-89.vercel.app/signup" rel="noopener noreferrer"&gt;oraclaw signup&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/solve/constraints &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "direction": "maximize",
    "objective": { "chairs": 45, "tables": 80 },
    "variables": [
      { "name": "chairs", "lower": 0, "upper": 100, "type": "integer" },
      { "name": "tables", "lower": 0, "upper": 100, "type": "integer" }
    ],
    "constraints": [
      { "name": "labor_hours", "coefficients": { "chairs": 5, "tables": 20 }, "upper": 400 },
      { "name": "wood_units", "coefficients": { "chairs": 10, "tables": 15 }, "upper": 450 }
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"optimal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"objectiveValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"variables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"chairs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The optimal answer: make &lt;strong&gt;24 chairs&lt;/strong&gt; and &lt;strong&gt;14 tables&lt;/strong&gt; for a maximum weekly profit of &lt;strong&gt;$2,200&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Verify the constraints hold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Labor: 5(24) + 20(14) = 120 + 280 = &lt;strong&gt;400&lt;/strong&gt; ≤ 400 ✓ (binding)&lt;/li&gt;
&lt;li&gt;Wood: 10(24) + 15(14) = 240 + 210 = &lt;strong&gt;450&lt;/strong&gt; ≤ 450 ✓ (binding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both constraints are binding at the optimum — exactly where you'd expect a true LP solution to land. HiGHS explores the full feasible polytope and returns the corner that maximises the objective, not a heuristic guess.&lt;/p&gt;

&lt;p&gt;What makes this interesting for developers isn't the math — it's the interface. No library installation. No language-specific SDK. No binary compilation. One HTTP call and you have the answer. Use it from Python, JavaScript, Go, Ruby, a shell script, or an AI agent that constructs the request autonomously.&lt;/p&gt;




&lt;h2&gt;
  
  
  When NOT to Use This
&lt;/h2&gt;

&lt;p&gt;I want to be direct about the limitations, because choosing the wrong solver for your problem is worse than paying too much for the right one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Gurobi or CPLEX when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your problem has 10,000+ variables and you need solutions in seconds, not minutes&lt;/li&gt;
&lt;li&gt;You're running millions of solves per day in a batch processing pipeline&lt;/li&gt;
&lt;li&gt;You need advanced features like quadratic programming (QP), second-order cone programming (SOCP), or nonlinear optimization&lt;/li&gt;
&lt;li&gt;You have dedicated operations research staff who will tune solver parameters&lt;/li&gt;
&lt;li&gt;Your company's revenue depends on shaving 2% off logistics costs at industrial scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use an API-based solver when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your problems are small to medium (dozens to low thousands of variables)&lt;/li&gt;
&lt;li&gt;You need optimization as a feature, not as the core product&lt;/li&gt;
&lt;li&gt;You're a startup or indie developer who can't justify $10K/year for a solver license&lt;/li&gt;
&lt;li&gt;You want to call optimization from a web service, mobile app, or AI agent&lt;/li&gt;
&lt;li&gt;You need something working in minutes, not days of setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest truth is that Gurobi is faster on large MIP problems. That's why companies pay $50,000/year for it. But "faster" means going from 0.8 seconds to 0.3 seconds on a 50,000-variable problem. For a 50-variable scheduling problem, both solvers return in under a millisecond. You're paying for headroom you may never need.&lt;/p&gt;




&lt;h2&gt;
  
  
  The MCP Angle: Your AI Agent Calls the Solver Itself
&lt;/h2&gt;

&lt;p&gt;If you're running an agent in Claude Desktop, Cursor, or Cline, the LP/MIP solver is exposed as an MCP tool. Drop this into your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oraclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@oraclaw/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ORACLAW_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-key-here"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your client and you can literally type:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Maximise 45·chairs + 80·tables, subject to 5·chairs + 20·tables ≤ 400 labour hours and 10·chairs + 15·tables ≤ 450 wood units. Both must be non-negative integers."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent calls &lt;code&gt;solve_constraints&lt;/code&gt; itself, gets back the structured optimum + objective value, and explains it. No more LLMs guessing at integer programs they can't actually solve.&lt;/p&gt;

&lt;p&gt;Beyond LP/MIP, the OraClaw MCP server ships &lt;strong&gt;17 tools total&lt;/strong&gt; — bandits, Monte Carlo, scheduling, Bayesian belief updates, ensemble consensus, pathfinding, scoring, time-series forecast, anomaly detection, graph analytics, CMA-ES, portfolio risk. All with explicit input + output JSON schemas so your agent knows exactly what it gets back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Linear programming is a solved problem. The simplex method is nearly 80 years old. The mathematics don't change based on what you pay for a license.&lt;/p&gt;

&lt;p&gt;What changes is access.&lt;/p&gt;

&lt;p&gt;Gurobi charges $10,000+/year and won't even show you the price. CPLEX wants $285/user/month. Both require license files, seat management, and enterprise sales cycles. Deploying them in containers is either painful or impossible.&lt;/p&gt;

&lt;p&gt;The alternative: an API call at $0.005 per request, or $9/month for 50K calls on the Starter plan. No sales call, no license file, no seat counting. Run it from any language, any platform, any container orchestrator — or have your AI agent call it directly via MCP.&lt;/p&gt;

&lt;p&gt;The solver underneath is HiGHS — the same open-source engine winning LP benchmarks. Wrapped in a REST API for simplicity and exposed as an MCP tool for AI agents.&lt;/p&gt;

&lt;p&gt;If you're building something that needs optimization and you're not a Fortune 500 logistics department, you shouldn't have to navigate enterprise sales to solve a linear program.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Get an API key (1 email):&lt;/strong&gt; &lt;a href="https://web-olive-one-89.vercel.app/signup" rel="noopener noreferrer"&gt;oraclaw signup&lt;/a&gt; — instant key, 1,000 calls/day on pay-per-call ($0.005/call), upgrade to Starter $9/mo for 50K/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try free tools without a key:&lt;/strong&gt; the API has 11 free endpoints (bandits, Monte Carlo, scheduling, pathfinding, scoring, Bayesian) — &lt;code&gt;curl https://oraclaw-api.onrender.com/api/v1/optimize/bandit ...&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use it from your AI agent:&lt;/strong&gt; &lt;code&gt;npm install @oraclaw/mcp-server&lt;/code&gt; or paste the MCP config above into Claude Desktop / Cursor / Cline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source + 17-tool docs:&lt;/strong&gt; &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt; (MIT licensed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live API:&lt;/strong&gt; &lt;a href="https://oraclaw-api.onrender.com" rel="noopener noreferrer"&gt;oraclaw-api.onrender.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The math is the same. The price shouldn't be.&lt;/p&gt;

</description>
      <category>optimization</category>
      <category>mcp</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your LLM Costs Spiked 400% Last Night — Here's How to Catch It in One API Call</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Sun, 19 Apr 2026 21:10:50 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/your-llm-costs-spiked-400-last-night-heres-how-to-catch-it-in-one-api-call-363a</link>
      <guid>https://dev.to/whatsonyourmind/your-llm-costs-spiked-400-last-night-heres-how-to-catch-it-in-one-api-call-363a</guid>
      <description>&lt;p&gt;You wake up Monday morning. Coffee in hand, you open your LLM provider's billing dashboard. The weekend total: &lt;strong&gt;$2,400&lt;/strong&gt;. Your usual weekend spend is $600.&lt;/p&gt;

&lt;p&gt;Somewhere between Friday at 11pm and Saturday at 3am, an agent hit a retry loop. Each retry included the full conversation context. Each retry was bigger than the last. A 400% cost spike. Nobody noticed because nobody was watching.&lt;/p&gt;

&lt;p&gt;The fix took 5 minutes — a missing &lt;code&gt;max_retries&lt;/code&gt; cap. The damage took 48 hours to discover.&lt;/p&gt;

&lt;p&gt;This is the most expensive category of bug in AI-native applications. Not a logic error. Not a crash. A silent cost explosion that hides inside normal-looking logs until the invoice arrives.&lt;/p&gt;

&lt;p&gt;You'd think monitoring would catch it. And it would — if you had monitoring. But proper observability means DataDog ($15/host/month), New Relic ($0.30/GB ingested), or a full Prometheus + Grafana stack that someone needs to maintain. For a team running a few LLM-powered features, that's like buying a fire truck to watch a candle.&lt;/p&gt;

&lt;p&gt;Here's the thing: &lt;strong&gt;you don't need any of that&lt;/strong&gt;. The math behind anomaly detection is old. Really old. The two techniques that catch 90% of cost spikes were invented in the 1800s. They run in microseconds. And they can be wrapped in a single API call.&lt;/p&gt;

&lt;p&gt;Let me show you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Algorithms That Catch Almost Everything
&lt;/h2&gt;

&lt;p&gt;There are two statistical methods that handle the vast majority of "did something weird happen in my numbers?" scenarios. They're different, and knowing when to use each one matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Z-Score: For Well-Behaved Data
&lt;/h3&gt;

&lt;p&gt;The Z-score measures how far a data point is from the mean, expressed in standard deviations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;standard_deviation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. If your daily LLM cost averages $150 with a standard deviation of $20, and today's cost is $250, the Z-score is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = (250 - 150) / 20 = 5.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Z-score of 5.0 means the value is 5 standard deviations from normal. In a normal distribution, anything beyond 2-3 standard deviations is extremely unlikely (less than 0.3% probability at z &amp;gt; 3). You have an anomaly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Costs, latency, throughput — any metric that clusters around a predictable average. If you plotted two weeks of your daily LLM spend and it looked roughly like a bell curve, Z-score is your tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weakness:&lt;/strong&gt; Z-score assumes your data is normally distributed. If your data is already skewed — say, you have occasional legitimate high-spend days — the mean and standard deviation get pulled toward the outliers, and real anomalies hide in the noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  IQR: For Data With a Long Tail
&lt;/h3&gt;

&lt;p&gt;The Interquartile Range method doesn't care about your data's shape. It works by looking at the middle 50%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IQR = Q3 - Q1
Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Q1 is the 25th percentile. Q3 is the 75th percentile. Anything below the lower fence or above the upper fence is an anomaly.&lt;/p&gt;

&lt;p&gt;The 1.5 multiplier is Tukey's original recommendation from 1977 — it corresponds roughly to +/- 2.7 standard deviations in normal data, catching about 0.7% of points as outliers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Response times (they always have a long tail), batch sizes, error rates, token counts per request — anything where legitimate values occasionally spike but you still want to catch the truly abnormal ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More robust than Z-score&lt;/strong&gt; because medians and quartiles aren't pulled by extreme values the way means and standard deviations are.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Rule
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;If your data looks like a bell curve, use Z-score. If it has a long tail or you're not sure, use IQR.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When in doubt, run both. If they agree, you have high confidence. If only one flags an anomaly, investigate further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Example: Catching a Cost Spike
&lt;/h2&gt;

&lt;p&gt;Here's a working example. These are 14 days of daily LLM costs in dollars. One day had a problem.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads up: anomaly detection is a premium tool&lt;/strong&gt; (one of 6 paid endpoints; the other 11 are free). Get a key in 30 seconds at &lt;a href="https://web-olive-one-89.vercel.app/signup" rel="noopener noreferrer"&gt;oraclaw signup&lt;/a&gt; — one email field, instant key, no card needed. Then:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/detect/anomaly &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "data": [142, 156, 138, 161, 145, 152, 139, 148, 155, 143, 612, 147, 151, 140],
    "method": "zscore",
    "threshold": 2.0
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zscore"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"anomalies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"zScore"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3.5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stats"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"mean"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;187.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stdDev"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;121.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"totalPoints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"anomalyCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's walk through the output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 11&lt;/strong&gt; (index 10, zero-indexed) cost &lt;strong&gt;$612&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The mean across all 14 days is $187.80 (inflated by the spike itself)&lt;/li&gt;
&lt;li&gt;Even with the spike pulling the mean up, $612 is still &lt;strong&gt;3.5 standard deviations&lt;/strong&gt; above it&lt;/li&gt;
&lt;li&gt;Your actual baseline is around $148/day. Something went very wrong on day 11.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Z-score of 3.5 means this value has less than a 0.02% chance of occurring naturally. That's not variance. That's an incident.&lt;/p&gt;

&lt;p&gt;You can swap &lt;code&gt;"method": "zscore"&lt;/code&gt; for &lt;code&gt;"method": "iqr"&lt;/code&gt; to use the IQR method instead — useful if your cost data has legitimate weekly patterns (higher on weekdays, lower on weekends) that make the distribution non-normal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building an Alert Pipeline in 10 Lines
&lt;/h2&gt;

&lt;p&gt;Detection is only useful if it triggers an action. Here's a minimal alert pipeline — a cron job that checks daily costs and sends a Slack notification when something looks wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// anomaly-alert.js — run via cron: 0 8 * * *&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;costs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetchDailyCosts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// last 14 days from your billing API&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://oraclaw-api.onrender.com/api/v1/detect/anomaly&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Bearer &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ORACLAW_API_KEY&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;costs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zscore&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;stats&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sendSlackAlert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Cost anomaly detected: $&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; `&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
    &lt;span class="s2"&gt;`(z-score: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;anomalies&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;, baseline: ~$&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;)`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No agents. No dashboards. No monthly SaaS bill. A cron job, one HTTP call, and a Slack webhook. You now have cost spike detection.&lt;/p&gt;

&lt;p&gt;Set the threshold based on your tolerance: &lt;strong&gt;2.0&lt;/strong&gt; catches more anomalies but includes some false positives — good for high-stakes environments where you'd rather investigate a false alarm than miss a real spike. &lt;strong&gt;3.0&lt;/strong&gt; catches only extreme outliers — better for noisy data where daily fluctuations are normal. Start at &lt;strong&gt;2.5&lt;/strong&gt; and adjust based on what you see in your first week.&lt;/p&gt;

&lt;p&gt;A few practical notes for production use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Window size matters.&lt;/strong&gt; 14 days gives a solid baseline. Fewer than 7 data points and your statistics get unreliable. More than 30 and you start averaging over too much history, making seasonal shifts invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run both methods.&lt;/strong&gt; If Z-score and IQR both flag the same point, that's a high-confidence anomaly. If only one flags it, it might be worth investigating but isn't urgent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Include context in your alert.&lt;/strong&gt; The raw Z-score or IQR deviation tells you &lt;em&gt;how&lt;/em&gt; anomalous the value is, but your Slack message should also include what the normal range looks like, so whoever gets paged can immediately gauge severity.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When You Need More
&lt;/h2&gt;

&lt;p&gt;This approach handles the "did something weird happen?" question well. But there are cases where you need heavier tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time streaming detection&lt;/strong&gt; (sub-second) — look at Grafana's built-in anomaly detection or AWS CloudWatch Anomaly Detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time-series decomposition&lt;/strong&gt; (separating trend, seasonality, residual) — Facebook's Prophet or statsmodels in Python&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-dimensional anomalies&lt;/strong&gt; (cost is normal but latency + error rate together are weird) — PyOD, Isolation Forest, or a full observability platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For "did my daily numbers do something weird?" — one API call is enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Or — Let Your AI Agent Detect Anomalies Itself
&lt;/h2&gt;

&lt;p&gt;If you're running an agent in Claude Desktop, Cursor, or Cline, you don't even need the curl. The same anomaly detection is exposed as an MCP tool. Drop this into your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oraclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@oraclaw/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"ORACLAW_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-key-here"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your client. Now you can literally type at your agent:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Here are the last 14 days of my LLM costs: [142, 156, 138, ..., 612, 147, 151, 140]. Anything weird?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent calls &lt;code&gt;detect_anomaly&lt;/code&gt; itself, gets back structured JSON with the spike index + Z-score, and explains it back in your language. The whole point of MCP: deterministic algorithms become first-class tools your LLM can reach for instead of guessing.&lt;/p&gt;

&lt;p&gt;The OraClaw MCP server ships &lt;strong&gt;17 tools total&lt;/strong&gt; — 11 free without a key (bandits, Monte Carlo, scheduling, Bayesian updates, ensemble consensus, pathfinding, scoring) and 6 premium (anomaly detection, time-series forecast, LP/MIP solver, graph analytics, CMA-ES, portfolio risk). All with explicit input + output JSON schemas so your agent knows exactly what it gets back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Z-score and IQR are 19th-century statistics. They work. They're fast. They're deterministic. They don't need training data, GPUs, or a machine learning pipeline.&lt;/p&gt;

&lt;p&gt;You don't need a $500/month observability platform to know that $612 is not normal when your average is $148. You need arithmetic and a threshold.&lt;/p&gt;

&lt;p&gt;The OraClaw &lt;code&gt;/detect/anomaly&lt;/code&gt; route wraps both Z-score and IQR into a single API call. It's one of 17 MCP tools your agent can reach for to make decisions on real numbers instead of vibes.&lt;/p&gt;

&lt;p&gt;Stop discovering cost spikes from invoices. Start discovering them from alerts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Get an API key (1 email):&lt;/strong&gt; &lt;a href="https://web-olive-one-89.vercel.app/signup" rel="noopener noreferrer"&gt;oraclaw signup&lt;/a&gt; — instant key, 1,000 calls/day on pay-per-call ($0.005/call), upgrade to Starter $9/mo for 50K/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use it from your AI agent:&lt;/strong&gt; &lt;code&gt;npm install @oraclaw/mcp-server&lt;/code&gt; or paste the MCP config above into Claude Desktop / Cursor / Cline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try the 11 free tools (no key):&lt;/strong&gt; see the full list at &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live API:&lt;/strong&gt; &lt;a href="https://oraclaw-api.onrender.com" rel="noopener noreferrer"&gt;oraclaw-api.onrender.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saved you from a $612 invoice surprise, leave a star — it helps other developers find it.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>mcp</category>
      <category>devops</category>
      <category>ai</category>
    </item>
    <item>
      <title>The $36,000 A/B Test: What Optimizely Charges vs. What the Algorithm Actually Costs</title>
      <dc:creator>Whatsonyourmind</dc:creator>
      <pubDate>Sun, 19 Apr 2026 20:48:05 +0000</pubDate>
      <link>https://dev.to/whatsonyourmind/the-36000-ab-test-what-optimizely-charges-vs-what-the-algorithm-actually-costs-al7</link>
      <guid>https://dev.to/whatsonyourmind/the-36000-ab-test-what-optimizely-charges-vs-what-the-algorithm-actually-costs-al7</guid>
      <description>&lt;h2&gt;
  
  
  You Just Want to Test Two Buttons
&lt;/h2&gt;

&lt;p&gt;You're a developer at a Series A startup. Your product manager walks over and says: "We need to A/B test the signup flow. Three variants, maybe four. Can you set that up this week?"&lt;/p&gt;

&lt;p&gt;Simple enough. You've read about multi-armed bandits. You know the theory. You start looking at tooling.&lt;/p&gt;

&lt;p&gt;Then you open Optimizely's pricing page. Or rather, you try to — because there is no pricing page. Just a "Contact Sales" button and a calendar widget for a 30-minute demo.&lt;/p&gt;

&lt;p&gt;After the demo, the sales call, the follow-up, and the "let me check with my manager" email chain, the number comes back: &lt;strong&gt;$36,000 per year minimum&lt;/strong&gt;. For A/B testing.&lt;/p&gt;

&lt;p&gt;That's not a typo. And it gets worse. If your product scales to 10 million impressions per month, you're looking at &lt;strong&gt;$63,700 to $113,100 per year&lt;/strong&gt; depending on your package. Enterprise tier? &lt;strong&gt;$200,000 to $400,000+&lt;/strong&gt;. One user reported getting "stuck with a $24,000 bill for a product they no longer needed" after downgrading became impossible without a sales conversation.&lt;/p&gt;

&lt;p&gt;The pricing model itself is designed to extract maximum value: Optimizely charges a &lt;strong&gt;percentage of your revenue&lt;/strong&gt;, not a flat fee. The more successful your product becomes, the more you pay for the same algorithm underneath.&lt;/p&gt;

&lt;p&gt;It's a system that, as one reviewer put it, "penalizes those just starting with experimentation." If you're a scrappy team trying to validate hypotheses fast, you're priced out before you write a single test.&lt;/p&gt;

&lt;p&gt;When Brex — a well-funded fintech company — finally switched away from Optimizely to Statsig, their engineering lead said it plainly: &lt;strong&gt;"Our engineers are significantly happier."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The question nobody asks during those sales calls is the one that matters most: what are you actually buying for $36,000?&lt;/p&gt;




&lt;h2&gt;
  
  
  What You're Actually Buying
&lt;/h2&gt;

&lt;p&gt;Strip away Optimizely's dashboard. Strip away the visual editor, the audience segmentation, the CDN integration, the SSR compatibility, the SDK for six different frameworks.&lt;/p&gt;

&lt;p&gt;What's left?&lt;/p&gt;

&lt;p&gt;At the mathematical core of Optimizely's experimentation engine is &lt;strong&gt;Thompson Sampling&lt;/strong&gt; — a multi-armed bandit algorithm published by William R. Thompson in 1933. That's not a criticism. Thompson Sampling is genuinely brilliant. It's one of the most elegant solutions to the explore/exploit problem in statistics.&lt;/p&gt;

&lt;p&gt;But it fits in about 20 lines of code.&lt;/p&gt;

&lt;p&gt;The algorithm itself is public domain. It's been public domain for 91 years. You can find implementations in every language on GitHub, in textbooks, in blog posts. The math is settled.&lt;/p&gt;

&lt;p&gt;So when you pay Optimizely $36,000 per year, you're not paying for the algorithm. You're paying for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The visual editor&lt;/strong&gt; — drag-and-drop test creation for non-technical users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audience targeting&lt;/strong&gt; — segment by geography, device, behavior, custom attributes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The SDK ecosystem&lt;/strong&gt; — client-side, server-side, edge, mobile, OTT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The analytics dashboard&lt;/strong&gt; — statistical significance calculations, revenue attribution, funnel visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance and governance&lt;/strong&gt; — SOC 2, GDPR controls, approval workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real features. They have real value — especially for large organizations with non-technical stakeholders who need to create and monitor experiments without writing code.&lt;/p&gt;

&lt;p&gt;But if you're a developer, and you just need the bandit algorithm — the explore/exploit engine that decides which variant to show next — you're paying $36,000 for something that costs pennies to compute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Thompson Sampling in 5 Minutes
&lt;/h2&gt;

&lt;p&gt;Let's actually learn the algorithm you'd be paying for. It's more intuitive than you think.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Explore/Exploit Dilemma
&lt;/h3&gt;

&lt;p&gt;You have three signup button variants. After 100 visitors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variant A&lt;/strong&gt; converted 35 out of 100 (35%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variant B&lt;/strong&gt; converted 40 out of 100 (40%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variant C&lt;/strong&gt; converted 5 out of 10 (50%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which is best? Traditional A/B testing says: "Keep running all three at equal traffic until we hit statistical significance." That wastes thousands of impressions sending traffic to Variant A, which is clearly losing.&lt;/p&gt;

&lt;p&gt;A naive approach says: "Variant C has 50% — send all traffic there." But wait — that's based on only 10 observations. It could easily be noise.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;explore/exploit dilemma&lt;/strong&gt;: do you exploit what looks best now, or explore the uncertain option to learn more?&lt;/p&gt;

&lt;h3&gt;
  
  
  How Thompson Sampling Solves It
&lt;/h3&gt;

&lt;p&gt;Thompson Sampling uses &lt;strong&gt;Beta distributions&lt;/strong&gt; to model uncertainty about each variant's true conversion rate.&lt;/p&gt;

&lt;p&gt;For each variant, you maintain two numbers: &lt;strong&gt;successes&lt;/strong&gt; (alpha) and &lt;strong&gt;failures&lt;/strong&gt; (beta). When you need to pick a variant to show, you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sample&lt;/strong&gt; a random value from each variant's Beta(alpha, beta) distribution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick&lt;/strong&gt; the variant whose sample is highest&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Show&lt;/strong&gt; that variant to the next visitor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update&lt;/strong&gt; the winning variant's alpha (if converted) or beta (if didn't)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The entire algorithm.&lt;/p&gt;

&lt;p&gt;The magic is in the Beta distribution's shape. A variant with 40 successes and 60 failures produces a tight distribution centered around 0.40 — you're fairly confident in that number. A variant with 5 successes and 5 failures produces a wide, flat distribution — it could be anywhere from 0.10 to 0.90.&lt;/p&gt;

&lt;p&gt;When you sample from the uncertain distribution, it occasionally produces very high values. That's exploration — the algorithm says "this option might be amazing, let's check." As you gather more data, the distribution tightens, and the algorithm naturally shifts from exploration to exploitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It converges faster than fixed-split A/B tests&lt;/strong&gt; because it automatically routes more traffic to winning variants while still exploring promising unknowns. No manual intervention. No arbitrary "stop the test" decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Failure Mode LLMs Hit
&lt;/h3&gt;

&lt;p&gt;Here's something surprising: large language models consistently get Thompson Sampling wrong when they try to implement decision-making. They see uncertainty and interpret it as risk. When a variant has high variance, an LLM tends to &lt;strong&gt;pull back&lt;/strong&gt; — to avoid the uncertain option and stick with the known quantity.&lt;/p&gt;

&lt;p&gt;That's the exact opposite of what Thompson Sampling does. The algorithm treats uncertainty as &lt;strong&gt;opportunity&lt;/strong&gt;. High variance means "we might be missing something great here." This is documented in what one team called "The $3,000 Bug" — an AI agent that was supposed to optimize decisions kept choosing the safe, well-known option and ignoring high-upside alternatives because it conflated uncertainty with danger.&lt;/p&gt;

&lt;p&gt;Thompson Sampling doesn't make that mistake. The math doesn't have opinions about risk.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Alternatives Landscape
&lt;/h2&gt;

&lt;p&gt;Optimizely isn't your only option. The market has fragmented significantly, and there are tools at every price point. Here's an honest comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Bandit Algorithms&lt;/th&gt;
&lt;th&gt;Self-Serve&lt;/th&gt;
&lt;th&gt;Lock-in&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Optimizely&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$36K+/yr&lt;/td&gt;
&lt;td&gt;Thompson only&lt;/td&gt;
&lt;td&gt;No (sales call)&lt;/td&gt;
&lt;td&gt;High (SDK)&lt;/td&gt;
&lt;td&gt;Enterprise with big budgets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VWO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$199+/mo&lt;/td&gt;
&lt;td&gt;Thompson only&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Mid-market marketing teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GrowthBook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (self-host)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Teams with DevOps capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Statsig&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free–$150/mo&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Developer-first teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OraClaw API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0/25 calls/day, $0.005/call after, $9/mo Starter&lt;/td&gt;
&lt;td&gt;UCB1 + Thompson + LinUCB&lt;/td&gt;
&lt;td&gt;Yes (1 email)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Developers and AI agents that just need the algorithm&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things jump out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GrowthBook&lt;/strong&gt; is the open-source hero. If you have the DevOps capacity to self-host, maintain, and monitor it, it's genuinely free and full-featured. The catch is operational overhead — you're running the infrastructure, handling uptime, managing database migrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statsig&lt;/strong&gt; hit a sweet spot for developer teams. Their free tier is generous, the DX is good, and it's what Brex switched to. If you need a full experimentation platform with a dashboard, this is the value pick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VWO&lt;/strong&gt; occupies the mid-market — cheaper than Optimizely, still dashboard-focused, still requires some sales interaction for advanced features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OraClaw&lt;/strong&gt; takes a fundamentally different approach. It's not a platform — it's an API endpoint. You send it arm data, it runs the algorithm, it returns a decision. No SDK to install, no dashboard to learn, no vendor lock-in. It supports three bandit algorithms (UCB1 for deterministic upper confidence bounds, Thompson for Bayesian exploration, and LinUCB for context-aware decisions that factor in features like time-of-day or user segment).&lt;/p&gt;

&lt;p&gt;The right choice depends entirely on what you need. Not every problem requires the same tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Right Now
&lt;/h2&gt;

&lt;p&gt;Here's a working example. No signup, no API key, no sales call. Just paste this into your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://oraclaw-api.onrender.com/api/v1/optimize/bandit &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "arms": [
      {"id": "variant-a", "name": "Original CTA", "pulls": 500, "totalReward": 175},
      {"id": "variant-b", "name": "New CTA", "pulls": 300, "totalReward": 126},
      {"id": "variant-c", "name": "Bold CTA", "pulls": 12, "totalReward": 8}
    ],
    "algorithm": "thompson"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll get back something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"selected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"variant-c"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"algorithm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"thompson"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait — variant-c? The one with only 12 pulls and a 66.7% conversion rate?&lt;/p&gt;

&lt;p&gt;Yes. And here's why that's correct.&lt;/p&gt;

&lt;p&gt;Variant A has 500 pulls and a 35% conversion rate. Thompson Sampling is very confident about that number — the Beta(175, 325) distribution is tight. It's almost certainly between 31% and 39%.&lt;/p&gt;

&lt;p&gt;Variant B has 300 pulls and a 42% conversion rate. Also fairly confident — Beta(126, 174) is tight. Probably between 37% and 47%.&lt;/p&gt;

&lt;p&gt;Variant C has 12 pulls and a 66.7% conversion rate. But Beta(8, 4) is &lt;strong&gt;wide&lt;/strong&gt;. The true rate could be anywhere from 35% to 90%. When Thompson samples from this distribution, it frequently draws values above 0.50 — higher than what A or B can produce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The algorithm is saying: "Variant C looks promising but we barely know anything about it. Let's send more traffic there to find out."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's exploration in action. If C's true rate is 45%, a few more pulls will tighten the distribution and it'll stop being selected. If C's true rate really is 65%, you just found a massive winner that a fixed 33/33/33 split would have taken 10x longer to identify.&lt;/p&gt;

&lt;p&gt;This is exactly the behavior that the "$3,000 Bug" LLM got wrong. It saw the small sample size and treated it as a reason to avoid variant C. Thompson Sampling sees the small sample size and treats it as a reason to investigate.&lt;/p&gt;

&lt;p&gt;You can swap &lt;code&gt;"algorithm": "thompson"&lt;/code&gt; for &lt;code&gt;"ucb1"&lt;/code&gt; or &lt;code&gt;"linucb"&lt;/code&gt; (with a context vector) to compare strategies. The endpoint is stateless — bring your own data, get back a decision, integrate it however you want. Pipe it into your CI/CD pipeline, call it from a serverless function, embed it in an AI agent's decision loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The MCP Angle: Your AI Agent Can Call This Directly
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Desktop, Cursor, Cline, or any MCP-compatible client, your agent can call this algorithm itself — no curl, no SDK install, no HTTP code. Drop this into your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"oraclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@oraclaw/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart your client. Now you can literally type:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I have these three signup variants with these conversion numbers. Which should I send the next 1,000 visitors to?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent calls &lt;code&gt;optimize_bandit&lt;/code&gt; itself, gets back a structured decision in 0.01ms, and explains the result in your language. No more LLMs guessing at the math — they offload it to a real algorithm.&lt;/p&gt;

&lt;p&gt;The MCP server ships with &lt;strong&gt;17 tools&lt;/strong&gt; total: bandits, contextual bandits, genetic algorithms, Monte Carlo, scheduling, pathfinding, scoring, Bayesian belief updates, ensemble consensus — 11 free without a key, 6 premium (LP/MIP solver, time-series forecasting, anomaly detection, graph analytics, CMA-ES, portfolio risk). All with explicit input + output JSON schemas so your agent knows exactly what it gets back.&lt;/p&gt;

&lt;p&gt;Get the MCP server: &lt;a href="https://www.npmjs.com/package/@oraclaw/mcp-server" rel="noopener noreferrer"&gt;npmjs.com/package/@oraclaw/mcp-server&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When NOT to Use This
&lt;/h2&gt;

&lt;p&gt;Let's be honest about the tradeoffs.&lt;/p&gt;

&lt;p&gt;If you need a &lt;strong&gt;visual editor&lt;/strong&gt; so your marketing team can create tests without writing code — use Optimizely or VWO. If you need &lt;strong&gt;audience targeting&lt;/strong&gt; with complex segmentation rules — use a platform. If you need a &lt;strong&gt;dashboard&lt;/strong&gt; with real-time charts for stakeholders who don't read JSON — use Statsig or GrowthBook.&lt;/p&gt;

&lt;p&gt;A bare API endpoint is the wrong tool for organizations where non-developers need to create and monitor experiments. That's a real use case, and the $36K platforms serve it well.&lt;/p&gt;

&lt;p&gt;But if you're a developer calling an optimization algorithm from your backend, your data pipeline, or your AI agent — you don't need a visual editor. You don't need a dashboard. You need the math, and you need it fast, and you need it cheap.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Math Doesn't Care About Your Budget
&lt;/h2&gt;

&lt;p&gt;Thompson Sampling produces the same distribution, the same samples, and the same convergence properties whether the compute costs $36,000 per year or $0.005 per call. The algorithm was published in 1933. It's been proven optimal in the limit. No amount of enterprise packaging changes the underlying mathematics.&lt;/p&gt;

&lt;p&gt;The question isn't "which algorithm is best" — for most A/B testing scenarios, Thompson Sampling is the answer regardless of vendor. The question is: &lt;strong&gt;how much infrastructure do you need wrapped around it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "a lot" — platforms exist for that. If the answer is "just give me the algorithm" — now you know what your options are.&lt;/p&gt;

&lt;p&gt;Stop paying $36,000 for 20 lines of math.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free tier (no signup):&lt;/strong&gt; the curl command above runs against the live API right now, 25 calls/day per IP, no key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get an API key (1 email field):&lt;/strong&gt; &lt;a href="https://web-olive-one-89.vercel.app/signup" rel="noopener noreferrer"&gt;oraclaw signup&lt;/a&gt; — instant key, 1,000 calls/day on pay-per-call ($0.005/call), upgrade to Starter $9/mo for 50K/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use it from your AI agent:&lt;/strong&gt; &lt;code&gt;npm install @oraclaw/mcp-server&lt;/code&gt; or paste the MCP config above into Claude Desktop / Cursor / Cline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source + 17-tool docs:&lt;/strong&gt; &lt;a href="https://github.com/Whatsonyourmind/oraclaw" rel="noopener noreferrer"&gt;github.com/Whatsonyourmind/oraclaw&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saved you from a $36K sales call, leave a star — it helps other developers find it.&lt;/p&gt;

</description>
      <category>testing</category>
      <category>webdev</category>
      <category>mcp</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
