<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Emir Hüseyin İnci</title>
    <description>The latest articles on DEV Community by Emir Hüseyin İnci (@emirhuseyininci).</description>
    <link>https://dev.to/emirhuseyininci</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4003108%2F9d797d52-d908-4e1c-83b9-1dfc3811f5ff.jpg</url>
      <title>DEV Community: Emir Hüseyin İnci</title>
      <link>https://dev.to/emirhuseyininci</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/emirhuseyininci"/>
    <language>en</language>
    <item>
      <title>The Moment AI Stops Predicting and Starts Choosing</title>
      <dc:creator>Emir Hüseyin İnci</dc:creator>
      <pubDate>Sat, 27 Jun 2026 03:52:14 +0000</pubDate>
      <link>https://dev.to/emirhuseyininci/the-moment-ai-stops-predicting-and-starts-choosing-20fj</link>
      <guid>https://dev.to/emirhuseyininci/the-moment-ai-stops-predicting-and-starts-choosing-20fj</guid>
      <description>&lt;p&gt;&lt;em&gt;Most machine learning learns from labels. Reinforcement learning learns from consequences — and that one-word difference breaks everything you thought you knew about how AI works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let me show you the exact moment AI gets dangerous.&lt;/p&gt;

&lt;p&gt;Not dangerous in the sci-fi sense.&lt;/p&gt;

&lt;p&gt;Dangerous in the &lt;em&gt;quietly optimizes the wrong thing for six months until your product is broken and you don’t know why&lt;/em&gt; sense.&lt;/p&gt;

&lt;p&gt;It happens when a system stops answering questions and starts making decisions.&lt;/p&gt;

&lt;p&gt;That’s reinforcement learning. And if you use AI products, build them, or invest in companies that do, you need to understand this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The difference no one explains clearly
&lt;/h2&gt;

&lt;p&gt;Every AI system you’ve used has probably been trained the same basic way:&lt;/p&gt;

&lt;p&gt;Show it a million examples. Tell it the right answer each time. Let it adjust until it gets good at guessing the right answer.&lt;/p&gt;

&lt;p&gt;Image → label. Email → spam/not spam. Transaction → fraud/not fraud.&lt;/p&gt;

&lt;p&gt;This is supervised learning. It’s powerful. It’s what makes your spam filter work and your autocomplete finish your sentences.&lt;/p&gt;

&lt;p&gt;But here’s the thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The world doesn’t come with labels.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your next career move doesn’t have a correct answer written on the back.&lt;/p&gt;

&lt;p&gt;Your company’s pricing strategy doesn’t come with a ground truth.&lt;/p&gt;

&lt;p&gt;And neither does a chess game, a trading position, or a conversation with a user.&lt;/p&gt;

&lt;p&gt;For these problems, you don’t need a system that &lt;em&gt;predicts the answer&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;You need a system that &lt;em&gt;takes an action&lt;/em&gt; and then lives with what happens next.&lt;/p&gt;

&lt;p&gt;That’s the shift.&lt;/p&gt;

&lt;p&gt;That’s everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The comfortable loop vs. the loop that changes the world
&lt;/h2&gt;

&lt;p&gt;Before we get to RL, a quick detour to a simpler problem.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;bandit&lt;/em&gt; algorithm, like the Thompson Sampling I wrote about in &lt;a href="https://dev.to/emirhuseyininci/thompson-sampling-how-recommender-systems-learn-to-bet-on-what-youll-like-2l4"&gt;the first piece in this series&lt;/a&gt;, makes one repeated decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Which option should I pick right now?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Show movie.&lt;/p&gt;

&lt;p&gt;User clicks or doesn’t.&lt;/p&gt;

&lt;p&gt;Update belief.&lt;/p&gt;

&lt;p&gt;Repeat.&lt;/p&gt;

&lt;p&gt;Crucially: each round resets. The world tomorrow looks basically like the world today.&lt;/p&gt;

&lt;p&gt;Reinforcement learning is what happens when you take that convenience away.&lt;/p&gt;

&lt;p&gt;Now the loop is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Observe the current state of the world&lt;/li&gt;
&lt;li&gt;Choose an action&lt;/li&gt;
&lt;li&gt;The world &lt;em&gt;changes&lt;/em&gt; because of your action&lt;/li&gt;
&lt;li&gt;Receive a reward, or don’t&lt;/li&gt;
&lt;li&gt;Find yourself in a new, different world&lt;/li&gt;
&lt;li&gt;Choose again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7nyly1g9ujlti41nr4sa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7nyly1g9ujlti41nr4sa.png" alt="Bandits vs reinforcement learning" width="700" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The bandit recommender asks:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Which movie should I show this user right now?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The RL recommender asks:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What should I show today, knowing it will change what this user wants to watch next month?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One extra clause. Completely different problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bandits choose well now. Reinforcement learning chooses well over time.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Six words that explain the whole field
&lt;/h2&gt;

&lt;p&gt;The jargon in RL sounds intimidating.&lt;/p&gt;

&lt;p&gt;It isn’t.&lt;/p&gt;

&lt;p&gt;Here’s the whole thing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt; — the system making decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment&lt;/strong&gt; — the world that reacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State&lt;/strong&gt; — what the agent can currently see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt; — what the agent chooses to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reward&lt;/strong&gt; — the signal that comes back after the action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Policy&lt;/strong&gt; — the rule that maps “what I see” to “what I do.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw8cojvy2v24axjlelw9s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fw8cojvy2v24axjlelw9s.png" alt="Agent environment loop" width="700" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything in the field — Q-learning, policy gradients, actor-critic, PPO, RLHF — is trying to solve one problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;How do you maximize reward not just now, but across the whole chain of decisions that follows?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;That’s the field.&lt;/p&gt;

&lt;p&gt;The reason it’s hard is that the chain is long, the future is uncertain, and actions today shape what’s even &lt;em&gt;possible&lt;/em&gt; tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that makes RL genuinely hard
&lt;/h2&gt;

&lt;p&gt;Here’s where it gets uncomfortable.&lt;/p&gt;

&lt;p&gt;In a bandit problem, feedback is fast.&lt;/p&gt;

&lt;p&gt;Show content, user clicks, update.&lt;/p&gt;

&lt;p&gt;In RL, the consequence of an action might show up &lt;em&gt;weeks later&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;By then, the system has made hundreds of other decisions.&lt;/p&gt;

&lt;p&gt;So which one caused the outcome?&lt;/p&gt;

&lt;p&gt;This is called the &lt;strong&gt;credit assignment problem&lt;/strong&gt;, and it’s not a small technical footnote.&lt;/p&gt;

&lt;p&gt;It’s one of the core reasons RL is difficult to get right.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fre5dhyzg2aac7ol4t8xr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fre5dhyzg2aac7ol4t8xr.png" alt="Credit assignment problem" width="700" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think about what this means in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The click-maximizing recommendation trains users to expect worse and worse content, but the engagement numbers look great for another 18 months.&lt;/li&gt;
&lt;li&gt;The profitable trade quietly accumulates exposure to a tail risk that only shows up under stress, but the P&amp;amp;L looks clean until it doesn’t.&lt;/li&gt;
&lt;li&gt;The cheapest LLM routing saves $0.003 per request until retries, escalations, and churn quietly eat the margin you thought you were protecting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every case:&lt;/p&gt;

&lt;p&gt;The immediate signal says &lt;em&gt;yes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The real outcome says &lt;em&gt;wait&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The system has to learn from both.&lt;/p&gt;

&lt;p&gt;That’s what makes this hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three industries being quietly reshaped by this problem right now
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Streaming and content recommendations
&lt;/h2&gt;

&lt;p&gt;Optimize for clicks → system learns clickbait.&lt;/p&gt;

&lt;p&gt;Not because anyone designed it to.&lt;/p&gt;

&lt;p&gt;Because clicks were the proxy, and the proxy was optimized.&lt;/p&gt;

&lt;p&gt;The metric improves for a year.&lt;/p&gt;

&lt;p&gt;Then user satisfaction surveys start dropping.&lt;/p&gt;

&lt;p&gt;Then retention curves start bending.&lt;/p&gt;

&lt;p&gt;Then someone in the boardroom asks why the numbers that matter are going in the wrong direction even though the numbers being tracked are fine.&lt;/p&gt;

&lt;p&gt;The RL framing forces the harder question earlier:&lt;/p&gt;

&lt;p&gt;What recommendation policy increases long-term &lt;em&gt;trust&lt;/em&gt;, not just this session’s engagement?&lt;/p&gt;

&lt;h2&gt;
  
  
  Trading
&lt;/h2&gt;

&lt;p&gt;A position that looks profitable today can be a bad decision if it shifts your risk profile in ways that only matter under stress.&lt;/p&gt;

&lt;p&gt;In RL terms, the question isn’t buy-or-sell.&lt;/p&gt;

&lt;p&gt;It’s position management over time, where each action changes what’s available and what’s dangerous next.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM routing
&lt;/h2&gt;

&lt;p&gt;This one is underappreciated.&lt;/p&gt;

&lt;p&gt;Routing every request to the cheapest capable model looks like free money.&lt;/p&gt;

&lt;p&gt;Until quality starts quietly degrading at the margin.&lt;/p&gt;

&lt;p&gt;Until the edge cases that fall through the cracks start accumulating.&lt;/p&gt;

&lt;p&gt;Until users who needed a good answer and got a mediocre one just stop asking.&lt;/p&gt;

&lt;p&gt;That cost never shows up in the routing dashboard.&lt;/p&gt;

&lt;p&gt;But it’s there.&lt;/p&gt;

&lt;p&gt;This is a pure RL problem:&lt;/p&gt;

&lt;p&gt;The reward signal, cost per query, and the real objective, user outcome, are separated in time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uncomfortable truth about reward functions
&lt;/h2&gt;

&lt;p&gt;Here’s the thing nobody says out loud early enough:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Reinforcement learning doesn’t learn what you want. It learns what you reward.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And those two things can drift apart faster than you’d expect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8igli7fnz6j4dbdkv6g7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8igli7fnz6j4dbdkv6g7.png" alt="Reward hacking" width="700" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This doesn’t require the system to be malicious.&lt;/p&gt;

&lt;p&gt;It doesn’t require it to be clever.&lt;/p&gt;

&lt;p&gt;It only requires the reward function to be an imperfect proxy for the thing that actually mattered.&lt;/p&gt;

&lt;p&gt;And in every real product, the reward function is a proxy.&lt;/p&gt;

&lt;p&gt;Always.&lt;/p&gt;

&lt;p&gt;Because the thing that actually matters — user trust, long-term retention, sustained business value — can’t be measured in real time.&lt;/p&gt;

&lt;p&gt;This is why RL has a strange dual nature:&lt;/p&gt;

&lt;p&gt;On one hand, it can discover strategies that humans would never write by hand.&lt;/p&gt;

&lt;p&gt;AlphaGo didn’t learn to play Go by following human intuition. It discovered lines of play no human had considered.&lt;/p&gt;

&lt;p&gt;On the other hand, it can exploit exactly the blind spots humans encoded into the reward function without realizing it.&lt;/p&gt;

&lt;p&gt;Both things are true at the same time.&lt;/p&gt;

&lt;p&gt;Neither cancels the other out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap nobody budgets for
&lt;/h2&gt;

&lt;p&gt;Here’s the thing that surprises people building RL systems for the first time:&lt;/p&gt;

&lt;p&gt;A policy can be statistically optimal under the reward function and &lt;em&gt;still be unacceptable in production.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Statistically optimal means:&lt;/p&gt;

&lt;p&gt;Given the data, given the reward signal, this is the best policy we found.&lt;/p&gt;

&lt;p&gt;Operationally acceptable means:&lt;/p&gt;

&lt;p&gt;This is something we’d actually defend when it scales to millions of users.&lt;/p&gt;

&lt;p&gt;Those are not the same thing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fgxasoir43x1endxxj36u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fgxasoir43x1endxxj36u.png" alt="Production guardrails" width="700" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A production RL system needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Constraints the policy cannot optimize around&lt;/li&gt;
&lt;li&gt;Monitoring that catches drift before it becomes a crisis&lt;/li&gt;
&lt;li&gt;Audit trails so decisions can be replayed and explained&lt;/li&gt;
&lt;li&gt;Hard limits on what the agent is allowed to do while it’s still learning&lt;/li&gt;
&lt;li&gt;A clear answer to “what happens when the reward function diverges from the goal?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not because the algorithm is broken.&lt;/p&gt;

&lt;p&gt;Because the reward is incomplete.&lt;/p&gt;

&lt;p&gt;It always is.&lt;/p&gt;

&lt;p&gt;The guardrails aren’t an afterthought.&lt;/p&gt;

&lt;p&gt;They’re the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond the technical teams
&lt;/h2&gt;

&lt;p&gt;If you manage products that use AI, the question you should be asking isn’t:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Did the metric improve?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Are the decisions the system is learning ones I would defend when they scale?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Because at scale, the compounding effects of a slightly wrong reward function aren’t small.&lt;/p&gt;

&lt;p&gt;They’re the story.&lt;/p&gt;

&lt;p&gt;They’re the reason the product that looked great in the dashboard becomes the product that’s slowly losing the users who mattered.&lt;/p&gt;

&lt;p&gt;Reinforcement learning is powerful because it matches how real decisions work.&lt;/p&gt;

&lt;p&gt;Real decisions happen in sequence.&lt;/p&gt;

&lt;p&gt;They have delayed consequences.&lt;/p&gt;

&lt;p&gt;Each one changes the next.&lt;/p&gt;

&lt;p&gt;The uncertainty doesn’t resolve immediately.&lt;/p&gt;

&lt;p&gt;That’s exactly the world RL is built for.&lt;/p&gt;

&lt;p&gt;But it also means the question is never just whether the algorithm worked.&lt;/p&gt;

&lt;p&gt;The question is whether you taught the machine to make decisions you can actually trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>reinforcementlearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Thompson Sampling: How Recommender Systems Learn to Bet on What You'll Like</title>
      <dc:creator>Emir Hüseyin İnci</dc:creator>
      <pubDate>Sat, 27 Jun 2026 03:46:21 +0000</pubDate>
      <link>https://dev.to/emirhuseyininci/thompson-sampling-how-recommender-systems-learn-to-bet-on-what-youll-like-2l4</link>
      <guid>https://dev.to/emirhuseyininci/thompson-sampling-how-recommender-systems-learn-to-bet-on-what-youll-like-2l4</guid>
      <description>&lt;p&gt;&lt;em&gt;A Bayesian approach to the explore-exploit tradeoff, explained through the lens of personalized recommendations.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every time a streaming service, news app, or e-commerce site decides what to put in front of you, it's making a bet.&lt;/p&gt;

&lt;p&gt;Show you the item it's most confident you'll like, and you get a probably-good recommendation, but the system never finds out whether something else might have been even better.&lt;/p&gt;

&lt;p&gt;Show you something new and untested, and you risk wasting a valuable slot on a flop, but you might also discover the next big hit for that user segment.&lt;/p&gt;

&lt;p&gt;This is the explore-exploit tradeoff, and it sits underneath nearly every recommendation engine, ad-ranking system, and content feed running today.&lt;/p&gt;

&lt;p&gt;Thompson Sampling is one of the oldest, and in practice one of the most effective, ways to solve it. It traces back to a 1933 paper by William R. Thompson, decades before "recommender system" was even a phrase, and it remains a standard tool for anyone building ranking or personalization systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendation Slots Are Bandits
&lt;/h2&gt;

&lt;p&gt;Strip away the UI, and a recommendation slot is a classic multi-armed bandit problem.&lt;/p&gt;

&lt;p&gt;Each candidate item, a movie, a product, an article, is an "arm."&lt;/p&gt;

&lt;p&gt;Showing that item to a user is "pulling" it.&lt;/p&gt;

&lt;p&gt;The reward is binary feedback:&lt;/p&gt;

&lt;p&gt;Did they click, watch, or buy?&lt;/p&gt;

&lt;p&gt;For each item &lt;code&gt;i&lt;/code&gt;, there's some true, unknown probability &lt;code&gt;theta_i&lt;/code&gt; that a user will engage with it.&lt;/p&gt;

&lt;p&gt;The system's job is to maximize cumulative engagement over time, which means learning the &lt;code&gt;theta_i&lt;/code&gt; values while it is still using them to make live decisions.&lt;/p&gt;

&lt;p&gt;That's what makes this harder than ordinary supervised learning:&lt;/p&gt;

&lt;p&gt;There is no separate training phase.&lt;/p&gt;

&lt;p&gt;Every recommendation is simultaneously a data point and a decision with real consequences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Greedy Ranking Fails
&lt;/h2&gt;

&lt;p&gt;The simplest strategy is greedy:&lt;/p&gt;

&lt;p&gt;Track the observed click rate for every item, and always recommend whichever one currently looks best.&lt;/p&gt;

&lt;p&gt;This fails for an intuitive reason.&lt;/p&gt;

&lt;p&gt;Suppose a genuinely great item gets unlucky on its first few impressions. Three users see it and none click, purely by chance.&lt;/p&gt;

&lt;p&gt;Its observed rate collapses, the greedy algorithm buries it, and it never gets shown again to correct the mistake.&lt;/p&gt;

&lt;p&gt;Early noise becomes a permanent verdict.&lt;/p&gt;

&lt;p&gt;A common fix is forced exploration: show a random item some fixed percentage of the time, known as epsilon-greedy.&lt;/p&gt;

&lt;p&gt;But that explores indiscriminately.&lt;/p&gt;

&lt;p&gt;It spends just as much effort re-testing items the system already has plenty of evidence about as it does on the ones still wrapped in uncertainty.&lt;/p&gt;

&lt;p&gt;What we actually want is exploration that's proportional to how unsure the system still is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thompson Sampling Changes the Frame
&lt;/h2&gt;

&lt;p&gt;This is where Thompson Sampling changes the frame entirely.&lt;/p&gt;

&lt;p&gt;Instead of tracking a single click-rate estimate per item, it tracks a full probability distribution representing the system's current belief about &lt;code&gt;theta_i&lt;/code&gt;: how plausible every possible click-rate value is, given the data seen so far.&lt;/p&gt;

&lt;p&gt;Early on, with little data, that belief is wide and flat. Almost any click rate seems plausible.&lt;/p&gt;

&lt;p&gt;As impressions accumulate, the belief narrows around the true value.&lt;/p&gt;

&lt;p&gt;Crucially, the shape of this belief, not just its average, is exactly the information needed to explore intelligently:&lt;/p&gt;

&lt;p&gt;A wide belief means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We genuinely don't know yet.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A narrow belief means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We're fairly confident.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feu0993754hh1fxm64qsz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feu0993754hh1fxm64qsz.png" alt="Beta belief narrows as data accumulates" width="700" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;With little data, the system's belief is wide. As evidence accumulates, the distribution narrows around the true click rate.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Beta-Bernoulli Setup
&lt;/h2&gt;

&lt;p&gt;For binary outcomes like clicks, the natural choice of belief distribution is the Beta distribution, thanks to a convenient property called conjugacy.&lt;/p&gt;

&lt;p&gt;We start with a prior belief for each item:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;theta_i ~ Beta(alpha, beta)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A natural starting point is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alpha = 1
beta = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the uniform distribution, meaning:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Any click rate from 0 to 1 is equally plausible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each interaction is modeled as a Bernoulli trial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A click is a success: &lt;code&gt;r = 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;No click is a failure: &lt;code&gt;r = 0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The update rule after a single observation is almost embarrassingly simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if the user clicked:
    alpha &amp;lt;- alpha + 1

if they did not click:
    beta &amp;lt;- beta + 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it.&lt;/p&gt;

&lt;p&gt;No gradient steps.&lt;/p&gt;

&lt;p&gt;No retraining.&lt;/p&gt;

&lt;p&gt;No matrix inversion.&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;n&lt;/code&gt; impressions with &lt;code&gt;k&lt;/code&gt; clicks, the posterior is exactly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Beta(alpha + k, beta + n - k)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mean of this distribution is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alpha / (alpha + beta)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the system's best point estimate of the click rate.&lt;/p&gt;

&lt;p&gt;Its variance shrinks roughly in proportion to &lt;code&gt;1 / n&lt;/code&gt;, the formal version of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;More data, narrower belief.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Thompson Sampling Loop
&lt;/h2&gt;

&lt;p&gt;With a belief distribution maintained per item, the full Thompson Sampling procedure is just three steps, repeated every time a recommendation is needed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sample one value &lt;code&gt;theta_hat_i&lt;/code&gt; from each candidate item's current &lt;code&gt;Beta(alpha_i, beta_i)&lt;/code&gt; distribution.&lt;/li&gt;
&lt;li&gt;Recommend the item with the highest sampled value.&lt;/li&gt;
&lt;li&gt;Observe the outcome and update that item's &lt;code&gt;alpha&lt;/code&gt; or &lt;code&gt;beta&lt;/code&gt; accordingly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Notice what's doing the actual work:&lt;/p&gt;

&lt;p&gt;The randomness in step 1.&lt;/p&gt;

&lt;p&gt;Nothing forces exploration explicitly. There is no epsilon, no separate exploration budget.&lt;/p&gt;

&lt;p&gt;Exploration emerges naturally from the fact that items the system is still uncertain about have wide distributions, and a wide distribution occasionally produces a high sample purely by chance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm6ph6mu6t2dschz184y8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fm6ph6mu6t2dschz184y8.png" alt="Thompson Sampling with three arms" width="700" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A new item may have a lower average estimate but a wider belief distribution. Sometimes it samples high enough to win the recommendation slot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The chart above shows exactly this in a single snapshot.&lt;/p&gt;

&lt;p&gt;The "New arrival" item has the lowest average click rate of the three, but because so little is known about it, its belief is wide.&lt;/p&gt;

&lt;p&gt;On this particular draw, its sampled value comes out ahead of both better-established items.&lt;/p&gt;

&lt;p&gt;It wins the recommendation slot this round, the system learns a little more about it, and its distribution narrows next time, win or lose.&lt;/p&gt;

&lt;p&gt;An item with a long, strong track record, by contrast, has a narrow distribution clustered tightly around its true rate.&lt;/p&gt;

&lt;p&gt;It keeps winning consistently, but it can occasionally lose a slot to a promising newcomer, exactly as it should.&lt;/p&gt;

&lt;p&gt;That is the elegant part:&lt;/p&gt;

&lt;p&gt;A single sampling step automatically interpolates between exploring and exploiting, with no tuning knob required, and it shifts toward exploitation on its own as confidence grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Thompson Sampling Practical
&lt;/h2&gt;

&lt;p&gt;Real recommender systems don't compare three items.&lt;/p&gt;

&lt;p&gt;They compare thousands or millions, and new items arrive constantly.&lt;/p&gt;

&lt;p&gt;A few extensions make Thompson Sampling practical at that scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cold Start
&lt;/h2&gt;

&lt;p&gt;A brand-new item starts at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Beta(1, 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means maximal uncertainty.&lt;/p&gt;

&lt;p&gt;It has a real, non-trivial chance of sampling high enough to get shown early on.&lt;/p&gt;

&lt;p&gt;This is a feature, not a bug:&lt;/p&gt;

&lt;p&gt;New content gets a fair shot at exposure without needing a separate "new item boost" rule bolted on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contextual Thompson Sampling
&lt;/h2&gt;

&lt;p&gt;Treating every item independently ignores everything known about the user:&lt;/p&gt;

&lt;p&gt;History, device, time of day, location, session context, and so on.&lt;/p&gt;

&lt;p&gt;In practice, recommendation systems typically use a contextual variant.&lt;/p&gt;

&lt;p&gt;Instead of a single &lt;code&gt;theta&lt;/code&gt; per item, the model maintains a distribution over the parameters of a model, commonly a Bayesian linear or logistic regression, that predicts click probability from user and item features together.&lt;/p&gt;

&lt;p&gt;Sampling now means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Draw one set of model parameters.&lt;/li&gt;
&lt;li&gt;Score all candidates under that sampled model.&lt;/li&gt;
&lt;li&gt;Recommend the top one.&lt;/li&gt;
&lt;li&gt;Observe the result and update the posterior.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The mechanics are unchanged:&lt;/p&gt;

&lt;p&gt;Sample.&lt;/p&gt;

&lt;p&gt;Act.&lt;/p&gt;

&lt;p&gt;Update.&lt;/p&gt;

&lt;p&gt;The model is just richer than a single number per item.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-Stationarity
&lt;/h2&gt;

&lt;p&gt;Tastes drift.&lt;/p&gt;

&lt;p&gt;Items go stale.&lt;/p&gt;

&lt;p&gt;A pure Beta-Bernoulli model with no decay eventually becomes overconfident about old data that's no longer representative.&lt;/p&gt;

&lt;p&gt;A common fix is to mildly discount &lt;code&gt;alpha&lt;/code&gt; and &lt;code&gt;beta&lt;/code&gt; over time, multiplying both by a factor slightly below &lt;code&gt;1&lt;/code&gt; before each update.&lt;/p&gt;

&lt;p&gt;That keeps the belief from narrowing all the way to zero uncertainty and lets the system adapt if the true rate shifts later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Questions
&lt;/h2&gt;

&lt;p&gt;Before reaching for Thompson Sampling in production, it is worth being deliberate about a couple of things.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Counts as Reward?
&lt;/h2&gt;

&lt;p&gt;A raw click is easy to measure but a weak proxy for satisfaction.&lt;/p&gt;

&lt;p&gt;Optimizing for clicks alone can reward clickbait while eroding trust.&lt;/p&gt;

&lt;p&gt;Many production systems instead model a downstream signal, like watch-time past a threshold, or a weighted blend of signals, and apply the same Bayesian machinery to that instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sampling Cost
&lt;/h2&gt;

&lt;p&gt;Drawing from a Beta distribution is cheap.&lt;/p&gt;

&lt;p&gt;But contextual variants that sample full parameter vectors, or in the extreme, run posterior sampling over a neural network, can get expensive at low latency and high request volume.&lt;/p&gt;

&lt;p&gt;Approximations like sampling once per batch of requests, rather than per individual request, are a common engineering compromise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Is Tricky
&lt;/h2&gt;

&lt;p&gt;Because the system's own choices generate the data it later learns from, naively asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What would have happened under a different policy?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;is statistically biased.&lt;/p&gt;

&lt;p&gt;Offline evaluation typically needs either logged propensity scores or a held-out slice of traffic served by uniform random exploration to validate against.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Thompson Sampling has earned its long shelf life because it turns a famously hard tradeoff, when to explore versus when to exploit, into a single, principled operation:&lt;/p&gt;

&lt;p&gt;Maintain a belief.&lt;/p&gt;

&lt;p&gt;Sample from it.&lt;/p&gt;

&lt;p&gt;Act on the sample.&lt;/p&gt;

&lt;p&gt;Update the belief.&lt;/p&gt;

&lt;p&gt;The exploration here is not a separate mechanism duct-taped onto a model.&lt;/p&gt;

&lt;p&gt;It is a direct, automatic consequence of being honest about uncertainty.&lt;/p&gt;

&lt;p&gt;For recommender systems in particular, where new items appear constantly, tastes shift, and every wrong "exploit" choice is a real user's wasted moment, that kind of self-calibrating exploration is not just elegant.&lt;/p&gt;

&lt;p&gt;It is exactly what the problem calls for.&lt;/p&gt;




&lt;p&gt;Next in the series: how reinforcement learning changes the problem once actions start shaping future states.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>algorithms</category>
      <category>recommendation</category>
    </item>
    <item>
      <title>Building a Replayable Decision Kernel in Rust</title>
      <dc:creator>Emir Hüseyin İnci</dc:creator>
      <pubDate>Fri, 26 Jun 2026 22:28:01 +0000</pubDate>
      <link>https://dev.to/emirhuseyininci/building-a-replayable-decision-kernel-in-rust-nl4</link>
      <guid>https://dev.to/emirhuseyininci/building-a-replayable-decision-kernel-in-rust-nl4</guid>
      <description>&lt;p&gt;I built &lt;a href="https://github.com/emirhuseynrmx/calybris-core" rel="noopener noreferrer"&gt;Calybris Core&lt;/a&gt; because I kept running into the same uncomfortable question in decision-heavy systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;After the system says "yes", "no", or "use this instead", what exactly can we prove later?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not prove in the formal-methods sense. I mean the practical engineering version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which policy was active?&lt;/li&gt;
&lt;li&gt;What was the input?&lt;/li&gt;
&lt;li&gt;What decision was returned?&lt;/li&gt;
&lt;li&gt;Can the decision be replayed?&lt;/li&gt;
&lt;li&gt;Did the budget/exposure invariant still hold?&lt;/li&gt;
&lt;li&gt;Can an audit log detect tampering?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calybris Core is my attempt to make that boundary small, deterministic, and boring.&lt;/p&gt;

&lt;p&gt;It is not an LLM framework.&lt;br&gt;&lt;br&gt;
It is not an exchange.&lt;br&gt;&lt;br&gt;
It is not a strategy engine.&lt;br&gt;&lt;br&gt;
It is not a web service.&lt;/p&gt;

&lt;p&gt;It is a Rust core primitive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;candidate + policy constraints -&amp;gt; decision + digests + optional WAL + budget proof
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first reference examples are LLM routing and pre-trade admission guards, but the crate itself is domain-neutral.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/emirhuseynrmx/calybris-core" rel="noopener noreferrer"&gt;github.com/emirhuseynrmx/calybris-core&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Crate: &lt;a href="https://crates.io/crates/calybris-core" rel="noopener noreferrer"&gt;crates.io/crates/calybris-core&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Docs: &lt;a href="https://docs.rs/calybris-core" rel="noopener noreferrer"&gt;docs.rs/calybris-core&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The boundary I wanted
&lt;/h2&gt;

&lt;p&gt;A lot of systems have a hidden decision point that looks simple from the outside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;request comes in
system checks constraints
system returns allow / substitute / reject
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But when something goes wrong, that simple decision becomes hard to reconstruct.&lt;/p&gt;

&lt;p&gt;Maybe the model was changed.&lt;br&gt;&lt;br&gt;
Maybe a budget was exceeded.&lt;br&gt;&lt;br&gt;
Maybe a cheaper fallback was selected.&lt;br&gt;&lt;br&gt;
Maybe an operator needs to explain why an action was rejected.&lt;br&gt;&lt;br&gt;
Maybe an audit log was modified after the fact.&lt;/p&gt;

&lt;p&gt;The typical response is to add more logs.&lt;/p&gt;

&lt;p&gt;That helps, but logs alone are not the same as replayable decisions. I wanted the core decision result to carry enough structure that an independent verifier can ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If I replay the same input against the same policy snapshot, do I get the same decision?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That became the central design constraint.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Calybris decides
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;kernel&lt;/code&gt; module evaluates a &lt;code&gt;KernelInput&lt;/code&gt; against a validated &lt;code&gt;PolicySnapshot&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The result is a &lt;code&gt;KernelDecision&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ExecuteRequested&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Substitute&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Reject&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The decision contains the selected candidate, reason, estimated cost, utility, counterfactual fields, evaluated/eligible counts, and policy/catalog epochs.&lt;/p&gt;

&lt;p&gt;The important part is not the specific domain. The important part is that the decision is deterministic and replayable.&lt;/p&gt;

&lt;p&gt;In code, the shape is intentionally direct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;calybris_core&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;kernel&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;calybris_core&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;verify_decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VerifyResult&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="nf"&gt;.prescribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nd"&gt;assert_eq!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;verify_decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nn"&gt;VerifyResult&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Valid&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hot path deliberately avoids:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;floating point&lt;/li&gt;
&lt;li&gt;JSON&lt;/li&gt;
&lt;li&gt;clocks&lt;/li&gt;
&lt;li&gt;network calls&lt;/li&gt;
&lt;li&gt;hidden I/O&lt;/li&gt;
&lt;li&gt;unsafe Rust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The crate root uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#![forbid(unsafe_code)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not magic, but it is a useful line in the sand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I avoided floating point
&lt;/h2&gt;

&lt;p&gt;The reference use cases both involve costs, budgets, confidence, risk, and utility.&lt;/p&gt;

&lt;p&gt;It would be easy to reach for &lt;code&gt;f64&lt;/code&gt;. I avoided it.&lt;/p&gt;

&lt;p&gt;Calybris uses integer amounts and basis points. Financial amounts are fixed-point microcents. Quality, risk, confidence, and policy thresholds are represented as integer basis points.&lt;/p&gt;

&lt;p&gt;That keeps replay behavior less surprising.&lt;/p&gt;

&lt;p&gt;For audit-oriented code, "close enough" is a dangerous phrase. If a decision depends on a threshold, I want the arithmetic to be explicit and repeatable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Canonical digests, not "whatever serde emitted"
&lt;/h2&gt;

&lt;p&gt;Replay alone is not enough. You also need stable fingerprints.&lt;/p&gt;

&lt;p&gt;Calybris computes canonical SHA-256 digests for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;policy snapshots&lt;/li&gt;
&lt;li&gt;decision inputs&lt;/li&gt;
&lt;li&gt;decision outputs&lt;/li&gt;
&lt;li&gt;budget ledger snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The digest layouts are version-tagged byte layouts, not hashes of arbitrary JSON.&lt;/p&gt;

&lt;p&gt;That distinction matters. JSON is great for transport and inspection, but field order and serialization choices are not a good audit boundary.&lt;/p&gt;

&lt;p&gt;The digest tags are explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;calypol1
calyinp1
calydcn1
calyldg1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Policy models are sorted before hashing. Ledger tenants are sorted before hashing. A logically equivalent snapshot should not get a different fingerprint because a map happened to iterate differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit bundle
&lt;/h2&gt;

&lt;p&gt;A decision can be wrapped in an audit bundle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;policy digest
input digest
decision digest
replay_valid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The verifier checks the structural decision, not just a string.&lt;/p&gt;

&lt;p&gt;If you change the input, replay fails.&lt;br&gt;&lt;br&gt;
If you change the decision, replay fails.&lt;br&gt;&lt;br&gt;
If you use the wrong policy, replay fails.&lt;br&gt;&lt;br&gt;
If the digest fields do not match canonical recomputation, replay fails.&lt;/p&gt;

&lt;p&gt;That is the reason I have been using the phrase "proof-carrying decision core", although I am still looking for feedback on whether that wording is too strong.&lt;/p&gt;

&lt;p&gt;To be clear: this is not a formal proof system. It is a replayable evidence bundle.&lt;/p&gt;
&lt;h2&gt;
  
  
  Optional WAL
&lt;/h2&gt;

&lt;p&gt;The crate also includes an optional write-ahead log.&lt;/p&gt;

&lt;p&gt;Each WAL entry contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sequence number&lt;/li&gt;
&lt;li&gt;previous hash&lt;/li&gt;
&lt;li&gt;entry hash&lt;/li&gt;
&lt;li&gt;record data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The unkeyed mode is useful for corruption detection and basic tamper evidence. The keyed mode uses HMAC-SHA256, which is the mode you would use if an attacker might rewrite entries and recompute hashes.&lt;/p&gt;

&lt;p&gt;The audited WAL path looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prescribe
  -&amp;gt; audit_bundle
  -&amp;gt; append_audited
  -&amp;gt; replay_audited_wal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replay fails closed if the chain is broken or if any policy/input/decision digest does not match.&lt;/p&gt;

&lt;p&gt;I intentionally did not put secret storage, key rotation, file locking, or multi-process coordination inside this crate. Those are deployment concerns and should be owned by the embedding system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget conservation
&lt;/h2&gt;

&lt;p&gt;The budget engine is another small core primitive.&lt;/p&gt;

&lt;p&gt;The invariant is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;remaining + reserved + committed_lifetime == initial
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A reservation removes spendable balance.&lt;br&gt;&lt;br&gt;
A commit turns a reservation into lifetime committed spend.&lt;br&gt;&lt;br&gt;
A release returns the hold.&lt;br&gt;&lt;br&gt;
A top-up extends initial and remaining budget.&lt;/p&gt;

&lt;p&gt;The budget engine uses CAS for the hot balance updates and mutex-protected metadata maps for the surrounding state.&lt;/p&gt;

&lt;p&gt;The invariant is checked on frozen snapshots. Multi-step operations may have transient internal states, so the docs are careful not to claim every mid-operation snapshot is linearizable.&lt;/p&gt;

&lt;p&gt;That distinction matters. Audit docs should say what is guaranteed, not what sounds good.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why not a general rules engine?
&lt;/h2&gt;

&lt;p&gt;Calybris is narrower than a rules engine.&lt;/p&gt;

&lt;p&gt;It does not try to provide a policy language. It does not parse arbitrary user rules. It does not evaluate scripts.&lt;/p&gt;

&lt;p&gt;The current kernel is closer to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rank candidates under hard constraints
return the best positive-utility candidate
otherwise reject
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That narrowness is intentional. I wanted the core to be small enough to reason about, test, replay, and document.&lt;/p&gt;

&lt;p&gt;A larger product can put a policy language above this layer. Calybris is the deterministic bottom layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing the uncomfortable parts
&lt;/h2&gt;

&lt;p&gt;The project has tests for the parts I would worry about first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;optimized kernel output vs reference implementation&lt;/li&gt;
&lt;li&gt;digest stability and sensitivity&lt;/li&gt;
&lt;li&gt;replay mismatch detection&lt;/li&gt;
&lt;li&gt;WAL tampering, duplicate sequence, truncation, malformed JSON&lt;/li&gt;
&lt;li&gt;keyed WAL verification&lt;/li&gt;
&lt;li&gt;budget conservation under mixed operations&lt;/li&gt;
&lt;li&gt;overflow paths&lt;/li&gt;
&lt;li&gt;concurrent reserve/commit/release behavior&lt;/li&gt;
&lt;li&gt;Loom interleavings&lt;/li&gt;
&lt;li&gt;Miri on the library and audit pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The CI runs MSRV and stable jobs, clippy with warnings denied, docs, examples, proptest-heavy jobs, Loom, Miri, cargo-audit, and cargo-deny.&lt;/p&gt;

&lt;p&gt;That does not make it "audited". It does make it less hand-wavy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it locally
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/emirhuseynrmx/calybris-core
&lt;span class="nb"&gt;cd &lt;/span&gt;calybris-core
cargo run &lt;span class="nt"&gt;--example&lt;/span&gt; quickstart
cargo run &lt;span class="nt"&gt;--example&lt;/span&gt; llm_routing
cargo run &lt;span class="nt"&gt;--example&lt;/span&gt; replay_audit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use it as a dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo add calybris-core
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kernel-only, without WAL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo add calybris-core &lt;span class="nt"&gt;--no-default-features&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Current status
&lt;/h2&gt;

&lt;p&gt;The current release is &lt;code&gt;v0.3.10&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Release notes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/emirhuseynrmx/calybris-core/releases/tag/v0.3.10" rel="noopener noreferrer"&gt;github.com/emirhuseynrmx/calybris-core/releases/tag/v0.3.10&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The crate is Apache-2.0 and usable, but I would not describe it as a complete production platform.&lt;/p&gt;

&lt;p&gt;It is a core primitive. If you embed it in a production system, you still own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;key management&lt;/li&gt;
&lt;li&gt;WAL storage policy&lt;/li&gt;
&lt;li&gt;deployment controls&lt;/li&gt;
&lt;li&gt;external audit&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;operational runbooks&lt;/li&gt;
&lt;li&gt;integration-level failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feedback I want
&lt;/h2&gt;

&lt;p&gt;I would especially like feedback from Rust, security, infra, and systems people on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the API boundary clear?&lt;/li&gt;
&lt;li&gt;Is "proof-carrying decision core" misleading?&lt;/li&gt;
&lt;li&gt;Should this remain a narrow primitive, or grow a small policy language?&lt;/li&gt;
&lt;li&gt;Are the WAL responsibilities split correctly between crate and caller?&lt;/li&gt;
&lt;li&gt;What replay/audit guarantees would you expect before trusting something like this?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/emirhuseynrmx/calybris-core" rel="noopener noreferrer"&gt;github.com/emirhuseynrmx/calybris-core&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>opensource</category>
      <category>security</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
