DEV Community: Emir Hüseyin İnci

The Moment AI Stops Predicting and Starts Choosing

Emir Hüseyin İnci — Sat, 27 Jun 2026 03:52:14 +0000

Most machine learning learns from labels. Reinforcement learning learns from consequences — and that one-word difference breaks everything you thought you knew about how AI works.

Let me show you the exact moment AI gets dangerous.

Not dangerous in the sci-fi sense.

Dangerous in the quietly optimizes the wrong thing for six months until your product is broken and you don’t know why sense.

It happens when a system stops answering questions and starts making decisions.

That’s reinforcement learning. And if you use AI products, build them, or invest in companies that do, you need to understand this.

The difference no one explains clearly

Every AI system you’ve used has probably been trained the same basic way:

Show it a million examples. Tell it the right answer each time. Let it adjust until it gets good at guessing the right answer.

Image → label. Email → spam/not spam. Transaction → fraud/not fraud.

This is supervised learning. It’s powerful. It’s what makes your spam filter work and your autocomplete finish your sentences.

But here’s the thing:

The world doesn’t come with labels.

Your next career move doesn’t have a correct answer written on the back.

Your company’s pricing strategy doesn’t come with a ground truth.

And neither does a chess game, a trading position, or a conversation with a user.

For these problems, you don’t need a system that predicts the answer.

You need a system that takes an action and then lives with what happens next.

That’s the shift.

That’s everything.

The comfortable loop vs. the loop that changes the world

Before we get to RL, a quick detour to a simpler problem.

A bandit algorithm, like the Thompson Sampling I wrote about in the first piece in this series, makes one repeated decision:

Which option should I pick right now?

Show movie.

User clicks or doesn’t.

Update belief.

Repeat.

Crucially: each round resets. The world tomorrow looks basically like the world today.

Reinforcement learning is what happens when you take that convenience away.

Now the loop is:

Observe the current state of the world
Choose an action
The world changes because of your action
Receive a reward, or don’t
Find yourself in a new, different world
Choose again

The bandit recommender asks:

Which movie should I show this user right now?

The RL recommender asks:

What should I show today, knowing it will change what this user wants to watch next month?

One extra clause. Completely different problem.

Bandits choose well now. Reinforcement learning chooses well over time.

Six words that explain the whole field

The jargon in RL sounds intimidating.

It isn’t.

Here’s the whole thing:

Agent — the system making decisions.

Environment — the world that reacts.

State — what the agent can currently see.

Action — what the agent chooses to do.

Reward — the signal that comes back after the action.

Policy — the rule that maps “what I see” to “what I do.”

Everything in the field — Q-learning, policy gradients, actor-critic, PPO, RLHF — is trying to solve one problem:

How do you maximize reward not just now, but across the whole chain of decisions that follows?

That’s it.

That’s the field.

The reason it’s hard is that the chain is long, the future is uncertain, and actions today shape what’s even possible tomorrow.

The part that makes RL genuinely hard

Here’s where it gets uncomfortable.

In a bandit problem, feedback is fast.

Show content, user clicks, update.

In RL, the consequence of an action might show up weeks later.

By then, the system has made hundreds of other decisions.

So which one caused the outcome?

This is called the credit assignment problem, and it’s not a small technical footnote.

It’s one of the core reasons RL is difficult to get right.

Think about what this means in practice:

The click-maximizing recommendation trains users to expect worse and worse content, but the engagement numbers look great for another 18 months.
The profitable trade quietly accumulates exposure to a tail risk that only shows up under stress, but the P&L looks clean until it doesn’t.
The cheapest LLM routing saves $0.003 per request until retries, escalations, and churn quietly eat the margin you thought you were protecting.

In every case:

The immediate signal says yes.

The real outcome says wait.

The system has to learn from both.

That’s what makes this hard.

Three industries being quietly reshaped by this problem right now

Streaming and content recommendations

Optimize for clicks → system learns clickbait.

Not because anyone designed it to.

Because clicks were the proxy, and the proxy was optimized.

The metric improves for a year.

Then user satisfaction surveys start dropping.

Then retention curves start bending.

Then someone in the boardroom asks why the numbers that matter are going in the wrong direction even though the numbers being tracked are fine.

The RL framing forces the harder question earlier:

What recommendation policy increases long-term trust, not just this session’s engagement?

Trading

A position that looks profitable today can be a bad decision if it shifts your risk profile in ways that only matter under stress.

In RL terms, the question isn’t buy-or-sell.

It’s position management over time, where each action changes what’s available and what’s dangerous next.

LLM routing

This one is underappreciated.

Routing every request to the cheapest capable model looks like free money.

Until quality starts quietly degrading at the margin.

Until the edge cases that fall through the cracks start accumulating.

Until users who needed a good answer and got a mediocre one just stop asking.

That cost never shows up in the routing dashboard.

But it’s there.

This is a pure RL problem:

The reward signal, cost per query, and the real objective, user outcome, are separated in time.

The uncomfortable truth about reward functions

Here’s the thing nobody says out loud early enough:

Reinforcement learning doesn’t learn what you want. It learns what you reward.

And those two things can drift apart faster than you’d expect.

This doesn’t require the system to be malicious.

It doesn’t require it to be clever.

It only requires the reward function to be an imperfect proxy for the thing that actually mattered.

And in every real product, the reward function is a proxy.

Always.

Because the thing that actually matters — user trust, long-term retention, sustained business value — can’t be measured in real time.

This is why RL has a strange dual nature:

On one hand, it can discover strategies that humans would never write by hand.

AlphaGo didn’t learn to play Go by following human intuition. It discovered lines of play no human had considered.

On the other hand, it can exploit exactly the blind spots humans encoded into the reward function without realizing it.

Both things are true at the same time.

Neither cancels the other out.

The gap nobody budgets for

Here’s the thing that surprises people building RL systems for the first time:

A policy can be statistically optimal under the reward function and still be unacceptable in production.

Statistically optimal means:

Given the data, given the reward signal, this is the best policy we found.

Operationally acceptable means:

This is something we’d actually defend when it scales to millions of users.

Those are not the same thing.

A production RL system needs:

Constraints the policy cannot optimize around
Monitoring that catches drift before it becomes a crisis
Audit trails so decisions can be replayed and explained
Hard limits on what the agent is allowed to do while it’s still learning
A clear answer to “what happens when the reward function diverges from the goal?”

Not because the algorithm is broken.

Because the reward is incomplete.

It always is.

The guardrails aren’t an afterthought.

They’re the product.

Why this matters beyond the technical teams

If you manage products that use AI, the question you should be asking isn’t:

Did the metric improve?

It’s:

Are the decisions the system is learning ones I would defend when they scale?

Because at scale, the compounding effects of a slightly wrong reward function aren’t small.

They’re the story.

They’re the reason the product that looked great in the dashboard becomes the product that’s slowly losing the users who mattered.

Reinforcement learning is powerful because it matches how real decisions work.

Real decisions happen in sequence.

They have delayed consequences.

Each one changes the next.

The uncertainty doesn’t resolve immediately.

That’s exactly the world RL is built for.

But it also means the question is never just whether the algorithm worked.

The question is whether you taught the machine to make decisions you can actually trust.

Thompson Sampling: How Recommender Systems Learn to Bet on What You'll Like

Emir Hüseyin İnci — Sat, 27 Jun 2026 03:46:21 +0000

A Bayesian approach to the explore-exploit tradeoff, explained through the lens of personalized recommendations.

Every time a streaming service, news app, or e-commerce site decides what to put in front of you, it's making a bet.

Show you the item it's most confident you'll like, and you get a probably-good recommendation, but the system never finds out whether something else might have been even better.

Show you something new and untested, and you risk wasting a valuable slot on a flop, but you might also discover the next big hit for that user segment.

This is the explore-exploit tradeoff, and it sits underneath nearly every recommendation engine, ad-ranking system, and content feed running today.

Thompson Sampling is one of the oldest, and in practice one of the most effective, ways to solve it. It traces back to a 1933 paper by William R. Thompson, decades before "recommender system" was even a phrase, and it remains a standard tool for anyone building ranking or personalization systems.

Recommendation Slots Are Bandits

Strip away the UI, and a recommendation slot is a classic multi-armed bandit problem.

Each candidate item, a movie, a product, an article, is an "arm."

Showing that item to a user is "pulling" it.

The reward is binary feedback:

Did they click, watch, or buy?

For each item i, there's some true, unknown probability theta_i that a user will engage with it.

The system's job is to maximize cumulative engagement over time, which means learning the theta_i values while it is still using them to make live decisions.

That's what makes this harder than ordinary supervised learning:

There is no separate training phase.

Every recommendation is simultaneously a data point and a decision with real consequences.

Why Greedy Ranking Fails

The simplest strategy is greedy:

Track the observed click rate for every item, and always recommend whichever one currently looks best.

This fails for an intuitive reason.

Suppose a genuinely great item gets unlucky on its first few impressions. Three users see it and none click, purely by chance.

Its observed rate collapses, the greedy algorithm buries it, and it never gets shown again to correct the mistake.

Early noise becomes a permanent verdict.

A common fix is forced exploration: show a random item some fixed percentage of the time, known as epsilon-greedy.

But that explores indiscriminately.

It spends just as much effort re-testing items the system already has plenty of evidence about as it does on the ones still wrapped in uncertainty.

What we actually want is exploration that's proportional to how unsure the system still is.

Thompson Sampling Changes the Frame

This is where Thompson Sampling changes the frame entirely.

Instead of tracking a single click-rate estimate per item, it tracks a full probability distribution representing the system's current belief about theta_i: how plausible every possible click-rate value is, given the data seen so far.

Early on, with little data, that belief is wide and flat. Almost any click rate seems plausible.

As impressions accumulate, the belief narrows around the true value.

Crucially, the shape of this belief, not just its average, is exactly the information needed to explore intelligently:

A wide belief means:

We genuinely don't know yet.

A narrow belief means:

We're fairly confident.

With little data, the system's belief is wide. As evidence accumulates, the distribution narrows around the true click rate.

The Beta-Bernoulli Setup

For binary outcomes like clicks, the natural choice of belief distribution is the Beta distribution, thanks to a convenient property called conjugacy.

We start with a prior belief for each item:

theta_i ~ Beta(alpha, beta)

A natural starting point is:

alpha = 1
beta = 1

That is the uniform distribution, meaning:

Any click rate from 0 to 1 is equally plausible.

Each interaction is modeled as a Bernoulli trial:

A click is a success: r = 1
No click is a failure: r = 0

The update rule after a single observation is almost embarrassingly simple:

if the user clicked:
    alpha <- alpha + 1

if they did not click:
    beta <- beta + 1

That's it.

No gradient steps.

No retraining.

No matrix inversion.

After n impressions with k clicks, the posterior is exactly:

Beta(alpha + k, beta + n - k)

The mean of this distribution is:

alpha / (alpha + beta)

That is the system's best point estimate of the click rate.

Its variance shrinks roughly in proportion to 1 / n, the formal version of:

More data, narrower belief.

The Thompson Sampling Loop

With a belief distribution maintained per item, the full Thompson Sampling procedure is just three steps, repeated every time a recommendation is needed:

Sample one value theta_hat_i from each candidate item's current Beta(alpha_i, beta_i) distribution.
Recommend the item with the highest sampled value.
Observe the outcome and update that item's alpha or beta accordingly.

Notice what's doing the actual work:

The randomness in step 1.

Nothing forces exploration explicitly. There is no epsilon, no separate exploration budget.

Exploration emerges naturally from the fact that items the system is still uncertain about have wide distributions, and a wide distribution occasionally produces a high sample purely by chance.

A new item may have a lower average estimate but a wider belief distribution. Sometimes it samples high enough to win the recommendation slot.

The chart above shows exactly this in a single snapshot.

The "New arrival" item has the lowest average click rate of the three, but because so little is known about it, its belief is wide.

On this particular draw, its sampled value comes out ahead of both better-established items.

It wins the recommendation slot this round, the system learns a little more about it, and its distribution narrows next time, win or lose.

An item with a long, strong track record, by contrast, has a narrow distribution clustered tightly around its true rate.

It keeps winning consistently, but it can occasionally lose a slot to a promising newcomer, exactly as it should.

That is the elegant part:

A single sampling step automatically interpolates between exploring and exploiting, with no tuning knob required, and it shifts toward exploitation on its own as confidence grows.

Making Thompson Sampling Practical

Real recommender systems don't compare three items.

They compare thousands or millions, and new items arrive constantly.

A few extensions make Thompson Sampling practical at that scale.

Cold Start

A brand-new item starts at:

Beta(1, 1)

That means maximal uncertainty.

It has a real, non-trivial chance of sampling high enough to get shown early on.

This is a feature, not a bug:

New content gets a fair shot at exposure without needing a separate "new item boost" rule bolted on.

Contextual Thompson Sampling

Treating every item independently ignores everything known about the user:

History, device, time of day, location, session context, and so on.

In practice, recommendation systems typically use a contextual variant.

Instead of a single theta per item, the model maintains a distribution over the parameters of a model, commonly a Bayesian linear or logistic regression, that predicts click probability from user and item features together.

Sampling now means:

Draw one set of model parameters.
Score all candidates under that sampled model.
Recommend the top one.
Observe the result and update the posterior.

The mechanics are unchanged:

Sample.

Act.

Update.

The model is just richer than a single number per item.

Non-Stationarity

Tastes drift.

Items go stale.

A pure Beta-Bernoulli model with no decay eventually becomes overconfident about old data that's no longer representative.

A common fix is to mildly discount alpha and beta over time, multiplying both by a factor slightly below 1 before each update.

That keeps the belief from narrowing all the way to zero uncertainty and lets the system adapt if the true rate shifts later.

Production Questions

Before reaching for Thompson Sampling in production, it is worth being deliberate about a couple of things.

What Counts as Reward?

A raw click is easy to measure but a weak proxy for satisfaction.

Optimizing for clicks alone can reward clickbait while eroding trust.

Many production systems instead model a downstream signal, like watch-time past a threshold, or a weighted blend of signals, and apply the same Bayesian machinery to that instead.

Sampling Cost

Drawing from a Beta distribution is cheap.

But contextual variants that sample full parameter vectors, or in the extreme, run posterior sampling over a neural network, can get expensive at low latency and high request volume.

Approximations like sampling once per batch of requests, rather than per individual request, are a common engineering compromise.

Evaluation Is Tricky

Because the system's own choices generate the data it later learns from, naively asking:

What would have happened under a different policy?

is statistically biased.

Offline evaluation typically needs either logged propensity scores or a held-out slice of traffic served by uniform random exploration to validate against.

The Takeaway

Thompson Sampling has earned its long shelf life because it turns a famously hard tradeoff, when to explore versus when to exploit, into a single, principled operation:

Maintain a belief.

Sample from it.

Act on the sample.

Update the belief.

The exploration here is not a separate mechanism duct-taped onto a model.

It is a direct, automatic consequence of being honest about uncertainty.

For recommender systems in particular, where new items appear constantly, tastes shift, and every wrong "exploit" choice is a real user's wasted moment, that kind of self-calibrating exploration is not just elegant.

It is exactly what the problem calls for.

Next in the series: how reinforcement learning changes the problem once actions start shaping future states.

Building a Replayable Decision Kernel in Rust

Emir Hüseyin İnci — Fri, 26 Jun 2026 22:28:01 +0000

I built Calybris Core because I kept running into the same uncomfortable question in decision-heavy systems:

After the system says "yes", "no", or "use this instead", what exactly can we prove later?

Not prove in the formal-methods sense. I mean the practical engineering version:

Which policy was active?
What was the input?
What decision was returned?
Can the decision be replayed?
Did the budget/exposure invariant still hold?
Can an audit log detect tampering?

Calybris Core is my attempt to make that boundary small, deterministic, and boring.

It is not an LLM framework.

It is not an exchange.

It is not a strategy engine.

It is not a web service.

It is a Rust core primitive:

candidate + policy constraints -> decision + digests + optional WAL + budget proof

The first reference examples are LLM routing and pre-trade admission guards, but the crate itself is domain-neutral.

Repo: github.com/emirhuseynrmx/calybris-core

Crate: crates.io/crates/calybris-core

Docs: docs.rs/calybris-core

The boundary I wanted

A lot of systems have a hidden decision point that looks simple from the outside:

request comes in
system checks constraints
system returns allow / substitute / reject

But when something goes wrong, that simple decision becomes hard to reconstruct.

Maybe the model was changed.

Maybe a budget was exceeded.

Maybe a cheaper fallback was selected.

Maybe an operator needs to explain why an action was rejected.

Maybe an audit log was modified after the fact.

The typical response is to add more logs.

That helps, but logs alone are not the same as replayable decisions. I wanted the core decision result to carry enough structure that an independent verifier can ask:

If I replay the same input against the same policy snapshot, do I get the same decision?

That became the central design constraint.

What Calybris decides

The kernel module evaluates a KernelInput against a validated PolicySnapshot.

The result is a KernelDecision:

ExecuteRequested
Substitute
Reject

The decision contains the selected candidate, reason, estimated cost, utility, counterfactual fields, evaluated/eligible counts, and policy/catalog epochs.

The important part is not the specific domain. The important part is that the decision is deterministic and replayable.

In code, the shape is intentionally direct:

use calybris_core::kernel::*;
use calybris_core::verify::{verify_decision, VerifyResult};

let decision = snapshot.prescribe(input);

assert_eq!(
    verify_decision(&snapshot, input, &decision),
    VerifyResult::Valid
);

The hot path deliberately avoids:

floating point
JSON
clocks
network calls
hidden I/O
unsafe Rust

The crate root uses:

#![forbid(unsafe_code)]

That is not magic, but it is a useful line in the sand.

Why I avoided floating point

The reference use cases both involve costs, budgets, confidence, risk, and utility.

It would be easy to reach for f64. I avoided it.

Calybris uses integer amounts and basis points. Financial amounts are fixed-point microcents. Quality, risk, confidence, and policy thresholds are represented as integer basis points.

That keeps replay behavior less surprising.

For audit-oriented code, "close enough" is a dangerous phrase. If a decision depends on a threshold, I want the arithmetic to be explicit and repeatable.

Canonical digests, not "whatever serde emitted"

Replay alone is not enough. You also need stable fingerprints.

Calybris computes canonical SHA-256 digests for:

policy snapshots
decision inputs
decision outputs
budget ledger snapshots

The digest layouts are version-tagged byte layouts, not hashes of arbitrary JSON.

That distinction matters. JSON is great for transport and inspection, but field order and serialization choices are not a good audit boundary.

The digest tags are explicit:

calypol1
calyinp1
calydcn1
calyldg1

Policy models are sorted before hashing. Ledger tenants are sorted before hashing. A logically equivalent snapshot should not get a different fingerprint because a map happened to iterate differently.

The audit bundle

A decision can be wrapped in an audit bundle:

policy digest
input digest
decision digest
replay_valid

The verifier checks the structural decision, not just a string.

If you change the input, replay fails.

If you change the decision, replay fails.

If you use the wrong policy, replay fails.

If the digest fields do not match canonical recomputation, replay fails.

That is the reason I have been using the phrase "proof-carrying decision core", although I am still looking for feedback on whether that wording is too strong.

To be clear: this is not a formal proof system. It is a replayable evidence bundle.

Optional WAL

The crate also includes an optional write-ahead log.

Each WAL entry contains:

sequence number
previous hash
entry hash
record data

The unkeyed mode is useful for corruption detection and basic tamper evidence. The keyed mode uses HMAC-SHA256, which is the mode you would use if an attacker might rewrite entries and recompute hashes.

The audited WAL path looks like this:

prescribe
  -> audit_bundle
  -> append_audited
  -> replay_audited_wal

Replay fails closed if the chain is broken or if any policy/input/decision digest does not match.

I intentionally did not put secret storage, key rotation, file locking, or multi-process coordination inside this crate. Those are deployment concerns and should be owned by the embedding system.

Budget conservation

The budget engine is another small core primitive.

The invariant is:

remaining + reserved + committed_lifetime == initial

A reservation removes spendable balance.

A commit turns a reservation into lifetime committed spend.

A release returns the hold.

A top-up extends initial and remaining budget.

The budget engine uses CAS for the hot balance updates and mutex-protected metadata maps for the surrounding state.

The invariant is checked on frozen snapshots. Multi-step operations may have transient internal states, so the docs are careful not to claim every mid-operation snapshot is linearizable.

That distinction matters. Audit docs should say what is guaranteed, not what sounds good.

Why not a general rules engine?

Calybris is narrower than a rules engine.

It does not try to provide a policy language. It does not parse arbitrary user rules. It does not evaluate scripts.

The current kernel is closer to:

rank candidates under hard constraints
return the best positive-utility candidate
otherwise reject

That narrowness is intentional. I wanted the core to be small enough to reason about, test, replay, and document.

A larger product can put a policy language above this layer. Calybris is the deterministic bottom layer.

Testing the uncomfortable parts

The project has tests for the parts I would worry about first:

optimized kernel output vs reference implementation
digest stability and sensitivity
replay mismatch detection
WAL tampering, duplicate sequence, truncation, malformed JSON
keyed WAL verification
budget conservation under mixed operations
overflow paths
concurrent reserve/commit/release behavior
Loom interleavings
Miri on the library and audit pipeline

The CI runs MSRV and stable jobs, clippy with warnings denied, docs, examples, proptest-heavy jobs, Loom, Miri, cargo-audit, and cargo-deny.

That does not make it "audited". It does make it less hand-wavy.

Try it locally

git clone https://github.com/emirhuseynrmx/calybris-core
cd calybris-core
cargo run --example quickstart
cargo run --example llm_routing
cargo run --example replay_audit

Use it as a dependency:

cargo add calybris-core

Kernel-only, without WAL:

cargo add calybris-core --no-default-features

Current status

The current release is v0.3.10.

Release notes:

github.com/emirhuseynrmx/calybris-core/releases/tag/v0.3.10

The crate is Apache-2.0 and usable, but I would not describe it as a complete production platform.

It is a core primitive. If you embed it in a production system, you still own:

key management
WAL storage policy
deployment controls
external audit
monitoring
operational runbooks
integration-level failure handling

Feedback I want

I would especially like feedback from Rust, security, infra, and systems people on:

Is the API boundary clear?
Is "proof-carrying decision core" misleading?
Should this remain a narrow primitive, or grow a small policy language?
Are the WAL responsibilities split correctly between crate and caller?
What replay/audit guarantees would you expect before trusting something like this?

The repo is here:

github.com/emirhuseynrmx/calybris-core