DEV Community

Emir Hüseyin İnci
Emir Hüseyin İnci

Posted on • Originally published at Medium

Thompson Sampling: How Recommender Systems Learn to Bet on What You'll Like

A Bayesian approach to the explore-exploit tradeoff, explained through the lens of personalized recommendations.

Every time a streaming service, news app, or e-commerce site decides what to put in front of you, it's making a bet.

Show you the item it's most confident you'll like, and you get a probably-good recommendation, but the system never finds out whether something else might have been even better.

Show you something new and untested, and you risk wasting a valuable slot on a flop, but you might also discover the next big hit for that user segment.

This is the explore-exploit tradeoff, and it sits underneath nearly every recommendation engine, ad-ranking system, and content feed running today.

Thompson Sampling is one of the oldest, and in practice one of the most effective, ways to solve it. It traces back to a 1933 paper by William R. Thompson, decades before "recommender system" was even a phrase, and it remains a standard tool for anyone building ranking or personalization systems.

Recommendation Slots Are Bandits

Strip away the UI, and a recommendation slot is a classic multi-armed bandit problem.

Each candidate item, a movie, a product, an article, is an "arm."

Showing that item to a user is "pulling" it.

The reward is binary feedback:

Did they click, watch, or buy?

For each item i, there's some true, unknown probability theta_i that a user will engage with it.

The system's job is to maximize cumulative engagement over time, which means learning the theta_i values while it is still using them to make live decisions.

That's what makes this harder than ordinary supervised learning:

There is no separate training phase.

Every recommendation is simultaneously a data point and a decision with real consequences.

Why Greedy Ranking Fails

The simplest strategy is greedy:

Track the observed click rate for every item, and always recommend whichever one currently looks best.

This fails for an intuitive reason.

Suppose a genuinely great item gets unlucky on its first few impressions. Three users see it and none click, purely by chance.

Its observed rate collapses, the greedy algorithm buries it, and it never gets shown again to correct the mistake.

Early noise becomes a permanent verdict.

A common fix is forced exploration: show a random item some fixed percentage of the time, known as epsilon-greedy.

But that explores indiscriminately.

It spends just as much effort re-testing items the system already has plenty of evidence about as it does on the ones still wrapped in uncertainty.

What we actually want is exploration that's proportional to how unsure the system still is.

Thompson Sampling Changes the Frame

This is where Thompson Sampling changes the frame entirely.

Instead of tracking a single click-rate estimate per item, it tracks a full probability distribution representing the system's current belief about theta_i: how plausible every possible click-rate value is, given the data seen so far.

Early on, with little data, that belief is wide and flat. Almost any click rate seems plausible.

As impressions accumulate, the belief narrows around the true value.

Crucially, the shape of this belief, not just its average, is exactly the information needed to explore intelligently:

A wide belief means:

We genuinely don't know yet.

A narrow belief means:

We're fairly confident.

Beta belief narrows as data accumulates

With little data, the system's belief is wide. As evidence accumulates, the distribution narrows around the true click rate.

The Beta-Bernoulli Setup

For binary outcomes like clicks, the natural choice of belief distribution is the Beta distribution, thanks to a convenient property called conjugacy.

We start with a prior belief for each item:

theta_i ~ Beta(alpha, beta)
Enter fullscreen mode Exit fullscreen mode

A natural starting point is:

alpha = 1
beta = 1
Enter fullscreen mode Exit fullscreen mode

That is the uniform distribution, meaning:

Any click rate from 0 to 1 is equally plausible.

Each interaction is modeled as a Bernoulli trial:

  • A click is a success: r = 1
  • No click is a failure: r = 0

The update rule after a single observation is almost embarrassingly simple:

if the user clicked:
    alpha <- alpha + 1

if they did not click:
    beta <- beta + 1
Enter fullscreen mode Exit fullscreen mode

That's it.

No gradient steps.

No retraining.

No matrix inversion.

After n impressions with k clicks, the posterior is exactly:

Beta(alpha + k, beta + n - k)
Enter fullscreen mode Exit fullscreen mode

The mean of this distribution is:

alpha / (alpha + beta)
Enter fullscreen mode Exit fullscreen mode

That is the system's best point estimate of the click rate.

Its variance shrinks roughly in proportion to 1 / n, the formal version of:

More data, narrower belief.

The Thompson Sampling Loop

With a belief distribution maintained per item, the full Thompson Sampling procedure is just three steps, repeated every time a recommendation is needed:

  1. Sample one value theta_hat_i from each candidate item's current Beta(alpha_i, beta_i) distribution.
  2. Recommend the item with the highest sampled value.
  3. Observe the outcome and update that item's alpha or beta accordingly.

Notice what's doing the actual work:

The randomness in step 1.

Nothing forces exploration explicitly. There is no epsilon, no separate exploration budget.

Exploration emerges naturally from the fact that items the system is still uncertain about have wide distributions, and a wide distribution occasionally produces a high sample purely by chance.

Thompson Sampling with three arms

A new item may have a lower average estimate but a wider belief distribution. Sometimes it samples high enough to win the recommendation slot.

The chart above shows exactly this in a single snapshot.

The "New arrival" item has the lowest average click rate of the three, but because so little is known about it, its belief is wide.

On this particular draw, its sampled value comes out ahead of both better-established items.

It wins the recommendation slot this round, the system learns a little more about it, and its distribution narrows next time, win or lose.

An item with a long, strong track record, by contrast, has a narrow distribution clustered tightly around its true rate.

It keeps winning consistently, but it can occasionally lose a slot to a promising newcomer, exactly as it should.

That is the elegant part:

A single sampling step automatically interpolates between exploring and exploiting, with no tuning knob required, and it shifts toward exploitation on its own as confidence grows.

Making Thompson Sampling Practical

Real recommender systems don't compare three items.

They compare thousands or millions, and new items arrive constantly.

A few extensions make Thompson Sampling practical at that scale.

Cold Start

A brand-new item starts at:

Beta(1, 1)
Enter fullscreen mode Exit fullscreen mode

That means maximal uncertainty.

It has a real, non-trivial chance of sampling high enough to get shown early on.

This is a feature, not a bug:

New content gets a fair shot at exposure without needing a separate "new item boost" rule bolted on.

Contextual Thompson Sampling

Treating every item independently ignores everything known about the user:

History, device, time of day, location, session context, and so on.

In practice, recommendation systems typically use a contextual variant.

Instead of a single theta per item, the model maintains a distribution over the parameters of a model, commonly a Bayesian linear or logistic regression, that predicts click probability from user and item features together.

Sampling now means:

  1. Draw one set of model parameters.
  2. Score all candidates under that sampled model.
  3. Recommend the top one.
  4. Observe the result and update the posterior.

The mechanics are unchanged:

Sample.

Act.

Update.

The model is just richer than a single number per item.

Non-Stationarity

Tastes drift.

Items go stale.

A pure Beta-Bernoulli model with no decay eventually becomes overconfident about old data that's no longer representative.

A common fix is to mildly discount alpha and beta over time, multiplying both by a factor slightly below 1 before each update.

That keeps the belief from narrowing all the way to zero uncertainty and lets the system adapt if the true rate shifts later.

Production Questions

Before reaching for Thompson Sampling in production, it is worth being deliberate about a couple of things.

What Counts as Reward?

A raw click is easy to measure but a weak proxy for satisfaction.

Optimizing for clicks alone can reward clickbait while eroding trust.

Many production systems instead model a downstream signal, like watch-time past a threshold, or a weighted blend of signals, and apply the same Bayesian machinery to that instead.

Sampling Cost

Drawing from a Beta distribution is cheap.

But contextual variants that sample full parameter vectors, or in the extreme, run posterior sampling over a neural network, can get expensive at low latency and high request volume.

Approximations like sampling once per batch of requests, rather than per individual request, are a common engineering compromise.

Evaluation Is Tricky

Because the system's own choices generate the data it later learns from, naively asking:

What would have happened under a different policy?

is statistically biased.

Offline evaluation typically needs either logged propensity scores or a held-out slice of traffic served by uniform random exploration to validate against.

The Takeaway

Thompson Sampling has earned its long shelf life because it turns a famously hard tradeoff, when to explore versus when to exploit, into a single, principled operation:

Maintain a belief.

Sample from it.

Act on the sample.

Update the belief.

The exploration here is not a separate mechanism duct-taped onto a model.

It is a direct, automatic consequence of being honest about uncertainty.

For recommender systems in particular, where new items appear constantly, tastes shift, and every wrong "exploit" choice is a real user's wasted moment, that kind of self-calibrating exploration is not just elegant.

It is exactly what the problem calls for.


Next in the series: how reinforcement learning changes the problem once actions start shaping future states.

Top comments (0)