<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Daksh Jain</title>
    <description>The latest articles on DEV Community by Daksh Jain (@dash10107).</description>
    <link>https://dev.to/dash10107</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3989571%2F196ca0b3-07b4-47f2-879b-4ce4168dcda8.png</url>
      <title>DEV Community: Daksh Jain</title>
      <link>https://dev.to/dash10107</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dash10107"/>
    <language>en</language>
    <item>
      <title>Beyond A/B Testing: How AI Handles Ad Fatigue and Revenue Optimization</title>
      <dc:creator>Daksh Jain</dc:creator>
      <pubDate>Sun, 21 Jun 2026 03:48:23 +0000</pubDate>
      <link>https://dev.to/dash10107/beyond-ab-testing-how-ai-handles-ad-fatigue-and-revenue-optimization-1pe</link>
      <guid>https://dev.to/dash10107/beyond-ab-testing-how-ai-handles-ad-fatigue-and-revenue-optimization-1pe</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feesjgbvsuujob5ypzz62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feesjgbvsuujob5ypzz62.png" alt="Bandit Optimizer Cover" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you read any standard tutorial on Multi-Armed Bandits, you will hear the exact same story: &lt;em&gt;A/B testing is inefficient because it wastes 50% of your traffic on a losing variation. Instead, use a Bandit algorithm to dynamically shift traffic to the winner.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;They usually introduce three algorithms: Epsilon-Greedy, UCB1, and Thompson Sampling. &lt;/p&gt;

&lt;p&gt;But almost all of these tutorials make two fatal, mathematically dangerous assumptions that will completely break your algorithms in the real world:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;They assume a click is just a click.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;They assume the world never changes.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built a custom &lt;a href="https://huggingface.co/spaces/Dash10107/mab-banner-optimizer" rel="noopener noreferrer"&gt;Interactive Bandit Simulator&lt;/a&gt; to visualize exactly why these assumptions fail, and how advanced Reinforcement Learning actually handles the chaos of the real world.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Casino Analogy (Where the name comes from)
&lt;/h3&gt;

&lt;p&gt;Imagine walking into a casino and facing a row of slot machines (known as "One-Armed Bandits"). You know that some machines have a higher payout rate than others, but you don't know which ones. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Do you pull the lever on the machine that paid out $10 on your first try (Exploitation)? &lt;/li&gt;
&lt;li&gt;  Or do you put coins into the unknown machines just in case one of them is the secret jackpot machine (Exploration)? &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the &lt;strong&gt;Multi-Armed Bandit&lt;/strong&gt; problem. In digital marketing, the slot machines are your ad banners, and the pulls are your website visitors.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Revenue Trap (CTR vs EV)
&lt;/h2&gt;

&lt;p&gt;Most standard bandit implementations optimize purely for Click-Through Rate (CTR). A conversion equals &lt;code&gt;1&lt;/code&gt;, a failure equals &lt;code&gt;0&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Imagine you are running a SaaS pricing page with three different "Call to Action" (CTA) buttons.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;"Start Free Trial"&lt;/strong&gt;: Gets a massive 15% CTR. (Expected lifetime value: $120)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;"Request Demo"&lt;/strong&gt;: Gets a 7% CTR. (Expected value: $180)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;"Buy Now"&lt;/strong&gt;: Gets a tiny 3% CTR. (Expected value: $300)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your algorithm only looks at clicks, it will confidently route 100% of your traffic to the "Free Trial" button. It thinks it's winning, but you are actively bleeding potential revenue.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix: Expected Value (EV)
&lt;/h3&gt;

&lt;p&gt;To fix this, our environment must multiply the conversion by the actual revenue. In our simulator's environment code, the reward function looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arm_idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;arm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arms&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;arm_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Did they click based on the hidden True CTR?
&lt;/span&gt;    &lt;span class="n"&gt;converted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;arm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;true_ctr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Multiply by the actual monetary value of that conversion!
&lt;/span&gt;    &lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;converted&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;converted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When you run &lt;strong&gt;Thompson Sampling&lt;/strong&gt; on the "SaaS Pricing Page" scenario in the live dashboard, you will watch something incredible happen. Initially, the algorithm gets flooded with clicks for the "Free Trial" banner. But over time, the rare—but massive—$300 payouts from the "Buy Now" button cause its revenue-weighted probability distribution to shift all the way to the right. The AI learns to ignore the high click rate and chases the money.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Standard Algorithms and Their Deep Flaws
&lt;/h2&gt;

&lt;p&gt;Let's look at how standard algorithms attempt to solve this exploration-exploitation trade-off, and where they break.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Naive Explorer: Epsilon-Greedy
&lt;/h3&gt;

&lt;p&gt;It rolls a loaded die. 90% of the time, it exploits the best banner. 10% of the time ($\epsilon$), it picks at random. &lt;br&gt;
&lt;strong&gt;The Flaw&lt;/strong&gt;: It never stops exploring. Even after 100,000 impressions when it is absolutely certain which banner is best, it still wastes 10% of its traffic on losers. &lt;/p&gt;

&lt;p&gt;We can fix this mathematically with &lt;strong&gt;Decaying Epsilon-Greedy&lt;/strong&gt;. Instead of a fixed 10%, we calculate epsilon dynamically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ε = decay / √(t)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;choose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Epsilon shrinks as the square root of time!
&lt;/span&gt;    &lt;span class="n"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decay&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;n_arms&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This forces heavy exploration early, but gracefully decays exploration to zero as time approaches infinity.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Genius of the Logarithm: UCB1
&lt;/h3&gt;

&lt;p&gt;Upper Confidence Bound (UCB1) doesn't use randomness. It mathematically calculates the maximum potential value of a banner using this formula:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UCB = Expected_Reward + c * √( ln(t) / pulls )&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;choose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="c1"&gt;# The UCB bonus formula
&lt;/span&gt;    &lt;span class="n"&gt;bonus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;bonus&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Why is &lt;code&gt;np.log(t)&lt;/code&gt; brilliant? As time (&lt;code&gt;t&lt;/code&gt;) moves forward, the numerator grows. But a logarithm grows &lt;em&gt;incredibly slowly&lt;/em&gt;. This guarantees that if a banner hasn't been pulled in a long time (the denominator &lt;code&gt;self.counts&lt;/code&gt; stays small), its bonus will eventually creep high enough to force the algorithm to test it again. No banner is ever permanently starved of attention.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. The Static World Fallacy (Ad Fatigue)
&lt;/h2&gt;

&lt;p&gt;Here is the second, much larger trap. UCB1 and Epsilon-Greedy assume that if a banner has a 10% CTR on Day 1, it will have a 10% CTR on Day 100. &lt;/p&gt;

&lt;p&gt;In digital marketing, this is completely false. Users get "Ad Fatigue". A brilliant new banner design will get high clicks for a week, and then slowly decay as users go blind to it. We call this a &lt;strong&gt;Non-Stationary Environment&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;If you run standard UCB1 in a non-stationary environment, it fails spectacularly. Why? Because UCB1 remembers &lt;em&gt;everything&lt;/em&gt;. If Banner A was amazing for the first 10,000 impressions, UCB1 builds an incredibly strong mathematical conviction that Banner A is the best. If Banner A suddenly goes blind and its CTR drops to zero, UCB1 is so weighed down by its historical data that it might take another 10,000 failed impressions before it finally changes its mind.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. Advanced Solutions: Bayes and Gradients
&lt;/h2&gt;

&lt;p&gt;How do modern AI systems handle shifting trends?&lt;/p&gt;
&lt;h3&gt;
  
  
  Solution A: The Bayesian Master (Thompson Sampling)
&lt;/h3&gt;

&lt;p&gt;Instead of tracking a single "average", Thompson Sampling tracks a &lt;strong&gt;Beta Distribution&lt;/strong&gt; of belief. It calculates the exact probability of a banner's true success rate using Bayes' Theorem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P(True_CTR | Data) = Beta(α + clicks, β + ignores)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;choose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Sample from the Beta distribution of each arm
&lt;/span&gt;    &lt;span class="n"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;alpha&lt;/code&gt; is the number of successes.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;beta&lt;/code&gt; is the number of failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you open the &lt;strong&gt;Learner Mode&lt;/strong&gt; in the dashboard, you can watch these Beta Distribution curves physically morph. When a banner is new, the curve is flat and wide (high uncertainty). When it gets clicks, it shifts right and becomes a tight spike. &lt;/p&gt;

&lt;p&gt;If the environment drifts (Ad Fatigue), a once-great banner starts accumulating failures. Its &lt;code&gt;beta&lt;/code&gt; parameter grows, the curve widens and shifts left, and the AI naturally begins exploring other banners again. It gracefully rides the changing waves.&lt;/p&gt;
&lt;h3&gt;
  
  
  Solution B: Gradient Bandits (The Deep RL Bridge)
&lt;/h3&gt;

&lt;p&gt;Instead of trying to estimate CTR or Revenue at all, what if the agent just learns a &lt;em&gt;relative preference&lt;/em&gt;? &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradient Bandits&lt;/strong&gt; maintain a preference score &lt;code&gt;H&lt;/code&gt; for each banner. It passes these scores through a &lt;code&gt;Softmax&lt;/code&gt; function to convert them into probabilities (just like a neural network). The update rule calculates the gradient of the reward against a rolling baseline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;H_chosen = H_chosen + α * (Reward - Baseline) * (1 - Prob_chosen)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_softmax&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Stochastic gradient ascent on expected reward
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha_lr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;arm&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha_lr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Update running baseline
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If a banner performs better than the historical baseline, its preference gets a boost. If it performs worse, it gets penalized. This relative updating makes it incredibly robust to non-stationary environments. &lt;em&gt;Fun fact: This exact math is the foundational stepping stone to Policy Gradient algorithms like PPO, which are used to train Large Language Models!&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🧪 Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Don't just read about it. Open up the &lt;strong&gt;&lt;a href="https://huggingface.co/spaces/Dash10107/mab-banner-optimizer" rel="noopener noreferrer"&gt;Live Dashboard&lt;/a&gt;&lt;/strong&gt; and run these exact experiments to see the AI break and recover:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Revenue Trap:&lt;/strong&gt; Go to the Face-Off tab. Select the &lt;code&gt;SaaS Pricing Page&lt;/code&gt; scenario. Race Epsilon-Greedy against Thompson Sampling. Watch how Thompson Sampling figures out that the lowest-clicked banner is actually the most profitable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Ad Fatigue Test:&lt;/strong&gt; Go to Advanced Settings and set the &lt;code&gt;CTR Drift&lt;/code&gt; to &lt;code&gt;0.008&lt;/code&gt;. Run UCB1 against Gradient Bandits. Watch how UCB1 stubbornly clings to early winners long after they have decayed, while Gradient Bandits smoothly adapt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step-by-Step Bayesian Learning:&lt;/strong&gt; Go to the Learner Mode, pick the &lt;code&gt;E-Commerce Sale&lt;/code&gt; scenario, and click "Next Impression" manually. Watch the mathematical confidence curves physically narrow in real time.&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  Wrapping Up
&lt;/h3&gt;

&lt;p&gt;Bandit algorithms are the secret engine behind almost every digital platform you use today—from Netflix thumbnails to Amazon recommendations. But building them for the real world requires moving beyond simple coin-flips and addressing revenue weighting and non-stationarity.&lt;/p&gt;

&lt;p&gt;This is the second of 12 interactive RL projects I am building to bridge the gap between academic math and real-world intuition. If this helped things click for you, I would be incredibly grateful if you checked out the source code and dropped a star on the full repository:&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;Reinforcement Learning Portfolio on GitHub&lt;/strong&gt;&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Dash10107" rel="noopener noreferrer"&gt;
        Dash10107
      &lt;/a&gt; / &lt;a href="https://github.com/Dash10107/rl-portfolio" rel="noopener noreferrer"&gt;
        rl-portfolio
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      End-to-end reinforcement learning projects — Q-Learning, DQN, PPO, SAC, A2C, IPPO, MBRL, HMM, RLHF, and Multi-Armed Bandits — each deployed as an interactive Gradio app on Hugging Face Spaces.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/Dash10107/rl-portfolio/assets/banner.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FDash10107%2Frl-portfolio%2FHEAD%2Fassets%2Fbanner.png" alt="Reinforcement Learning Portfolio Banner" width="900"&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
  &lt;a href="https://github.com/Dash10107/rl-portfolio/actions/workflows/lint.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/Dash10107/rl-portfolio/actions/workflows/lint.yml/badge.svg" alt="Lint Status"&gt;&lt;/a&gt;
  &lt;a href="https://huggingface.co/spaces/Dash10107" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/872f3617f8520e4eec3aa40401095b0ebe81b0e56aa6608df1c8961c013cfb30/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f48756767696e67253230466163652d5370616365732d79656c6c6f773f7374796c653d666c6174266c6f676f3d68756767696e6766616365" alt="HuggingFace Spaces"&gt;&lt;/a&gt;
  &lt;a href="https://colab.research.google.com/github/Dash10107/rl-portfolio/blob/main/open_in_colab.ipynb" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ff68bd4526bf49af34888a32dc6cdaaf15de08b2e87958ca1d75193c07fde47a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6c61622d4f70656e2d6f72616e67653f7374796c653d666c6174266c6f676f3d676f6f676c65636f6c6162266c6f676f436f6c6f723d7768697465" alt="Open in Colab"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/codespaces/new?hide_repo_select=true&amp;amp;ref=main&amp;amp;repo=Dash10107/rl-portfolio" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ee58e60d66f1cfc9ac447becf2fa8330807686c21fbaf15aedf168d9b02cc1a7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f64657370616365732d4f70656e2d626c75653f7374796c653d666c6174266c6f676f3d676974687562266c6f676f436f6c6f723d7768697465" alt="Open in Codespaces"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/Dash10107/rl-portfolio/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/b0347597560d4e01a4ea5c0d1afaecd0a7b68752516e32700fab4f6964606491/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75653f7374796c653d666c6174" alt="License"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/Dash10107/rl-portfolio/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f2552088777d97a2d8c6487ca73bb2083a86bfa7c18352bf6a23f9bcae1cebe0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f73746172732d25453225393825383525323077656c636f6d652d627269676874677265656e3f7374796c653d666c6174" alt="GitHub stars"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/Dash10107/rl-portfolio/issues" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/51bcf793698eb578579df6f367164058b73aad4e4411b1f592a646ec9505867a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6973737565732d77656c636f6d652d627269676874677265656e3f7374796c653d666c6174" alt="GitHub issues"&gt;&lt;/a&gt;
&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Reinforcement Learning Portfolio&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;A collection of 12 end-to-end reinforcement learning projects, each deployed as an interactive web application on Hugging Face Spaces. The projects span the full range of modern RL — from the simplest tabular methods that fit on a single page, to multi-agent coordination, model-based planning, and learning from human feedback.&lt;/p&gt;

&lt;p&gt;Every project is built to be understood by someone who is new to RL. Each has its own README explaining the algorithm, the environment, and what you are looking at when you run it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New to reinforcement learning?&lt;/strong&gt; Start with these two documents before anything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/Dash10107/rl-portfolio/./CONCEPTS.md" rel="noopener noreferrer"&gt;CONCEPTS.md&lt;/a&gt; — what RL is, the core vocabulary, and how all 12 algorithms relate to each other&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Dash10107/rl-portfolio/./GETTING_STARTED.md" rel="noopener noreferrer"&gt;GETTING_STARTED.md&lt;/a&gt; — step-by-step guide to running your first project and your first experiment&lt;/li&gt;
&lt;/ul&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Key Highlights&lt;/h2&gt;
&lt;/div&gt;


&lt;ul&gt;

&lt;li&gt;⚡ &lt;strong&gt;Zero-Install Interactive Demos&lt;/strong&gt;: Every project is deployed live on Hugging Face Spaces for instant testing.&lt;/li&gt;

&lt;li&gt;🎓 &lt;strong&gt;Curriculum-Based&lt;/strong&gt;…&lt;/li&gt;

&lt;/ul&gt;&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Dash10107/rl-portfolio" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;p&gt;Let me know in the comments: &lt;em&gt;What's the weirdest A/B test result you've ever seen where the data totally contradicted your intuition?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>reinforcementlearning</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
