DEV Community

Cover image for Multi-Armed Bandits: A Practical Guide to Adaptive Testing
Pavel Zapolskii
Pavel Zapolskii

Posted on

Multi-Armed Bandits: A Practical Guide to Adaptive Testing

Math guy, product person, enthusiast of startups, data consultant, and global markets aficionado - I have always been in a position where multi-armed bandits could be a game-changer in testing and optimization strategy, but it usually stays on the white board because it is perceived as too complex or maybe the business is just hesitant to try. So far, I have been successfully running MABs through different projects, and today I want to share those experiences, focusing on an interesting case from the gaming industry.

Understanding Multi-Armed Bandits

Let me consider two related but separate concepts: multi-armed bandits as algorithms of numeric decision-making and multi-armed bandit testing as an alternative to A/B testing. While the former is an analogy to classic testing, the latter is a fundamental algorithm that makes it all possible.

Imagine being in a casino and standing in front of a row of slot machines offering different rewards. The question is a classic dilemma: should you continue with the one which seems to give you the most, out of exploitation, or should you try others in hopes of finding an even better one, out of exploration?

This casino analogy perfectly illustrates the very basics of Multi-Armed Bandits. Each “arm” returns values we want to maximize or minimize, and in aggregate these form some sort of distribution over time. The difficulty is in accurately building those distributions while balancing data collection, exploration, with return maximization, exploitation.

Mathematical Foundation

The reward optimization over time is the problem that MAB solves. Mathematically speaking, ideally immediately identify the optimal distribution of the arm and then focus the remaining time in gaining rewards from that arm. Perhaps the most concrete implementation, epsilon-greedy, makes this a concrete concept with one simple rule: with probability 0.9, exploit-the best performing arm; with probability 0.1, explore-all other arms.

However, there are two big disadvantages to epsilon-greedy:

  • Solution Instability: It is hard to tune by using a fixed value of epsilon, and the performance can vary greatly for small variations.
  • Poor Adaptation: Epsilon-greedy does not adapt well in non-stationary systems where distributions of arms change with time; it takes much longer to converge to new optimal states.

MAB Testing vs. A/B Testing: A Detailed Comparison

Traditional A/B Testing

In classical A/B testing, there is a fixed division in an audience, analysis of metrics after a test period, and a rollout of the winning variant to the whole audience. While it is simple, it may be inefficient.

MAB Testing

A/B testing starts with equal-sized audiences, while MAB dynamically adjusts the sizes of each group based on the performance metrics. And instead of waiting for the complete rollout, MAB stops at the time it would show 95% of users the best-performing variation, retaining 5% for exploration.

More importantly, how well we balance exploration and exploitation determines the convergence time, i.e., the time when MAB reaches this ideal state. Also, the infrastructure for dynamic redistribution of users is needed for MAB testing. Thus, it does not fit into every testing scenario.

Image description

Key Advantages of MAB

  • Real-Time Adaptation: in a dynamic environment, MAB really shines. For instance, when the quality of traffic fluctuates, such as Variant A being better for two weeks followed by Variant B, MAB self-corrects the distribution ratio
  • Reduced Opportunity Cost and Downtime: in smaller companies or very important tests, having mediocre performance in some A/B test groups can really dent the monthly metrics. MAB reduces this risk by dialing down traffic very fast to the poorly performing variants.

Limitations and Considerations

Despite its advantages, MAB has some marked shortcomings:

  • Implementation Complexity: Technical and conceptual as well as mathematical complexity introduce a larger possibility for mistakes.
  • Limited Applicability: Some scenarios, such as testing customer support systems with multi-day interactions or performances of taxi drivers, are not appropriate for rapid reassignment.

Optimal Use Cases

MAB can be exceptionally effective in the following cases:

  • E-commerce Dynamic Pricing: Companies like Amazon use MAB algorithms for real-time price adjustments, based on customer behavior, demand, and competitor pricing.
  • Media Buying and Traffic Allocation: The underlying variability of ad auctions and market conditions makes MAB perfectly appropriate for dynamic learning and real-time budget allocation across media sources.
  • Gaming Monetization: Perfect to optimize in-game purchases and pricing strategies across different player segments.

Real-World Implementation: A GameDev Case Study

Imagine there is a space farming game that seems optimized for marketing yet still fails to get it right. Here’s the challenge: increase lifetime value without simply raising the price across the board-a surefire way to drive away players and hurt retention rates.

The Challenge

The game contains several monetization factors: in-game shops and event offers, battle pass subscriptions, and a lot more. The complexity was many-sided because the segments of players were manifold, and the retention had to be balanced with revenue optimization, taking into consideration that the major part of the player base was new.

The Solution: Adaptive Price Optimization

We developed a Thompson sampling-based MAB system, solving the multi-armed bandit problem in a Bayesian way. Here’s how it works:

  • Initialize Prior Beliefs: Keep non-informative prior distributions, usually Beta, over the reward of each variant of a price.
  • Action Selection: Sample the reward values from each distribution and select the option with the highest value.
  • Observe Reward: Collect actual performance data about the selected variant.
  • Update Distribution: Bayesian updates of the probability distribution with respect to the outcomes observed.

Implementation Process

  1. Analyse the optimization possibilities based on the price elasticity curve
  2. Create segmentation strategies in collaboration with game designers
  3. Implement behavior tracking
  4. Deploy MAB for dynamic price optimization

Image description

Contextual Enhancement

We enhanced the system’s performance further more by supplementing with additional contextual information relative to user behavior, including:

  • Buying patterns relative to events in a game (Halloween, Christmas, etc.)
  • Engagement with historical offers
  • Player lifecycle stage

This contextual information will improve the quality of predictions made by the system, convergence speed, and reduce regret - the opportunity cost of suboptimal choices.

Image description

Results and Performance

The production implementation showed stellar results over a 100-day period with the following milestone events:

  • Initial stabilization occurred within 30 days
  • Successfully adapted to multiple traffic quality shifts
  • Maintained stable churn rates
  • Achieved 20% ARPU increase compared to the previous fixed-price version
  • Automatically optimized for different traffic sources without human intervention

Image description

Key Takeaways and Best Practices

  • Test Duration: MAB can converge 2-3 times faster than A/B testing in appropriate cases.
  • Use Case Selection: One should refrain from over-engineering simple tests, such as button color changes, with MAB implementation. Choose your tools based upon specific needs and potential returns.
  • Synthetic Testing: Heavy backtesting before production deployment on synthetic examples emulating what is anticipated to be seen online.
  • Simplicity vs. Complexity: Whereas there are indeed many complex variants of MAB, sometimes simpler approaches work better. One should always consider whether extra complexity really serves one’s particular needs.

Conclusion

If implemented correctly, Multi-Armed Bandits are an extremely powerful optimization technique. The key to success is knowing what it can and cannot do, selecting appropriate problems to apply it to, and finding a balance in the degree of implementation complexity. Sophisticated mathematics can bring great power but does demand great care in its implementation not to make expensive mistakes.
Consider that not every test requires the complexity of MAB, but where dynamic environments and the need for real-time optimization arise, MAB can offer great benefits over traditional methods of testing.

Top comments (0)