UCB

Upper Confidence Bound (UCB) applications are a strategy used in multi-armed bandit problems. These types of problems usually arise where there is more than one option and each option yields a certain amount of money.

The UCB discount accounts for the expected products of each option (or “strand”) and the uncertainty of that reward. A "confidence interval" is created for each lever, and the upper limit of that interval (i.e., the best possible outcome) is used to determine when to pull that lever.

Uncertainty increases with how little a lever is pulled. So, if a branch is monitored too frequently, the more performance it has about it and the times it decreases. However, if a branch is rarely tracked, there will be less performance about it and performance will increase.

The UCB strategy attracts both companies that are in the high expected league (i.e. “exploitation” or “exploitation”) and companies that have few exposures and their high uncertainty (i.e. “exploration” or “exploration”). This allows the agent to both exploit existing usage and acquire new information.

For example, let's consider an online advertising scenario. A company wants to determine which ad is more likely to be blocked at the same time. He holds her ad as a "lever" and holds it as a "reward" for clicking on it. UCB discounts can be used to determine which ads are likely to generate more clicks.

In this case, the UCB pull-up both keeps the most clicked ads intermittent (i.e. "don't consume") and displays other ads that sign up less but may potentially have high click-through rates (i.e. "explore"). This way, the company can use both existing and good ads and discover new, potentially better ads.