DEV Community: friendneeder

Game Theory in Action: Pluribus and CFR in Texas Hold'em (2) CFR Counterfactual Regret Minimization Algorithm

friendneeder — Fri, 26 Jul 2024 15:02:58 +0000

The previous chapter introduced the Nash equilibrium strategy, which was derived using an algorithm called CFR. Before discussing CFR, let's first introduce regret matching.
Taking rock-paper-scissors as an example, its payoff matrix is as follows.

The algorithm iterates as follows:
For each player, initialize regretSum[action] to 0.
Iterate T times:
    Normalize any positive regretSum values to derive the strategy.
    Add the strategy to strategySum.
    Sample an action a based on the strategy.
    Calculate the regret value for other actions a’ as regret[a'] = u(a) - u(a').
    Add the regret[action] value to regretSum[action].
Normalize strategySum to find the average strategy.
This results in an equilibrium solution for rock-paper-scissors of 33%, 33%, 33%.
    In simple terms, if playing rock earns 2 more than playing paper this time, add a 2 to regretSum[rock]. The sampling action probability, strategy, is based on the distribution of regretSum. We will try more rock. Noam's paper proves the convergence of regret matching. The paper is as follows.

Here, two concepts are introduced:
Best Response: Assuming we know the opponent will play rock 100% of the time, if we adopt a strategy of playing paper 100%, this strategy is called Best Response. Best Response is always one-hot, executing a pure action 100%.
Exploitability of Opponent Strategy: The profit from Best Response.
An interesting thought: If we modify the rules of rock-paper-scissors so that scissors wins 2 points and loses 2 points, and rock and paper earn 1 point each, what would be the Nash equilibrium strategy for players?

Here is a recommended paper that provides a detailed introduction to CFR and includes code:
Introduction of cfr
CFR Counterfactual Regret Minimization Algorithm The regret matching algorithm used in rock-paper-scissors returns profits immediately, making it unsuitable for extensive games. When the game issue is in the form of a game tree, it is necessary to train the strategy at each information set on the game tree. Defining the state value function of intermediate nodes, updating regret values, and ensuring the final result is a Nash equilibrium are issues that regret matching cannot solve alone. This is where our CFR counterfactual regret minimization algorithm comes into play.

I extracted an introduction to CFR from the first paper, which should help understand the implementation method using the example of Kuhn poker (code included). There are also other bloggers who have explained this in detail online. At first glance, the CFR algorithm seems complex, but we can analyze the issue by extending regret matching.
CFR's regret calculation formula
Comparing CFR and regret matching's regret calculation, the biggest difference is the inclusion of opponent reach probabilities. It took me a long time to understand its physical meaning. Trying to remove this probability led to incorrect results. After practical testing, I believe that under a game tree, both oneself and the opponent will make a large number of different choices, and we need to combine different choices into a state value to propagate upwards, thus weighting different choices' profits differently. If the opponent never reaches a node, then the profit's contribution to the upper node is zero.
Another point to note is that when calculating the average strategy, the real-time strategy should be multiplied by one's own reach probability to accumulate, which makes sense. If the probability of reaching a node under the current strategy is very low, then accumulating this strategy into the average strategy is meaningless.
This is just my personal inference; detailed mathematical proof is still necessary.
The paper first defines average overall regret and proves that if the average overall regret is less than e, then the average strategy conforms to a Nash equilibrium strategy within 2e exploitability.

It then proves that the sum of counterfactual regrets across all information sets exceeds the average overall regret.

Lastly, it proves the convergence of counterfactual regret. It also demonstrates that the CFR algorithm can converge to a Nash equilibrium.

The next chapter will introduce the principles and differences between current market solvers like Libratus, Pluribus, and DeepStack.
I am looking for remote job opportunities. Please contact me at my email: justfunnychen@gmail.com.

Game Theory in Action: Pluribus and CFR in Texas Hold'em(1) Understanding Nash Equilibrium

friendneeder — Fri, 26 Jul 2024 14:54:52 +0000

    Recently, beyond rapid Development of large language models (LLMs), significant progress has been made in AI game theory as well. After completing Pluribus, Noam developed Cicero by integrating game-theoretic reinforcement learning with LLM technology. Cicero learned to employ humor and form alliances in a diplomatic strategy game, achieving top 10% performance, with opponents completely unaware they were playing against an AI. Noam has recently moved to OpenAI, likely to develop something referred to as "Q*". It is believed that combining reinforcement learning with game theory and large language models will have practical applications in gaming, autonomous driving, and robotics planning. Here, I will share insights on the principles and technical challenges of Pluribus, particularly within the context of Texas Hold'em poker.
    I plan to write seven or eight chapters, using concepts from Texas Hold'em to help explain Nash Equilibrium, a concept commonly associated with the prisoner's dilemma. Nash Equilibrium is a state where no player can increase their payoff by unilaterally changing strategies, assuming other players' strategies remain unchanged. While Nash Equilibrium does not guarantee maximizing collective or individual benefits, it provides a non-exploitative strategy when opponents' strategies are unknown. By analyzing deviations from Nash Equilibrium, one can infer opponents' habits and devise strategies to maximize overall benefits.
    In the game of rock-paper-scissors, if you use a Nash equilibrium strategy where the probabilities of choosing rock, paper, and scissors are all 33%, your opponent will not be able to exploit any weaknesses. However, if you choose rock 100% of the time, your opponent can exploit you by choosing paper 100% of the time.

For instance, in a Texas Hold'em game at the river stage, suppose the community cards include four twos and a heart three, and the current pot is 2. A leading player with a 50% chance of holding AA and a 50% chance of holding QQ can choose to check or bet 1. If this player bets and the opponent holds KK, the opponent might decide to fold or call.

Using Nash Equilibrium solvers, such as the one provided by bupticybee on their GitHub (https://github.com/bupticybee/TexasHoldemSolverJava), we find:
For Player 1:
With AA, bet 1 100% of the time.
With QQ, check 66% and bet 1 33% of the time.
For Player 2, facing a bet:
With KK, call 66% and fold 33% of the time.

These strategies ensure that neither player can improve their expected payoff by deviating unilaterally from this strategy set.

First, let's discuss Player 1's actions. The expected value of the Nash equilibrium strategy is:
0.5×0.66×2 (AA bets 1, opponent calls and wins 2)+
0.5×0.33 (AA bets 1, opponent folds and wins 1)+
0.5×0.33×0.66×(−2) (QQ bets 1, opponent calls and loses 2)+
0.5×0.33×0.33 (QQ bets 1, opponent folds and wins 1)+
0.5×0.66×(−1) (QQ checks, opponent bets, folds and loses 1)=0.33
If Player 2 adjusts their strategy to 50% calling and 50% folding in response to the Nash equilibrium strategy, Player 1's betting expected value becomes 0.66.
If Player 2 adjusts their strategy to 100% calling in response to the Nash equilibrium strategy, Player 1's betting expected value becomes 0.33.
Player 2 cannot reduce Player 1's expected value by modifying their strategy; since this is a zero-sum game, Player 2 cannot increase their own expected value either. Note that the calculation for Player 1's check line is omitted for simplicity here.
If Player 1 modifies their strategy to always bet 1 with AA and check with QQ, then Player 2 can adjust their strategy to fold 100% against a bet, making Player 1’s betting expected value 0. This situation in poker is known as "under-bluffing."
If Player 1 modifies their strategy to always bet 1 with both AA and QQ, then Player 2 can respond by calling 100%, reducing Player 1’s expected value to 0. This situation in poker is known as "over-bluffing."
If Player 1 deviates from the Nash equilibrium strategy, Player 2 can adjust their strategy to reduce Player 1's expected value.

    Some clever friends have pointed out that using Nash equilibrium strategies is a defensive approach that cannot yield profits. In practice, however, opponents often self-exploit unintentionally. For example, the expected value of a 50% bet and 50% check strategy is 2, while the opponent’s choice of 30% check and 70% bet has an expected value of 1, effectively causing them to exploit themselves.
The real Texas Hold'em game tree has 10^180 nodes, and real matches require decisions within 30 seconds. This has led to the development of various sophisticated CFR (Counterfactual Regret Minimization) methods and numerous engineering optimization tricks, which I will gradually introduce later.
    In the next chapter, I will discuss algorithms related to CFR and virtual regret minimization.
    I am looking for remote job opportunities. Please contact me at my email: justfunnychen@gmail.com.