How I Built a Pluribus-Style Poker AI From Scratch

Griffin Henning — Wed, 24 Jun 2026 05:35:07 +0000

Why poker is hard

Chess and Go are perfect information games, both players see everything. The challenge is purely computational.

Poker is an imperfect information game. You can't see your opponent's cards. Every decision has to account for the full range of hands they might hold.

Standard minimax doesn't work. You can't evaluate a position without knowing your opponent's hand.

The solution is Nash equilibrium, a strategy where neither player can improve their EV by unilaterally changing what they do. In poker this is called GTO (Game-Theoretically Optimal). Unexploitable by definition.

Stage 1: Counterfactual Regret Minimization

CFR (Zinkevich et al., 2007) is the algorithm that made strong poker AI possible.

The intuition: Play against yourself repeatedly. After each game, ask: “how much better would I have done if I'd always taken a different action?” That's regret.

Update your strategy: play actions proportional to their cumulative positive regret. Repeat thousands of times. The time-average of your strategies converges to Nash equilibrium.

I implemented this on “Leduc Hold'em” ‚ a 6-card toy game with 216 information sets, the standard poker AI research testbed.

def get_strategy(self, reach_prob: float) -> list[float]:
    pos = [max(r, 0.0) for r in self.regrets]
    total = sum(pos)
    strat = [p / total for p in pos] if total > 0 else [1/n] * n
    # Accumulate average strategy (this is what converges to Nash)
    for i in range(n):
        self.strat_sum[i] += reach_prob * strat[i]
    return strat

After 10,000 iterations (2.2 seconds):

King raises 99% ‚ correct, strongest hand
Jack plays a mixed strategy‚ correct, must be unpredictable to stay unexploitable

That last point surprises people. GTO isn't about always making the "right" move. It's about being unpredictable enough that no opponent strategy can consistently beat you.

Stage 2: Monte Carlo CFR

Full NLHE has ~10¬π‚Å∂‚Å∞ game states. Full tree traversal every iteration is impossible.

MCCFR samples a subset each iteration. I used external sampling:

Traverser nodes: explore all actions
Opponent nodes: sample one action from their strategy
Result: unbiased value estimates, tractable computation

if player == traverser:
    # Explore ALL actions, update regrets
    for action in actions:
        v = traverse(next_state, traverser)
        action_values[action] = v
    update_regrets(action_values)
else:
    # SAMPLE one opponent action
    action = sample_from_strategy(strategy)
    return traverse(next_state, traverser)

On Leduc Hold'em: MCCFR converged to the same equilibrium as vanilla CFR at 1.9x the speed.

Stage 3: Card Abstraction

Even with MCCFR, full NLHE has too many states. Solution: group similar hands into buckets.

The naive approach clusters by average equity, wrong in an important way.

A flopped flush draw and a top pair can have identical average equity but completely different equity distributions over future runouts. The flush draw is either far ahead or far behind. The made hand is consistently ahead. GTO strategy treats these differently.

The right metric: Earth Mover's Distance between equity histograms.

def emd(hist_a: np.ndarray, hist_b: np.ndarray) -> float:
    """Wasserstein-1 distance ‚Äî captures distribution shape, not just mean."""
    cdf_a = np.cumsum(hist_a)
    cdf_b = np.cumsum(hist_b)
    return float(np.sum(np.abs(cdf_a - cdf_b)))

Abstraction scheme:

Preflop: 169 canonical hands ‚Üí 8 equity percentile buckets
Flop: EMD clustering ‚Üí 12 clusters
Turn: EMD clustering ‚Üí 12 clusters
River: exact strength percentile ‚Üí 8 bins

Optimization that halved computation: compute equity histograms for both players simultaneously from the same random rollouts, rather than running separate Monte Carlo samples.

Stage 4: Deep CFR

Still too many information sets for a hash table. Deep CFR (Brown & Sandholm, 2019) replaces the table with neural networks.

Two networks per player:

class AdvantageNetwork(nn.Module):
    """Approximates cumulative counterfactual regret per action."""
    def regret_matching(self, features):
        advantages = self.forward(features)
        pos = F.relu(advantages)
        total = pos.sum(dim=-1, keepdim=True)
        uniform = torch.ones_like(advantages) / n
        return torch.where(total > 1e-6, pos / total, uniform)

class StrategyNetwork(nn.Module):
    """Approximates the average strategy ‚Äî this converges to Nash."""
    def forward(self, x):
        return F.softmax(super().forward(x), dim=-1)

Feature encoding (373 dimensions):

104: hole cards (2 √ó 52 one-hot)
260: board cards (5 slots √ó 52 one-hot)
4: street one-hot
5: normalized scalars (pot, stacks, to-call, raises)

Reservoir buffers ensure all past iterations are represented in training, preventing catastrophic forgetting without unbounded memory.

The biggest performance win: keeping networks in eval() mode during traversal rather than toggling per inference.

# Before: called eval() on every inference ‚ 61ms/traversal
model.eval()
with torch.no_grad():
    output = model(x)

# After: set eval() once before traversal loop ‚ 10ms/traversal
player.set_inference_mode()  # called once
# ... thousands of traversals ...

One line of code. 6x speedup. Profile before you optimize.

Stage 5: Real-Time Search

The blueprint strategy from Deep CFR knows a lot but uses coarse abstractions. Real-time search fixes this.

At each decision point:

Build a local game tree rooted at the current state
Run MCCFR in that subtree for a fixed iteration budget
At leaf nodes, query the blueprint for value estimates
Return the search-refined strategy

def solve(self, gs: GameState, player: int) -> dict:
    self.nodes = {}  # Fresh local tree per decision
    t_start = time.time()

    while self._iters < self.config.n_iters:
        if (time.time() - t_start) * 1000 >= self.config.time_limit_ms:
            break
        self._traverse(gs, self._iters % 2, depth=1)
        self._iters += 1

    return self._root_node().get_avg_strategy(actions)

Blueprint bootstrapping blends local and blueprint strategies early in search for stability:

blend = min(self._iters / self.config.n_iters, 1.0)
strat = (1 - blend) * blueprint_strat + blend * local_strat

Average decision time: 75ms on CPU.

Results

300 duplicate hand pairs (600 total hands per matchup). Duplicate scoring controls for card luck.

Matchup	mBB/hand	Significant
Blueprint vs Random	+28,403	‚úì
Search vs Random	+28,134	‚úì
Search vs Blueprint	+31,798	‚úì

Search consistently outperforms blueprint-only play. This is the core empirical claim of Pluribus ‚ validated.

How it compares to Pluribus

	This project	Pluribus
Players	2 (heads-up)	6
Traversals	~50,000	12.4M
Hardware	Single CPU	64-core CPU
Training	~30 min	~8 days

Same architecture. Scale is the difference.

Three things I'd do differently

1. Profile earlier. The 6x traversal speedup from eval mode was sitting there the whole time. I spent days on algorithmic optimizations before finding it.

2. Start with finer bet abstraction. Five bet sizes is enough to demonstrate the technique but too coarse for real strategic depth. Pluribus used 14. The strategy changes meaningfully.

3. Build the evaluation framework first. I ran hundreds of training iterations before having reliable exploitability metrics. Convergence looks different than you expect, EV oscillating near zero is not the same as converging to Nash.

The codebase

GitHub: github.com/griff-ui/poker-ai

Five stages, 40 files, 27 tests passing, full documentation. MIT licensed ‚use it for research, study, or as the foundation for your own solver.

Live demo: griff-ui.github.io/poker-ai

Browser-based hand analyzer, select cards, set game state, see GTO strategy frequencies. No Python required.

References

Zinkevich et al. (2007) ‚ Regret Minimization in Games with Incomplete Information
Lanctot et al. (2009) ‚ Monte Carlo Sampling for Regret Minimization
Brown & Sandholm (2019) ‚ Deep Counterfactual Regret Minimization
Brown & Sandholm (2019) ‚ Superhuman AI for Multiplayer Poker

DEV Community: Griffin Henning