Stojan Kojo

Posted on Jun 7

What is the most efficient way to evaluate poker hands at scale?

#javascript #ai #webdev #tutorial

The most efficient way to evaluate poker hands at scale is not to write logic that checks for straights and flushes at runtime. Instead, the industry standard for high-frequency trading (HFT) and real-money poker is pre-computed Lookup Tables (LUTs) combined with bitwise operations and SIMD (Single Instruction, Multiple Data) vectorization.

The goal is to reduce hand evaluation to a single CPU instruction or a few array lookups, achieving nanosecond-level latency. This allows the engine to evaluate millions of hands per second for Monte Carlo simulations, AI training, or real-time fairness checks.

1. The Core Architecture: Lookup Tables (LUTs)

The fundamental insight is that there are a finite number of 5-card combinations from a 52-card deck: $\binom{52}{5} = 2,598,960$.
For 7-card games (Hold'em/Omaha), it's $\binom{52}{7} = 133,784,560$.

The Strategy:

Pre-computation: Before the server starts, generate a massive array where every possible 5-card (or 7-card) combination maps to a unique integer "strength" value.
Runtime: Convert the player's hand into a unique integer key (hash).
Lookup: Access the array at that index. The value returned is the hand's strength.

Memory Trade-off:

5-Card LUT: ~2.6 million entries. If each entry is a 4-byte integer, this is ~10 MB. Fits easily in L3 CPU cache.
7-Card LUT: ~133 million entries. ~534 MB. Fits in RAM, but may spill out of cache on large servers.
Optimization: Most engines use a 5-card LUT and a 7-to-5 reduction algorithm. They generate all $\binom{7}{5} = 21$ combinations, look them up, and pick the best.

2. The Bitwise Implementation (The "Cactus Kev" / "TwoPlusTwo" Method)

The most famous high-performance implementation is the Cactus Kev algorithm (popularized by poker-eval in C/C++). It maps cards to prime numbers or bitmasks to perform arithmetic operations that instantly detect flushes and straights.

Data Representation

Cards: Represented as integers.
Ranks: 2-14 (2-Ace).
Suits: 4 bits (1 for Spade, 2 for Heart, etc.).
Bitmasks: A 52-bit integer where the $i$-th bit is 1 if the $i$-th card is present.

The Evaluation Function (Pseudocode)

// Pre-computed table: maps a 5-card hash to a strength score (0-7462)
// Lower score = better hand (or higher, depending on convention)
int lookup_table[2598960]; 

int evaluate_hand(int hand_mask) {
    // 1. Check for Flush: Sum of suit bits. If any suit sum >= 4 (5 cards), it's a flush.
    // This is done via bitwise AND with suit masks (e.g., 0xF0F0F0F0F0F0F0F0)
    if (is_flush(hand_mask)) {
        return lookup_flush[hand_mask]; // Pre-computed flush strength
    }

    // 2. Check for Straight: Use a "rank bitboard".
    // Shift bits to detect sequences of 5 consecutive bits.
    if (is_straight(hand_mask)) {
        return lookup_straight[hand_mask];
    }

    // 3. Standard Evaluation (Pairs, Trips, etc.)
    // Use the "Cactus Kev" prime multiplication method.
    // Each rank is a prime: 2=2, 3=3, ..., A=41.
    // Multiply the primes of the 5 cards.
    // The product is unique for every hand type (e.g., A-A-A-K-K always yields X).
    int product = 1;
    for (int i = 0; i < 5; i++) {
        product *= RANK_PRIMES[get_rank(hand_mask, i)];
    }

    return lookup_table[product];
}

Why this is fast:

No Loops: The logic uses bitwise shifts and masks, which are single-cycle CPU instructions.
Cache Friendly: The lookup table is small enough to stay in the CPU's L1/L2 cache.
Branch Prediction: The if statements are highly predictable (most hands are high-card, few are straights), minimizing pipeline stalls.

3. Scaling to Millions of Hands: SIMD & Vectorization

For Monte Carlo simulations (e.g., "What is my equity if I run this hand 100,000 times?"), single-threaded evaluation is too slow. You must use SIMD (AVX2, AVX-512 on x86; NEON on ARM).

The Approach:

Load 8 or 16 hands into a single 256-bit or 512-bit register.
Perform the bitwise operations (flush check, straight check) on all 8/16 hands simultaneously.
Use lookup tables that are vectorized (or use shuffle instructions to map multiple keys to multiple values).

Library Example: HandEvaluator in Rust/C++ with AVX2

// Conceptual SIMD evaluation (using a crate like `hand-eval-simd`)
fn evaluate_batch(hands: &[Hand], results: &mut [u32]) {
    // Load 8 hands into a 256-bit register
    let hand_vec = load_simd(hands);

    // Parallel Flush Check
    let flush_mask = check_flush_simd(hand_vec);

    // Parallel Straight Check
    let straight_mask = check_straight_simd(hand_vec);

    // Parallel Lookup (requires a specialized vectorized LUT or scatter/gather)
    let scores = vectorized_lookup(hand_vec, flush_mask, straight_mask);

    store_simd(results, scores);
}

This can achieve 100M+ hands per second per core on modern CPUs.

4. Architecture for Production Systems

The "Evaluator Service" Microservice

In a distributed architecture, do not embed the heavy evaluation logic in the main game loop if you need to run simulations. Offload it to a dedicated Evaluator Service.

Protocol: gRPC or UDP (for low latency).
Language: C++ or Rust for the evaluator core (maximum performance).
Interface:
- EvaluateHandsRequest: { hand_1: [c1, c2...], hand_2: [...] }
- EvaluateHandsResponse: { score_1: 1234, score_2: 5678 }
Scaling: Run the service as a stateless pod in Kubernetes. Use Horizontal Pod Autoscaling (HPA) based on CPU utilization.

Database & Caching

Redis: Cache the results of common board textures (e.g., A-K-Q-J-T on board) if the same board appears frequently in simulations.
PostgreSQL: Store the final hand history, including the hand_strength_score and winner_ids. Do not store the raw bitmasks unless needed for debugging.

5. Security & Compliance

Determinism: The evaluator must be pure. No random numbers, no external state. If the same 7 cards are passed in, the same score must be returned. This is non-negotiable for regulatory compliance.
Certification: The LUT generation algorithm must be certified by an independent lab (e.g., eCOGRA, GLI). The source code for the LUT generator is often audited, not just the runtime binary.
Anti-Cheating: Never expose the evaluator logic to the client. The client sends cards only when allowed; the server evaluates.

6. Real-World Implementation Example

Scenario: A 6-max No-Limit Hold'em table.

Pre-flop: Player A (AA) vs Player B (KK).
Simulation: The engine runs 100,000 Monte Carlo simulations to determine equity for the "All-In" button.
Execution:
- The "Simulator" service generates random boards (5 cards).
- It calls the Evaluator 200,000 times (2 hands x 100k simulations).
- The Evaluator uses the Cactus Kev LUT + AVX2 to process 8 hands in parallel.
- Total time: ~50ms.
Result: Player A has 82% equity. The server updates the UI.

Trade-offs:

Memory vs. Speed: A full 7-card LUT is fast but memory-heavy (500MB+). A 5-card LUT + 21 combinations is slower (21x lookups) but uses only 10MB.
Solution: Hybrid approach. Use a 5-card LUT. If the board is a flush/straight, use the optimized path. Otherwise, use the standard lookup.

7. Code Snippet: The "Fast" Evaluator (Node.js with N-API)

Since Node.js is single-threaded and slow for raw math, use N-API to bind a C++ evaluator.

// evaluator.cc (C++ N-API Module)
#include <node_api.h>
#include "poker_eval.h" // The C++ LUT engine

 napi_value EvaluateHand(napi_env env, napi_callback_info info) {
   // 1. Extract card array from JS
   // 2. Convert to bitboard (uint64_t)
   // 3. Call C++ evaluate(bitboard) -> int
   // 4. Return int to JS
   return napi_create_int32(env, result);
 }

// evaluator.js (Node.js)
const evalModule = require('./build/Release/evaluator.node');

function getHandStrength(holeCards, communityCards) {
  // Hole: [0, 1], Board: [2, 3, 4, 5, 6]
  const allCards = [...holeCards, ...communityCards];
  return evalModule.evaluate(allCards);
}

FAQs

Q1: Why not just write a simple if (isFlush) return ... function in Python/JavaScript?
Because a standard if-else approach involves loops to check suits, loops to check ranks, and sorting. This is $O(N)$ or $O(N \log N)$ and involves many conditional branches. In a high-load environment (10k hands/sec), this creates a bottleneck and high CPU usage. A LUT approach is $O(1)$ and uses bitwise operations, which are orders of magnitude faster.

Q2: How do you handle 7-card hands (Omaha/Hold'em) with a 5-card LUT?
You generate all $\binom{7}{5} = 21$ possible 5-card combinations from the 7 available cards. You run the 5-card LUT on each of the 21 combinations and pick the highest score. This is still extremely fast because the LUT lookup is instant. Advanced engines optimize this by pre-computing the "best 5 of 7" directly into a 7-card LUT, but the 21-lookup method is often sufficient and uses less memory.

Q3: Can I use this for Web3/Blockchain poker?
Yes, but with a caveat. On-chain evaluation (e.g., on Ethereum) is too expensive for full LUTs due to gas costs. The standard pattern is Off-chain Evaluation with On-chain Verification. The server (or a trusted oracle) evaluates the hand off-chain using the LUT, signs the result, and the smart contract verifies the signature. The contract never runs the evaluator.

Q4: What is the "Cactus Kev" method?
It is a specific algorithm developed by Cactus Kev (Eric Persson) that maps each card rank to a prime number. The product of the primes of 5 cards is unique for every hand type (e.g., a Full House of Aces over Kings has a unique product). This allows the engine to determine the hand type and strength using a single multiplication and a lookup table, avoiding complex conditional logic.

Q5: How do I ensure the LUT is correct and not buggy?
You must verify the LUT against a known reference implementation (like the PokerStars or WSOP logic) or a brute-force generator.

Generate all 2.6M 5-card hands.
Evaluate them with a slow, correct (but inefficient) reference algorithm.
Evaluate them with your fast LUT.
Assert that the results match for every single hand. This verification process is part of your CI/CD pipeline.

DEV Community