NanMing

Posted on Apr 15

Building AI for a 2v2 Card Game: How We Solved Cooperative Imperfect Information

#ai #machinelearning #gamedev #algorithms

Training AI to play adversarial games is well-understood. Chess, Go, Poker — we have frameworks for all of them.

But what happens when your AI needs to cooperate with a partner it can't talk to, while competing against opponents whose cards it can't see?

That's the challenge of Guandan (掼蛋), a 2v2 trick-taking card game played by 140 million people in China. And it broke almost every assumption we had about game AI.

The Setup

Guandan uses two standard 54-card decks (108 cards total). Four players, two teams. You sit across from your partner. Rules are trick-taking with bombs, straights, and a rotating wild card that changes every round.

The goal: be the first team to promote through ranks 2→A.

The catch: you cannot communicate with your partner except through which cards you play.

Challenge 1: The Action Space is Enormous

In Go, you have ~300 legal moves per turn. In Poker, maybe ~100.

In Guandan, a single turn can have 10,000+ legal actions. Why? Because with 27 cards in hand from a 108-card deck, the number of valid combinations (singles, pairs, triples, straights, bombs of various sizes) explodes.

Our Solution: Two-Stage Architecture

Turn Decision:
  Stage 1: What TYPE of action? 
    → Classifier over 12 categories
    → [single, pair, triple, straight, plate, tube, 
       bomb_4, bomb_5, bomb_6, bomb_7, bomb_8, rocket, pass]

  Stage 2: WHICH cards for that type?
    → Pointer network conditioned on stage 1 output
    → Selects specific cards from hand

This decomposition reduces the effective branching factor by ~50x. Stage 1 is a simple 12-class classification. Stage 2 only needs to rank cards within the chosen type.

Challenge 2: Cooperation Without Communication

This is the hard part. In Bridge, partners have a bidding system — an explicit communication protocol. In Guandan, there's nothing.

Yet human experts develop rich implicit signaling:

Play Pattern	Implicit Meaning
Leading low when you could win	"I'm saving strength — you take the lead"
Playing an unusual suit first	"I'm strong in this suit"
NOT bombing when you could	"I have a plan, trust me"
Discarding from a long suit	"This suit is safe for you to play"

Our Solution: LSTM History Encoder

class PartnerBeliefModel(nn.Module):
    def __init__(self):
        self.history_encoder = nn.LSTM(
            input_size=card_feature_dim + action_type_dim,
            hidden_size=256,
            num_layers=2
        )
        self.belief_head = nn.Linear(256, partner_hand_estimate_dim)
        self.intent_head = nn.Linear(256, strategy_embedding_dim)

    def forward(self, play_history):
        _, (h_n, _) = self.history_encoder(play_history)
        partner_belief = self.belief_head(h_n[-1])
        partner_intent = self.intent_head(h_n[-1])
        return partner_belief, partner_intent

The key insight: we don't explicitly program signaling conventions. We let them emerge from self-play. After ~500M games, agents develop consistent patterns that look remarkably like human expert conventions.

The wildest finding: different training runs produce different "dialects." Two agents from the same training run cooperate beautifully. Pair agents from different runs, and coordination drops by 15-20%.

Challenge 3: Dynamic Wild Cards

Each round has a "current rank" (starts at 2, promotes through A). All cards of the current rank become wild.

This means:

Your hand's value changes every round
A "bad hand" in round 5 might be amazing in round 8
Bombs that exist in one round might dissolve in the next

We handle this with rank-conditioned policy networks — the current rank is embedded and concatenated with the game state before every decision.

Challenge 4: The Tribute System

After each round, losers must give their best card to the winners. This creates a unique information asymmetry: you KNOW one card in your opponent's hand.

Tribute Info Encoding:
  - Card given to opponent: one-hot (certain knowledge)
  - Card received from partner: one-hot (certain knowledge)  
  - Bayesian update of opponent hand distribution

Smart agents exploit this known card — e.g., if you know the opponent has a specific Ace, you avoid playing into situations where that Ace can beat you.

Training Details

Algorithm: DMC (Deep Monte Carlo) with prioritized experience replay
Self-play: 500M games, ~72 hours on 8× A100 GPUs
Curriculum: Start with rule-based opponents, gradually transition to self-play
Key trick: We found that training cooperation required ~3x more iterations than adversarial play. Agents learn to attack quickly but learn restraint slowly.

Results

Metric	Our Agent	Rule-based	Random
Win rate vs baseline	72.3%	50.0%	12.1%
Bomb timing accuracy	68%	41%	22%
Partner coordination	0.73	0.52	0.31

The "bomb timing accuracy" metric deserves explanation: we measured how often the agent's bomb usage matched expert-annotated "correct bomb timing" in a test set of 10,000 games. Beginner players bomb immediately; experts hold bombs for critical moments.

What Surprised Us

Restraint is harder than aggression. Teaching an agent when NOT to play took 3x longer to converge.
Difficulty tuning is its own problem. Making AI beatable-but-fun required a separate "difficulty controller" that intentionally introduces calibrated sub-optimality. Playing against a perfect agent is miserable.
Cross-game transfer works. Pre-training on Dou Di Zhu (a simpler 3-player card game) then fine-tuning on Guandan saved ~40% training time.
Humans prefer the teaching AI. Our agent that explains its reasoning ("I played the 3 because your partner likely has the straight") retains users 3x longer than the silent version.

Open Problems

Zero-shot partner coordination: Can we build agents that cooperate well with ANY partner, even unseen ones?
Natural language explanation: Generating human-readable strategy explanations in real-time
Cultural variant adaptation: Guandan has regional rule differences too — can we adapt quickly?

If you're working on cooperative multi-agent RL or imperfect-information games, I'd love to compare notes. Drop a comment or reach out.

This is part of a series on building AI for traditional games. Next up: "Why Mahjong's 200+ Regional Variants are a Nightmare for AI (and a Gift for Transfer Learning)"

DEV Community