Training AI to play adversarial games is well-understood. Chess, Go, Poker — we have frameworks for all of them.
But what happens when your AI needs to cooperate with a partner it can't talk to, while competing against opponents whose cards it can't see?
That's the challenge of Guandan (掼蛋), a 2v2 trick-taking card game played by 140 million people in China. And it broke almost every assumption we had about game AI.
The Setup
Guandan uses two standard 54-card decks (108 cards total). Four players, two teams. You sit across from your partner. Rules are trick-taking with bombs, straights, and a rotating wild card that changes every round.
The goal: be the first team to promote through ranks 2→A.
The catch: you cannot communicate with your partner except through which cards you play.
Challenge 1: The Action Space is Enormous
In Go, you have ~300 legal moves per turn. In Poker, maybe ~100.
In Guandan, a single turn can have 10,000+ legal actions. Why? Because with 27 cards in hand from a 108-card deck, the number of valid combinations (singles, pairs, triples, straights, bombs of various sizes) explodes.
Our Solution: Two-Stage Architecture
Turn Decision:
Stage 1: What TYPE of action?
→ Classifier over 12 categories
→ [single, pair, triple, straight, plate, tube,
bomb_4, bomb_5, bomb_6, bomb_7, bomb_8, rocket, pass]
Stage 2: WHICH cards for that type?
→ Pointer network conditioned on stage 1 output
→ Selects specific cards from hand
This decomposition reduces the effective branching factor by ~50x. Stage 1 is a simple 12-class classification. Stage 2 only needs to rank cards within the chosen type.
Challenge 2: Cooperation Without Communication
This is the hard part. In Bridge, partners have a bidding system — an explicit communication protocol. In Guandan, there's nothing.
Yet human experts develop rich implicit signaling:
| Play Pattern | Implicit Meaning |
|---|---|
| Leading low when you could win | "I'm saving strength — you take the lead" |
| Playing an unusual suit first | "I'm strong in this suit" |
| NOT bombing when you could | "I have a plan, trust me" |
| Discarding from a long suit | "This suit is safe for you to play" |
Our Solution: LSTM History Encoder
class PartnerBeliefModel(nn.Module):
def __init__(self):
self.history_encoder = nn.LSTM(
input_size=card_feature_dim + action_type_dim,
hidden_size=256,
num_layers=2
)
self.belief_head = nn.Linear(256, partner_hand_estimate_dim)
self.intent_head = nn.Linear(256, strategy_embedding_dim)
def forward(self, play_history):
_, (h_n, _) = self.history_encoder(play_history)
partner_belief = self.belief_head(h_n[-1])
partner_intent = self.intent_head(h_n[-1])
return partner_belief, partner_intent
The key insight: we don't explicitly program signaling conventions. We let them emerge from self-play. After ~500M games, agents develop consistent patterns that look remarkably like human expert conventions.
The wildest finding: different training runs produce different "dialects." Two agents from the same training run cooperate beautifully. Pair agents from different runs, and coordination drops by 15-20%.
Challenge 3: Dynamic Wild Cards
Each round has a "current rank" (starts at 2, promotes through A). All cards of the current rank become wild.
This means:
- Your hand's value changes every round
- A "bad hand" in round 5 might be amazing in round 8
- Bombs that exist in one round might dissolve in the next
We handle this with rank-conditioned policy networks — the current rank is embedded and concatenated with the game state before every decision.
Challenge 4: The Tribute System
After each round, losers must give their best card to the winners. This creates a unique information asymmetry: you KNOW one card in your opponent's hand.
Tribute Info Encoding:
- Card given to opponent: one-hot (certain knowledge)
- Card received from partner: one-hot (certain knowledge)
- Bayesian update of opponent hand distribution
Smart agents exploit this known card — e.g., if you know the opponent has a specific Ace, you avoid playing into situations where that Ace can beat you.
Training Details
- Algorithm: DMC (Deep Monte Carlo) with prioritized experience replay
- Self-play: 500M games, ~72 hours on 8× A100 GPUs
- Curriculum: Start with rule-based opponents, gradually transition to self-play
- Key trick: We found that training cooperation required ~3x more iterations than adversarial play. Agents learn to attack quickly but learn restraint slowly.
Results
| Metric | Our Agent | Rule-based | Random |
|---|---|---|---|
| Win rate vs baseline | 72.3% | 50.0% | 12.1% |
| Bomb timing accuracy | 68% | 41% | 22% |
| Partner coordination | 0.73 | 0.52 | 0.31 |
The "bomb timing accuracy" metric deserves explanation: we measured how often the agent's bomb usage matched expert-annotated "correct bomb timing" in a test set of 10,000 games. Beginner players bomb immediately; experts hold bombs for critical moments.
What Surprised Us
Restraint is harder than aggression. Teaching an agent when NOT to play took 3x longer to converge.
Difficulty tuning is its own problem. Making AI beatable-but-fun required a separate "difficulty controller" that intentionally introduces calibrated sub-optimality. Playing against a perfect agent is miserable.
Cross-game transfer works. Pre-training on Dou Di Zhu (a simpler 3-player card game) then fine-tuning on Guandan saved ~40% training time.
Humans prefer the teaching AI. Our agent that explains its reasoning ("I played the 3 because your partner likely has the straight") retains users 3x longer than the silent version.
Open Problems
- Zero-shot partner coordination: Can we build agents that cooperate well with ANY partner, even unseen ones?
- Natural language explanation: Generating human-readable strategy explanations in real-time
- Cultural variant adaptation: Guandan has regional rule differences too — can we adapt quickly?
If you're working on cooperative multi-agent RL or imperfect-information games, I'd love to compare notes. Drop a comment or reach out.
This is part of a series on building AI for traditional games. Next up: "Why Mahjong's 200+ Regional Variants are a Nightmare for AI (and a Gift for Transfer Learning)"
Top comments (0)