DEV Community: NanMing

Building AI for a 2v2 Card Game: How We Solved Cooperative Imperfect Information

NanMing — Wed, 15 Apr 2026 04:30:53 +0000

Training AI to play adversarial games is well-understood. Chess, Go, Poker — we have frameworks for all of them.

But what happens when your AI needs to cooperate with a partner it can't talk to, while competing against opponents whose cards it can't see?

That's the challenge of Guandan (掼蛋), a 2v2 trick-taking card game played by 140 million people in China. And it broke almost every assumption we had about game AI.

The Setup

Guandan uses two standard 54-card decks (108 cards total). Four players, two teams. You sit across from your partner. Rules are trick-taking with bombs, straights, and a rotating wild card that changes every round.

The goal: be the first team to promote through ranks 2→A.

The catch: you cannot communicate with your partner except through which cards you play.

Challenge 1: The Action Space is Enormous

In Go, you have ~300 legal moves per turn. In Poker, maybe ~100.

In Guandan, a single turn can have 10,000+ legal actions. Why? Because with 27 cards in hand from a 108-card deck, the number of valid combinations (singles, pairs, triples, straights, bombs of various sizes) explodes.

Our Solution: Two-Stage Architecture

Turn Decision:
  Stage 1: What TYPE of action? 
    → Classifier over 12 categories
    → [single, pair, triple, straight, plate, tube, 
       bomb_4, bomb_5, bomb_6, bomb_7, bomb_8, rocket, pass]

  Stage 2: WHICH cards for that type?
    → Pointer network conditioned on stage 1 output
    → Selects specific cards from hand

This decomposition reduces the effective branching factor by ~50x. Stage 1 is a simple 12-class classification. Stage 2 only needs to rank cards within the chosen type.

Challenge 2: Cooperation Without Communication

This is the hard part. In Bridge, partners have a bidding system — an explicit communication protocol. In Guandan, there's nothing.

Yet human experts develop rich implicit signaling:

Play Pattern	Implicit Meaning
Leading low when you could win	"I'm saving strength — you take the lead"
Playing an unusual suit first	"I'm strong in this suit"
NOT bombing when you could	"I have a plan, trust me"
Discarding from a long suit	"This suit is safe for you to play"

Our Solution: LSTM History Encoder

class PartnerBeliefModel(nn.Module):
    def __init__(self):
        self.history_encoder = nn.LSTM(
            input_size=card_feature_dim + action_type_dim,
            hidden_size=256,
            num_layers=2
        )
        self.belief_head = nn.Linear(256, partner_hand_estimate_dim)
        self.intent_head = nn.Linear(256, strategy_embedding_dim)

    def forward(self, play_history):
        _, (h_n, _) = self.history_encoder(play_history)
        partner_belief = self.belief_head(h_n[-1])
        partner_intent = self.intent_head(h_n[-1])
        return partner_belief, partner_intent

The key insight: we don't explicitly program signaling conventions. We let them emerge from self-play. After ~500M games, agents develop consistent patterns that look remarkably like human expert conventions.

The wildest finding: different training runs produce different "dialects." Two agents from the same training run cooperate beautifully. Pair agents from different runs, and coordination drops by 15-20%.

Challenge 3: Dynamic Wild Cards

Each round has a "current rank" (starts at 2, promotes through A). All cards of the current rank become wild.

This means:

Your hand's value changes every round
A "bad hand" in round 5 might be amazing in round 8
Bombs that exist in one round might dissolve in the next

We handle this with rank-conditioned policy networks — the current rank is embedded and concatenated with the game state before every decision.

Challenge 4: The Tribute System

After each round, losers must give their best card to the winners. This creates a unique information asymmetry: you KNOW one card in your opponent's hand.

Tribute Info Encoding:
  - Card given to opponent: one-hot (certain knowledge)
  - Card received from partner: one-hot (certain knowledge)  
  - Bayesian update of opponent hand distribution

Smart agents exploit this known card — e.g., if you know the opponent has a specific Ace, you avoid playing into situations where that Ace can beat you.

Training Details

Algorithm: DMC (Deep Monte Carlo) with prioritized experience replay
Self-play: 500M games, ~72 hours on 8× A100 GPUs
Curriculum: Start with rule-based opponents, gradually transition to self-play
Key trick: We found that training cooperation required ~3x more iterations than adversarial play. Agents learn to attack quickly but learn restraint slowly.

Results

Metric	Our Agent	Rule-based	Random
Win rate vs baseline	72.3%	50.0%	12.1%
Bomb timing accuracy	68%	41%	22%
Partner coordination	0.73	0.52	0.31

The "bomb timing accuracy" metric deserves explanation: we measured how often the agent's bomb usage matched expert-annotated "correct bomb timing" in a test set of 10,000 games. Beginner players bomb immediately; experts hold bombs for critical moments.

What Surprised Us

Restraint is harder than aggression. Teaching an agent when NOT to play took 3x longer to converge.
Difficulty tuning is its own problem. Making AI beatable-but-fun required a separate "difficulty controller" that intentionally introduces calibrated sub-optimality. Playing against a perfect agent is miserable.
Cross-game transfer works. Pre-training on Dou Di Zhu (a simpler 3-player card game) then fine-tuning on Guandan saved ~40% training time.
Humans prefer the teaching AI. Our agent that explains its reasoning ("I played the 3 because your partner likely has the straight") retains users 3x longer than the silent version.

Open Problems

Zero-shot partner coordination: Can we build agents that cooperate well with ANY partner, even unseen ones?
Natural language explanation: Generating human-readable strategy explanations in real-time
Cultural variant adaptation: Guandan has regional rule differences too — can we adapt quickly?

If you're working on cooperative multi-agent RL or imperfect-information games, I'd love to compare notes. Drop a comment or reach out.

This is part of a series on building AI for traditional games. Next up: "Why Mahjong's 200+ Regional Variants are a Nightmare for AI (and a Gift for Transfer Learning)"

Why Mahjong AI is 10x Harder Than Go AI (And What We Learned Building One)

NanMing — Wed, 15 Apr 2026 04:30:24 +0000

Six months ago, I started working on Mahjong AI. I assumed it would be easier than Go AI.

Go's state space is 10^170 — "more possible positions than atoms in the universe." Mahjong only has 136 tiles. Intuitively, it should be simpler.

I was completely wrong. Here's why, and what we learned building a multi-rule Mahjong AI engine.

Challenge 1: Imperfect Information

Go is a perfect information game. Both players see the entire board. AlphaGo's brilliance was in search + evaluation — exploring future board states and judging which ones are good.

Mahjong is imperfect information. You see 13 tiles in your hand. The other 123 tiles? You know some (discards are visible), but most are hidden. You're making decisions with ~70% of the information missing.

This breaks MCTS (Monte Carlo Tree Search), the backbone of Go AI. MCTS assumes you can simulate future states accurately. In Mahjong, you can't — because you don't know what tiles other players hold.

Our approach: Instead of tree search, we use LSTM networks that learn to infer hidden information from observable signals (discard patterns, timing, claim/pass decisions). Think of it as teaching the AI to "read" opponents the way human experts do.

Challenge 2: 200+ Rule Variants

"Mahjong" isn't one game. It's a family of 200+ games.

Changsha Mahjong has "Zha Niao" (bird catching) — after winning, you flip tiles to determine bonus multipliers. Sichuan Mahjong has "Xue Zhan Dao Di" (bloody fight to the end) — the game continues after the first winner until only one loser remains. Japanese Riichi has entirely different scoring, with concepts like "furiten" (you can't win on a tile you previously discarded).

Each variant requires a separate model. Training 8 models from scratch would be prohibitively expensive.

Our approach: Shared base model + rule-specific adapter layers. The base model learns general Mahjong skills (tile efficiency, defense, hand reading). Adapter layers encode variant-specific rules. This is similar to how multilingual NLP models handle different languages.

Result: Training a new variant takes ~40% less compute compared to training from scratch. The model transfers skills like "don't discard tiles your opponent might need" across all variants.

Challenge 3: Multi-Agent Dynamics

Go is 1v1. Mahjong is 4-player free-for-all (or 2v2 in some variants).

In a 4-player game, optimal strategy isn't just "maximize my winning probability." It's "maximize my winning probability WHILE considering that three other rational agents are doing the same." This is significantly harder than 2-player zero-sum games.

Example: You're one tile away from winning. But the tile you need was just discarded by the player to your left. Should you claim it? In some variants, claiming a discard to win is legal but reveals information. In Riichi Mahjong, you might actually choose NOT to claim it if you're in furiten.

Our approach: We train with self-play across 4 agents simultaneously, using Deep Monte Carlo (DMC) methods. Each agent learns not just its own optimal strategy, but also models of what the other three agents are likely to do.

Challenge 4: Reward Signal Sparsity

In Go, every move changes the board state, providing rich feedback signals. In Mahjong, a game can last 20+ turns before anyone wins — and most of those turns are "draw a tile, discard a tile" with no immediate feedback on whether you're playing well.

Our approach: Auxiliary reward signals. Beyond win/lose, we give partial rewards for:

Hand efficiency improvements (getting closer to a winning hand)
Successful defensive plays (avoiding dealing into opponents' wins)
Information gathering (making discards that reveal useful information)

This dramatically accelerates training convergence.

Challenge 5: Stochastic Elements

Go has zero randomness. Every game state is deterministic.

Mahjong has massive randomness. The tile draw sequence is random. Your starting hand is random. Other players' hands are random. A "perfect" AI can still lose to a novice due to unlucky draws.

This means evaluation requires thousands of games to measure statistical significance. A 2% win rate improvement that would be obvious in Go takes 10,000+ games to confirm in Mahjong.

What We Learned (Technical Summary)

Aspect	Go AI	Mahjong AI
Information	Perfect	Imperfect (~70% hidden)
Core technique	MCTS + neural net	LSTM + DMC self-play
Rules	Single ruleset	200+ variants
Players	2 (zero-sum)	4 (general-sum)
Randomness	None	High (tile draws)
Evaluation	Single game sufficient	Thousands needed
State space	Larger (10^170)	Smaller but hidden
Action space	~300/move	~50/move but context-dependent
Training data	Public game records	Variant-specific, often scarce

The Surprising Takeaway

The hardest part of Mahjong AI isn't any single technical challenge. It's that all five challenges exist simultaneously. Go AI researchers can focus on search algorithms because information is perfect and rules are fixed. Poker AI researchers can focus on imperfect information because the game is well-defined and 2-player.

Mahjong AI requires solving imperfect information + multi-agent dynamics + stochastic outcomes + variable rule sets, all at once. It's a uniquely challenging benchmark for game AI research.

What's Next

Transformer exploration — attention mechanisms might better capture "who played what" relationships than LSTM
Online adaptation — adjusting strategy in real-time based on opponent tendencies
Natural language coaching — using LLMs to translate AI decisions into human-readable explanations ("Don't play 3-wan because your opponent likely needs it for a straight")

I'm building a multi-rule game AI engine covering 7 Mahjong variants + Guandan + Dou Di Zhu + Texas Hold'em. If you're working on game AI or imperfect information games, I'd love to compare notes in the comments.