<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: NanMing</title>
    <description>The latest articles on DEV Community by NanMing (@malinguo).</description>
    <link>https://dev.to/malinguo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879573%2Ff43b012b-2ee7-456b-b977-21cfe8e33635.jpg</url>
      <title>DEV Community: NanMing</title>
      <link>https://dev.to/malinguo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/malinguo"/>
    <language>en</language>
    <item>
      <title>Building AI for a 2v2 Card Game: How We Solved Cooperative Imperfect Information</title>
      <dc:creator>NanMing</dc:creator>
      <pubDate>Wed, 15 Apr 2026 04:30:53 +0000</pubDate>
      <link>https://dev.to/malinguo/building-ai-for-a-2v2-card-game-how-we-solved-cooperative-imperfect-information-kmo</link>
      <guid>https://dev.to/malinguo/building-ai-for-a-2v2-card-game-how-we-solved-cooperative-imperfect-information-kmo</guid>
      <description>&lt;p&gt;Training AI to play adversarial games is well-understood. Chess, Go, Poker — we have frameworks for all of them.&lt;/p&gt;

&lt;p&gt;But what happens when your AI needs to &lt;strong&gt;cooperate&lt;/strong&gt; with a partner it can't talk to, while competing against opponents whose cards it can't see?&lt;/p&gt;

&lt;p&gt;That's the challenge of Guandan (掼蛋), a 2v2 trick-taking card game played by 140 million people in China. And it broke almost every assumption we had about game AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Guandan uses two standard 54-card decks (108 cards total). Four players, two teams. You sit across from your partner. Rules are trick-taking with bombs, straights, and a rotating wild card that changes every round.&lt;/p&gt;

&lt;p&gt;The goal: be the first team to promote through ranks 2→A.&lt;/p&gt;

&lt;p&gt;The catch: &lt;strong&gt;you cannot communicate with your partner&lt;/strong&gt; except through which cards you play.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 1: The Action Space is Enormous
&lt;/h2&gt;

&lt;p&gt;In Go, you have ~300 legal moves per turn. In Poker, maybe ~100.&lt;/p&gt;

&lt;p&gt;In Guandan, a single turn can have &lt;strong&gt;10,000+ legal actions&lt;/strong&gt;. Why? Because with 27 cards in hand from a 108-card deck, the number of valid combinations (singles, pairs, triples, straights, bombs of various sizes) explodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Our Solution: Two-Stage Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn Decision:
  Stage 1: What TYPE of action? 
    → Classifier over 12 categories
    → [single, pair, triple, straight, plate, tube, 
       bomb_4, bomb_5, bomb_6, bomb_7, bomb_8, rocket, pass]

  Stage 2: WHICH cards for that type?
    → Pointer network conditioned on stage 1 output
    → Selects specific cards from hand
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This decomposition reduces the effective branching factor by ~50x. Stage 1 is a simple 12-class classification. Stage 2 only needs to rank cards within the chosen type.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 2: Cooperation Without Communication
&lt;/h2&gt;

&lt;p&gt;This is the hard part. In Bridge, partners have a bidding system — an explicit communication protocol. In Guandan, there's nothing.&lt;/p&gt;

&lt;p&gt;Yet human experts develop rich implicit signaling:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Play Pattern&lt;/th&gt;
&lt;th&gt;Implicit Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Leading low when you could win&lt;/td&gt;
&lt;td&gt;"I'm saving strength — you take the lead"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Playing an unusual suit first&lt;/td&gt;
&lt;td&gt;"I'm strong in this suit"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NOT bombing when you could&lt;/td&gt;
&lt;td&gt;"I have a plan, trust me"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Discarding from a long suit&lt;/td&gt;
&lt;td&gt;"This suit is safe for you to play"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Our Solution: LSTM History Encoder
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PartnerBeliefModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LSTM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;input_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;card_feature_dim&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;action_type_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;num_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;belief_head&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partner_hand_estimate_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intent_head&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strategy_embedding_dim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;play_history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h_n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history_encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;play_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;partner_belief&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;belief_head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h_n&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;partner_intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;intent_head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h_n&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;partner_belief&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partner_intent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: we don't explicitly program signaling conventions. We let them &lt;strong&gt;emerge from self-play&lt;/strong&gt;. After ~500M games, agents develop consistent patterns that look remarkably like human expert conventions.&lt;/p&gt;

&lt;p&gt;The wildest finding: &lt;strong&gt;different training runs produce different "dialects."&lt;/strong&gt; Two agents from the same training run cooperate beautifully. Pair agents from different runs, and coordination drops by 15-20%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 3: Dynamic Wild Cards
&lt;/h2&gt;

&lt;p&gt;Each round has a "current rank" (starts at 2, promotes through A). All cards of the current rank become wild.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your hand's value changes every round&lt;/li&gt;
&lt;li&gt;A "bad hand" in round 5 might be amazing in round 8&lt;/li&gt;
&lt;li&gt;Bombs that exist in one round might dissolve in the next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We handle this with &lt;strong&gt;rank-conditioned policy networks&lt;/strong&gt; — the current rank is embedded and concatenated with the game state before every decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 4: The Tribute System
&lt;/h2&gt;

&lt;p&gt;After each round, losers must give their best card to the winners. This creates a unique information asymmetry: &lt;strong&gt;you KNOW one card in your opponent's hand&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tribute Info Encoding:
  - Card given to opponent: one-hot (certain knowledge)
  - Card received from partner: one-hot (certain knowledge)  
  - Bayesian update of opponent hand distribution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Smart agents exploit this known card — e.g., if you know the opponent has a specific Ace, you avoid playing into situations where that Ace can beat you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Algorithm&lt;/strong&gt;: DMC (Deep Monte Carlo) with prioritized experience replay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-play&lt;/strong&gt;: 500M games, ~72 hours on 8× A100 GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curriculum&lt;/strong&gt;: Start with rule-based opponents, gradually transition to self-play&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key trick&lt;/strong&gt;: We found that training cooperation required ~3x more iterations than adversarial play. Agents learn to attack quickly but learn restraint slowly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Our Agent&lt;/th&gt;
&lt;th&gt;Rule-based&lt;/th&gt;
&lt;th&gt;Random&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Win rate vs baseline&lt;/td&gt;
&lt;td&gt;72.3%&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;12.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bomb timing accuracy&lt;/td&gt;
&lt;td&gt;68%&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partner coordination&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;0.52&lt;/td&gt;
&lt;td&gt;0.31&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "bomb timing accuracy" metric deserves explanation: we measured how often the agent's bomb usage matched expert-annotated "correct bomb timing" in a test set of 10,000 games. Beginner players bomb immediately; experts hold bombs for critical moments.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Us
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Restraint is harder than aggression.&lt;/strong&gt; Teaching an agent when NOT to play took 3x longer to converge.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Difficulty tuning is its own problem.&lt;/strong&gt; Making AI beatable-but-fun required a separate "difficulty controller" that intentionally introduces calibrated sub-optimality. Playing against a perfect agent is miserable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-game transfer works.&lt;/strong&gt; Pre-training on Dou Di Zhu (a simpler 3-player card game) then fine-tuning on Guandan saved ~40% training time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Humans prefer the teaching AI.&lt;/strong&gt; Our agent that explains its reasoning ("I played the 3 because your partner likely has the straight") retains users 3x longer than the silent version.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Open Problems
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-shot partner coordination&lt;/strong&gt;: Can we build agents that cooperate well with ANY partner, even unseen ones?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural language explanation&lt;/strong&gt;: Generating human-readable strategy explanations in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cultural variant adaptation&lt;/strong&gt;: Guandan has regional rule differences too — can we adapt quickly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're working on cooperative multi-agent RL or imperfect-information games, I'd love to compare notes. Drop a comment or reach out.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on building AI for traditional games. Next up: "Why Mahjong's 200+ Regional Variants are a Nightmare for AI (and a Gift for Transfer Learning)"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>algorithms</category>
      <category>gamedev</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why Mahjong AI is 10x Harder Than Go AI (And What We Learned Building One)</title>
      <dc:creator>NanMing</dc:creator>
      <pubDate>Wed, 15 Apr 2026 04:30:24 +0000</pubDate>
      <link>https://dev.to/malinguo/why-mahjong-ai-is-10x-harder-than-go-ai-and-what-we-learned-building-one-eli</link>
      <guid>https://dev.to/malinguo/why-mahjong-ai-is-10x-harder-than-go-ai-and-what-we-learned-building-one-eli</guid>
      <description>&lt;p&gt;Six months ago, I started working on Mahjong AI. I assumed it would be easier than Go AI.&lt;/p&gt;

&lt;p&gt;Go's state space is 10^170 — "more possible positions than atoms in the universe." Mahjong only has 136 tiles. Intuitively, it should be simpler.&lt;/p&gt;

&lt;p&gt;I was completely wrong. Here's why, and what we learned building a multi-rule Mahjong AI engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 1: Imperfect Information
&lt;/h2&gt;

&lt;p&gt;Go is a &lt;strong&gt;perfect information&lt;/strong&gt; game. Both players see the entire board. AlphaGo's brilliance was in search + evaluation — exploring future board states and judging which ones are good.&lt;/p&gt;

&lt;p&gt;Mahjong is &lt;strong&gt;imperfect information&lt;/strong&gt;. You see 13 tiles in your hand. The other 123 tiles? You know some (discards are visible), but most are hidden. You're making decisions with ~70% of the information missing.&lt;/p&gt;

&lt;p&gt;This breaks MCTS (Monte Carlo Tree Search), the backbone of Go AI. MCTS assumes you can simulate future states accurately. In Mahjong, you can't — because you don't know what tiles other players hold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach:&lt;/strong&gt; Instead of tree search, we use &lt;strong&gt;LSTM networks&lt;/strong&gt; that learn to infer hidden information from observable signals (discard patterns, timing, claim/pass decisions). Think of it as teaching the AI to "read" opponents the way human experts do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 2: 200+ Rule Variants
&lt;/h2&gt;

&lt;p&gt;"Mahjong" isn't one game. It's a family of 200+ games.&lt;/p&gt;

&lt;p&gt;Changsha Mahjong has "Zha Niao" (bird catching) — after winning, you flip tiles to determine bonus multipliers. Sichuan Mahjong has "Xue Zhan Dao Di" (bloody fight to the end) — the game continues after the first winner until only one loser remains. Japanese Riichi has entirely different scoring, with concepts like "furiten" (you can't win on a tile you previously discarded).&lt;/p&gt;

&lt;p&gt;Each variant requires a &lt;strong&gt;separate model&lt;/strong&gt;. Training 8 models from scratch would be prohibitively expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach:&lt;/strong&gt; &lt;strong&gt;Shared base model + rule-specific adapter layers.&lt;/strong&gt; The base model learns general Mahjong skills (tile efficiency, defense, hand reading). Adapter layers encode variant-specific rules. This is similar to how multilingual NLP models handle different languages.&lt;/p&gt;

&lt;p&gt;Result: Training a new variant takes ~40% less compute compared to training from scratch. The model transfers skills like "don't discard tiles your opponent might need" across all variants.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 3: Multi-Agent Dynamics
&lt;/h2&gt;

&lt;p&gt;Go is 1v1. Mahjong is 4-player free-for-all (or 2v2 in some variants).&lt;/p&gt;

&lt;p&gt;In a 4-player game, optimal strategy isn't just "maximize my winning probability." It's "maximize my winning probability WHILE considering that three other rational agents are doing the same." This is significantly harder than 2-player zero-sum games.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; You're one tile away from winning. But the tile you need was just discarded by the player to your left. Should you claim it? In some variants, claiming a discard to win is legal but reveals information. In Riichi Mahjong, you might actually choose NOT to claim it if you're in furiten.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach:&lt;/strong&gt; We train with &lt;strong&gt;self-play across 4 agents simultaneously&lt;/strong&gt;, using Deep Monte Carlo (DMC) methods. Each agent learns not just its own optimal strategy, but also models of what the other three agents are likely to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 4: Reward Signal Sparsity
&lt;/h2&gt;

&lt;p&gt;In Go, every move changes the board state, providing rich feedback signals. In Mahjong, a game can last 20+ turns before anyone wins — and most of those turns are "draw a tile, discard a tile" with no immediate feedback on whether you're playing well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach:&lt;/strong&gt; &lt;strong&gt;Auxiliary reward signals.&lt;/strong&gt; Beyond win/lose, we give partial rewards for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hand efficiency improvements (getting closer to a winning hand)&lt;/li&gt;
&lt;li&gt;Successful defensive plays (avoiding dealing into opponents' wins)&lt;/li&gt;
&lt;li&gt;Information gathering (making discards that reveal useful information)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This dramatically accelerates training convergence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenge 5: Stochastic Elements
&lt;/h2&gt;

&lt;p&gt;Go has zero randomness. Every game state is deterministic.&lt;/p&gt;

&lt;p&gt;Mahjong has massive randomness. The tile draw sequence is random. Your starting hand is random. Other players' hands are random. A "perfect" AI can still lose to a novice due to unlucky draws.&lt;/p&gt;

&lt;p&gt;This means evaluation requires &lt;strong&gt;thousands of games&lt;/strong&gt; to measure statistical significance. A 2% win rate improvement that would be obvious in Go takes 10,000+ games to confirm in Mahjong.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned (Technical Summary)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Go AI&lt;/th&gt;
&lt;th&gt;Mahjong AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Information&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;td&gt;Imperfect (~70% hidden)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core technique&lt;/td&gt;
&lt;td&gt;MCTS + neural net&lt;/td&gt;
&lt;td&gt;LSTM + DMC self-play&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rules&lt;/td&gt;
&lt;td&gt;Single ruleset&lt;/td&gt;
&lt;td&gt;200+ variants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Players&lt;/td&gt;
&lt;td&gt;2 (zero-sum)&lt;/td&gt;
&lt;td&gt;4 (general-sum)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Randomness&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;High (tile draws)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation&lt;/td&gt;
&lt;td&gt;Single game sufficient&lt;/td&gt;
&lt;td&gt;Thousands needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State space&lt;/td&gt;
&lt;td&gt;Larger (10^170)&lt;/td&gt;
&lt;td&gt;Smaller but hidden&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action space&lt;/td&gt;
&lt;td&gt;~300/move&lt;/td&gt;
&lt;td&gt;~50/move but context-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training data&lt;/td&gt;
&lt;td&gt;Public game records&lt;/td&gt;
&lt;td&gt;Variant-specific, often scarce&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Surprising Takeaway
&lt;/h2&gt;

&lt;p&gt;The hardest part of Mahjong AI isn't any single technical challenge. It's that &lt;strong&gt;all five challenges exist simultaneously&lt;/strong&gt;. Go AI researchers can focus on search algorithms because information is perfect and rules are fixed. Poker AI researchers can focus on imperfect information because the game is well-defined and 2-player.&lt;/p&gt;

&lt;p&gt;Mahjong AI requires solving imperfect information + multi-agent dynamics + stochastic outcomes + variable rule sets, all at once. It's a uniquely challenging benchmark for game AI research.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transformer exploration&lt;/strong&gt; — attention mechanisms might better capture "who played what" relationships than LSTM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Online adaptation&lt;/strong&gt; — adjusting strategy in real-time based on opponent tendencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural language coaching&lt;/strong&gt; — using LLMs to translate AI decisions into human-readable explanations ("Don't play 3-wan because your opponent likely needs it for a straight")&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;I'm building a multi-rule game AI engine covering 7 Mahjong variants + Guandan + Dou Di Zhu + Texas Hold'em. If you're working on game AI or imperfect information games, I'd love to compare notes in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>algorithms</category>
      <category>computerscience</category>
      <category>gamedev</category>
    </item>
  </channel>
</rss>
