DEV Community

Mininglamp
Mininglamp

Posted on

I Let an Open-Source GUI Agent Play Mahjong. Here's What Happened.

The Challenge

Most GUI Agent demos show the same thing: open a browser, fill out a form, click "Submit." It works, it's useful, but it doesn't really stress-test what these agents can do.

We wanted to find out: what happens when you throw a GUI Agent into a completely unfamiliar, non-standard interface?

So we picked Mahjong — a Chinese tile game with complex rules, dense visual information, and a UI that has nothing in common with a typical web app.

Here's Mano-P playing:

Liquid error: internal

And the raw video:

https://github.com/user-attachments/assets/397a0552-9611-4d74-9f24-99544da272b6

What is Mano-P?

Mano-P (GUI-VLA Agent for Edge Devices) is an open-source project from Mininglamp Technology. The name comes from "Mano" (Spanish for "hand") and "P" for Person + Party.

The key differentiator: Mano-P is purely vision-driven. It doesn't parse DOM trees, doesn't use accessibility APIs, doesn't rely on OCR as a preprocessing step. It takes a screenshot, understands what's on screen, and outputs mouse/keyboard actions.

Think of it as an AI that operates a computer the same way you do — by looking at the screen.

Why Mahjong is a Brutal Test Case

If you've played Mahjong, you know it's no joke. But even if you haven't, here's why it's an excellent stress test for a GUI Agent:

1. Dense, Visually Similar Elements

A Mahjong board has 136 tiles. Your hand has 13 tiles at a time. The tiles are small, visually similar (slight variations in dots, characters, bamboo patterns), and tightly packed. The agent needs pixel-level precision to distinguish a "3 of Dots" from a "5 of Dots."

2. Zero Structured Data

There's no HTML, no DOM, no accessibility tree. The game UI is rendered by a game engine — it's all pixels. This means any approach that relies on parsing page structure is out. Only pure vision works.

3. Strategic Reasoning Required

This isn't "see button, click button." The agent needs to:

  • Recognize all tiles in its hand
  • Evaluate possible winning combinations
  • Decide which tile to discard
  • React to other players' moves (pass, claim, or declare)

4. Asynchronous Multi-Player Flow

Mahjong is turn-based with 4 players. The agent has to wait for others, recognize when it's their turn, handle variable timing, and respond to unexpected events (another player declares a win, for instance).

How It Works: Think-Act-Verify

Mano-P doesn't just look once and act. It runs a continuous reasoning loop:

┌─────────┐     ┌─────────┐     ┌──────────┐
│  Think   │ ──▶ │   Act   │ ──▶ │  Verify  │
│ (analyze │     │(execute │     │(confirm  │
│  screen) │     │ action) │     │ result)  │
└─────────┘     └─────────┘     └──────────┘
      ▲                              │
      └──────── loop back ◀──────────┘
Enter fullscreen mode Exit fullscreen mode
  • Think: Capture a screenshot, analyze the current game state. What tiles do I have? What's on the table? Is it my turn?
  • Act: Decide and execute an action — click a tile to discard, click "Pass," click "Claim."
  • Verify: Take another screenshot. Did my action register? Did the game state change as expected? If not, go back to Think.

This loop is critical for games, where animations, delays, and other players' actions create a constantly shifting interface.

Training Pipeline

Mano-P uses a three-stage training approach:

Stage Method Purpose
1 SFT (Supervised Fine-Tuning) Learn basic GUI recognition and operation
2 Offline RL (Reinforcement Learning) Optimize action policies from recorded trajectories
3 Online RL Interactive learning in real environments

This progression moves from "imitate human actions" to "discover optimal strategies through exploration" — a pattern that's proven effective across many RL domains.

Benchmark Results

Numbers matter. Here's where Mano-P stands:

OSWorld (Desktop App Automation)

Model Score
Mano-P 72B 58.2% (Rank #1 among specialized models)
opencua-72b 45.0%

WebRetriever Protocol I (Web Interaction)

Model Score
Mano-P 41.7
Gemini 2.5 Pro 40.9
Claude 4.5 31.3

Edge Inference (4B Quantized, w4a16)

Running on Apple M4 + 32GB RAM:

Metric Value
Prefill throughput 476 tok/s
Decode throughput 76 tok/s
Peak memory 4.3 GB

That's fast enough for real-time GUI interaction on a local device. No cloud API calls, no data leaving your machine.

Hardware note: The 4B model currently requires Apple M4 + 32GB RAM. Not all Macs can run it — be aware of this before trying.

What This Actually Means

The Mahjong demo is fun, but the real takeaway is about generalization.

Most GUI automation tools are brittle. Traditional RPA breaks when a button moves. DOM-based agents break when there's no DOM. Screen-scraping breaks when the UI updates.

A purely vision-driven agent doesn't have these dependencies. If a human can operate the application by looking at the screen, Mano-P can too — at least in principle. The Mahjong demo shows this isn't just theory:

  • ✅ Non-standard UI? Handled.
  • ✅ Visually dense interface? Handled.
  • ✅ Strategic reasoning? Handled.
  • ✅ Async multi-player flow? Handled.

The same architecture that plays Mahjong can automate legacy enterprise systems, operate desktop applications, or handle any GUI that doesn't expose a programmatic interface.

Open Source Roadmap

Mano-P is released under Apache 2.0 with a three-phase open-source plan:

Phase Content Status
Phase 1 Skills (core capabilities) ✅ Released
Phase 2 Local models + SDK Coming soon
Phase 3 Training methodology Planned

Try It Out

The project is live on GitHub. Whether you're interested in GUI automation, VLA research, or just want to see an AI play Mahjong, check it out:

👉 github.com/Mininglamp-AI/Mano-P

If you're working on GUI agents or have thoughts on vision-driven automation, I'd love to hear from you in the comments. What would you test a GUI Agent on?


Built by the open-source team at Mininglamp Technology. Mano-P is Apache 2.0 licensed — contributions welcome.

Top comments (0)