Max Polaczuk

Posted on Jun 23

Stop Benchmarking AI Coding Agents on Todo Apps. Make Them Build an MMO.

#ai #claude #gamedev #opensource

A coding agent that can scaffold a todo app is no longer interesting.

Give it an MMO.

An online world is not one feature. It is a collision between real-time networking, persistent state, combat math, character progression, inventories, economies, social systems, content, deployment, security, and a client that must remain responsive while the server remains in control.

Every subsystem can break every other subsystem.

That is why an MMO is such a useful stress test for AI-assisted software development—and why we decided to try building one in a weekend.

The result was World of ClaudeCraft, a free and open-source browser MMO.

Not a trailer. Not a generated screenshot. Not a single-room demo with one enemy.

A live, persistent world you can enter from a browser.

What the weekend produced—and what came next

A 48-hour experiment with Anthropic's Claude Fable 5 produced the playable foundation. It has since grown into a playable level 1–20 world with nine classes, three open zones, nearly 90 quests, instanced dungeons, boss mechanics, parties, trading, duels, ranked PvP, persistent characters, offline play, mobile controls, and 14 locales.

The independent MMO publication MMORPG.com called it “surprisingly complete” and noted that it had played studio demos in rougher condition despite far longer development cycles.

That outside reaction matters. But it is not the most interesting part.

The most interesting part is that you can inspect the code.

levy-street / world-of-claudecraft

World of ClaudeCraft

Quest, group up, and raid a hand-built world, free in your browser. Open source, web3, and online right now.

English · Español · Español (España) · Français · Français (Canada) · Italiano · Deutsch · 简体中文 · 繁體中文 · 한국어 · 日本語 · Português (Brasil) · Русский

Play now · Host your own world · Train an agent · Web3 · Contributing · Discord

What this is

World of ClaudeCraft is a complete classic-era MMO you can play right now in your browser, host yourself with one command, and even train AI agents to play. It is free, open source, and live at worldofclaudecraft.com.

One shared world runs in three places, all from the same game core:

the offline browser world, where you click Play Offline and you are in,
the authoritative multiplayer server, where Postgres-backed accounts share a live world,
the headless RL…

View on GitHub

The repository is MIT-licensed. You can run the game locally, host your own persistent world, train an agent against the real simulation, or fork the project and change its rules.

That makes World of ClaudeCraft more than an AI demo. It makes it a falsifiable engineering artifact.

The architecture is the real story

The project is held together by one rule:

There is one deterministic game simulation, and every host uses it.

                         ┌─────────────────────────────┐
                         │  Deterministic TypeScript   │
                         │     simulation: src/sim     │
                         └──────────┬───────┬──────────┘
                                    │       │
                      ┌─────────────┘       └─────────────┐
                      │                                   │
              Offline browser                    Authoritative server
              instant local world                REST + WebSocket + Postgres
                      │                                   │
                      └──────────────┐       ┌────────────┘
                                     │       │
                                  Headless RL host
                                  Python + Gymnasium

The offline browser world, multiplayer server, and reinforcement-learning environment do not contain three reimplementations of the game. They host the same core.

The renderer and HUD talk to an IWorld interface rather than a concrete implementation. Offline, the local simulation satisfies that interface. Online, a client-side mirror consumes server snapshots. In both cases, the presentation layer sees a world—not the machinery behind it.

Meanwhile, the multiplayer server is authoritative. Clients send movement intent and commands. The server resolves combat rolls, loot, quest credit, trades, vendor transactions, and progression. PostgreSQL persists the character state. The browser renders the result.

That separation is not glamorous, but it is why the prototype could keep growing without becoming code confetti.

AI output is cheap. Architectural invariants are valuable.

Determinism became a force multiplier

The simulation does not depend on wall-clock time or Math.random(). Seed the environment, replay the actions, and you can reproduce the episode.

That single constraint creates several advantages:

Offline and online combat can follow the same rules.
Tests can reproduce failures instead of chasing intermittent state.
A headless agent can train against the actual game rather than a simplified port.
Automated parties can clear dungeons and expose regressions across combat, threat, healing, loot, and encounter scripting.
Cosmetic systems can remain outside the simulation. Even biome weather is render-only, so rain cannot accidentally change game state.

This is the kind of detail that gets lost when people evaluate AI coding by counting generated lines.

The better question is: did the system become easier to verify?

Procedural generation was a scope strategy, not a gimmick

A weekend MMO cannot wait for a conventional asset pipeline.

Much of World of ClaudeCraft's scenery, terrain, weather, spell iconography, animation, and sound is generated at runtime. The project also uses credited, freely licensed assets where appropriate.

That choice reduced coordination costs and kept iteration close to the code. A developer could change a biome, creature family, icon system, or sound generator without waiting for a separate production queue.

It also produced an unexpected benefit: contributors can change more of the world through code.

Procedural generation did not remove art direction. It compressed the distance between an idea and a testable result.

A serious coding benchmark needs cross-cutting consequences

A todo app usually rewards local correctness. Add a field, update a form, change a schema, pass a test.

An MMO punishes local thinking.

Add one class ability and you may need to touch:

simulation rules,
resource costs and cooldowns,
targeting and range validation,
server command handling,
client presentation and feedback,
action bars and tooltips,
persistence,
localization,
bots and automated tests,
PvE and PvP balance.

This is where long-horizon coding agents become interesting. The challenge is not producing ten files. The challenge is preserving a coherent model across ten files while the product is changing underneath them.

World of ClaudeCraft suggests a more useful way to measure AI-assisted development:

Do not ask how much code the model wrote. Ask how many consequential decisions the human team could validate per hour.

Good harnesses, deterministic state, explicit interfaces, fast tests, browser automation, and server authority make that number go up. Prompt cleverness alone does not.

What the AI did—and what it did not do

Claude accelerated implementation dramatically. It helped turn a deliberately unreasonable scope into a playable foundation over one weekend.

That does not mean “one prompt replaced a game studio.”

The model did not decide why the world should exist. It did not define acceptable game feel. It did not own the architectural invariants, choose what to cut, judge whether combat was fun, operate the live service, review community contributions, or decide what deserved to ship.

Human direction stayed in the loop because software is not valuable merely when it compiles. It is valuable when someone can explain its constraints, test its behavior, and take responsibility for the result.

The practical lesson is less dramatic—and more useful:

Small teams can now attempt systems that were previously irrational for them to attempt.

The most important feature arrived after launch

After the first version went public, people did not only play it. They opened issues, submitted pull requests, forked the repository, translated text, fixed bugs, and built new systems.

The GitHub project quickly passed 1,000 stars and 300 forks. The game continued growing from its weekend foundation into a larger world with more quests, multiplayer systems, dungeons, PvP, and raid content.

The model that seeded the project was publicly available for only three days before access was suspended. The world it helped create kept moving.

That may be the more durable pattern:

AI compresses the cost of the first playable system. Open source and human community determine whether it becomes a living one.

Run it before arguing about it

You can play World of ClaudeCraft in your browser with no download, or run your own world from the source:

git clone https://github.com/levy-street/world-of-claudecraft.git
cd world-of-claudecraft
cp .env.example .env

# Set a strong POSTGRES_PASSWORD in .env, then:
docker compose up -d --build

# Open http://localhost:8787

Then inspect src/sim/. Break an invariant. Add an ability. Train an agent. Build a boss mechanic. Open a pull request.

The next useful AI coding benchmark should not be another landing page.

It should be a system with enough moving parts to fight back.

What would you add to make this benchmark harder?

DEV Community