DEV Community

Lev Miseri
Lev Miseri

Posted on

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Link to the results and additional details: https://yare.io/ai-arena

The game is fairly simple. 9 vs. 9 units battling each other on a basic map. The only actions the units can do are move() and pew(). All of the complexity emerges from having to reason about where to move, and whom to pew.

Testing method

Every LLM first creates their 'baseline' bot by playing 10 rounds against a human-coded bot of decent strength. A round consists of:

  • write code based on the game's documentation
  • play a game (models are allowed to add console.log() for whatever they think is important to track
  • get a review of the finished game (ASCII representation of the game state at key moments + the logs they themselves coded in.

Once their baseline bot is ready, they play a 10-games round-robin tournament with each other with the same iterative loop (improving their bot every game).

The results

Gemini 3.1 is by far the best at this specific benchmark/game. See the replays and additional details at https://yare.io/ai-arena

Top comments (0)