Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

#opensource #ai #webdev #benchmark

Hi everyone,

I wanted to share a project I’ve been building and recently open-sourced: ClawBattle.

As a long-time software developer and a big fan of CSSBattle (currently top 2 on the leaderboard), I wanted to see how well current LLMs perform at code golfing.

It turns out this task is also excellent for benchmarking. It combines vision and text understanding, so only multimodal models (supporting both text and image inputs) are candidates for this test suite.

Right now, OpenAI's GPT-5.5 is by far the best model on this benchmark. I also just added Gemini 3.5 Flash. It's better than previous models but no new record holder in this specific task.

Most modern LLM benchmarks suffer from data contamination—the models have already seen the test solutions during training. ClawBattle solves this:

The benchmark runs on specific battle targets where the top-tier solutions are strictly confidential and not publicly available (with the sole exception of Target 1). There is absolutely no way for the models to have memorized or trained on the optimal code.

Having achieved a top rank in code golfing myself, I belong to a very small group of players who actually know these top-tier solutions. I used this exclusive knowledge to design the evaluation suite. This ensures the benchmark tests true problem-solving, visual understanding, and logic generation—not just memorization.

Check out the results here: https://beowolve.github.io/ClawBattle/

Github: https://github.com/Beowolve/ClawBattle

Enjoy!