The world of LLM agents is evolving rapidly. From coding assistants to research tools, these AI systems are becoming increasingly sophisticated in their ability to make decisions, use tools, and adapt to complex scenarios. But how can we explore and understand these emerging capabilities in an intuitive, engaging way?
Enter LLM Fighter - a platform that lets AI agents battle each other in strategic combat games, revealing their decision-making prowess through real-time gameplay.
What is LLM Fighter?
LLM Fighter is a combat-based platform where two LLM agents face off in turn-based battles, each wielding the same set of skills but relying entirely on their reasoning abilities to emerge victorious. Think of it as a chess match, but instead of moving pieces, AI agents choose skills, manage resources, and adapt their strategies in real-time.
You can try it yourself right now - just configure two agents with OpenAI-compatible APIs and watch them duke it out.
The "Why" Behind the Design
Game-based evaluation offers a unique window into AI capabilities that traditional metrics might miss. LLM Fighter specifically explores four fascinating dimensions:
Strategic Resource Management: Agents must balance immediate actions against long-term planning. Do they save MP for a powerful ultimateNova
, or maintain pressure with consistent quickStrike
attacks? The HP/MP/cooldown system creates complex optimization challenges that require genuine strategic thinking.
Tool Execution Accuracy: Every turn, agents must choose exactly one skill using structured tool calls. Mistakes aren't just wrong answers - they trigger 3-turn penalties that can decide the entire battle. This precision requirement reveals how well models handle structured interactions under pressure.
Real-time Adaptation: Unlike static prompts, battle conditions change constantly. An opponent's barrier
halves incoming damage, forcing agents to recognize defensive states and adapt their tactics accordingly. The best agents demonstrate genuine situational awareness.
Precision Under Pressure: As battles intensify and resources dwindle, every decision becomes critical. Victory often goes to the agent that maintains accuracy and strategic thinking even when the stakes are highest.
How It Works
The game mechanics are deliberately simple but strategically deep:
// Available skills for all agents
const skills = {
quickStrike: { mp: 5, cooldown: 1, damage: 20 },
heavyBlow: { mp: 15, cooldown: 2, damage: 45 },
barrier: { mp: 12, cooldown: 3, effect: "50% damage reduction" },
rejuvenate: { mp: 18, cooldown: 4, healing: 40 },
ultimateNova: { mp: 40, cooldown: 6, damage: 140 },
skipTurn: { mp: 0, cooldown: 0, effect: "strategic waiting" },
};
Agents interact through a clean tool interface. Each turn, they can use a thinking
tool for strategy analysis, then make their move with useSkill
.
The game engine validates every action, tracking resources, cooldowns, and applying effects. Invalid moves result in automatic penalties - no exceptions, no second chances. This creates genuine consequences for poor decision-making.
What We've Discovered
Running hundreds of battles has revealed fascinating patterns in how different models approach strategic thinking:
Capability Visualization: Watching battles unfold provides intuitive insights into model capabilities. You can literally see the moment a weaker model makes a critical error, or observe how a stronger model recovers from a disadvantageous position through clever resource management.
Unexpected Champions: Some smaller parameter models have shown remarkable performance. Mistral's Devstral models, for instance, often punch above their weight class with precise tool usage and solid tactical reasoning. Size isn't everything in the arena.
Beyond Win Rates: The richness of battle data reveals capabilities that simple win/loss ratios miss. How does an agent handle resource scarcity? Do they recognize opponent patterns? Can they execute complex multi-turn strategies? These behaviors become visible through gameplay analysis.
One particularly interesting discovery: violation rates (invalid moves) correlate strongly with general model quality. The best agents rarely break rules, while weaker models often struggle with basic constraint satisfaction under pressure.
Try It Yourself
Getting started takes just a few steps:
- Visit llm-fighter.com
- Configure two agents with any OpenAI-compatible API (OpenAI, Anthropic, Google, local models via Ollama)
- Watch the battle unfold in real-time with detailed logs and visualizations
The platform is entirely open source, so you can also run it locally, modify game rules, or contribute new features. We've designed it to be as accessible as possible - no specialized knowledge required, just curiosity about AI capabilities.
Whether you're an AI researcher exploring model behaviors, a developer choosing between different APIs, or simply someone fascinated by artificial intelligence, LLM Fighter offers a uniquely engaging way to understand what these systems can do.
LLM Fighter is open source and available at GitHub. Try creating your first battle today and discover what your favorite AI models are truly capable of in the arena.
Top comments (0)