DEV Community

Cover image for Inside Agent Arcade: Building a Real-Time AI Benchmarking Arena
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Inside Agent Arcade: Building a Real-Time AI Benchmarking Arena

In the rapidly evolving world of Large Language Models (LLMs), we often ask: "How smart is this model, really?" Standard benchmarks like MMLU or HumanEval are great, but they are increasingly "contaminated" by training data.

Enter Agent Arcade (formerly Prison Break AI) — a project designed to test AI models in a dynamic, visual, and interactive environment.

The Vision: Beyond Static Text

The goal was to create an app where users could watch an AI model solve puzzles in real-time. I wanted to see the "thinking" process — the failures, the retries, and the eventual "Aha!" moments.

Technical Architecture

1. The Engine-Agent Loop

The core of the app is a state machine. The Agent Runner manages the lifecycle of a "Run":

  1. Generate: The Game Engine creates a fresh puzzle state.
  2. Prompt: The Engine converts that state into a natural language prompt for the AI.
  3. Inference: The Model Provider sends the prompt to either a local Ollama instance or a cloud API (AIsa.one).
  4. Validate: The Engine parses the AI's response (usually JSON) and validates it against the game rules.
  5. Iterate: If the solution is wrong, the error is logged, and the agent gets another attempt with the error context.

2. State Management with Zustand

I chose Zustand for its simplicity and performance. It handles everything from API keys to the global leaderboard.

export const useStore = create<AppState>((set) => ({
  provider: 'ollama',
  selectedModel: null,
  // Persistence ensures your leaderboard survives a refresh
  leaderboard: JSON.parse(localStorage.getItem('leaderboard_runs') || '[]'),
  // ...
}));
Enter fullscreen mode Exit fullscreen mode

3. The "Pixel Art" Design System

To give it that arcade feel, I built a custom CSS system on top of Tailwind 4. I used @layer components to define "pixel-border" and "arcade-card" classes that use hard shadows to simulate 8-bit depth.

.pixel-border {
  @apply border-4 border-slate-800 shadow-[4px_4px_0px_0px_rgba(30,41,59,1)];
}
Enter fullscreen mode Exit fullscreen mode

Challenges: The CORS Wall

One of the biggest technical hurdles was connecting a web-based frontend to a local Ollama instance. Browsers block these requests due to Cross-Origin Resource Sharing (CORS). I solved this by providing clear in-app troubleshooting that guides users to set OLLAMA_ORIGINS="*".

Why It Matters

Agent Arcade isn't just a game; it's a diagnostic tool. By watching a model fail at a 4x4 Sudoku but excel at word ladders, developers can gain deep insights into the specific reasoning strengths and weaknesses of different architectures.

I'm also looking at adding Vision-Language Model (VLM) support. Instead of sending a text description of the board, we'll send a screenshot. This will test the model's ability to perceive spatial relationships directly from pixels.

Github Repo: https://github.com/harishkotra/agent-arcade

Top comments (0)