In the rapidly evolving world of Large Language Models (LLMs), we often ask: "How smart is this model, really?" Standard benchmarks like MMLU or HumanEval are great, but they are increasingly "contaminated" by training data.
Enter Agent Arcade (formerly Prison Break AI) — a project designed to test AI models in a dynamic, visual, and interactive environment.
The Vision: Beyond Static Text
The goal was to create an app where users could watch an AI model solve puzzles in real-time. I wanted to see the "thinking" process — the failures, the retries, and the eventual "Aha!" moments.
Technical Architecture
1. The Engine-Agent Loop
The core of the app is a state machine. The Agent Runner manages the lifecycle of a "Run":
- Generate: The Game Engine creates a fresh puzzle state.
- Prompt: The Engine converts that state into a natural language prompt for the AI.
- Inference: The Model Provider sends the prompt to either a local Ollama instance or a cloud API (AIsa.one).
- Validate: The Engine parses the AI's response (usually JSON) and validates it against the game rules.
- Iterate: If the solution is wrong, the error is logged, and the agent gets another attempt with the error context.
2. State Management with Zustand
I chose Zustand for its simplicity and performance. It handles everything from API keys to the global leaderboard.
export const useStore = create<AppState>((set) => ({
provider: 'ollama',
selectedModel: null,
// Persistence ensures your leaderboard survives a refresh
leaderboard: JSON.parse(localStorage.getItem('leaderboard_runs') || '[]'),
// ...
}));
3. The "Pixel Art" Design System
To give it that arcade feel, I built a custom CSS system on top of Tailwind 4. I used @layer components to define "pixel-border" and "arcade-card" classes that use hard shadows to simulate 8-bit depth.
.pixel-border {
@apply border-4 border-slate-800 shadow-[4px_4px_0px_0px_rgba(30,41,59,1)];
}
Challenges: The CORS Wall
One of the biggest technical hurdles was connecting a web-based frontend to a local Ollama instance. Browsers block these requests due to Cross-Origin Resource Sharing (CORS). I solved this by providing clear in-app troubleshooting that guides users to set OLLAMA_ORIGINS="*".
Why It Matters
Agent Arcade isn't just a game; it's a diagnostic tool. By watching a model fail at a 4x4 Sudoku but excel at word ladders, developers can gain deep insights into the specific reasoning strengths and weaknesses of different architectures.
I'm also looking at adding Vision-Language Model (VLM) support. Instead of sending a text description of the board, we'll send a screenshot. This will test the model's ability to perceive spatial relationships directly from pixels.
Github Repo: https://github.com/harishkotra/agent-arcade
Top comments (0)