Harish Kotra (he/him)

Posted on Apr 18

Building a Multi-Agent Sudoku Arena in Node.js

#ai #programming #javascript #dailybuild2026

This post walks through a real project: a multi-provider AI Sudoku system where each model acts as an independent agent and competes under the same constraints.

If you care about AI reliability, this project is a practical pattern: never trust model output directly, always validate, and design orchestration to survive bad responses.

Why Sudoku?

Sudoku is a great benchmark for agent behavior because:

rules are strict and deterministic
outputs are easy to validate
hallucinations are immediately observable
step-by-step progress can be visualized cleanly

That makes it ideal for comparing local and cloud LLM behavior under identical prompt and runtime conditions.

What We Built

A modular Node.js app with four providers:
- OpenAI
- Ollama
- LM Studio
- Featherless (OpenAI-compatible)
A shared solve(board, mode) contract for all agents.
A robust Sudoku validation core.
A live web UI with side-by-side providers.
Counters for invalid moves and timeouts.

System Design

Folder Layout

agents/   # provider implementations
core/     # sudoku logic + orchestration
utils/    # json, timing, formatting
web/      # frontend UI
server.js # HTTP + SSE backend
index.js  # CLI entry

Core Interface: Agent Contract

Every provider implements the same shape, making orchestration provider-agnostic.

class SomeProviderAgent {
  constructor(options) {
    this.name = "ProviderName";
    this.options = options;
  }

  async solve(board, mode = "full") {
    // return strict JSON data
  }
}

Modes:

full -> { solution: [[...9x9]] }
step -> { row, col, value }

Defensive Output Handling

Model outputs are treated as untrusted data.

if (!text.startsWith("{") || !text.endsWith("}")) {
  return { ok: false, error: "Response is not strict JSON object text." };
}

Even valid JSON is still validated semantically against Sudoku rules.

Sudoku Validation Strategy

The validator enforces:

board shape (9x9, integer bounds)
no duplicate values in rows/columns/3x3 boxes
move legality
clue preservation
solved-state completeness

This guarantees a model cannot “win” by returning formatted but invalid answers.

Orchestrator Behavior: Resilience Over Fragility

An earlier version stopped a run on invalid move. We changed that for better observability and robustness.

Current behavior:

invalid move -> increment invalidMoveCount, continue
timeout -> increment timeoutCount, retry, continue until threshold
step with no valid move -> emit step_skipped, continue
solve success -> finish as solved

Pseudo-flow:

for each step:
  for each retry attempt:
    response = await agent.solve(board, "step")
    if invalid:
      invalidMoveCount++
      continue
    if timeout:
      timeoutCount++
      continue
    apply move
    emit move
    if solved: finish
  if no valid move in step:
    emit step_skipped
    continue

Why SSE for Real-Time Updates?

SSE was enough for one-way streaming (server -> client), simpler than WebSockets for this use case.

res.writeHead(200, {
  "Content-Type": "text/event-stream",
  "Cache-Control": "no-cache",
  Connection: "keep-alive",
});

Each event carries live stats so UI never needs hidden state from backend.

UI Design Decisions

Split providers into two rows:
- Local models (Ollama, LM Studio)
- Third-party models (OpenAI, Featherless)
Two columns each row for quick comparison.
Per-provider model configuration:
- local: auto-detected model dropdown
- cloud: manual model entry
Per-provider timeout input to address local model latency variability.

Local Model Discovery

We added provider-specific discovery endpoints:

Ollama: GET /api/tags
LM Studio: GET /v1/models

The frontend can refresh model lists without restarting server.

Timeout Lessons

Local models can be slow on first token or heavy model loads. A single global timeout is usually wrong.

What worked better:

per-provider timeout control in UI
higher defaults for local providers (>= 180000ms)
retryable timeout policy + timeout counters

Example Run Start Payload

{
  "providerId": "ollama",
  "model": "gemma4:latest",
  "timeoutMs": 180000
}

Contribution Opportunities

If you want to extend this project, here are high-impact additions:

Add a baseline deterministic solver and compare LLM deviation.
Add puzzle packs and ELO-style provider rating.
Add persistent run history (SQLite + charting).
Add tests for orchestrator edge cases.
Add CI + linting + type checks.
Add websocket mode and richer live metrics.

Key Takeaways

Standard contracts unlock multi-provider experimentation.
Validation is non-negotiable when models are in the loop.
Reliability improves when invalid outputs become measurable events, not hard crashes.
Observability (attempts, invalids, timeouts) is as important as final correctness.

Output

If you build a similar system for another constrained task (SQL generation, code transforms, schema mapping), this architecture transfers almost directly.

Github: https://github.com/harishkotra/agentoku

DEV Community