This post walks through a real project: a multi-provider AI Sudoku system where each model acts as an independent agent and competes under the same constraints.
If you care about AI reliability, this project is a practical pattern: never trust model output directly, always validate, and design orchestration to survive bad responses.
Why Sudoku?
Sudoku is a great benchmark for agent behavior because:
- rules are strict and deterministic
- outputs are easy to validate
- hallucinations are immediately observable
- step-by-step progress can be visualized cleanly
That makes it ideal for comparing local and cloud LLM behavior under identical prompt and runtime conditions.
What We Built
- A modular Node.js app with four providers:
- OpenAI
- Ollama
- LM Studio
- Featherless (OpenAI-compatible)
- A shared
solve(board, mode)contract for all agents. - A robust Sudoku validation core.
- A live web UI with side-by-side providers.
- Counters for invalid moves and timeouts.
System Design
Folder Layout
agents/ # provider implementations
core/ # sudoku logic + orchestration
utils/ # json, timing, formatting
web/ # frontend UI
server.js # HTTP + SSE backend
index.js # CLI entry
Core Interface: Agent Contract
Every provider implements the same shape, making orchestration provider-agnostic.
class SomeProviderAgent {
constructor(options) {
this.name = "ProviderName";
this.options = options;
}
async solve(board, mode = "full") {
// return strict JSON data
}
}
Modes:
-
full->{ solution: [[...9x9]] } -
step->{ row, col, value }
Defensive Output Handling
Model outputs are treated as untrusted data.
if (!text.startsWith("{") || !text.endsWith("}")) {
return { ok: false, error: "Response is not strict JSON object text." };
}
Even valid JSON is still validated semantically against Sudoku rules.
Sudoku Validation Strategy
The validator enforces:
- board shape (9x9, integer bounds)
- no duplicate values in rows/columns/3x3 boxes
- move legality
- clue preservation
- solved-state completeness
This guarantees a model cannot βwinβ by returning formatted but invalid answers.
Orchestrator Behavior: Resilience Over Fragility
An earlier version stopped a run on invalid move. We changed that for better observability and robustness.
Current behavior:
- invalid move -> increment
invalidMoveCount, continue - timeout -> increment
timeoutCount, retry, continue until threshold - step with no valid move -> emit
step_skipped, continue - solve success -> finish as
solved
Pseudo-flow:
for each step:
for each retry attempt:
response = await agent.solve(board, "step")
if invalid:
invalidMoveCount++
continue
if timeout:
timeoutCount++
continue
apply move
emit move
if solved: finish
if no valid move in step:
emit step_skipped
continue
Why SSE for Real-Time Updates?
SSE was enough for one-way streaming (server -> client), simpler than WebSockets for this use case.
res.writeHead(200, {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
});
Each event carries live stats so UI never needs hidden state from backend.
UI Design Decisions
- Split providers into two rows:
- Local models (Ollama, LM Studio)
- Third-party models (OpenAI, Featherless)
- Two columns each row for quick comparison.
- Per-provider model configuration:
- local: auto-detected model dropdown
- cloud: manual model entry
- Per-provider timeout input to address local model latency variability.
Local Model Discovery
We added provider-specific discovery endpoints:
- Ollama:
GET /api/tags - LM Studio:
GET /v1/models
The frontend can refresh model lists without restarting server.
Timeout Lessons
Local models can be slow on first token or heavy model loads. A single global timeout is usually wrong.
What worked better:
- per-provider timeout control in UI
- higher defaults for local providers (
>= 180000ms) - retryable timeout policy + timeout counters
Example Run Start Payload
{
"providerId": "ollama",
"model": "gemma4:latest",
"timeoutMs": 180000
}
Contribution Opportunities
If you want to extend this project, here are high-impact additions:
- Add a baseline deterministic solver and compare LLM deviation.
- Add puzzle packs and ELO-style provider rating.
- Add persistent run history (SQLite + charting).
- Add tests for orchestrator edge cases.
- Add CI + linting + type checks.
- Add websocket mode and richer live metrics.
Key Takeaways
- Standard contracts unlock multi-provider experimentation.
- Validation is non-negotiable when models are in the loop.
- Reliability improves when invalid outputs become measurable events, not hard crashes.
- Observability (attempts, invalids, timeouts) is as important as final correctness.
Output
If you build a similar system for another constrained task (SQL generation, code transforms, schema mapping), this architecture transfers almost directly.


Top comments (0)