Yesterday’s v1 build proved the core concept: multiple LLM providers can compete on the same Sudoku board with strict validation and real-time observability.
Today’s v2 upgrade extends that system with a different benchmark mode: single-call one-shot solving.
This post focuses on what changed from v1, why it matters, and how to apply the same design pattern in other AI systems.
V1 recap (baseline)
V1 included:
- multi-provider step-by-step solving
- standardized provider interface (
solve(board, mode)) - strict JSON parsing and Sudoku validation
- SSE-powered live UI with retries, invalid move tracking, and timeout tracking
This made model behavior visible, but also introduced repeated model calls and repeated prompt overhead for each move.
Why V2 was needed
For benchmarking inference efficiency and cost, we needed:
- one request per full puzzle (instead of one request per move)
- lower prompt token usage
- provider usability without hard dependency on startup env keys
V2 key additions
1) One-Shot page (/one-shot)
A dedicated page where user:
- picks a provider
- selects/enters model
- sets timeout
- clicks one button to solve full board in one call
This is intentionally simpler than the race UI: one board in, one board out.
2) New API endpoint: POST /api/solve-once
The backend now supports full-board one-shot requests.
High-level flow:
- resolve provider + model + timeout (+ optional runtime API key)
- call
agent.solve(board, "full")exactly once - validate returned board
- return status (
solved,invalid,timeout,failed) + latency
3) Runtime API key input for OpenAI/Featherless
In v1/v1.5, cloud providers could appear disabled when env keys were missing.
V2 change:
- OpenAI and Featherless are selectable
- one-shot UI accepts runtime API key input
- request can include
apiKey - backend falls back to env key if runtime key not provided
This makes testing easier across environments without editing .env every time.
4) Prompt compaction for lower token usage
We replaced verbose full-solve instructions with a compact strict schema prompt.
V2 architecture
Core backend snippet (conceptual)
const response = await withTimeout(() => agent.solve(puzzle, "full"), timeoutMs);
const validated = validateFullSolutionPayload(response, puzzle);
if (!validated.ok) {
return { status: "invalid", reason: validated.reason };
}
return { status: "solved", solution: validated.solution };
Cost-optimized prompt strategy (V2)
V1 prompt style was explicit but longer.
V2 uses a concise prompt preserving only required constraints + schema.
[
"Solve Sudoku. Strict JSON only.",
"Rules: digits 1-9; each row/col/3x3 has 1-9 exactly once; never change non-zero clues.",
'Return exactly: {"solution":[[9x9 integers]]}',
"No markdown, no extra keys/text.",
"Board:",
safeStringify(board),
].join("\n");
Why this is cost-aware
- Fewer instruction tokens per request
- No repetitive step prompts
- Better fit for one-shot evaluation experiments
Validation still remains strict
Even with shorter prompting, we do not relax safety:
- board shape must be valid 9x9
- fixed clues must remain unchanged
- board must satisfy Sudoku constraints
- board must be fully solved
If any check fails, result is invalid.
Observability in one-shot mode
One-shot UI exposes:
- selected provider/model
- timeout used
- result status
- latency
- optional token/cost estimator panel
Estimator is intentionally approximate but useful for quick tradeoff testing against step-based assumptions.
What this teaches (beyond Sudoku)
The v2 pattern is transferable to many AI workflows:
- keep a stable provider abstraction
- introduce alternate execution modes (step vs batch/one-shot)
- optimize prompts per mode
- keep strict validation unchanged
- decouple cloud auth from startup env when practical
Suggested V3 expansions
- persist one-shot vs step run comparisons
- add provider/model auto-profiling over multiple puzzles
- expose prompt presets (compact, strict, reasoning-heavy)
- generate benchmark reports and trend charts
V1 gave us operational resilience.
V2 gives us cost-aware one-shot benchmarking while preserving correctness gates.
Github Repo: https://github.com/harishkotra/agentoku

Top comments (0)