Multi-Language Code Evaluation Pipeline for LeetCode Style Problems
Most evaluator writeups optimize for speed first.
Our biggest quality issue was not latency, it was false negatives.
We repeatedly saw “correct-looking” solutions fail across languages due to starter drift, I/O contract mismatch, and comparator inconsistency. So we redesigned the pipeline around one goal:
deterministic, explainable verdicts, no AI validation.
The constraint was that trust mattered most. This work came out of building CodeNexus, a mobile LeetCode-style coding app to help form habit loops. That constraint forced us to treat evaluation correctness as a first-class system problem, not just an execution detail.
What we built
Starter quality gate (pre-execution)
- Validates starter templates before users run anything.
- Catches missing/empty templates, TODO-only scaffolds, missing callable signatures, and structural syntax defects.
- Prevents template defects from polluting runtime pass/fail metrics.
Starter smoke validation
- Fast per-language smoke runs classify failures as starter quality vs solver quality.
- Surfaces wrapper/parser drift and placeholder runtime crashes early.
Contract-driven comparison layer
- Each problem can define indexing, output format, order sensitivity, unordered strategy, and optional semantic validator.
- Comparator sequence:
- normalize expected/actual
- exact match
- unordered match (when allowed)
- semantic validation for multi-answer correctness
- diagnostic mismatch classification
- Normalization includes whitespace, boolean canonicalization, JSON normalization, and unordered multiset strategies.
Hardened execution path
- Clear separation of compile errors, runtime errors, and infrastructure failures.
- Language-specific execution config is explicit (including TS compiler options).
- Batch submission + polling for throughput, sequential fallback for reliability without changing grading semantics.
Artifact-first outputs
- Every run emits a structured JSON artifact with summary + failure details.
- Failures include problem slug, failure class, and expected/actual snippets for fast triage, analytics, and replay.
{
"language": "python",
"summary": { "total": 316, "passed": 316, "failed": 0, "errors": 0, "passRate": 100 },
"failures": []
}
Results
- C++: 19.0% -> 100.0%
- Go: 7.0% -> 100.0%
- Java: 0.9% -> 100.0%
- Non-SQL suite: 316/316 across supported languages
Most reliability gains came from evaluation architecture, not algorithm rewrites.
In multi-language judges, deterministic contracts and artifacts matter more than raw execution speed once baseline performance is acceptable.
Top comments (0)