Multi-Language Code Evaluation Pipeline for LeetCode-Style Problems

#programming #architecture #testing #ai

Multi-Language Code Evaluation Pipeline for LeetCode Style Problems

Most evaluator writeups optimize for speed first.

Our biggest quality issue was not latency, it was false negatives.

We repeatedly saw “correct-looking” solutions fail across languages due to starter drift, I/O contract mismatch, and comparator inconsistency. So we redesigned the pipeline around one goal:

deterministic, explainable verdicts, no AI validation.

The constraint was that trust mattered most. This work came out of building CodeNexus, a mobile LeetCode-style coding app to help form habit loops. That constraint forced us to treat evaluation correctness as a first-class system problem, not just an execution detail.

What we built

Starter quality gate (pre-execution)

Validates starter templates before users run anything.
Catches missing/empty templates, TODO-only scaffolds, missing callable signatures, and structural syntax defects.
Prevents template defects from polluting runtime pass/fail metrics.

Starter smoke validation

Fast per-language smoke runs classify failures as starter quality vs solver quality.
Surfaces wrapper/parser drift and placeholder runtime crashes early.

Contract-driven comparison layer

Each problem can define indexing, output format, order sensitivity, unordered strategy, and optional semantic validator.
Comparator sequence:
1. normalize expected/actual
2. exact match
3. unordered match (when allowed)
4. semantic validation for multi-answer correctness
5. diagnostic mismatch classification
Normalization includes whitespace, boolean canonicalization, JSON normalization, and unordered multiset strategies.

Hardened execution path

Clear separation of compile errors, runtime errors, and infrastructure failures.
Language-specific execution config is explicit (including TS compiler options).
Batch submission + polling for throughput, sequential fallback for reliability without changing grading semantics.

Artifact-first outputs

Every run emits a structured JSON artifact with summary + failure details.
Failures include problem slug, failure class, and expected/actual snippets for fast triage, analytics, and replay.

{
  "language": "python",
  "summary": { "total": 316, "passed": 316, "failed": 0, "errors": 0, "passRate": 100 },
  "failures": []
}

Results

C++: 19.0% -> 100.0%
Go: 7.0% -> 100.0%
Java: 0.9% -> 100.0%
Non-SQL suite: 316/316 across supported languages

Most reliability gains came from evaluation architecture, not algorithm rewrites.

In multi-language judges, deterministic contracts and artifacts matter more than raw execution speed once baseline performance is acceptable.

DEV Community

Multi-Language Code Evaluation Pipeline for LeetCode-Style Problems

What we built

Results

Top comments (0)