DEV Community

Cover image for Multi-Language Code Evaluation Pipeline for LeetCode-Style Problems
Jordan
Jordan

Posted on

Multi-Language Code Evaluation Pipeline for LeetCode-Style Problems

Multi-Language Code Evaluation Pipeline for LeetCode Style Problems

Most evaluator writeups optimize for speed first.

Our biggest quality issue was not latency, it was false negatives.

We repeatedly saw “correct-looking” solutions fail across languages due to starter drift, I/O contract mismatch, and comparator inconsistency. So we redesigned the pipeline around one goal:

deterministic, explainable verdicts, no AI validation.

The constraint was that trust mattered most. This work came out of building CodeNexus, a mobile LeetCode-style coding app to help form habit loops. That constraint forced us to treat evaluation correctness as a first-class system problem, not just an execution detail.

What we built

Starter quality gate (pre-execution)

  • Validates starter templates before users run anything.
  • Catches missing/empty templates, TODO-only scaffolds, missing callable signatures, and structural syntax defects.
  • Prevents template defects from polluting runtime pass/fail metrics.

Starter smoke validation

  • Fast per-language smoke runs classify failures as starter quality vs solver quality.
  • Surfaces wrapper/parser drift and placeholder runtime crashes early.

Contract-driven comparison layer

  • Each problem can define indexing, output format, order sensitivity, unordered strategy, and optional semantic validator.
  • Comparator sequence:
    1. normalize expected/actual
    2. exact match
    3. unordered match (when allowed)
    4. semantic validation for multi-answer correctness
    5. diagnostic mismatch classification
  • Normalization includes whitespace, boolean canonicalization, JSON normalization, and unordered multiset strategies.

Hardened execution path

  • Clear separation of compile errors, runtime errors, and infrastructure failures.
  • Language-specific execution config is explicit (including TS compiler options).
  • Batch submission + polling for throughput, sequential fallback for reliability without changing grading semantics.

Artifact-first outputs

  • Every run emits a structured JSON artifact with summary + failure details.
  • Failures include problem slug, failure class, and expected/actual snippets for fast triage, analytics, and replay.
{
  "language": "python",
  "summary": { "total": 316, "passed": 316, "failed": 0, "errors": 0, "passRate": 100 },
  "failures": []
}
Enter fullscreen mode Exit fullscreen mode

Results

  • C++: 19.0% -> 100.0%
  • Go: 7.0% -> 100.0%
  • Java: 0.9% -> 100.0%
  • Non-SQL suite: 316/316 across supported languages

Most reliability gains came from evaluation architecture, not algorithm rewrites.

In multi-language judges, deterministic contracts and artifacts matter more than raw execution speed once baseline performance is acceptable.

Top comments (0)