<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jordan </title>
    <description>The latest articles on DEV Community by Jordan  (@jschilling12).</description>
    <link>https://dev.to/jschilling12</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3886171%2Fd628ca49-5874-4bb5-b376-c8813da53162.jpeg</url>
      <title>DEV Community: Jordan </title>
      <link>https://dev.to/jschilling12</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jschilling12"/>
    <language>en</language>
    <item>
      <title>Multi-Language Code Evaluation Pipeline for LeetCode-Style Problems</title>
      <dc:creator>Jordan </dc:creator>
      <pubDate>Sun, 19 Apr 2026 04:49:04 +0000</pubDate>
      <link>https://dev.to/jschilling12/multi-language-code-evaluation-pipeline-for-leetcode-style-problems-5f6e</link>
      <guid>https://dev.to/jschilling12/multi-language-code-evaluation-pipeline-for-leetcode-style-problems-5f6e</guid>
      <description>&lt;p&gt;Multi-Language Code Evaluation Pipeline for LeetCode Style Problems&lt;/p&gt;

&lt;p&gt;Most evaluator writeups optimize for speed first.&lt;br&gt;&lt;br&gt;
Our biggest quality issue was not latency, it was false negatives.&lt;/p&gt;

&lt;p&gt;We repeatedly saw “correct-looking” solutions fail across languages due to starter drift, I/O contract mismatch, and comparator inconsistency. So we redesigned the pipeline around one goal: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;deterministic, explainable verdicts, no AI validation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The constraint was that trust mattered most. This work came out of building CodeNexus, a mobile LeetCode-style coding app to help form habit loops. That constraint forced us to treat evaluation correctness as a first-class system problem, not just an execution detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we built
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Starter quality gate (pre-execution)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validates starter templates before users run anything.&lt;/li&gt;
&lt;li&gt;Catches missing/empty templates, TODO-only scaffolds, missing callable signatures, and structural syntax defects.&lt;/li&gt;
&lt;li&gt;Prevents template defects from polluting runtime pass/fail metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Starter smoke validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast per-language smoke runs classify failures as starter quality vs solver quality.&lt;/li&gt;
&lt;li&gt;Surfaces wrapper/parser drift and placeholder runtime crashes early.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Contract-driven comparison layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each problem can define indexing, output format, order sensitivity, unordered strategy, and optional semantic validator.&lt;/li&gt;
&lt;li&gt;Comparator sequence:

&lt;ol&gt;
&lt;li&gt;normalize expected/actual&lt;/li&gt;
&lt;li&gt;exact match&lt;/li&gt;
&lt;li&gt;unordered match (when allowed)&lt;/li&gt;
&lt;li&gt;semantic validation for multi-answer correctness&lt;/li&gt;
&lt;li&gt;diagnostic mismatch classification&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Normalization includes whitespace, boolean canonicalization, JSON normalization, and unordered multiset strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hardened execution path&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear separation of compile errors, runtime errors, and infrastructure failures.&lt;/li&gt;
&lt;li&gt;Language-specific execution config is explicit (including TS compiler options).&lt;/li&gt;
&lt;li&gt;Batch submission + polling for throughput, sequential fallback for reliability without changing grading semantics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Artifact-first outputs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every run emits a structured JSON artifact with summary + failure details.&lt;/li&gt;
&lt;li&gt;Failures include problem slug, failure class, and expected/actual snippets for fast triage, analytics, and replay.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;316&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;316&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"errors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"passRate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failures"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;C++: &lt;strong&gt;19.0% -&amp;gt; 100.0%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Go: &lt;strong&gt;7.0% -&amp;gt; 100.0%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Java: &lt;strong&gt;0.9% -&amp;gt; 100.0%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Non-SQL suite: &lt;strong&gt;316/316&lt;/strong&gt; across supported languages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most reliability gains came from evaluation architecture, not algorithm rewrites.  &lt;/p&gt;

&lt;p&gt;In multi-language judges, deterministic contracts and artifacts matter more than raw execution speed once baseline performance is acceptable.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>architecture</category>
      <category>testing</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
