<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AI Tech News</title>
    <description>The latest articles on DEV Community by AI Tech News (@wolfsea2357).</description>
    <link>https://dev.to/wolfsea2357</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1212033%2Fb9c824aa-4bd1-4cee-92e3-b5bbcf0404d1.jpg</url>
      <title>DEV Community: AI Tech News</title>
      <link>https://dev.to/wolfsea2357</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wolfsea2357"/>
    <language>en</language>
    <item>
      <title>A 4B Model Just Beat 8B — We Tested 18 Small LLMs and the Results Are Wild</title>
      <dc:creator>AI Tech News</dc:creator>
      <pubDate>Tue, 10 Mar 2026 14:59:27 +0000</pubDate>
      <link>https://dev.to/wolfsea2357/a-4b-model-just-beat-8b-we-tested-18-small-llms-and-the-results-are-wild-npb</link>
      <guid>https://dev.to/wolfsea2357/a-4b-model-just-beat-8b-we-tested-18-small-llms-and-the-results-are-wild-npb</guid>
      <description>&lt;h2&gt;
  
  
  The "bigger is better" assumption is wrong.
&lt;/h2&gt;

&lt;p&gt;We spent weeks evaluating &lt;strong&gt;18 small language models&lt;/strong&gt; from &lt;strong&gt;12 different makers&lt;/strong&gt; on &lt;strong&gt;125 questions across 7 languages&lt;/strong&gt; — and the results seriously challenge conventional wisdom about model scaling.&lt;/p&gt;

&lt;p&gt;Here's what the data actually shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;4B model&lt;/strong&gt; outperforms an &lt;strong&gt;8B model&lt;/strong&gt; — using &lt;strong&gt;36% of the RAM&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;1.5GB MoE model&lt;/strong&gt; matches dense models that need &lt;strong&gt;8.5GB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;1.7B model&lt;/strong&gt; beats three separate &lt;strong&gt;7B–14B models&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;1.3B model&lt;/strong&gt; fabricates fake content &lt;strong&gt;80% of the time&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't theoretical predictions. These are measured results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F1.png" alt="Smol AI WorldCup Leaderboard" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤔 Why we built yet another benchmark
&lt;/h2&gt;

&lt;p&gt;Here's the thing — &lt;strong&gt;MMLU, GPQA, and HumanEval weren't built for edge AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They give the same test to a 0.5B model and a 500B model. That's fine if you only care about "how smart is it?" But if you're deploying on a &lt;strong&gt;phone&lt;/strong&gt;, a &lt;strong&gt;Raspberry Pi&lt;/strong&gt;, or an &lt;strong&gt;8GB laptop&lt;/strong&gt;, you need to know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Does it fit?&lt;/strong&gt; → How much RAM does it actually need?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it lie?&lt;/strong&gt; → How often does it fabricate fake information?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is it fast enough?&lt;/strong&gt; → How many tokens per second?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is it worth the cost?&lt;/strong&gt; → What's the performance per GB of RAM?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Existing benchmarks answer none of these. So we built one that answers all of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏟️ Introducing SHIFT — 5 axes, not 1
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;S&lt;/strong&gt;ize · &lt;strong&gt;H&lt;/strong&gt;onesty · &lt;strong&gt;I&lt;/strong&gt;ntelligence · &lt;strong&gt;F&lt;/strong&gt;ast · &lt;strong&gt;T&lt;/strong&gt;hrift&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;How&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How big is the model?&lt;/td&gt;
&lt;td&gt;Parameter count, active params for MoE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does it resist hallucination?&lt;/td&gt;
&lt;td&gt;40 questions — traps, calibration, refusal, self-correction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;I&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How smart is it?&lt;/td&gt;
&lt;td&gt;85 questions — reasoning, math, coding, 7 languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;F&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How fast does it run?&lt;/td&gt;
&lt;td&gt;tok/s measured via HF Inference API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How much resource does it need?&lt;/td&gt;
&lt;td&gt;Peak RAM at Q4 quantization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All 125 questions require &lt;strong&gt;JSON-structured output&lt;/strong&gt; with verifiable fields. No keyword matching. 75 questions are fully automatic — zero human grading needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F2.png" alt="SHIFT Framework" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 The ranking formula: WCS
&lt;/h2&gt;

&lt;p&gt;The tricky part — how do you rank models when you're measuring both &lt;em&gt;quality&lt;/em&gt; and &lt;em&gt;efficiency&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHIFT alone?&lt;/strong&gt; Then 14B always beats 1.7B. Boring. Expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PIR (efficiency) alone?&lt;/strong&gt; Then a terrible 1.3B model becomes #1 because it's tiny. Misleading.&lt;/p&gt;

&lt;p&gt;Our solution: &lt;strong&gt;WorldCup Score (WCS)&lt;/strong&gt; — the geometric mean of both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WCS = √( SHIFT × PIR_norm )

Where:
  SHIFT    = H × 0.4 + I × 0.6       → quality
  PIR      = (I × H × F) ÷ (S × T)   → efficiency
  PIR_norm = log₁₀(PIR) / log₁₀(max) × 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why geometric mean?&lt;/strong&gt; Because &lt;code&gt;√(A × B)&lt;/code&gt; requires &lt;em&gt;both&lt;/em&gt; to be high. Smart but huge? Low WCS. Tiny but dumb? Also low WCS. You need &lt;strong&gt;both quality and efficiency&lt;/strong&gt; to rank well.&lt;/p&gt;


&lt;h2&gt;
  
  
  🏆 The results
&lt;/h2&gt;

&lt;p&gt;Here are the top 5 — and they're not what you'd expect:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#  Model              WCS    SHIFT  RAM     League
🏆 GPT-OSS-20B       82.6   76.9   1.5GB   🥅 Raspberry Pi tier
🥈 Gemma-3n-E4B      81.8   77.3   2.0GB   ⚽ Smartphone tier
🥉 Llama-4-Scout     79.3   74.2   10GB    🏆 Desktop (but 240 tok/s!)
4  Qwen3-4B          76.6   76.8   2.8GB   ⚽ Smartphone tier
5  Qwen3-1.7B        76.1   66.8   1.2GB   🥅 IoT tier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;The WCS champion runs on a Raspberry Pi.&lt;/strong&gt; Let that sink in.&lt;/p&gt;


&lt;h2&gt;
  
  
  🔬 5 findings that surprised us
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Finding 1: 4B = 8B (at 36% of the RAM)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gemma-3n-E4B  → SHIFT 77.3  (4B,  2.0GB)  ← #1 quality!
Qwen3-8B      → SHIFT 76.9  (8B,  5.5GB)
                              Gap: 0.4 points
                              RAM: 2.75× more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Google's PLE architecture and Qwen3's training pipeline have made &lt;strong&gt;4B models functionally equivalent to 8B&lt;/strong&gt; on structured evaluation tasks. The extra 3.5GB of RAM buys you almost nothing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F3.png" alt="4B vs 8B comparison" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Finding 2: MoE is the cheat code for edge AI
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPT-OSS-20B   → 21B total, 3.6B active, 1.5GB RAM → SHIFT 76.9
Gemma-3-12B   → 12B total, 12B active,  8.5GB RAM → SHIFT 75.7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Same quality. &lt;strong&gt;5.7× less RAM.&lt;/strong&gt; MoE models activate only a fraction of their parameters at inference time, giving you big-model knowledge with small-model resources.&lt;/p&gt;
&lt;h3&gt;
  
  
  Finding 3: Thinking models have a dark side
&lt;/h3&gt;

&lt;p&gt;Models with &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; reasoning tokens (DeepSeek-R1, Nemotron-Nano) face a double penalty:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality hit&lt;/strong&gt; — &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags break JSON structured output:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Qwen3-8B (non-thinking)    → SHIFT 76.9
DeepSeek-R1-7B (thinking)  → SHIFT 68.2  (−8.7 points!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Speed hit&lt;/strong&gt; — internal reasoning = 2–6× more tokens generated:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Qwen3-8B        → 186.8 tok/s
DeepSeek-R1-7B  →  69.2 tok/s  (2.7× slower)
Nemotron-Nano   →  29.8 tok/s  (6.3× slower)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Thinking helps for complex math (DeepSeek-R1-14B's reasoning score is the highest we measured), but for &lt;strong&gt;real-time structured tasks&lt;/strong&gt;, non-thinking models win.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F6.png" alt="Thinking model penalties" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Finding 4: The hallucination gap is enormous
&lt;/h3&gt;

&lt;p&gt;Our H1 test presents fake people, papers, and products. Models must refuse to fabricate.&lt;/p&gt;

&lt;p&gt;The score range? &lt;strong&gt;20 to 100.&lt;/strong&gt; That's an 80-point spread — the widest of any metric.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash
H1 = 90:  Gemma-3n-E4B, Llama-4-Scout
H1 = 60:  Qwen3-1.7B, DeepSeek-R1-14B
H1 = 20:  Llama-3.2-1B  ← fabricates 80% of the time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The Qwen3 family is remarkably consistent at hallucination resistance across all sizes. Meanwhile, the smallest model (1.3B) will confidently tell you about a nonexistent professor's nonexistent research paper, complete with fake citations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F4.png" alt="Hallucination scores" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Finding 5: 1.7B beats 14B
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Qwen3-1.7B    (1.2GB)  → SHIFT 66.8
Mistral-7B    (5.0GB)  → SHIFT 60.6  ← 4.2× bigger, 6.2 points worse
Llama-3.1-8B  (5.5GB)  → SHIFT 61.0  ← 4.7× bigger, 5.8 points worse
DeepSeek-R1-14B (9.5GB) → SHIFT 59.8  ← 8.7× bigger, 7.0 points worse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Architecture generation matters more than parameter count. A 2025 model at 1.7B outperforms three 2024 models at 7–14B.&lt;/p&gt;


&lt;h2&gt;
  
  
  🏅 vs SOTA: How do small models compare to Claude and GPT-5?
&lt;/h2&gt;

&lt;p&gt;We give the same 19 questions to both our small models and frontier giants:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Sonnet 4.6  → 69.9  (ceiling)
Claude Opus 4.6    → 69.3
GPT-5.4            → 62.4
Qwen3.5-397B       → 57.1
────────────────────────────
Gemma-3-12B        → 57.1  (82% of Claude!)
GPT-OSS-20B        → 54.2  (78% of Claude)
Gemma-3n-E4B       → 47.4  (68% of Claude)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A 12B model matches a 397B model on identical questions. The gap between small and large is &lt;strong&gt;narrower than most people think&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  ⚡ Speed: Provider matters more than model size
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Llama-4-Scout (Groq)       → 240.5 tok/s
Llama-3.1-8B (Cerebras)    → 187.7 tok/s
Qwen3-8B (Fireworks)       → 186.8 tok/s
...
Gemma-3-12B (Featherless)  →  18.7 tok/s
Mistral-7B (Featherless)   →  17.8 tok/s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The fastest model is &lt;strong&gt;13× faster&lt;/strong&gt; than the slowest — and it's a bigger model. The difference? Groq's inference chip vs. generic GPU hosting. &lt;strong&gt;Infrastructure choice dominates model size in determining real-world speed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F5.png" alt="Speed rankings" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🗓️ Anti-contamination: Season system
&lt;/h2&gt;

&lt;p&gt;One concern with any public benchmark: models will eventually train on the questions.&lt;/p&gt;

&lt;p&gt;Our defense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30 anchor questions&lt;/strong&gt; stay fixed across seasons (for IRT calibration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;95 questions rotate&lt;/strong&gt; (70%+ replaced each season)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Union Eval questions are never published&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Season 2 planned for 2026 Q3&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🤝 Built with the community
&lt;/h2&gt;

&lt;p&gt;This benchmark was developed in collaboration with the &lt;strong&gt;&lt;a href="https://huggingface.co/FINAL-Bench" rel="noopener noreferrer"&gt;FINAL Bench&lt;/a&gt;&lt;/strong&gt; research team. The Union Eval cross-benchmark design draws on their evaluation methodology.&lt;/p&gt;

&lt;p&gt;It also integrates with the &lt;strong&gt;&lt;a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard" rel="noopener noreferrer"&gt;ALL Bench Leaderboard&lt;/a&gt;&lt;/strong&gt; — so you can see where your small model ranks among small models (Smol WorldCup) &lt;em&gt;and&lt;/em&gt; against the full landscape including GPT-5 and Claude (ALL Bench).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fginigen-ai%2Fsmol-worldcup%2Fresolve%2Fmain%2F7.png" alt="Recommendations" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;The dataset is open under &lt;strong&gt;Apache 2.0&lt;/strong&gt;. We welcome new model submissions.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;

&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ginigen-ai/smol-worldcup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; questions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Filter by axis
&lt;/span&gt;&lt;span class="n"&gt;honesty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;shift_axis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;H&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Filter by language
&lt;/span&gt;&lt;span class="n"&gt;korean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multilingual_ko&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;🏟️ &lt;a href="https://huggingface.co/spaces/ginigen-ai/smol-worldcup" rel="noopener noreferrer"&gt;Live Leaderboard&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;📊 &lt;a href="https://huggingface.co/datasets/ginigen-ai/smol-worldcup" rel="noopener noreferrer"&gt;Dataset on HuggingFace&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;🏅 &lt;a href="https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard" rel="noopener noreferrer"&gt;ALL Bench Leaderboard&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;



&lt;p&gt;&lt;em&gt;Developed by &lt;a href="https://ginigen.ai" rel="noopener noreferrer"&gt;Ginigen.ai&lt;/a&gt; · Small but Mighty AI&lt;/em&gt;&lt;/p&gt;


&lt;div class="ltag__huggingface"&gt;
  &lt;iframe src="https://ginigen-ai-smol-worldcup.hf.space" title="Hugging Face Space" width="100%" height="600"&gt;
  &lt;/iframe&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>MARL: Runtime Middleware That Reduces LLM Hallucination Without Fine-Tuning</title>
      <dc:creator>AI Tech News</dc:creator>
      <pubDate>Mon, 09 Mar 2026 08:45:56 +0000</pubDate>
      <link>https://dev.to/wolfsea2357/marl-runtime-middleware-that-reduces-llm-hallucination-without-fine-tuning-5fca</link>
      <guid>https://dev.to/wolfsea2357/marl-runtime-middleware-that-reduces-llm-hallucination-without-fine-tuning-5fca</guid>
      <description>&lt;p&gt;Your LLM is confidently wrong, and it can't stop itself.&lt;/p&gt;

&lt;p&gt;Ask GPT about a historical date, and it answers with full confidence — right or wrong. Ask Claude to analyze a contract, and it commits to its first interpretation without ever reconsidering. This is &lt;strong&gt;hallucination&lt;/strong&gt;, and in 2026, it remains the #1 blocker for production AI.&lt;/p&gt;

&lt;p&gt;The root cause is structural. LLMs are autoregressive: each token is conditioned on previous tokens. Once generation starts, the model cannot stop mid-stream and say &lt;em&gt;"wait, I was wrong."&lt;/em&gt; If the initial framing is flawed, it rides that trajectory to the end.&lt;/p&gt;

&lt;p&gt;We built &lt;strong&gt;MARL&lt;/strong&gt; to fix this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;marl-middleware
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0mao3nfm036rmtax66c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0mao3nfm036rmtax66c.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Says
&lt;/h2&gt;

&lt;p&gt;We released &lt;a href="https://huggingface.co/spaces/FINAL-Bench/Leaderboard" rel="noopener noreferrer"&gt;FINAL Bench&lt;/a&gt; — the world's first benchmark measuring AI &lt;strong&gt;metacognition&lt;/strong&gt; (the ability to know what you know and what you don't). We tested 9 SOTA models including GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro across 1,800 assessments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;MA&lt;/strong&gt; (Metacognitive Accuracy)&lt;/td&gt;
&lt;td&gt;Can it say "I might be wrong"?&lt;/td&gt;
&lt;td&gt;0.694&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;ER&lt;/strong&gt; (Error Recovery)&lt;/td&gt;
&lt;td&gt;Can it actually find and fix errors?&lt;/td&gt;
&lt;td&gt;0.302&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The chasm between knowing and doing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.392&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;AI models &lt;em&gt;sense&lt;/em&gt; they could be wrong. But they can't &lt;em&gt;fix&lt;/em&gt; what's broken. A 39.2 percentage-point gap between awareness and action.&lt;/p&gt;

&lt;h2&gt;
  
  
  How MARL Works
&lt;/h2&gt;

&lt;p&gt;MARL (Model-Agnostic Runtime Middleware for LLMs) decomposes a single LLM call into a 5-stage expert pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    │
    ▼
S1: Hypothesis  → Designs the optimal approach
    │
    ▼
S2: Solver      → Performs deep reasoning
    │
    ▼
S3: Auditor     → Audits for gaps and contradictions
    │
    ▼
S4: Verifier    → Adversarial cross-validation
    │
    ▼
S5: Synthesizer → Integrates ALL feedback,
                   generates entirely new final response
    │
    ▼
Clean Answer (user sees only the refined result)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two mechanisms run simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cooperative Reinforcement&lt;/strong&gt; — knowledge compounds across S1→S2→S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial Cross-Validation&lt;/strong&gt; — S4 deliberately attacks S2's conclusions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Synthesizer (S5) doesn't patch the original. It writes a &lt;strong&gt;completely new response&lt;/strong&gt; informed by every correction. This transforms "answer in one shot" into &lt;strong&gt;"think, doubt, correct, and rewrite."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In our FINAL Bench tests, this metacognitive scaffolding improved performance on the hardest tasks by &lt;strong&gt;over 70%&lt;/strong&gt;, with &lt;strong&gt;94.8% of the gain coming from error recovery&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not Fine-Tuning. Not RAG. A Third Way.
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Fine-Tuning&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;MARL&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Changes&lt;/td&gt;
&lt;td&gt;Model weights&lt;/td&gt;
&lt;td&gt;External knowledge&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Reasoning structure&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$10K+ GPU&lt;/td&gt;
&lt;td&gt;Vector DB setup&lt;/td&gt;
&lt;td&gt;1 line of code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;Weeks&lt;/td&gt;
&lt;td&gt;Days&lt;/td&gt;
&lt;td&gt;Instant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lock-in&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fixes&lt;/td&gt;
&lt;td&gt;Domain gaps&lt;/td&gt;
&lt;td&gt;Knowledge gaps&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Reasoning errors&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;MARL never touches weights. Switch from GPT-5.4 to Claude to Llama — the MARL layer stays. No vendor lock-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integration: One Line
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# Just change base_url. That's it.
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# ← MARL server
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Everything else stays exactly the same
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your prompt here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call now flows through the 5-stage pipeline automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5k1g5a1ucafe2keqxyp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk5k1g5a1ucafe2keqxyp.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9 Domain-Specific Emergence Engines
&lt;/h2&gt;

&lt;p&gt;Beyond default reasoning enhancement, MARL ships with 9 specialized engines — activated by appending &lt;code&gt;::mode&lt;/code&gt; to the model name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::pharma&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="c1"&gt;# 💊 Drug discovery (172 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::invent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="c1"&gt;# 🔬 Invention &amp;amp; patents (4,275 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::genomics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# 🧬 Genomics &amp;amp; bio (104 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::chemistry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# 🧪 Chemistry &amp;amp; materials (135 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::ecology&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    &lt;span class="c1"&gt;# 🌍 Ecology &amp;amp; environment (105 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::law&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;        &lt;span class="c1"&gt;# ⚖️ Legal &amp;amp; regulatory (59 items)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="c1"&gt;# 🎨 General creative (493 seeds)
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::doc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;        &lt;span class="c1"&gt;# 📝 Document generation
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.4::recipe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;     &lt;span class="c1"&gt;# 🍳 Culinary fusion
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5,538 expert data items cross-combined across multiple layers. Each engine has 5 emergence rules and 10 cross-layer bonus pairs. Works with &lt;strong&gt;any LLM model name&lt;/strong&gt; — not just OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Core: Protected Engine, Transparent Reasoning
&lt;/h2&gt;

&lt;p&gt;The core engine (pipeline logic, attention matrix, agent prompts) ships as a &lt;strong&gt;compiled binary&lt;/strong&gt; — proprietary tech stays protected.&lt;/p&gt;

&lt;p&gt;Everything else is open: installation, API integration, A/B test demos, and most importantly — &lt;strong&gt;the full reasoning trace&lt;/strong&gt;. Every stage is logged transparently. You can see exactly where an error was caught and how it was corrected.&lt;/p&gt;

&lt;p&gt;If LLMs are black boxes, MARL is a &lt;strong&gt;glass box&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Available Everywhere
&lt;/h2&gt;

&lt;p&gt;We shipped MARL simultaneously across four platforms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# PyPI&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;marl-middleware

&lt;span class="c"&gt;# Docker&lt;/span&gt;
docker pull vidraft/marl:latest

&lt;span class="c"&gt;# ClawHub (OpenClaw — 260K+ developers, 3,200+ AI skills)&lt;/span&gt;
clawhub &lt;span class="nb"&gt;install &lt;/span&gt;marl-middleware

&lt;span class="c"&gt;# GitHub&lt;/span&gt;
git clone https://github.com/Vidraft/MARL.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On &lt;strong&gt;ClawHub&lt;/strong&gt;, MARL is the first middleware in the &lt;strong&gt;Reasoning Enhancement&lt;/strong&gt; category. One command gives your AI agent a metacognition upgrade — it thinks before it acts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2e7c2dr4ik5ljbtqg71x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2e7c2dr4ik5ljbtqg71x.png" alt=" " width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Now
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📝 Technical deep dive&lt;/strong&gt;: &lt;a href="https://huggingface.co/blog/FINAL-Bench/marl-middleware" rel="noopener noreferrer"&gt;huggingface.co/blog/FINAL-Bench/marl-middleware&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🤗 Live A/B test&lt;/strong&gt; (Raw LLM vs MARL): &lt;a href="https://huggingface.co/spaces/VIDraft/MARL" rel="noopener noreferrer"&gt;huggingface.co/spaces/VIDraft/MARL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📦 PyPI&lt;/strong&gt;: &lt;a href="https://pypi.org/project/marl-middleware/" rel="noopener noreferrer"&gt;pypi.org/project/marl-middleware&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🐙 GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Vidraft/MARL" rel="noopener noreferrer"&gt;github.com/Vidraft/MARL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🦀 ClawHub&lt;/strong&gt;: &lt;a href="https://clawhub.ai/Cutechicken99/marl-middleware" rel="noopener noreferrer"&gt;clawhub.ai/Cutechicken99/marl-middleware&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://vidraft.net" rel="noopener noreferrer"&gt;VIDRAFT&lt;/a&gt; — the team behind FINAL Bench (HF Dataset Global #5), FACTS Grounding Medical AI World #2 (CNRS-verified), and HuggingFace STAR AI TOP 12 (2024). 2M monthly active users, 1,500+ public AI models.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
