DEV Community: AI Tech News

A 4B Model Just Beat 8B — We Tested 18 Small LLMs and the Results Are Wild

AI Tech News — Tue, 10 Mar 2026 14:59:27 +0000

The "bigger is better" assumption is wrong.

We spent weeks evaluating 18 small language models from 12 different makers on 125 questions across 7 languages — and the results seriously challenge conventional wisdom about model scaling.

Here's what the data actually shows:

A 4B model outperforms an 8B model — using 36% of the RAM
A 1.5GB MoE model matches dense models that need 8.5GB
A 1.7B model beats three separate 7B–14B models
A 1.3B model fabricates fake content 80% of the time

These aren't theoretical predictions. These are measured results.

🤔 Why we built yet another benchmark

Here's the thing — MMLU, GPQA, and HumanEval weren't built for edge AI.

They give the same test to a 0.5B model and a 500B model. That's fine if you only care about "how smart is it?" But if you're deploying on a phone, a Raspberry Pi, or an 8GB laptop, you need to know:

Does it fit? → How much RAM does it actually need?
Does it lie? → How often does it fabricate fake information?
Is it fast enough? → How many tokens per second?
Is it worth the cost? → What's the performance per GB of RAM?

Existing benchmarks answer none of these. So we built one that answers all of them.

🏟️ Introducing SHIFT — 5 axes, not 1

Size · Honesty · Intelligence · Fast · Thrift

Axis	What it measures	How
S	How big is the model?	Parameter count, active params for MoE
H	Does it resist hallucination?	40 questions — traps, calibration, refusal, self-correction
I	How smart is it?	85 questions — reasoning, math, coding, 7 languages
F	How fast does it run?	tok/s measured via HF Inference API
T	How much resource does it need?	Peak RAM at Q4 quantization

All 125 questions require JSON-structured output with verifiable fields. No keyword matching. 75 questions are fully automatic — zero human grading needed.

📊 The ranking formula: WCS

The tricky part — how do you rank models when you're measuring both quality and efficiency?

SHIFT alone? Then 14B always beats 1.7B. Boring. Expected.

PIR (efficiency) alone? Then a terrible 1.3B model becomes #1 because it's tiny. Misleading.

Our solution: WorldCup Score (WCS) — the geometric mean of both:

WCS = √( SHIFT × PIR_norm )

Where:
  SHIFT    = H × 0.4 + I × 0.6       → quality
  PIR      = (I × H × F) ÷ (S × T)   → efficiency
  PIR_norm = log₁₀(PIR) / log₁₀(max) × 100

Why geometric mean? Because √(A × B) requires both to be high. Smart but huge? Low WCS. Tiny but dumb? Also low WCS. You need both quality and efficiency to rank well.

🏆 The results

Here are the top 5 — and they're not what you'd expect:

#  Model              WCS    SHIFT  RAM     League
🏆 GPT-OSS-20B       82.6   76.9   1.5GB   🥅 Raspberry Pi tier
🥈 Gemma-3n-E4B      81.8   77.3   2.0GB   ⚽ Smartphone tier
🥉 Llama-4-Scout     79.3   74.2   10GB    🏆 Desktop (but 240 tok/s!)
4  Qwen3-4B          76.6   76.8   2.8GB   ⚽ Smartphone tier
5  Qwen3-1.7B        76.1   66.8   1.2GB   🥅 IoT tier

The WCS champion runs on a Raspberry Pi. Let that sink in.

🔬 5 findings that surprised us

Finding 1: 4B = 8B (at 36% of the RAM)

Gemma-3n-E4B  → SHIFT 77.3  (4B,  2.0GB)  ← #1 quality!
Qwen3-8B      → SHIFT 76.9  (8B,  5.5GB)
                              Gap: 0.4 points
                              RAM: 2.75× more

Google's PLE architecture and Qwen3's training pipeline have made 4B models functionally equivalent to 8B on structured evaluation tasks. The extra 3.5GB of RAM buys you almost nothing.

Finding 2: MoE is the cheat code for edge AI

GPT-OSS-20B   → 21B total, 3.6B active, 1.5GB RAM → SHIFT 76.9
Gemma-3-12B   → 12B total, 12B active,  8.5GB RAM → SHIFT 75.7

Same quality. 5.7× less RAM. MoE models activate only a fraction of their parameters at inference time, giving you big-model knowledge with small-model resources.

Finding 3: Thinking models have a dark side

Models with <think> reasoning tokens (DeepSeek-R1, Nemotron-Nano) face a double penalty:

Quality hit — <think> tags break JSON structured output:

Qwen3-8B (non-thinking)    → SHIFT 76.9
DeepSeek-R1-7B (thinking)  → SHIFT 68.2  (−8.7 points!)

Speed hit — internal reasoning = 2–6× more tokens generated:

Qwen3-8B        → 186.8 tok/s
DeepSeek-R1-7B  →  69.2 tok/s  (2.7× slower)
Nemotron-Nano   →  29.8 tok/s  (6.3× slower)

Thinking helps for complex math (DeepSeek-R1-14B's reasoning score is the highest we measured), but for real-time structured tasks, non-thinking models win.

Finding 4: The hallucination gap is enormous

Our H1 test presents fake people, papers, and products. Models must refuse to fabricate.

The score range? 20 to 100. That's an 80-point spread — the widest of any metric.

H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash
H1 = 90:  Gemma-3n-E4B, Llama-4-Scout
H1 = 60:  Qwen3-1.7B, DeepSeek-R1-14B
H1 = 20:  Llama-3.2-1B  ← fabricates 80% of the time

The Qwen3 family is remarkably consistent at hallucination resistance across all sizes. Meanwhile, the smallest model (1.3B) will confidently tell you about a nonexistent professor's nonexistent research paper, complete with fake citations.

Finding 5: 1.7B beats 14B

Qwen3-1.7B    (1.2GB)  → SHIFT 66.8
Mistral-7B    (5.0GB)  → SHIFT 60.6  ← 4.2× bigger, 6.2 points worse
Llama-3.1-8B  (5.5GB)  → SHIFT 61.0  ← 4.7× bigger, 5.8 points worse
DeepSeek-R1-14B (9.5GB) → SHIFT 59.8  ← 8.7× bigger, 7.0 points worse

Architecture generation matters more than parameter count. A 2025 model at 1.7B outperforms three 2024 models at 7–14B.

🏅 vs SOTA: How do small models compare to Claude and GPT-5?

We give the same 19 questions to both our small models and frontier giants:

Claude Sonnet 4.6  → 69.9  (ceiling)
Claude Opus 4.6    → 69.3
GPT-5.4            → 62.4
Qwen3.5-397B       → 57.1
────────────────────────────
Gemma-3-12B        → 57.1  (82% of Claude!)
GPT-OSS-20B        → 54.2  (78% of Claude)
Gemma-3n-E4B       → 47.4  (68% of Claude)

A 12B model matches a 397B model on identical questions. The gap between small and large is narrower than most people think.

⚡ Speed: Provider matters more than model size

Llama-4-Scout (Groq)       → 240.5 tok/s
Llama-3.1-8B (Cerebras)    → 187.7 tok/s
Qwen3-8B (Fireworks)       → 186.8 tok/s
...
Gemma-3-12B (Featherless)  →  18.7 tok/s
Mistral-7B (Featherless)   →  17.8 tok/s

The fastest model is 13× faster than the slowest — and it's a bigger model. The difference? Groq's inference chip vs. generic GPU hosting. Infrastructure choice dominates model size in determining real-world speed.

🗓️ Anti-contamination: Season system

One concern with any public benchmark: models will eventually train on the questions.

Our defense:

30 anchor questions stay fixed across seasons (for IRT calibration)
95 questions rotate (70%+ replaced each season)
Union Eval questions are never published
Season 2 planned for 2026 Q3

🤝 Built with the community

This benchmark was developed in collaboration with the FINAL Bench research team. The Union Eval cross-benchmark design draws on their evaluation methodology.

It also integrates with the ALL Bench Leaderboard — so you can see where your small model ranks among small models (Smol WorldCup) and against the full landscape including GPT-5 and Claude (ALL Bench).

Try it yourself

The dataset is open under Apache 2.0. We welcome new model submissions.

from datasets import load_dataset

ds = load_dataset("ginigen-ai/smol-worldcup")
print(f"Total: {len(ds['train'])} questions")

# Filter by axis
honesty = ds['train'].filter(lambda x: x['shift_axis'] == 'H')

# Filter by language
korean = ds['train'].filter(lambda x: x['category'] == 'multilingual_ko')

🏟️ Live Leaderboard
📊 Dataset on HuggingFace
🏅 ALL Bench Leaderboard

Developed by Ginigen.ai · Small but Mighty AI

MARL: Runtime Middleware That Reduces LLM Hallucination Without Fine-Tuning

AI Tech News — Mon, 09 Mar 2026 08:45:56 +0000

Your LLM is confidently wrong, and it can't stop itself.

Ask GPT about a historical date, and it answers with full confidence — right or wrong. Ask Claude to analyze a contract, and it commits to its first interpretation without ever reconsidering. This is hallucination, and in 2026, it remains the #1 blocker for production AI.

The root cause is structural. LLMs are autoregressive: each token is conditioned on previous tokens. Once generation starts, the model cannot stop mid-stream and say "wait, I was wrong." If the initial framing is flawed, it rides that trajectory to the end.

We built MARL to fix this.

pip install marl-middleware

What the Data Says

We released FINAL Bench — the world's first benchmark measuring AI metacognition (the ability to know what you know and what you don't). We tested 9 SOTA models including GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro across 1,800 assessments:

Metric	What It Measures	Score
MA (Metacognitive Accuracy)	Can it say "I might be wrong"?	0.694
ER (Error Recovery)	Can it actually find and fix errors?	0.302
Gap	The chasm between knowing and doing	0.392

AI models sense they could be wrong. But they can't fix what's broken. A 39.2 percentage-point gap between awareness and action.

How MARL Works

MARL (Model-Agnostic Runtime Middleware for LLMs) decomposes a single LLM call into a 5-stage expert pipeline:

User Query
    │
    ▼
S1: Hypothesis  → Designs the optimal approach
    │
    ▼
S2: Solver      → Performs deep reasoning
    │
    ▼
S3: Auditor     → Audits for gaps and contradictions
    │
    ▼
S4: Verifier    → Adversarial cross-validation
    │
    ▼
S5: Synthesizer → Integrates ALL feedback,
                   generates entirely new final response
    │
    ▼
Clean Answer (user sees only the refined result)

Two mechanisms run simultaneously:

Cooperative Reinforcement — knowledge compounds across S1→S2→S3
Adversarial Cross-Validation — S4 deliberately attacks S2's conclusions

The Synthesizer (S5) doesn't patch the original. It writes a completely new response informed by every correction. This transforms "answer in one shot" into "think, doubt, correct, and rewrite."

In our FINAL Bench tests, this metacognitive scaffolding improved performance on the hardest tasks by over 70%, with 94.8% of the gain coming from error recovery.

Not Fine-Tuning. Not RAG. A Third Way.

	Fine-Tuning	RAG	MARL
Changes	Model weights	External knowledge	Reasoning structure
Cost	$10K+ GPU	Vector DB setup	1 line of code
Time	Weeks	Days	Instant
Lock-in	Yes	No	No
Fixes	Domain gaps	Knowledge gaps	Reasoning errors

MARL never touches weights. Switch from GPT-5.4 to Claude to Llama — the MARL layer stays. No vendor lock-in.

Integration: One Line

from openai import OpenAI

# Just change base_url. That's it.
client = OpenAI(
    api_key="sk-...",
    base_url="http://localhost:8080/v1"  # ← MARL server
)

# Everything else stays exactly the same
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Your prompt here"}]
)

Every call now flows through the 5-stage pipeline automatically.

9 Domain-Specific Emergence Engines

Beyond default reasoning enhancement, MARL ships with 9 specialized engines — activated by appending ::mode to the model name:

model="gpt-5.4::pharma"     # 💊 Drug discovery (172 items)
model="gpt-5.4::invent"     # 🔬 Invention & patents (4,275 items)
model="gpt-5.4::genomics"   # 🧬 Genomics & bio (104 items)
model="gpt-5.4::chemistry"  # 🧪 Chemistry & materials (135 items)
model="gpt-5.4::ecology"    # 🌍 Ecology & environment (105 items)
model="gpt-5.4::law"        # ⚖️ Legal & regulatory (59 items)
model="gpt-5.4::create"     # 🎨 General creative (493 seeds)
model="gpt-5.4::doc"        # 📝 Document generation
model="gpt-5.4::recipe"     # 🍳 Culinary fusion

5,538 expert data items cross-combined across multiple layers. Each engine has 5 emergence rules and 10 cross-layer bonus pairs. Works with any LLM model name — not just OpenAI.

Open Core: Protected Engine, Transparent Reasoning

The core engine (pipeline logic, attention matrix, agent prompts) ships as a compiled binary — proprietary tech stays protected.

Everything else is open: installation, API integration, A/B test demos, and most importantly — the full reasoning trace. Every stage is logged transparently. You can see exactly where an error was caught and how it was corrected.

If LLMs are black boxes, MARL is a glass box.

Available Everywhere

We shipped MARL simultaneously across four platforms:

# PyPI
pip install marl-middleware

# Docker
docker pull vidraft/marl:latest

# ClawHub (OpenClaw — 260K+ developers, 3,200+ AI skills)
clawhub install marl-middleware

# GitHub
git clone https://github.com/Vidraft/MARL.git

On ClawHub, MARL is the first middleware in the Reasoning Enhancement category. One command gives your AI agent a metacognition upgrade — it thinks before it acts.

Try It Now

📝 Technical deep dive: huggingface.co/blog/FINAL-Bench/marl-middleware

🤗 Live A/B test (Raw LLM vs MARL): huggingface.co/spaces/VIDraft/MARL

📦 PyPI: pypi.org/project/marl-middleware

🐙 GitHub: github.com/Vidraft/MARL

🦀 ClawHub: clawhub.ai/Cutechicken99/marl-middleware

Built by VIDRAFT — the team behind FINAL Bench (HF Dataset Global #5), FACTS Grounding Medical AI World #2 (CNRS-verified), and HuggingFace STAR AI TOP 12 (2024). 2M monthly active users, 1,500+ public AI models.