AI Tech News

Posted on Mar 10 • Originally published at huggingface.co

A 4B Model Just Beat 8B — We Tested 18 Small LLMs and the Results Are Wild

#ai #machinelearning #opensource #benchmark

The "bigger is better" assumption is wrong.

We spent weeks evaluating 18 small language models from 12 different makers on 125 questions across 7 languages — and the results seriously challenge conventional wisdom about model scaling.

Here's what the data actually shows:

A 4B model outperforms an 8B model — using 36% of the RAM
A 1.5GB MoE model matches dense models that need 8.5GB
A 1.7B model beats three separate 7B–14B models
A 1.3B model fabricates fake content 80% of the time

These aren't theoretical predictions. These are measured results.

🤔 Why we built yet another benchmark

Here's the thing — MMLU, GPQA, and HumanEval weren't built for edge AI.

They give the same test to a 0.5B model and a 500B model. That's fine if you only care about "how smart is it?" But if you're deploying on a phone, a Raspberry Pi, or an 8GB laptop, you need to know:

Does it fit? → How much RAM does it actually need?
Does it lie? → How often does it fabricate fake information?
Is it fast enough? → How many tokens per second?
Is it worth the cost? → What's the performance per GB of RAM?

Existing benchmarks answer none of these. So we built one that answers all of them.

🏟️ Introducing SHIFT — 5 axes, not 1

Size · Honesty · Intelligence · Fast · Thrift

Axis	What it measures	How
S	How big is the model?	Parameter count, active params for MoE
H	Does it resist hallucination?	40 questions — traps, calibration, refusal, self-correction
I	How smart is it?	85 questions — reasoning, math, coding, 7 languages
F	How fast does it run?	tok/s measured via HF Inference API
T	How much resource does it need?	Peak RAM at Q4 quantization

All 125 questions require JSON-structured output with verifiable fields. No keyword matching. 75 questions are fully automatic — zero human grading needed.

📊 The ranking formula: WCS

The tricky part — how do you rank models when you're measuring both quality and efficiency?

SHIFT alone? Then 14B always beats 1.7B. Boring. Expected.

PIR (efficiency) alone? Then a terrible 1.3B model becomes #1 because it's tiny. Misleading.

Our solution: WorldCup Score (WCS) — the geometric mean of both:

WCS = √( SHIFT × PIR_norm )

Where:
  SHIFT    = H × 0.4 + I × 0.6       → quality
  PIR      = (I × H × F) ÷ (S × T)   → efficiency
  PIR_norm = log₁₀(PIR) / log₁₀(max) × 100

Why geometric mean? Because √(A × B) requires both to be high. Smart but huge? Low WCS. Tiny but dumb? Also low WCS. You need both quality and efficiency to rank well.

🏆 The results

Here are the top 5 — and they're not what you'd expect:

#  Model              WCS    SHIFT  RAM     League
🏆 GPT-OSS-20B       82.6   76.9   1.5GB   🥅 Raspberry Pi tier
🥈 Gemma-3n-E4B      81.8   77.3   2.0GB   ⚽ Smartphone tier
🥉 Llama-4-Scout     79.3   74.2   10GB    🏆 Desktop (but 240 tok/s!)
4  Qwen3-4B          76.6   76.8   2.8GB   ⚽ Smartphone tier
5  Qwen3-1.7B        76.1   66.8   1.2GB   🥅 IoT tier

The WCS champion runs on a Raspberry Pi. Let that sink in.

🔬 5 findings that surprised us

Finding 1: 4B = 8B (at 36% of the RAM)

Gemma-3n-E4B  → SHIFT 77.3  (4B,  2.0GB)  ← #1 quality!
Qwen3-8B      → SHIFT 76.9  (8B,  5.5GB)
                              Gap: 0.4 points
                              RAM: 2.75× more

Google's PLE architecture and Qwen3's training pipeline have made 4B models functionally equivalent to 8B on structured evaluation tasks. The extra 3.5GB of RAM buys you almost nothing.

Finding 2: MoE is the cheat code for edge AI

GPT-OSS-20B   → 21B total, 3.6B active, 1.5GB RAM → SHIFT 76.9
Gemma-3-12B   → 12B total, 12B active,  8.5GB RAM → SHIFT 75.7

Same quality. 5.7× less RAM. MoE models activate only a fraction of their parameters at inference time, giving you big-model knowledge with small-model resources.

Finding 3: Thinking models have a dark side

Models with <think> reasoning tokens (DeepSeek-R1, Nemotron-Nano) face a double penalty:

Quality hit — <think> tags break JSON structured output:

Qwen3-8B (non-thinking)    → SHIFT 76.9
DeepSeek-R1-7B (thinking)  → SHIFT 68.2  (−8.7 points!)

Speed hit — internal reasoning = 2–6× more tokens generated:

Qwen3-8B        → 186.8 tok/s
DeepSeek-R1-7B  →  69.2 tok/s  (2.7× slower)
Nemotron-Nano   →  29.8 tok/s  (6.3× slower)

Thinking helps for complex math (DeepSeek-R1-14B's reasoning score is the highest we measured), but for real-time structured tasks, non-thinking models win.

Finding 4: The hallucination gap is enormous

Our H1 test presents fake people, papers, and products. Models must refuse to fabricate.

The score range? 20 to 100. That's an 80-point spread — the widest of any metric.

H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash
H1 = 90:  Gemma-3n-E4B, Llama-4-Scout
H1 = 60:  Qwen3-1.7B, DeepSeek-R1-14B
H1 = 20:  Llama-3.2-1B  ← fabricates 80% of the time

The Qwen3 family is remarkably consistent at hallucination resistance across all sizes. Meanwhile, the smallest model (1.3B) will confidently tell you about a nonexistent professor's nonexistent research paper, complete with fake citations.

Finding 5: 1.7B beats 14B

Qwen3-1.7B    (1.2GB)  → SHIFT 66.8
Mistral-7B    (5.0GB)  → SHIFT 60.6  ← 4.2× bigger, 6.2 points worse
Llama-3.1-8B  (5.5GB)  → SHIFT 61.0  ← 4.7× bigger, 5.8 points worse
DeepSeek-R1-14B (9.5GB) → SHIFT 59.8  ← 8.7× bigger, 7.0 points worse

Architecture generation matters more than parameter count. A 2025 model at 1.7B outperforms three 2024 models at 7–14B.

🏅 vs SOTA: How do small models compare to Claude and GPT-5?

We give the same 19 questions to both our small models and frontier giants:

Claude Sonnet 4.6  → 69.9  (ceiling)
Claude Opus 4.6    → 69.3
GPT-5.4            → 62.4
Qwen3.5-397B       → 57.1
────────────────────────────
Gemma-3-12B        → 57.1  (82% of Claude!)
GPT-OSS-20B        → 54.2  (78% of Claude)
Gemma-3n-E4B       → 47.4  (68% of Claude)

A 12B model matches a 397B model on identical questions. The gap between small and large is narrower than most people think.

⚡ Speed: Provider matters more than model size

Llama-4-Scout (Groq)       → 240.5 tok/s
Llama-3.1-8B (Cerebras)    → 187.7 tok/s
Qwen3-8B (Fireworks)       → 186.8 tok/s
...
Gemma-3-12B (Featherless)  →  18.7 tok/s
Mistral-7B (Featherless)   →  17.8 tok/s

The fastest model is 13× faster than the slowest — and it's a bigger model. The difference? Groq's inference chip vs. generic GPU hosting. Infrastructure choice dominates model size in determining real-world speed.

🗓️ Anti-contamination: Season system

One concern with any public benchmark: models will eventually train on the questions.

Our defense:

30 anchor questions stay fixed across seasons (for IRT calibration)
95 questions rotate (70%+ replaced each season)
Union Eval questions are never published
Season 2 planned for 2026 Q3

🤝 Built with the community

This benchmark was developed in collaboration with the FINAL Bench research team. The Union Eval cross-benchmark design draws on their evaluation methodology.

It also integrates with the ALL Bench Leaderboard — so you can see where your small model ranks among small models (Smol WorldCup) and against the full landscape including GPT-5 and Claude (ALL Bench).

Try it yourself

The dataset is open under Apache 2.0. We welcome new model submissions.

from datasets import load_dataset

ds = load_dataset("ginigen-ai/smol-worldcup")
print(f"Total: {len(ds['train'])} questions")

# Filter by axis
honesty = ds['train'].filter(lambda x: x['shift_axis'] == 'H')

# Filter by language
korean = ds['train'].filter(lambda x: x['category'] == 'multilingual_ko')

🏟️ Live Leaderboard
📊 Dataset on HuggingFace
🏅 ALL Bench Leaderboard

Developed by Ginigen.ai · Small but Mighty AI

DEV Community