DEV Community

Cover image for A 4B Model Just Beat 8B — We Tested 18 Small LLMs and the Results Are Wild
AI Tech News
AI Tech News

Posted on • Originally published at huggingface.co

A 4B Model Just Beat 8B — We Tested 18 Small LLMs and the Results Are Wild

The "bigger is better" assumption is wrong.

We spent weeks evaluating 18 small language models from 12 different makers on 125 questions across 7 languages — and the results seriously challenge conventional wisdom about model scaling.

Here's what the data actually shows:

  • A 4B model outperforms an 8B model — using 36% of the RAM
  • A 1.5GB MoE model matches dense models that need 8.5GB
  • A 1.7B model beats three separate 7B–14B models
  • A 1.3B model fabricates fake content 80% of the time

These aren't theoretical predictions. These are measured results.

Smol AI WorldCup Leaderboard


🤔 Why we built yet another benchmark

Here's the thing — MMLU, GPQA, and HumanEval weren't built for edge AI.

They give the same test to a 0.5B model and a 500B model. That's fine if you only care about "how smart is it?" But if you're deploying on a phone, a Raspberry Pi, or an 8GB laptop, you need to know:

  1. Does it fit? → How much RAM does it actually need?
  2. Does it lie? → How often does it fabricate fake information?
  3. Is it fast enough? → How many tokens per second?
  4. Is it worth the cost? → What's the performance per GB of RAM?

Existing benchmarks answer none of these. So we built one that answers all of them.


🏟️ Introducing SHIFT — 5 axes, not 1

Size · Honesty · Intelligence · Fast · Thrift

Axis What it measures How
S How big is the model? Parameter count, active params for MoE
H Does it resist hallucination? 40 questions — traps, calibration, refusal, self-correction
I How smart is it? 85 questions — reasoning, math, coding, 7 languages
F How fast does it run? tok/s measured via HF Inference API
T How much resource does it need? Peak RAM at Q4 quantization

All 125 questions require JSON-structured output with verifiable fields. No keyword matching. 75 questions are fully automatic — zero human grading needed.

SHIFT Framework


📊 The ranking formula: WCS

The tricky part — how do you rank models when you're measuring both quality and efficiency?

SHIFT alone? Then 14B always beats 1.7B. Boring. Expected.

PIR (efficiency) alone? Then a terrible 1.3B model becomes #1 because it's tiny. Misleading.

Our solution: WorldCup Score (WCS) — the geometric mean of both:

WCS = √( SHIFT × PIR_norm )

Where:
  SHIFT    = H × 0.4 + I × 0.6       → quality
  PIR      = (I × H × F) ÷ (S × T)   → efficiency
  PIR_norm = log₁₀(PIR) / log₁₀(max) × 100
Enter fullscreen mode Exit fullscreen mode

Why geometric mean? Because √(A × B) requires both to be high. Smart but huge? Low WCS. Tiny but dumb? Also low WCS. You need both quality and efficiency to rank well.


🏆 The results

Here are the top 5 — and they're not what you'd expect:

#  Model              WCS    SHIFT  RAM     League
🏆 GPT-OSS-20B       82.6   76.9   1.5GB   🥅 Raspberry Pi tier
🥈 Gemma-3n-E4B      81.8   77.3   2.0GB   ⚽ Smartphone tier
🥉 Llama-4-Scout     79.3   74.2   10GB    🏆 Desktop (but 240 tok/s!)
4  Qwen3-4B          76.6   76.8   2.8GB   ⚽ Smartphone tier
5  Qwen3-1.7B        76.1   66.8   1.2GB   🥅 IoT tier
Enter fullscreen mode Exit fullscreen mode

The WCS champion runs on a Raspberry Pi. Let that sink in.


🔬 5 findings that surprised us

Finding 1: 4B = 8B (at 36% of the RAM)

Gemma-3n-E4B  → SHIFT 77.3  (4B,  2.0GB)  ← #1 quality!
Qwen3-8B      → SHIFT 76.9  (8B,  5.5GB)
                              Gap: 0.4 points
                              RAM: 2.75× more
Enter fullscreen mode Exit fullscreen mode

Google's PLE architecture and Qwen3's training pipeline have made 4B models functionally equivalent to 8B on structured evaluation tasks. The extra 3.5GB of RAM buys you almost nothing.

4B vs 8B comparison

Finding 2: MoE is the cheat code for edge AI

GPT-OSS-20B   → 21B total, 3.6B active, 1.5GB RAM → SHIFT 76.9
Gemma-3-12B   → 12B total, 12B active,  8.5GB RAM → SHIFT 75.7
Enter fullscreen mode Exit fullscreen mode

Same quality. 5.7× less RAM. MoE models activate only a fraction of their parameters at inference time, giving you big-model knowledge with small-model resources.

Finding 3: Thinking models have a dark side

Models with <think> reasoning tokens (DeepSeek-R1, Nemotron-Nano) face a double penalty:

Quality hit<think> tags break JSON structured output:

Qwen3-8B (non-thinking)    → SHIFT 76.9
DeepSeek-R1-7B (thinking)  → SHIFT 68.2  (−8.7 points!)
Enter fullscreen mode Exit fullscreen mode

Speed hit — internal reasoning = 2–6× more tokens generated:

Qwen3-8B        → 186.8 tok/s
DeepSeek-R1-7B  →  69.2 tok/s  (2.7× slower)
Nemotron-Nano   →  29.8 tok/s  (6.3× slower)
Enter fullscreen mode Exit fullscreen mode

Thinking helps for complex math (DeepSeek-R1-14B's reasoning score is the highest we measured), but for real-time structured tasks, non-thinking models win.

Thinking model penalties

Finding 4: The hallucination gap is enormous

Our H1 test presents fake people, papers, and products. Models must refuse to fabricate.

The score range? 20 to 100. That's an 80-point spread — the widest of any metric.

H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash
H1 = 90:  Gemma-3n-E4B, Llama-4-Scout
H1 = 60:  Qwen3-1.7B, DeepSeek-R1-14B
H1 = 20:  Llama-3.2-1B  ← fabricates 80% of the time
Enter fullscreen mode Exit fullscreen mode

The Qwen3 family is remarkably consistent at hallucination resistance across all sizes. Meanwhile, the smallest model (1.3B) will confidently tell you about a nonexistent professor's nonexistent research paper, complete with fake citations.

Hallucination scores

Finding 5: 1.7B beats 14B

Qwen3-1.7B    (1.2GB)  → SHIFT 66.8
Mistral-7B    (5.0GB)  → SHIFT 60.6  ← 4.2× bigger, 6.2 points worse
Llama-3.1-8B  (5.5GB)  → SHIFT 61.0  ← 4.7× bigger, 5.8 points worse
DeepSeek-R1-14B (9.5GB) → SHIFT 59.8  ← 8.7× bigger, 7.0 points worse
Enter fullscreen mode Exit fullscreen mode

Architecture generation matters more than parameter count. A 2025 model at 1.7B outperforms three 2024 models at 7–14B.


🏅 vs SOTA: How do small models compare to Claude and GPT-5?

We give the same 19 questions to both our small models and frontier giants:

Claude Sonnet 4.6  → 69.9  (ceiling)
Claude Opus 4.6    → 69.3
GPT-5.4            → 62.4
Qwen3.5-397B       → 57.1
────────────────────────────
Gemma-3-12B        → 57.1  (82% of Claude!)
GPT-OSS-20B        → 54.2  (78% of Claude)
Gemma-3n-E4B       → 47.4  (68% of Claude)
Enter fullscreen mode Exit fullscreen mode

A 12B model matches a 397B model on identical questions. The gap between small and large is narrower than most people think.


⚡ Speed: Provider matters more than model size

Llama-4-Scout (Groq)       → 240.5 tok/s
Llama-3.1-8B (Cerebras)    → 187.7 tok/s
Qwen3-8B (Fireworks)       → 186.8 tok/s
...
Gemma-3-12B (Featherless)  →  18.7 tok/s
Mistral-7B (Featherless)   →  17.8 tok/s
Enter fullscreen mode Exit fullscreen mode

The fastest model is 13× faster than the slowest — and it's a bigger model. The difference? Groq's inference chip vs. generic GPU hosting. Infrastructure choice dominates model size in determining real-world speed.

Speed rankings


🗓️ Anti-contamination: Season system

One concern with any public benchmark: models will eventually train on the questions.

Our defense:

  • 30 anchor questions stay fixed across seasons (for IRT calibration)
  • 95 questions rotate (70%+ replaced each season)
  • Union Eval questions are never published
  • Season 2 planned for 2026 Q3

🤝 Built with the community

This benchmark was developed in collaboration with the FINAL Bench research team. The Union Eval cross-benchmark design draws on their evaluation methodology.

It also integrates with the ALL Bench Leaderboard — so you can see where your small model ranks among small models (Smol WorldCup) and against the full landscape including GPT-5 and Claude (ALL Bench).

Recommendations


Try it yourself

The dataset is open under Apache 2.0. We welcome new model submissions.

from datasets import load_dataset

ds = load_dataset("ginigen-ai/smol-worldcup")
print(f"Total: {len(ds['train'])} questions")

# Filter by axis
honesty = ds['train'].filter(lambda x: x['shift_axis'] == 'H')

# Filter by language
korean = ds['train'].filter(lambda x: x['category'] == 'multilingual_ko')
Enter fullscreen mode Exit fullscreen mode

🏟️ Live Leaderboard
📊 Dataset on HuggingFace
🏅 ALL Bench Leaderboard


Developed by Ginigen.ai · Small but Mighty AI

Top comments (0)