The "bigger is better" assumption is wrong.
We spent weeks evaluating 18 small language models from 12 different makers on 125 questions across 7 languages — and the results seriously challenge conventional wisdom about model scaling.
Here's what the data actually shows:
- A 4B model outperforms an 8B model — using 36% of the RAM
- A 1.5GB MoE model matches dense models that need 8.5GB
- A 1.7B model beats three separate 7B–14B models
- A 1.3B model fabricates fake content 80% of the time
These aren't theoretical predictions. These are measured results.
🤔 Why we built yet another benchmark
Here's the thing — MMLU, GPQA, and HumanEval weren't built for edge AI.
They give the same test to a 0.5B model and a 500B model. That's fine if you only care about "how smart is it?" But if you're deploying on a phone, a Raspberry Pi, or an 8GB laptop, you need to know:
- Does it fit? → How much RAM does it actually need?
- Does it lie? → How often does it fabricate fake information?
- Is it fast enough? → How many tokens per second?
- Is it worth the cost? → What's the performance per GB of RAM?
Existing benchmarks answer none of these. So we built one that answers all of them.
🏟️ Introducing SHIFT — 5 axes, not 1
Size · Honesty · Intelligence · Fast · Thrift
| Axis | What it measures | How |
|---|---|---|
| S | How big is the model? | Parameter count, active params for MoE |
| H | Does it resist hallucination? | 40 questions — traps, calibration, refusal, self-correction |
| I | How smart is it? | 85 questions — reasoning, math, coding, 7 languages |
| F | How fast does it run? | tok/s measured via HF Inference API |
| T | How much resource does it need? | Peak RAM at Q4 quantization |
All 125 questions require JSON-structured output with verifiable fields. No keyword matching. 75 questions are fully automatic — zero human grading needed.
📊 The ranking formula: WCS
The tricky part — how do you rank models when you're measuring both quality and efficiency?
SHIFT alone? Then 14B always beats 1.7B. Boring. Expected.
PIR (efficiency) alone? Then a terrible 1.3B model becomes #1 because it's tiny. Misleading.
Our solution: WorldCup Score (WCS) — the geometric mean of both:
WCS = √( SHIFT × PIR_norm )
Where:
SHIFT = H × 0.4 + I × 0.6 → quality
PIR = (I × H × F) ÷ (S × T) → efficiency
PIR_norm = log₁₀(PIR) / log₁₀(max) × 100
Why geometric mean? Because √(A × B) requires both to be high. Smart but huge? Low WCS. Tiny but dumb? Also low WCS. You need both quality and efficiency to rank well.
🏆 The results
Here are the top 5 — and they're not what you'd expect:
# Model WCS SHIFT RAM League
🏆 GPT-OSS-20B 82.6 76.9 1.5GB 🥅 Raspberry Pi tier
🥈 Gemma-3n-E4B 81.8 77.3 2.0GB ⚽ Smartphone tier
🥉 Llama-4-Scout 79.3 74.2 10GB 🏆 Desktop (but 240 tok/s!)
4 Qwen3-4B 76.6 76.8 2.8GB ⚽ Smartphone tier
5 Qwen3-1.7B 76.1 66.8 1.2GB 🥅 IoT tier
The WCS champion runs on a Raspberry Pi. Let that sink in.
🔬 5 findings that surprised us
Finding 1: 4B = 8B (at 36% of the RAM)
Gemma-3n-E4B → SHIFT 77.3 (4B, 2.0GB) ← #1 quality!
Qwen3-8B → SHIFT 76.9 (8B, 5.5GB)
Gap: 0.4 points
RAM: 2.75× more
Google's PLE architecture and Qwen3's training pipeline have made 4B models functionally equivalent to 8B on structured evaluation tasks. The extra 3.5GB of RAM buys you almost nothing.
Finding 2: MoE is the cheat code for edge AI
GPT-OSS-20B → 21B total, 3.6B active, 1.5GB RAM → SHIFT 76.9
Gemma-3-12B → 12B total, 12B active, 8.5GB RAM → SHIFT 75.7
Same quality. 5.7× less RAM. MoE models activate only a fraction of their parameters at inference time, giving you big-model knowledge with small-model resources.
Finding 3: Thinking models have a dark side
Models with <think> reasoning tokens (DeepSeek-R1, Nemotron-Nano) face a double penalty:
Quality hit — <think> tags break JSON structured output:
Qwen3-8B (non-thinking) → SHIFT 76.9
DeepSeek-R1-7B (thinking) → SHIFT 68.2 (−8.7 points!)
Speed hit — internal reasoning = 2–6× more tokens generated:
Qwen3-8B → 186.8 tok/s
DeepSeek-R1-7B → 69.2 tok/s (2.7× slower)
Nemotron-Nano → 29.8 tok/s (6.3× slower)
Thinking helps for complex math (DeepSeek-R1-14B's reasoning score is the highest we measured), but for real-time structured tasks, non-thinking models win.
Finding 4: The hallucination gap is enormous
Our H1 test presents fake people, papers, and products. Models must refuse to fabricate.
The score range? 20 to 100. That's an 80-point spread — the widest of any metric.
H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash
H1 = 90: Gemma-3n-E4B, Llama-4-Scout
H1 = 60: Qwen3-1.7B, DeepSeek-R1-14B
H1 = 20: Llama-3.2-1B ← fabricates 80% of the time
The Qwen3 family is remarkably consistent at hallucination resistance across all sizes. Meanwhile, the smallest model (1.3B) will confidently tell you about a nonexistent professor's nonexistent research paper, complete with fake citations.
Finding 5: 1.7B beats 14B
Qwen3-1.7B (1.2GB) → SHIFT 66.8
Mistral-7B (5.0GB) → SHIFT 60.6 ← 4.2× bigger, 6.2 points worse
Llama-3.1-8B (5.5GB) → SHIFT 61.0 ← 4.7× bigger, 5.8 points worse
DeepSeek-R1-14B (9.5GB) → SHIFT 59.8 ← 8.7× bigger, 7.0 points worse
Architecture generation matters more than parameter count. A 2025 model at 1.7B outperforms three 2024 models at 7–14B.
🏅 vs SOTA: How do small models compare to Claude and GPT-5?
We give the same 19 questions to both our small models and frontier giants:
Claude Sonnet 4.6 → 69.9 (ceiling)
Claude Opus 4.6 → 69.3
GPT-5.4 → 62.4
Qwen3.5-397B → 57.1
────────────────────────────
Gemma-3-12B → 57.1 (82% of Claude!)
GPT-OSS-20B → 54.2 (78% of Claude)
Gemma-3n-E4B → 47.4 (68% of Claude)
A 12B model matches a 397B model on identical questions. The gap between small and large is narrower than most people think.
⚡ Speed: Provider matters more than model size
Llama-4-Scout (Groq) → 240.5 tok/s
Llama-3.1-8B (Cerebras) → 187.7 tok/s
Qwen3-8B (Fireworks) → 186.8 tok/s
...
Gemma-3-12B (Featherless) → 18.7 tok/s
Mistral-7B (Featherless) → 17.8 tok/s
The fastest model is 13× faster than the slowest — and it's a bigger model. The difference? Groq's inference chip vs. generic GPU hosting. Infrastructure choice dominates model size in determining real-world speed.
🗓️ Anti-contamination: Season system
One concern with any public benchmark: models will eventually train on the questions.
Our defense:
- 30 anchor questions stay fixed across seasons (for IRT calibration)
- 95 questions rotate (70%+ replaced each season)
- Union Eval questions are never published
- Season 2 planned for 2026 Q3
🤝 Built with the community
This benchmark was developed in collaboration with the FINAL Bench research team. The Union Eval cross-benchmark design draws on their evaluation methodology.
It also integrates with the ALL Bench Leaderboard — so you can see where your small model ranks among small models (Smol WorldCup) and against the full landscape including GPT-5 and Claude (ALL Bench).
Try it yourself
The dataset is open under Apache 2.0. We welcome new model submissions.
from datasets import load_dataset
ds = load_dataset("ginigen-ai/smol-worldcup")
print(f"Total: {len(ds['train'])} questions")
# Filter by axis
honesty = ds['train'].filter(lambda x: x['shift_axis'] == 'H')
# Filter by language
korean = ds['train'].filter(lambda x: x['category'] == 'multilingual_ko')
🏟️ Live Leaderboard
📊 Dataset on HuggingFace
🏅 ALL Bench Leaderboard
Developed by Ginigen.ai · Small but Mighty AI







Top comments (0)