soohan abbasi

Posted on May 24 • Edited on Jun 13

Small Language Models: Rethinking What Intelligence Actually Requires

#ai #llm #machinelearning #microsoft

"Scale solves everything — until it doesn't."

Introduction: A Result Nobody Predicted

In March 2024, Microsoft published a technical report with a claim that most researchers found difficult to take seriously at first. Their new model, Phi-3 Mini, had 3.8 billion parameters. GPT-3 had 175 billion. GPT-4 is estimated at somewhere above a trillion. And yet Phi-3 Mini outperformed GPT-3 on standard benchmarks, approached GPT-3.5 on several tasks, and ran entirely on a laptop with no internet connection.

The response from the research community was not celebration. It was confusion. The scaling laws, the empirical relationships between model size, data, compute, and performance, had held for years. They were the closest thing the field had to a reliable theory of how intelligence emerges in these systems. Phi-3 did not break the scaling laws, but it suggested something the field had underweighted: the laws describe what scale can do, not what scale is required to do it.

The question Phi-3 raised is not "how small can we go?" It is something more fundamental: what does a language model actually need in order to reason well?

That is what this post is about. I spent this week reading the papers, running three experiments on Kaggle, and trying to build an honest picture of where SLMs stand today — what they can genuinely do, what they cannot, and why the answer matters more than most benchmark tables suggest.

Part 1: Why Small Language Models Exist at All

By 2023, the dominant AI paradigm was clear: train larger models on more data with more compute. GPT-4, PaLM 2, Gemini Ultra — each required infrastructure that only a handful of organizations on earth could afford. Training costs ran into tens or hundreds of millions of dollars.

This created a real problem. Most AI applications do not need a trillion-parameter model. They need something reliable, fast, cheap, and ideally not dependent on a cloud API that sends data to an external server. Finance, legal, government — domains with the strongest AI use cases are also the ones with the strictest data privacy requirements. There is no local deployment option for GPT-4.

SLMs emerged as a direct response. Not as a compromise, but as a deliberate design decision: build the smallest model that can reliably do a specific set of tasks.

Around the same time, a quieter debate was happening in research circles. A group of papers, starting with the original Phi-1 work in 2023, made a provocative argument: the reason large models outperform small ones is not primarily because they are larger. It is because they are trained on more data, and most of that data is low quality. Filter the data aggressively, keep only dense reasoning-heavy content, and a much smaller model performs surprisingly well.

This is sometimes called the textbook hypothesis: a model trained on textbook-quality material learns to reason better than a model trained on ten times as much internet text. The Phi series became the primary empirical test of this hypothesis, and the results were striking enough that the idea is now taken seriously across the field.

There is no official definition of small, but the community generally treats anything under 7 billion parameters as an SLM:

Category	Parameters	Examples
Large	>70B	GPT-4, Claude 3 Opus, Llama 3 70B
Medium	7B to 70B	Mistral 7B, Llama 3 8B
Small	<7B	Phi-3 Mini (3.8B), Gemma 2B, TinyLlama 1.1B

Part 2: How SLMs Are Actually Built

Knowledge Distillation

The most important technique behind high-performing SLMs is knowledge distillation, and it is worth understanding properly rather than just naming it.

Standard training optimizes against ground truth labels: a math problem has a correct answer and the model learns to produce it. But this only tells the model what the right answer is. It says nothing about the shape of the problem space — which wrong answers are close, which are far, what the structure of uncertainty looks like.

A large teacher model, when it answers a question, produces a full probability distribution over all possible next tokens. If the teacher gives "Paris" 80% probability and "Lyon" 15% for a question about French capitals, that 15% carries real information. These two answers are related in a way that "banana" and "Paris" are not. The distribution encodes structured knowledge about relationships between concepts.

Distillation trains the student to match the teacher's full distribution, not just the top answer. The student learns from the teacher's uncertainty, not just its correctness. This is why a 3.8B model trained with distillation can outperform a 7B model trained without it.

The Orca and Alpaca Results

The most compelling demonstration of distillation's power came from Microsoft's Orca papers in 2023. Orca was a 13B model fine-tuned not just on GPT-4 answers but on GPT-4's full reasoning traces — step-by-step explanations of how it arrived at each answer. Orca outperformed models five times its size on several reasoning benchmarks.

Orca 2 pushed further and showed that smaller models could be explicitly taught when to use different reasoning strategies — step-by-step for complex problems, direct answers for simple ones. This was not emerging naturally from scale. It was being deliberately taught through the quality of the training signal.

Stanford's Alpaca showed a related result: a 7B LLaMA model fine-tuned on 52,000 GPT-generated instruction examples matched GPT-3.5 on instruction-following tasks. 52,000 examples, one GPU, a few hours. The gap between open and closed models narrowed overnight.

The bottleneck was never parameter count. It was training signal quality.

Quantization

Running a model locally requires fitting it in memory. A 7B model in 32-bit floating point takes roughly 28GB of RAM. This is where quantization comes in.

Quantization reduces numerical precision. Instead of storing each parameter as a 32-bit float, you store it as an 8-bit or 4-bit integer. The memory savings are proportional: 8-bit halves the footprint, 4-bit quarters it.

For most language tasks, 8-bit quantization produces outputs essentially indistinguishable from full precision. 4-bit is where degradation becomes detectable, particularly on tasks requiring precise numerical reasoning. Techniques like GPTQ and AWQ apply quantization non-uniformly, preserving precision in the weights that matter most. My Experiment 3 results below show exactly this tradeoff in practice.

Efficient Architectures

Beyond training and quantization, the architecture choices in SLMs reflect deliberate engineering for inference efficiency.

Grouped query attention shares key and value projections across multiple query heads. Not every attention head needs its own unique representation — sharing costs little in model quality but significantly reduces memory during generation.

Sliding window attention, used in Mistral, limits each token's attention to a local window rather than the full context. This makes inference cost linear in sequence length rather than quadratic.

Speculative decoding is one of the more elegant recent ideas. A small draft model generates several tokens quickly. A larger target model then evaluates all of them in a single parallel forward pass, accepting those it would have generated and rejecting the rest. Net result: significantly faster generation with no change in output quality. SLMs become accelerators for larger models rather than replacements.

Part 3: What the Benchmarks Actually Show

On general benchmarks, the gap between SLMs and large models is real but not as dramatic as headlines suggest:

Benchmark	GPT-4	Phi-3 Mini (3.8B)	Gemma 2B	TinyLlama 1.1B
MMLU (General)	~86%	~69%	~51%	~26%
GSM8K (Math)	~92%	~78%	~52%	~8%
HumanEval (Code)	~87%	~59%	~34%	~12%
ARC-Challenge	~96%	~85%	~71%	~45%

Before treating these numbers as deployment guidance, there is a problem worth understanding. Academic benchmarks are published on the internet and may appear in training data. A model that has seen the test set during training is being evaluated on recall, not reasoning. This affects all language model benchmarks. Treat these numbers as upper bounds, not precise measurements.

Where SLMs genuinely dominate is cost, privacy, and hardware requirements:

Metric	GPT-4 API	Phi-3 Mini (Local)
Cost per 1M tokens	~$10 to $30	$0
First token latency	500ms to 2s	less than 100ms
Throughput on GPU	Cloud only	200+ tok/s
Data privacy	Sent to external server	Fully on-device

And the hardware floor matters enormously. GPT-4 has no local deployment option. Llama 3 70B requires roughly 40GB VRAM. Phi-3 Mini runs on 4GB RAM, which means a MacBook or a Raspberry Pi 5. TinyLlama at 1.1B fits in 700MB, enough for embedded devices.

My Own Experiments: Three Tasks, Two Models, Real Numbers

For this week's experiments I ran Phi-3 Mini locally on Kaggle using a T4 GPU and compared it against Llama 3.3 70B via the Groq API. I designed three prompts myself to cover different reasoning types rather than using a standard benchmark dataset. The choice was intentional: I wanted to observe how the models behave on naturally phrased tasks, not just benchmark-formatted questions. For rigorous evaluation at scale, datasets like GSM8K and HumanEval would be the right choice, and that is something I plan to revisit in a later week.

Experiment setup:


Model A	Microsoft Phi-3 Mini 4k Instruct (3.8B, FP16)
Model B	Llama 3.3 70B via Groq API
Hardware	Kaggle T4 GPU
Framework	HuggingFace Transformers 4.44.0
Code	GitHub → weekly-AI-ML-research/week02-slms

The three prompts I used:

ID	Type	Prompt summary
T-001	Math reasoning	Multi-step word problem involving apples, oranges, and change calculation
T-002	Code understanding	Identify what a Python function does and spot a hidden bug
T-003	Language reasoning	Identify the core argument in a technical paragraph

Experiment 1: Phi-3 Mini Inference

ID	Task	Latency	Speed	Correct?
T-001	Math reasoning	32.4s	5.7 tok/s	Yes, $7.60
T-002	Code understanding	13.2s	19.3 tok/s	Yes, ZeroDivisionError found
T-003	Language reasoning	1.9s	19.2 tok/s	Yes, clean one-sentence summary

T-001 was the slowest because the model generated a full step-by-step working, which produced more tokens. T-003 required only one sentence so it finished in under two seconds. All three answers were correct.

Experiment 2: Llama 3.3 70B vs Phi-3 Mini

ID	Type	Llama 3.3 70B	Phi-3 Mini	Quality gap
T-001	Math reasoning	0.5s (Cloud)	13.3s (Local)	Minimal
T-002	Code understanding	1.0s (Cloud)	13.7s (Local)	Noticeable
T-003	Language reasoning	0.3s (Cloud)	2.0s (Local)	Minimal

Llama 3.3 70B was faster on every task, which is expected since it runs on Groq's optimized cloud infrastructure. But the quality gap was smaller than I expected. Both models got the correct answers on T-001 and T-003. On T-002, Llama gave a richer explanation of why the ZeroDivisionError occurs. Phi-3 identified the bug correctly but explained it more shallowly. For use cases where correctness is what matters rather than explanation depth, Phi-3 holds up well. For use cases where explanation quality matters, the gap is real.

The more important comparison is not latency but deployment context. Llama 3.3 70B through Groq costs money and sends your data to an external server. Phi-3 Mini costs nothing after hardware and never leaves your machine.

Experiment 3: Quantization in Practice

Config	VRAM	Latency	Speed
FP16 (baseline)	3.89 GB	16.9s	11.8 tok/s
8-bit	1.86 GB	23.1s	8.6 tok/s
4-bit NF4	1.08 GB	13.9s	13.1 tok/s

The 8-bit result is the practical takeaway: VRAM drops by more than half with no meaningful quality loss on these tasks. 4-bit was the most surprising result. It used the least memory (1.08 GB) and was actually faster than FP16 on this task (13.1 vs 11.8 tok/s). The response quality on a conceptual explanation task was comparable across all three configurations. The degradation from quantization shows up more clearly on tasks requiring precise multi-step numerical reasoning, which is exactly where SLMs are already weakest.

Part 4: The Emergent Capabilities Problem

One of the more surprising findings in scaling research was the concept of emergent capabilities: abilities that appear suddenly in large models and are essentially absent in smaller ones. Few-shot learning, multi-step arithmetic, chain-of-thought reasoning were all identified as capabilities that emerge with scale.

SLMs challenge this picture but do not fully overturn it. What the Phi and Orca results show is that some apparently emergent capabilities can be induced in smaller models through better training. The capability was not truly emergent — it was underspecified by the training data. Give the model a better signal and the capability appears at smaller scale.

But some capabilities appear to be genuinely scale-dependent. Complex multi-step mathematical reasoning, reliable code generation for non-trivial programs, coherent reasoning across very long contexts — these degrade noticeably as model size decreases, even with high-quality training and distillation.

The uncomfortable implication is that we do not have a reliable theory for which capabilities are genuinely scale-dependent and which are just undertrained in smaller models. The only way to find out is to try empirically for your specific task.

Part 5: The On-Device AI Movement

The most significant infrastructure shift happening around SLMs is not in data centers. It is on consumer devices.

Apple's Neural Engine, present in every iPhone since the A11 chip, is now powerful enough to run models in the 1 to 3B parameter range at reasonable speeds. Apple Intelligence uses a 3B on-device model for most tasks, calling a larger cloud model only when necessary. The privacy argument is central: your data never leaves the device.

Qualcomm's Snapdragon X Elite, targeting Windows laptops, includes dedicated NPU hardware rated for 45 TOPS. Microsoft's Copilot+ PC initiative is built around this, with on-device models handling real-time summarization and other features locally.

Google's Gemini Nano runs on Pixel phones and recent Android devices, enabling on-device summarization and voice transcription without cloud calls.

This hardware push reflects a bet that the next generation of AI features will be defined not by which cloud model is most capable but by which on-device model is fast, private, and reliable enough to be always available. SLMs are the only class of model that can compete in this environment.

Part 6: Fine-Tuning — Power and Trap

A base SLM is a generalist with limited specialized knowledge. Fine-tuning on domain-specific data produces dramatic improvements on narrow tasks. Parameter-efficient methods like LoRA make this practical: instead of updating all model weights, LoRA introduces small trainable matrices that approximate the updates. A LoRA fine-tune of a 7B model can be done in a few hours on a single consumer GPU with a few thousand examples.

The trap is catastrophic forgetting. When you fine-tune on domain-specific data, the model improves on that domain at the cost of general capability. It overwrites some prior knowledge with new patterns. A model fine-tuned aggressively on legal documents may produce excellent legal summaries and poor responses to everything else.

LoRA mitigates this significantly because you are not modifying base weights directly. But it does not eliminate the problem entirely. Fine-tuning requires evaluating not just the target task but also the general capabilities you want to preserve.

Part 7: Where SLMs Genuinely Cannot Compete

Being honest about hard limits is more useful than optimism.

For problems requiring many intermediate results held simultaneously — advanced mathematics, multi-constraint planning — large models are meaningfully better and fine-tuning does not close the gap. Maintaining complex internal state during long reasoning chains appears to benefit from scale in ways data quality alone does not address.

When a task requires combining skills in configurations the model has not seen before — not applying a familiar pattern but genuinely constructing a new approach — smaller models are more brittle than benchmark numbers suggest.

Maintaining a coherent thread across 100,000+ tokens is qualitatively harder for smaller models even when they technically support the context window. The model loses track of earlier constraints in ways that compound over long sequences.

Large models follow a wider range of novel instructions reliably. Smaller models are more sensitive to exact prompt phrasing — small wording changes produce larger output quality changes than you would expect.

Part 8: Open Questions

Is the textbook hypothesis general? Phi-3's data approach worked for reasoning tasks. Does it transfer to less structured domains such as creative writing, open-ended dialogue, or cultural reasoning? The hypothesis has not been tested rigorously outside its original domain.

Where is the true capability floor? We know some capabilities emerge with scale. We do not know the minimum scale at which each reliably appears as a deployment characteristic rather than a benchmark number.

Can quantization go further? 2-bit and 1-bit quantization have been explored experimentally. The results are not yet good enough for general deployment. Whether this is a fundamental limit or an engineering problem is not resolved.

What happens when SLMs are wrong? Error analysis for SLMs in production is underdeveloped. Large model failures tend to be graceful — wrong but coherent. SLM failures can be less graceful. A systematic understanding of failure modes across task types would be practically valuable and is mostly absent from the literature.

How does on-device AI change development practices? If inference moves to the edge, evaluation, updating, and monitoring all change significantly. The MLOps infrastructure built around centralized cloud inference does not translate directly to a world where models run on millions of individual devices.

Papers Worth Reading

Paper	What It Contributes	Venue
Gunasekar et al. (2023)	Phi-1: the textbook hypothesis	NeurIPS 2023
Abdin et al. (2024)	Phi-3 technical report	arXiv 2404
Mukherjee et al. (2023)	Orca: learning from GPT-4 explanations	arXiv 2306
Mitra et al. (2023)	Orca 2: teaching reasoning strategies	arXiv 2311
Taori et al. (2023)	Alpaca: instruction following at 7B	Stanford HAI
Zhang et al. (2024)	TinyLlama: pretraining at 1.1B	arXiv 2401
Frantar et al. (2022)	GPTQ: post-training quantization	arXiv 2210
Lin et al. (2023)	AWQ: activation-aware weight quantization	arXiv 2306
Leviathan et al. (2023)	Speculative decoding	ICML 2023
Hu et al. (2021)	LoRA: low-rank adaptation	ICLR 2022

Research Groups Doing Relevant Work

Microsoft Research's Phi team has produced the most sustained empirical investigation of the data quality hypothesis. Meta's LLaMA team made open weights standard practice and enabled the fine-tuning ecosystem most SLM work depends on. Hugging Face's evaluation team, particularly the Open LLM Leaderboard and their contamination research, is essential for understanding what benchmark numbers actually mean. On hardware, Qualcomm Research and Apple's ML team are defining what on-device inference looks like in practice. MIT's Han Lab has done foundational work on quantization and efficient inference.

Benchmarks You Should Know

MMLU covers 57 subjects across multiple domains and is useful for measuring breadth but is known to have contamination issues. ARC-Challenge focuses on scientific reasoning that requires inference rather than recall. GSM8K has 8,500 grade school math problems requiring multi-step reasoning and is the most widely used reasoning benchmark. HumanEval tests code generation with 164 programming problems across different difficulty levels. BIG-Bench Hard collects 23 tasks specifically designed to resist current models. HELM provides a more structured evaluation framework with explicit contamination controls. For any production decision, treat published benchmark numbers as approximate and build an evaluation set from your actual task distribution.

Conclusion

The SLM story is not about building smaller versions of large models. It is about a set of discoveries that have changed what we understand about the relationship between model size and capability.

Data quality can substitute for scale to a remarkable degree. Distillation can transfer knowledge across size boundaries in ways flat training data cannot. Efficient architectures reduce the hardware floor without meaningful capability loss. And the on-device movement is creating a deployment environment where the question is not "what is the best model?" but "what is the best model that fits this constraint set?"

Running these experiments myself made the tradeoffs concrete in a way that reading papers alone does not. Phi-3 Mini got every answer right. It was slower than the cloud alternative, but it ran locally, cost nothing per query, and required no data to leave the machine. For many real applications, that is not a compromise. It is exactly what you need.

The scaling laws are not wrong. But they were describing one path to capability. The research of the last two years has found that there are others, and some of them are more practical for the problems most people actually need to solve.

Next week: Retrieval-Augmented Generation. How do you give a language model access to knowledge it was never trained on? What actually happens when retrieval goes wrong? And does RAG actually solve the hallucination problem, or just change where the failures occur?

References

Gunasekar, S., et al. (2023). Textbooks Are All You Need. NeurIPS 2023.
Abdin, M., et al. (2024). Phi-3 Technical Report. arXiv:2404.14219.
Mukherjee, S., et al. (2023). Orca: Progressive Learning from Complex Explanation Traces. arXiv:2306.02707.
Mitra, A., et al. (2023). Orca 2: Teaching Small Language Models How to Reason. arXiv:2311.11045.
Taori, R., et al. (2023). Stanford Alpaca: An Instruction-following LLaMA Model. Stanford HAI.
Zhang, P., et al. (2024). TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385.
Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for GPT. arXiv:2210.17323.
Lin, J., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression. arXiv:2306.00978.
Leviathan, Y., et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

Experiment code: GitHub → weekly-AI-ML-research/week02-slms

This is part of a weekly series on AI/ML research. Each post covers theory, recent papers, and experiments I run myself.

Connect on LinkedIn: Soohan Abbasi

Top comments (1)

Harjot Singh • May 31

SLMs are one of the most under-hyped shifts in the space - the assumption that "more parameters = better" is breaking down for a huge class of real tasks. Most production work isn't open-ended reasoning; it's classification, extraction, routing, formatting - and a well-fit small model nails those at a fraction of the cost AND latency, while the giant model is overkill burning money to do something trivial. "What intelligence actually requires" is the right question: for most steps, far less than we reflexively reach for.

The practical implication is a portfolio, not a winner: small models for the bulk of narrow tasks, big models reserved for the genuinely hard reasoning - routed by difficulty. That's literally the economics behind Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - SLMs do the high-volume scoped work, frontier models only where they earn it, which is how a full build holds ~$3 flat. SLMs aren't a downgrade; they're the right tool for most of the job. Great framing. Where's the line in your experience - what's the most complex task you've seen a small model handle reliably before you genuinely need to escalate? That boundary is the whole routing decision.