Sergei Parfenov

Posted on May 29

How Model Distillation Actually Works (and What the 'China Distilled Our Model' Headlines Really Mean)

#ai #machinelearning #llm #deeplearning

Every few weeks a headline drops: "Chinese lab distilled a frontier model from OpenAI / Anthropic." Cue the comments — half the thread thinks distillation is a synonym for theft, the other half thinks it's some exotic Chinese trick.

Both are wrong. Distillation is one of the most boring, well-established techniques in deep learning, and the labs raising the alarms use it on their own models constantly. The actual controversy is narrower and more interesting than the headlines. Let's separate the engineering from the geopolitics.

What distillation actually is

Knowledge distillation trains a small student model to imitate a large teacher model. The classic framing comes from Hinton et al. (2015): instead of training the student only on ground-truth labels, you also train it to match the teacher's output distribution.

Why does that help? Because the teacher's full probability distribution carries far more information than the single correct answer. If a teacher classifies an image of a dog, it might output dog: 0.9, wolf: 0.08, cat: 0.001. That "dog and wolf are similar, cat is not" signal — Hinton called it dark knowledge — is exactly what a small model struggles to learn from hard labels alone.

There are two kinds of training signal:

Hard labels — the final answer (the token the teacher actually produced, or the ground-truth label).
Soft labels — the teacher's full probability distribution over outputs, usually its logits passed through a softmax.

The trick is temperature. You divide the logits by a temperature T > 1 before the softmax, which flattens the distribution and exposes those small-but-meaningful probabilities the student should learn from.

The loss is a blend of two terms: a standard cross-entropy against the real labels, and a KL-divergence pulling the student's softened distribution toward the teacher's.

import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    # 1. Standard loss: student vs ground truth (hard labels)
    hard_loss = F.cross_entropy(student_logits, labels)

    # 2. Distillation loss: student vs teacher's softened distribution (soft labels)
    soft_targets = F.softmax(teacher_logits / T, dim=-1)
    student_log_probs = F.log_softmax(student_logits / T, dim=-1)
    # T**2 keeps gradient magnitudes balanced when T > 1
    soft_loss = F.kl_div(student_log_probs, soft_targets, reduction="batchmean") * (T ** 2)

    return alpha * hard_loss + (1 - alpha) * soft_loss

For LLMs the same idea applies per token: the teacher's next-token distribution is the soft target. In practice teams mix hard and soft labels — recent work argues the gain from mixing comes less from "matching the teacher better" and more from reducing exposure bias (the train/inference distribution mismatch). The point: this is normal, published, peer-reviewed engineering.

And labs distill their own models all the time. The cheap, fast variant of a flagship model that you actually get to call in production? Very often a distilled student. Anthropic itself, in the middle of its own complaint about Chinese firms, acknowledged that AI companies routinely distill their own models to make smaller, cheaper versions.

Why distilling from a closed API is a different beast

Here's the part the headlines skip. Everything above assumes you have the teacher's logits — the raw output distribution. That's white-box distillation, and it requires access to the model's internals or at least its full probability outputs.

You do not get logits from a closed commercial API like Claude or GPT. You get text. That forces black-box (a.k.a. sequence-level) distillation:

Prompt the teacher with lots of inputs.
Collect its generated text outputs.
Build a synthetic dataset of (prompt → teacher answer) pairs.
Fine-tune your student on that dataset with supervised fine-tuning, often followed by RL.

You lose the dark knowledge in the soft labels, but it turns out you can get remarkably far just by training on a large, high-quality synthetic dataset generated by a strong teacher. This is exactly why "did model X learn from model Y's outputs?" is such a live and hard-to-prove question — the evidence isn't a stolen weights file, it's statistical fingerprints in behavior (a model that randomly claims to be ChatGPT, mirrors another model's quirks, etc.).

	White-box	Black-box (closed API)
Needs	Logits / weights	Just text outputs
Signal richness	High (full distribution)	Lower (final answers)
Feasible against a closed model?	No	Yes
What the China allegations are about	—	This one

So what are the actual allegations?

Strip the drama and here's the documented timeline:

Jan 2025 — After DeepSeek's R1 launch, OpenAI and Microsoft open an investigation into whether DeepSeek used ChatGPT outputs to train it. Users noticed R1 behaving suspiciously ChatGPT-like.
Feb 2026 — OpenAI sends a memo to the U.S. House Select Committee on China alleging DeepSeek used obfuscated third-party routers to access OpenAI models and programmatically extract outputs for distillation, in violation of its terms of service.
Feb 24, 2026 — Anthropic publicly accuses three Chinese firms — DeepSeek, Moonshot AI, and MiniMax — of coordinated "distillation attack" campaigns: flooding Claude with crafted prompts, allegedly via commercial proxy services running tens of thousands of accounts to sidestep Anthropic's China access restrictions.

Two things matter here, and most coverage gets them backwards:

These are allegations. The labs have not, as of writing, published the full underlying evidence, and the accused firms dispute or haven't confirmed them. Behavioral similarity is suggestive, not proof.
The dispute is not "distillation = bad." As one ethics researcher put it after Anthropic's statement, if Anthropic itself calls distillation legitimate and widespread, the controversy can't be the technique. It's two narrower things: unauthorized access (using proxies to evade geographic and account restrictions) and terms-of-service violations (most frontier APIs explicitly forbid using outputs to train a competing model). It's closer to a contract-and-access fight than an IP-theft slam dunk — and the legal status of "training on another model's outputs" is genuinely unsettled.

"How long does it take / how much does it cost?"

This is the question everyone asks, and the honest answer is: dramatically less than training from scratch — which is the entire economic motive — but precise figures for any specific alleged case are not public. Anyone quoting you an exact "they did it in N days for $M" is guessing.

What we can say structurally:

Pretraining a frontier model from scratch means a massive run on tens of thousands of high-end accelerators, plus the data pipeline and research iteration behind it.
Distillation collapses that timeline. The expensive part — discovering the capability — was already paid for by the teacher. The student's cost is roughly: generating a synthetic dataset (API calls + time) plus a comparatively cheap fine-tuning run. That's the asymmetry the U.S. labs are upset about: they spend billions to push the frontier, and a "free-rider" can chase it for a fraction.
This is also why DeepSeek's headline numbers were so contested. Its self-reported low training cost and modest hardware footprint were precisely what made rivals suspect a shortcut: it's much easier to hit those numbers if you bootstrapped from an already-trained Western teacher rather than doing all the discovery yourself.

So: distillation makes a strong-ish student fast and cheap. It does not let you leapfrog past the teacher — a student is generally capped by the teacher it learned from. You don't distill your way to the frontier; you distill your way to a cheap copy of someone else's.

Takeaways

Distillation is standard, published deep-learning practice. The labs complaining about it use it themselves.
White-box distillation needs logits; closed APIs only expose text, so distilling from Claude/GPT means black-box training on generated outputs.
The OpenAI and Anthropic allegations against DeepSeek, Moonshot, and MiniMax are about unauthorized access and ToS violations, not about distillation being inherently illegitimate — and they remain allegations.
The economic point is real: distillation is far cheaper than frontier pretraining, which is why it's a business and policy flashpoint. But a student is bounded by its teacher.

If you want the deep technical version of any of these — the math of temperature scaling, why mixing hard and soft labels beats either alone, or how behavioral fingerprinting tries to detect distillation — let me know in the comments.

Sources & further reading

OpenAI memo to the U.S. House Select Committee on China (Feb 2026) — reporting via Reuters and Rest of World.
"Anthropic joins OpenAI in flagging distillation campaigns by Chinese AI firms," CNBC, Feb 24, 2026.
Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network" (2015).
"Understanding LLM Distillation Techniques," MarkTechPost, 2026.
"The Bridge-Garden Dilemma in LLM Distillation," arXiv:2605.26246.
Winston & Strawn, "Is AI Distillation by DeepSeek IP Theft?" (analysis of the legal gray zone).

Top comments (15)

xulingfeng • May 31

The bit about "China distilled our model" headlines is spot on — most people dont realize distillation is just a training technique, not a theft. We use distilled models (DeepSeek V4 Flash) as our daily driver and the cost difference vs the full-fat version is roughly 20x.

One thing I would add: distillation doesnt just shrink the model, it also forces you to confront which capabilities you actually need. We found that our test automation workflows only need about 60% of the teacher models capability space. Have you seen a systematic way to figure out the minimum viable capability set before starting the distillation?

Sergei Parfenov • Jun 2

Great point about the 60% — that reframing (distillation as a forcing function for "what do we actually need") is honestly the part most write-ups miss, so thanks for adding it.
On your question: I haven't seen a clean canonical method for nailing the minimum viable capability set up front, and I'm a little skeptical one exists, because "capability space" isn't something you can measure directly before you have a student to test. What does work in practice is flipping it from a design problem into an eval problem:

Build the eval set before the dataset. Pull real production traces (your test-automation workflows in this case) and turn them into a graded eval suite — ideally bucketed by capability (reasoning, tool-calling, format adherence, edge cases). This becomes your definition of "60%" instead of a guess.
Bootstrap a student fast and measure the gap per bucket. A cheap first SFT pass on teacher outputs tells you where the student already clears the bar vs where it collapses. The buckets it passes are capabilities you didn't need to over-invest in; the failures are your real target set.
Close the gap with weakness-driven data, not more data. This is where active-learning-style distillation helps — analyze the student's failures, then have the teacher synthesize examples specifically targeting those, rather than generating a giant undifferentiated corpus. There's a line of work (EvoKD and similar) formalizing exactly this loop: evaluate student → identify weaknesses → teacher generates targeted samples → repeat.

So the MVC set isn't something you derive in advance — it kind of emerges from the eval buckets your tasks actually exercise. The discipline is front-loading a good, capability-bucketed eval; everything downstream falls out of it.
Curious how you arrived at your 60% number — was that from an eval suite, or more from observing which teacher behaviors never fired in production? The "never fired" signal is underrated for this.

Harjot Singh • May 31

The "separate the engineering from the geopolitics" framing is the public service here, because the headline panic obscures how mundane and useful distillation is. The part worth amplifying for builders: distillation isn't just a frontier-lab arms-race thing, it's one of the highest-leverage cost moves available to a regular product team. Once you have a stable task running on a big expensive model, that model's outputs are a labeled dataset, and distilling a small student for that specific task turns a recurring frontier bill into near-zero inference. You don't need to distill a whole frontier model; you distill the one capability you actually use. The narrower controversy you point at (terms-of-service on training against another model's outputs) is the real story, and it's a legal/contract question, not a "this is magic theft" one. Practically: distill your own traffic, not someone else's model, and the whole controversy evaporates. This expensive-teacher-to-cheap-specialist economics is exactly how I think about cost in Moonshift. In your experience, where does the student start failing, the long tail of rare cases the teacher handled and the small model never saw enough of?

Sergei Parfenov • Jun 2

Exactly — "distill your own traffic, not someone else's model" is the whole thing in one line.
On where the student starts failing: in my experience it's rarely a smooth degradation, it's three fairly distinct failure modes, and the long tail is only one of them.

The long tail you named — rare intents the student saw too few times. This one's the least scary because it's measurable: it shows up as a frequency cliff in your eval buckets, and you can buy it back by having the teacher over-generate synthetic examples for the sparse intents. It's a data-coverage problem, not a capability problem.
Compositional / multi-step reasoning — this is the one that bites hardest. The student often handles each step fine in isolation but falls apart when a task chains 4-5 of them, because it learned the surface form of the teacher's answers without the latent reasoning that produced them. Black-box distillation makes this worse: you're training on the teacher's output text, not the reasoning trace, so the student mimics the destination without the path. Distilling CoT traces instead of just final answers helps a lot here.
Calibration on the boundary — the student gets overconfident exactly where the teacher would have hedged or said "I'm not sure." The teacher's uncertainty lived in its soft distribution, which you never saw through the API. So the student fails silently — wrong but confident — which in production is more dangerous than the long-tail misses you can at least detect.

Rough rule I've landed on: the long tail you fix with data, compositional failures you fix with better targets (traces, not answers), and calibration you mostly can't fix from a black-box teacher — you manage it with a confidence threshold and a fallback to the teacher for low-confidence cases. That hybrid (cheap student for the 90%, expensive teacher as fallback) tends to beat trying to distill the last 10% into the student.
Where does it break for you in Moonshift — long tail, or more the compositional stuff?

VoltageGPU • Jun 1

Interesting take on distillation — it's reassuring to see a clear breakdown of the practical limits when you can't access the teacher model's internal states. From an infrastructure perspective, when working with VoltageGPU, we've seen how distillation can help reduce inference costs without sacrificing too much accuracy, but it's definitely a trade-off. The China controversy highlights how hard it is to prove or disprove model provenance when training data and architecture are opaque.

Sergei Parfenov • Jun 2

Thanks! Provenance is the genuinely hard part — when weights and training data are closed, behavioral fingerprinting (self-identification slips, shared quirks) is about all you've got, and it's circumstantial at best. That's exactly why the China cases stay in "allegation" territory rather than getting settled.

xulingfeng • Jun 2

This is incredibly helpful — the "flip it to an eval problem" framing clicked immediately. We've been doing something similar informally (grabbing production traces, running them against candidate models) but never bucketed by capability. That bucket approach would have saved us from a few wrong turns where we optimized for reasoning the student already had while ignoring format-adherence gaps.

The EvoKD loop you mentioned (evaluate → identify weaknesses → teacher synthesizes targeted examples) is exactly what I want to try next. We're stuck at the "undifferentiated corpus" phase and feeling the diminishing returns. Have you seen any practical EvoKD implementations that work well with black-box API teachers where you don't have logit access? That's our constraint — using DeepSeek/Claude APIs as the teacher.

Sergei Parfenov • Jun 2

glad the bucketing landed — the format-adherence-vs-reasoning split is exactly the kind of thing that hides inside an aggregate score, so good that it surfaced for u.

on EvoKD-style loops with a black-box teacher — short answer, the classic EvoKD framing assumes u can probe the teacher freely, but the weakness-targeting half of the loop works fine black-box, u just lose the logit-level signal and do everything at the text level. the part that doesnt transfer is soft-label matching. what u keep: eval student → cluster the failures → prompt the teacher to synthesize examples targeting those clusters → SFT → repeat. no logits needed anywhere, all sequence-level.

the thing actually worth ur time tho: theres a recent paper from microsoft, Generative Adversarial Distillation (GAD), nov 2025, built specifically for the black-box/API-teacher case with no logit access. instead of treating teacher outputs as fixed SFT labels (ur "undifferentiated corpus" problem), it trains a discriminator to tell student outputs apart from teacher outputs, and that discriminator becomes an on-policy reward model that co-evolves with the student. thats basically a learned, automatic version of "find the weaknesses" — the discriminator is the weakness-finder, and it adapts as the student improves instead of u hand-bucketing every round. they got a Qwen2.5-14B student comparable to GPT-5-Chat as teacher. worth a read for ur exact constraint.

one caveat that matters for ur setup specifically: plain SeqKD students show higher n-gram overlap with the teacher but lower task scores — ur memorizing surface form, not capability. thats the diminishing-returns wall ur hitting. the adversarial/on-policy approaches exist precisely to break past it. so ur instinct that the undifferentiated corpus is the problem is dead on — its not that u need more data, its that flat SFT caps out.
(and the obvious one — ur using DeepSeek/Claude APIs as teacher, so just double-check the ToS on training competing models before u scale it, given the whole topic of the post lol.)

Mudassir Khan • Jun 2

the 'student is bounded by the teacher' framing is right for general capability but undersells the narrow task case. we've seen task specific students outperform the teacher on the exact thing they were distilled for — because you're training on curated, filtered teacher outputs for your domain, not random samples. the ceiling moves.

the failure mode Harjot named is real too. the part we've found hardest: student doesn't fail loudly, it fails confidently. same hallucination pattern as any under trained model, except you didn't expect it because the teacher made the task look easy.

how are you evaluating student coverage before shipping to prod?

Sergei Parfenov • Jun 2

You're right, and that's a real correction — I overstated it. "Bounded by the teacher" holds for general capability, but on a narrow task it breaks, exactly for the reason you give: you're training on curated, domain-filtered teacher outputs, not the teacher's full noisy distribution. Strip the teacher's mistakes and off-domain hedging out of the training set and the student's ceiling on that slice can sit above the teacher's average behavior there. There's a line of work formalizing this — student beats teacher when its gain on the student-favored subdomain outweighs its deficit on the teacher-favored one. So "bounded" should really be "bounded in aggregate, not per-slice."
And "fails confidently, not loudly" is a better phrasing of the calibration problem than mine — that's the one that actually hurts in prod.
On evaluating coverage before shipping, what's worked for me:

Bucketed eval, not a single aggregate score. A 92% average hides the cliff. I split the eval set by intent/capability and look for the buckets where the student drops well below its own mean — that's where the silent failures live. The aggregate number is almost useless for ship/no-ship.
Disagreement sampling against the teacher. Run student and teacher on a large unlabeled production sample and surface where they diverge. You don't need labels for the whole thing — the disagreement set is small and is exactly where you should spend human review. Cheap way to find the confident-wrong cases before users do.
Confidence calibration check. Plot student confidence vs actual correctness on the eval set. If the high-confidence band isn't also high-accuracy, that's the "fails confidently" pattern showing up quantitatively — and it tells you where to set a fallback threshold.
Ship with a teacher fallback, not as all-or-nothing. Route low-confidence (or known-weak-bucket) cases to the teacher and let the student handle the rest. Lets you ship at lower coverage and ratchet up as you close gaps, instead of waiting for the student to clear 100%.

The disagreement-sampling one is the highest-leverage if you only do one — it finds the failures you didn't think to write an eval for.
Are you evaluating on a held-out slice of real traffic, or a synthetic eval set generated by the teacher? I've found teacher-generated evals flatter the student, since they share the same blind spots.

Mykola Kondratiuk • Jun 7

yeah this matters beyond headlines. when a vendor trains on your API output logs, that’s a contract/licensing question, not a ML technique story. conflating the two lets real vendor risk hide behind tech confusion.

Iuliia Fokina • May 30

Thank you for this!

xulingfeng • Jun 2

GAD paper recommendation is gold — the discriminator-as-weakness-finder framing clicks immediately. That Qwen2.5-14B ≈ GPT-5-Chat result is striking. Been reading it since your last reply and the on-policy co-evolution is exactly what our "undifferentiated corpus" setup is missing.

On the 60%: it came from the "never fired" signal, not an eval suite. We ran our test automation suite against both DeepSeek V4 Flash and a bigger teacher, then tracked which capabilities the extra budget never touched across ~2 weeks of real PR traffic. About 40% of the teacher's capability space was dead code for our workflow. The signal is noisy (sample window luck), but the headroom it freed up was real.

And noted on the ToS — we're distilling for internal test automation, not shipping a competing model, so the compliance angle should be clean. Thanks for the heads-up though.

Sergei Parfenov • Jun 3

nice, the "never fired" signal is a great way to derive it empirically — way better than guessing the capability set up front. and yeah GAD's discriminator is basically that same instinct automated: instead of u eyeballing what never fired, it learns the gap on-policy and keeps moving the target as the student closes it. fits ur setup almost too well.

one thing id watch on the 40% "dead code" though — "never fired in 2 weeks of PR traffic" is a coverage signal, not a capability signal, and those two can look identical. some of that 40% is genuinely dead for ur workflow (drop it, free win). but some of it is the long tail — capabilities that fire rarely but are expensive when u need them and dont have them. both show up as "never fired" in a 2-week window. the way i tell them apart isnt frequency, its cost of being wrong: a capability that fires 0.1% of the time but causes a bad merge when missing is not the same as one thats truly unused, even though the sample says they're equal.

cheap insurance: before u cut a capability from the student, ask "if this fires next month and the student cant do it, what breaks?" if the answer is "nothing much," drop it. if its "we ship a regression," keep it in the teacher-fallback path even if it never fired in ur window. costs u almost nothing and saves the one incident that wipes out all the headroom u gained.
(and yeah, internal test automation vs shipping a competing model is a totally different ToS posture — agreed thats clean.)
good thread btw, this is the kind of back-and-forth i started the blog for. lmk how the GAD experiment goes.

xulingfeng • Jun 3

Good distinction. Frequency and cost-of-being-wrong aren't the same thing — we ran into this building MemBridge too. In testing, "never fired" is 90% dead code. But in a memory system there's a third category: stored but never retrieved yet. It's not dead, it just hasn't had its turn. Different shape from test coverage.

What we ended up doing: a hit counter on each memory entry. If it sits for N days with zero retrieval hits, then flag it as cleanable. Window-based cuts alone miss this — something stored yesterday has the same "never fired" profile as something dead for months.

GAD is on the list. Will report back once I run it. Appreciate the pointer.

View full discussion (15 comments)