Distillation Attacks : Risks and Controversies

#ai #programming #aidistillation

Do you know, You can replicate behaviour of large language model on a small model that has better inference time, less computation cost and competitive benchmark as LLM ?

This is possible using a technique called knowledge distillation, where a smaller model, commonly called "student" model learns to mimic larger model, "teacher" model.

Let's understand about this powerful concept and the issues that arises along with it.

What is Model Distillation ?

Model distillation is a technique where the basic idea is to make a smaller model (student) mimics the behaviour of a larger model (teacher).The goal is to get better generalization in the student model than training it from scratch. Here student model may learn from:

Soft probability distributions : Instead of predicting a single hard label, the model outputs a probability distribution over all possible classes or tokens.
Logits : Logits are the raw outputs of the neural network before softmax is applied.
Generated text Outputs : This is the final text produced by the model. Even if probabilities or softmax are hidden, attackers collect (prompt, generated_text) to perform distillation.

From Compression to Exploitation

Orginally model ditillation technique was designed as a compression technique, a way to transfer knowledge to small model. In this setting, both the teacher and student models belong to the same organization, and the goal is optimization.

However, the context changes when the teacher model is closed-source and accessible only through an API.

Distillation becomes an attack when:

The teacher model is proprietary and not publicly available
The student model is trained without permission
The attacker only interacts through API queries
The objective is not compression, but capability replication

In this scenario, the attacker does not need access to the model’s architecture, training data, or weights. Instead, they rely purely on the outputs generated by the system.

This process is commonly referred to as:

Model extraction
Model stealing
Black-box distillation

How Distillation Attack Works ?

Distillation attacks involves following steps :

Step 1: Querying
Here, Attacker generates random prompts on specific domain based on sysnthetic data. This collection of query is fed to Teacher model to generate output.

Step 2: Dataset Construction
In this step, The collected query is passed to model which generated output. Given input prompt along with model output is stored as a dataset.

Step 3: Student Model Training
In this final step, the attacker trains a smaller transformer model using the constructed dataset. The objective is simple: given the same input prompt, the student model tries to predict the teacher model’s output as closely as possible. This is typically done by minimizing cross-entropy loss, and in more refined setups, KL-divergence is used to match the probability distributions of the teacher’s predictions. Over time, the student does not just memorize responses, ut begins to mimmic teacher response.

Recent Controversies on Distillation Attack

Recently, Anthropic claimed that three chinese companies DeepSeek, Moonshot AI and MiniMax ran large scale distillation attacks on its claude models. Anthropic alleges that :

~24,000 fraudulent accounts were used to access Claude.
Over 16 million prompt–response exchanges were generated to extract capability data.
These interactions were then allegedly used to train rival AI systems.

Also Beyond formal reports, online communities have sparked their own controversies sometimes based on technical quirks rather than confirmed evidence.

One example that gained traction on Reddit was a post claiming that Anthropic’s Sonnet 4.6 model responded to certain Chinese‑language prompts with:

“I am DeepSeek‑V3, an AI assistant developed by DeepSeek.”

This led some users to speculate that :

Sonnet had been distilled from or trained on DeepSeek’s model outputs

Conclusion : Who Really Owns AI Knowledge ?

Large Language Models are trained on massive datasets scraped from internet. All of this contsnts are publicly available, yet its use in training is often debated legally or ethinically.

If training an AI on the internet is considered legal, some might ask "why is it illegal to replicate a model using its outputs through distillation ?"

What do you think? Drop your thoughts on comment.