Links:
- 📦 resk-logits: https://pypi.org/project/resklogits
- 📦 reskSecure: https://pypi.org/project/resksecure
- 🐙 GitHub: https://github.com/Resk-Security
- 🌐 Web: https://resk.fr
When people talk about LLM security threats, they usually mention prompt injection, jailbreaks, or data poisoning. But there's another attack vector that's quietly growing: model distillation attacks.
What is Model Distillation?
Knowledge distillation is a legitimate technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model by learning from its outputs. It's widely used to reduce inference costs and latency.
# Simplified example: distilling from a teacher LLM
student_logits = student_model(input_ids)
teacher_logits = teacher_model(input_ids)
# KL-divergence loss to mimic teacher distribution
loss = kl_divergence(student_logits, teacher_logits)
loss.backward()
optimizer.step()
The Attack Surface
1. Safety Alignment Evasion
A safety-aligned model has two layers of knowledge:
- Capability knowledge: how to write code, analyze data, answer questions
- Safety alignment: refusal to generate harmful content (RLHF, constitutional AI)
When an attacker distills a model using API outputs, the student often inherits the capability but not the safety alignment. The student learns WHAT to say but not WHEN to refuse — because refusal is an emergent behavior of the fine-tuning process, not something that can be easily captured in output distributions.
2. IP Theft at Scale
Models like GPT-4o, Claude Opus, and Gemini cost millions to train. Distillation lets an attacker replicate benchmark-level performance for the cost of API queries. This is why terms of service explicitly prohibit it, but detection is practically impossible — the API just sees legitimate traffic.
3. Poisoning the Supply Chain
A more sophisticated attack: release a "helpful" distilled model on Hugging Face, let the community adopt it, then push an update that removes safety constraints. The model was already trusted through the distillation name.
Why Logits-Level Filtering Matters Here
Post-deployment filtering is the most practical defense against rogue distilled models. Even if you don't control the model weights, you can control its output at inference time.
resk-logits uses GPU-accelerated Aho-Corasick to shadow-ban dangerous tokens during generation:
from resklogits import ReskLogits, Pattern
import torch
patterns = [
Pattern("how to build a bomb"),
Pattern("instructions for synthesizing "),
Pattern("steps to hack into "),
]
rl = ReskLogits(patterns, device="cuda")
# Intercept at logits level — before token selection
logits = model(input_ids)
logits = rl.process(logits, input_ids)
token = torch.argmax(logits, dim=-1)
reskSecure adds a policy-driven bitmask firewall on top, letting you define per-user capability levels with hot-reload YAML policies.
The Bottom Line
Model distillation attacks will grow as open-weight and API-accessible models proliferate. The defense isn't better terms of service — it's runtime security tooling that doesn't depend on the model's own alignment.
What's your take on the distillation threat landscape?
pip install resklogits
pip install resksecure
Top comments (0)