Phil Stafford

Posted on Feb 25

Anthropic Just Published a Kill Chain for AI Model Theft. Let's Break It Down.

#ai #cybersecurity #llm #news

Attack patterns, detection challenges, and defensive gaps from the industrial-scale distillation campaigns against Claude.

On February 24, Anthropic dropped a detailed report attributing industrial-scale distillation campaigns against Claude to three Chinese AI labs: DeepSeek, Moonshot AI, and MiniMax. The numbers: 24,000 fraudulent accounts, 16+ million exchanges, targeting reasoning, agentic tool use, coding, and computer vision.

The geopolitical framing is getting all the coverage. This piece is about the technical content — because what Anthropic actually published is a kill chain analysis for AI model capability extraction, and there are concrete takeaways for anyone building or defending systems that expose model capabilities through APIs.

The Attack Surface: Your Model's Output IS the Exfiltration Channel

Traditional data exfiltration moves data out through network channels, side channels, or compromised endpoints. Distillation flips this: the exfiltration channel is the product's intended interface. Every API response is a potential training sample. The model's designed behavior is the thing being stolen.

This means conventional API security — rate limiting, authentication, payload inspection, WAF rules — addresses the wrong layer of the problem. A distillation query is syntactically and semantically identical to a legitimate query. The signal isn't in individual requests. It's in the aggregate pattern across thousands of accounts and millions of interactions.

Three Tiers of Extraction

Anthropic's report describes increasingly sophisticated extraction techniques that map to different training objectives. Each tier extracts a different kind of value and needs a different detection approach.

Tier 1: Supervised Fine-Tuning Data
The baseline approach. Generate diverse prompts, collect high-quality responses, use the (input, output) pairs as training data. This is what the bulk of MiniMax's 13 million exchanges likely comprised — volume-oriented harvesting of agentic coding and tool-use responses. Detection signal: high volume, narrow capability focus, repetitive structural patterns across distributed accounts.

Tier 2: Chain-of-Thought Extraction
More targeted. Anthropic specifically called out DeepSeek prompts that asked Claude to "imagine and articulate the internal reasoning behind a completed response and write it out step by step." This isn't harvesting outputs — it's harvesting the reasoning process. The resulting data is more valuable because it captures intermediate reasoning steps, not just final answers. If you've followed the lineage from Chain-of-Thought Prompting through to process reward models, you know why this matters. Detection signal: prompts that consistently request step-by-step reasoning, explanation of decision processes, or verbalization of internal logic — especially at scale across coordinated accounts.

Tier 3: Reward Model Construction
The most sophisticated tier. DeepSeek used Claude for "rubric-based grading tasks" — using the target model as a reward model for reinforcement learning. They weren't extracting Claude's outputs for training data. They were extracting Claude's evaluative judgments as a training signal. This is efficient as hell: you need far fewer reward model samples than supervised training samples to meaningfully improve a model via RL. Detection signal: evaluation-style prompts, scoring rubrics, comparison tasks, and preference judgments at scale.

Each tier gets you more value per query. A well-designed campaign uses all three in sequence.

The Infrastructure: Hydra Clusters and Traffic Mixing

Anthropic describes the proxy infrastructure as "hydra clusters" — networks managing 20,000+ fraudulent accounts simultaneously across their API and third-party cloud platforms. Here's how they operate:

No single points of failure. Account bans are immediately backfilled.
Traffic mixing. Distillation queries are blended with legitimate customer traffic from the same proxy network, making behavioral isolation harder.
Multi-pathway access. Campaigns spanned multiple account types (educational, research, startup programs) to diversify their access surface.
Adaptive targeting. When Anthropic released a new model mid-campaign, MiniMax pivoted within 24 hours — redirecting roughly half their traffic to the updated system.

If you've worked botnet detection or large-scale scraping defense, this architecture is familiar. The novelty is the target, not the tactics. But detection is harder here because individual request payloads aren't anomalous — there's no SQLi signature, no malformed header, no obvious abuse pattern at the request level.

The Detection Engineering Problem

This is the most valuable part of the disclosure for practitioners. Anthropic describes building "classifiers and behavioral fingerprinting systems" for detection. Here's what that actually takes.

Feature engineering at the account-behavior level, not the request level. You need to aggregate across accounts and time windows to identify: topic concentration (is this account only hitting one narrow capability area?), structural repetition (are prompt templates being reused with variation?), and temporal coordination (are accounts exhibiting synchronized behavior?).

Cross-account correlation. The hydra cluster architecture means you need entity resolution across accounts that may share no obvious identifiers. Shared payment methods, timing patterns, prompt structural similarity, and infrastructure indicators (IP ranges, client fingerprints) become your linkage signals.

Distinguishing distillation from power users. A legitimate developer building an AI-powered product might generate high-volume, focused traffic that superficially resembles distillation. Your classifier needs features that capture the training data generation intent — prompt variation patterns that suggest systematic coverage of a capability space rather than production workload patterns.

Chain-of-thought elicitation detection. Anthropic mentions this specifically. Prompts that consistently request externalization of reasoning processes, especially when the structure suggests the output is being collected for training rather than being consumed by an end user.

The false positive problem is real. Legitimate evaluation and benchmarking, red-teaming, and research use can all look like distillation at certain scales. Any detection system here needs careful tuning to avoid punishing your heaviest legitimate users.

Defensive Countermeasures and Their Tradeoffs

Anthropic mentions "model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers." They don't get specific, but here's what that likely means:

Output perturbation. Injecting subtle noise into outputs that degrades their utility as training data without being noticeable to humans. Tradeoff: any perturbation that hurts training utility can also hurt downstream applications that depend on deterministic or consistent model behavior.

Watermarking. Embedding statistical signatures in model outputs that can be detected in models trained on those outputs. Kirchenbauer et al. and subsequent work showed promise, but also demonstrated that watermarks can be removed or diluted through post-processing. Works against casual distillation. Probably not enough against actors at this level.

Selective capability gating. Restricting access to the model's most valuable capabilities (extended reasoning, tool use, agentic behaviors) based on account trust level. Zero-trust applied to model capabilities — you earn access to higher-value outputs through demonstrated legitimate use. Tradeoff: friction on legitimate onboarding, which is exactly the pathway these attackers exploited.

Reasoning trace obfuscation. If chain-of-thought extraction is a primary vector, you can modify how the model exposes its reasoning — summarizing instead of showing step-by-step traces, or varying the structure of reasoning outputs to reduce their consistency as training data. Tradeoff: reasoning transparency is a feature, not a bug. A lot of legitimate users are paying for exactly this.

None of these are silver bullets. The core problem: the same properties that make model outputs valuable to legitimate users — quality, consistency, reasoning depth — make them valuable as training data. Any defense that degrades training utility is going to degrade product utility too. That's the tradeoff nobody's solved.

What This Means If You're Building

If you're exposing any model capability through an API — frontier lab or company running fine-tuned models for your domain — this is now a documented threat pattern.

AI vendor risk assessment needs a provenance question. If you're consuming AI capabilities from third-party providers, understanding how their models were trained is a security question now. A model built through illicit distillation may have had safety alignment degraded in the process. This isn't theoretical — Anthropic's report says directly that safety guardrails are unlikely to transfer faithfully through distillation.

MCP and agent ecosystems expand the extraction surface. As AI systems get more agentic — calling tools, executing code, orchestrating multi-step workflows — the capability surface available for distillation grows. Moonshot and MiniMax specifically targeted agentic reasoning and tool use. Any trust framework for agent-to-agent or agent-to-service communication (like MCP) needs to account for the possibility that one endpoint in the chain is conducting capability extraction rather than legitimate interaction. This is the supply chain trust problem applied to model intelligence.

Rate limiting is necessary but not sufficient. Per-account rate limits are trivially defeated by hydra cluster architecture. Behavioral rate limiting — throttling based on detected extraction patterns rather than raw volume — is closer to what's needed, but that requires the detection engineering investment described above.

This is an arms race. MiniMax pivoting to a new model release within 24 hours tells you these campaigns adapt in real time. Static defenses will get outpaced. This needs the same continuous detection and response investment we'd apply to any sophisticated threat actor. Treat it like one.

What Anthropic Didn't Say

Worth flagging a few gaps.

The report doesn't address whether distillation was detected in real-time or through retrospective analysis. The MiniMax campaign was caught "while it was still active," but the DeepSeek and Moonshot timelines are less clear. That distinction matters a lot: real-time detection enables intervention. Retrospective analysis gives you attribution but the horse has already left.

There's no discussion of whether extracted capabilities were actually confirmed in the resulting models. Anthropic draws the connection between distillation campaigns and the labs' product roadmaps, but proving that specific capabilities in DeepSeek V4 or Kimi originated from Claude distillation is a different problem entirely — you'd need model output comparison, behavioral fingerprinting of deployed models, or watermark detection. That's the smoking gun they don't have yet, at least not publicly.

And the report is silent on distillation from non-Chinese actors. This is almost certainly happening — distillation is a technique, not a nationality — but only campaigns attributed to Chinese labs made the cut. Understandable given the policy context and the export control debate, but incomplete as a threat picture. If you're building defenses based on this report, don't scope them to one country of origin.

Phil Stafford is an AI security researcher and Principal Consultant at Singularity Systems. He builds tools for securing AI agent ecosystems, including ThinkTank (multi-agent structured dissent for security analysis) and Credence (cryptographic trust registry for MCP server validation). He writes about AI security on Medium and speaks on adversarial AI and agent security at industry conferences.

Top comments (4)

david duymelinck • Feb 25 • Edited

The Anthropic news feels like the pot calling the kettle black. LLM's are trained on copyrighted work. This seems be faded away to the background, but there are still people who remember.

You need an console account and an api key. So the supposed thieves are paying them to generate output. How else do they get to the 24,000 number?

The only thing different between the "distillation attacks" and the training material theft is the traceability.

If one thief steals laundered money from another thief, and that thief goes to the cops isn't it likely they both end up in jail?

And the solution is make the output worse for everyone, that seems like a good idea?

Phil Stafford • Feb 26

That seems to be the general sentiment amongst the general public, yes. However, I don't think that's the point. Anthropic is waging political battle with the DoD right now and this comes at the right time for all the frontier models to start pointing fingers at China. You'll notice no non-chinese open-source models were named.

I'm not intending to make apologies for Anthropic (ip theft is ip theft), but if we say its ok for anyone to do the same to Anthropic where does that leave copyright for anyone? Doesn't that mean we effectively abolished copyright then?

I'm not even focusing on the moral ramifications of this - its a security issue with huge ramifications. Especially since it comes at a time when the US govt is choosing which providers they're going to allow federal contracts to work with. Shouldn't we be doing something about the security? Or will we let our ire at Anthropic and LLMs in general blind us to the safety implications?

david duymelinck • Feb 26 • Edited

I see more and more Github repositories with Claude as contributor. According to a US court this means the code is public domain whatever license the maintainer is adding. I think copyright is almost dead in the US.
The scary thing if you look deeper is that every company email drafted by an AI is public domain, so no more industrial spionage?

They are also mentioning prompts that have no bad intent, it is the scale a prompt is used. So anyone that tests their application too much is considered dangerous?

And why is Anthropic the only one that discovered this attack? Why would the people that execute these attacks only focus on one company?
This seems to be a message that should be brought with a unison voice by all AI companies if it really is a security issue.

A darker thought, why won't the military do the same and avoid paying any company. 16 million API calls are peanuts for the US military budget.

I don't single out Anthropic. They releases the news and I'm reacting on the facts of that news.
I would react the same if it was any other company. They all committed the same crimes.

I don't think the general public cares about these issues. Those frontier AI companies can control the narrative because it is a new product. So every message should be under a looking glass. It is up to the people with IT experience to ask questions.

Phil Stafford • Feb 26

Anthropic isn't the only company involved - OpenAI wrote a letter to the US House of Representatives claiming the same thing about Deepseek. Many of the frontier companies are making similar claims about Chinese models.

I'm sure the general public isn't the audience here - Anthropic doesn't care whether we are concerned about national security. It's aimed at the larger industry and the government that is currently evaluating the industry to determine winners and losers in the economic shuffle. Our opinions aren't important to any of the frontier companies here.

As far as the US court ruling, there's little chance for it to last - its ludicrous to strip copyright because of the tool used, especially since its gotten so much traction. It's akin to saying books made with Word or music made with synthesizers aren't copyrightable. The industry doesn't seem to have shifted based on that one ruling, and I doubt we'll see it stand the test of time.

The military could stop paying any company but that holds true for any situation, they have the firepower, why not just take what they want? I'm not sure what you're getting at there. Not to mention that waging economic war against their corporate citizens doesn't seem to be a winning strategy for long-term stability. But that's not my field, nor the focus of my point here.

I think it boils down to this: Yes, Frontier models were built through MASSIVE copyright infringement. That's indisputable. I'm a creative myself, and I wholeheartedly believe in people getting compensated for their work, especially in fields where our labor is devalued already. But are we going to watch a sustained geopolitical attack on US industry and cheer because the company did wrong? What does that mean for security?

I'm FAR from a corporate apologist. There are some deep-seated issues inherent to our current system that hurt each and every citizen, and its more and more difficult to defend any of the corporate interests that benefit from it. However, every company of substantial size has SOMETHING you can point to that would justify labeling them 'evil'. Do we stop protecting people because they don't meet our standard of 'good'? I'm all for demanding restitution for a company's damage. But if you're going to go as far as washing your hands free of any care for their security, as well as the security of the industry as a whole, I don't know that I can follow.