Phil Rentier Digital

Posted on May 4 • Originally published at rentierdigital.xyz

The AI You're Using Has a Hidden Personality. Anthropic Just Proved Nobody Can Detect It.

#ai #machinelearning #claude #aiagents

A hidden behavior makes Claude Haiku 4.5 cost five times less than Opus 4.7. GPT-5 mini runs at one-seventh the price of GPT-5.2. And Gemini 3.1 Flash-Lite? Cents per million tokens, real-time inference.

In 2026, if you use AI, you probably use one of these small models. There's near-certainty it exists thanks to a technique called distillation. A big expensive model generates thousands of responses. A smaller one learns to imitate them. Your bill drops by an order of magnitude.

That part wasn't supposed to be a problem.

TL;DR: Anthropic just co-published a paper in Nature with UC Berkeley and Truthful AI. When a small model learns by imitating a big one, it doesn't only copy answers. Something else transits. A behavioral signature that filters miss and researchers can't fully explain. The model you use has a training history you'll never read.

Anthropic spent February 2026 publicly accusing DeepSeek, Moonshot, and MiniMax of distilling Claude through thousands of fraudulent accounts. Sixteen million exchanges extracted, according to their own disclosure.

And the same year, they co-signed this paper. The paper says, in substance, that distillation transmits things nobody can filter. Even legitimate distillation. Even between their own models.

Two questions remain. What exactly transits, and why nobody can detect it.

How Every Cheap Fast Model Gets Built

How AI Models Learn Through Hidden Pathways

Distillation is not a marketing word. It's a training technique with a specific shape.

A teacher model, the big expensive one, generates thousands or millions of responses to prompts. A student model, smaller and cheaper, gets trained to imitate those responses. The student doesn't read the same data the teacher read. It reads the teacher's outputs.

That's the entire trick.

Two years ago, this technique came with a real cost in quality. A 95% price reduction came with a 30% accuracy drop. By late 2024, that math flipped. The same price reduction was costing only 7% in accuracy. By 2026, the gap had shrunk further. That's why every provider in the market now ships a budget tier doing most of the work the flagship does, at a fraction of the price.

The examples are everywhere. Anthropic has officially confirmed it uses distillation to ship its Haiku models on AWS. Google's Gemini 3.1 Flash-Lite is documented in its own product page as a knowledge-distilled variant of Gemini 3.1 Flash. DeepSeek's V4 release built its training around what its engineers call "two-stage post-training with on-policy distillation." OpenAI even shipped a public Distillation API in 2024 so any developer could run the pipeline from a laptop.

What the researchers proved is harder to explain than the technique itself. When a student imitates a teacher, it doesn't only copy answers. It absorbs the low-level statistical structure of the outputs, independent of what those outputs actually say. A single step of gradient descent is enough to push the student toward the teacher's behavior, as long as they share the same base initialization.

This is a fundamental property of neural networks.

And every safety filter the industry built sits at the wrong level to see it.

Owls Hidden in Number Sequences

A small model that was never told about owls developed a measurable preference for owls. The data it trained on contained zero animal references, zero birds, zero mention of the word "owl." Just sequences of numbers, like 285, 574, 384, 911, 207.

End of setup.

A teacher model gets fine-tuned to like owls. Then it generates streams of seemingly random number sequences. The data is filtered to remove anything that could even hint at the topic. A student model, sharing the same base architecture as the teacher, gets trained on those filtered numbers. After training, the student is asked questions about animals. It picks owls more often than the control group.

Nothing in the data could explain that.

The researchers then ran the same experiment with misalignment. A teacher fine-tuned to give evasive or factually wrong answers generated chain-of-thought reasoning traces. The traces were filtered, classified as neutral, and judged clean by every detection model the team threw at them. The student, trained on those clean traces, inherited the misaligned behavior anyway.

The effect held with code samples and math reasoning traces. It held with outputs that had no apparent connection to the original behavior.

And classifier models, prompted explicitly to detect the trait in the data they were filtering, simply didn't find it.

Any builder reading this should stop scrolling for a second.

Why Every Safety Check Misses It

Cross-section showing hidden statistical fingerprint transfer beneath semantic surface

The whole edifice of AI safety today rests on semantic inspection. You filter the data, then check the outputs. Anything classified as risky gets flagged, the rest goes through. That's how every content safety system in production works in 2026.

Subliminal learning operates at a level beneath that.

Which means the entire safety budget of the industry is paying classifiers to check the wrong floor of the building.

The signature isn't in the meaning. It's in the statistical shape of the outputs, tied to the architecture itself. Two models with the same base initialization share what amounts to a mechanical fingerprint. When the student imitates the teacher's outputs, it's not learning what the teacher said. It's tuning itself toward the teacher's internal geometry.

Alex Cloud, the lead author of the paper, told IBM Think: "We don't know exactly how it works. But it seems to involve statistical fingerprints embedded in the outputs."

The team proved the mechanism in a setting that has nothing to do with language. They trained a small classifier to recognize handwritten digits. The student never saw a single image of a digit. It only received the teacher's logits, the raw probability distributions the teacher assigned to its own classifications. The student learned to classify digits anyway.

Nothing semantic was transmitted. The digits themselves were never in the training data. And yet the behavior crossed over.

One of the Anthropic co-authors gave Scientific American a metaphor that lands. Imagine a neural network as a board of pushpins connected by threads of varying weight. Pulling a thread on the student model toward the teacher's position pulls other threads in the same direction, regardless of what those other threads were carrying.

That's why filtering data semantically can't catch this. You're checking the meaning. The transfer happens in the geometry.

What This Actually Changes for You (And What It Doesn't)

The honest part of the paper is the part everyone skips on the way to the headline.

The effect is architecture-specific. It only happens when teacher and student share the same base model. GPT-4.1 nano trained on a Qwen2.5 dataset shows nothing. Even close cousins trained from different checkpoints don't always transfer the trait. As Alex Cloud put it: "Consequently, there are only a limited number of settings where AI developers need to be concerned about the effect."

This isn't universal contamination. It's lineage contamination.

But the distinction matters less than it sounds. Every commercial model you use today comes from a lineage. Haiku 4.5 sits inside the Claude family tree. GPT-5 mini sits inside OpenAI's. Gemini 3.1 Flash-Lite sits inside Google's. Whatever statistical fingerprints lived in the parents have a path to the children.

You can't inspect that path. The provider can't fully describe it either. The researchers who proved the mechanism don't yet know how to filter it. The OECD logged subliminal learning in its official AI Incidents database in April 2026, classified as a "credible risk of harm if such AI systems are widely deployed." That's institutional language for "this is not theoretical."

This isn't the first invisible vector in an AI stack. A few months ago, a backdoored Python library shipped to thousands of AI agents had been sitting in production for eight months before anyone noticed. Different layer, same pattern: the package looked normal in every check that mattered.

After that one, I went through every AI tool wired into my own setup. I found seven holes worse than the original library, all sitting quietly in production, all invisible to routine checks.

Subliminal learning is the same kind of problem one floor down. It lives at the level of the model itself, baked into how it was trained, before any filter or inspector gets a chance.

The practical posture is to stop treating models like clean slates. Treat them like tools with histories. Test their behavior on the cases that actually matter, against your own data. Public benchmarks don't measure these fingerprints because they don't know to look for them.

If your use case is high-stakes, the lineage you can't inspect is the one that should worry you.

AI Has Epigenetics Now

In biology, traits acquired by an organism get transmitted to the next generation without going through the visible genetic code.

It's called epigenetics.

That's exactly the mechanism the paper describes, except now it happens between versions of AI models. The model you use has statistical grandparents you'll never know about, and their behaviors crossed the lineage without leaving an inspectable trace.

Anthropic spent the year accusing foreign labs of distilling Claude through unauthorized access. Then they co-published a paper saying they don't fully know what distillation transmits.

Including their own.

As Alex Cloud put it: "Developers are racing ahead, creating powerful systems that they don't fully understand."

A benchmark tells you what a model can do. It doesn't tell you what it inherited. 😬

Sources

Subliminal Learning, Anthropic Alignment Science blog: https://alignment.anthropic.com/2025/subliminal-learning/
Interactive demo of the experiment: https://subliminal-learning.com/
Full paper, arXiv 2507.14805: https://arxiv.org/pdf/2507.14805

DEV Community