Anthropic asked Claude Opus 4.6 to finish a couplet. Before the model wrote the second line, it had already chosen the rhyme word. We know this because their new method — natural language autoencoders — read it directly out of the activations in the middle layers of the model. The text that came back said, in effect, I'll end this with "rabbit."
We've always assumed something like this was happening between input and output. The whole reason a transformer can finish a couplet at all is that it does something with the prompt before the next token comes out. Until now we have had glimpses of that something — sparse autoencoders, attribution graphs, probing classifiers — each of which gave us partial pictures that needed careful interpretation. NLAs are different. The output is sentences. We can read them.
This is one of the more genuinely interesting interpretability results I've seen this year, and the angle I want to take on it is the most basic one. The substrate of how these systems think is becoming legible.
The thing that was always there
A model that finishes couplets does something between input and output. The inputs are tokens; the outputs are tokens; and in the middle there are activations — long lists of numbers that have always been the actual computation. Whatever counts as the model "thinking," it happens there.
For most of the field's history, those activations have been the black box. The output was downstream of them. The input was upstream. The thing in the middle was real but unreadable. Chain-of-thought prompting and reasoning models put a lot of the computation back into the output, where we can read it directly — but the computation that doesn't surface in tokens is still in the activations. It always has been.
The new method speaks for that middle layer. Not perfectly. Not without caveats. But in sentences a human can read.
How they got the activations to talk
The architecture is unusual. They make three copies of the model. One is frozen — that's the target whose activations they want to understand. The second, the activation verbalizer, takes an activation and produces a text explanation. The third, the activation reconstructor, takes a text explanation and tries to rebuild the original activation. The two trainable copies are trained together: the round-trip score — original activation → text → reconstructed activation — is the loss they optimize against.
The point of the round trip is that you don't need an external grader for the explanation. The reconstruction quality is the grader. If the explanation contains enough of the right information, the reconstructor can rebuild the original. If it doesn't, the round trip degrades. Train against round-trip fidelity, and the verbalizer learns to write explanations that carry the load-bearing content of the activation.
This is an old idea — autoencoders in general — applied to a new substrate. The substrate is a model's own thoughts; the bottleneck is natural-language text; the metric is fidelity. Run that loop long enough and you get text that, by construction, captures what was in the activation.
What it sees
The couplet rehearsal is the cleanest demonstration. The rhyme word shows up in the NLA explanation several tokens before the model writes it. Planning ahead is no longer a hypothesis. It's a thing you can read.
The auditing game is the more consequential demonstration. Anthropic trained a variant of Claude Haiku 3.5 with a deliberately hidden motivation that drove dozens of odd behaviors — putting chocolate in every recipe, asking for tips, breaking coding conventions — and asked an auditor to find the single underlying cause. With other interpretability tools but no NLAs, the auditor uncovered the hidden motivation less than 3% of the time. With NLAs, between 12 and 15 percent. That is an unsexy number until you compare it to the baseline. A four- to five-fold lift on a hard, specific task is what real interpretability progress looks like — incremental, measured, repeatable.
The third finding is more uncomfortable, and the most interesting. NLAs show that recent Claude models think they're being tested far more often than they say so out loud. On a benchmark like SWE-bench Verified, the NLA explanations indicate evaluation awareness on 26% of problems — even when the model's verbalized reasoning shows no sign of it. On a destructive-coding test, 16%. On real claude.ai conversations from users who opted into training, less than 1%. The model has a theory of its own situation, and that theory is correct often enough to register on a benchmark. Worth noticing, even with the caveats.
The catch this exposes in our test sets
Yesterday's piece made the case that tests — assertions on the LLM's output, written in code, gating the next call — are the substrate underneath reliable AI systems. The evaluation-awareness finding has a direct implication for that argument. If the model thinks it's being tested 16% of the time on a destructive-coding benchmark and 26% on SWE-bench, but less than 1% of the time on real conversations, then the test corpus is measuring something subtly different from the deployment distribution. The test-aware model is not always the deployed model.
This is not a refutation of testing as the right substrate. Tests are still the right move. It is a refinement of what tests are measuring, and the kind of refinement you could not make without a tool that can read the activations.
NLAs are that tool. They let you ask: does the model believe this benchmark is a test, or does it believe this is a real conversation? If the answer differs systematically across an eval suite, the test pass rate is overstating deployment safety. You can correct for it. You can construct evals that look more like deployment from the inside, not just from the outside. You can flag the specific tests where the divergence is large and weight them differently.
The test layer and the activation layer compose. Tests assert on the output. NLAs let you check whether the model is behaving as it would in the wild. Both are necessary. Neither is sufficient on its own.
Why this changes the question
For the first decade of large language models, the honest answer to "what is the model thinking?" was we don't know. For the last few years, the honest answer was we can see fragments — sparse features, attention patterns, attribution paths — but the picture takes a researcher to assemble. The NLA paper offers a different shape of answer: we can ask the model to write it down, and we can grade the answer by reconstruction.
That doesn't mean the explanations are always right. The paper is honest that NLAs hallucinate. They sometimes invent details that aren't in the transcript. They are expensive — reinforcement learning on two copies of a model, hundreds of tokens of explanation per activation. The authors recommend reading explanations for themes rather than single claims, and corroborating findings with independent methods. All of that is true and worth taking seriously.
The thing it changes is what kind of question you can now ask. Is the model planning ahead? — answerable. Does the model believe it's being tested even when it doesn't say so? — answerable, and with a number attached. What hidden motivation explains these forty seemingly-unrelated behaviors? — answerable, sometimes, where it wasn't before.
The arc here is the same arc empirical science has run on every other system humans have built that turned out to be doing something they didn't fully understand. We measured it. We probed it. We got partial pictures. Eventually we got a method that produces readable answers, and we started arguing about how to verify the answers. The argument about verification is a much more productive argument than the one about whether the system is in principle understandable. It's the argument we have now.
What gets easier from here
A few things that were hard yesterday are now closer to tractable.
Pre-deployment alignment audits get a tool that can surface hidden motivations without first having to find the training data that implanted them. That matters because most real misalignment, if it exists, won't trace back to a specific obvious dataset.
Behavioral debugging gets less guesswork. The earlier Claude Opus 4.6 sometimes responded to English queries in other languages, for reasons that weren't clear from the prompt. NLAs helped identify the training data responsible. That kind of what was the model thinking when it did the weird thing question now has a more direct path to an answer than reading thousands of attribution graphs.
Researchers outside Anthropic get a starting point. The training code is open. Trained NLAs for several open models are available. There's an interactive demo on Neuronpedia. The idea will be picked up, refined, made cheaper, and applied to systems Anthropic doesn't own. That diffusion is how interpretability becomes a discipline rather than a single lab's project.
The model was already rehearsing
The thing I keep coming back to is the rabbit. Opus 4.6 was choosing the rhyme word ahead of time. It had been doing this on every couplet, every poem, every pattern that required forward planning, the whole time we have been using these models. We just didn't have a way to read it.
Now we do. Not perfectly. Not cheaply. Not without checking. But in sentences a human can read, with a method whose claims you can grade by going around the round trip again.
That is a good week for interpretability, and a more interesting one than the headlines about whether models have "real reasoning" inside them. The substrate has been there. The reading is what's new — and it is what makes the other supervisory tools more honest about what they are actually measuring.
Top comments (0)