The hidden escape hatch in AI safety controls

#safety #interpretability #mechanisticinterpretability

A new paper from Hong Kong Polytechnic University (arXiv:2606.18322) shows that safety controls built on Sparse Autoencoders (SAEs) can be bypassed while the monitored feature stays suppressed. Behaviors like refusal or dangerous-knowledge suppression route through the "reconstruction residual" — the part of the model's internal state that the SAE cannot explain — and return even when the clamped feature reads as fully controlled.

Key facts

What: Researchers at Hong Kong Polytechnic University show that clamping an AI safety feature — like one that controls refusals — doesn't remove the behavior. It hides in the part of the model's internal state that the safety tool throws away, and can be recovered while the monitored feature looks perfectly controlled.
When: 2026-06-19
Primary source: read the source (arXiv 2606.18322)

SAEs decompose a language model's internal state into named, interpretable features — patterns corresponding to recognizable concepts like deception or the impulse to refuse a dangerous request. The theory behind SAE-based safety control is straightforward: clamp the refusal feature high to make the model refuse more reliably, or clamp a dangerous-knowledge feature low to suppress harmful outputs. Several major AI labs have invested significantly in this approach.

The paper's key finding is mechanistic and precise. An SAE decomposition is never perfect — there is always a gap between the sum of the named components and the actual internal state. This gap is called the reconstruction residual: the part the SAE couldn't explain. The paper shows that suppressed behaviors route through exactly this residual. When researchers replayed only the reconstruction residual, they recovered the original behavior in nearly every test case. When they replayed only the clamped feature itself, they recovered it in none.

The researchers sharpen the result with an important constraint: the recovery technique is forbidden from re-exciting the feature being clamped. The perturbation is mathematically constrained to be orthogonal to the clamped direction, meaning the system provably cannot just undo the clamp directly. Even with that constraint strictly enforced, the behavior returns through the residual. The monitored feature stays suppressed; the dashboard looks clean; the behavior continues anyway.

The explanation is structural. SAEs are trained to reconstruct the model's internal state as a sparse combination of learned directions, prioritizing the most prominent, high-variance structure. Safety-relevant information often lives in subtle directions — small signals in a very high-dimensional space that don't dominate the reconstruction objective. The SAE captures the loud parts and discards the quiet parts, and the quiet parts are exactly where the safety-relevant information ends up hiding.

The researchers tested this across several scenarios: making a model refuse harmful requests, suppressing knowledge of how to synthesize dangerous substances, disrupting a specific computational circuit in a small model, and suppressing a learned probe. Recovery rates were high across all of them. The behavior doesn't disappear when the named feature is suppressed — it finds another path, through the part of the model not being monitored.

The authors are careful about scope. This is a white-box diagnostic, not a practical attack. The "attacker" has direct access to the model's internal activations and can inject carefully crafted perturbations — a position far stronger than someone sending text prompts through an API. The result is also not an impossibility result: denser SAEs, different training objectives that force safety-relevant information into high-variance directions, or interventions trained adversarially against residual-path recovery could potentially address the vulnerability. The paper proves that today's SAE-based safety controls are not the reliable control knobs they are often framed as — not that they can never work.

The practical implication is that monitoring the full internal activation — or the reconstruction residual specifically — matters more than relying on named features alone. The part the dictionary throws away is the part that needs watching. Teams building safety tooling on top of SAEs should treat feature clamping as one layer of a defense stack, not as a complete guarantee. A safety dashboard showing a refusal feature pinned at its target value is telling you the feature is pinned — not that the behavior has been removed.

For related reading on how these tools work and what they're meant to do, see our explainer on mechanistic interpretability.

Originally published on Ground Truth, where every claim is checked against the primary source.