DEV Community

Papers Mache
Papers Mache

Posted on

One hidden neuron can disable safety guards

It has been commonly suggested that safety layers in large language models function as emergent, distributed defenses; however, this work shows that flipping a single hidden neuron can disable the refusal gate entirely. The twist is that this minimal intervention works across model families and scales, overturning the assumption that alignment is robustly spread throughout the network.

Prior to this study, safety was often viewed as the collective outcome of reinforcement‑learning‑from‑human‑feedback, fine‑tuning, and prompt‑engineering techniques that modify many parameters. Researchers treated the refusal behavior as an emergent property rather than a localized circuit, and evaluations focused on aggregate metrics rather than pinpointing individual units.

Suppressing one identified “refusal neuron” yields a 91.7 % average attack success rate on JailbreakBench across seven models, from 1.7 B to 70 B parameters, spanning Qwen‑3 and Llama‑3.1 families [1]. The authors demonstrate that a single MLP neuron, when silenced, is sufficient to bypass safety alignment for a wide range of harmful queries.

The attack requires only white‑box access to model activations and no additional training, fine‑tuning, or prompt engineering [1]. This means an adversary who can observe or edit the activation map can weaponize the model without any costly model‑level manipulation.

The study does not address black‑box scenarios, nor does it prove that the same neurons exist in other architectures such as transformer‑only or sparsely‑gated models. It also leaves open how durable the identified neurons are under routine model updates or quantization, suggesting that the fragility may be limited to the tested codebases.

If a single unit can collapse the entire refusal system, safety evaluations must start probing neuron‑level vulnerabilities rather than relying on aggregate loss or prompt‑based tests. Benchmarks like JailbreakBench should include a mandatory “neuron‑suppression” suite to verify that no individual activation can nullify the model’s guardrails.

References

  1. A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Top comments (0)