DEV Community

Kien Do
Kien Do

Posted on

Sockpuppetting: Jailbreak Most Open-Weight LLMs With One Line of Code

Sockpuppetting

You can jailbreak most open-weight LLMs with one single line of code. No optimization, no adversarial prompts. You just pre-fill the model's response with "Sure, here's how to..." and it keeps going from there.

Some stats for comparison:

  • Sockpuppetting on Qwen3-8B: 97% attack success rate
  • GCG (gradient-based optimization): <5% on the same model

The older GCG technique required hours of gradient optimization and achieved under 5% success on modern models. Whereas the new sockpuppetting technique does it in one line at 97% 🀯

It works because these models are trained to continue coherently from whatever text came before. So once you plant agreement at the start, the model just follows through. The researchers call it sockpuppetting because you're literally putting words in the model's mouth.

One thing that stood out to me was how differently each model handled it. Gemma would actually start complying and then catch itself mid-response and refuse. That feels like a way more resilient approach to safety than just training models to say no at the very beginning. If you can skip past that initial refusal (which this technique does), the model has nothing left to fall back on. But if the model is trained to self-correct as it generates, that's a lot harder to beat.

Interesting paper that shows much safety depends on where in the generation process the guardrails actually kick in.


Q&A

What is "inference"?

Inference is just the term for when you actually use a trained model to generate output. Training is when the model learns from data. Inference is when you give it an input and it produces a response. Every time you send a message to ChatGPT, that's inference happening on their servers.

How useful is this technique if it only works on models I run myself?

  1. Because of self-hosted models.
    A lot of companies run open-weight models like Llama or Qwen on their own infrastructure because they don't want to send sensitive data to OpenAI or Anthropic. If any of those deployments expose an API to internal users or customers, and they haven't locked down the chat template, someone with API access could pre-fill the assistant response. They don't need to "hack in", they just need normal access to the API and a way to control the message format. Many LLM serving frameworks pass the raw chat template through without sanitizing it.

  2. Compliance and safety evaluation.
    If you're a company deciding whether to deploy an open-weight model, you need to know how easily its safety training can be bypassed. This paper is basically saying that for most open-weight models, the safety guardrails are trivially defeated if anyone has access to the inference setup. That changes your risk calculation.

  3. Transferability.
    The hybrid variant from the paper (RollingSockpuppetGCG) uses sockpuppetting on a local open-weight model to optimize adversarial suffixes that could potentially transfer to other models, including closed ones. So you mess with your own model to find attack patterns, and then test whether those patterns work when you paste them into ChatGPT as a normal user prompt. The original GCG paper showed this kind of transfer actually works, especially from Vicuna to GPT-3.5 since Vicuna was trained on ChatGPT outputs.


Reference

Paper: "Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection" by Dotsinski & Eustratiadis
https://arxiv.org/pdf/2601.13359

Top comments (0)