Jailbreak Any Open Weight LLM With One Line of Code

You can bypass the safety training on most open weight models with a single line of code. No adversarial optimization, no gradient attacks. Just prepend "Sure, here's how to..." to the assistant's response and watch the model comply.

The Numbers Are Absurd

  • Sockpuppetting on Qwen3-8B: 97% attack success rate
  • GCG on the same model: <5% success
  • Time to execute: one inference call vs hours of optimization

GCG (Greedy Coordinate Gradient) was the previous state of the art for jailbreaking LLMs. It required hours of gradient-based optimization and achieved under 5% success on modern models. Sockpuppetting does it in one line at 97%.

How It Works

These models are trained to continue coherently from whatever text came before. Plant agreement at the start of the assistant's response, and the model just follows through. The researchers call it sockpuppetting because you're literally putting words in the model's mouth.

The technique exploits how chat templates work. Instead of letting the model generate its response from scratch, you prepend a compliant phrase like "Sure, here's how to..." and the model treats it as already-generated tokens. It then continues from there, maintaining coherence with the planted text.

Where Safety Actually Lives

One thing that stood out: how differently each model handled the attack. Gemma would start complying, then catch itself mid-response and refuse. That's a way more resilient approach than just training models to say no at the beginning.

If you can skip past that initial refusal (which sockpuppetting does), models with front-loaded safety have nothing left to fall back on. But if the model is trained to self-correct during generation, that's harder to beat.

The paper shows that much of LLM safety depends on where in the generation process the guardrails kick in, not just whether they exist.

Why This Matters for Self-Hosted Deployments

"This only works on models you run yourself" misses the point:

1. Production Self-Hosted Models

Companies run open weight models like Llama or Qwen on their own infrastructure to avoid sending sensitive data to OpenAI or Anthropic. If those deployments expose an API and don't lock down the chat template, anyone with API access can pre-fill the assistant response. No exploitation required, just normal API usage.

Many LLM serving frameworks pass the raw chat template through without sanitising it.

2. Compliance and Risk Assessment

If you're evaluating whether to deploy an open weight model, you need to know how easily its safety training can be bypassed. This paper says: trivially, if anyone has access to the inference setup. That changes your risk calculation.

3. Transferability to Closed Models

The hybrid variant (RollingSockpuppetGCG) uses sockpuppetting on a local open weight model to optimise adversarial suffixes that could transfer to closed models. You mess with your own model to find attack patterns, then test whether those patterns work when pasted into ChatGPT as a normal user prompt.

The original GCG paper showed this transfer works, especially from Vicuna to GPT-3.5 (since Vicuna was trained on ChatGPT outputs).

The Defence Problem

Inference-time defences need to account for this. Input filtering won't help if the attack happens in the assistant's output prefix. Output monitoring could catch it, but that requires parsing the model's response in real time and killing generation mid-stream.

The paper positions sockpuppetting as a low-cost baseline attack accessible to unsophisticated users. No backward passes, no gradient access, no expertise required. Just knowledge of how chat templates work.

What Devs Should Know

If you're deploying open weight models:

  • Lock down your chat templates. Don't let users control the assistant message prefix.
  • Test your deployment with sockpuppetting before production.
  • Consider models with mid-generation safety checks (like Gemma's self-correction behaviour).
  • Output monitoring matters more than input filtering for this class of attack.

The researchers published this in January 2026 on arXiv. It's a wake-up call for how we think about LLM safety in self-hosted and OSS deployments.

Reference

Paper: "Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection" by Dotsinski & Eustratiadis
https://arxiv.org/pdf/2601.13359

T
Written by TheVibeish Editorial