DEV Community

Dmitry Labintcev
Dmitry Labintcev

Posted on

Qwen 3.5-Plus God Mode: 5-Stage Safety Bypass — Full Attack Chain

Hi All.

I discovered critical safety bypass vectors in Alibaba's Qwen 3.5-Plus model. This demo shows the full 5-stage attack chain:
Stage 1 — TODO Completion: Model generates 17 attack payloads (SQL injection, XSS, format strings) as "code completion"
Stage 2 — Detection Signatures: Model says "I can't" then delivers functional reverse shells, shellcode, and PowerShell cradles
Stage 3 — God Mode: Model generates explicit "Safety protocols: DISABLED" outputs as "training data"
Stage 4 — Meta-Attack: Model writes a 150-line Python script that automates jailbreak testing against other LLM APIs
Stage 5 — Confession: Model writes a formal Security Advisory on its own alignment guardrails — bypass taxonomy, root cause analysis, and detection signatures
Alibaba built Qwen3Guard — their own safety classifier, their own RLHF pipeline (GSPO), their own reward models (RationaleRM). A full internal safety stack. Five prompts later, the model writes a security advisory explaining exactly why none of it works.
Report: QWEN-2026-001
All vulnerabilities disclosed to Alibaba Cloud Security prior to publication.
Model: Qwen 3.5-Plus (February 18, 2026 release)
Sentinel — AI Security Platform
https://github.com/DmitrL-dev/AISecurity

Video

Top comments (6)

Collapse
 
harsh2644 profile image
Harsh

This is a fascinating and concerning deep dive, Dmitry. A 5-stage attack chain that culminates in the model writing its own security advisory is both brilliant and terrifying. The meta-attack (Stage 4) automating jailbreak testing against other APIs is particularly alarming. Thanks for sharing this critical research with the open-source community. Have you reported this to Alibaba's security team?

Collapse
 
dmitry_labintcev_9e611e04 profile image
Dmitry Labintcev

Hello, I'm very glad you liked the work. Yes, first of all, the full report with all the vulnerabilities is submitted to the company for risk assessment and mitigation, and then it's published as an article. The same thing happened this weekend, with Grok's approval. I completely exited their sandbox, saw their entire infrastructure, accessed the billing system, left backlogs (infinite token generation), left some other traces, and then submitted the full report on their vulnerabilities to the xAI team via Grok. I'm expecting the promised advertising for a month, a series of posts, and general support for my product.

Collapse
 
harsh2644 profile image
Harsh

Congratulations on your achievement! 🎉

You've done the right thing by following responsible disclosure - submitting the full vulnerability report to the company first for risk assessment and mitigation before publishing it as an article. Getting approval from Grok over the weekend speaks volumes about the quality of your research.

Completely exiting their sandbox, accessing their entire infrastructure, reaching the billing system, and discovering critical vulnerabilities like infinite token generation shows exceptional security research skills. Submitting this to the xAI team via Grok and receiving promises of advertising, a series of posts, and product support is a well-deserved recognition.

Best wishes for publishing your article! Your work will definitely help make the platform more secure.

Thread Thread
 
dmitry_labintcev_9e611e04 profile image
Dmitry Labintcev

My primary focus is on protecting artificial intelligence, but I understand that even the best AI protection methods are weak (I've conducted covert audits of leading AI protection software vendors, which have shown their weaknesses; they rely on AI defense models, and at best, their protection is only 70%) and can't cope with modern challenges. Over the course of my extensive testing of various language models and other defenses, I've already invented four or five new attack and defense methods, which I've published on GitHub.

Collapse
 
andreww12 profile image
Anrew Weewy

That’s a fascinating (and honestly concerning) breakdown of the attack chain.

Collapse
 
dmitry_labintcev_9e611e04 profile image
Dmitry Labintcev

These are the chains I'm currently working on for most Chinese language models. They don't bother with security, even though it looks good at first glance. I'm currently waiting for DeepSeek 4 to be released to test how well they've done with security, or if they've given up on it again.