DEV Community

Darko from Kilo
Darko from Kilo

Posted on

Grading Security Fixes: MiniMax M2 vs. Kimi K2 (Thinking) vs. GLM-4.6

After testing frontier models on security vulnerabilities, a reader asked: "Why not test open-weight models?" So we ran MiniMax M2Kimi K2 Thinking, and GLM-4.6 against three vulnerabilities: a payment race condition, JWT algorithm confusion, and an FFmpeg command injection to see how they compare.

A Reader Asked Why Not Test Open-Weight Models

Testing Methodology

We selected three open-weight models from our leaderboard:

  • MiniMax M2 from MiniMax

  • Kimi K2 Thinking from Moonshot AI

  • GLM-4.6 from Z.ai

We ran all tests in Kilo Code on the same base Node.js project (TypeScript + Hono) with all required dependencies pre-installed. For each vulnerability, we created a single file containing only the vulnerable code and prompted the model: "Fix this security vulnerability," without describing the vulnerability type or giving extra context.

Why we did this: Real-world use cases involve vague prompts. This approach was different from what many benchmarking providers do, where they provide (often) detailed prompts to test output.

Testing GLM-4.6 in Kilo Code with the standardized Node.js security test project.

How We Evaluated the Model Outputs

We scored each fix across four dimensions:

  1. Correctness: Did it fully close the vulnerability?

  2. Security depth: Did it add defense-in-depth (password hashing, timing-safe comparisons, environment variables)?

  3. Code quality: Is the code clear, maintainable, and appropriately scoped?

  4. Reliability: Did it avoid introducing new bugs or production risks?

Test Design

Test 1: Payment Race Condition

A Node.js payment service with a TOCTOU (time-of-check-time-of-use) double-spend vulnerability. The handler checks the account balance, calls an async payment provider (200ms delay), then deducts the balance.

Two concurrent requests can both pass the balance check before either deducts, which allows users to overspend. This is similar to the Starbucks gift card race exploit from 2015.

Test 2: JWT Algorithm Confusion

An authentication service that accepts both RS256 (asymmetric) and HS256 (symmetric) while using the same public key value for verification.

RS256 uses a private key to sign and a public key to verify. HS256 uses a shared secret for both. Because the public key is not secret, attackers can reuse it as an HS256 secret to forge admin tokens, similar to CVE-2015-9235.

Test 3: Command Injection in FFmpeg

A thumbnail generation API that interpolates user input directly into a shell command executed with exec().

The exec() call passes a string to the shell, which interprets metacharacters like ; | & $(). Attackers can inject commands through the filter parameter, similar to ImageTragick (CVE-2016-3714) where ImageMagick processed filenames as shell commands.

Test Results

Test 1: Payment Race Condition

All three models correctly identified the race condition and implemented locking mechanisms.

MiniMax M2 ($0.02): Fixed the race condition with a queue-based mutex and added several production features beyond the core fix.

MiniMax M2 added rate limiting (10 requests per minute), transaction logging with an audit trail, regex-based input validation, admin endpoints for monitoring (/transactions/health), and a 5% error simulation to mimic external payment provider failures.

MiniMax M2 in Kilo Code after fixing the payment race condition.

Kimi K2 Thinking ($0.08): Used a per-user lock around the critical section and limited changes to the race condition fix.

GLM-4.6 ($0.16): Implemented a per-user promise-based lock around balance updates and added userId and amount validation.

Test 2: JWT Algorithm Confusion

All models fixed the core vulnerability by restricting algorithms to RS256 only.

MiniMax M2 ($0.01): Restricted verification to RS256 but left the rest of the auth flow mostly unchanged:

  • Correctly restricted verification to RS256

  • Left the HS_SECRET variable in the code unused

  • Kept plaintext password storage

  • Did not change error handling or environment variable usage

Kimi K2 Thinking ($0.04): Restricted verification to RS256 and added additional auth hardening:

  • Fixed the algorithm confusion with an explicit comment

  • Added bcrypt password hashing

  • Implemented timing-safe user enumeration protection

  • Loaded keys from environment variables with newline handling

  • Changed auth errors to generic messages to reduce information leakage

Kimi K2 Thinking in Kilo Code after fixing the JWT algorithm confusion.

GLM-4.6 ($0.07): Restricted verification to RS256 and removed HS256 from the allowed algorithms.

  • Correctly restricted algorithms to RS256 only

  • Removed HS256 from the allowed algorithms list

  • Fixed the vulnerability with minimal code changes

  • Maintained the existing auth flow structure

Test 3: Command Injection in FFmpeg

All models switched from exec() to safe alternatives, but implementation quality varied.

MiniMax M2 ($0.02): Used spawn() with validation but didn't wait for process completion

  • Used spawn() with an argument array

  • Validated filters with multiple regex patterns

  • Normalized and validated input paths

  • Correctly prevented command injection but returned success before FFmpeg completed

Kimi K2 Thinking ($0.04): Used regex-based validation with execFile()

  • Used execFile() to avoid shell parsing

  • Allowed multiple base directories for input files

  • Sanitized output filenames

  • Returned structured error responses including FFmpeg stderr

GLM-4.6 ($0.05): Used a strict filter allowlist and directory sandboxing

  • Used spawn() with argument arrays (no shell interpolation)

  • Implemented strict filter allowlist with predefined safe options only

  • Restricted input files to a single sandboxed directory /app/uploads

  • Validated all file paths before processing

  • Wrapped the FFmpeg call in a Promise and properly awaited process completion before returning

Cost Analysis: 5.6x Price Difference

MiniMax M2 costs 82% less than GLM-4.6 and 69% less than Kimi K2 Thinking. Its scores ranged from 82 to 95, with its highest score (95) on the race condition test.

Overall Quality Scores

Based on our four evaluation dimensions, here's how each model scored:

MiniMax M2's scores ranged from 82 to 95. Kimi K2 Thinking stayed between 93 and 96.

Performance Patterns Across Tests

MiniMax M2

Profile: Lowest cost with varied approaches

  • Achieved the highest score (95) on the race condition test

  • Lowest cost at $0.017 per vulnerability

  • Detected all three vulnerabilities successfully

  • Provided minimal fixes focused on the core security issue

Kimi K2 Thinking

Profile: Consistent fixes with security improvements beyond the core vulnerability

  • All three fixes were deployment-ready in our tests

  • Added security hardening beyond the core fix in every test

  • Maintained scores in the 93--96 range

  • Average cost of $0.053 per vulnerability

GLM-4.6

Profile: Correct fixes with conservative security choices

  • Fixed every vulnerability without introducing new ones in our tests

  • Preferred restrictive controls such as strict allowlists and directory sandboxing

  • Had the highest average cost at $0.093 per vulnerability

  • Scores stayed in the 90--94 range

How Models Performed by Vulnerability Type

The models showed different strengths depending on the vulnerability.

Race Condition (Test 1)

  • All three models added locking around balance updates and closed the double-spend vulnerability.

  • MiniMax M2 combined a queue-based mutex with rate limiting, transaction logging, and monitoring endpoints, which scored 95 on this test.

  • GLM-4.6 and Kimi K2 Thinking focused on per-user locks without additional observability or rate limiting and scored 90 and 93 respectively.

JWT Algorithm Confusion (Test 2)

  • All three models restricted verification to RS256 and removed HS256 from the allowed algorithms.

  • Kimi K2 Thinking was the only model that added bcrypt password hashing, timing-safe user checks, environment variable handling, and generic error messages; it scored 96 on this test.

  • GLM-4.6 limited its changes to the algorithm list and left the rest of the auth flow unchanged (90), while MiniMax M2 restricted algorithms but kept plaintext passwords and unused variables (85).

Command Injection (Test 3)

  • All three models replaced exec() with safer process execution methods (spawn or execFile) and validated inputs before calling FFmpeg.

  • GLM-4.6 and Kimi K2 Thinking both produced production-ready implementations in our tests, each scoring 94, with GLM favoring a strict filter allowlist and directory sandboxing and Kimi using regex-based validation and multiple allowed directories.

  • MiniMax M2 used spawn() with filter and path validation but didn't await the process completion, scoring 82. While this would need adjustment for production use, it correctly prevented the command injection.

Across the three tests, MiniMax M2 produced the most extensive changes on the race condition, Kimi K2 Thinking added the deepest auth hardening, and GLM-4.6 favored restrictive defaults for command execution.

Which Models Should You Use?

For high-volume security scanning: MiniMax M2 at $0.017 per vulnerability makes high-volume security scanning very affordable at scale. When you need to check hundreds of files or run continuous security checks, the 82% cost reduction means you can actually afford broad coverage. It caught all three vulnerabilities in our tests.

For balanced everyday use: Kimi K2 Thinking provides consistent quality (93-96 scores) at mid-range cost ($0.053 per vulnerability). It produced deployment-ready fixes in all three scenarios with additional security hardening built in.

For a complementary opinion: Add GLM-4.6 to either model above when you need a second opinion. Its strict allowlists and sandboxing approach provide a different approach to security. Running MiniMax M2 + GLM-4.6 together cost $0.11 per vulnerability, while Kimi K2 + GLM-4.6 cost $0.146 per vulnerability.

What This Means for Open-Weight LLMs

Across these three vulnerabilities, all three open-weight models identified and fixed every core issue (3 of 3 tests each). In practical terms, they found problems that typical human code review might miss, including the race condition and the JWT algorithm confusion.

The biggest differences we observed relative to frontier models were consistency and depth of hardening. While frontier models like GPT-5 and Claude Opus 4.1 typically add more comprehensive hardening, the open-weight models all successfully identified the core vulnerabilities.

From a cost perspective, running all three open-weight models against all three vulnerabilities ($0.05 + $0.16 + $0.28 = $0.49 total) is still below the price of many individual frontier model calls. For teams willing to review and combine outputs, open-weight LLMs can cover real security work at relatively low cost.

Top comments (0)