Darko from Kilo

Posted on Jan 13 • Originally published at blog.kilo.ai

Open-Weight Models Are Getting Serious: GLM 4.7 vs MiniMax M2.1

#discuss #coding #ai

Two new open-weight models dropped in December, claiming strong agentic coding capabilities: GLM 4.7 from z.AI and MiniMax M2.1. We ran both through a multi-phase coding test in Kilo Code to see how they handle real implementation work.

TL;DR: Both models successfully built a fully functional CLI task runner in one go, implementing all 20 specified features including dependency management, parallel execution, and caching. GLM 4.7 produced more comprehensive planning and documentation. MiniMax M2.1 delivered the same functional result at half the cost.

Why Open-Weight Models Matter

Open-weight models are becoming really good. A year ago, if you wanted to do serious coding work with AI, you were limited to expensive frontier models like Claude or GPT . Now, models like GLM 4.7 and MiniMax M2.1 can successfully handle complex agentic coding workflows autonomously for long durations and produce working code.

This matters for cost efficiency and also opens up more options for developers. For example, MiniMax M2.1 cost us $0.15 for a task that would cost several dollars with frontier models. As these models improve, the gap between "affordable" and "best" keeps shrinking.

We wanted to test this with a practical benchmark: give each model a complex coding task that requires planning, multi-file creation, and sustained execution over many minutes. The kind of task where you walk away and come back to see finished, production-ready code.

Building a CLI Task Runner

We designed a single, in-depth task: build a CLI tool called "taskflow" that runs commands defined in a YAML config file with dependency management.

Think of this as a lightweight version of GitHub Actions where you define a workflow in YAML and run it on your own infrastructure after you commit your code.

The requirements included 20 features across five categories:

Core Execution: YAML parsing, topological sort, cycle detection, parallel execution, skip on failure
Task Config: Per-task env vars, working directory, config defaults, conditional execution
Output: Colored logs with timing, log files, real-time streaming, verbose/quiet modes
CLI: --dry-run, --filter, --max-parallel, SIGINT handling
Advanced: Input file caching, before/after hooks, retry logic, timeouts

Here's a sample config file:

This is a realistic CLI tool that would take a senior developer at least a day or two to build properly. It requires understanding dependency graphs, process spawning, file hashing, signal handling, and stream management.

Two-Phase Testing

We ran the test in two phases to evaluate both planning and implementation abilities.

Phase 1: Architect Mode

We gave each model the full requirements and asked for a detailed implementation plan using the Kilo Code Architecture mode. This tests their ability to design before coding. A good plan should include file structure, type definitions, algorithms for dependency resolution, and how each feature will be implemented.

Phase 2: Code Mode

After each model produced its plan, we told it to implement the plan using the Kilo Code Code mode. This tests whether they can follow through on their own design and produce working code.

Both models ran uninterrupted for the entire task. MiniMax M2.1 ran for 14 minutes without stopping. GLM 4.7 completed in 10 minutes. Neither required human intervention or course correction.

Performance Summary

Phase 1 Results: Planning

The planning phase revealed significant differences in how each model approaches architecture.

GLM 4.7's Plan

GLM 4.7 produced a 741-line architecture document with three Mermaid diagrams showing execution flow, parallel batching strategy, and module relationships. The plan included:

A nested directory structure with 18 files across 8 directories
Complete TypeScript interfaces for all types
Explicit mention of Kahn's algorithm for topological sort with pseudocode
A 26-step implementation roadmap
Security considerations (command injection risks)
Performance notes (spawn vs exec tradeoffs)

GLM 4.7 Architecture Plan Output

For example, here's how GLM 4.7 documented its cache strategy:

The plan also included the exact cache JSON format:

MiniMax M2.1's Plan

MiniMax M2.1's plan was significantly shorter at only 284 lines with two Mermaid diagrams. It covered all the same concepts but with less detail:

A flat directory structure with 9 files
Complete TypeScript interfaces
Mentioned Kahn's algorithm by name (no pseudocode)
Module descriptions without step-by-step implementation roadmap

MiniMax M2.1 Architecture Plan Output

MiniMax M2.1's cache description:

Both plans are technically correct and cover all requirements. GLM 4.7's plan reads like internal documentation you'd hand to a new team member. MiniMax M2.1's plan is a working specification, which is much, much shorter, but still gets the job done.

Plan Scoring (50 points)

We scored plans on nine criteria:

The 6-point gap only reflects depth of documentation rather than technical correctness. Both plans would enable a developer to build the tool.

Phase 2 Results: Implementation

Both models successfully implemented all 20 requirements. The code compiles, runs, and handles the test cases correctly without any major issues or errors.

Core Features (20 points)

Both implementations include:

Working topological sort with cycle detection:

GLM 4.7's implementation (Kahn's algorithm):

MiniMax M2.1's implementation follows the same algorithm with nearly identical logic.

Parallel execution with concurrency limits:

GLM 4.7 uses a dynamic approach where it starts new tasks as slots become available:

MiniMax M2.1 uses batch-based execution where tasks are grouped by dependency level:

Both approaches work. GLM 4.7's is more responsive to individual task completion. MiniMax M2.1's is simpler to understand.

Implementation Scoring

We scored implementations on the same 20 requirements:

Both models achieved full marks on implementation. Every feature works as specified.

Code Quality Differences

While both implementations are functional, they differ in structure and style.

Architecture

GLM 4.7 created a deeply modular structure:

GLM 4.7 Code Structure

MiniMax M2.1 created a flat structure:

MiniMax M2.1 Code Structure

Neither is wrong. GLM 4.7's structure is easier to extend and test in isolation. MiniMax M2.1's structure is easier to navigate and understand initially.

Error Handling

GLM 4.7 created custom error classes:

MiniMax M2.1 used standard Error objects with descriptive messages:

Retry Logic

GLM 4.7 implemented exponential backoff:

MiniMax M2.1 retries immediately without delay.

Hashing

GLM 4.7 uses MD5. MiniMax M2.1 uses SHA256. For cache invalidation purposes, both work fine. SHA256 is technically more collision-resistant.

CLI Parsing

GLM 4.7 implemented argument parsing manually:

MiniMax M2.1 used commander.js:

GLM 4.7's approach has no external dependency. MiniMax M2.1's approach is more maintainable and handles edge cases automatically.

Documentation

GLM 4.7 generated a 363-line README.md with installation instructions, configuration reference, CLI options, multiple examples, and exit code documentation.

GLM 4.7 README Output

MiniMax M2.1 generated no README.

Agentic Capabilities

Both models demonstrated genuine agentic behavior. After finishing the implementation, each model tested its own work by running the CLI with Bash and verified the output.

GLM 4.7 reached a working state faster with fewer issues. It produced 1,850 lines across 18 files in 10 minutes, then ran through its test cases without hitting major problems.

MiniMax M2.1 took longer because it ran into issues during self-testing. The Commander.js library wasn't parsing CLI flags correctly. Instead of giving up or asking for help, MiniMax tested the library inline using Node to figure out what was wrong. Once it understood the issue, it went back and fixed the CLI code. This debugging loop happened without any human intervention.

Cost Analysis

At the time of our testing, MiniMax M2.1 cost us $0.15 for this task, half as much as GLM 4.7. Since then, MiniMax M2.1 has become free to use in Kilo Code for a limited time. Both models are significantly cheaper than frontier models. Running this same task with Claude Opus 4.5 or OpenAI GPT 5.2 would cost several dollars or more.

The 6-point score difference comes entirely from planning documentation, not implementation quality. For teams optimizing on cost, MiniMax M2.1 produces working code at half the price. For teams that value comprehensive documentation and modular architecture, GLM 4.7 may be worth the premium.

Tradeoffs

Based on our testing, GLM 4.7 is better if you want comprehensive documentation and modular architecture out of the box. It generated a full README, detailed error classes, and organized code across 18 well-separated files. The tradeoff is higher cost and some arguably over-engineered patterns like manual CLI parsing when a library would do.

MiniMax M2.1 is better if you prefer simpler code and lower cost. Its 9-file structure is easier to navigate, and it used established libraries like Commander.js instead of rolling its own. The tradeoff is no documentation. You'll need to add a README and inline comments yourself.

Both codebases would pass code review with minor adjustments. Neither is production-ready without human review, but that's expected for any AI-generated code.

It's worth noting that both models can be steered with more specific prompts. What we're observing here is just the default behavior inside the Kilo Code harness. If you wanted MiniMax to generate documentation or GLM to use fewer files, you could prompt for that explicitly.

The final conclusion

Overall, we're impressed with the performance of both models. For practical coding work, either one delivers functional code. GLM 4.7 requires less hand-holding and produces more complete output. MiniMax M2.1 achieves the same functional result at half the cost, and is currently free to use in Kilo Code for a limited time.

A year ago, neither of these models existed. Now they can run for long durations autonomously, debug their own issues, and produce working projects. Open-weight models are catching up to frontier models and getting better with each release.

DEV Community