GLM-4.7 achieves 73.8% SWE-bench and 87.4% tau-Bench with Preserved Thinking. Complete developer guide for the $3/month Claude Code alternative.
Key Statistics
- 355B Total Parameters
- 32B Active Parameters
- 200K Context Window
- 73.8% SWE-bench
Key Takeaways
- Open-Source Claude Alternative: GLM-4.7 is a 355B parameter MIT-licensed model achieving 73.8% SWE-bench—competitive with Claude Sonnet 4.5 at a fraction of the cost.
- Preserved Thinking Innovation: Unlike models that restart reasoning each turn, GLM-4.7 retains thinking blocks across conversations, maintaining context in long coding sessions.
- $3/Month Coding Plan: The GLM Coding Plan offers Claude-level coding at 1/7th the price with 3x usage quota, working directly with Claude Code, Cline, and Roo Code.
- Best-in-Class Tool Use: Achieves 87.4% on tau-Bench and 84.9% on LiveCodeBench, outperforming Claude Sonnet 4.5 on multiple agent and coding benchmarks.
- Production-Ready for Agents: Built specifically for terminal-based agentic workflows rather than chat, with native support for multi-turn stability in coding agents.
What Is GLM-4.7?
GLM-4.7 is Z.ai's flagship open-source coding model, released on December 22, 2025. Unlike previous models that focused primarily on chat capabilities, GLM-4.7 is engineered specifically for agentic coding—the ability to autonomously complete complex programming tasks across multiple files and turns.
The model represents a significant milestone: it's the first open-source LLM to approach proprietary model performance on real-world coding benchmarks while being available at a fraction of the cost. Z.ai (formerly Zhipu AI), a Tsinghua University spinoff valued at approximately $3-4 billion, has positioned GLM-4.7 as a direct alternative to Claude and GPT for developers who need capable coding assistance without enterprise pricing.
Built for Agents
Designed from the ground up for terminal-based workflows. Works natively with Claude Code, Cline, Roo Code, and Kilo Code.
MIT Licensed
Fully open-source with commercial use permitted. Weights available on HuggingFace and ModelScope for local deployment.
Technical Specifications
GLM-4.7 uses a Mixture-of-Experts (MoE) architecture with 355 billion total parameters, but only 32 billion are active per forward pass. This design enables frontier-level capabilities while maintaining reasonable inference costs.
| Specification | GLM-4.7 | GLM-4.6 |
|---|---|---|
| Total Parameters | 355B (MoE) | Similar |
| Active Parameters | 32B | 32B |
| Context Length | 200K tokens | 128K tokens |
| Max Output | 128K tokens | 32K tokens |
| License | MIT (Open-Source) | MIT |
| Knowledge Cutoff | Mid-Late 2024 | Earlier 2024 |
Thinking Modes: The Innovation
GLM-4.7's most significant innovation is its three-tier thinking architecture. This addresses the "context collapse" problem where AI coding assistants lose track of earlier decisions during long sessions.
Interleaved Thinking
Active by default. The model reasons before every response and every tool call. This prevents "hallucinated code" by verifying logic before generating output. Think of it as the model pausing to check its work at each step.
Preserved Thinking
Enabled by default on GLM Coding Plan. Unlike models that restart their thought process from scratch each turn, GLM-4.7 retains its "thinking blocks" across the entire conversation. This is analogous to a human developer who remembers why they made an architectural decision three hours ago.
Benefits:
- Reduces information loss in multi-turn sessions
- Improves cache hit rates, lowering costs
- Maintains consistency during complex refactors
Turn-Level Thinking Control
Developer-controllable per request. Enable or disable thinking on a per-turn basis within a session. Disable for simple syntax questions to reduce latency and costs; enable for complex debugging to maximize accuracy.
API Usage: Enable thinking with "thinking": {"type": "enabled"} in your API request. For preserved thinking, set "clear_thinking": false.
Benchmark Performance
GLM-4.7 demonstrates significant improvements across coding, reasoning, and agent benchmarks. Here's how it compares to leading proprietary models:
| Benchmark | GLM-4.7 | Claude Sonnet 4.5 | GPT-5.1 High | DeepSeek-V3.2 |
|---|---|---|---|---|
| SWE-bench Verified | 73.8% | 77.2% | 76.3% | 73.1% |
| LiveCodeBench v6 | 84.9% | 64.0% | 87.0% | 83.3% |
| tau-Bench (Tools) | 87.4% | 87.2% | 82.7% | 85.3% |
| Terminal Bench 2.0 | 41.0% | 42.8% | 47.6% | 46.4% |
| HLE (w/ Tools) | 42.8% | 32.0% | 42.7% | 40.8% |
| BrowseComp | 52.0% | 24.1% | 50.8% | 51.4% |
| AIME 2025 | 95.7% | 87.0% | 94.0% | 93.1% |
Where GLM-4.7 Wins
- LiveCodeBench: 84.9% beats Claude's 64.0%
- tau-Bench: Best-in-class tool use at 87.4%
- HLE with Tools: Matches GPT-5.1 at 42.8%
- BrowseComp: Doubles Claude at 52% vs 24%
Honest Assessment
- SWE-bench: ~3% behind Claude Sonnet 4.5
- Terminal Bench: Trails Gemini 3.0 Pro (54%)
- Edge Cases: May need more prompting for simple tasks
Vibe Coding & UI Generation
Z.ai introduced the term "vibe coding" to describe GLM-4.7's improved aesthetic output. Beyond functional code, the model now generates visually appealing UI layouts, presentations, and designs.
UI Generation
Cleaner, more modern webpage layouts with improved color harmony, typography, and component styling. Reduces "fine-tuning" time significantly.
PPT Compatibility (91%)
16:9 layout compatibility improved from 52% to 91%. Generated slides are now essentially "ready to use" without manual adjustments.
Visual Artifacts
Generates interactive demos, particle effects, 3D visualizations, and creative coding projects with improved aesthetic quality.
Pricing & Access
GLM-4.7 offers multiple access options, from a budget-friendly subscription to pay-per-token API access and free local deployment.
| Model/Plan | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| GLM Coding Plan | $3/month (quota-based) | — | 3x Claude quota, resets every 5 hours |
| GLM-4.7 API (Z.ai) | $0.60 | $2.20 | Direct API access |
| GLM-4.7 (OpenRouter) | $0.40 | $1.50 | Third-party provider |
| Claude Sonnet 4.5 | ~$3-4 | ~$15 | For comparison |
| DeepSeek V3.2 | $0.28 | $0.42 | Lower price point |
Value Proposition: GLM-4.7 is roughly 4-7x cheaper than Claude/GPT while approaching their performance levels. The $3/month Coding Plan is particularly compelling for individual developers.
Getting Started
Claude Code Integration
The easiest way to use GLM-4.7 is through Claude Code with a GLM Coding Plan subscription:
# Install Claude Code
npm install -g @anthropic-ai/claude-code
# Configure for GLM-4.7
export ANTHROPIC_AUTH_TOKEN=your-zai-api-key
export ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
API Quick Start (Python)
from zai import ZaiClient
client = ZaiClient(api_key="your-api-key")
response = client.chat.completions.create(
model="glm-4.7",
messages=[
{"role": "user", "content": "Write a React component for a todo list"}
],
thinking={"type": "enabled"},
max_tokens=4096
)
print(response.choices[0].message.content)
Local Deployment
For local deployment, GLM-4.7 supports vLLM, SGLang, and Ollama:
# Via Ollama (easiest)
ollama run glm-4.7
# Via HuggingFace + vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model zai-org/GLM-4.7 --tensor-parallel-size 8
Hardware Requirements
Full Model (355B):
- BF16: 16x H100 (80GB)
- FP8: 8x H100 or 4x H200
Quantized (Consumer):
- 2-bit: 24GB GPU + 128GB RAM
- Speed: ~5 tokens/second
When to Use GLM-4.7
Choose GLM-4.7 When
- You need Claude-level coding at 1/7th the cost
- Long coding sessions where context preservation matters
- Tool-heavy workflows (tau-Bench, BrowseComp)
- Multilingual codebases (66.7% SWE-bench Multilingual)
- You want open-source/self-hostable with MIT license
Consider Alternatives When
- You need absolute best SWE-bench scores (Claude 77.2%)
- Terminal-heavy workflows (Gemini 3.0 Pro leads at 54%)
- Chat-first use cases requiring nuanced emotional handling
- Local deployment without enterprise GPU infrastructure
- Absolute lowest cost is priority (DeepSeek V3.2 cheaper)
Conclusion
GLM-4.7 represents a significant milestone in the democratization of AI coding. For the first time, an open-source model genuinely competes with Claude and GPT on real-world coding benchmarks—and does so at a fraction of the cost.
The Preserved Thinking innovation addresses a real pain point: maintaining coherent reasoning across long coding sessions. Combined with best-in-class tool use performance and a $3/month pricing tier, GLM-4.7 makes frontier-level coding assistance accessible to individual developers and small teams.
While it doesn't beat Claude or GPT on every benchmark, the gap has closed substantially. For developers who want Claude-like capabilities without Claude-like pricing, GLM-4.7 is worth serious consideration.
Frequently Asked Questions
What is GLM-4.7?
GLM-4.7 is Z.ai's (formerly Zhipu AI) latest open-source large language model, released December 22, 2025. It's a 355B parameter Mixture-of-Experts (MoE) model with 32B active parameters, specifically optimized for agentic coding, tool usage, and complex reasoning tasks.
Who is Z.ai (Zhipu AI)?
Z.ai is a Chinese AI company founded in 2019, spun out from Tsinghua University. Valued at approximately $3-4 billion, they're one of China's 'AI Tiger' companies and are preparing for a Hong Kong IPO in early 2026. The company rebranded from Zhipu AI to Z.ai internationally in July 2025.
How does GLM-4.7 compare to Claude Sonnet 4.5?
GLM-4.7 is competitive with Claude Sonnet 4.5 on coding benchmarks: 73.8% vs 77.2% on SWE-bench Verified, but GLM-4.7 wins on LiveCodeBench (84.9% vs 64.0%) and tau-Bench (87.4% vs 87.2%). The main advantage is price—GLM Coding Plan costs $3/month vs ~$20/month for Claude Pro.
What is Preserved Thinking?
Preserved Thinking is GLM-4.7's innovation where the model retains its reasoning blocks across multi-turn conversations instead of starting fresh each turn. This reduces information loss, improves cache hit rates, and makes long coding sessions more stable and consistent.
How much does GLM-4.7 cost?
The GLM Coding Plan starts at $3/month for use with coding agents like Claude Code. API pricing is $0.40-0.60 per million input tokens and $1.50-2.20 per million output tokens. This is roughly 4-7x cheaper than Claude or GPT equivalents.
Can I run GLM-4.7 locally?
Yes, GLM-4.7 weights are available on HuggingFace under MIT license. It supports vLLM, SGLang, and Ollama for inference. However, the full model requires significant hardware—8x H100 GPUs for FP8, or 16x H100 for BF16. Quantized versions can run on consumer hardware with 24GB VRAM + 128GB RAM.
What hardware do I need for local deployment?
For the full 355B model: 8x H100 (80GB) for FP8 or 16x H100 for BF16. For quantized versions: minimum 24GB GPU + 128GB RAM using 2-bit quantization with MoE offloading. Expect ~5 tokens/second on consumer hardware.
Is GLM-4.7 truly open-source?
Yes, GLM-4.7 is released under the MIT license, which allows commercial use, modification, and distribution without restrictions. Weights are freely available on HuggingFace (zai-org/GLM-4.7) and ModelScope.
Does GLM-4.7 work with Claude Code?
Yes, GLM-4.7 integrates directly with Claude Code via the GLM Coding Plan. Configure your ANTHROPIC_AUTH_TOKEN with your Z.ai API key and set ANTHROPIC_BASE_URL to https://api.z.ai/api/anthropic. The model maps to both Opus and Sonnet endpoints.
What programming languages does GLM-4.7 support?
GLM-4.7 excels at multilingual coding with a 66.7% score on SWE-bench Multilingual—a 12.9% improvement over its predecessor. It supports Python, JavaScript/TypeScript, Java, C++, Go, Rust, and other major languages commonly used in professional development.
How does GLM-4.7 handle long coding sessions?
GLM-4.7's Preserved Thinking mode automatically retains reasoning across turns, addressing the 'context collapse' problem where models lose track of earlier decisions. Combined with the 200K context window, it can maintain coherent multi-hour coding sessions.
What are GLM-4.7's main limitations?
GLM-4.7 still trails Gemini 3.0 Pro on Terminal Bench (41% vs 54.2%) and is slightly behind Claude on SWE-bench Verified (73.8% vs 77.2%). Some users report it can be more rigid in handling emotional nuances compared to chat-optimized models, and the full model requires substantial hardware.
Top comments (0)