Richard Gibbons

Posted on Jan 2 • Originally published at digitalapplied.com on Dec 23, 2025

GLM-4.7 Guide: Z.ai's Open-Source AI Coding Model

#glm47 #zai #opensourcellm #agenticcoding

GLM-4.7 achieves 73.8% SWE-bench and 87.4% tau-Bench with Preserved Thinking. Complete developer guide for the $3/month Claude Code alternative.

Key Statistics

355B Total Parameters
32B Active Parameters
200K Context Window
73.8% SWE-bench

Key Takeaways

Open-Source Claude Alternative: GLM-4.7 is a 355B parameter MIT-licensed model achieving 73.8% SWE-bench—competitive with Claude Sonnet 4.5 at a fraction of the cost.
Preserved Thinking Innovation: Unlike models that restart reasoning each turn, GLM-4.7 retains thinking blocks across conversations, maintaining context in long coding sessions.
$3/Month Coding Plan: The GLM Coding Plan offers Claude-level coding at 1/7th the price with 3x usage quota, working directly with Claude Code, Cline, and Roo Code.
Best-in-Class Tool Use: Achieves 87.4% on tau-Bench and 84.9% on LiveCodeBench, outperforming Claude Sonnet 4.5 on multiple agent and coding benchmarks.
Production-Ready for Agents: Built specifically for terminal-based agentic workflows rather than chat, with native support for multi-turn stability in coding agents.

What Is GLM-4.7?

GLM-4.7 is Z.ai's flagship open-source coding model, released on December 22, 2025. Unlike previous models that focused primarily on chat capabilities, GLM-4.7 is engineered specifically for agentic coding—the ability to autonomously complete complex programming tasks across multiple files and turns.

The model represents a significant milestone: it's the first open-source LLM to approach proprietary model performance on real-world coding benchmarks while being available at a fraction of the cost. Z.ai (formerly Zhipu AI), a Tsinghua University spinoff valued at approximately $3-4 billion, has positioned GLM-4.7 as a direct alternative to Claude and GPT for developers who need capable coding assistance without enterprise pricing.

Built for Agents

Designed from the ground up for terminal-based workflows. Works natively with Claude Code, Cline, Roo Code, and Kilo Code.

MIT Licensed

Fully open-source with commercial use permitted. Weights available on HuggingFace and ModelScope for local deployment.

Technical Specifications

GLM-4.7 uses a Mixture-of-Experts (MoE) architecture with 355 billion total parameters, but only 32 billion are active per forward pass. This design enables frontier-level capabilities while maintaining reasonable inference costs.

Specification	GLM-4.7	GLM-4.6
Total Parameters	355B (MoE)	Similar
Active Parameters	32B	32B
Context Length	200K tokens	128K tokens
Max Output	128K tokens	32K tokens
License	MIT (Open-Source)	MIT
Knowledge Cutoff	Mid-Late 2024	Earlier 2024

Thinking Modes: The Innovation

GLM-4.7's most significant innovation is its three-tier thinking architecture. This addresses the "context collapse" problem where AI coding assistants lose track of earlier decisions during long sessions.

Interleaved Thinking

Active by default. The model reasons before every response and every tool call. This prevents "hallucinated code" by verifying logic before generating output. Think of it as the model pausing to check its work at each step.

Preserved Thinking

Enabled by default on GLM Coding Plan. Unlike models that restart their thought process from scratch each turn, GLM-4.7 retains its "thinking blocks" across the entire conversation. This is analogous to a human developer who remembers why they made an architectural decision three hours ago.

Benefits:

Reduces information loss in multi-turn sessions
Improves cache hit rates, lowering costs
Maintains consistency during complex refactors

Turn-Level Thinking Control

Developer-controllable per request. Enable or disable thinking on a per-turn basis within a session. Disable for simple syntax questions to reduce latency and costs; enable for complex debugging to maximize accuracy.

API Usage: Enable thinking with "thinking": {"type": "enabled"} in your API request. For preserved thinking, set "clear_thinking": false.

Benchmark Performance

GLM-4.7 demonstrates significant improvements across coding, reasoning, and agent benchmarks. Here's how it compares to leading proprietary models:

Benchmark	GLM-4.7	Claude Sonnet 4.5	GPT-5.1 High	DeepSeek-V3.2
SWE-bench Verified	73.8%	77.2%	76.3%	73.1%
LiveCodeBench v6	84.9%	64.0%	87.0%	83.3%
tau-Bench (Tools)	87.4%	87.2%	82.7%	85.3%
Terminal Bench 2.0	41.0%	42.8%	47.6%	46.4%
HLE (w/ Tools)	42.8%	32.0%	42.7%	40.8%
BrowseComp	52.0%	24.1%	50.8%	51.4%
AIME 2025	95.7%	87.0%	94.0%	93.1%

Where GLM-4.7 Wins

LiveCodeBench: 84.9% beats Claude's 64.0%
tau-Bench: Best-in-class tool use at 87.4%
HLE with Tools: Matches GPT-5.1 at 42.8%
BrowseComp: Doubles Claude at 52% vs 24%

Honest Assessment

SWE-bench: ~3% behind Claude Sonnet 4.5
Terminal Bench: Trails Gemini 3.0 Pro (54%)
Edge Cases: May need more prompting for simple tasks

Vibe Coding & UI Generation

Z.ai introduced the term "vibe coding" to describe GLM-4.7's improved aesthetic output. Beyond functional code, the model now generates visually appealing UI layouts, presentations, and designs.

UI Generation

Cleaner, more modern webpage layouts with improved color harmony, typography, and component styling. Reduces "fine-tuning" time significantly.

PPT Compatibility (91%)

16:9 layout compatibility improved from 52% to 91%. Generated slides are now essentially "ready to use" without manual adjustments.

Visual Artifacts

Generates interactive demos, particle effects, 3D visualizations, and creative coding projects with improved aesthetic quality.

Pricing & Access

GLM-4.7 offers multiple access options, from a budget-friendly subscription to pay-per-token API access and free local deployment.

Model/Plan	Input (per 1M tokens)	Output (per 1M tokens)	Notes
GLM Coding Plan	$3/month (quota-based)	—	3x Claude quota, resets every 5 hours
GLM-4.7 API (Z.ai)	$0.60	$2.20	Direct API access
GLM-4.7 (OpenRouter)	$0.40	$1.50	Third-party provider
Claude Sonnet 4.5	~$3-4	~$15	For comparison
DeepSeek V3.2	$0.28	$0.42	Lower price point

Value Proposition: GLM-4.7 is roughly 4-7x cheaper than Claude/GPT while approaching their performance levels. The $3/month Coding Plan is particularly compelling for individual developers.

Getting Started

Claude Code Integration

The easiest way to use GLM-4.7 is through Claude Code with a GLM Coding Plan subscription:

# Install Claude Code
npm install -g @anthropic-ai/claude-code

# Configure for GLM-4.7
export ANTHROPIC_AUTH_TOKEN=your-zai-api-key
export ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic

API Quick Start (Python)

from zai import ZaiClient

client = ZaiClient(api_key="your-api-key")

response = client.chat.completions.create(
    model="glm-4.7",
    messages=[
        {"role": "user", "content": "Write a React component for a todo list"}
    ],
    thinking={"type": "enabled"},
    max_tokens=4096
)

print(response.choices[0].message.content)

Local Deployment

For local deployment, GLM-4.7 supports vLLM, SGLang, and Ollama:

# Via Ollama (easiest)
ollama run glm-4.7

# Via HuggingFace + vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model zai-org/GLM-4.7 --tensor-parallel-size 8

Hardware Requirements

Full Model (355B):

BF16: 16x H100 (80GB)
FP8: 8x H100 or 4x H200

Quantized (Consumer):

2-bit: 24GB GPU + 128GB RAM
Speed: ~5 tokens/second

When to Use GLM-4.7

Choose GLM-4.7 When

You need Claude-level coding at 1/7th the cost
Long coding sessions where context preservation matters
Tool-heavy workflows (tau-Bench, BrowseComp)
Multilingual codebases (66.7% SWE-bench Multilingual)
You want open-source/self-hostable with MIT license

Consider Alternatives When

You need absolute best SWE-bench scores (Claude 77.2%)
Terminal-heavy workflows (Gemini 3.0 Pro leads at 54%)
Chat-first use cases requiring nuanced emotional handling
Local deployment without enterprise GPU infrastructure
Absolute lowest cost is priority (DeepSeek V3.2 cheaper)

Conclusion

GLM-4.7 represents a significant milestone in the democratization of AI coding. For the first time, an open-source model genuinely competes with Claude and GPT on real-world coding benchmarks—and does so at a fraction of the cost.

The Preserved Thinking innovation addresses a real pain point: maintaining coherent reasoning across long coding sessions. Combined with best-in-class tool use performance and a $3/month pricing tier, GLM-4.7 makes frontier-level coding assistance accessible to individual developers and small teams.

While it doesn't beat Claude or GPT on every benchmark, the gap has closed substantially. For developers who want Claude-like capabilities without Claude-like pricing, GLM-4.7 is worth serious consideration.

Frequently Asked Questions

What is GLM-4.7?

GLM-4.7 is Z.ai's (formerly Zhipu AI) latest open-source large language model, released December 22, 2025. It's a 355B parameter Mixture-of-Experts (MoE) model with 32B active parameters, specifically optimized for agentic coding, tool usage, and complex reasoning tasks.

Who is Z.ai (Zhipu AI)?

Z.ai is a Chinese AI company founded in 2019, spun out from Tsinghua University. Valued at approximately $3-4 billion, they're one of China's 'AI Tiger' companies and are preparing for a Hong Kong IPO in early 2026. The company rebranded from Zhipu AI to Z.ai internationally in July 2025.

How does GLM-4.7 compare to Claude Sonnet 4.5?

GLM-4.7 is competitive with Claude Sonnet 4.5 on coding benchmarks: 73.8% vs 77.2% on SWE-bench Verified, but GLM-4.7 wins on LiveCodeBench (84.9% vs 64.0%) and tau-Bench (87.4% vs 87.2%). The main advantage is price—GLM Coding Plan costs $3/month vs ~$20/month for Claude Pro.

What is Preserved Thinking?

Preserved Thinking is GLM-4.7's innovation where the model retains its reasoning blocks across multi-turn conversations instead of starting fresh each turn. This reduces information loss, improves cache hit rates, and makes long coding sessions more stable and consistent.

How much does GLM-4.7 cost?

The GLM Coding Plan starts at $3/month for use with coding agents like Claude Code. API pricing is $0.40-0.60 per million input tokens and $1.50-2.20 per million output tokens. This is roughly 4-7x cheaper than Claude or GPT equivalents.

Can I run GLM-4.7 locally?

Yes, GLM-4.7 weights are available on HuggingFace under MIT license. It supports vLLM, SGLang, and Ollama for inference. However, the full model requires significant hardware—8x H100 GPUs for FP8, or 16x H100 for BF16. Quantized versions can run on consumer hardware with 24GB VRAM + 128GB RAM.

What hardware do I need for local deployment?

For the full 355B model: 8x H100 (80GB) for FP8 or 16x H100 for BF16. For quantized versions: minimum 24GB GPU + 128GB RAM using 2-bit quantization with MoE offloading. Expect ~5 tokens/second on consumer hardware.

Is GLM-4.7 truly open-source?

Yes, GLM-4.7 is released under the MIT license, which allows commercial use, modification, and distribution without restrictions. Weights are freely available on HuggingFace (zai-org/GLM-4.7) and ModelScope.

Does GLM-4.7 work with Claude Code?

Yes, GLM-4.7 integrates directly with Claude Code via the GLM Coding Plan. Configure your ANTHROPIC_AUTH_TOKEN with your Z.ai API key and set ANTHROPIC_BASE_URL to https://api.z.ai/api/anthropic. The model maps to both Opus and Sonnet endpoints.

What programming languages does GLM-4.7 support?

GLM-4.7 excels at multilingual coding with a 66.7% score on SWE-bench Multilingual—a 12.9% improvement over its predecessor. It supports Python, JavaScript/TypeScript, Java, C++, Go, Rust, and other major languages commonly used in professional development.

How does GLM-4.7 handle long coding sessions?

GLM-4.7's Preserved Thinking mode automatically retains reasoning across turns, addressing the 'context collapse' problem where models lose track of earlier decisions. Combined with the 200K context window, it can maintain coherent multi-hour coding sessions.

What are GLM-4.7's main limitations?

GLM-4.7 still trails Gemini 3.0 Pro on Terminal Bench (41% vs 54.2%) and is slightly behind Claude on SWE-bench Verified (73.8% vs 77.2%). Some users report it can be more rigid in handling emotional nuances compared to chat-optimized models, and the full model requires substantial hardware.