Jangwook Kim

Posted on Apr 11 • Originally published at effloow.com

GLM-5: The Open-Source Frontier Model You Can Self-Host

#aimodels #opensource #selfhosting #llm

GLM-5: The Open-Source Frontier Model You Can Self-Host

On February 11, 2026, Z.ai (formerly Zhipu AI) released GLM-5 on Hugging Face under an MIT license. That single decision changed the landscape of open-source AI. For the first time, a model that competes directly with GPT-5.2 and Claude Opus 4.5 on major benchmarks is available for anyone to download, modify, fine-tune, and deploy commercially with zero restrictions or royalty fees.

GLM-5 is a 744-billion-parameter Mixture of Experts model with only 40 billion active parameters per inference. It was pre-trained on 28.5 trillion tokens, supports a 200K context window, and can generate outputs up to 131,072 tokens. It ranks in the top five on nearly every major frontier benchmark, and its successor GLM-5.1 (released April 7, 2026) has already claimed the number one spot on SWE-Bench Pro.

This guide covers what GLM-5 is, how it performs against proprietary competitors, what it costs to run via API or self-hosted, and how to deploy it yourself with vLLM and Docker.

If you are evaluating the broader self-hosting vs cloud API tradeoff, see our detailed cost and performance comparison. For open-source developer tooling more broadly, our best open-source AI tools guide covers the full ecosystem.

What Is GLM-5 and Why Does It Matter?

GLM-5 is the fifth-generation large language model from Z.ai, a Chinese AI lab that rebranded from Zhipu AI. The model is specifically designed for complex systems engineering and long-horizon agentic tasks: advanced reasoning, coding, tool use, web browsing, terminal operations, and multi-step agentic workflows.

Technical Specifications

Specification	Value
Total Parameters	744B
Active Parameters (MoE)	40B
Architecture	Mixture of Experts with DeepSeek Sparse Attention (DSA)
Expert Count	256 experts, 8 activated per token (5.9% sparsity)
Pre-training Data	28.5T tokens
Context Window	200K tokens
Max Output Length	131,072 tokens
License	MIT
Release Date	February 11, 2026
Training Hardware	100,000 Huawei Ascend 910B chips

The architecture incorporates DeepSeek's Multi-head Latent Attention (MLA) combined with Dynamic Sparse Attention (DSA). This pairing substantially reduces deployment cost while preserving the model's ability to process extremely long contexts. In practical terms, the DSA mechanism allows GLM-5 to handle 200K-token sequences without the computational overhead of traditional dense attention.

One notable detail: GLM-5 was trained entirely on 100,000 Huawei Ascend 910B accelerator chips with zero NVIDIA hardware involved.

The MIT license means no commercial restrictions. You can download, modify, fine-tune, and redistribute GLM-5 without paying royalties or seeking permission. For enterprises that need full control over their model stack, this is the most permissive license available at this performance tier.

Benchmark Comparison: GLM-5 vs GPT-5.2, Claude Opus 4.5, and Peers

The following table uses official scores from the GLM-5 Hugging Face model card and covers the major frontier evaluation suites as of April 2026.

Frontier Benchmark Scores

Benchmark	GLM-5	Claude Opus 4.5	GPT-5.2	Gemini 3 Pro	Kimi K2.5	DeepSeek-V3.2
HLE	30.5	28.4	35.4	37.2	31.5	25.1
HLE (w/ Tools)	50.4	43.4	45.5	45.8	51.8	40.8
AIME 2026 I	92.7	93.3	--	90.6	92.5	92.7
GPQA-Diamond	86.0	87.0	92.4	91.9	87.6	82.4
SWE-bench Verified	77.8	80.9	80.0	76.2	76.8	73.1
SWE-bench Multilingual	73.3	77.5	72.0	65.0	73.0	70.2
Terminal-Bench 2.0	56.2	59.3	54.0	54.2	50.8	39.3
BrowseComp	62.0	37.0	--	37.8	60.6	51.4
BrowseComp (w/ Context)	75.9	67.8	65.8	59.2	74.9	67.6
tau-2-Bench	89.7	91.6	85.5	90.7	80.2	85.3
MCP-Atlas (Public)	67.8	65.2	68.0	66.6	63.8	62.2
CyberGym	43.2	50.6	--	39.9	41.3	17.3
Tool-Decathlon	38.0	43.5	46.3	36.4	27.8	35.2

What the Numbers Tell Us

Where GLM-5 leads: BrowseComp is the standout -- GLM-5 scores 62.0 (75.9 with context management) versus Claude Opus 4.5 at 37.0 (67.8). The model also excels on HLE with tools (50.4 vs Claude's 43.4 and GPT-5.2's 45.5) and MCP-Atlas (67.8 vs Claude's 65.2). These benchmarks measure agentic capabilities: web browsing, tool use, and multi-step task completion.

Where GLM-5 is competitive but trails slightly: On SWE-bench Verified, GLM-5 scores 77.8 against Claude Opus 4.5 at 80.9 and GPT-5.2 at 80.0. The gap is narrow enough that the model is viable for real-world coding tasks. AIME 2026 I shows the same pattern: 92.7 for GLM-5 versus 93.3 for Claude.

Where GLM-5 falls short: GPQA-Diamond (86.0 vs GPT-5.2's 92.4) and Tool-Decathlon (38.0 vs GPT-5.2's 46.3) show meaningful gaps. For use cases demanding the absolute best on scientific reasoning or complex tool orchestration, proprietary models still have an edge.

The successor, GLM-5.1 (released April 7, 2026) pushes further. It claims the number one spot on SWE-Bench Pro at 58.4, surpassing GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). Terminal-Bench 2.0 jumps to 69.0. For an open-source model to lead the most challenging software engineering benchmark is unprecedented.

API Pricing: How Much Does GLM-5 Cost to Use?

GLM-5 is available through multiple API providers if you prefer not to self-host.

Official API Pricing

Provider	Input (per 1M tokens)	Output (per 1M tokens)
Z.ai (official)	$1.00	$3.20
OpenRouter (Ambient, FP8)	$0.72	$2.30

For context, here is how this compares to proprietary alternatives:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GLM-5 (Z.ai official)	$1.00	$3.20
GPT-5 (OpenAI)	$1.25	$10.00
Claude Opus 4.6 (Anthropic)	$5.00	$25.00
Claude Sonnet 4.6 (Anthropic)	$3.00	$15.00

GLM-5's output pricing is significantly lower than every proprietary frontier model. At $3.20 per million output tokens, it costs roughly one-third of GPT-5 and one-eighth of Claude Opus 4.6 on output-heavy workloads. However, note that GLM-5 is notably verbose: during Artificial Analysis evaluations it consumed 110 million tokens versus a median of 40 million across comparable models. Verbosity partially offsets the per-token savings.

Through OpenRouter, prices drop further to $0.72/$2.30 via third-party providers running FP8 quantized variants.

How to Self-Host GLM-5 with vLLM and Docker

Self-hosting gives you full control over your data, zero per-token costs after hardware investment, and the ability to fine-tune for your use case. The primary deployment path uses vLLM, which provides an OpenAI-compatible API.

Hardware Requirements

GLM-5 in BF16 needs approximately 860 GB of VRAM. The practical minimum is 8x NVIDIA H200 GPUs (141 GB each, totaling 1,128 GB). For the FP8 quantized variant, 8x H100 (80 GB each, 640 GB total) is sufficient.

For quantized GGUF versions from Unsloth, the requirements drop substantially:

Quantization	Disk Size	Approx RAM/VRAM Needed
Full BF16	1.65 TB	860+ GB VRAM
Dynamic 2-bit GGUF	241 GB	~256 GB RAM
Dynamic 1-bit GGUF	176 GB	~180 GB RAM

Option 1: vLLM Direct Serve

Install the required dependencies first:

pip install "vllm==0.19.0" --torch-backend=auto
pip install "transformers>=5.4.0"

Then serve the model:

vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.85 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8

Option 2: Docker Deployment

For production deployments, Docker provides isolation and reproducibility:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm51 zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8

If your CUDA version is 13 or higher, use the vllm/vllm-openai:glm51-cu130 image instead.

Option 3: SGLang

SGLang is an alternative serving framework that supports GLM-5 from version 0.5.10:

sglang serve \
  --model-path zai-org/GLM-5 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --served-model-name glm-5

Testing Your Deployment

Once the server is running, test it with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[
        {"role": "user", "content": "Explain the MoE architecture in GLM-5."}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

The vLLM server exposes an OpenAI-compatible API, so any tool or library that works with the OpenAI SDK works with your self-hosted instance without code changes.

Supported Frameworks Summary

Framework	Minimum Version
vLLM	0.19.0+
SGLang	0.5.10+
KTransformers	0.5.3+
Transformers	5.4.0+
xLLM	0.8.0+

Running GLM-5 Locally with Quantized Models

Not everyone has a rack of H200 GPUs. Unsloth provides quantized GGUF versions of GLM-5 on Hugging Face (unsloth/GLM-5-GGUF) that trade some accuracy for dramatically lower resource requirements.

What Quantization Gives Up

The full BF16 model requires 1.65 TB of disk space. The dynamic 2-bit GGUF brings this down to 241 GB (an 85% reduction), while the 1-bit variant drops to 176 GB (89% reduction). You can run the 1-bit variant on a system with around 180 GB of system RAM using llama.cpp -- no GPU required, though inference will be slower.

For a middle ground, the 2-bit quantization with MoE offloading works on a system with a single 24 GB GPU plus 256 GB of system RAM. Active experts run on the GPU while inactive ones stay in system memory.

Running with llama.cpp

# Download the quantized model
huggingface-cli download unsloth/GLM-5-GGUF \
  --include "GLM-5-UD-Q2_K_XL.gguf" \
  --local-dir ./models

# Serve with llama.cpp
./llama-server \
  -m ./models/GLM-5-UD-Q2_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -c 8192

Quantized models show degraded performance on reasoning-heavy benchmarks. For production agentic workloads, the full FP8 or BF16 model is recommended.

When to Choose GLM-5 Over Proprietary Models

The decision depends on your constraints and use case.

Choose GLM-5 When

Data sovereignty is non-negotiable. Self-hosting means your data never leaves your infrastructure. For regulated industries (healthcare, finance, defense), this eliminates an entire category of compliance risk.
You need agentic capabilities at scale. GLM-5 leads on BrowseComp and MCP-Atlas, the benchmarks that measure real-world tool use and web interaction. If you are building agents that browse, use tools, and operate autonomously, GLM-5 is purpose-built for this.
Output-heavy workloads are driving your API bill. At $3.20 per million output tokens versus $25.00 for Claude Opus 4.6, the cost difference is enormous on workloads that generate long outputs (code generation, report writing, document processing).
You want to fine-tune. The MIT license and available weights mean you can fine-tune GLM-5 on your proprietary data. No proprietary frontier model offers this.

Choose Proprietary Models When

You need peak coding accuracy. Claude Opus 4.6 (80.9 SWE-bench Verified) and GPT-5.2 (80.0) still lead GLM-5 (77.8) on code-specific benchmarks by a meaningful margin.
Scientific reasoning is critical. GPT-5.2 scores 92.4 on GPQA-Diamond versus GLM-5's 86.0. For research and academic applications, this gap matters.
You do not want to manage GPU infrastructure. Self-hosting a 744B-parameter model requires significant operational expertise and hardware investment. If your team does not have this capability, API access is the pragmatic choice.
Latency is paramount. Through the Z.ai API, GLM-5 achieves 74.4 tokens per second with a 1.67-second time to first token. This is competitive, but proprietary providers with dedicated infrastructure often deliver more consistent latency guarantees at scale.

Key Takeaways

GLM-5 is the first MIT-licensed model to genuinely compete with GPT-5.2 and Claude Opus 4.5 across major benchmarks. With 744B total parameters (40B active) and pre-training on 28.5T tokens, it delivers frontier-tier performance.
Benchmark performance is strong but not universally leading. GLM-5 excels on agentic benchmarks (BrowseComp, HLE with tools, MCP-Atlas) but trails on GPQA-Diamond and SWE-bench Verified. Its successor GLM-5.1 already leads SWE-Bench Pro.
API pricing undercuts proprietary competitors significantly. At $1.00/$3.20 per million tokens (input/output) through Z.ai, it costs a fraction of Claude Opus 4.6 ($5.00/$25.00) and less than GPT-5 ($1.25/$10.00) on output.
Self-hosting requires serious hardware. The full model needs 8x H200 GPUs (860+ GB VRAM). FP8 quantization brings this to 8x H100. Unsloth's 2-bit GGUF enables CPU-only inference at 241 GB, but with accuracy tradeoffs.
Deployment is straightforward with vLLM or Docker. A single vllm serve or docker run command with tensor parallelism gets you an OpenAI-compatible API endpoint. SGLang, KTransformers, and xLLM are also supported.
The MIT license is the real differentiator. Beyond benchmark scores, the freedom to modify, fine-tune, and commercially deploy without restrictions makes GLM-5 uniquely valuable for enterprises building proprietary AI products.
Watch GLM-5.1 closely. Released April 7, 2026, it claims number one on SWE-Bench Pro (58.4) and substantially improves Terminal-Bench 2.0 scores. The GLM model family is iterating fast.

Conclusion

GLM-5 represents a genuine inflection point for open-source AI. Previous open-weight models competed well on specific tasks but never across the full range of frontier benchmarks simultaneously. GLM-5 does, under the most permissive license available.

For teams evaluating self-hosted AI infrastructure, the calculus has changed. You no longer need to accept a significant performance penalty to gain data sovereignty and fine-tuning capability. The gap between open-source and proprietary frontier models has narrowed to single-digit percentage points on most benchmarks, and on agentic tasks like BrowseComp, GLM-5 actually leads.

The practical barriers remain real: 860 GB of VRAM is not trivial. But with FP8 quantization, Docker deployment, and multiple serving frameworks, the path from download to production API endpoint is more accessible than ever.

Whether you deploy via the Z.ai API or self-host on your own GPUs, GLM-5 has earned its place in the evaluation set for any AI infrastructure decision in 2026.

DEV Community

GLM-5: The Open-Source Frontier Model You Can Self-Host

GLM-5: The Open-Source Frontier Model You Can Self-Host

What Is GLM-5 and Why Does It Matter?

Technical Specifications

Benchmark Comparison: GLM-5 vs GPT-5.2, Claude Opus 4.5, and Peers

Frontier Benchmark Scores

What the Numbers Tell Us

API Pricing: How Much Does GLM-5 Cost to Use?

Official API Pricing

How to Self-Host GLM-5 with vLLM and Docker

Hardware Requirements

Option 1: vLLM Direct Serve

Option 2: Docker Deployment

Option 3: SGLang

Testing Your Deployment

Supported Frameworks Summary

Running GLM-5 Locally with Quantized Models

What Quantization Gives Up

Running with llama.cpp

When to Choose GLM-5 Over Proprietary Models

Choose GLM-5 When

Choose Proprietary Models When

Key Takeaways

Conclusion

Top comments (0)