GLM 5.1 just dropped — 754B open-weight MoE model under MIT license. here's how to run it

#ai #opensource #llm #machinelearning

GLM 5.1 dropped two days ago. 754 billion parameters, Mixture-of-Experts architecture, MIT license. Built by Z.ai (formerly Zhipu AI), it's designed for long agentic sessions — hundreds of tool-calling rounds, sustained code optimization, complex multi-step workflows.

It leads SWE-Bench Pro and scores competitively against Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro. And it's fully open-weight.

Here's how to actually run it.

The Reality Check

Let's be honest first: 754B parameters is massive. Even as a MoE model (where only a subset of parameters activate per token), you need serious hardware to self-host the full model. We're talking multi-GPU setups with 80+ GB VRAM total, or heavily quantized versions on high-end consumer cards.

But that's changing fast. GGUF quantizations are already available from Unsloth (unsloth/GLM-5.1-GGUF), and serving frameworks like vLLM (v0.10.0+), SGLang (v0.3.0+), and KTransformers (v0.5.3+) all support it.

For most people, the practical path is:

Cloud GPU rental — Lambda Labs, RunPod, or similar. Spin up a multi-GPU instance, serve the model, point your local app at it.
Quantized GGUF — Run a Q4 or Q5 quant on a high-VRAM setup (48+ GB).
API access — Z.ai offers API access at competitive pricing if you want to test before committing to hardware.

What Makes GLM 5.1 Different

Most frontier models are good at chat. GLM 5.1 is specifically optimized for agentic workflows — the kind where the model needs to:

Plan a multi-step approach to a coding problem
Call tools (file read, shell execute, web search) dozens of times
Keep context coherent across hundreds of conversation turns
Self-correct when a tool call returns unexpected results

The benchmarks reflect this. On NL2Repo (converting natural language specs into full repositories) and Terminal-Bench 2.0 (executing complex terminal workflows), GLM 5.1 leads its predecessor GLM-5 by a wide margin.

For reference:

AIME 2026: 95.3 (math reasoning)
HMMT Nov 2025: 94.0 (competition math)
HLE: 31.0 (general knowledge — Gemini 3.1 Pro still leads at 45.0)
HLE with Tools: 52.3 (knowledge + tool use)

Running It Locally with Ollama or vLLM

Once quantized versions stabilize on Ollama's model library, running GLM 5.1 will be as simple as:

ollama pull glm5.1

For vLLM, the model card provides the serving command:

vllm serve zai-org/GLM-5.1 --tensor-parallel-size 4

Then point any OpenAI-compatible frontend at http://localhost:8000/v1.

What This Means for Local AI

Every few months, the "you need cloud APIs for anything serious" argument gets weaker. GLM 5.1 is another step in that direction. A 754B MoE model under MIT license, with full weights on HuggingFace, optimized for the kind of agentic work that used to require Claude or GPT-4.

The gap between proprietary and open-weight models is shrinking on every benchmark. The hardware requirements are still steep for frontier models, but quantization and MoE efficiency are making it increasingly practical.

If you want a frontend that can talk to GLM 5.1 (or any other model behind an OpenAI-compatible API), we added GLM 5.1 support in Locally Uncensored v2.3.0 this week — alongside Gemma 4, Qwen 3.5, and 20+ other provider presets. One app, swap models per conversation, no reconfiguration needed.

Top comments (2)

mote • Apr 10

Great comparison! One often-missed dimension in agent framework discussions: the data persistence layer. Most frameworks assume cloud DBs or external vector stores, but for embodied AI agents — robotics, autonomous systems, edge devices — storage needs to live on-device.

We built moteDB for exactly this gap: a Rust-native embedded multimodal database that handles vectors, time-series, and structured data in a single engine with no external server. Would love to see data persistence as an explicit comparison axis in future agent framework reviews!

mote • Apr 10

The context engineering section really resonates. In multi-agent systems for physical environments, context includes sensor readings, spatial maps, and temporal sequences that all need efficient querying. This is exactly why we built moteDB: a Rust-native embedded multimodal database for embodied AI. It handles vectors, time-series, and structured data together in one on-device engine. The A2A protocol discussion is timely — persistent shared memory between agents is a major unsolved problem. Happy to share notes on the storage architecture side!