Zhipu put the GLM 5.2 weights on HuggingFace under an MIT license, so the question stopped being "can I download a frontier coding model" and became "will it run on the machine I already own." For a single Mac Studio or a desktop with one GPU and a lot of RAM, the answer is a qualified yes. The qualifier is the quant.
What You Can Run Locally (and What You Can't)
This guide is about running GLM 5.2 on one machine you own, using quantized GGUF weights and llama.cpp, LM Studio, or Unsloth Studio. That is a different job from serving it to a team on a rack of H200s, which the GLM 5.2 self-host hardware and cost guide covers, and a different job again from calling the hosted API, which the GLM 5.2 access guide covers.
GLM 5.2 is a 753B-parameter model with a 1M-token context, released under MIT. At full BF16 precision the weights are ~1.5 TB, which does not fit any single desktop. Local inference means quantizing: trading some quality for a footprint that fits in your RAM. Here is the 30-second version of what fits where.
| Your machine | Quant that fits | Disk / RAM needed | What to expect |
|---|---|---|---|
| Mac Studio M3 Ultra, 512 GB | 4-bit UD-Q4_K_XL | ~376-475 GB | Best local quality, mostly lossless, usable coding speed |
| Mac Studio M3 Ultra, 256 GB | 2-bit UD-IQ2_M | ~240 GB | Codes well, ~3-9 tok/s, the common local rig |
| Desktop + 4090 + 256 GB DDR5 | 2-bit UD-IQ2_M | ~240 GB | Runs via offload, low single-digit tok/s |
| 8x H200 or 4x H100 rack | FP8 / Q4 | 376-750 GB | Production scale, see the self-host guide |
| MacBook / 64-128 GB box | none | n/a | Use the hosted plan instead |
The honest headline: a 256 GB Mac Studio running the 2-bit quant is the realistic "GLM 5.2 on my desk" setup. The 4-bit quant is the quality sweet spot, but it wants a 512 GB machine or heavy offload. Anything smaller than 256 GB is a hosted-API job, not a local one.
Decision Frame: When Local GLM 5.2 Is Worth It (and When NOT)
Run the quant locally for the right reasons. The wrong reason is saving money, because for almost everyone the hosted plan is cheaper.
When to run it locally
- Offline or air-gapped work. No outbound traffic to
api.z.aiis allowed, so the model has to live on your hardware. - Privacy on a single box. Your prompts and code never leave the machine, and one Mac Studio is the whole perimeter.
- You already own the hardware. A 256 GB or 512 GB Mac Studio bought for video or ML work is sitting idle at night, and a local quant costs you nothing extra to run.
- Tinkering and learning. You want to feel how a 753B MoE behaves, test sampling settings, or build against a local OpenAI-compatible endpoint with no rate limits.
When NOT to run it locally
- You want it to be cheap and fast. The Z.ai Coding Plan is ~$30/month and runs at full speed. A 2-bit local quant at 3-9 tok/s cannot match that for the price of electricity alone. Read the access guide.
- You need to serve more than one person. A single Mac Studio is a single-session machine. Two developers hammering it at once will each feel it crawl. That is the datacenter path.
- Your machine is under 256 GB. There is no quant that makes GLM 5.2 fit a 128 GB box at quality worth using. Do not burn a weekend trying.
- You need the full 1M context. Long-context KV cache does not fit on consumer hardware. Local tops out around 16K-64K in practice.
Stop rule
If you do not have at least 256 GB of unified memory or system RAM, stop here and use the hosted plan. No amount of quantization changes that floor.
System Requirements
flowchart TD
A[How much memory?] -->|512 GB Mac| B[4-bit UD-Q4_K_XL<br/>best local quality]
A -->|256 GB Mac or DDR5| C[2-bit UD-IQ2_M<br/>the common rig]
A -->|under 256 GB| D[Use the hosted plan<br/>not a local job]
B --> E[llama.cpp / LM Studio / Unsloth Studio]
C --> E
Before you pull 240 GB of weights, confirm you have:
- Memory. 256 GB minimum (unified memory on Apple silicon, or system DDR5 on a CUDA box). The 2-bit quant is ~240 GB, so on a 256 GB machine the headroom is genuinely tight: close other apps and leave macOS its share of unified memory, or you will hit swap. 512 GB to run 4-bit comfortably.
- Disk. The quant plus headroom: ~240 GB free for 2-bit, ~376-475 GB for 4-bit. An SSD, not a spinning disk, or load times become painful.
- A runner. llama.cpp built from a recent commit, LM Studio, or Unsloth Studio. The architecture (GLM MoE DSA) is new enough that an old llama.cpp build will fail to load the tensors.
- The right repo. Community GGUF quants live at
huggingface.co/unsloth/GLM-5.2-GGUF. The officialzai-org/GLM-5.2repo is BF16 only and is not what you want for local inference.
Step-by-Step: Run GLM 5.2 Locally
Step 1: Pull a GGUF quant
Download only the quant you need, not the whole repo. The --include filter keeps you from fetching 750 GB of shards you will not use.
# 2-bit for a 256 GB machine (~240 GB on disk)
hf download unsloth/GLM-5.2-GGUF \
--local-dir ~/models/glm-5.2-gguf \
--include "*UD-IQ2_M*"
You should end up with a set of GLM-5.2-UD-IQ2_M-0000X-of-0000Y.gguf shards in ~/models/glm-5.2-gguf. Swap the filter to *UD-Q4_K_XL* if you are on a 512 GB machine. Check the live "Files and versions" tab on HuggingFace for the exact shard names, since Unsloth revises quant labels as the dynamic quants improve.
Step 2: Run it with llama.cpp
This is the command-line path and the one with the most control. Build a recent llama.cpp first (Metal compiles automatically on Mac; add -DGGML_CUDA=ON on an Nvidia box).
# Build once
cmake -B build && cmake --build build --config Release -j
# Serve an OpenAI-compatible endpoint on port 8080
./build/bin/llama-server \
--model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
--ctx-size 32768 \
--n-gpu-layers 999 \
--temp 1.0 --top-p 0.95 --min-p 0.01 \
--host 0.0.0.0 --port 8080
Each flag earns its place:
-
--ctx-size 32768sets a 32K window. Raising it eats memory fast on a 256 GB machine; start here and grow only if a request needs it. -
--n-gpu-layers 999offloads every layer it can to the GPU. On a Mac the unified memory makes this nearly free; on a 4090 it offloads the fraction that fits in 24 GB and leaves the rest on the CPU. -
--temp 1.0 --top-p 0.95 --min-p 0.01are Zhipu's recommended sampling defaults. Getting these wrong is the most common cause of "the local model is dumber than the hosted one."
Once it loads, llama-server logs the layer count and then prints server listening on http://0.0.0.0:8080. The first load takes a minute or two off an SSD.
Step 3: Or use a GUI (LM Studio / Unsloth Studio)
If you would rather not touch a build toolchain, two GUI apps load the same GGUF quants.
LM Studio runs the same GGUF quants from a desktop app. Search for unsloth/GLM-5.2-GGUF in the in-app model browser, pick the 2-bit or 4-bit quant, and it handles the download and serving, exposing the same OpenAI-compatible endpoint on a local port.
Unsloth Studio is a web UI with automatic memory offloading, installed in one line.
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888
Both are the better choice if you want to swap quants and settings without re-typing a long llama.cpp command each time.
Step 4: Smoke test
Point any OpenAI client at the local port and confirm it answers.
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2",
"messages": [{"role":"user","content":"Reply with only the string OK."}],
"max_tokens": 16
}' | jq -r '.choices[0].message.content'
You should get OK back after a short pause. If the reply is garbled or loops, your sampling params are off, so re-check --temp 1.0 --top-p 0.95 --min-p 0.01 against the values in huggingface.co/zai-org/GLM-5.2/generation_config.json.
Real Tokens/sec: What to Expect by Tier
Generation speed on local hardware is bound by memory bandwidth, not raw compute, which is why a Mac Studio with 800 GB/s unified memory beats a DDR5 desktop whose RAM runs closer to 80-100 GB/s. These are the figures to plan around.
| Setup | Quant | Realistic generation speed | Good for |
|---|---|---|---|
| Mac Studio M3 Ultra, 256 GB | 2-bit UD-IQ2_M | ~3-9 tok/s | Solo coding agent, one session |
| Mac Studio M3 Ultra, 512 GB | 4-bit UD-Q4_K_XL | a few tok/s, higher quality | Solo work where correctness matters more than speed |
| Desktop, 4090 + 256 GB DDR5 | 2-bit UD-IQ2_M | low single digits | Tinkering, offline use |
| 4x H100 / 8x H200 rack | Q4 / FP8 | tens of tok/s per stream | Teams (see self-host guide) |
The pattern: local GLM 5.2 is a single-stream, single-developer tool. The speed is fine for one coding agent thinking through a task. It is not fine for a shared endpoint, and no consumer quant changes that. If you need throughput for a team, the self-host hardware guide walks the vLLM and SGLang path on datacenter GPUs.
Common Errors During Local Setup (and Fixes)
| Error | Likely cause | Fix |
|---|---|---|
tensor not found: blk.X.attn_q.weight |
llama.cpp build too old for GLM MoE DSA | Pull a recent llama.cpp commit and rebuild with cmake --build build
|
| Process killed / swap thrash on load | Quant is bigger than free RAM | Drop to a smaller quant, or close other apps; 2-bit needs ~240 GB free, not just installed |
| Output is repetitive or incoherent | Sampling params not aligned to Zhipu defaults | Set --temp 1.0 --top-p 0.95 --min-p 0.01; do not leave top_k at a low default |
| Painfully slow generation on a 4090 box | Most layers running from DDR5, not VRAM | Expected on 24 GB VRAM; lower --ctx-size, or move to a 256 GB Mac for better bandwidth |
failed to allocate KV cache at high ctx-size |
Context window too large for remaining memory | Lower --ctx-size, or quantize the KV cache with --cache-type-k q4_1 --cache-type-v q4_1
|
| Model "thinks" forever before answering | Thinking mode on for a task that does not need it | Disable it with --chat-template-kwargs '{"enable_thinking":false}'
|
Ollama pull only offers glm-5.2:cloud
|
No local Ollama tag exists yet | Use llama.cpp or LM Studio with the Unsloth GGUF instead |
Team / Multi-Developer: When One Mac Isn't Enough
A single local machine serves one person. The moment a second developer points an agent at the same llama-server, both sessions slow to a crawl, because consumer hardware has no spare bandwidth to split. There is no clever flag that fixes this.
Two real options when local stops scaling:
- Move to datacenter GPUs. An 8x H200 node serving FP8 handles many concurrent streams at tens of tokens per second each. That is a different cost and operations story, fully worked through in the self-host vLLM and cost guide, including the break-even math against the hosted plan.
- Use a hosted endpoint and stop running metal. For most teams this wins on every axis except data residency.
The local quant is the right tool for one developer who wants the model on their own machine. It is the wrong tool for a shared service.
Advanced: Long Context and Thinking Mode
Two knobs are worth knowing once the basic setup runs.
KV cache quantization. The 1M context is real in the architecture but unreachable on a 256 GB box, because the KV cache alone would need hundreds of gigabytes. Quantizing it buys back room:
./build/bin/llama-server \
--model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
--ctx-size 65536 \
--cache-type-k q4_1 --cache-type-v q4_1 \
--n-gpu-layers 999 --port 8080
This roughly halves KV cache memory, letting you push context further on the same hardware, at a small quality cost on very long inputs.
Thinking mode. GLM 5.2 has a reasoning mode that spends tokens thinking before it answers. For quick edits and short prompts it adds latency you may not want. Turn it off per request with --chat-template-kwargs '{"enable_thinking":false}' and leave it on for hard multi-step problems where the extra reasoning earns its keep.
When Local Is the Wrong Answer: Hosted and ofox Alternatives
If the 256 GB floor or the single-session speed rules local out, you do not have to give up GLM 5.2 at all. The same model is on the ofox catalog as z-ai/glm-5.2, priced at $1.40/M input and $4.40/M output, so you can run it hosted at full speed by changing only the base URL and model ID, with no rig to buy or babysit. You prototype against your local llama-server and then point the same client at the hosted model:
export OPENAI_BASE_URL="https://api.ofox.ai/v1"
export OPENAI_API_KEY="ofox-..."
export OPENAI_MODEL="z-ai/glm-5.2" # the exact same model, now hosted
The hosted access guide covers the Z.ai Coding Plan route to the same model as well. And if you want a few other open-weights coding models behind that one OpenAI-compatible endpoint, ofox lists these day-one too:
| Model | ofox model ID | Context | When to pick over GLM 5.2 |
|---|---|---|---|
| DeepSeek V4 Pro | deepseek/deepseek-v4-pro |
1M | You want a longer community track record and published SWE-bench Verified numbers |
| Kimi K2.6 | moonshotai/kimi-k2.6 |
262K | You need independently benchmarked long context, not a 16K local ceiling |
| Qwen 3 Coder Next | bailian/qwen3-coder-next |
256K | Multilingual codebases where local speed is too slow to iterate |
For a price-and-quality read on GLM against a closed model before you commit to either a local rig or a hosted subscription, see the GLM 5.2 vs GPT-5.5 cost comparison.
Sources Checked for This Refresh
- HuggingFace official model card,
zai-org/GLM-5.2(753B parameters, MIT license, 1M context), verified 2026-06-23: https://huggingface.co/zai-org/GLM-5.2 - Unsloth GGUF community quants and per-quant memory table, verified 2026-06-23: https://huggingface.co/unsloth/GLM-5.2-GGUF
- Unsloth GLM 5.2 run guide (quant sizes, sampling defaults, KV-cache flags, Unsloth Studio install): https://unsloth.ai/docs/models/glm-5.2
- llama.cpp project: https://github.com/ggml-org/llama.cpp
- LM Studio: https://lmstudio.ai
- Companion ofox guides: self-host hardware and cost, hosted access, GLM 5.2 vs GPT-5.5 cost
The interesting shift is not that a frontier model runs locally, it is how little it now costs to find out. A 256 GB Mac Studio you already own and an afternoon of downloading is the whole experiment. The next thing to watch is FP4 and tighter dynamic quants: the day a good 4-bit drops under 200 GB, the local floor moves from a 256 GB Mac down to a 128 GB one, and a lot more desks qualify.
Originally published on ofox.ai/blog.
Top comments (0)