Owen

Posted on Jun 23 • Originally published at ofox.ai

Run GLM 5.2 Locally (2026): 2-bit on a 256GB Mac or 4090 box

#ai #llm #gguf #llamacpp

Zhipu put the GLM 5.2 weights on HuggingFace under an MIT license, so the question stopped being "can I download a frontier coding model" and became "will it run on the machine I already own." For a single Mac Studio or a desktop with one GPU and a lot of RAM, the answer is a qualified yes. The qualifier is the quant.

What You Can Run Locally (and What You Can't)

This guide is about running GLM 5.2 on one machine you own, using quantized GGUF weights and llama.cpp, LM Studio, or Unsloth Studio. That is a different job from serving it to a team on a rack of H200s, which the GLM 5.2 self-host hardware and cost guide covers, and a different job again from calling the hosted API, which the GLM 5.2 access guide covers.

GLM 5.2 is a 753B-parameter model with a 1M-token context, released under MIT. At full BF16 precision the weights are ~1.5 TB, which does not fit any single desktop. Local inference means quantizing: trading some quality for a footprint that fits in your RAM. Here is the 30-second version of what fits where.

Your machine	Quant that fits	Disk / RAM needed	What to expect
Mac Studio M3 Ultra, 512 GB	4-bit UD-Q4_K_XL	~376-475 GB	Best local quality, mostly lossless, usable coding speed
Mac Studio M3 Ultra, 256 GB	2-bit UD-IQ2_M	~240 GB	Codes well, ~3-9 tok/s, the common local rig
Desktop + 4090 + 256 GB DDR5	2-bit UD-IQ2_M	~240 GB	Runs via offload, low single-digit tok/s
8x H200 or 4x H100 rack	FP8 / Q4	376-750 GB	Production scale, see the self-host guide
MacBook / 64-128 GB box	none	n/a	Use the hosted plan instead

The honest headline: a 256 GB Mac Studio running the 2-bit quant is the realistic "GLM 5.2 on my desk" setup. The 4-bit quant is the quality sweet spot, but it wants a 512 GB machine or heavy offload. Anything smaller than 256 GB is a hosted-API job, not a local one.

Decision Frame: When Local GLM 5.2 Is Worth It (and When NOT)

Run the quant locally for the right reasons. The wrong reason is saving money, because for almost everyone the hosted plan is cheaper.

When to run it locally

Offline or air-gapped work. No outbound traffic to api.z.ai is allowed, so the model has to live on your hardware.
Privacy on a single box. Your prompts and code never leave the machine, and one Mac Studio is the whole perimeter.
You already own the hardware. A 256 GB or 512 GB Mac Studio bought for video or ML work is sitting idle at night, and a local quant costs you nothing extra to run.
Tinkering and learning. You want to feel how a 753B MoE behaves, test sampling settings, or build against a local OpenAI-compatible endpoint with no rate limits.

When NOT to run it locally

You want it to be cheap and fast. The Z.ai Coding Plan is ~$30/month and runs at full speed. A 2-bit local quant at 3-9 tok/s cannot match that for the price of electricity alone. Read the access guide.
You need to serve more than one person. A single Mac Studio is a single-session machine. Two developers hammering it at once will each feel it crawl. That is the datacenter path.
Your machine is under 256 GB. There is no quant that makes GLM 5.2 fit a 128 GB box at quality worth using. Do not burn a weekend trying.
You need the full 1M context. Long-context KV cache does not fit on consumer hardware. Local tops out around 16K-64K in practice.

Stop rule

If you do not have at least 256 GB of unified memory or system RAM, stop here and use the hosted plan. No amount of quantization changes that floor.

System Requirements

flowchart TD
  A[How much memory?] -->|512 GB Mac| B[4-bit UD-Q4_K_XL<br/>best local quality]
  A -->|256 GB Mac or DDR5| C[2-bit UD-IQ2_M<br/>the common rig]
  A -->|under 256 GB| D[Use the hosted plan<br/>not a local job]
  B --> E[llama.cpp / LM Studio / Unsloth Studio]
  C --> E

Before you pull 240 GB of weights, confirm you have:

Memory. 256 GB minimum (unified memory on Apple silicon, or system DDR5 on a CUDA box). The 2-bit quant is ~240 GB, so on a 256 GB machine the headroom is genuinely tight: close other apps and leave macOS its share of unified memory, or you will hit swap. 512 GB to run 4-bit comfortably.
Disk. The quant plus headroom: ~240 GB free for 2-bit, ~376-475 GB for 4-bit. An SSD, not a spinning disk, or load times become painful.
A runner. llama.cpp built from a recent commit, LM Studio, or Unsloth Studio. The architecture (GLM MoE DSA) is new enough that an old llama.cpp build will fail to load the tensors.
The right repo. Community GGUF quants live at huggingface.co/unsloth/GLM-5.2-GGUF. The official zai-org/GLM-5.2 repo is BF16 only and is not what you want for local inference.

Step-by-Step: Run GLM 5.2 Locally

Step 1: Pull a GGUF quant

Download only the quant you need, not the whole repo. The --include filter keeps you from fetching 750 GB of shards you will not use.

# 2-bit for a 256 GB machine (~240 GB on disk)
hf download unsloth/GLM-5.2-GGUF \
  --local-dir ~/models/glm-5.2-gguf \
  --include "*UD-IQ2_M*"

You should end up with a set of GLM-5.2-UD-IQ2_M-0000X-of-0000Y.gguf shards in ~/models/glm-5.2-gguf. Swap the filter to *UD-Q4_K_XL* if you are on a 512 GB machine. Check the live "Files and versions" tab on HuggingFace for the exact shard names, since Unsloth revises quant labels as the dynamic quants improve.

Step 2: Run it with llama.cpp

This is the command-line path and the one with the most control. Build a recent llama.cpp first (Metal compiles automatically on Mac; add -DGGML_CUDA=ON on an Nvidia box).

# Build once
cmake -B build && cmake --build build --config Release -j

# Serve an OpenAI-compatible endpoint on port 8080
./build/bin/llama-server \
  --model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --temp 1.0 --top-p 0.95 --min-p 0.01 \
  --host 0.0.0.0 --port 8080

Each flag earns its place:

--ctx-size 32768 sets a 32K window. Raising it eats memory fast on a 256 GB machine; start here and grow only if a request needs it.
--n-gpu-layers 999 offloads every layer it can to the GPU. On a Mac the unified memory makes this nearly free; on a 4090 it offloads the fraction that fits in 24 GB and leaves the rest on the CPU.
--temp 1.0 --top-p 0.95 --min-p 0.01 are Zhipu's recommended sampling defaults. Getting these wrong is the most common cause of "the local model is dumber than the hosted one."

Once it loads, llama-server logs the layer count and then prints server listening on http://0.0.0.0:8080. The first load takes a minute or two off an SSD.

Step 3: Or use a GUI (LM Studio / Unsloth Studio)

If you would rather not touch a build toolchain, two GUI apps load the same GGUF quants.

LM Studio runs the same GGUF quants from a desktop app. Search for unsloth/GLM-5.2-GGUF in the in-app model browser, pick the 2-bit or 4-bit quant, and it handles the download and serving, exposing the same OpenAI-compatible endpoint on a local port.

Unsloth Studio is a web UI with automatic memory offloading, installed in one line.

curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

Both are the better choice if you want to swap quants and settings without re-typing a long llama.cpp command each time.

Step 4: Smoke test

Point any OpenAI client at the local port and confirm it answers.

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [{"role":"user","content":"Reply with only the string OK."}],
    "max_tokens": 16
  }' | jq -r '.choices[0].message.content'

You should get OK back after a short pause. If the reply is garbled or loops, your sampling params are off, so re-check --temp 1.0 --top-p 0.95 --min-p 0.01 against the values in huggingface.co/zai-org/GLM-5.2/generation_config.json.

Real Tokens/sec: What to Expect by Tier

Generation speed on local hardware is bound by memory bandwidth, not raw compute, which is why a Mac Studio with 800 GB/s unified memory beats a DDR5 desktop whose RAM runs closer to 80-100 GB/s. These are the figures to plan around.

Setup	Quant	Realistic generation speed	Good for
Mac Studio M3 Ultra, 256 GB	2-bit UD-IQ2_M	~3-9 tok/s	Solo coding agent, one session
Mac Studio M3 Ultra, 512 GB	4-bit UD-Q4_K_XL	a few tok/s, higher quality	Solo work where correctness matters more than speed
Desktop, 4090 + 256 GB DDR5	2-bit UD-IQ2_M	low single digits	Tinkering, offline use
4x H100 / 8x H200 rack	Q4 / FP8	tens of tok/s per stream	Teams (see self-host guide)

The pattern: local GLM 5.2 is a single-stream, single-developer tool. The speed is fine for one coding agent thinking through a task. It is not fine for a shared endpoint, and no consumer quant changes that. If you need throughput for a team, the self-host hardware guide walks the vLLM and SGLang path on datacenter GPUs.

Common Errors During Local Setup (and Fixes)

Error	Likely cause	Fix
`tensor not found: blk.X.attn_q.weight`	llama.cpp build too old for GLM MoE DSA	Pull a recent llama.cpp commit and rebuild with `cmake --build build`
Process killed / swap thrash on load	Quant is bigger than free RAM	Drop to a smaller quant, or close other apps; 2-bit needs ~240 GB free, not just installed
Output is repetitive or incoherent	Sampling params not aligned to Zhipu defaults	Set `--temp 1.0 --top-p 0.95 --min-p 0.01`; do not leave top_k at a low default
Painfully slow generation on a 4090 box	Most layers running from DDR5, not VRAM	Expected on 24 GB VRAM; lower `--ctx-size`, or move to a 256 GB Mac for better bandwidth
`failed to allocate KV cache` at high ctx-size	Context window too large for remaining memory	Lower `--ctx-size`, or quantize the KV cache with `--cache-type-k q4_1 --cache-type-v q4_1`
Model "thinks" forever before answering	Thinking mode on for a task that does not need it	Disable it with `--chat-template-kwargs '{"enable_thinking":false}'`
Ollama pull only offers `glm-5.2:cloud`	No local Ollama tag exists yet	Use llama.cpp or LM Studio with the Unsloth GGUF instead

Team / Multi-Developer: When One Mac Isn't Enough

A single local machine serves one person. The moment a second developer points an agent at the same llama-server, both sessions slow to a crawl, because consumer hardware has no spare bandwidth to split. There is no clever flag that fixes this.

Two real options when local stops scaling:

Move to datacenter GPUs. An 8x H200 node serving FP8 handles many concurrent streams at tens of tokens per second each. That is a different cost and operations story, fully worked through in the self-host vLLM and cost guide, including the break-even math against the hosted plan.
Use a hosted endpoint and stop running metal. For most teams this wins on every axis except data residency.

The local quant is the right tool for one developer who wants the model on their own machine. It is the wrong tool for a shared service.

Advanced: Long Context and Thinking Mode

Two knobs are worth knowing once the basic setup runs.

KV cache quantization. The 1M context is real in the architecture but unreachable on a 256 GB box, because the KV cache alone would need hundreds of gigabytes. Quantizing it buys back room:

./build/bin/llama-server \
  --model ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf \
  --ctx-size 65536 \
  --cache-type-k q4_1 --cache-type-v q4_1 \
  --n-gpu-layers 999 --port 8080

This roughly halves KV cache memory, letting you push context further on the same hardware, at a small quality cost on very long inputs.

Thinking mode. GLM 5.2 has a reasoning mode that spends tokens thinking before it answers. For quick edits and short prompts it adds latency you may not want. Turn it off per request with --chat-template-kwargs '{"enable_thinking":false}' and leave it on for hard multi-step problems where the extra reasoning earns its keep.

When Local Is the Wrong Answer: Hosted and ofox Alternatives

If the 256 GB floor or the single-session speed rules local out, you do not have to give up GLM 5.2 at all. The same model is on the ofox catalog as z-ai/glm-5.2, priced at $1.40/M input and $4.40/M output, so you can run it hosted at full speed by changing only the base URL and model ID, with no rig to buy or babysit. You prototype against your local llama-server and then point the same client at the hosted model:

export OPENAI_BASE_URL="https://api.ofox.ai/v1"
export OPENAI_API_KEY="ofox-..."
export OPENAI_MODEL="z-ai/glm-5.2"   # the exact same model, now hosted

The hosted access guide covers the Z.ai Coding Plan route to the same model as well. And if you want a few other open-weights coding models behind that one OpenAI-compatible endpoint, ofox lists these day-one too:

Model	ofox model ID	Context	When to pick over GLM 5.2
DeepSeek V4 Pro	`deepseek/deepseek-v4-pro`	1M	You want a longer community track record and published SWE-bench Verified numbers
Kimi K2.6	`moonshotai/kimi-k2.6`	262K	You need independently benchmarked long context, not a 16K local ceiling
Qwen 3 Coder Next	`bailian/qwen3-coder-next`	256K	Multilingual codebases where local speed is too slow to iterate

For a price-and-quality read on GLM against a closed model before you commit to either a local rig or a hosted subscription, see the GLM 5.2 vs GPT-5.5 cost comparison.

Sources Checked for This Refresh

HuggingFace official model card, zai-org/GLM-5.2 (753B parameters, MIT license, 1M context), verified 2026-06-23: https://huggingface.co/zai-org/GLM-5.2
Unsloth GGUF community quants and per-quant memory table, verified 2026-06-23: https://huggingface.co/unsloth/GLM-5.2-GGUF
Unsloth GLM 5.2 run guide (quant sizes, sampling defaults, KV-cache flags, Unsloth Studio install): https://unsloth.ai/docs/models/glm-5.2
llama.cpp project: https://github.com/ggml-org/llama.cpp
LM Studio: https://lmstudio.ai
Companion ofox guides: self-host hardware and cost, hosted access, GLM 5.2 vs GPT-5.5 cost

The interesting shift is not that a frontier model runs locally, it is how little it now costs to find out. A 256 GB Mac Studio you already own and an afternoon of downloading is the whole experiment. The next thing to watch is FP4 and tighter dynamic quants: the day a good 4-bit drops under 200 GB, the local floor moves from a 256 GB Mac down to a 128 GB one, and a lot more desks qualify.

Originally published on ofox.ai/blog.

DEV Community