David

Posted on Apr 12

How to Run GLM 4.7 Flash Locally with Ollama — 30B Quality at 3B Speed

#llm #ollama #opensource #ai

ZhipuAI quietly dropped GLM 4.7 Flash and it's been blowing up — 830K+ downloads on HuggingFace, 1,600+ likes. The pitch: 30B-parameter MoE model with only 3B active parameters per token. Translation: you get 30B-class quality at the speed and VRAM cost of a 3B model.

The benchmarks back it up. AIME 25: 91.6% (beats GPT-class models). SWE-bench Verified: 59.2% (nearly 3x Qwen3-30B-A3B). And it's MIT licensed — commercial use, fine-tuning, whatever you want.

I've been building a local AI desktop app (Locally Uncensored) and just added GLM 4.7 support. Here's how to run it locally.

Install GLM 4.7 Flash with Ollama

One command:

ollama run glm4.7

That's it. Ollama handles the download and quantization. Default is Q4_K_M which gives you the best quality-to-size ratio.

If you want a specific quantization:

ollama run glm4.7:q4_k_m    # ~5 GB, recommended
ollama run glm4.7:q8_0      # ~10 GB, higher quality
ollama run glm4.7:q2_k      # ~3 GB, if VRAM is tight

Why GLM 4.7 Flash Matters

The MoE (Mixture of Experts) architecture is the key. The model has 30B total parameters but only activates 3B per token. This means:

Speed: Token generation is fast — comparable to running a 3B dense model
VRAM: Only needs 6-8 GB for Q4 quantization
Quality: Reasoning and coding performance matches models 10x its active size

Here's how it compares:

Benchmark	GLM 4.7 Flash (30B-A3B)	Qwen3-30B-A3B	GPT-OSS-20B
AIME 25	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench (agentic)	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

The agentic benchmarks are insane. τ²-Bench at 79.5 vs Qwen3's 49.0 — that's not a marginal improvement, that's a different league. This model was built for tool calling and multi-step reasoning.

VRAM Requirements

Q2_K: ~3-4 GB VRAM (or CPU-only with 8 GB RAM)
Q4_K_M: 6-8 GB VRAM — the sweet spot
Q8_0: 10-12 GB VRAM — if you have the room
FP16: 20+ GB — only for research/fine-tuning

If you have a GTX 1660 (6 GB) or better, Q4_K_M runs comfortably. On Apple Silicon with 16 GB unified memory, it flies.

Agent Mode with GLM 4.7

This is where GLM 4.7 really shines. The model was specifically optimized for agentic tasks — it has a "Preserved Thinking" mode that keeps chain-of-thought reasoning active across multi-turn tool interactions.

In practice: you give it a tool (web search, file read, code execution) and it actually uses it intelligently. The 59.2% SWE-bench score means it can navigate real codebases, understand context, and produce working patches — not just toy completions.

In Locally Uncensored, GLM 4.7 is auto-detected as an agent-capable model. Enable Agent mode in the UI and it gets access to web search, file operations, and code execution out of the box.

GLM 4.7 vs the Competition

vs Qwen3-30B-A3B: Same architecture class (30B MoE, 3B active) but GLM 4.7 dominates on agentic and coding tasks. Qwen3 is better at pure math.

vs Gemma 4 E4B: Gemma 4 is smaller (4.5B effective) and faster, but GLM 4.7 has significantly better reasoning depth. If you need an agent that can handle complex multi-step tasks, GLM 4.7 wins.

vs Llama 3.3 70B: Llama needs 3-4x the VRAM for similar coding performance. GLM 4.7 is the efficiency play.

What's the Catch?

Honestly, not much:

Chinese-English bilingual — Trained on both, works great in both. If you only need English, it's still excellent.
Context window — Supports up to 128K tokens. More than enough for most use cases.
MIT license — Fully open. No restrictions on commercial use, modification, or redistribution.

The main caveat: if you want vision/multimodal, GLM 4.7 Flash is text-only. Look at GLM-4V or Gemma 4 for image input.

Try It

ollama run glm4.7

Or if you want a full desktop UI with agent mode, image gen, and A/B model comparison:

Locally Uncensored — free, open source, AGPL-3.0. Single .exe/.AppImage, no Docker needed. GLM 4.7 is in the recommended models list.

Running GLM 4.7 on your hardware? I'd love to hear your tok/s numbers and use case. Drop a comment.

Locally Uncensored — AGPL-3.0 licensed. GitHub.

DEV Community