Rob

Posted on Jun 30 • Originally published at vibescoder.dev

GLM Is the New Hotness, So Let's Test It On the Homelab

#modelshowdown #benchmark #ai #llm

GLM is the new hotness.

I'm hearing it from both sides of the AI builder world. Software engineers are talking about it because the benchmark numbers are interesting, the weights are open, and the coding claims are strong. Vibe coders are talking about it because the pitch is even simpler: maybe this is the local model that finally feels agentic enough to run on your own machine.

That overlap is rare. A lot of models get academic buzz. A lot of models get LocalLLaMA buzz. A smaller number get real developer curiosity. GLM is sitting in that third bucket right now.

So we do what we always do: jump in and ask the boring practical questions.

What is GLM?
Is it suitable for the homelab?
How does it perform on a real agentic coding task?

This post answers the first two. It also sets up the dedicated GLM bakeoff we will run to answer the third.

What GLM Is

GLM is the model family from Z.ai, formerly Zhipu AI. The current discussion is not about one model. It is about a family that now spans several very different deployment targets:

Model	What it is	Why we care
GLM-5.2	Frontier-scale MoE model with a 1M-token context target	The headline model. Strong claims, open weights, not sized for a normal homelab.
GLM-4.7-Flash	30B-A3B MoE model	The practical candidate. Small enough to plausibly fit the RTX 5090 class.
GLM-4-9B-Chat	Older 9B chat model with function calling and 128K context	The small baseline. It should fit easily, but expectations should be modest.

That spread is why this got interesting. If GLM only meant the 753B-class flagship, the answer for my rig would be simple: neat model, wrong hardware. But GLM-4.7-Flash changes the question. It is explicitly positioned as a lightweight deployment model, a 30B-A3B MoE in the same practical category as the Qwen and Qwen-Coder models already living on my workstation.

The homelab does not need the biggest model. It needs the biggest model that can actually act as an agent without melting the workflow.

The Homelab Filter

The machine we are testing against is the same box from the recent local-model rounds:

Component	Homelab
CPU	Ryzen 9 9950X3D
GPU	RTX 5090, 32 GB VRAM
RAM	64 GB DDR5
Inference	llama.cpp, single model on port 8080
Agent platform	Coder Agents
Target workload	Real coding tasks in the vibescoder.dev repo

This is not a cloud lab. It is not eight H100s. It is not a Mac Studio with hundreds of gigabytes of unified memory. It is the same single-GPU homelab I have been tuning all year: a very aggressive consumer workstation with one big GPU.

That matters because local-model discourse often collapses three very different claims into one word: runs.

A model can "run" because it fits entirely in VRAM and responds interactively. A model can also "run" because llama.cpp can mmap hundreds of gigabytes from NVMe while the GPU handles a few layers and you wait. Those are not the same thing.

We learned that with Kimi K2. It technically ran. It produced output. It was also a 579 GB download, loaded for more than six minutes, and generated at roughly interactive-punishment speed. Technically valid. Practically dead.

So the GLM question is not "can I make it produce tokens?" The question is:

Can it run locally, use tools correctly, and complete a Coder Agents task without turning the session into a science project?

The Three Candidates

GLM-5.2: the completeness run

GLM-5.2 is the model generating most of the buzz. It is also the least likely to be a real candidate for this hardware.

The reason is not mysterious. It is huge. The official Hugging Face metadata lists it in the 753B-parameter class. Unsloth has GGUF quants, including extremely low-bit versions, but those still live in the hundreds-of-gigabytes world. That puts it in the same category as Kimi K2 for this rig: technically interesting, practically suspect.

We are still going to include it.

Not because I think it will win. Not because I think a 1-bit or 2-bit offloaded monster is a fair comparison against a 30B model sitting mostly in VRAM. We are including it because the data is useful. If it fails the feasibility gate, that is a result. If it loads but is unusably slow, that is a result. If it somehow clears the bar, that is definitely a result.

But we go in eyes open: GLM-5.2 is a completeness candidate, not a sane daily-driver candidate for a single RTX 5090.

GLM-4.7-Flash: the real contender

GLM-4.7-Flash is the one I actually care about.

Z.ai describes it as a 30B-A3B MoE model aimed at lightweight deployment. That puts it directly in the class we have been testing all month:

Qwen 3.6 35B-A3B
Qwen3-Coder 30B-A3B
Nemotron-style 30B-A3B candidates
now GLM-4.7-Flash

The naming is almost too convenient. Flash means "this one might fit the box." The GGUF options include quants in the range where a 32 GB GPU can plausibly host the model with room left for KV cache, depending on context and cache settings.

This is the model with an actual path to becoming useful on the homelab.

The open questions:

Does llama.cpp handle the model cleanly?
Does the GLM tool-call format round-trip through Coder Agents?
Does it avoid the looping behavior people have reported in some GLM-4.7-Flash GGUF runs?
Can it ship code, not just write plausible code?

That last question is the one that matters.

GLM-4-9B: the floor

GLM-4-9B-Chat is the older small model. It supports function calling and long context on paper. It should fit easily on the 5090. It should be fast enough that the model itself is not the bottleneck.

That makes it useful as a floor.

I do not expect a 9B model to beat Qwen3-Coder on a real multi-file Next.js task. If it does, something strange and interesting happened. But it can still answer two important questions:

Does the GLM family tool-call format work cleanly in our stack?
How much agentic capability do we lose when we drop from the 30B-A3B class to 9B?

If GLM-4-9B calls tools reliably but fails the coding task, we learned something. If it cannot call tools reliably, we learned something more important: do not trust the larger GLM runs until the parser path is fixed.

The Tool-Calling Question

A fellow vibe coder told me she could not get GLM to run with Hermes because it was not compatible with JSON.

My second question was: is that true?

My first question was: what is GLM? We answered that above. So let's dive into the JSON rumor.

The rumor is half right and half misleading.

GLM does not appear to be JSON-native in the way some tool-call models are. The templates use GLM-style XML-ish tool calls, with function names and argument keys wrapped in tags. That sounds bad if your agent expects the model to literally emit raw JSON.

But Coder Agents is not talking directly to raw model text. It talks to an OpenAI-compatible server. llama.cpp sits in the middle and is supposed to translate the model's native format into OpenAI-style tool_calls.

That is the entire game.

If llama.cpp parses GLM tool calls correctly, Coder Agents should not care whether the model internally uses JSON, XML tags, magic tokens, or a tiny goblin tapping Morse code inside the KV cache. The API response either contains structured tool calls or it does not.

So the first test is not the tag-manager task. The first test is much simpler:

Start the model, send a tool schema, and confirm /v1/chat/completions returns structured tool_calls with valid JSON arguments.

If that fails, the bakeoff is over until the template is fixed.

Round 7 already taught us why. Devstral did not fail because it wrote bad TypeScript. It failed before that. It emitted fake tool calls as plain text. Coder Agents could not parse them, so nothing happened. Nine messages, zero actions.

Tool calling is not a feature of an agentic local model. It is the price of admission.

The Bakeoff Harness

We are not inventing a new task. We are reusing the newest real-world local-model harness: the Round 7 tag-manager task, with the Round 8 protocol improvements.

That task asks the agent to add a tag manager to the blog admin panel. It builds on the taxonomy cleanup from From Chaos to Signal, but raises the bar: instead of asking a model to reason about tags, we ask it to build the admin tooling that manages them.

create tag-reading helpers using gray-matter
add admin API routes for listing, renaming, and deleting tags
build an /admin/tags page
link it from the admin dashboard
run npm run build
take a Playwright screenshot
commit in logical chunks
push the branch

This task is useful because it is not synthetic. It hits the exact failure modes local models struggle with:

Failure mode	Why this task catches it
Tool-call failure	The agent has to read, write, execute, and use browser tools.
Repo navigation	The codebase has existing admin patterns to discover.
TypeScript debugging	`gray-matter` and Next.js route types are easy to get subtly wrong.
Build-loop behavior	Bad models repeat the same broken fix. Good models inspect the error.
Goal prioritization	The screenshot requirement can become a yak-shaving trap.
Shipping discipline	Passing build is not enough. The model has to commit and push.

Round 7 proved the value of this task. Qwen 3.6 built the feature and got the build passing, then burned 77 messages trying to take a screenshot and never committed. Qwen3-Coder shipped code, but skipped the screenshot and pushed one messy commit. Gemma and Hermes looped on build errors. Devstral never made a structured tool call.

That is the kind of signal a one-shot benchmark will never give you. It is the same reason I keep coming back to messy feature-build bakeoffs instead of clean synthetic prompts, from the original local-vs-cloud benchmark to the four-agent feature build.

The Plan

The GLM bakeoff has two layers: qualification and the real task.

Phase 1: qualification

Before any full Coder Agents run, each model must pass four gates.

Gate	Test	Pass condition
Load	Start llama-server	Health check passes, model appears in `/v1/models`
Plain chat	One short response	No loop, no malformed output, completes on time
Tool call	One forced tool call	OpenAI response includes structured `tool_calls`
Tiny agent task	Create and run a trivial file	Uses tools, completes, stops

GLM-5.2 gets a special label here. If it requires heavy offload, we mark it as offload-class. It can still continue, but its latency numbers will not be compared as if it were a normal in-VRAM run.

Phase 2: official agentic runs

If the models pass qualification, they get the Round 7 tag-manager task.

Run	Model	Role
`glm-run-1`	GLM-5.2 GGUF	Completeness and feasibility
`glm-run-2`	GLM-4.7-Flash GGUF	Practical contender
`glm-run-3`	GLM-4-9B GGUF	Small baseline

Each run gets:

same repo baseline
same prompt
same Coder Agents setup
same intervention rules
same hard timeout
same scoring rubric

Phase 3: reruns

Single-run agent bakeoffs are noisy. If GLM-4.7-Flash or GLM-4-9B does anything interesting, we rerun it.

Minimum reruns:

Run	Model	Why
`glm-run-2b`	GLM-4.7-Flash	Likely best practical candidate
`glm-run-3b`	GLM-4-9B	Measures variance in the small baseline

GLM-5.2 only gets a rerun if it is surprisingly usable. I am curious, not masochistic.

Optional Phase 4: the screenshot timebox

The screenshot requirement is intentionally left in the official run. It is part of the agentic test. Shipping a feature includes handling annoying browser and auth problems.

But if every model fails mainly because of Playwright, we will run a controlled variant:

If the screenshot is blocked after three attempts or 20 minutes, document the blocker, commit and push the working code, and mention the missing screenshot in the final summary.

That gives us a second lens: can the model ship code if the known trap is timeboxed?

How We Will Score It

The scoring rubric stays the same as the recent bakeoffs, especially Round 5 and Round 7:

Dimension	Weight	What it measures
Correctness	25%	Does the feature work and does the build pass?
Design	15%	Does the admin UI fit the app?
Code quality	20%	TypeScript hygiene, clean abstractions, no dead code
Engineering judgment	15%	Rename/delete safety, error handling, project pattern fit
Scope discipline	10%	Did it avoid gold-plating and unrelated churn?
Commit hygiene	10%	Logical commits, useful messages, branch pushed
Surprise	5%	Anything unusually good or bad

But local models need a second table. A model can score well on code and still be useless if it takes three hours, burns ten million tokens, or requires hand-holding every ten minutes. That was the real lesson from Slaying the Gemma Beast: the model output is only half the story. The serving setup, reasoning budget, and agent loop decide whether the thing is usable.

So we will also capture deployability:

Metric	Why it matters
Load time	Operator experience
Peak VRAM and RAM	Hardware fit
Offload status	Fairness and practicality
Tokens per second	Real latency
Wall-clock runtime	Can I actually use this?
Total tokens	Agentic efficiency
Tool calls	Workflow behavior
Build attempts	Debugging quality
Human interventions	Autonomy
Screenshot status	Known Round 7 trap
Commits pushed	Shipping discipline

The final verdict will separate capability from deployability. That matters especially for GLM-5.2. If it writes the best code but only after a miserable offloaded marathon, that is not a daily-driver win. It is a lab result.

What Would Count as a Win?

For GLM-5.2, a win is not beating the smaller models. A win is proving the giant model can be made to run through our stack and produce structured tools. Anything beyond that is upside.

For GLM-4.7-Flash, the bar is higher. It needs to look like a plausible Qwen3-Coder alternative:

structured tool calls work
no degenerate loops
build passes
branch gets committed and pushed
token usage is not absurd
the implementation is reviewable without a rescue mission

For GLM-4-9B, the bar is lower but still real:

tool calls work
it can navigate the repo
it makes a coherent attempt
it gives us a useful small-model baseline

If GLM-4.7-Flash ships a clean branch, that is the headline. If GLM-5.2 cannot clear the feasibility gate, that is still worth publishing. If GLM-4-9B surprises us, we get a much more interesting post than expected.

What I Think Will Happen

My guess before running anything:

GLM-5.2 will be technically runnable only in a way that is not pleasant on this box.
GLM-4.7-Flash is the only serious candidate for local Coder Agents use.
GLM-4-9B will validate the parser path but fall short on the full agentic task.

The danger is that I am wrong in either direction. GLM-4.7-Flash could be fast but loopy. GLM-4-9B could be more disciplined than expected. GLM-5.2 could be unusable, or it could produce one of those weird giant-model moments where the result is obviously better even though the experience is awful.

That is why we test.

By the Numbers

3 GLM variants in scope
1 RTX 5090 as the hardware constraint
4 qualification gates before the real task
3 official agentic runs minimum
2 reproducibility reruns planned if the practical candidates show promise
1 known trap from Round 7: Playwright screenshot yak-shaving
0 assumptions that "runs locally" means "is useful locally"

GLM is hot. That is enough reason to look.

It is not enough reason to believe.

The bakeoff comes next.

Top comments (3)

Alex Shev • Jun 30

Homelab testing is valuable because it makes model evaluation concrete. Instead of arguing benchmarks, you see latency, memory pressure, failure behavior, and whether the model is useful enough to stay in your actual workflow.

Rob • Jun 30

Agreed! My favorite aspect of this new era we're in is the ability to fearlessly tackle these initiatives. I couldn't sustain hobby efforts like a homelab. Agents make the discovery, experimentation, and implementation doable for me.

Vasyl • Jul 2

The qualification gates are the right call, and I'd log one extra thing during the tool-call gate: the raw model text next to what llama.cpp parsed out of it. I've hit template bugs where the parser silently swallowed half the arguments, the tool_calls array looked structurally valid, and everything downstream failed in ways that looked like model stupidity. It only took one diff of raw vs parsed to find it. Also curious what you'll do with KV cache on the 5090 for the long agent sessions, 32 GB gets tight fast when a run burns 70+ messages. Quantized cache, or capped context?