An open model from China beat Claude on a security test -- at a sixth of the cost

#openweightmodels #security #glm #benchmarks

GLM 5.2, a free open-weight model from Zhipu AI, beat Anthropic's Claude at catching broken-access-control bugs in Semgrep's benchmarks, at roughly a sixth of the cost per bug found. The result, published in Semgrep's blog post We Have Mythos At Home, is narrow but real: on one security-critical task, a downloadable model outperformed a top closed model.

Key facts

What: Semgrep ran GLM 5.2 against Claude on a narrow vulnerability-finding task and the free, open-weight model came out ahead for far less money.
When: 2026-06-28
Primary source: read the source

A huge share of real-world web bugs come from one boring mistake: a site checks that you are logged in, but forgets to check that the thing you are asking for actually belongs to you. Change the order number in the address bar from 1001 to 1002 and you are suddenly looking at someone else's invoice. Security people call this a broken-access-control or IDOR bug. It is everywhere, it is costly, and it is the kind of needle-in-a-haystack reading job people now hand to AI: point a model at a codebase and ask, where can a user reach data that isn't theirs?

Semgrep built a fair test around that question and ran several models through it. The standout was GLM 5.2, an open-weight model from the Chinese lab Zhipu AI. On the narrow task of catching access-control bugs, GLM 5.2 scored ahead of Claude Code -- and because GLM is free to download and cheap to run, the cost per bug it found was about a sixth of Claude's. For a security team scanning millions of lines, that gap is the difference between scanning everything and scanning a sample.

GLM 5.2 is a mixture-of-experts design: it is enormous on paper -- hundreds of billions of parameters -- but for any given chunk of text it only switches on a small slice of itself, which keeps it fast and affordable. It reads up to about a million tokens at once, enough to hold a fair-sized codebase in working memory while it reasons about who can reach what. It ships under a permissive MIT license, so a company can run it on its own machines and never send a line of proprietary code to anyone else.

Semgrep itself is careful to make the caveat: this is one narrow win, not a coronation. On harder, longer programming tasks -- the kind that involve juggling a whole project over many steps -- GLM 5.2 still trails the top closed models by a wide margin. The sharpest point in the writeup is that the model alone was not even the best result on Semgrep's own board: their full scanning pipeline, the model wrapped in custom tooling and checks, beat every bare model by a healthy margin. How you wire a model into a system matters at least as much as which model you pick. A bare benchmark score is the start of the story, not the end of it.

The direction matters regardless. A year ago the assumption was that frontier capability lived behind a handful of American API keys. The Semgrep result is a clean, reproducible data point that on at least one economically important task, a free model you can run in your own building is now the rational default. Developers on local-AI forums are quietly moving day-to-day work onto GLM and keeping the expensive models for the genuinely hard problems. Combine that with the fact that the most powerful American models are getting harder to access, and a cheap, open, capable alternative feels less like a curiosity and more like infrastructure.

Originally published on Ground Truth, where every claim is checked against the primary source.