What: The news anchor is GLM-5.2, Zhipu AI's open-weights model that just topped the Artificial Analysis Intelligence Index; the concept it makes concrete is active vs total parameters — the two numbers in its "744B total / 40B active" spec.
Why: Those two numbers price two different things: total sets the memory footprint and the GPU you need, while active sets the compute and bandwidth you pay per token. Reading both tells you what a model release actually costs to run.
vs prior: The old habit was to quote one parameter count — which assumes a dense model where every weight fires on every token, so active equals total. A sparse Mixture-of-Experts splits that into two, and the gap between them is the design lever.
Think of it as
A big engine that fires only a few of its cylinders at a time.
GLM-5.2 ENGINE: 744 CYLINDERS BUILT IN
│
┌────────────────────┴─────────────────────┐
│ the whole engine block is hauled along │
│ . . . . . . . . . . . . . . . . . . . . │
│ . . # . . . # . . # . . . # . . . # . . │
│ but only ~40 cylinders ( # ) fire now │
└────────────────────┬─────────────────────┘
│
┌────────────────┴────────────────┐
▼ ▼
MEMORY = the whole block COMPUTE = firing only
all 744B resident, ~744 GB ~40B active, ~80 GFLOP
- total parameters (744B) = every cylinder built into the engine block — the full capacity
- active parameters (40B) = the cylinders actually firing on this stroke — what burns fuel right now
- router = the engine controller deciding which cylinders fire for each token
- memory footprint = the whole engine block you still haul around, firing or idle
- sparsity ratio = how few cylinders fire (40) versus how many exist (744)
Quick glossary
Total parameters — Every weight the model contains — here 744 billion. The total sets the model's knowledge capacity and, critically, the memory footprint: all 744 B must be loaded into GPU memory whether or not they are used on a given token.
Active parameters — The subset of weights actually read and multiplied for a single token — here ~40 billion. In a dense model active equals total; in a sparse model active is a fraction. Per-token compute and bandwidth track the active count, not the total.
Mixture-of-Experts (MoE) — A transformer variant that replaces each dense feed-forward network with many smaller "expert" sub-networks, plus a router that activates only a handful per token. It decouples total capacity from per-token cost.
Router — The small learned network inside an MoE layer that assigns each token to its top-k experts. It is what makes "which weights are active" change from token to token.
Sparsity ratio — The fraction of total parameters that are active per token. GLM-5.2's 40 B of 744 B is roughly 5% — about one weight in eighteen. A lower ratio means more capacity sits idle on any given token.
Dense model — A model with no routing: every weight participates in every token, so active equals total. Per-token FLOPs scale linearly with the full parameter count.
FLOP — A floating-point operation — one multiply or add. A useful rule of thumb: a forward pass costs about 2 × (active parameters) FLOPs per token.
Artificial Analysis Intelligence Index — A third-party benchmark that aggregates many evals (reasoning, coding, knowledge) into a single comparable score. GLM-5.2 scored 51 on v4.1, leading all open-weights models.
The news. On June 17, 2026, Artificial Analysis reported that Zhipu AI's GLM-5.2 became the leading open-weights model on its Intelligence Index v4.1, scoring 51 — ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44. The model carries 744 B total parameters but activates only ~40 B per token, ships under an MIT license, and keeps GLM-5.1's architecture while showing particular strength on scientific reasoning. Read the report →
Picture a very large engine. It has hundreds of cylinders machined into the block, but at any instant only a handful are firing — and which few are firing changes constantly as a controller picks the right ones for the moment. The size of the engine is one number; the cylinders burning fuel right now are a completely different one. That is exactly the gap GLM-5.2 puts on its spec sheet: 744 billion cylinders built in, but only about 40 billion firing per token. The first number is the engine you have to build and haul; the second is the fuel you actually burn each stroke.
Dense model releases often came with one headline number, because the model was dense — every weight fired on every token, so the engine you built and the engine you ran were the same. A Mixture-of-Experts model breaks that equality on purpose. Most of a transformer's parameters live in its feed-forward layers — roughly two-thirds of them — so MoE replaces that one big dense feed-forward block with many smaller expert blocks and a router that lights up only the few each token needs. The 744 B stays resident, but the per-token bill tracks the ~40 B that fire.
So the two numbers price two genuinely different resources. The total parameter count sets your memory footprint — every one of the 744 B weights has to sit in GPU memory, idle or not, which is why running an open-weights model this large means a multi-GPU node and a good reason to shrink the weights with quantization. The active count sets your per-token compute and bandwidth — and at ~40 B active, GLM-5.2 computes each token at roughly the cost of a 40 B model even though it holds 744 B parameters of capacity. The notable part of this release is not just that an open-weights model topped the leaderboard; it is that it did so at a ~5% sparsity ratio — about one weight in eighteen — pushing the frontier on a very lean per-token budget.
| Per token, you pay… | If GLM-5.2 were dense (744B active) | GLM-5.2 as shipped (744B-total, ~40B active) |
|---|---|---|
| Active parameters | 744B (all of them) | ~40B |
| Compute per token | ~1.49 TFLOP (illustrative, ≈2× active-params rule) | ~80 GFLOP (illustrative, ≈2× active-params rule) |
| Weights held in memory | ~744 GB (~approx, 1 byte/param at FP8) | ~744 GB (~approx, 1 byte/param at FP8) |
| Intelligence Index v4.1 | — | 51 (leading open weights) |
Work one token through the numbers to see why the gap matters. Using the rule that a forward pass costs about 2 × (active parameters) FLOPs, a dense 744 B model would burn 2 × 744 B ≈ 1.49 TFLOP every token; GLM-5.2, firing only ~40 B, burns 2 × 40 B ≈ 80 GFLOP — roughly 18× less compute per token (illustrative — derived from the parameter counts, not measured). But both versions still have to keep all 744 B weights resident — about 744 GB at one byte each — so the memory bill is identical. That is the trade in parameter-count terms: MoE is designed to give you the per-token compute of a small model and the capacity of a large one — while still charging you the memory of the large one. (Real systems also pay routing overhead and run dense attention layers, so the picture is more nuanced than the two counts alone.) Whether the trade is worth it depends on what binds you — if memory is the constraint, a smaller dense model can win, which is the flip side explored in the related Granite explainer below.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network and LLM Internals → Quantization → Why Quantize
Related explainers
- IBM Granite 4.1 — 8B dense matches the prior 32B MoE — the serving-cost flip side: when memory is what binds, a smaller dense model can beat a larger MoE
- MobileMoE — DRAM-aware MoE scaling — what the active-vs-total gap buys you on memory-constrained devices
- SoftMoE — differentiable soft top-k routing — how the router actually decides which experts fire each token
FAQ
What is the difference between active and total parameters?
Total parameters are every weight the model contains — they set its knowledge capacity and its memory footprint, because all of them must be loaded into GPU memory. Active parameters are the subset actually read and multiplied for a single token; they set the per-token compute and bandwidth. In a dense model the two are equal; in a sparse Mixture-of-Experts model like GLM-5.2, active (~40B) is a small fraction of total (744B).
Why does GLM-5.2 list two parameter counts (744B total, 40B active)?
Because it is a Mixture-of-Experts model. Its feed-forward layers are split into many expert sub-networks, and a router activates only a handful per token — so the model holds 744B weights but fires only ~40B on any given token. The total predicts the memory and GPU you need; the active count predicts how fast and how cheaply it runs per token. A single number would hide that the two costs have decoupled.
Does a lower active-parameter count make a model cheaper to run?
It makes the per-token compute and bandwidth cheaper — GLM-5.2 computes a token at roughly the cost of a 40B model. But it does not lower the memory bill: all 744B total parameters still have to fit in GPU memory whether or not they fire. So a very sparse model is cheap on compute and expensive on memory, which is why deployments often pair it with quantization and multi-GPU nodes.
Originally posted on Learn AI Visually.
Top comments (0)