Hassann

Posted on Jun 17 • Originally published at apidog.com

GLM-5.2 Pricing: API Cost, Cached Input, and the GLM Coding Plan Tiers (2026)

GLM-5.2 is a low-cost way to run a frontier-class coding model. Z.ai ships it with open weights under an MIT license, a 1M-token context window, and API pricing that undercuts many closed-model alternatives. This guide gives you the practical numbers: per-token API cost, cached-input pricing, worked cost examples, GLM Coding Plan usage, and when GLM-5.2 may be cheaper than GPT-5.5 for real coding workflows.

Try Apidog today

AI pricing changes quickly, and some GLM Coding Plan tier details differ across secondary sources. Treat any flagged or estimated number as something to verify directly at z.ai before you commit budget.

GLM-5.2 API cost at a glance

Start with the pay-as-you-go API rate, which is confirmed by OpenRouter’s public listing.

Item	Price	Source
Input tokens	$1.40 / 1M	Confirmed: OpenRouter
Output tokens	$4.40 / 1M	Confirmed: OpenRouter
Cached input	~$0.26 / 1M	VentureBeat reporting

That means:

Input token cost  = $0.0000014 per token
Output token cost = $0.0000044 per token

Output is roughly 3.1x the price of input. That matters for reasoning workloads because generated reasoning tokens are billed as output tokens.

The cached-input rate of about $0.26 / 1M tokens is the main optimization lever for agents, chat apps, and RAG systems. That number comes from VentureBeat reporting rather than a first-party rate card, so treat it as attributed reporting.

There is no free OpenRouter lane for glm-5.2. If you want a no-metered-API-cost path, run the open weights yourself on your own hardware. For that path, see how to use GLM-5.2 for free and running GLM-5 locally for free.

How cached input reduces GLM-5.2 cost

Prompt caching is the easiest way to reduce GLM-5.2 spend.

The pattern is simple:

Put stable, reused content at the start of the prompt.
Put variable per-request content at the end.
Reuse the same prefix across calls.
Let the provider bill the cached prefix at the lower cached-input rate.

Good cache candidates include:

System prompts
Tool definitions
Coding-agent instructions
Repository context
Long documents used across multiple questions
Conversation history reused across turns

Example structure:

[Stable prefix]
- system instructions
- tool schemas
- repo overview
- unchanged source files

[Variable suffix]
- current user request
- current file diff
- latest question

If 80K tokens of a 100K-token prompt are reused, those 80K tokens may be billed at the cached-input rate instead of the full input rate.

Caching is most useful for:

Coding agents: Claude Code, Cline, Cursor, and similar tools often resend tool schemas, instructions, and repo context every turn. See the GLM-5.2 with Claude Code, Cline, and Cursor guide.
RAG and document Q&A: Cache the document once, then pay mostly for short questions and answers.
Long conversations: A chat history becomes a stable prefix as the conversation grows.

Two implementation rules:

Keep reusable content at the front of the prompt.
Send follow-up calls close together, because caches can expire.

Disable thinking when the task does not need reasoning

GLM-5.2 is a reasoning model with two thinking-effort levels: High and Max. Z.ai recommends Max for coding, but reasoning tokens are output tokens, and output is the expensive side of the bill at $4.40 / 1M.

For mechanical tasks, disable thinking.

{
  "model": "glm-5.2",
  "messages": [
    {
      "role": "user",
      "content": "Reformat this JSON and return it."
    }
  ],
  "thinking": {
    "type": "disabled"
  }
}

Use effort levels deliberately:

Mode	Use it for
Thinking disabled	Formatting, extraction, classification, simple rewrites
High effort	Everyday coding, debugging, analysis
Max effort	Hard coding tasks, long-horizon planning, math-heavy work

The cost impact can be large because more thinking means more generated output tokens. For parameter details, see the GLM-5.2 API guide and the earlier GLM-5 API walkthrough.

Worked GLM-5.2 cost examples

Per-token prices are easier to reason about when mapped to real workloads.

Example 1: one 100K-token coding session

Assume an agentic coding task sends:

100K input tokens
20K output tokens

Cost:

Input:  100,000 × $1.40 / 1,000,000 = $0.140
Output:  20,000 × $4.40 / 1,000,000 = $0.088

Total: ~$0.23

Example 2: the same session with cached input

Now assume:

80K input tokens are cached
20K input tokens are fresh
20K output tokens are generated

Cost:

Cached input: 80,000 × $0.26 / 1,000,000 = $0.021
Fresh input:  20,000 × $1.40 / 1,000,000 = $0.028
Output:       20,000 × $4.40 / 1,000,000 = $0.088

Total: ~$0.14

Caching the stable prefix cuts this session by roughly 40%. The more turns you take against the same context, the more caching matters.

Example 3: extraction bot with thinking disabled

Assume a support bot processes:

500 messages per day
2K input tokens per call
300 output tokens per call
Thinking disabled

Cost:

Input:  500 × 2,000 × $1.40 / 1,000,000 = $1.40
Output: 500 ×   300 × $4.40 / 1,000,000 = $0.66

Total: ~$2.06 / day
Monthly estimate: ~$62

These are list-rate estimates. Your actual bill depends on thinking effort, output length, cache hit rate, and request volume.

GLM Coding Plan tiers

If you work inside a coding agent all day, a subscription plan may be cheaper than metered API calls. Z.ai sells GLM Coding Plan tiers such as Lite, Pro, Max, and Team. These can be exposed to Claude Code and similar tools through an Anthropic-compatible endpoint.

The GLM Coding Plan key is different from a standard API key. To use GLM-5.2 with Claude Code, configure the coding endpoint and select the 1M-context model variant with the [1m] suffix.

export ANTHROPIC_BASE_URL="https://api.z.ai/api/coding/paas/v4"
export ANTHROPIC_API_KEY="your-glm-coding-plan-key"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000
export API_TIMEOUT_MS=3000000

The timeout matters. Large-context coding calls can run long, and Claude Code may terminate the request early if the timeout is too short.

Some sources show the coding base URL as:

open.z.ai/api/paas/v4

Verify the live endpoint before wiring it into your tooling.

For full setup details, including Cline and Cursor, see the GLM-5.2 coding agents guide. The earlier GLM-5.1 with Claude Code writeup covers the same pattern for the previous generation.

Is GLM-5.2 cheaper than GPT-5.5?

On the metered API, yes, by a wide margin.

VentureBeat reported that GLM-5.2 “beats GPT-5.5 on long-horizon coding at about 1/6th the cost.” That is VentureBeat’s claim, not an Apidog measurement, and it combines benchmark performance with pricing. Treat it as a directional value statement rather than a direct per-token ratio.

At the rate-card level:

GLM-5.2 input:  $1.40 / 1M tokens
GLM-5.2 output: $4.40 / 1M tokens

Top closed frontier reasoning models from OpenAI, Anthropic, and Google generally sit higher than that for premium reasoning tiers. For broader model comparisons, see:

The subscription comparison is less direct. A heavy GLM Coding Plan tier estimated around $80/mo lands near other premium single-seat coding subscriptions, so the decision depends on:

Quality on your codebase
Usage limits
Context-window needs
Agent compatibility
How each plan meters heavy usage

For a plan-by-plan comparison, see Claude Code vs Codex vs Cursor vs MiniMax Plan vs GLM Plan.

One benchmark caveat: launch results such as SWE-bench Pro 62.1, Terminal-Bench 2.1 at 81.0, and MCP-Atlas 77.0 are Z.ai’s published results. See the GLM-5.2 benchmarks deep-dive and GLM-5.2 vs GPT-5.5, Claude Opus, and Gemini for more context.

Which pricing path should you choose?

Use this decision guide.

Use case	Best fit
Low-volume or spiky usage	Pay-as-you-go API
All-day coding inside an agent	GLM Coding Plan
Offline, private, or zero marginal token cost	Self-host open weights

Pick pay-as-you-go API if

You only run GLM-5.2 occasionally or your workload varies by day. The per-token rates are low enough that light usage stays inexpensive.

Pick a GLM Coding Plan if

You make hundreds of coding-agent calls per day and want predictable monthly spend. Verify the current tier price before committing.

Self-host if

You need privacy, offline access, or no per-token bill. You still pay for your own compute. Start with running GLM-5 locally for free or GLM-5 for free with Ollama.

Whichever path you choose, the two main cost controls are the same:

Cache stable prompt prefixes.
Lower or disable thinking effort when the task does not need deep reasoning.

Test GLM-5.2 costs before committing

Before choosing a plan, benchmark your own prompts.

Use an OpenAI-compatible client against:

https://api.z.ai/api/paas/v4/chat/completions

Track:

Input tokens
Output tokens
Cache hit behavior
Latency
Thinking effort
Cost per task

Apidog is useful for this because it lets you design, debug, test, and document API calls in one place. You can send GLM-5.2 requests, inspect responses and token usage, save them as reusable collections, and compare different thinking settings. Download Apidog if you want to test the rate card against your actual traffic.

The short version: anchor your estimates on GLM-5.2’s confirmed API rate of $1.40 / 1M input tokens and $4.40 / 1M output tokens. Then reduce spend by caching stable prefixes, controlling thinking effort, and verifying live Coding Plan prices before you commit.

DEV Community

GLM-5.2 Pricing: API Cost, Cached Input, and the GLM Coding Plan Tiers (2026)

GLM-5.2 API cost at a glance

How cached input reduces GLM-5.2 cost

Disable thinking when the task does not need reasoning

Worked GLM-5.2 cost examples

Example 1: one 100K-token coding session

Example 2: the same session with cached input

Example 3: extraction bot with thinking disabled

GLM Coding Plan tiers

Is GLM-5.2 cheaper than GPT-5.5?

Which pricing path should you choose?

Pick pay-as-you-go API if

Pick a GLM Coding Plan if

Self-host if

Test GLM-5.2 costs before committing

Top comments (0)