The GLM-5.2 API gives you programmatic access to Z.ai’s newest open-weights flagship: a ~753B-parameter MoE model with a 1M-token context window and strong long-horizon coding benchmark results. This guide shows how to get an API key, send your first request, use Python and curl, configure thinking modes, stream responses, call tools, and track token cost.
If you are migrating from GLM-5.1, start here.
What changed since GLM-5.1
GLM-5.2 supersedes the 5.1 generation. If your app already calls the GLM-5.1 API, the request format stays the same. In most integrations, you only need to change the model id to glm-5.2.
Key differences:
- Sparse attention update: GLM-5.2 introduces “IndexShare,” which reuses a single indexer across every four sparse-attention layers to reduce attention cost at long context. You do not configure this directly through the API.
- Stronger agentic coding results: Z.ai’s published results report Terminal-Bench 2.1 at 81.0, up from GLM-5.1’s 62.0.
- Two reasoning effort levels: GLM-5.2 exposes High and Max reasoning effort. Z.ai recommends Max for coding tasks.
Everything below targets glm-5.2.
Step 1: Get a GLM-5.2 API key
Sign in at z.ai, open the API keys section in your account dashboard, and create a new key.
Store it as an environment variable instead of hard-coding it:
export ZAI_API_KEY="your-glm-5.2-api-key"
Do not commit this key to git. A leaked key can generate billable traffic against your account.
Step 2: Use the OpenAI-compatible endpoint
GLM-5.2 is OpenAI-compatible, so existing OpenAI Chat Completions clients can work after you change the base URL and model id.
| Setting | Value |
|---|---|
| Chat completions endpoint | https://api.z.ai/api/paas/v4/chat/completions |
| Base URL for SDKs | https://api.z.ai/api/paas/v4/ |
| Model id | glm-5.2 |
| Auth header | Authorization: Bearer $ZAI_API_KEY |
If you prefer OpenRouter, the alias is z-ai/glm-5.2: OpenRouter GLM-5.2.
For local use, Ollama publishes the model as glm-5.2 in the Ollama library. The open weights are also available on Hugging Face under an MIT license.
Before building around long contexts, note the limits:
- Context window: 1M tokens, or 1,048,576 tokens.
- Max output: up to 128K according to the z.ai docs. Verify live limits before relying on this in production.
Step 3: Send your first request with curl
Use this minimal curl request to verify your key and endpoint:
curl https://api.z.ai/api/paas/v4/chat/completions \
-H "Authorization: Bearer $ZAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2",
"messages": [
{"role": "system", "content": "You are a concise backend engineer."},
{"role": "user", "content": "Write a SQL query that returns the 5 newest orders per customer."}
]
}'
The response follows the OpenAI Chat Completions shape:
idchoiceschoices[0].message.contentusage
Use the usage object later for cost tracking.
Step 4: Call GLM-5.2 from Python
Install the OpenAI SDK:
pip install openai
Then configure the client with the Z.ai base URL:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["ZAI_API_KEY"],
base_url="https://api.z.ai/api/paas/v4/",
)
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{"role": "system", "content": "You are a concise backend engineer."},
{"role": "user", "content": "Explain idempotency keys in 3 sentences."},
],
)
print(resp.choices[0].message.content)
That is the core integration. Existing OpenAI-style helpers for retries, logging, request tracing, and response parsing should carry over.
For more background on the broader GLM API family, see the GLM-5 API overview.
Step 5: Configure thinking and reasoning effort
GLM-5.2 supports reasoning controls. You can disable thinking for simple tasks or enable it with higher effort for coding, math, and multi-step work.
Disable thinking for low-latency tasks
Use this for classification, routing, short transformations, or simple rewrites:
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{"role": "user", "content": "Classify: 'my card was charged twice'"}
],
extra_body={
"thinking": {"type": "disabled"}
},
)
print(resp.choices[0].message.content)
Enable Max reasoning for harder coding tasks
Z.ai recommends Max reasoning effort for coding:
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{
"role": "user",
"content": "Refactor this function to remove the N+1 query and explain the fix."
},
],
extra_body={
"thinking": {"type": "enabled"},
"reasoning_effort": "max",
},
)
print(resp.choices[0].message.content)
The OpenAI Python SDK passes provider-specific fields through extra_body.
If you call the REST API directly with curl, place thinking and reasoning_effort at the top level next to model.
Max reasoning can increase output token usage because reasoning counts toward output. Use it where the quality improvement is worth the additional cost.
Step 6: Stream responses
For chat UIs, long answers, and coding-agent output, enable streaming so tokens appear as they arrive.
stream = client.chat.completions.create(
model="glm-5.2",
messages=[
{
"role": "user",
"content": "Write a 200-word changelog entry for a rate-limit fix."
}
],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
With curl, add "stream": true to the JSON body. The server returns Server-Sent Events, with one data: line per chunk and a final:
data: [DONE]
Streaming does not change token pricing. It only changes when you receive the output.
Step 7: Add function and tool calling
Tool calling follows the standard OpenAI two-step flow:
- Define tools in the request.
- Let the model return a
tool_callsrequest. - Execute the function in your code.
- Send the tool result back to the model.
- Let the model produce the final answer.
Z.ai’s published results report GLM-5.2 at 77.0 on MCP-Atlas, close to Claude Opus 4.8.
Here is a minimal weather-tool example:
import json
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current temperature for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. Berlin"
},
"unit": {
"type": "string",
"enum": ["c", "f"]
},
},
"required": ["city"],
},
},
}
]
messages = [
{
"role": "user",
"content": "What's the weather in Berlin in celsius?"
}
]
first = client.chat.completions.create(
model="glm-5.2",
messages=messages,
tools=tools,
)
call = first.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
# Run the real function here.
# This is stubbed for the example.
def get_weather(city, unit="c"):
return {
"city": city,
"temp": 12,
"unit": unit
}
result = get_weather(**args)
messages.append(first.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
final = client.chat.completions.create(
model="glm-5.2",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content)
The model decides whether to call a tool. Your application executes the tool and returns the result. The second request lets GLM-5.2 turn the raw function output into a natural-language response.
For repeated manual testing, Apidog is useful here: define the GLM-5.2 endpoint once, save request bodies for each thinking mode, replay tool-calling turns, and inspect streamed responses without rewriting curl commands.
Step 8: Read usage data and estimate cost
Every non-streamed response includes a usage object. Use it for billing and monitoring:
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{
"role": "user",
"content": "Summarize REST vs gRPC in 4 bullets."
}
],
)
u = resp.usage
print("Prompt tokens:", u.prompt_tokens)
print("Completion tokens:", u.completion_tokens)
print("Total tokens:", u.total_tokens)
GLM-5.2 pricing is:
- Input: $1.40 per 1M tokens
- Output: $4.40 per 1M tokens
- Cached input: about $0.26 per 1M tokens, reported by VentureBeat
Example cost calculation for 8,000 input tokens and 1,500 output tokens:
(8000 / 1_000_000 * 1.40) + (1500 / 1_000_000 * 4.40)
= 0.0112 + 0.0066
= about $0.0178
Max-effort reasoning tokens count as output tokens. A Max-effort coding request will usually cost more than a thinking-disabled request.
VentureBeat reports GLM-5.2 “beats GPT-5.5 on long-horizon coding at roughly 1/6 the cost,” attributing that claim to the economics behind these token prices.
If you prefer a flat-rate plan instead of metered API calls, Z.ai also sells GLM Coding Plan tiers such as Lite, Pro, Max, and Team. Pricing can change, so verify current tiers at z.ai before committing.
For more detail, see:
Using GLM-5.2 inside Claude Code
GLM-5.2 also provides an Anthropic-compatible path, so you can use it from Claude Code.
Set the coding base URL to:
https://api.z.ai/api/coding/paas/v4
Some sources show open.z.ai/api/paas/v4, so verify the live endpoint before production use.
Then set these environment variables:
export ANTHROPIC_BASE_URL="https://api.z.ai/api/coding/paas/v4"
export ANTHROPIC_API_KEY="your-glm-coding-plan-key"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000
export API_TIMEOUT_MS=3000000
The [1m] suffix selects the 1M-context variant. The long API_TIMEOUT_MS helps prevent Claude Code from killing large-context requests before they return.
For a full setup guide, see GLM with Claude Code. If you are comparing tools, see Claude Code vs Codex vs Cursor vs GLM Plan.
GLM-5.2 integration reference
Use this table when wiring GLM-5.2 into an app:
| Property | GLM-5.2 |
|---|---|
| Model id | glm-5.2 |
| Architecture | ~753B MoE, BF16, IndexShare sparse attention |
| Context window | 1M tokens, or 1,048,576 tokens |
| Max output | Up to 128K per z.ai docs, verify live |
| Thinking modes | High / Max, or disabled |
| Input price | $1.40 / 1M tokens |
| Output price | $4.40 / 1M tokens |
| License | MIT, open weights |
For benchmark detail, Z.ai’s published results include:
- SWE-bench Pro: 62.1, compared with GPT-5.5 at 58.6
- Humanity’s Last Exam with tools: 54.7
- AIME 2026: 99.2
For more comparison data, see:
FAQ
Is the GLM-5.2 API OpenAI-compatible?
Yes. Point the OpenAI SDK’s base_url to:
https://api.z.ai/api/paas/v4/
Then set:
model="glm-5.2"
Standard chat, streaming, and tool-calling flows work with the OpenAI-compatible format.
What model id should I use?
Use:
glm-5.2
Other variants:
- Z.ai API:
glm-5.2 - OpenRouter:
z-ai/glm-5.2 - Ollama:
glm-5.2 - Claude Code 1M-context variant:
glm-5.2[1m]
How do I turn reasoning off?
Pass:
{
"thinking": {
"type": "disabled"
}
}
In the Python SDK, pass it through extra_body:
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{"role": "user", "content": "Classify this support ticket."}
],
extra_body={
"thinking": {"type": "disabled"}
},
)
For hard coding tasks, enable thinking and set:
{
"reasoning_effort": "max"
}
Z.ai recommends Max for coding.
How much does GLM-5.2 cost per request?
Use these prices:
- $1.40 per 1M input tokens
- $4.40 per 1M output tokens
Read the usage object on each response to compute exact cost. Remember that Max-effort reasoning tokens count as output.
Does GLM-5.2 support vision?
There is no confirmed vision variant as of June 2026. Treat the API as text-in, text-out until Z.ai documents image input support.
Wrapping up
To integrate GLM-5.2 into an existing OpenAI-compatible codebase:
- Set
base_urltohttps://api.z.ai/api/paas/v4/. - Use
model="glm-5.2". - Start with a curl request.
- Move the call into the OpenAI Python SDK.
- Add streaming for chat UIs.
- Enable Max reasoning for hard coding tasks.
- Use tool calling for agent workflows.
- Read
usageto monitor cost.
When you are ready to test endpoints, save request variants, and inspect tool-calling turns without hand-writing curl each time, download Apidog and configure the GLM-5.2 endpoint once.
For more context, see:

Top comments (0)