Point an existing OpenAI-style agent loop at GLM 5.2 and most of it just works: you send tools, you get back tool_calls, you run them, you send the results. Then it does something the SDK examples never show. The assistant returns a line of text in the same turn as the tool calls:
{
"choices": [{
"finish_reason": "tool_calls",
"message": {
"role": "assistant",
"content": "I'll look up both pieces of information for you at the same time!",
"tool_calls": [
{"id": "call_…", "type": "function",
"function": {"name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}},
{"id": "call_…", "type": "function",
"function": {"name": "get_time", "arguments": "{\"city\":\"Tokyo\"}"}}
]
}
}]
}
Two conventions dominate, and it helps to keep both in view. In OpenAI's, you send function schemas, get tool_calls back, and answer with a tool message per call, keyed by tool_call_id:
resp = openai.chat.completions.create(model="…", tools=tools, tool_choice="auto", messages=messages)
# assistant.tool_calls → [{"id": "call_…", "function": {"name": "get_weather", "arguments": "{\"city\":\"Paris\"}"}}]
messages.append(resp.choices[0].message)
messages.append({"role": "tool", "tool_call_id": "call_…", "content": "18C, clear"})
Anthropic's is shaped differently: tools carry an input_schema, the model emits tool_use blocks, and you answer with a tool_result block:
resp = anthropic.messages.create(model="…", tools=tools, messages=messages)
# resp.content → [{"type": "tool_use", "id": "toolu_…", "name": "get_weather", "input": {"city": "Paris"}}]
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": [
{"type": "tool_result", "tool_use_id": "toolu_…", "content": "18C, clear"}]})
GLM 5.2 speaks the OpenAI dialect.
In OpenAI's contract, message.content is null when finish_reason is tool_calls. Plenty of agent loops lean on that: they branch on "content or tool calls," log content as the final answer, or assert it's empty. GLM hands you both at once, and that assumption is the first thing to break.
The behavior here was captured from live tool-calling requests to glm-5.2, with gpt-5.5 and claude-opus-4-8 run on the same task as reference points. The short version: GLM 5.2 uses the OpenAI API surface, but on a couple of axes it behaves more like Claude than like GPT, and an OpenAI-trained loop is the one that trips.
The same turn, three ways
Same prompt, same two tools, three models:
GLM (glm-5.2) |
OpenAI (gpt-5.5) |
Anthropic (claude-opus-4-8) |
|
|---|---|---|---|
| API surface | OpenAI chat-completions | OpenAI chat-completions | Anthropic messages |
| Text in the tool-call turn |
content preamble (non-null) |
content is null
|
a text block before tool_use
|
| Reasoning on that turn | exposed: reasoning_content + reasoning_tokens
|
hidden; only reasoning_tokens in usage
|
only as a thinking block, if you enable it |
| Parallel tool calls | yes, with index
|
yes | yes, multiple tool_use blocks |
| Done signal | finish_reason: "tool_calls" |
finish_reason: "tool_calls" |
stop_reason: "tool_use" |
| Tool-call id prefix | call_… |
call_… |
toolu_… |
Two rows are where loops break: text in the tool-call turn, and reasoning showing up on that turn. The rest is reassuringly boring.
Text rides with the tool call
GLM 5.2 routinely emits a short assistant content preamble alongside tool_calls, with finish_reason: "tool_calls". It is not an error and it is not occasional.
Here is the same turn from all three, trimmed to the part that differs:
// OpenAI gpt-5.5: content is null on a tool-call turn
"message": { "content": null,
"tool_calls": [ {/* get_weather */}, {/* get_time */} ] }
// GLM glm-5.2: content carries a preamble
"message": { "content": "I'll look up both pieces of information for you at the same time!",
"tool_calls": [ {/* get_weather */}, {/* get_time */} ] }
// Anthropic claude-opus-4-8: a text block sits before the tool_use blocks
"content": [ { "type": "text", "text": "I'll get both pieces of information for you." },
{ "type": "tool_use", /* get_weather */ },
{ "type": "tool_use", /* get_time */ } ]
OpenAI leaves content null; GLM fills it; Anthropic has always put a text block there. So GLM takes OpenAI's wire format with Anthropic's habit of narrating before it acts, and a loop written against OpenAI is the one caught off guard. The fix is small but you have to make it deliberately. Stop treating a tool-call turn as content-free:
resp = client.chat.completions.create(model="glm-5.2", messages=msgs, tools=tools)
msg = resp.choices[0].message
# GLM may return assistant text in the same turn as the tool calls.
if msg.content:
log.debug("preamble: %s", msg.content) # keep or drop, but don't assume it's empty
msgs.append(msg)
for call in msg.tool_calls:
result = dispatch(call.function.name, json.loads(call.function.arguments))
msgs.append({"role": "tool", "tool_call_id": call.id, "content": result})
If your loop renders content to the user as the assistant's reply, you will now show a "let me check that" line before every tool call. Decide whether you want it. The point is that the decision is yours, not something the model's silence makes for you.
It thinks out loud
GLM 5.2 is a reasoning model, and that does not pause for tool use. A tool-call turn carries reasoning along with it, and GLM 5.2 exposes it as text. In a non-streaming response the token accounting makes it explicit:
"usage": {
"prompt_tokens": 224,
"completion_tokens": 68,
"completion_tokens_details": { "reasoning_tokens": 30 },
"total_tokens": 292
}
Almost half the completion was reasoning, on a request whose visible output is two short function calls. This is the row where all three models diverge. GLM 5.2 gives you the reasoning as reasoning_content plus a token count. OpenAI bills reasoning_tokens in usage but never shows the text. Anthropic shows it only as thinking blocks, and only when you turn extended thinking on. GLM 5.2 is the most exposed of the three by default.
Two consequences. First, cost: you pay for those reasoning tokens on tool-call turns, and an agent loop is many turns. Reasoning effort is the dial that moves the number, which we covered in GLM 5.2: Reasoning Effort Is the Cost Lever. Count reasoning tokens on every turn, not just the final answer.
Second, streaming order. When you stream the request, GLM sends the reasoning first, then the preamble text, then the tool calls:
reasoning_content (many deltas)
content (a few deltas)
tool_calls (id + name, then arguments)
A parser written against vanilla OpenAI chat completions does not know the reasoning_content field and will quietly ignore that opening burst. Usually fine. It stops being fine if your UI shows a "thinking…" state keyed on the first content delta, because the first thing on the wire is reasoning, not content, and the indicator never flips.
What a GLM 5.2 tool-call turn costs
Behavior is half the story; the bill is the other half, and an agent loop runs the same turn many times over. With a fixed prefix (a roughly 2,000-token system prompt plus the tool definitions) and the user message varied each call, measured over ten warm turns:
| per warm tool-call turn | GLM glm-5.2
|
OpenAI gpt-5.5
|
Anthropic claude-opus-4-8
|
|---|---|---|---|
| Cost | $0.0009 | $0.0042 | $0.0051 |
| Latency (median) | 6.6s | 1.9s | 3.1s |
| Prompt cached | ≈96% | ≈81% | ≈97% |
| Reasoning tokens | ≈27 | 0 | 0 |
| Cold → warm cost | 3.4× | 2.8× | 4.9× |
GLM 5.2 is the cheap one: roughly 4.5× cheaper than GPT-5.5 and 5.4× cheaper than Opus per warm turn. It is also the slow one, two to three and a half times their latency, because it spends reasoning tokens on every turn while the other two spent none on this task. That is the trade: GLM buys cost with latency, and reasoning effort is the dial that moves it.
Caching is what makes any of these affordable in a loop. The system prompt and tool definitions are most of every prompt and identical each turn, so once the prefix is cached the turn gets 2.8× to 4.9× cheaper. Two things decide whether you see it. GLM and OpenAI cache the prefix automatically; Anthropic only caches what you mark with cache_control. And GLM's cache warms up a beat late, so a three-step task can pay full price while a thirty-step one runs cached. The mechanics are in Open-Weight LLM Caching.
When to reach for GLM 5.2, and how to run it well
Put the pieces together. GLM 5.2 is the cheap model in that table and the slow one, and it reasons on every turn. That profile points at where it earns its place.
Where it fits: long, multi-step agent loops where cost dominates and a few seconds per turn is acceptable. Background coding agents, CI and batch automation, jobs that run unattended. The reasoning that makes it slow is also why it holds up on real coding and planning rather than trivial routing. Caching compounds the case past the warm-up: a thirty-step task amortizes the prefix and runs cheap, while a three-step one can pay full price and eat the latency for nothing. So reach for GLM 5.2 on the long jobs, and keep a faster model for the interactive, single-shot calls where six seconds a turn is felt.
How to run GLM 5.2 well. Five habits make a loop GLM-ready without leaving the OpenAI API surface:
- Treat a tool-call turn as possibly carrying
content. Do not assert it is empty. - Expect
reasoning_contenton the wire andreasoning_tokensinusage; budget for both, and use the reasoning-effort dial to trade quality for cost. - In streaming, do not key UI state on the first content delta, since reasoning arrives first.
- Echo
tool_call_idverbatim; treat it as opaque, never parse or regenerate it. - Accumulate streaming
argumentsbyindexuntil the call closes; do not assume a chunk count.
Two things you do not have to defend against: GLM emits parallel tool calls with an index like the others, and the round-trip closes normally. Append the assistant turn, append one tool message per call with its result, and it finishes with finish_reason: "stop". Keep the cacheable prefix byte-stable across turns while you are at it; the system prompt and tool definitions are most of every prompt, and a stable prefix is what lets GLM's cache carry the cost once it warms.
None of this is exotic. It is the gap between "the request succeeds" and "the agent loop is correct," and on GLM that gap is mostly two assumptions wide: that a tool-call turn is silent, and that it isn't thinking. Drop those two, keep the prefix stable, and one loop carries GLM, GPT, and Claude alike, with GLM doing it for a fraction of the cost wherever latency is not the thing you are optimizing.
Disclaimer
The cost, latency, and cache figures above were measured on 2026-06-30 over ten warm tool-call turns per model, with glm-5.2, gpt-5.5, and claude-opus-4-8. Cost is taken from reported usage; latency is wall-clock median and shifts with load and reasoning effort. Model behavior and prices drift, so treat the figures as indicative and re-measure against your own traffic before depending on them.
Top comments (0)