DEV Community

Cover image for How to use the GLM-5.1 API: complete guide with code examples
Preecha
Preecha

Posted on

How to use the GLM-5.1 API: complete guide with code examples

TL;DR

GLM-5.1 is available through the BigModel API at https://open.bigmodel.cn/api/paas/v4/. The API is OpenAI-compatible: same endpoint structure, same request format, same streaming pattern. You need a BigModel account, an API key, and the model name glm-5.1. This guide shows how to authenticate, send your first request, stream responses, handle tool calls, and test your integration with Apidog.

Try Apidog today

Image

Introduction

GLM-5.1 is Z.AI's flagship agentic model, released April 2026. It ranks #1 on SWE-Bench Pro and leads GLM-5 on every major coding benchmark. If you're building an AI coding assistant, autonomous agent, or application that needs long-horizon task execution, you can integrate GLM-5.1 through the BigModel API.

The developer-friendly part: the API is OpenAI-compatible. If your app already uses GPT-style chat completions, you usually only need to change:

  • The base URL
  • The model name
  • The API key

Testing is the harder part for agentic workflows. A model-driven loop can run many tool calls over several minutes, and repeatedly testing against the live API consumes quota. Apidog's Smart Mock and Test Scenarios help you simulate normal completions, streaming responses, tool calls, and error states before production.

Prerequisites

Before you start, prepare:

  • A BigModel account at bigmodel.cn
  • An API key from the BigModel console under API Keys
  • Python 3.8+ or Node.js 18+
  • The OpenAI SDK, requests, or fetch

Set your API key as an environment variable:

export BIGMODEL_API_KEY="your_api_key_here"
Enter fullscreen mode Exit fullscreen mode

Do not hardcode API keys in source code.

Authentication

Every request requires a Bearer token:

Authorization: Bearer YOUR_API_KEY
Enter fullscreen mode Exit fullscreen mode

BigModel API keys use a two-part format similar to:

xxxxxxxx.xxxxxxxxxxxxxxxx
Enter fullscreen mode Exit fullscreen mode

This differs from OpenAI's sk- prefix, but you use it the same way in the Authorization header.

Base URL and endpoint

Use this base URL:

https://open.bigmodel.cn/api/paas/v4/
Enter fullscreen mode Exit fullscreen mode

The chat completions endpoint is:

POST https://open.bigmodel.cn/api/paas/v4/chat/completions
Enter fullscreen mode Exit fullscreen mode

Make your first request

Option 1: curl

curl https://open.bigmodel.cn/api/paas/v4/chat/completions \
  -H "Authorization: Bearer $BIGMODEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes."
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'
Enter fullscreen mode Exit fullscreen mode

Option 2: Python with requests

import os
import requests

api_key = os.environ["BIGMODEL_API_KEY"]

response = requests.post(
    "https://open.bigmodel.cn/api/paas/v4/chat/completions",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    },
    json={
        "model": "glm-5.1",
        "messages": [
            {
                "role": "user",
                "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes.",
            }
        ],
        "max_tokens": 1024,
        "temperature": 0.7,
    },
)

response.raise_for_status()

result = response.json()
print(result["choices"][0]["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Option 3: OpenAI SDK

Because GLM-5.1 is OpenAI-compatible, you can use the OpenAI SDK with a custom base URL:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/",
)

response = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes.",
        }
    ],
    max_tokens=1024,
    temperature=0.7,
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

This is usually the simplest integration path if your app already uses OpenAI-compatible clients.

Response format

The response structure follows the OpenAI chat completions format:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1744000000,
  "model": "glm-5.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "def sieve_of_eratosthenes(n):\n    ..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 215,
    "total_tokens": 247
  }
}
Enter fullscreen mode Exit fullscreen mode

Read the assistant output from:

result["choices"][0]["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

Track the usage field to monitor quota consumption. GLM-5.1 bills at 3x quota during peak hours, 14:00-18:00 UTC+8.

Stream responses

For long code generation or analysis tasks, enable streaming so your app can display tokens as they arrive.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/",
)

stream = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Explain how a B-tree index works in a database, with a code example.",
        }
    ],
    stream=True,
    max_tokens=2048,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()
Enter fullscreen mode Exit fullscreen mode

Each chunk contains only the newly generated delta. The final chunk includes a finish_reason, such as stop or length.

Streaming with raw requests

If you do not want to use the OpenAI SDK, handle the server-sent event stream directly:

import os
import json
import requests

api_key = os.environ["BIGMODEL_API_KEY"]

response = requests.post(
    "https://open.bigmodel.cn/api/paas/v4/chat/completions",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    },
    json={
        "model": "glm-5.1",
        "messages": [
            {
                "role": "user",
                "content": "Write a merge sort in Python.",
            }
        ],
        "stream": True,
        "max_tokens": 1024,
    },
    stream=True,
)

response.raise_for_status()

for line in response.iter_lines():
    if not line:
        continue

    line = line.decode("utf-8")

    if not line.startswith("data: "):
        continue

    data = line[6:]

    if data == "[DONE]":
        break

    chunk = json.loads(data)
    delta = chunk["choices"][0]["delta"]

    if "content" in delta:
        print(delta["content"], end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

Use tool calling

GLM-5.1 supports tool calling. The model can request function execution mid-conversation, which is useful for agents that need to run code, query systems, read files, call APIs, or perform actions.

Define tools

import os
import json
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "run_python",
            "description": "Execute Python code and return the output. Use this to test, profile, or benchmark code.",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "The Python code to execute",
                    }
                },
                "required": ["code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path to read",
                    }
                },
                "required": ["path"],
            },
        },
    },
]

response = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Write a function to compute Fibonacci numbers, test it for n=10, and show me the output.",
        }
    ],
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message

print(f"Finish reason: {response.choices[0].finish_reason}")

if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"\nTool called: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")
Enter fullscreen mode Exit fullscreen mode

Execute tool calls and return results

When the model returns finish_reason: "tool_calls", execute the requested tools and append their outputs as tool messages.

import json
import subprocess

def execute_tool(tool_call):
    """Execute the tool and return the result."""
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    if name == "run_python":
        result = subprocess.run(
            ["python3", "-c", args["code"]],
            capture_output=True,
            text=True,
            timeout=10,
        )
        return result.stdout or result.stderr

    if name == "read_file":
        try:
            with open(args["path"]) as f:
                return f.read()
        except FileNotFoundError:
            return f"Error: file {args['path']} not found"

    return f"Unknown tool: {name}"
Enter fullscreen mode Exit fullscreen mode

Then run the agent loop:

def run_agent_loop(user_message, tools, max_iterations=20):
    """Run a full agent loop with tool calling."""
    messages = [{"role": "user", "content": user_message}]

    for _ in range(max_iterations):
        response = client.chat.completions.create(
            model="glm-5.1",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            max_tokens=4096,
        )

        choice = response.choices[0]
        message = choice.message
        messages.append(message.model_dump())

        if choice.finish_reason == "stop":
            return message.content

        if choice.finish_reason == "tool_calls":
            for tool_call in message.tool_calls:
                tool_result = execute_tool(tool_call)
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": tool_result,
                    }
                )

    return "Max iterations reached"

result = run_agent_loop(
    "Write a quicksort implementation, test it with a random list of 1000 integers, and report the time.",
    tools,
)

print(result)
Enter fullscreen mode Exit fullscreen mode

This is the standard loop:

  1. Send user request.
  2. Let the model decide whether to call tools.
  3. Execute requested tools.
  4. Return tool results.
  5. Continue until the model finishes.

Key parameters

Parameter Type Default Description
model string required Use "glm-5.1"
messages array required Conversation history
max_tokens integer 1024 Max tokens to generate, up to 163840
temperature float 0.95 Randomness. Lower values are more deterministic. Range: 0.0-1.0
top_p float 0.7 Nucleus sampling. Z.AI recommends 0.7 for coding tasks
stream boolean false Enable streaming responses
tools array null Function definitions for tool calling
tool_choice string/object "auto" "auto", "none", or a specific tool
stop string/array null Custom stop sequences

Recommended settings for coding tasks:

{
  "model": "glm-5.1",
  "temperature": 1.0,
  "top_p": 0.95,
  "max_tokens": 163840
}
Enter fullscreen mode Exit fullscreen mode

Z.AI uses these settings in its benchmark evaluations. For more deterministic code generation, lower temperature to 0.2-0.4.

Use GLM-5.1 with coding assistants

The Z.AI Coding Plan lets you route Claude Code, Cline, Kilo Code, and other AI coding assistants through GLM-5.1 via the BigModel API. This is useful if you want to test GLM-5.1 in an existing coding workflow.

Claude Code setup

In your Claude Code configuration file, such as ~/.claude/settings.json:

{
  "model": "glm-5.1",
  "baseURL": "https://open.bigmodel.cn/api/paas/v4/",
  "apiKey": "your_bigmodel_api_key"
}
Enter fullscreen mode Exit fullscreen mode

Cline / Roo Code setup

In your VS Code settings or Cline extension config:

{
  "cline.apiProvider": "openai",
  "cline.openAIBaseURL": "https://open.bigmodel.cn/api/paas/v4/",
  "cline.openAIApiKey": "your_bigmodel_api_key",
  "cline.openAIModelId": "glm-5.1"
}
Enter fullscreen mode Exit fullscreen mode

Quota consumption

GLM-5.1 uses the Z.AI quota system rather than per-token billing:

  • Peak hours, 14:00-18:00 UTC+8: 3x quota per request
  • Off-peak: 2x quota per request
  • Promotional rate through April 2026: 1x during off-peak

For heavy agentic workloads, schedule long-running jobs during off-peak hours when possible.

Test the GLM-5.1 API with Apidog

Agentic integrations need to handle several response types:

  • Standard completions
  • Streaming chunks
  • Tool call requests
  • Tool result messages
  • Rate limits
  • Server errors

Testing every case against the real API consumes quota and depends on a live network connection.

Image

Apidog's Smart Mock lets you define these response states and test your client without calling the real API.

Create a mock endpoint

In Apidog, create this endpoint:

POST https://open.bigmodel.cn/api/paas/v4/chat/completions
Enter fullscreen mode Exit fullscreen mode

Add a mock expectation for a standard success response:

{
  "id": "chatcmpl-test123",
  "object": "chat.completion",
  "created": 1744000000,
  "model": "glm-5.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "def sieve(n): ..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 120,
    "total_tokens": 152
  }
}
Enter fullscreen mode Exit fullscreen mode

Add another expectation for a tool call response:

{
  "id": "chatcmpl-tool456",
  "object": "chat.completion",
  "created": 1744000001,
  "model": "glm-5.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc",
            "type": "function",
            "function": {
              "name": "run_python",
              "arguments": "{\"code\": \"print(2+2)\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 48,
    "completion_tokens": 35,
    "total_tokens": 83
  }
}
Enter fullscreen mode Exit fullscreen mode

Add a rate limit response with HTTP 429:

{
  "error": {
    "message": "Rate limit exceeded. Please retry after 60 seconds.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}
Enter fullscreen mode Exit fullscreen mode

Test the full agent loop

Use Apidog Test Scenarios to chain requests together.

Example scenario:

  1. Send the initial POST /chat/completions request.
  2. Assert status is 200.
  3. Assert choices[0].finish_reason equals "tool_calls".
  4. Extract choices[0].message.tool_calls[0].id.
  5. Send the next request with a tool message containing the tool result.
  6. Assert status is 200.
  7. Assert choices[0].finish_reason equals "stop".
  8. Assert the final content contains the expected code or output.

This lets you test the agent loop without spending quota. You can also switch the mock to return 429 and verify your retry logic.

For multi-step workflows, use variables to pass values such as request_id or tool_call_id between steps. This mirrors a real agent loop and catches integration issues before production.

Error handling

The API returns standard HTTP status codes.

Status Meaning Action
200 Success Process response normally
400 Bad request Check your request format
401 Unauthorized Verify your API key
429 Rate limit Retry after the Retry-After header value
500 Server error Retry with exponential backoff
503 Service unavailable Retry with exponential backoff

Use retries for rate limits, timeouts, and transient server errors:

import os
import time
import requests

def call_with_retry(payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.post(
                "https://open.bigmodel.cn/api/paas/v4/chat/completions",
                headers={
                    "Authorization": f"Bearer {os.environ['BIGMODEL_API_KEY']}",
                    "Content-Type": "application/json",
                },
                json=payload,
                timeout=120,
            )

            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 60))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            response.raise_for_status()
            return response.json()

        except requests.exceptions.Timeout:
            wait = 2 ** attempt
            print(f"Timeout on attempt {attempt + 1}. Retrying in {wait}s...")
            time.sleep(wait)

    raise Exception("Max retries exceeded")
Enter fullscreen mode Exit fullscreen mode

For long agentic runs, use a generous timeout such as 120-300 seconds. Individual steps may take longer when the model generates complete code files or analyzes complex results.

Conclusion

GLM-5.1's OpenAI-compatible API makes integration straightforward if you already use GPT-style chat completions. Update the base URL, use the glm-5.1 model name, and handle responses with the same chat completions structure.

For agentic applications, focus on the loop: tool definitions, tool execution, streamed output, quota-aware retries, and mock-based testing. Apidog's Smart Mock and Test Scenarios help validate those paths before your agent runs against the live API.

For background on what GLM-5.1 is and how its benchmarks compare, see the GLM-5.1 model overview. For more on building and testing AI agent workflows with Apidog, see how AI agent memory works.

FAQ

Is the GLM-5.1 API OpenAI-compatible?

Yes. The request format, response structure, streaming protocol, and tool calling format are compatible with the OpenAI chat completions API. You can use the OpenAI Python SDK or another OpenAI-compatible client by setting the base URL to:

https://open.bigmodel.cn/api/paas/v4/
Enter fullscreen mode Exit fullscreen mode

What model name should I use?

Use:

glm-5.1
Enter fullscreen mode Exit fullscreen mode

Do not use a full versioned model name.

How does GLM-5.1 API pricing work?

The BigModel API uses a quota system. GLM-5.1 consumes:

  • 3x quota during peak hours, 14:00-18:00 UTC+8
  • 2x quota during off-peak hours
  • 1x quota during off-peak hours through the end of April 2026 as a promotional rate

What is the maximum context length?

GLM-5.1 supports a 200,000-token input context. Maximum output is 163,840 tokens. For long agentic runs, set max_tokens to a large value such as 32768 or higher to reduce the chance of truncating output mid-task.

Can I use GLM-5.1 for function calling or tool use?

Yes. Define tools with type: "function", pass them in the tools array, and handle responses where finish_reason is "tool_calls".

How do I test GLM-5.1 API calls without spending quota?

Use Apidog Smart Mock to define mock responses for success cases, tool calls, rate limits, and errors. Run your client or test suite against the mock endpoint during development, then use the real API for final validation.

Where can I find the GLM-5.1 model weights?

The open-source weights are on HuggingFace at zai-org/GLM-5.1. They are released under the MIT License and support vLLM and SGLang for local inference.

Top comments (0)