DEV Community

Cover image for How to use the GLM-5.1 API: complete guide with code examples
Wanda
Wanda

Posted on • Originally published at apidog.com

How to use the GLM-5.1 API: complete guide with code examples

TL;DR

GLM-5.1 is accessible via the BigModel API at https://open.bigmodel.cn/api/paas/v4/. The API is OpenAI-compatible: same endpoints, request/response formats, and streaming support. You'll need a BigModel account, an API key, and the model name glm-5.1. This guide shows you how to authenticate, send requests, handle streaming and tool calls, and test your integration with Apidog.

Try Apidog today

GLM-5.1 BigModel API

Introduction

GLM-5.1 is Z.AI's flagship agentic model, released April 2026. It leads coding benchmarks including SWE-Bench Pro, outperforming GLM-5 in all major evaluations. If you're building an AI coding assistant or agentic application, integrating GLM-5.1 is straightforward.

The API is OpenAI-compatible. If you already use GPT-4 or Claude, you only need to update the base URL and model name—no new SDKs or parsing required.

đź’ˇ Testing agentic APIs is challenging: Models that make hundreds of tool calls over minutes are hard to test directly without consuming quota. Apidog's Test Scenarios let you define request sequences, mock responses for each agent state, and verify streaming, tool calls, and error handling before production. Use Apidog for thorough integration testing.

Prerequisites

Before your first call, make sure you have:

  1. BigModel account: Register at bigmodel.cn. Free signup.
  2. API key: Obtain from the BigModel console under API Keys.
  3. Python 3.8+ or Node.js 18+ (examples provided for both).
  4. OpenAI SDK or standard requests/fetch (API is OpenAI-compatible).

Set your API key in your environment:

export BIGMODEL_API_KEY="your_api_key_here"
Enter fullscreen mode Exit fullscreen mode

Never hardcode API keys in your code.

Authentication

Add your API key as a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY
Enter fullscreen mode Exit fullscreen mode

BigModel API keys look like xxxxxxxx.xxxxxxxxxxxxxxxx. The format differs from OpenAI's but use it the same way.

Base URL

https://open.bigmodel.cn/api/paas/v4/
Enter fullscreen mode Exit fullscreen mode

The chat completions endpoint:

POST https://open.bigmodel.cn/api/paas/v4/chat/completions
Enter fullscreen mode Exit fullscreen mode

Your First Request

Using curl

curl https://open.bigmodel.cn/api/paas/v4/chat/completions \
  -H "Authorization: Bearer $BIGMODEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes."
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'
Enter fullscreen mode Exit fullscreen mode

Using Python (requests)

import os
import requests

api_key = os.environ["BIGMODEL_API_KEY"]

response = requests.post(
    "https://open.bigmodel.cn/api/paas/v4/chat/completions",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json={
        "model": "glm-5.1",
        "messages": [
            {
                "role": "user",
                "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes."
            }
        ],
        "max_tokens": 1024,
        "temperature": 0.7
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Using the OpenAI SDK (Recommended)

The OpenAI Python SDK works out of the box by setting a custom base URL:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/"
)

response = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that finds all prime numbers up to n using the Sieve of Eratosthenes."
        }
    ],
    max_tokens=1024,
    temperature=0.7
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

The SDK handles retries, timeouts, and parsing automatically.

Response Format

The API response matches OpenAI's schema:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1744000000,
  "model": "glm-5.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "def sieve_of_eratosthenes(n):\n    ..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 215,
    "total_tokens": 247
  }
}
Enter fullscreen mode Exit fullscreen mode

Extract the response via result["choices"][0]["message"]["content"].

Monitor usage to track quota, especially during peak hours (GLM-5.1 bills at 3x quota 14:00-18:00 UTC+8).

Streaming Responses

Enable streaming to receive tokens as they're generated—critical for user-facing apps.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/"
)

stream = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Explain how a B-tree index works in a database, with a code example."
        }
    ],
    stream=True,
    max_tokens=2048
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()
Enter fullscreen mode Exit fullscreen mode

Each stream chunk contains new tokens. The final chunk's finish_reason is "stop" or "length".

Streaming with Raw Requests

import os
import json
import requests

api_key = os.environ["BIGMODEL_API_KEY"]

response = requests.post(
    "https://open.bigmodel.cn/api/paas/v4/chat/completions",
    headers={
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    },
    json={
        "model": "glm-5.1",
        "messages": [{"role": "user", "content": "Write a merge sort in Python."}],
        "stream": True,
        "max_tokens": 1024
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode("utf-8")
        if line.startswith("data: "):
            data = line[6:]
            if data == "[DONE]":
                break
            chunk = json.loads(data)
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                print(delta["content"], end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

Tool Calling

GLM-5.1 enables tool (function) calling mid-conversation for agentic workflows.

Defining Tools

import os
import json
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BIGMODEL_API_KEY"],
    base_url="https://open.bigmodel.cn/api/paas/v4/"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "run_python",
            "description": "Execute Python code and return the output. Use this to test, profile, or benchmark code.",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "The Python code to execute"
                    }
                },
                "required": ["code"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path to read"
                    }
                },
                "required": ["path"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="glm-5.1",
    messages=[
        {
            "role": "user",
            "content": "Write a function to compute Fibonacci numbers, test it for n=10, and show me the output."
        }
    ],
    tools=tools,
    tool_choice="auto"
)

message = response.choices[0].message
print(f"Finish reason: {response.choices[0].finish_reason}")

if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"\nTool called: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")
Enter fullscreen mode Exit fullscreen mode

Handling Tool Call Responses

When the model requests a tool call, execute it and return the result in the next message:

import subprocess

def execute_tool(tool_call):
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    if name == "run_python":
        result = subprocess.run(
            ["python3", "-c", args["code"]],
            capture_output=True,
            text=True,
            timeout=10
        )
        return result.stdout or result.stderr

    elif name == "read_file":
        try:
            with open(args["path"]) as f:
                return f.read()
        except FileNotFoundError:
            return f"Error: file {args['path']} not found"

    return f"Unknown tool: {name}"


def run_agent_loop(user_message, tools, max_iterations=20):
    messages = [{"role": "user", "content": user_message}]

    for i in range(max_iterations):
        response = client.chat.completions.create(
            model="glm-5.1",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            max_tokens=4096
        )

        message = response.choices[0].message
        messages.append(message.model_dump())

        if response.choices[0].finish_reason == "stop":
            return message.content

        if response.choices[0].finish_reason == "tool_calls":
            for tool_call in message.tool_calls:
                tool_result = execute_tool(tool_call)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": tool_result
                })

    return "Max iterations reached"


result = run_agent_loop(
    "Write a quicksort implementation, test it with a random list of 1000 integers, and report the time.",
    tools
)
print(result)
Enter fullscreen mode Exit fullscreen mode

This pattern enables robust agentic workflows: let the model initiate tool calls, process results, and iterate until completion.

Key Parameters

Parameter Type Default Description
model string required Use "glm-5.1"
messages array required Conversation history
max_tokens integer 1024 Max tokens to generate (up to 163,840)
temperature float 0.95 Randomness. Lower = more deterministic
top_p float 0.7 Nucleus sampling; 0.7 recommended for coding
stream boolean false Enable streaming responses
tools array null Tool definitions for function calling
tool_choice string/object "auto" "auto", "none", or specific tool
stop string/array null Custom stop sequences

Recommended settings for coding tasks:

{
    "model": "glm-5.1",
    "temperature": 1.0,
    "top_p": 0.95,
    "max_tokens": 163840  # full context for long agentic runs
}
Enter fullscreen mode Exit fullscreen mode

For deterministic code generation, set temperature to 0.2–0.4.

Using GLM-5.1 with Coding Assistants

You can configure Claude Code, Cline, Kilo Code, and other coding assistants to use GLM-5.1 via the BigModel API for lower cost and strong coding performance.

Claude Code Setup

Update your Claude Code config (~/.claude/settings.json):

{
  "model": "glm-5.1",
  "baseURL": "https://open.bigmodel.cn/api/paas/v4/",
  "apiKey": "your_bigmodel_api_key"
}
Enter fullscreen mode Exit fullscreen mode

Cline / Roo Code Setup

In VS Code settings or Cline extension config:

{
  "cline.apiProvider": "openai",
  "cline.openAIBaseURL": "https://open.bigmodel.cn/api/paas/v4/",
  "cline.openAIApiKey": "your_bigmodel_api_key",
  "cline.openAIModelId": "glm-5.1"
}
Enter fullscreen mode Exit fullscreen mode

Quota Consumption

GLM-5.1 uses quota-based billing:

  • Peak (14:00–18:00 UTC+8): 3x quota per request
  • Off-peak: 2x quota per request
  • Promo (through April 2026): 1x quota off-peak

Schedule lengthy agentic tasks for off-peak to save quota.

Testing the GLM-5.1 API with Apidog

Testing agentic integrations requires handling completions, streaming, tool calls, and errors. Testing against the real API burns quota and requires live connectivity.

Apidog Test Scenarios

Apidog's Smart Mock lets you define and simulate all response types for robust local testing.

Setting Up the Mock Endpoint

  1. In Apidog, create a new endpoint:
   POST https://open.bigmodel.cn/api/paas/v4/chat/completions
Enter fullscreen mode Exit fullscreen mode
  1. Add a Mock Expectation for a standard response:
   {
     "id": "chatcmpl-test123",
     "object": "chat.completion",
     "created": 1744000000,
     "model": "glm-5.1",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": "def sieve(n): ..."
         },
         "finish_reason": "stop"
       }
     ],
     "usage": {
       "prompt_tokens": 32,
       "completion_tokens": 120,
       "total_tokens": 152
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. Add a second expectation for a tool call response:
   {
     "id": "chatcmpl-tool456",
     "object": "chat.completion",
     "created": 1744000001,
     "model": "glm-5.1",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": null,
           "tool_calls": [
             {
               "id": "call_abc",
               "type": "function",
               "function": {
                 "name": "run_python",
                 "arguments": "{\"code\": \"print(2+2)\"}"
               }
             }
           ]
         },
         "finish_reason": "tool_calls"
       }
     ],
     "usage": {
       "prompt_tokens": 48,
       "completion_tokens": 35,
       "total_tokens": 83
     }
   }
Enter fullscreen mode Exit fullscreen mode
  1. Add a rate limit response (HTTP 429):
   {
     "error": {
       "message": "Rate limit exceeded. Please retry after 60 seconds.",
       "type": "rate_limit_error",
       "code": "rate_limit_exceeded"
     }
   }
Enter fullscreen mode Exit fullscreen mode

Testing the Full Agent Loop

Use Apidog's Test Scenarios to chain requests:

  1. Step 1: POST to /chat/completions with the initial message; assert 200 and finish_reason == "tool_calls".
  2. Step 2: POST again with the tool result in messages; assert 200 and finish_reason == "stop".
  3. Step 3: Extract final content and verify code output.

Test error handling by switching the mock to return 429 and verifying your retry logic.

Apidog variables let you pass data (request_id, tool_call_id) between steps, accurately mirroring real agentic flows.

Error Handling

The API uses standard HTTP status codes:

Status Meaning Action
200 Success Process response
400 Bad request Check request format
401 Unauthorized Verify API key
429 Rate limit Retry after Retry-After header value
500 Server error Retry with exponential backoff
503 Service unavailable Retry with exponential backoff

Example error-handling logic:

import time
import requests

def call_with_retry(payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.post(
                "https://open.bigmodel.cn/api/paas/v4/chat/completions",
                headers={"Authorization": f"Bearer {os.environ['BIGMODEL_API_KEY']}",
                         "Content-Type": "application/json"},
                json=payload,
                timeout=120
            )

            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 60))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            response.raise_for_status()
            return response.json()

        except requests.exceptions.Timeout:
            wait = 2 ** attempt
            print(f"Timeout on attempt {attempt + 1}. Retrying in {wait}s...")
            time.sleep(wait)

    raise Exception("Max retries exceeded")
Enter fullscreen mode Exit fullscreen mode

For long agentic runs (steps may take 30–60s), set timeouts to 120–300 seconds.

Conclusion

GLM-5.1's OpenAI-compatible API makes integration simple—just update the base URL and model name if you're already using OpenAI's APIs. The main differences are the endpoint (open.bigmodel.cn) and quota-based billing.

For agentic applications with long tool-call sessions, GLM-5.1's long-horizon optimization is a significant advantage. Use Apidog's Smart Mock and Test Scenarios to ensure your integration handles all edge cases before going live.

For an overview of GLM-5.1 and its benchmarks, see the GLM-5.1 model overview. For more on building and testing agent workflows with Apidog, see how AI agent memory works.

FAQ

Is the GLM-5.1 API OpenAI-compatible?

Yes. The request/response format, streaming, and tool calling are identical to OpenAI's chat completions API. Use the OpenAI Python SDK or any compatible client by setting the base URL to https://open.bigmodel.cn/api/paas/v4/.

What model name should I use?

Use "glm-5.1" in your API requests.

How does GLM-5.1 pricing work?

BigModel API uses quota:

  • Peak hours (14:00–18:00 UTC+8): 3x quota
  • Off-peak: 2x quota
  • Promo (through April 2026): 1x quota off-peak

Maximum context length?

200,000 tokens input context. Max output: 163,840 tokens. For long agentic runs, set max_tokens to 32,768 or higher.

Does GLM-5.1 support function/tool calling?

Yes. Use the OpenAI tool calling schema (type: "function"), pass in the tools array, and handle finish_reason: "tool_calls" responses.

How can I test without spending quota?

Use Apidog's Smart Mock to define and simulate all API responses—success, tool calls, rate limits, errors—and run your test suite locally.

Where are the GLM-5.1 model weights?

On HuggingFace: zai-org/GLM-5.1 (MIT License), supporting vLLM and SGLang for local inference.

Top comments (0)