Hassann

Posted on Jun 17 • Originally published at apidog.com

How to Use GLM-5.2 for Free

GLM-5.2 is one of the most capable open-weights models you can run today. The MIT license makes the weights free to use, but a ~753B mixture-of-experts model is not “easy” to run. This guide shows the practical routes: self-hosting, trial credits, low-cost coding plans, and pay-as-you-go APIs.

Try Apidog today

Short version: if you have the hardware, self-host the open weights. If you do not, start with z.ai trial credits or the cheapest GLM Coding Plan tier. There is no free OpenRouter route for glm-5.2.

The quick decision tree

Pick the route that matches your constraints.

Your situation	Best route	Real cost
You own a strong GPU box or can rent one	Self-host open weights with Ollama or vLLM	$0 for weights; electricity or GPU rental
You want zero setup and no card	z.ai free-trial credits / rate-limited tier	Free until credits run out; verify current terms
You want the cheapest reliable paid path	GLM Coding Plan Lite, or cached-input API pricing	Low monthly plan or low per-call cost; verify current pricing
You want pay-as-you-go with no commitment	OpenRouter API	$1.40 / 1M input tokens, $4.40 / 1M output tokens

Rule of thumb: truly free means self-hosting. Near-free means trial credits, a low-cost coding plan, or careful cached-input API usage.

Route 1: self-host the open MIT weights

GLM-5.2 is available under the MIT license, so you can download and run the weights without paying a model license fee. The weights are on Hugging Face:

https://huggingface.co/zai-org/GLM-5.2

The implementation reality: GLM-5.2 is a ~753B-parameter MoE model in BF16. Even though only part of the model activates per token, the full weight set still needs to live in memory. BF16 raw weights are well over a terabyte.

You are not running this comfortably on a normal laptop.

Most self-hosted setups use one of these approaches:

Quantized builds, such as 4-bit variants, to reduce memory requirements.
Multi-GPU instances, often rented by the hour, then shut down after use.
High-memory local machines, where unified memory or multiple GPUs can hold the model.

“Free” here means free model weights. You still pay for hardware, electricity, or GPU time.

Option A: run GLM-5.2 with Ollama

Ollama is the simplest local path if you want a quick API-compatible endpoint.

# Pull the model. Expect a very large download.
ollama pull glm-5.2:cloud

Then call Ollama’s local OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function to parse an RFC 3339 timestamp."
      }
    ]
  }'

Watch RAM and VRAM during generation. If the model spills heavily to disk, latency becomes impractical. A quantized build plus enough memory is the difference between usable and unusable.

For a deeper local setup, see:

The setup pattern is similar. Use the glm-5.2 model tag where applicable.

Option B: run GLM-5.2 with vLLM

Use vLLM when you need an OpenAI-compatible server with better throughput and multi-GPU serving.

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model zai-org/GLM-5.2 \
  --tensor-parallel-size 8 \
  --max-model-len 131072

Notes:

--tensor-parallel-size 8 assumes eight GPUs.
The correct GPU count depends on your cards, memory, and whether you use a quantized checkpoint.
vLLM exposes an OpenAI-compatible API, so most chat-completion clients work without major changes.
GLM-5.2’s 1M-token context is a headline feature, but KV cache memory is expensive. Set --max-model-len to what your workload actually needs.

Example request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5.2",
    "messages": [
      {
        "role": "user",
        "content": "Summarize this repository architecture in 10 bullets."
      }
    ]
  }'

Route 2: use z.ai trial credits

If self-hosting is too much, use z.ai’s hosted API. New accounts typically receive free-trial credits, and there is usually a rate-limited free tier for light testing. Verify the current offer on z.ai, because trial terms change.

Basic API call:

curl https://api.z.ai/api/paas/v4/chat/completions \
  -H "Authorization: Bearer $ZAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [
      {
        "role": "user",
        "content": "Explain IndexShare sparse attention in two sentences."
      }
    ],
    "thinking": {
      "type": "enabled"
    },
    "reasoning_effort": "max"
  }'

Implementation notes:

Use thinking to enable or disable reasoning.
For coding tasks, z.ai recommends the Max thinking-effort level via "reasoning_effort": "max".
The documented effort levels are High and Max.
Output length is documented by z.ai as up to 128K, but verify live limits before building around that number.

Full parameter details are in the z.ai GLM-5.2 guide.

Route 3: use the cheapest paid floor

When free credits run out, there are two practical low-cost paths.

GLM Coding Plan Lite

If your main use case is coding, the GLM Coding Plan is usually the most predictable option. Pricing sources vary, so verify the current Lite tier directly at z.ai. The important tradeoff is flat-rate coding access instead of metered token billing.

The Coding Plan also exposes an Anthropic-compatible path, which lets you configure tools such as Claude Code, Cline, or Cursor against GLM-5.2.

Example Claude Code environment:

export ANTHROPIC_BASE_URL="https://api.z.ai/api/coding/paas/v4"
export ANTHROPIC_API_KEY="your-glm-coding-plan-key"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000
export API_TIMEOUT_MS=3000000

Key details:

glm-5.2[1m] selects the 1M-context variant.
API_TIMEOUT_MS should be high for long-context coding tasks.
Some sources show open.z.ai/api/paas/v4; verify the live base URL before committing it to config.

Related setup guides:

API + cached input pricing

For pay-as-you-go access, OpenRouter lists GLM-5.2 at:

$1.40 per 1M input tokens
$4.40 per 1M output tokens

Reference: OpenRouter GLM-5.2

The same general pricing applies whether you call z.ai directly or use OpenRouter as the routing layer.

The main cost-saving tactic is cached input. Reported cached-input pricing is around $0.26 per 1M tokens, per VentureBeat. This matters when your app repeatedly sends the same prefix, such as:

a long system prompt,
a repository snapshot,
API documentation,
product rules,
agent instructions.

A repeated-context workflow should look like this:

Stable prefix:
  - system prompt
  - coding rules
  - repository context
  - API schemas

Variable suffix:
  - user request
  - current file
  - bug report

You pay full price for the stable prefix once, then reuse it at the cached-input rate where supported.

Important: there is no free OpenRouter tier for glm-5.2. OpenRouter is cheap, not free.

Free vs near-free comparison

Route	Upfront cost	Ongoing cost	Setup effort	Best for
Self-host with Ollama or vLLM	Hardware or rental	Electricity / GPU hours	High	Privacy, no metering, full control
z.ai trial credits	None	Free until credits end	Low	Quick tests and first evaluation
GLM Coding Plan Lite	Low monthly plan; verify current price	Flat monthly	Low	Daily coding in Claude Code, Cline, or Cursor
API + cached input	None	$1.40/$4.40 per 1M tokens; lower for cached input where supported	Low	Apps with repeated context

A practical decision flow:

Start with trial credits.
If you need daily coding, evaluate the Coding Plan.
If you need privacy or no token metering, self-host.
If you are building an app with repeated context, use the API and design around caching.

Test your GLM-5.2 endpoint with Apidog

After you choose a route, test the endpoint before wiring it into your application. This applies whether you use:

local Ollama,
a vLLM server,
the z.ai cloud API,
OpenRouter.

Apidog is useful here because you can send requests, inspect streaming responses, save test cases, and mock responses while your app is still being built.

For Ollama, point Apidog at:

http://localhost:11434/v1/chat/completions

Use this request body:

{
  "model": "glm-5.2",
  "messages": [
    {
      "role": "user",
      "content": "Return a JSON object with three API test cases."
    }
  ]
}

For z.ai, use the hosted endpoint:

https://api.z.ai/api/paas/v4/chat/completions

Add the auth header:

Authorization: Bearer YOUR_ZAI_API_KEY

Then save the request as a reusable case. If your frontend is not ready for the live model yet, mock the response and let the frontend integrate against the same response shape.

Download it here: Download Apidog

FAQ

Is GLM-5.2 actually free to use?

The weights are free under the MIT license. Self-hosting has no model licensing cost, but you still need hardware, electricity, or rented GPUs. Hosted API usage is paid after trial credits or free-tier limits.

Can I run GLM-5.2 with Ollama on a normal laptop?

Realistically, no. GLM-5.2 is a ~753B MoE model. Even quantized builds require serious memory. Ollama makes the commands simple, but the hardware requirement is still high.

For sizing and local setup patterns, see the local deep-dive.

Is there a free OpenRouter tier for GLM-5.2?

No. OpenRouter lists GLM-5.2 as pay-as-you-go at $1.40 per 1M input tokens and $4.40 per 1M output tokens. It is low-cost, not free.

What is the cheapest paid way to use GLM-5.2 for coding?

Usually the GLM Coding Plan Lite tier, but verify current pricing at z.ai. It provides flat-rate coding access and an Anthropic-compatible endpoint for tools such as Claude Code, Cline, and Cursor.

How does GLM-5.2 compare to GPT-5.5 on cost?

Per VentureBeat, GLM-5.2 beats GPT-5.5 on several long-horizon coding benchmarks at about one-sixth the cost. For more detail, see:

Where to go next

The best route depends on your hardware and usage pattern:

Choose self-hosting if you need privacy, control, or no token metering.
Choose trial credits if you only need to evaluate the model.
Choose a coding plan if you use GLM-5.2 daily in coding tools.
Choose API + cached input if you are building an app with repeated context.

If you are still evaluating GLM-5.2, start here:

DEV Community