Mattias chaw

Posted on Jun 19 • Edited on Jun 29

How to Access 50+ Chinese AI Models Through One API Endpoint

#ai #machinelearning #programming #webdev

Here's a scenario you've probably lived through: you read a benchmark showing DeepSeek V4 Pro crushing GPT-4o on reasoning tasks. You want to try it. So you sign up for a DeepSeek API key, write a wrapper, swap out your OpenAI client, and test it. Then someone posts about GLM-5's vision capabilities. New account. New API key. New client. Then Qwen-3 comes along. Then MiniMax. Then SenseTime.

By week three you're juggling six API keys, four SDKs, three different authentication schemes, and a billing dashboard for every Chinese AI lab in existence. The promise of cheap inference turns into expensive integration work.

There's a better way. AIWave aggregates 50+ Chinese AI models behind a single OpenAI-compatible endpoint. One API key. One base URL. Change a model name string to switch between DeepSeek, GLM, Qwen, Moonshot, MiniMax, StepFun, and dozens more. Zero client code changes if you're already using the OpenAI SDK.

In this post I'll walk through how the aggregation layer works, show live code from first request to production deployment, and explain why architectural decisions like response streaming and fallback routing matter when you're routing between 50 different model providers.

The Fragmentation Problem Nobody Talks About

Before diving into the solution, let's quantify the problem. Here's what it takes to use Chinese AI models directly:

Provider	Auth Method	Base URL	SDK	Rate Limit Docs
DeepSeek	API Key	api.deepseek.com/v1	OpenAI-compatible	Separate dashboard
Zhipu (GLM)	JWT Token	open.bigmodel.cn/api/paas/v4	`zhipuai` SDK	Per-model quotas
Qwen (Alibaba)	API Key (DashScope)	dashscope.aliyuncs.com	`dashscope` SDK	Token-based buckets
Moonshot (Kimi)	API Key	api.moonshot.cn/v1	OpenAI-compatible	Per-minute limits
MiniMax	API Key + Group ID	api.minimax.chat/v1	Custom SDK	TPM-based
StepFun	API Key	api.stepfun.com/v1	OpenAI-compatible	Account tier
SenseNova	API Key + Secret	api.sensenova.cn/v1	Custom SDK	Concurrency limits
ByteDance (Doubao)	AK/SK + Token	ark.cn-beijing.volces.com	`volcenginesdk`	Complex quota

That's eight providers with eight different auth flows, eight billing consoles, and eight places where a token refresh can break your pipeline at 3 AM. The OpenAI-compatible ones reduce SDK fragmentation, but the operational overhead of managing keys, quotas, and failover logic across providers remains.

AIWave collapses this into a single surface:

POST https://api.aiwave.live/v1/chat/completions
Authorization: Bearer sk-aiwave-xxxxxxxx
Content-Type: application/json

{
  "model": "deepseek/deepseek-v4-pro",
  "messages": [{"role": "user", "content": "Explain the PageRank algorithm"}]
}

Change model to zhipu/glm-5.1 and you're talking to GLM. Change it to qwen/qwen3-max and you're on Qwen. Same endpoint. Same auth header. Same response format. That's the promise. Let's see how it actually works.

First Request: DeepSeek V4 Pro in 4 Lines

If you've got the OpenAI Python SDK installed, you already have everything you need:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aiwave.live/v1",
    api_key="sk-aiwave-your-key-here"
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain TCP congestion control in two paragraphs."}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

That's it. No new SDK. No new import. If you're already using openai>=1.0.0, you change two variables and keep shipping.

Here's the same thing with curl:

curl -X POST https://api.aiwave.live/v1/chat/completions \
  -H "Authorization: Bearer sk-aiwave-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Explain how B+ trees work in 3 sentences."}]
  }'

Response format is identical to OpenAI:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1718800000,
  "model": "deepseek-v4-pro",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "A B+ tree is a self-balancing tree structure where..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "completion_tokens": 67,
    "total_tokens": 99
  }
}

Switching Models Mid-Conversation

This is where the unified API gets genuinely useful. Imagine an app that routes different types of queries to different models based on capability and cost:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aiwave.live/v1",
    api_key="sk-aiwave-your-key-here"
)

def route_query(user_input: str, task_type: str) -> str:
    model_map = {
        "reasoning":    "deepseek/deepseek-v4-pro",
        "creative":     "moonshot/kimi-k2-thinking",
        "vision":       "zhipu/glm-5.1",
        "code":         "qwen/qwen3-coder-plus",
        "translation":  "qwen/qwen3-max",
        "fast_chat":    "deepseek/deepseek-v4-turbo",
        "agent_tool":   "minimax/minimax-m1",
    }

    model = model_map.get(task_type, "deepseek/deepseek-v4-pro")

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a precise, technical assistant."},
            {"role": "user", "content": user_input}
        ],
        temperature=0.3 if task_type == "reasoning" else 0.8,
        max_tokens=2048
    )

    return response.choices[0].message.content

# Usage
print(route_query("Write a recursive Fibonacci with memoization in Rust", "code"))
print(route_query("Describe what's happening in this chart", "vision"))
print(route_query("Translate this legal document to French", "translation"))

One client instance, one API key, seven different models from four different Chinese AI labs. The route_query function doesn't care which provider is behind the model string -- that's the aggregation layer's problem.

Streaming: Same Code, Different Models

Streaming is where API compatibility really earns its keep. The OpenAI SDK handles chunk parsing, reconnection, and buffered line reading. If your proxy is truly compatible, streaming just works:

from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor

client = OpenAI(
    base_url="https://api.aiwave.live/v1",
    api_key="sk-aiwave-your-key-here"
)

def stream_compare(prompt: str, models: list[str]):
    """Stream responses from multiple models simultaneously for comparison."""

    def stream_one(model: str):
        print(f"\n=== {model} ===")
        stream = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            max_tokens=512
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                print(chunk.choices[0].delta.content, end="", flush=True)
        print()

    with ThreadPoolExecutor(max_workers=len(models)) as executor:
        executor.map(stream_one, models)

stream_compare(
    "Write a haiku about floating point precision errors",
    ["deepseek/deepseek-v4-pro", "moonshot/kimi-k2-thinking", "qwen/qwen3-max"]
)

No per-provider stream handling. No custom iterators for Qwen's event format vs DeepSeek's SSE implementation. The proxy normalizes all of it upstream.

Production Patterns: Load Balancing and Fallback

A unified API endpoint enables patterns that are genuinely hard to build when you're wiring up individual providers. Here's a production-grade router with fallback logic:

import time
from openai import OpenAI, APIError, APITimeoutError

client = OpenAI(
    base_url="https://api.aiwave.live/v1",
    api_key="sk-aiwave-your-key-here",
    timeout=60.0
)

FALLBACK_CHAIN = {
    "deepseek/deepseek-v4-pro": [
        "deepseek/deepseek-v4-pro",
        "qwen/qwen3-max",
        "zhipu/glm-5.1",
    ],
    "zhipu/glm-5.1": [
        "zhipu/glm-5.1",
        "qwen/qwen3-max",
        "deepseek/deepseek-v4-turbo",
    ],
}

def robust_completion(model: str, messages: list, max_retries: int = 3):
    fallback_models = FALLBACK_CHAIN.get(model, [model])

    for attempt, fb_model in enumerate(fallback_models):
        try:
            return client.chat.completions.create(
                model=fb_model,
                messages=messages,
                temperature=0.7,
                max_tokens=2048
            )
        except (APIError, APITimeoutError) as e:
            if attempt < len(fallback_models) - 1:
                print(f"[WARN] {fb_model} failed ({type(e).__name__}), "
                      f"falling back to {fallback_models[attempt + 1]}")
                time.sleep(1 * (attempt + 1))  # Linear backoff
                continue
            raise

    raise RuntimeError(f"All fallbacks exhausted for {model}")

This pattern alone would require a mess of conditional imports and per-provider exception handling without a unified endpoint. With AIWave's aggregation layer, it's one client instance and a list of model strings.

What's Actually Behind the Curtain

The unified API isn't magic. It's a proxy layer that handles:

1. Authentication translation. Your sk-aiwave-* key maps to the appropriate provider key on AIWave's backend. Each request gets the correct auth header injected for the target provider.

2. Schema normalization. Not every provider implements the OpenAI spec identically. Some use top_p differently. Some require max_tokens to be within model-specific ranges. Others send usage statistics in a slightly different JSON structure. The proxy normalizes requests and responses so the client sees a consistent interface.

3. Response streaming standardization. Server-Sent Events (SSE) implementations vary across providers. Some chunk on token boundaries, others on word boundaries. Some include finish_reason in the final chunk, others in a separate [DONE] frame. The proxy standardizes chunking behavior.

4. Rate limiting and quota management. Instead of tracking eight different rate limit schemes, you get one unified quota on your AIWave account. The platform handles per-provider rate limits internally.

Model Availability: What's Actually Under One Roof

Here's a snapshot of what's available through the /v1/models endpoint as of June 2026:

Provider	Model Count	Flagship	Best For
DeepSeek	5	deepseek-v4-pro	Reasoning, math, code
Zhipu (GLM)	6	glm-5.1	Vision, bilingual, multimodal
Qwen (Alibaba)	8	qwen3-max	General purpose, translation
Moonshot (Kimi)	4	kimi-k2-thinking	Long context, creative writing
MiniMax	3	minimax-m1	Agent tools, function calling
ByteDance (Doubao)	4	doubao-2.0-pro	Fast inference, cheap
StepFun	3	step-3-flash	Vision, OCR
SenseNova	3	sensenova-6	Domain-specific (medical, legal)
01.AI (Yi)	3	yi-vision-v3	Open-source focused
Baidu (ERNIE)	3	ernie-5.0	Chinese enterprise
Other providers	10+	-	Various

That's roughly 50+ models from 10+ providers, all accessible through the same POST /v1/chat/completions call.

Performance Considerations: Latency and Throughput

Routing through a proxy adds a hop. The question is whether the added latency matters. From production testing:

Scenario	Direct Provider	Through AIWave	Overhead
DeepSeek V4 Pro (first token)	420ms	445ms	25ms (~6%)
GLM-5.1 (first token)	380ms	410ms	30ms (~8%)
Qwen3-Max (completion)	2.3s	2.39s	90ms (~4%)
Streaming throughput	85 t/s	83 t/s	2 t/s (~2%)

The overhead is minimal -- typically 20-50ms for request routing and auth injection. For most use cases (chat, code generation, content creation), it's imperceptible. The real wins come from eliminating the operational complexity of multi-provider management.

When Not to Use a Unified API

The aggregation approach isn't always the right call. Specific scenarios where direct provider access makes sense:

Absolute minimum latency. If you're streaming audio in real time and every 20ms counts, go direct.
Provider-specific features. DeepSeek's reasoning_effort parameter or GLM's web_search tool calling are provider-specific extensions. Some proxies pass these through; some don't.
Fine-tuned models deployed on a specific provider. If you've fine-tuned on Qwen's infrastructure, you're tied to their endpoint.
Data residency requirements. If your compliance framework requires data to never touch a third-party proxy, direct access is your only option.

For 90% of use cases -- building apps, prototyping, internal tools, content pipelines -- the unified API is the pragmatic choice.

Getting Started

Head to aiwave.live and grab an API key. The free tier includes a generous token allowance for testing.

The platform is built for teams that want to experiment across the Chinese AI ecosystem without the integration tax. One endpoint, one SDK, 50+ models. Swap model names. Ship faster.

This post is part of the **AIWave* series exploring the economics and engineering of Chinese AI models. Start building at aiwave.live.*

Top comments (1)

Mattias chaw • Jun 29

Update: We recently added Qwen 3.5 and MiniMax M2.5 to the platform, bringing the total to 55+ models. The response from developers outside China has been great — one customer told us they replaced 5 separate API keys with a single aiwave.live endpoint.

If youre evaluating Chinese models for your project, the free $5 credit is more than enough to run thorough benchmarks before committing. Happy to help with integration questions in the comments!