BeanBean

Posted on Apr 29 • Originally published at nextfuture.io.vn

The Late-April 2026 Chinese LLM Stack: Qwen 3.6, DeepSeek V4PLUS, Kimi K2.6, MiniMax M2.7, GLM-5.1 Compared

#fullstack #ai #webdev #javascript

Originally published on NextFuture

If you build with LLMs in late April 2026, the most expensive mistake you can make is assuming there are only three providers worth integrating. While Western teams reflexively wire up Anthropic, OpenAI, and Google, a different stack has quietly reached parity — and in several places, surpassed the incumbents on price-to-capability. Qwen 3.6 Max-Preview, DeepSeek V4PLUS, Kimi K2.6, MiniMax M2.7, and GLM-5.1: five Chinese labs, five production-ready frontiers, all available outside the Great Firewall today. This is the practitioner's tour — what each one is good at, where it actually beats Claude Opus 4.7 or GPT-5.4, and how to wire it into a Next.js or agent stack right now.

TL;DR: who ships what

ModelLabStrengthsBest forOpen weights

Qwen 3.6 Max-PreviewAlibabaAgentic coding, tool use, multilingualCoding agents, RFQ automation3.6-35B-A3B + 3.6 Plus
DeepSeek V4PLUS / V4 Pro / V4 FlashDeepSeek1M-token context, cheap reasoning, MoELong-document agents, RAG, cost-sensitive workloadsYes — V4 base
Kimi K2.6MoonshotSWE-Bench Pro 58.6%, agent swarmsOpen-source coding agents, autonomous loopsYes
MiniMax M2.7 (+ MiniMax CLI)MiniMaxNative multimodal, voice, agent CLIVoice-first apps, multimodal agentsNo (API + CLI)
GLM-5.1 (+ GLM 5V Turbo)Z.ai (Zhipu)754B MoE, MIT license, beats GPT-5.4 / Opus 4.6 on SWE-Bench ProSelf-hosted coding, vision tasksYes — MIT license

Why Western builders should care in 2026

Three forces flipped the calculus this year. One: capability convergence — and overshoot. Z.ai's GLM-5.1 (April 8) is a 754B-parameter MoE under MIT license that, by the lab's own benchmarks, outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro. DeepSeek V4PLUS (April 27) closed the frontier gap on cost. Kimi K2.6 hit 58.6% on SWE-Bench Pro. Qwen 3.6 Max-Preview is Alibaba's agentic-coding flagship, with the open Qwen3.6-35B-A3B variant independently observed to draw better SVG pelicans than Claude Opus 4.7 — a proxy for structured-output discipline that translates to real production work. Two: price floor collapse. DeepSeek V4 Flash and Qwen 3.6 inference run 5–30× cheaper per million tokens than Anthropic or OpenAI flagship pricing — for workloads where you call the model 50M+ times a month, that is the entire budget conversation. Three: open weights with permissive licenses. Qwen, DeepSeek, Kimi, and now GLM-5.1 ship open-source variants. GLM-5.1 specifically lands under MIT — meaning fine-tune, self-host, redistribute, no questions asked. None of this requires a VPN. All five are live on AI gateways (Vercel AI Gateway, OpenRouter, Helicone) and most expose OpenAI-compatible endpoints.

Qwen 3.6 Max-Preview — the agentic coding flagship

Alibaba's flagship preview dropped on April 28 and is positioned squarely at agentic coding. Sibling models in the 3.6 family include the cheaper Qwen 3.6 Plus (April 2, "towards real-world agents") and the open-weight Qwen3.6-35B-A3B mixture-of-experts variant — small enough to run on a single GPU, capable enough to outdraw Claude Opus 4.7 on the canonical "draw a pelican" benchmark. Tool use is robust and the model handles function calls with the kind of consistency that lets you stop writing JSON-repair fallbacks. If you are building an autonomous loop — refactor agent, RFQ automator, support triage — Qwen 3.6 is the one to start with.

Access: DashScope, Vercel AI Gateway, OpenRouter. OpenAI-SDK compatible. See our 15-minute Qwen 3.6 Max + Next.js streaming setup for a working starter.

DeepSeek V4PLUS — the 1M-context cost leader

DeepSeek's April rollout went V4 → V4 Pro → V4PLUS in three weeks, and the headline numbers stayed: one million tokens of usable context, plus mixture-of-experts efficiency that pushes price per million tokens into low single digits. Independent integrators describe V4 as "a million-token context that agents can actually use" — meaning it doesn't degrade past 200K the way many long-context models do. The current tiers: V4PLUS for top-end reasoning, V4 Pro for the everyday workload, V4 Flash for high-throughput cheap calls.

If you have any RAG pipeline that struggles to chunk well, or any agent that benefits from dumping the entire repo into context, this is the obvious model to test. Read our 15-minute DeepSeek V4PLUS + OpenAI SDK guide and the head-to-head versus Claude Opus 4.7 for the unflinching comparison.

Kimi K2.6 — open-source SOTA for coding

Moonshot AI's Kimi K2.6 hit 58.6% on SWE-Bench Pro on April 21, briefly leading the open-source coding benchmark. Combined with Kimi's traditional strength in long-horizon planning, it slots in as the open-source model to beat for autonomous coding loops and agent swarms. If your team has any reason to self-host (compliance, latency, fine-tuning), K2.6 is the default starting point — though GLM-5.1 (below) is now the bigger open-weight flex on raw benchmark numbers.

The other Kimi superpower is context length — multi-million-token windows that genuinely retrieve, not just accept input. We unpack when Kimi K2.6's long context actually beats Gemini in a separate piece.

MiniMax M2.7 — voice and multimodal as a first-class primitive

MiniMax took a different bet. While the others scaled coding and reasoning, M2.7 (March 18) leaned into native multimodal — voice in, voice out, image, and video, without a separate TTS hop. The April 11 release of MiniMax CLI goes further: it gives AI agents native multimodal capabilities at the command line. For real-time interactive apps (tutors, customer-support bots, accessibility front-ends), latency from the unified pipeline is meaningfully lower than chaining ASR → LLM → TTS. M2.7 is API-only and not the cheapest option, but it is the most complete multimodal stack of the five.

GLM-5.1 — the 754B MIT-license bomb

Zhipu (now branded Z.ai) was the quietest of the five — until April 8, when GLM-5.1 dropped: 754 billion parameters, mixture-of-experts, and an MIT license. Z.ai's own benchmarks claim it outperforms GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro. Even discounting marketing-flavored numbers, that is the most permissive-license frontier model anyone has shipped. There is also GLM 5V Turbo (April 1), the vision-tuned sibling, live on Vercel AI Gateway. For self-hosting, fine-tuning on proprietary data, or shipping on-prem to enterprises that veto API egress, GLM-5.1 is the new default starting point.

Pricing snapshot (USD per 1M tokens, late April 2026)

ModelInputOutputNotes

DeepSeek V4 Flash$0.14$0.28Cache hits ~$0.014
DeepSeek V4PLUS$0.55$2.20Reasoning tier
Qwen 3.6 Max-Preview$0.40$1.20Tier discounts; 35B-A3B free to self-host
Kimi K2.6 (API)$0.50$1.50Self-host = free
GLM-5.1$0.60$2.00MIT license — self-host
MiniMax M2.7$1.00$3.00Voice/video extra
Claude Opus 4.7 (ref)$15.00$75.00Western frontier reasoning
GPT-5.4 (ref)$5.00$15.00Western frontier

Numbers move fast — confirm at the model card before committing to a year-long contract. The pattern, however, is durable: Chinese-stack inference runs roughly 5–25× cheaper at frontier capability tiers.

How to access without ever touching a VPN

You do not need a Chinese cloud account. Three clean paths:

Vercel AI Gateway: All five models are exposed under unified routing — including GLM 5V Turbo, Kimi K2.6, Qwen 3.6 Plus, MiniMax M2.7, DeepSeek V4. One env var swap from Anthropic to DeepSeek; same SDK code path.
OpenRouter / Helicone / Portkey: Same idea, different vendor. Useful if you already use these for observability.
Direct API: Each provider exposes an OpenAI-compatible endpoint. Set baseURL in the OpenAI SDK and you're done. No code changes.

Pick guide

Self-hosted frontier on a permissive license: GLM-5.1 (MIT) or Qwen3.6-35B-A3B.
Coding agent, API budget: Kimi K2.6 or Qwen 3.6 Max-Preview.
RAG over giant docs: DeepSeek V4PLUS (1M context, cheap output).
Voice / multimodal product: MiniMax M2.7 + MiniMax CLI.
Drop-in cheap replacement for GPT-5.4: DeepSeek V4 Flash.

Bottom line

The 2026 LLM stack is not Anthropic-vs-OpenAI-vs-Google anymore. It is two regional pools — Western frontier and Chinese frontier — with overlapping capability and a 5–25× price gap. With GLM-5.1 dropping a 754B MoE under MIT license and DeepSeek V4PLUS pushing 1M-token context into pennies-per-call territory, the second pool is no longer a curiosity for cost-sensitive teams. Start with one of the five above, wire it through an AI gateway, run a real workload for a week, and let the data tell you what to adopt.

This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.

Top comments (2)

Lina Chen • May 17

Helpful comparison. I’d add one production checklist item: when routing across Qwen, DeepSeek, Kimi, and GLM, log requested model, actual provider model, tokens, latency, retry reason, and timestamp per call. Otherwise quality regressions are hard to debug.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.