Coinbase Cut AI Costs by 50% Switching to GLM 5.2 and Kimi K2.7 — Here's How You Can Too
Coinbase just made Chinese open-weight models the default for all engineers. The result? Their AI bill was slashed in half while token usage kept growing exponentially. Here's what happened, why it matters, and how you can replicate their setup today.
The Move That Shook Silicon Valley
Last weekend, Coinbase CEO Brian Armstrong dropped a bombshell on X: the company had quietly switched every engineer's default LLM from Anthropic and OpenAI's frontier models to Zhipu GLM 5.2 and Moonshot Kimi K2.7 — two Chinese open-weight models. The result? AI spending cut by nearly 50%, despite token consumption continuing to grow exponentially.
The best part? Armstrong didn't frame this as a cost-cutting sacrifice. He revealed that 91% of Coinbase engineers never hit their original usage caps — meaning most daily engineering tasks simply didn't need GPT-5 or Claude Opus. The move wasn't about limiting AI usage; it was about stopping the waste.
"Using frontier models for execution-level tasks is overkill," Armstrong wrote. "Any company can replicate this."
The Three-Layer Strategy
Coinbase's approach wasn't just "swap models, save money." They built a sophisticated system:
- Smart Routing: An internal LLM gateway automatically routes simple tasks (code review, doc summarization, data cleaning) to GLM/Kimi, while complex multi-step agent tasks still hit frontier models.
- Aggressive Caching: They boosted cache hit rates from 5% to 60% by making all requests cache-aware — no more re-computing the same answers.
- Lean Contexts: Engineers start fresh sessions for new tasks, avoiding the trap of carrying 30K tokens of history just to ask a one-line question.
The philosophy is simple: default to cheap-and-capable, escalate to expensive only when needed.
The Models: What Makes GLM 5.2 and Kimi K2.7 Worth the Switch
GLM 5.2 (Zhipu AI) — released June 12, 2026 under the MIT license:
- 744B parameters, MoE architecture (only 40B active per token)
- #1 open-weight model on Artificial Analysis rankings
- Beat GPT-5.5 on SWE-bench Pro (62.1 vs 58.6), nearly tied Opus 4.8 on FrontierSWE
- Pricing: $1.40/M input, $4.40/M output — roughly 6x cheaper than Opus 4.8 at $5/$25
Kimi K2.7 Code (Moonshot AI) — also released June 12:
- 128K native context window, built for long-document processing
- Used by Cursor (acquired by Elon Musk for $60B); their ARR doubled to $200M+ after switching
- Excels at code review, script generation, and smart contract validation
- Moonshot's valuation: soared from $4.3B to $20B in six months
Not Just Coinbase — A Tidal Shift
This isn't an isolated case. The trend is accelerating:
| Company | What They Did | Result |
|---|---|---|
| Cloudflare | Deployed Kimi K2.5 for internal security agents | 77% cost reduction, 7B tokens/day |
| Airbnb | Switched customer service from GPT to Qwen | Significant cost savings |
| Lindy | Migrated from Claude to DeepSeek V4 | AI costs were exceeding payroll |
| Snowflake | Testing GLM 5.2 as Claude alternative | "Comparable performance at a fraction of the price" |
On OpenRouter, Chinese models now account for 40%+ of all token traffic — up from less than 2% just a year ago. Qwen has surpassed Llama as the most downloaded open-weight model family on Hugging Face.
How to Replicate Coinbase's Setup
Coinbase self-hosts the open-weight models on their own servers, which is great for enterprise compliance but requires serious GPU infrastructure. For most teams, the fastest path is an API aggregation gateway that gives you one endpoint to access all these models.
This is where MoToken AI comes in. It's a unified API aggregation service that provides OpenAI-compatible access to GLM, Kimi, DeepSeek, Qwen, and other Chinese models through a single API key. No need to manage separate accounts, keys, or SDKs across multiple providers.
Quick Start: cURL
curl -s https://api.motoken.top/v1/chat/completions \
-H "Authorization: Bearer YOUR_MOTOKEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2",
"messages": [
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "Review this code for SQL injection vulnerabilities and suggest fixes."}
],
"temperature": 0.3,
"max_tokens": 2048
}'
Python: Build Your Own Smart Router
Here's a minimal implementation of Coinbase's tiered routing strategy using MoToken's unified API:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("MOTOKEN_API_KEY"),
base_url="https://api.motoken.top/v1"
)
# Tiered model routing — Coinbase-style
MODEL_TIERS = {
"default": "glm-5.2", # Daily tasks: code review, docs, data cleaning
"code": "kimi-k2.7-code", # Code generation & review
"complex": "claude-sonnet-4-6", # Multi-step reasoning, architecture
}
def smart_chat(prompt: str, task_type: str = "default") -> str:
"""Route to the right model based on task complexity."""
model = MODEL_TIERS.get(task_type, MODEL_TIERS["default"])
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful engineering assistant."},
{"role": "user", "content": prompt}
],
temperature=0.3,
max_tokens=2048
)
usage = response.usage
cost_estimate = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
print(f"[{model}] {usage.total_tokens} tokens | ~${cost_estimate:.4f}")
return response.choices[0].message.content
def estimate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
"""Rough cost estimate per million tokens."""
rates = {
"glm-5.2": (1.40, 4.40),
"kimi-k2.7-code": (1.50, 5.00),
"claude-sonnet-4-6": (3.00, 15.00),
}
input_rate, output_rate = rates.get(model, (2.00, 8.00))
return (prompt_tokens / 1_000_000) * input_rate + (completion_tokens / 1_000_000) * output_rate
# Usage
result = smart_chat("Explain the OWASP Top 10 for 2026", task_type="default")
print(result[:200] + "...")
With this setup, you can swap model in a single line and instantly switch between GLM, Kimi, DeepSeek, or any other supported model — all through the same API key and endpoint.
The Bottom Line
Coinbase's move isn't about "Chinese AI crushing Western AI" — that's a headline, not the real story. The real story is about smart tiering: most engineering tasks don't need a $25/M-token model, and the price gap between open-weight Chinese models and Western closed-source models has become impossible to ignore.
As Armstrong put it: "The goal isn't to make engineers use less AI. It's to let them use as much as they want — without burning money."
That's a philosophy every team can adopt today. Whether you self-host like Coinbase or use an aggregation gateway like MoToken, the tools are ready. The only question is: are you still overpaying for everyday tasks?
What's your team's default model? Have you experimented with GLM, Kimi, or other Chinese open-weight models? Drop a comment below — I'd love to hear your experience.
Top comments (0)