Motoken

Posted on Jun 30 • Originally published at api.motoken.top

Coinbase Cut AI Costs by 50% Switching to GLM 5.2 and Kimi K2.7 — Here's How You Can Too

#webdev #programming #opensource #ai

Coinbase Cut AI Costs by 50% Switching to GLM 5.2 and Kimi K2.7 — Here's How You Can Too

Coinbase just made Chinese open-weight models the default for all engineers. The result? Their AI bill was slashed in half while token usage kept growing exponentially. Here's what happened, why it matters, and how you can replicate their setup today.

The Move That Shook Silicon Valley

Last weekend, Coinbase CEO Brian Armstrong dropped a bombshell on X: the company had quietly switched every engineer's default LLM from Anthropic and OpenAI's frontier models to Zhipu GLM 5.2 and Moonshot Kimi K2.7 — two Chinese open-weight models. The result? AI spending cut by nearly 50%, despite token consumption continuing to grow exponentially.

The best part? Armstrong didn't frame this as a cost-cutting sacrifice. He revealed that 91% of Coinbase engineers never hit their original usage caps — meaning most daily engineering tasks simply didn't need GPT-5 or Claude Opus. The move wasn't about limiting AI usage; it was about stopping the waste.

"Using frontier models for execution-level tasks is overkill," Armstrong wrote. "Any company can replicate this."

The Three-Layer Strategy

Coinbase's approach wasn't just "swap models, save money." They built a sophisticated system:

Smart Routing: An internal LLM gateway automatically routes simple tasks (code review, doc summarization, data cleaning) to GLM/Kimi, while complex multi-step agent tasks still hit frontier models.
Aggressive Caching: They boosted cache hit rates from 5% to 60% by making all requests cache-aware — no more re-computing the same answers.
Lean Contexts: Engineers start fresh sessions for new tasks, avoiding the trap of carrying 30K tokens of history just to ask a one-line question.

The philosophy is simple: default to cheap-and-capable, escalate to expensive only when needed.

The Models: What Makes GLM 5.2 and Kimi K2.7 Worth the Switch

GLM 5.2 (Zhipu AI) — released June 12, 2026 under the MIT license:

744B parameters, MoE architecture (only 40B active per token)
#1 open-weight model on Artificial Analysis rankings
Beat GPT-5.5 on SWE-bench Pro (62.1 vs 58.6), nearly tied Opus 4.8 on FrontierSWE
Pricing: $1.40/M input, $4.40/M output — roughly 6x cheaper than Opus 4.8 at $5/$25

Kimi K2.7 Code (Moonshot AI) — also released June 12:

128K native context window, built for long-document processing
Used by Cursor (acquired by Elon Musk for $60B); their ARR doubled to $200M+ after switching
Excels at code review, script generation, and smart contract validation
Moonshot's valuation: soared from $4.3B to $20B in six months

Not Just Coinbase — A Tidal Shift

This isn't an isolated case. The trend is accelerating:

Company	What They Did	Result
Cloudflare	Deployed Kimi K2.5 for internal security agents	77% cost reduction, 7B tokens/day
Airbnb	Switched customer service from GPT to Qwen	Significant cost savings
Lindy	Migrated from Claude to DeepSeek V4	AI costs were exceeding payroll
Snowflake	Testing GLM 5.2 as Claude alternative	"Comparable performance at a fraction of the price"

On OpenRouter, Chinese models now account for 40%+ of all token traffic — up from less than 2% just a year ago. Qwen has surpassed Llama as the most downloaded open-weight model family on Hugging Face.

How to Replicate Coinbase's Setup

Coinbase self-hosts the open-weight models on their own servers, which is great for enterprise compliance but requires serious GPU infrastructure. For most teams, the fastest path is an API aggregation gateway that gives you one endpoint to access all these models.

This is where MoToken AI comes in. It's a unified API aggregation service that provides OpenAI-compatible access to GLM, Kimi, DeepSeek, Qwen, and other Chinese models through a single API key. No need to manage separate accounts, keys, or SDKs across multiple providers.

Quick Start: cURL

curl -s https://api.motoken.top/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MOTOKEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [
      {"role": "system", "content": "You are a senior software engineer."},
      {"role": "user", "content": "Review this code for SQL injection vulnerabilities and suggest fixes."}
    ],
    "temperature": 0.3,
    "max_tokens": 2048
  }'

Python: Build Your Own Smart Router

Here's a minimal implementation of Coinbase's tiered routing strategy using MoToken's unified API:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("MOTOKEN_API_KEY"),
    base_url="https://api.motoken.top/v1"
)

# Tiered model routing — Coinbase-style
MODEL_TIERS = {
    "default": "glm-5.2",           # Daily tasks: code review, docs, data cleaning
    "code": "kimi-k2.7-code",       # Code generation & review
    "complex": "claude-sonnet-4-6", # Multi-step reasoning, architecture
}

def smart_chat(prompt: str, task_type: str = "default") -> str:
    """Route to the right model based on task complexity."""
    model = MODEL_TIERS.get(task_type, MODEL_TIERS["default"])

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful engineering assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=2048
    )

    usage = response.usage
    cost_estimate = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
    print(f"[{model}] {usage.total_tokens} tokens | ~${cost_estimate:.4f}")

    return response.choices[0].message.content

def estimate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    """Rough cost estimate per million tokens."""
    rates = {
        "glm-5.2": (1.40, 4.40),
        "kimi-k2.7-code": (1.50, 5.00),
        "claude-sonnet-4-6": (3.00, 15.00),
    }
    input_rate, output_rate = rates.get(model, (2.00, 8.00))
    return (prompt_tokens / 1_000_000) * input_rate + (completion_tokens / 1_000_000) * output_rate

# Usage
result = smart_chat("Explain the OWASP Top 10 for 2026", task_type="default")
print(result[:200] + "...")

With this setup, you can swap model in a single line and instantly switch between GLM, Kimi, DeepSeek, or any other supported model — all through the same API key and endpoint.

The Bottom Line

Coinbase's move isn't about "Chinese AI crushing Western AI" — that's a headline, not the real story. The real story is about smart tiering: most engineering tasks don't need a $25/M-token model, and the price gap between open-weight Chinese models and Western closed-source models has become impossible to ignore.

As Armstrong put it: "The goal isn't to make engineers use less AI. It's to let them use as much as they want — without burning money."

That's a philosophy every team can adopt today. Whether you self-host like Coinbase or use an aggregation gateway like MoToken, the tools are ready. The only question is: are you still overpaying for everyday tasks?

What's your team's default model? Have you experimented with GLM, Kimi, or other Chinese open-weight models? Drop a comment below — I'd love to hear your experience.

DEV Community

Coinbase Cut AI Costs by 50% Switching to GLM 5.2 and Kimi K2.7 — Here's How You Can Too

Coinbase Cut AI Costs by 50% Switching to GLM 5.2 and Kimi K2.7 — Here's How You Can Too

The Move That Shook Silicon Valley

The Three-Layer Strategy

The Models: What Makes GLM 5.2 and Kimi K2.7 Worth the Switch

Not Just Coinbase — A Tidal Shift

How to Replicate Coinbase's Setup

Quick Start: cURL

Python: Build Your Own Smart Router

The Bottom Line

Top comments (0)