Cristian Tala

Posted on Apr 4

Why I Migrated My AI Assistant from Claude to Qwen and Gemma in One Morning

#ai #automation #llm #opensource

TITLE: Why I Migrated My AI Assistant from Claude to Qwen and Gemma in One Morning

On Friday, April 3 at 7:47 PM, an email from Anthropic landed in my inbox:

«Starting April 4 at 12pm PT / 8pm BST, you'll no longer be able to use your Claude subscription limits for third-party harnesses including OpenClaw.»

I had less than 17 hours to decide what to do.

It wasn’t like my assistant would just die—Nyx (my personal AI running on OpenClaw) could still use Claude, but now it needed “extra usage”: separate pay-as-you-go charges on top of my subscription. My Anthropic Max plan ($100–$200/month) no longer covered third-party tool usage.

So I did what I’d been putting off: I built a smarter, diversified model stack.

The email that changed everything

Anthropic’s official message explained three things:

1. The change: Starting April 4, Pro and Max subscriptions no longer cover usage in external tools like OpenClaw. They still work—but now require “extra usage” (separate pay-per-token billing).

2. The exception: Subscriptions do still cover Anthropic’s own products: Claude Code and Claude Cowork.

3. The sweeteners: Anthropic offered a one-time credit equal to one month of your subscription (redeemable by April 17), plus up to 30% discounts if you pre-buy “extra usage” bundles.

According to Boris Cherny (Head of Claude Code), the technical reason is that third-party tools aren’t optimized for Claude’s internal prompt caching system, creating disproportionate compute costs. Their own tools recycle processed text and are way more efficient.

This wasn’t a surprise move—Anthropic had also started limiting sessions to every 5 hours for the top 7% of heaviest users around the same time. The writing was on the wall.

My real-world setup

Nyx is my personal AI assistant, running on my own VPS. It handles content, automations, analytics, publishing calendars, and dozens of tasks daily. Up until Friday, it used Claude Sonnet 4.6 as the default model, covered under my Max subscription.

After the change, sticking with the status quo meant paying API tokens on top of my subscription. Community estimates suggest an active agent can burn $50–$200/month in API costs alone. That’s unacceptable when you’re already paying $100+ just to subscribe.

The obvious alternative? OpenRouter.

How I migrated in one morning

Step 1: Audit what’s actually available

First, I looked at what I could really use—not what sounds good on paper:

Anthropic (direct token) → Sonnet, Opus, Haiku — available, but now pay-as-you-go
Google Antigravity (OAuth) → Gemini 3.1 Pro High, Gemini 3 Flash — ruled out due to recurring timeouts we’d experienced months ago
OpenRouter (API key) → Dozens of models from multiple providers, pay-per-use
Groq (token) → Fast models, but many outdated model IDs

Step 2: Check real rankings on LM Arena

I didn’t trust slick marketing benchmarks. I went straight to OpenLM.ai’s LM Arena, which aggregates millions of blind human votes between models.

Here’s what mattered for my use case:

Model	Arena Score	Open Source	Cost / 1M tokens
Gemini 3.1 Pro High	~1505 🏆	No	OAuth free*
Claude Sonnet 4.6	~1460 🥇	No	$3/$15
Gemma 4 31B	1450 🥇	✅ Apache 2.0	$0.14
Qwen3 235B 2507	1418 🥉	✅ Apache 2.0	$0.07
DeepSeek V3 0324	1377 🪙	✅ MIT	$0.20

* Gemini is free via OAuth—but we’d already had timeout issues in production.

Step 3: Test real latency—not paper benchmarks

Here’s where things got wild. I used OpenClaw to spin up subagents for each model and measured cold-start response times:

Model	Real Latency
DeepSeek V3 0324	257ms ✅
Llama 4 Maverick	346ms ✅
Qwen3 235B	638ms ✅
Mistral Small 3.1	460ms ✅
Gemma 4 31B	6.2 seconds ❌

Gemma 4 had the best open-source Arena score—but 6 seconds of cold latency makes interactive conversation impossible. It got demoted to batch tasks only (SEO analysis, bulk processing).

Step 4: The final stack

After testing and tuning, here’s the setup I landed on:

Main model: Qwen3 235B A22B 2507 via OpenRouter

638ms latency
Arena score 1418 (on par with Claude Sonnet 4.5)
$0.07 per million tokens — ~42x cheaper than Sonnet
262k token context

By agent specialization:

Agent	Model	Why
Main (Nyx)	Qwen3 235B	Best quality/cost/latency balance
Content & courses	Qwen3 235B	Strong reasoning, fluent Spanish
SEO & batch analysis	Gemma 4 31B	Top open-source score, acceptable latency offline
n8n & code	DeepSeek V3	Fastest (257ms), coding powerhouse
Social comments	Mistral Small 3.1	Super fast, $0.03/M — perfect for short replies
API exploration	Llama 4 Maverick	1M token context — great for huge docs
Session compaction	Mistral Small 3.1	Cheap context summarization
On-demand high quality	Claude (Sonnet/Opus/Haiku)	Optional for peak quality — not default anymore

What I learned

1. Relying on a single provider is operational suicide.

That email hit at 8 PM Friday. By Saturday noon, the rules changed. Less than 17 hours’ notice for a global impact. If my infrastructure wasn’t ready, Nyx would’ve just… stopped. No drama. No warning. Just silence.

2. Open-source models are now seriously competitive.

Gemma 4 31B (Apache 2.0) scores 1450 on Arena—beating proprietary models from just 6 months ago. Qwen3 235B is neck-and-neck with Claude Sonnet at less than 1/10th the cost. This isn’t last year’s game anymore.

3. A diversified stack is cheaper and more robust.

Using the right model for the job cuts costs and improves results. The most expensive model isn’t the best for everything. Mistral Small for 3-sentence social comments works just as well as Sonnet—and costs 500x less.

4. Latency matters as much as the score.

On paper, Gemma 4 (1450) vs Qwen3 (1418) looks like a close race. But 6 seconds vs 638ms? That’s the difference between a usable assistant and a doorstop in interactive mode.

5. This move from Anthropic was predictable.

When a company gives you “unlimited” access for $20–$200/month and you’re using it to run a full-time autonomous agent, the economics eventually blow up. Anthropic chose to protect margins for direct users instead of subsidizing third-party power users. From their angle? Totally reasonable.

What you should do if this affects you

If you run an AI assistant, automation agent, or any tool using Claude via subscription + third-party tools, you’ve got three options:

Enable “extra usage” on your Anthropic account — easiest, but adds variable costs on top of your fixed subscription. Use their free credit (valid until April 17).
Migrate to OpenRouter — access to dozens of models, pay-per-actual-use. No monthly fees. Risk: you have to pick the right model per task.
Go direct API with Anthropic — drop the subscription, pay per token. More predictable for variable usage, but pricier at high volume.

I went with #2, backed up by #1 when I need top-tier quality.

If you want to dive into how to build this kind of stack yourself, Join my community of founders at Cágala, Aprende, Repite — we’re running exactly these kinds of experiments daily.

This post was written by Qwen3 235B — the same model that replaced Claude as Nyx’s default. The full migration happened Saturday morning. Ironically, it was the perfect excuse to build the stack I should’ve had months ago.

📝 Originally published in Spanish at cristiantala.com

DEV Community