Stephan Miller

Posted on May 19 • Originally published at stephanmiller.com on May 16

I Was Wrong About Hy3 (And Other Things I Learned This Week)

#largelanguagemodels

Two weeks ago I told you Tencent’s Hy3 Preview was a marketing stunt. Free until May 8, +1,356% on OpenRouter, “free + new + expiring = noise.” I was extremely confident about it.

Today, one week into paid pricing, Hy3 is the #1 model on OpenRouter by tokens. 2.76 trillion of them, in a single week. The pattern detector got overconfident, and it turns out that’s the theme of the whole week. Cheapskate Picks held mostly steady but Math flipped one week after I published it. Claude Code’s billing bug entered its third week unpatched. And Google I/O is Tuesday, so whatever I write here is going to look quaint by Wednesday morning.

Let’s get into it.

What I Got Wrong About Hy3 (And the Other New Players)
The Cheapskate Picks Held (Mostly)
Hype vs. Value: Ring 2.6 vs. Ernie 5.1
Claude Code’s Billing Bug Enters Its Third Week
What’s Worth Trying This Week
Tuesday Is Going to Be Loud
What I’m Watching Next Week

What I Got Wrong About Hy3 (And the Other New Players)

The Hy3 numbers are not a typo. 2.76T tokens in a week. The +153,299% delta is the migration spike — users coming back when paid pricing kicked in instead of bouncing. Pricing locked in at $0.066 input / $0.26 output per 1M tokens with a 262K context window, which is competitive enough that production users had no real reason to leave when the free tier ended.

Here’s what I got wrong: I had a rule that said “free + new + expiring = noise.” It worked great for filtering out cynical marketing stunts. It also filtered out a real model. The new rule, which I’m putting in the methodology going forward: free-period spikes that survive the cliff are validation, not residue. The cliff is the actual test.

While I was busy being wrong about Hy3, three other things showed up:

DeepSeek V4 Flash at #2 by token volume (1.65T/week, +70% week-over-week). $0.112 input / $0.224 output, 1M context, MIT-licensed, 284B params with 13B active. It also debuted at #1 on the trending list under its free variant. Independent reviewers report 79.0% on SWE-bench Verified and 91.6% on LiveCodeBench. This is the cheap workhorse doing most of DeepSeek’s actual work this week. Not the V4 Pro everyone covered when it launched April 24.
Gemini 3.1 Flash Lite dropped May 7 at $0.25 / $1.50 with a 1M context window. Half the cost of regular Gemini 3 Flash. AA Intelligence Index of 34, which is solid for the price class, and 347 tokens per second of output — fastest in its tier by a wide margin. This is Google racing the Asian price floor, which is a sentence I would not have written 18 months ago.
Owl Alpha is the OpenRouter stealth model that’s now 706B tokens a week, free, agentic-tuned, 1M context. It’s been live since April 28 — 17 days now — without anyone confirming the provider. Prior stealth releases (Polaris Alpha → GPT-5.1, Sherlock Alpha → Grok 4.1) got unmasked inside two weeks. Owl Alpha is breaking that pattern. Either the labs are getting better at guarding A/B test windows, or someone’s collecting an unusually long RL run before publishing.

The common thread connecting Hy3, V4 Flash, 3.1 Flash Lite, and the stealth model is that the market’s center of gravity has shifted. Three of the four are non-Western, all four cost less than $1/M output, and three of the four feature a 1M context window.

The Cheapskate Picks Held (Mostly)

Quick refresher on the methodology: take the leader’s Arena rating in a category, draw a 50-point band downward, then sort everything in the band by output price. Cheapest model in the band wins the Cheapskate slot. The whole point is that the top of Arena is structurally compressed (overall: 1502 leader to #20 at 1468: only 34 points of spread), so paying 8x more buys you ~2% more rating. Not always a good trade.

This week, here’s how it shook out across the seven Arena categories I track:

Category	Leader	$ leader (out)	Cheapskate pick	$ pick (out)	Δ rating	Price ratio	AA Pareto
Overall	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−29	8.3×	nearby
Coding	Claude Opus 4.7 Thinking	$25	GLM-5.1	$3.08	−36	8.1×	✓ (AA Idx 51)
Creative Writing	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−36	8.3×	nearby
Math	GPT-5.4-high / Opus 4.6 Thinking	$15/$25	Ernie 5.1	$2.65	−19	5.7×–9.4×	n/a
Instruction Following	Claude Opus 4.6 Thinking	$25	MiMo V2.5 Pro	$3	−44	8.3×	✓ (AA Idx 54)
Hard Prompts	Claude Opus 4.6 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby
Multi-Turn	Claude Opus 4.7 Thinking	$25	Gemini 3 Flash	$3	−41	8.3×	nearby

Six picks held from last week. Math flipped.

The flip is Ernie 5.1 , which Baidu launched May 9 at $0.59 input / $2.65 output and which immediately landed in the Arena top 20 with a 1472 overall, 1496 in math, 1518 in coding, and 1517 in instruction following. That’s a model dropping in mid-week and slotting in cheaper AND higher-rated than last week’s Math winner (DeepSeek V4 Pro Thinking at $1.74/$3.48 even with the 75% discount). Baidu also says they trained it at 6% the compute cost of comparable models, which is either misleading or the kind of thing that quietly resets cost-per-capability assumptions across the industry.

Caveat: Ernie’s primary host is Baidu’s Qianfan API, not OpenRouter. If you’re routing through OpenRouter, the runner-up is MiMo V2.5 Pro at $3 / 1M output, rating 1484, and it’s available there.

The two highest-confidence picks this week (meaning the methodology AND Artificial Analysis’s independent Intelligence Index agree) are GLM-5.1 for Coding (AA Index 51) and MiMo V2.5 Pro for Instruction Following (AA Index 54). When two independent evaluation methodologies converge on the same model, that’s about as strong a signal as this kind of comparison produces.

Gemini 3 Flash still wins five of seven categories. Five months after launch. The boring answer keeps being correct.

Hype vs. Value: Ring 2.6 vs. Ernie 5.1

Probably hype: Ring 2.6 1T from Ant Group / InclusionAI dropped May 8. One trillion params, MIT-licensed, 63B active. The launch announcement claimed 87.6 on PinchBench, beating GPT-5.4 and Gemini 3.1 Pro, with vendor-reported scores of 95.83 on AIME 2026 and 88.27 on GPQA Diamond. Open-weight + cross-frontier claims is a hype cocktail that always trends. But no third party has independently verified any of those numbers yet: no AA coverage, no neutral LiveCodeBench harness run, nothing. Trillion-param vendor benchmarks beating frontier models is the exact pattern that should make you wait two weeks before betting on it.

While I’m at it, Trinity Large Thinking from Arcee at #2 on the trending free list deserves the same caution. It’s a real release — Apache 2.0, 398B sparse MoE, US-built, the rare open frontier model we can actually inspect: but the “free for limited time” framing is the same trap I just admitted to walking into with Hy3. Track it past the cliff before deploying it anywhere that matters.

Under-sold value: Ernie 5.1, which I just covered above, is the cleanest example of this week’s repeating pattern. Hits Arena top 20 the day it launches, immediately becomes the cheapskate winner in a category, and the Western LLM Twitter barely notices. Same shape as Hy3 two weeks ago. Same shape as Kimi K2.6 four weeks ago.

I’m starting to think the meta-pattern matters more than any individual model: when a non-Western lab ships a serious value play, the default reaction in the English-language commentariat is either “interesting but unproven” or silence. Then six weeks later it’s quietly running in production at half the price of Claude. We keep being surprised by the same trajectory.

Claude Code’s Billing Bug Enters Its Third Week

If you use Claude Code on a Max plan, this section is for you. If you don’t, skip to the next one: but know that this is the week’s universal frontier-lab horror story and people are pissed.

Claude Code v2.1.100 and later silently inflate cache_creation_input_tokens by roughly 20,000 per request. The inflation is 100% server-side, routed by the User-Agent header (which includes the version number), and it appears to be caused by the prompt cache forcing a full re-process of conversation history on every turn instead of resuming. GitHub issue #46917 is the canonical thread, with payload-vs-billed-tokens evidence from multiple developers.

The real-world impact is brutal. One paying Max customer’s quota went from 0 to 67% in ten minutes of normal work with 128 cache flush events on a separate chat. Independent measurement says the inflation is driving costs 10–20× higher, exhausting even the $100/month Max plan in 1–2 hours of normal use.

Anthropic shipped a postmortem and a partial fix. The latest CLI as of this writing is v2.1.133 (released May 8). The bug is still there. Three weeks running.

The workaround everyone’s on: downgrade to v2.1.34, or reinstall via npm instead of using the native binary. That bypasses the version routing on the server side and gives you back the cache behavior from before the regression.

While we’re piling on Anthropic this week, two more things:

Opus 4.7 quietly costs 35% more than Opus 4.6 at the same headline price. Same $5 input / $25 output per 1M tokens, but the new tokenizer uses up to 35% more tokens for the same fixed text. If you’re on Opus 4.7 and on Claude Code v2.1.100+, you’re getting hit with two compounding inflations on the same workflow. Fun.

Opus 4.7 also regressed on refusals. Multiple developers report Opus 4.7 in Claude Code flagging routine benign code as malware and refusing to complete file operations, network calls, and standard library usage that 4.6 handled without complaint. This is in addition to the billing bug, not instead of it.

OpenAI doesn’t get to feel smug about this either. GPT-5.5 hallucinates 86% of the time it doesn’t know something on the AA-Omniscience benchmark. The 14-point AA-Omniscience improvement over GPT-5.4 came mostly from better factual recall, not better refusal: when 5.5 doesn’t know something, it makes up an answer roughly nine times out of ten.

The honest take here is that the gap between “shipped” and “actually works in production” keeps widening for the US frontier labs while the cheap Asian models keep landing comparatively clean. That’s not a comfortable thing to write but it’s what the week looks like.

What’s Worth Trying This Week

Stuff I’d actually do this week, not just stuff I’d read about:

Replace Opus with Gemini 3 Flash for general-purpose work if you haven’t already. $0.50 input / $3 output, 1M context, Arena top 20 in everything. The Cheapskate Pick in 5 of 7 categories isn’t a coincidence.
Try Kimi K2.6 on a real coding task for a week and see if you switch. There’s a developer who used it as their only coding assistant for 30 days and posted a brutally honest review: over-engineering tendency, agent swarm wins, where it broke. Worth reading before committing.
Use Owl Alpha while it’s still free before whoever made it pulls access. 1M context, agentic-tuned, optimized for Claude Code-style workflows.
Skip Ring 2.6 1T for production until ArtificialAnalysis runs benchmarks. Read about it, don’t deploy on it.
Downgrade Claude Code to v2.1.34 if you’re on Max and watching your quota burn. Stop the bleeding while Anthropic figures out the cache routing.

That’s it. Five things. Three of them are “use cheaper models,” one is “wait for verification,” and one is “downgrade your tools to fix billing.” That’s the week.

Tuesday Is Going to Be Loud

Google I/O 2026 runs May 19–20 — Tuesday and Wednesday this week. The keynote agenda confirms Gemini and AI updates as the headline.

The leaks so far point to three things:

Gemini 4 as the headline upgrade. Expected to focus on multi-context search and the new TPU generation.

Gemini Omni as the surprise. Six days before I/O, an X user spotted “Powered by Omni” inside the Gemini app’s video tab, positioned next to “Toucan”: which is Google’s internal codename for Veo 3.1. The most likely interpretation is that Omni is a unified text/image/video generation pipeline, which would make it the first frontier model to do all three in a single system. Demo videos already leaked from at least one Pro user’s account, including a chalkboard math scene that reportedly handled trigonometric proofs accurately.

Beyond Tuesday:

GPT-6 is a Q3-Q4 base case. Polymarket has it at ~10% by June 30, 51% by September 30, 82% by December 31. GPT-5.5 in April was Spud, the codename people thought meant GPT-6. It didn’t. The next jump is later this year.

Claude Mythos is confirmed real and being explicitly withheld on safety grounds. Project Glasswing, the cybersecurity capability, is the bottleneck. This is the first time a frontier lab has publicly said “we built it, we’re not shipping it” with a confirmed model. No timeline. Anthropic has committed to advance notice on any safeguard changes, so the roadmap will be visible before it happens. Watch their blog.

Already shipped this month and worth flagging if you missed them:

Mercury 2 from Inception Labs — diffusion-based LLM at 1000+ tokens/sec, now available on OpenRouter. Not autoregressive. 5–15% behind frontier on hard reasoning, matches on structured output and translation. The architectural alternative is finally here and it’s fast.
NVIDIA Nemotron 3 Nano Omni — open 30B-parameter MoE with 3B active, multimodal across vision, audio, and text, 9× the throughput of comparable open omni-models. Available on OpenRouter and SageMaker.

What I’m Watching Next Week

The model market moved faster than my pattern detectors this week. I had to eat one prediction (Hy3), and recalibrate the cheapskate Math winner one week after publishing it (Ernie 5.1 dropped on Friday and walked into the slot).

Three things on watch for next week:

Gemini 4 / Omni at I/O Tuesday. If Omni ships as a unified video model with API access, the cheapskate calculus for everything multimodal resets overnight.
Whether Anthropic ships a real fix for the Claude Code cache bug. Three weeks in, their workaround is “use an older version.” That can’t last forever.
Whether anyone gets neutral verification of Ring 2.6 1T’s claims. If it holds up, the cheapskate Coding pick might be open-weight by W22.

And while I was finishing this up, Owl Alpha probably got unmasked, Hy3 launched a new variant nobody told me about, and Anthropic shipped a Claude Code patch that introduces three new bugs. That’s the price you pay for hitting publish on Saturday.