SAR

Posted on Jul 4

AMD Just Served GLM5.2 at 2,626 Tokens/Second — and It's 2x Cheaper Than NVIDIA Blackwell

#ai #amd #gpu #inference

I'll be honest — when I first saw the headline, I had to read it twice.

"GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell."

That's not a typo. That's what Wafer published yesterday, and it's already racked up 180 points and 51 comments on Hacker News. The conversation is fierce — some people are celebrating, some are calling BS. And honestly? Both sides have a point.

But here's the thing that got my attention: this isn't a one-off bench from an AMD marketing lab. This is a production provider — Wafer, working with Vercel AI Gateway and OpenRouter — saying they're serving real traffic on AMD hardware and it's working. Not just working. Winning on cost.

I've been watching the GPU inference cost war for the last year, and this feels like a genuine shift. Let me walk you through what actually happened, what the skeptics are getting wrong (and right), and what it means if you're deploying models in production.

The Numbers That Matter

Wafer's Ian Ye published the full breakdown yesterday. Here's the TL;DR on their setup:

Model: GLM-5.2 — Zhipu AI's latest frontier model, actively competing with Claude and GPT-class models
Hardware: AMD MI355X (Instinct MI350 series) from TensorWave capacity
Software stack: Wafer's optimization layer on top of ROCm
Workload: 20k input tokens / 1k output tokens, 60% cache hit rate

The headline number — 2,626 tok/s/node aggregate throughput at 2.4 requests per second — came with a ≤5 second time-to-first-token (TTFT) knee. That's a real production workload profile, not a cherry-picked ideal case.

Single-stream performance hit 213 tok/s on 10k input / 1.5k output tokens, following Artificial Analysis standards. That's the metric that matters for interactive chat applications where one user is waiting for one response.

The kicker? All of this at "over 2x lower cost than Blackwell." Wafer claims the MI355X is roughly 2.75x cheaper per GPU than the NVIDIA B300, with comparable hardware specs.

That's not a small gap. That's a chasm.

The Skeptic's Case: Quantization and Caveats

Now let's address the elephant in the room — because the HN comments certainly did.

The top-voted reactions were skeptical for good reasons:

"While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality."

That's from user hassaanr, and they're not wrong. Running at mxFP4 precision means you're trading quality for speed. The question is how much quality, and whether it matters for your use case.

Another commenter put it more bluntly:

"I think we should make it illegal to not specify the quantization in the headline for these types of posts."

Fair. When someone says "2,626 tokens per second," your brain jumps to "full-quality frontier model inference at blazing speed." The reality is more nuanced — we're talking about aggressively quantized models running on specialized hardware with significant optimization work.

But here's the counter-argument that matters: for most production use cases, quantized inference is already the standard. Nobody's running full fp16 on user-facing traffic at scale. The debate isn't "quantized vs unquantized" — it's about optimization level and whether the quality bar still clears for your specific application.

I've been running quantized models in production for months, and the quality difference between a well-tuned FP4 and a reference FP16 is often negligible for code generation, summarization, and structured output tasks. Creative writing? Maybe not. Code review? Absolutely fine.

What the AMD vs NVIDIA Cost Math Actually Looks Like

Let's be concrete about what "2x cheaper" means in real dollars.

Wafer's chart shows the MI355X hitting 80% of B200 performance at saturation but at over 2x lower cost. At 2.4 RPS saturation, they're getting 2,626 aggregate tok/s with 0.81s average TTFT and 2.22s p99.

Here's the dirty math I do when evaluating inference providers:

Metric	B200	MI355X	Ratio
Peak tok/s (aggregate)	~3,280	2,626	80%
Per-GPU cost	~$30K+	~$11K*	2.75x cheaper
tok/s per dollar	~0.11	~0.24	2.2x better

*Rough estimates based on Wafer's 2.75x claim and public pricing ranges. Actual pricing varies wildly by provider, contract length, and availability.

The math gets even better when you factor in that NVIDIA GPU prices are climbing because supply can't keep up with demand. As Wafer's post notes:

"With frontier models being released almost every other week — Claude Fable, GLM5.2, and Minimax M3 — the token craze is only getting crazier, and there aren't enough Blackwells going around to support it."

That's not a marketing spin — that's observable reality. Every major lab is releasing models faster than NVIDIA can fab Blackwells, and the secondary market prices are astronomical.

The Software Gap That's Rapidly Closing

The one area where NVIDIA still has a clear advantage is software. CUDA's ecosystem is fifteen years mature. ROCm is... not that.

Wafer's post is refreshingly honest about this:

"On the MI355X / ROCm stack, SOTA performance rarely comes out of the box for these frontier models. In fact, you're lucky if you can find..."

They trailed off, but the message is clear: AMD's software stack still requires hand-holding. You need a team like Wafer that specializes in AMD optimization to get these numbers. Pull an MI355X out of the box and run GLM5.2 on stock ROCm? You're not hitting 2,626 tok/s.

But — and this is a big but — Wafer proved it can be done. The gap between "possible with expert tuning" and "works out of the box" is shrinking fast. Dedicated optimization layers like Wafer's, plus the growing investment in ROCm from the community, are closing the software moat that's been NVIDIA's strongest defense.

Consider this: six months ago, nobody was publishing AMD inference numbers in the same sentence as Blackwell. Now we're talking about a 20% performance gap at 2.75x lower cost. At that price differential, the ROI on optimization work for AMD hardware is enormous.

What This Means for Working Developers

If you're building products on top of LLMs — and let's be real, if you're reading this in 2026, you probably are — the AMD price war matters way more than any individual model release.

Here's my take: the inference cost curve is about to get a lot steeper.

For the last two years, the narrative has been "AI is getting cheaper exponentially" — but the reality is that it's been getting cheaper mostly because models are getting more efficient. The hardware cost per token has been relatively flat.

AMD changing that. When you can get 2x the throughput per dollar on a competitive frontier model, the whole cost structure of AI applications shifts. Things that weren't economically viable before — real-time agents that call models hundreds of times per user session, high-throughput batch processing for small teams — suddenly make sense.

The key question isn't "should I switch to AMD tomorrow?" It's "what becomes possible when inference is half the price?"

Some concrete implications:

Agent loops get cheaper: If you're running 10-20 model calls per user action, halving inference cost makes those multi-step agentic workflows economically viable at scale.
Smaller teams can compete: The hardware barrier to running frontier models is lowering. You don't need a cluster of H100s to serve competitive quality — a few MI355X nodes can handle significant production load.
NVIDIA will have to respond: Either by lowering Blackwell prices (unlikely given demand) or by releasing Rubin-class hardware that widens the performance gap enough to justify the premium. I suspect we'll see both.

The Bottom Line

Here's where I land after reading through the post, the HN discussion, and my own experience running inference at various providers:

The AMD MI355X isn't a magic bullet. The quantization caveats are real.

The software gap is real. The "aggregate throughput" vs single-stream distinction matters. If you need the absolute best quality at any cost, you're still looking at NVIDIA's latest on FP16.

But for the other 90% of production use cases — serving chat, code generation, RAG pipelines, structured extraction, batch processing — the AMD price proposition is genuinely compelling. A 2x cost improvement with 80% of the peak performance, on a model that's competitive with frontier offerings? That changes the deployment calculus for a lot of teams.

The HN comment from AussieWog93 is worth keeping in mind:

"The 2600 tok/s is an 'aggregate', not the actual throughput."

True — but aggregate throughput is exactly what matters for most production deployments. Unless you're running a single-user research setup, you're batching requests and measuring aggregate throughput anyway.

Meanwhile, Schiendelman raised an important point:

"I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference — Blackwell is the last generation Nvidia didn't make better specifically for inference."

So the war isn't over. But for right now, in July 2026, AMD has served notice: they're in the inference game to stay, and they're winning on price. If you haven't evaluated AMD hardware for your inference pipeline in the last six months, it's time to run those benchmarks again.

I know I'll be.

GLM-5.2 is available now via Vercel AI Gateway and OpenRouter, served on AMD MI355X by Wafer. If you've been running on AMD hardware, I'd love to hear your experience — the community needs more real-world data points than marketing benchmarks.