DEV Community

Cover image for Qwen 3 Benchmarks, Comparisons, Model Specifications, and More

Qwen 3 Benchmarks, Comparisons, Model Specifications, and More

BestCodes on May 01, 2025

Qwen3: Alibaba's Latest Open-Source AI Model Qwen3 is the latest generation of large language models (LLMs) from Alibaba Cloud. Built by...
Collapse
 
nevodavid profile image
Nevo David

kinda insane how fast this stuff moves - i'm always trying to keep up but dang, it changes every week

Collapse
 
best_codes profile image
BestCodes

Yep, Phi 4 Reasoning came out right after I wrote this, and now I have to do one on it 🤭

Collapse
 
rd8591 profile image
RedDragen

Qwen3 is... highly overrated!

Here'a my case.
I wanted to build some visuals in code based on a description of the visual. Qwen3 required many many many iterations before it was able to present me something that was - with a lot of squinting and frowning - barely what i meant for it to build.

Then i tried the exact same thing with the exact same prompts in Gemini 2.5 pro. It did a near perfect job in a single shot and only required a few more minor revisions to make it exactly as i intended.

Now does that mean Qwen3 is bad and Gemini 2.5 Pro is good? Yes and no.
For specific requests Qwen3 is probably on par or definitely able to compete. But when it actually needs to reason about the thing you ask it to do then it's more accurate to describe it's "cognitive skills" as a junior student whereas the reasoning from Gemini is truly bordering final exam student level. Meaning they both are good models, but Gemini is just better at "thinking" and thus giving better results where it needs that skill.

Collapse
 
best_codes profile image
BestCodes

Gemini 2.5 Pro isn't open source. I'm excited about Qwen3 because it is one of the top performing open source models right now 🙂

Collapse
 
rd8591 profile image
RedDragen • Edited

Agreed! And i use Qwen happily :) Well, still the 2.5 coder one, it's awesome!

My point was more that it's thinking ability is.. ok at best.
I haven't tried it yet but i assume Llama 4 is better. And deepseek r1 definitely is better, that i did try just not on this example case.

Thread Thread
 
best_codes profile image
BestCodes

Yeah, the problem with models like DeepSeek R1 and Llama 4 is that they are huge and usually have to run on a remote server or a special device. CPU devices or typical laptops can't run them or run them really slow 🥲

Qwen 3 has a 0.6b variant which super nice because it can run locally even on very small or weak devices. The size to performance ratio with Qwen models is great.

Thread Thread
 
rd8591 profile image
RedDragen

Qwen 3 has a 0.6b variant which super nice because it can run locally even on very small or weak devices. The size to performance ratio with Qwen models is great.

Oh boy, you're hallucination :P

Think of these tiny models of a kid that has the potential to be at Einstein level but with a massive amount of amnesia (it forgot 90% of what it learned or more). It can talk and form sentences but the responses are mostly close to useless. It can do some simple things well but that's about it.

In my testing i found that anything below 3B (not just qwen, all these models out there) is mediocre at best. 3B itself being the transition point from where responses make leaps and bounds in terms of quality.

But tiny models have great value! Like recently AMD seems to be using tiny models to use their next token prediction as a sort of pre-prediction for a large model. The large one then just has to verify if it would've done that same prediction (which is apparently a lot cheaper then to predict) and then use it. It essentially means large models at the speeds of tiny models with the quality of the large one. Mind blowing, literally. AMD seems to be getting up to 4x tokens/sec from 7-8B models when a tiny models is used to pre-predict.

Thread Thread
 
best_codes profile image
BestCodes

Yeah, it's not the greatest for facts, but for chatting or writing code it's pretty nice. The 4b and 30b Qwen3 models are also very impressive, especially for their size. The Qwen architecture is improving a lot too, so they require less and less memory to run!

Collapse
 
alt_exist profile image
Alternate Existance

good info :)