DEV Community

Qwen 3 Benchmarks, Comparisons, Model Specifications, and More

BestCodes on May 01, 2025

Qwen3: Alibaba's Latest Open-Source AI Model Qwen3 is the latest generation of large language models (LLMs) from Alibaba Cloud. Built by...

Read full post

Nevo David • May 2 '25

kinda insane how fast this stuff moves - i'm always trying to keep up but dang, it changes every week

BestCodes • May 2 '25

Yep, Phi 4 Reasoning came out right after I wrote this, and now I have to do one on it 🤭

RedDragen • May 1 '25

Qwen3 is... highly overrated!

Here'a my case.
I wanted to build some visuals in code based on a description of the visual. Qwen3 required many many many iterations before it was able to present me something that was - with a lot of squinting and frowning - barely what i meant for it to build.

Then i tried the exact same thing with the exact same prompts in Gemini 2.5 pro. It did a near perfect job in a single shot and only required a few more minor revisions to make it exactly as i intended.

Now does that mean Qwen3 is bad and Gemini 2.5 Pro is good? Yes and no.
For specific requests Qwen3 is probably on par or definitely able to compete. But when it actually needs to reason about the thing you ask it to do then it's more accurate to describe it's "cognitive skills" as a junior student whereas the reasoning from Gemini is truly bordering final exam student level. Meaning they both are good models, but Gemini is just better at "thinking" and thus giving better results where it needs that skill.

BestCodes • May 1 '25

Gemini 2.5 Pro isn't open source. I'm excited about Qwen3 because it is one of the top performing open source models right now 🙂

RedDragen • May 1 '25 • Edited

Agreed! And i use Qwen happily :) Well, still the 2.5 coder one, it's awesome!

My point was more that it's thinking ability is.. ok at best.
I haven't tried it yet but i assume Llama 4 is better. And deepseek r1 definitely is better, that i did try just not on this example case.

BestCodes • May 2 '25

Yeah, the problem with models like DeepSeek R1 and Llama 4 is that they are huge and usually have to run on a remote server or a special device. CPU devices or typical laptops can't run them or run them really slow 🥲

Qwen 3 has a 0.6b variant which super nice because it can run locally even on very small or weak devices. The size to performance ratio with Qwen models is great.

RedDragen • May 2 '25

Qwen 3 has a 0.6b variant which super nice because it can run locally even on very small or weak devices. The size to performance ratio with Qwen models is great.

Oh boy, you're hallucination :P

Think of these tiny models of a kid that has the potential to be at Einstein level but with a massive amount of amnesia (it forgot 90% of what it learned or more). It can talk and form sentences but the responses are mostly close to useless. It can do some simple things well but that's about it.

In my testing i found that anything below 3B (not just qwen, all these models out there) is mediocre at best. 3B itself being the transition point from where responses make leaps and bounds in terms of quality.

But tiny models have great value! Like recently AMD seems to be using tiny models to use their next token prediction as a sort of pre-prediction for a large model. The large one then just has to verify if it would've done that same prediction (which is apparently a lot cheaper then to predict) and then use it. It essentially means large models at the speeds of tiny models with the quality of the large one. Mind blowing, literally. AMD seems to be getting up to 4x tokens/sec from 7-8B models when a tiny models is used to pre-predict.

BestCodes • May 2 '25

Yeah, it's not the greatest for facts, but for chatting or writing code it's pretty nice. The 4b and 30b Qwen3 models are also very impressive, especially for their size. The Qwen architecture is improving a lot too, so they require less and less memory to run!

Alternate Existance • May 1 '25

good info :)