DEV Community

David
David

Posted on

qwen3.6-27b scores 77.2% on SWE-bench. the dense model is winning against MoE.

When Alibaba released Qwen3.6-35B-A3B, the MoE (Mixture of Experts) design stole all the headlines. 35 billion parameters, 3 billion activated per token — everyone's been focused on that ratio.

Then they dropped Qwen3.6-27B. A plain old dense model. 27 billion parameters, all active.

On SWE-bench Verified, the 27B dense scores 77.2%. The 35B MoE scores 73.4%. The dense model is outperforming the MoE by nearly 4 points — on the benchmark that measures real software engineering capability.

what SWE-bench actually measures

SWE-bench gives an LLM a real GitHub issue and a codebase. It has to understand the problem, find the right files, write the fix, and get the tests to pass. It's not multiple choice — it requires actual coding.

Qwen3.6-27B at 77.2% puts it in range of proprietary models. Claude Opus 4.5 scores 80.9%. The gap is real but narrowing — and Qwen3.6-27B does it on your own GPU under Apache 2.0.

why is the dense model winning?

Two factors seem to be driving this:

1. Full parameter utilization. In a MoE model like the 35B-A3B, only 3B of 35B parameters are active per token. The routing layer decides which experts to use. This is efficient for inference speed, but the model can't "use" all of its knowledge simultaneously. A dense model can activate its full capacity for harder reasoning tasks.

2. Architecture: Gated DeltaNet. Qwen3.6-27B isn't a vanilla dense transformer. It uses a Gated DeltaNet + Gated Attention hybrid — alternating layers of linear-gated attention (DeltaNet) with standard gated attention. DeltaNet processes information in compressed deltas rather than full representations, which lets it handle long contexts more efficiently while maintaining reasoning depth.

The result is a model that can do 262K context natively (extendable to 1M tokens) while still being a strong coder.

the benchmark breakdown

Task Qwen3.6-27B (dense) Qwen3.6-35B-A3B (MoE) Gap
SWE-bench Verified 77.2 73.4 +3.8
SWE-bench Pro 53.5 49.5 +4.0
Terminal-Bench 2.0 59.3 51.5 +7.8
SkillsBench Avg5 48.2 28.7 +19.5
QwenWebBench 1487 1397 +90
NL2Repo 36.2 29.4 +6.8

Terminal-Bench (real terminal operations) and SkillsBench show the largest gaps. These are tasks where the model needs to chain together multiple operations — the kind of thing where full parameter access seems to matter most.

the tradeoff

Dense models aren't free. The 27B activates all 27B parameters per forward pass. The 35B MoE activates only 3B. During inference:

  • 35B MoE is faster per token (3B vs 27B compute)
  • 35B MoE uses less memory for the active computation (but total disk/loaded size is still large)
  • 27B dense is better at hard coding tasks (SWE-bench, terminal operations)

If you're doing simple chat, the MoE will be faster. If you're running an agent that needs to reason through a complex codebase — the dense model is showing real advantages.

vision included

Qwen3.6-27B is an image-text-to-text model. The vision encoder is built in. That means you can screenshot a UI and ask it to fix the bug, read a diagram and explain the architecture, or debug from screenshots. The 35B MoE is text-only.

running it

ollama run qwen3.6-27b
Enter fullscreen mode Exit fullscreen mode

With Locally Uncensored, you also get image input, a built-in code agent, and fully local outputs:

git clone https://github.com/PurpleDoubleD/locally-uncensored
cd locally-uncensored && npm run tauri dev
Enter fullscreen mode Exit fullscreen mode

The MoE vs dense debate isn't settled. But on coding agent tasks, Qwen3.6-27B is making a strong case that raw parameter count isn't everything — architecture and full utilization matter too.

Locally Uncensored — AGPL-3.0 license.

Top comments (0)