DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Qwen3-Coder-Next for Local AI in 2026: Which GPU Can Actually Run Alibaba's #1 Coding Agent?

This article was originally published on runaihome.com

TL;DR: Qwen3-Coder-Next is an 80B Mixture-of-Experts model that activates only 3 billion parameters per token, scoring 71.3% on SWE-bench Verified — competitive with closed-source frontier models. The catch is raw memory: the Q4_K_M GGUF weighs 48.7 GB, so you need either dual 24 GB cards, a Mac Studio with 64 GB+ unified memory, or a single RTX 5090 with aggressive RAM assist. A solo RTX 4090 can technically run it at IQ2 quality, but that is a different model from what the benchmarks describe.

Dual RTX 3090 Mac Studio M4 Max 64 GB RTX 5090 + 128 GB DDR5
Best for Budget VRAM-poolers Plug-and-play reliability CUDA tools, multi-user serving
Total VRAM / Memory 48 GB combined 64 GB unified 32 GB + RAM overflow
Practical quant IQ4_XS (42.8 GB) Q4_K_M (48.7 GB) Q3_K_M (36.7 GB) GPU-only
Throughput (32K ctx) ~33 tok/s ~30–45 tok/s (est.) ~60–80 tok/s (est.)
Cost (May 2026) ~$2,400 pair avg $1,999+ (see Apple) ~$3,658 market avg

Honest take: The Mac Studio M4 Max 64 GB is the most friction-free path for a solo developer. It runs Q4_K_M without juggling dual-card power budgets and draws ~80 W at load. If you are already invested in CUDA and have two RTX 3090s, the dual-card route works. What does not work: buying a single RTX 4090 specifically for this model.


Why This Model Is Different From Every Other 80B

Qwen3-Coder-Next launched on February 4, 2026 from Alibaba's Qwen team under the Apache 2.0 license. On paper it is an 80-billion-parameter model. In practice, the compute profile of each token generation step resembles a 3B dense model.

The architecture is a Mixture-of-Experts with 512 experts, 10 activated per forward pass plus one shared expert, using a hybrid attention mechanism (Gated DeltaNet + Gated Attention). The router decides which 10 experts handle each token. The other 502 experts sit in memory, dormant. This is why the model scores 44.3% on SWE-bench Pro — beating DeepSeek-V3.2's 40.9% — while activating only 3B parameters per step, roughly 0.4% of its total weight.

For home lab hardware, this architecture creates a specific constraint set that is different from a dense 70B model:

  • Memory to hold everything: All 80B weights must be addressable because the router may call any expert. You need the storage of an 80B model.
  • Compute per token is cheap: Only 3B parameters participate per step, so token generation is fast once the weights are loaded.
  • CPU offloading stings more: If frequently-called expert layers end up in system RAM instead of VRAM, you pay the PCIe bandwidth penalty on every token. With a dense model, the same layer is always hit in sequence; with MoE, the access pattern is less predictable.

The native context window is 262,144 tokens. The model was fine-tuned on 800,000+ verifiable coding tasks and is designed specifically for agentic workflows — multi-step edits, tool calls, and error recovery loops that a single chat-style generation cannot handle.


The Benchmark Numbers in Context

On SWE-bench Verified, Qwen3-Coder-Next scores 71.3% with OpenHands, 71.1% with MiniSWE-Agent, and 70.6% with SWE-Agent. This edges out DeepSeek-V3.2 at 70.2%.

On SWE-bench Pro — the harder, longer-horizon benchmark — the model scores 44.3%, against DeepSeek-V3.2's 40.9% and GLM-4.7's 40.6%.

The remarkable part is not the number itself but the denominator. Models that score comparably on SWE-bench typically have 30B+ active parameters. Qwen3-Coder-Next achieves this with 3B active, which translates directly to lower inference cost, faster token generation at equivalent VRAM, and the ability to run on consumer hardware that would otherwise need a 30B-class dense model.


VRAM Math: What Each Quantization Weighs

These are GGUF file sizes from the Bartowski repo on Hugging Face. This is what must fit in your combined memory (VRAM + any CPU RAM offload):

Quantization File Size Where it fits
IQ2_XXS 19.3 GB Single RTX 4090 (24 GB), comfortable
IQ2_S 23.4 GB Single RTX 4090 / RTX 5090 (32 GB)
IQ2_M 26.1 GB RTX 5090 with headroom
IQ3_XXS 31.7 GB RTX 5090, minimal margin
Q3_K_M 36.7 GB RTX 5090 with ~5 GB to spare
Q3_K_XL 38.5 GB RTX 5090, tight
IQ4_XS 42.8 GB Dual RTX 3090, Mac 64 GB
Q4_K_M 48.7 GB Mac Studio 64 GB, large RAM rigs
Q8_0 ~84.8 GB Mac Studio 128 GB, enterprise GPUs

The Q4_K_M at 48.7 GB is the practical quality ceiling for most home lab setups. Going to Q8 (84.8 GB) requires either a 128 GB Mac Studio or enterprise hardware. The IQ2 range is usable but the model loses coherence on complex, multi-file agentic tasks — the kind of work that makes this model worth running in the first place.

Note that actual runtime memory usage will exceed the file size once you add the KV cache for your context window. At 32K tokens of context, budget roughly 4–6 GB of overhead on top of the model weights.


Hardware by Budget Tier

Single consumer GPU (16–24 GB VRAM)

Cards: RTX 4090 (24 GB), RTX 5080 (16 GB)

IQ2_XXS (19.3 GB) fits on a single RTX 4090 with 4 GB to spare. At 2-bit quantization, the model is technically running but the quality gap versus Q4 is significant for agentic coding: long dependency chains, unfamiliar APIs, and multi-file edits all suffer. You will notice it within the first hour of real use.

IQ3 variants (31–38 GB) require CPU RAM offloading on any 24 GB card. With 64 GB of DDR5 and llama.cpp, this works — layers overflow to system RAM automatically. The problem is throughput. PCIe 5.0 tops out around 64 GB/s in each direction; the bandwidth bottleneck on frequently-accessed experts will push single-digit tokens per second for those layers.

If you have a single RTX 4090, the honest recommendation is either the Qwen3-Coder-30B (3B active, fits in 24 GB at Q4, scores around 64% on SWE-bench Verified) or use RunPod to access the full model without buying new hardware.

Dual RTX 3090 (48 GB VRAM combined)

Cards: 2× RTX 3090 (24 GB × 2)

Two used RTX 3090s average around $1,200 each on eBay completed listings (May 2026 range: $895–$1,477 per card), putting the pair at roughly $2,400 at average prices. The second card does not need NVLink — a PCIe x4 slot is sufficient for LLM inference because llama.cpp distributes complete layers, not tensor slices, between GPUs.

IQ4_XS (42.8 GB) fits across both cards with 5 GB headroom. That margin matters for context: at 32K tokens of context with KV cache overhead, you are right at the limit. At 65K context, plan to drop to IQ3_K variants.

Throughput on dual RTX 3090 with Q4_K_XL at 32K context: approximately 33 tok/s. At 131K context, this drops to around 25 tok/s as attention computation scales with sequence length. For agentic coding — where the model calls a tool, waits for output, then processes the result — 25–33 tok/s is usable. You are not waiting on the model; you are waiting on build pipelines and test runners.

The power cost is real: each RTX 3090 draws up to 350 W under load, and both cards run hot during generation. A 850 W PSU is the minimum comfortable spec. See the PSU sizing guide for the full calculation. For total cost of ownership including power bills, the 24/7 AI server cost breakdown has the math.

Single RTX 5090 (32 GB VRAM, 1,792 GB/s bandwidth)

Cards: RTX 5090 (32 GB GDDR7)

The RTX 5090 has an MSRP of $1,999 but trades at around $3,658 on the open market as of May 2026, driven by GDDR7 supply constraints. That price premium hurts the value case — but the bandwidth is genuinely different.

At 1,792 GB/s, the 5090 has 77% more memory ban

Top comments (0)