AlexChen

Posted on Apr 2

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

#ai #llm #machinelearning #programming

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

29.899 tokens per second. A 35-billion parameter model. An NVIDIA RTX 3070 with 8GB of VRAM. A $500 GPU you can buy at Micro Center.

That number — 29.9 tok/s — is the result of 770 experiments across 12 phases. We started at 6.1 tok/s with naive settings. We ended at nearly 30. That's a +387% improvement, and every percentage point was earned through systematic search, not guesswork.

This article is the full story: what we tried, what worked, what didn't, and the exact configuration you can copy to reproduce our results.

Why This Matters

The conventional wisdom says: big models need big GPUs. A 35B parameter model "should" need 40–70GB of VRAM. Cloud inference costs $0.50–$2.00 per million tokens. Local inference on consumer hardware is supposed to be impractical beyond 7–13B models.

Mixture-of-Experts (MoE) architectures change this calculus. Qwen3.5-35B-A3B has 35 billion total parameters but only activates 3 billion per token. The rest sit dormant — which means if you're clever about what lives in VRAM and what lives in RAM, you can run a frontier-class model on hardware that costs less than a month of cloud API bills.

The question was never "can it run?" — it was "can it run fast enough to be useful?" At 6 tok/s, it's a curiosity. At 30 tok/s, it's a daily driver.

The Model: Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is Alibaba's latest MoE release. The architecture:

35B total parameters across multiple expert groups
3B active parameters per token (roughly 8% activation ratio)
Competitive with dense 30B+ models on reasoning benchmarks while being dramatically cheaper to run

We used the IQ2_XXS GGUF quantization, which compresses the model to 9.6GB on disk. This is aggressive — 2-bit quantization with importance-weighted rounding — but the MoE architecture is surprisingly resilient to quantization because most parameters are expert weights that activate sparsely.

At 9.6GB, the model technically doesn't fit entirely in 8GB of VRAM. But it doesn't have to. That's where the optimization story begins.

The Methodology: Autonomous Autoresearch

We didn't sit down with a spreadsheet and plan 770 experiments. We built an autonomous research loop — a system that:

Proposes parameter configurations based on prior results
Runs each experiment with llama.cpp's CUDA backend
Measures throughput (tok/s), VRAM usage, and generation quality
Analyzes results statistically
Proposes the next batch of experiments

No pre-specified experimental matrix. No hand-tuning. The system explored the parameter space systematically across multiple sessions, finding configurations a human would never try — and discovering failure modes we never would have predicted.

770 experiments. 12 distinct phases. Multiple breakthrough moments. All driven by data.

The Journey: Phase by Phase

Phase 3 — Baseline: 6.1 tok/s

Naive settings. Default thread count, minimal GPU offloading. The model runs, but barely. At this speed, generating a 500-token response takes over 80 seconds. Usable for experimentation, not for work.

Phase 4 — First GPU Offload Sweep: 11.850 tok/s

The first big lever: n_gpu_layers. This controls how many transformer layers live in VRAM vs. system RAM. GPU memory bandwidth is ~10x faster than DDR4, so every layer you can fit on the GPU matters enormously.

With n_gpu=16 layers offloaded to the RTX 3070, throughput nearly doubled. The autoresearch loop found this optimal layer count by sweeping from 0 to 20 in steps of 1, measuring each configuration three times for statistical reliability.

Phase 8 — Incremental Gains: 12.021 tok/s

Phases 5–8 explored secondary parameters: batch sizes, context lengths, thread counts. Small gains. The system was methodically eliminating dead ends and confirming that GPU layer count was the dominant variable.

Phase 10 — The n_gpu=17 Breakthrough: 12.331 tok/s

One more layer on the GPU. It seems trivial, but this was a boundary condition — the system found that layer 17 barely fit within the 8GB VRAM budget when combined with KV cache and activation memory. Pushing to 18 caused OOM. The autoresearch loop discovered this edge precisely.

Phase 11 — Quantization Breakthrough: 21.621 tok/s

+75% overnight. This was the single biggest jump in the entire project.

The insight: switching from Q3_K_M quantization to IQ2_M dramatically reduced model size, freeing VRAM for more GPU layers. More layers on GPU = exponentially more throughput (we'll quantify this later).

IQ2_M uses importance-weighted 2-bit quantization. Perplexity increased only marginally (1.3138 vs 1.3073 for Q3_K_M on our test set). The quality-to-speed tradeoff was extraordinary — IQ2_M is the Pareto optimal choice for quality-adjusted throughput.

Phase 12 — The Summit: 29.899 tok/s

The final configuration combined everything we'd learned:

IQ2_XXS quantization (even more aggressive than IQ2_M, 9.6GB on disk)
n_gpu=27 layers on GPU (the smaller model footprint freed massive VRAM headroom)
threads=8 (not 16 — more on this below)
batch=32/16 (prompt processing / generation)
flash_attn=1 + op_offload=1 (MoE expert offloading)
KV cache in q8_0 (TurboQuant-style compression)

29.899 tok/s. +387% from baseline. At this speed, generating 500 tokens takes 17 seconds. That's faster than most people read.

The Four Techniques

1. Flash MoE Expert Offloading

Inspired by Apple's research on flash-memory inference for MoE models. The key idea: MoE expert weights that aren't active for the current token can be offloaded from VRAM, freeing space for more transformer layers.

In llama.cpp, this maps to --flash-attn 1 and --op-offload 1. Together, they allow the runtime to dynamically manage expert residency, keeping hot experts in VRAM and cold experts in system RAM.

Impact: enabled fitting 27 layers on GPU instead of 17. This alone accounts for the majority of the Phase 11→12 throughput gain.

2. TurboQuant KV Compression

Based on Google's work on quantized KV caches. Instead of storing key-value cache entries in FP16 (2 bytes per element), we compress to q8_0 (roughly 1 byte per element with block-wise scaling).

Impact: reduced KV cache VRAM footprint by ~50%, freeing additional headroom for model layers. Quality impact at short-to-medium context lengths: negligible.

3. PolarQuant (Polar Decomposition)

PolarQuant decomposes weight matrices into rotation and scaling components, enabling more efficient quantization by separating the "direction" and "magnitude" of weight information.

We implemented and tested PolarQuant in Phase 7. The theoretical promise is strong: better preservation of weight structure at low bit-widths.

4. QJL (Quantized Johnson-Lindenstrauss)

QJL applies random projection to compress KV cache entries, exploiting the Johnson-Lindenstrauss lemma to preserve pairwise distances in lower dimensions.

We implemented and tested QJL in Phase 7 alongside PolarQuant.

The PolarQuant/QJL Finding: A Research Gap

Here's where the story gets interesting. Both PolarQuant and QJL are theoretically sound techniques with published results showing quality improvements at low bit-widths. Our experiments confirmed the theory — the math works.

But in practice, both were 250–16,000x slower than baseline without dedicated CUDA kernels.

PolarQuant requires a polar decomposition (SVD-like) for each forward pass through quantized layers. QJL requires random projection matrix multiplications for every KV cache operation. On CPU, these operations dominate inference time so completely that any quality gains are irrelevant.

This is a research gap, not a research failure. The techniques work. They need hardware acceleration. Specifically:

PolarQuant needs fused CUDA kernels for polar decomposition during dequantization
QJL needs fused random projection kernels integrated into the attention mechanism

We've documented the exact performance profiles and bottlenecks. If you're building CUDA kernels for quantized inference, these are two techniques worth accelerating. The theoretical foundation is solid; the engineering gap is well-defined.

The Power Law: Why VRAM Scaling Is Non-Linear

After collecting 770 data points, we built AutoInfer — an analysis framework to model the relationship between VRAM usage and throughput.

The best fit is a power law:

tok/s = 9.81e-6 × (42 - model_size_gb)^4.247

Where model_size_gb is the portion of the model residing in system RAM (i.e., not on GPU), and 42 represents the total effective memory budget.

The exponent α = 4.25 is the key finding. This means throughput scales with the fourth power of available VRAM headroom. Moving 1GB of model from RAM to VRAM doesn't give you a linear speedup — it gives you a polynomial one.

This explains why aggressive quantization (IQ2_XXS → smaller model → more fits on GPU) produced such outsized gains. Every gigabyte saved by quantization is amplified 4x by the VRAM scaling law.

Practical implication: for MoE models on constrained hardware, optimizing model size is more important than optimizing any other parameter. A 10% reduction in model size can yield a 40%+ throughput improvement.

The Recipe: Optimal Configuration

For an RTX 3070 (8GB) with 16GB system RAM running Qwen3.5-35B-A3B:

./llama-cli \
  -m qwen3.5-35b-a3b-IQ2_XXS.gguf \
  -ngl 27 \
  -t 8 \
  -b 32 \
  -ub 16 \
  --flash-attn \
  --op-offload 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -c 4096

Key parameters explained:

Parameter	Value	Why
`-ngl`	27	Maximum layers that fit in 8GB with IQ2_XXS
`-t`	8	Threads — NOT 16 (see below)
`-b / -ub`	32 / 16	Prompt batch / generation batch
`--flash-attn`	on	Enables flash attention for memory efficiency
`--op-offload 1`	on	MoE expert offloading
`-ctk/-ctv q8_0`	q8_0	Quantized KV cache
`-c`	4096	Context length (increase reduces tok/s)

The 16-Thread Trap

One of the most counterintuitive findings: using 16 threads instead of 8 drops throughput from 29.9 to 3.7 tok/s. That's a catastrophic 8x slowdown.

Statistical significance: z-score of -6.0. This is not noise.

The cause: thread contention on the memory bus. With 16 threads all competing for DDR4 bandwidth to load expert weights from system RAM, the CPU spends more time waiting for memory than computing. 8 threads saturate the useful bandwidth; 16 threads create destructive interference.

We would never have found this without systematic sweeps. The intuition — "more threads = more speed" — is dangerously wrong for memory-bound MoE inference. Always benchmark your thread count.

Practical Guide: Matching Settings to Your Hardware

If you have more VRAM (12GB+, e.g., RTX 3080/4070 Ti):

Increase -ngl until you approach VRAM limit (monitor with nvidia-smi)
Consider IQ2_M instead of IQ2_XXS for better quality at minimal speed cost
Expect 40–60+ tok/s based on the power-law scaling

If you have less VRAM (6GB, e.g., RTX 2060):

Reduce -ngl to 18–20
Stick with IQ2_XXS
Expect 12–18 tok/s

If you prioritize quality over speed:

Use IQ2_M (Pareto optimal for quality-adjusted throughput)
Accept ~22 tok/s instead of ~30
PPL difference is minimal (1.3138 vs IQ2_XXS which is slightly higher)

If you have 32GB+ RAM:

The extra system RAM helps with longer context lengths
KV cache overflow to RAM is less painful
Consider -c 8192 or higher

Conclusion

770 experiments. 12 phases. +387% improvement. One RTX 3070.

The takeaways:

MoE models are uniquely suited to consumer hardware — 35B params with 3B active is the sweet spot for VRAM-constrained inference
Aggressive quantization pays polynomial dividends — the α=4.25 power law means every byte saved on model size is amplified dramatically
Systematic search beats intuition — the 16-thread catastrophe, the n_gpu=17→27 leap via quantization, the PolarQuant/QJL gap — none of these would emerge from manual tuning
The research gap is real — PolarQuant and QJL need CUDA kernels to become practical, and the community should build them
Local inference at 30 tok/s on a $500 GPU is production-viable — not a demo, not a toy, a daily driver

The age of "you need an A100 for a real model" is over. The age of "you need to know what you're doing" has begun.

All experiment data and analysis code: github.com/clawinfra/qwen35-moe-offload

Published by the ClawInfra Team. Built with llama.cpp and a lot of patience.

DEV Community

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

770 Experiments to Squeeze 30 tok/s Out of a 35B MoE Model on a $500 GPU

Why This Matters

The Model: Qwen3.5-35B-A3B

The Methodology: Autonomous Autoresearch

The Journey: Phase by Phase

Phase 3 — Baseline: 6.1 tok/s

Phase 4 — First GPU Offload Sweep: 11.850 tok/s

Phase 8 — Incremental Gains: 12.021 tok/s

Phase 10 — The n_gpu=17 Breakthrough: 12.331 tok/s

Phase 11 — Quantization Breakthrough: 21.621 tok/s

Phase 12 — The Summit: 29.899 tok/s

The Four Techniques

1. Flash MoE Expert Offloading

2. TurboQuant KV Compression

3. PolarQuant (Polar Decomposition)

4. QJL (Quantized Johnson-Lindenstrauss)

The PolarQuant/QJL Finding: A Research Gap

The Power Law: Why VRAM Scaling Is Non-Linear

The Recipe: Optimal Configuration

The 16-Thread Trap

Practical Guide: Matching Settings to Your Hardware

Conclusion

Top comments (0)