plasmon

Posted on Mar 22 • Originally published at qiita.com

Running Qwen2.5-32B on RTX 4060 8GB — Beating M4 at 10.8 t/s with llama.cpp

#llm #gpu #benchmark #ai

My laptop has an RTX 4060. 8GB of VRAM. It's the spec people call "the short straw" for running local LLMs.

Still, I wanted to run a 32B model. I'd tried the 7B class. It works. But when you use it for coding assistance, you start running into quality issues. On the other hand, hitting an API racks up monthly costs, and there are times I want to work offline.

I'm aware of the prevailing sentiment that "32B on 8GB is impossible." The entire model's layers won't fit on the GPU. But I'd heard that llama.cpp's hybrid inference (GPU+CPU split) had gotten considerably better over the past year, so I decided to give it a shot with nothing to lose.

Why llama.cpp

There are other inference engines for local LLMs. Ollama is popular and easy to set up, and vLLM has high throughput.

I tried Ollama first. Setup was indeed easy, but it doesn't give you fine-grained control over ngl (the number of layers offloaded to GPU). With 8GB of VRAM, I want to tune this one layer at a time, but Ollama decides automatically. The result is a conservative value that leaves the GPU idling.

vLLM is excellent for server use cases, but it's not meant for an 8GB VRAM laptop. Its memory management is designed to claim all available VRAM upfront, and with 8GB it wouldn't even start (for a 32B model).

I ended up settling on llama.cpp. You can specify ngl freely. Building with CUDA makes inference fast. The abundance of quantized models in GGUF format floating around is also a big plus. The UI and abstractions are thin, but that's exactly what gives you control.

Test Environment: RTX 4060 Laptop + Ryzen 7

Writing down the full environment for reproducibility.

Component	Spec
CPU	AMD Ryzen 7 7845HS (8C/16T, boost to 5.4GHz)
RAM	32GB DDR5-4800 (dual channel)
GPU	NVIDIA RTX 4060 Laptop (Ada Lovelace, 8GB GDDR6, 128-bit)
Storage	NVMe Gen4 SSD (read 7000MB/s)
OS	Windows 11 + WSL2 (Ubuntu 22.04)
llama.cpp	b4850 (2026-03 build, CUDA 12.6)
Comparison	Apple M4 MacBook (16GB Unified Memory)

There's one number I'm glad I looked up beforehand: memory bandwidth.

RTX 4060: 272 GB/s. M4 Pro: 273 GB/s. Nearly identical.

LLM inference speed (especially token generation) is roughly proportional to memory bandwidth. In theory, if you optimize quantization and layer splitting, the 4060 can match or beat the M4. The catch is that you only have 8GB of VRAM, so how you pack things in is the real battle.

The CUDA Build for llama.cpp — A Quietly Frustrating Experience

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA=ON \
  -DGGML_CUDA_F16=ON \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA_GRAPHS=ON

cmake --build build --config Release -j $(nproc)

That last GGML_CUDA_GRAPHS=ON is the key. It's a flag for CUDA Graph optimization, and it alone gave me an 8-12% throughput improvement. This feature has been officially supported since late 2025, yet it's barely mentioned in the official build guide or README. I had to dig through GitHub PRs to find it. Seriously, put this stuff in the documentation.

Also, setting up CUDA under WSL2 has its own gotchas. If the NVIDIA driver version on the Windows side and the CUDA Toolkit version inside WSL2 don't line up, the build succeeds but you get CUDA error: no kernel image is available at runtime. The combo of driver 560.x + CUDA Toolkit 12.6 was stable.

Choosing Quantization — When It Won't All Fit, What Do You Cut?

I chose Qwen2.5-32B because among open models, it had the best balance of coding performance and Japanese language quality. Llama3 and Mistral were candidates too, but Qwen handled Japanese conversation most naturally.

Qwen3.5-27B is out now and I'm curious about it, but I haven't verified how many layers fit in 8GB with IQ4_XS yet. This time I'm reporting results for Qwen2.5-32B.

So, quantization selection. I wrote a quick calculation script to estimate sizes.

def estimate_model_vram(params_b: float, quant_bits: float,
                         kv_cache_gb: float = 0.5) -> dict:
    model_gb = (params_b * 1e9 * quant_bits) / (8 * 1024**3)
    total_gb = model_gb + kv_cache_gb
    return {
        "model_size_gb": round(model_gb, 2),
        "total_with_kvcache_gb": round(total_gb, 2),
        "fits_in_8gb": total_gb <= 7.5
    }

for name, bits in [("Q8_0", 8.5), ("Q5_K_M", 5.5), ("Q4_K_M", 4.5),
                    ("IQ4_XS", 4.25), ("Q3_K_M", 3.5), ("IQ2_XS", 2.31)]:
    info = estimate_model_vram(32, bits)
    fits = "✅" if info["fits_in_8gb"] else "⚠️ Won't fit entirely"
    print(f"{name:10s}: {info['total_with_kvcache_gb']:.1f}GB  {fits}")

Q8_0      : 32.7GB  ⚠️ Won't fit entirely
Q5_K_M    : 21.6GB  ⚠️ Won't fit entirely
Q4_K_M    : 17.7GB  ⚠️ Won't fit entirely
IQ4_XS    : 16.9GB  ⚠️ Won't fit entirely
Q3_K_M    : 13.7GB  ⚠️ Won't fit entirely
IQ2_XS    :  9.2GB  ⚠️ Won't fit entirely

Total wipeout. A 32B model won't fully fit in 8GB. Not even IQ2_XS at 9.2GB.

But you don't need to give up here. llama.cpp can split layers between GPU and CPU. Even without loading everything, the layers on GPU are processed at high speed, and the rest is handled by the CPU. The question is: "How many layers is optimal to offload?"

Cramming a 32B Model into 8GB VRAM — Finding the Optimal ngl

I chose IQ4_XS because it had the best balance of quality and size. Dropping to Q3_K_M introduced visible degradation in Japanese output.

Qwen2.5-32B has 65 layers total. I ran benchmarks while varying ngl.

MODEL="$HOME/models/qwen2.5-32b-instruct-IQ4_XS.gguf"

for NGL in 20 30 40 50 60 65; do
    echo "=== ngl=$NGL ==="
    ./build/bin/llama-bench \
        -m "$MODEL" \
        -ngl $NGL \
        -t 8 \
        --ctx-size 4096 \
        -r 3 \
        2>&1 | grep -E "pp|tg|model"
done

ngl	VRAM Usage	Prefill (pp512)	Generation Speed (tg128)	Verdict
20	~3.1GB	48 t/s	3.2 t/s	Not practical
30	~4.5GB	89 t/s	5.1 t/s	Slow
40	~5.8GB	127 t/s	6.8 t/s	Barely usable
50	~7.0GB	198 t/s	9.1 t/s	Practical range
60	~7.6GB	231 t/s	10.8 t/s	Recommended
65	~8.1GB	OOM	—	Crash

ngl=65 OOM'd spectacularly. Only 0.6GB of headroom, and the KV cache couldn't fit.

ngl=60 is the sweet spot. It uses 7.6GB and offloads the remaining 5 layers to the CPU (Ryzen 7845HS, 16MB L3 cache). 10.8 t/s for Japanese text is "slightly slower than reading aloud" — a tolerable speed for a coding assistant.

RTX 4060 vs M4 MacBook — Local LLM Benchmark Comparison

I borrowed a friend's M4 MacBook (16GB Unified Memory) and compared using the same model and the same prompt.

Metric	RTX 4060 (ngl=60)	M4 16GB (ngl=99)
Generation Speed (tg128)	10.8 t/s	9.4 t/s
Prefill (pp512)	231 t/s	187 t/s
VRAM/Memory Usage	7.6GB VRAM + 12GB RAM	14.2GB Unified
Power Draw (inference)	~85W	~18W
Long Context (32K)	Major slowdown	Stable

The 4060 wins on speed. Since the bandwidth is nearly identical, this makes sense in theory — but I'd internalized the assumption that "you can't beat Apple Silicon at local inference," so I was genuinely surprised.

However, the 4060's victory only holds for contexts under 8K. Stretching to 32K tokens exhausts VRAM and increases CPU fallback, causing speed to tank. The M4 has 16GB of Unified Memory, so it stays stable with long contexts. Power efficiency is a 4.7x gap — not even close.

Here's how I'd divide usage:

Code generation, short-to-medium conversations (4K-8K) → 4060 is faster
Long document summarization, battery life, quiet operation → M4 wins overwhelmingly

Pushing the 8GB VRAM Wall a Bit Further — KV Cache Quantization & Flash Attention

ngl=60 works, but extending the context length quickly runs out of VRAM. I tried several additional techniques.

KV Cache Quantization

./llama-server \
    -m model.gguf \
    -ngl 60 \
    --ctx-size 16384 \
    -ctk q4_0 \
    -ctv q4_0

Dropping KV cache from FP16 to Q4. At 16K context, this saves about 1.8GB of VRAM. I couldn't perceive any quality degradation. The BLEU score impact was around -0.3%. Thanks to this, I can now maintain ngl=60 even with 16K context.

Flash Attention

./llama-server \
    -m model.gguf \
    -ngl 60 \
    --flash-attn \
    --ctx-size 8192

--flash-attn saves about 1.2GB at 8K context. It stacks with KV cache quantization, so enabling both is the right call.

mlock to Prevent Swapping

./llama-server \
    -m model.gguf \
    -ngl 60 \
    --mlock \
    --no-mmap

When the layers offloaded to CPU get swapped out to the Windows page file, inference stalls for seconds at a time. --mlock pins them in RAM to prevent this. In an NVMe Gen4 environment, adding --no-mmap slows model loading by 15%, but reduces latency spikes during inference.

Running llama-server as an OpenAI-Compatible API

Benchmarking done, time for practical use. Spin up llama-server as an OpenAI-compatible API, and any tool can hit it.

./build/bin/llama-server \
    -m ~/models/qwen2.5-32b-instruct-IQ4_XS.gguf \
    -ngl 60 \
    -t 8 \
    --ctx-size 8192 \
    --host 0.0.0.0 \
    --port 8080 \
    --parallel 1 \
    --flash-attn \
    --mlock \
    -ctk q4_0 -ctv q4_0

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")

response = client.chat.completions.create(
    model="qwen2.5-32b",
    messages=[{"role": "user", "content": "Implement a thread-safe singleton in Python"}],
    max_tokens=512,
    temperature=0.7
)
print(response.choices[0].message.content)

Since it's OpenAI SDK-compatible, you just swap base_url in your existing code and it switches to the local LLM. Works as a backend for Cursor and Continue as-is.

What I Want to Test Next

Qwen3.5-27B is out. Parameter count dropped from 32B to 27B, yet quality reportedly improved thanks to architectural changes. At 27B, IQ4_XS should fit in 8GB with more room to spare — there's even a chance ngl=65 puts all layers on GPU. I'll do a measured benchmark by model size, including comparison with Qwen3.5-9B, in the next post.

I'm also curious about comparing a native Windows build vs WSL2. I ran everything on WSL2 this time, but anecdotally the native build feels 5-8% slower. If DirectML matures, that gap might flip.

One more thing — if vLLM makes moves to support the 8GB VRAM class better, I want to try it. Currently llama.cpp is the only realistic option in the 8GB tier, but as a user, competition is always welcome.

Originally published on Qiita in Japanese.

DEV Community