DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

GPTQ vs AWQ vs GGUF for vLLM 2026: Which 4-Bit Wins

This article was originally published on aifoss.dev

TL;DR: For vLLM serving on an Ampere-or-newer NVIDIA GPU, AWQ in INT4 with the Marlin kernel is the default-correct choice — it matches GPTQ on speed and beats it on quality in most community tests. GGUF technically loads in vLLM but runs roughly 8x slower than AWQ; keep GGUF on llama.cpp where it belongs.

What you'll know after this guide:

  • Which 4-bit format to pick for vLLM on your specific GPU (Ampere, Ada, or Hopper)
  • The exact kernel flags that decide whether you get 700+ tok/s or a slow fallback path
  • The one single-user case where GGUF + llama.cpp still beats a vLLM cluster

Honest take: If you're serving concurrent traffic on vLLM, quantize to AWQ INT4 and stop overthinking it. GPTQ is fine if you already have the weights. GGUF in vLLM is a trap unless you have no other option.

The format wars between GPTQ, AWQ, and GGUF look like a quality debate. For vLLM production inference they're really a kernel debate. The same 4-bit weights run 8x faster or 8x slower depending on which CUDA kernel vLLM picks at load time, and that choice is driven by your GPU architecture and a couple of metadata flags most people never check. Here's what actually moves the needle.

The three formats, fast

GPTQ (Generalized Post-Training Quantization) is a one-shot, layer-by-layer weight quantizer. It's been the workhorse INT4 format since 2023 and has the widest tooling. The modern toolkit is GPTQModel by ModelCloud, which supports NVIDIA, AMD, Intel GPU and CPU backends and exports straight to vLLM and SGLang.

AWQ (Activation-aware Weight Quantization) protects the small fraction of weight channels that matter most for activations, leaving them at higher precision. The result is usually slightly better quality at the same bit width, and AWQ INT4 has become the de facto best-practice format for vLLM serving in 2026.

GGUF is llama.cpp's container format. It's the king of laptop, Mac, and CPU inference — k-quants like Q4_K_M, partial GPU offload, single-file portability. None of those strengths translate to a vLLM GPU server.

GPTQ AWQ GGUF
Best home vLLM / SGLang servers vLLM servers llama.cpp / Ollama
vLLM kernel Marlin / Machete / ExLlamaV2 Marlin (default optimized) experimental, under-optimized
Typical quality vs FP16 small drop smallest drop small drop
Non-NVIDIA hardware AMD/Intel via GPTQModel limited best (CPU, Mac, partial offload)
Re-quantize to switch yes yes yes

The numbers that matter

Community benchmarks on a Qwen2.5-32B-class model (the recurring reference point in 2026 r/LocalLLaMA threads and several independent write-ups) tell a consistent story on a single high-end NVIDIA card:

Format + kernel Throughput HumanEval Pass@1
FP16 (baseline) reference 56.1%
AWQ INT4 + Marlin ~741 tok/s 51.8%
GPTQ INT4 + Marlin ~712 tok/s 46.0%
GGUF Q4_K_M (vLLM, H200) ~93 tok/s 51.8%

Two things jump out. AWQ and GPTQ land within ~4% of each other on throughput, so speed is not the reason to pick one over the other — quality is, and AWQ wins that round in these tests. And GGUF on vLLM is not in the same league: ~93 tok/s on an H200 is roughly an eighth of AWQ's number. The quality is fine; the serving path is the problem.

These are community figures, not a controlled benchmark of my own — treat them as directional. The 700-vs-90 gap is real and reproducible across sources; the 46-vs-51.8 quality gap depends heavily on how the GPTQ weights were produced (more on that below).

Why GGUF tanks in vLLM

vLLM's own documentation flags GGUF support as highly experimental and under-optimized, warns it may be incompatible with other features, and notes it only loads single-file GGUF models. The slowness isn't a bug to wait out — it's architectural. vLLM's whole value is continuous batching and PagedAttention against GPU-native INT4 kernels. GGUF's k-quant layout was designed for llama.cpp's dequantization path, so vLLM falls back to a slow route and the batching engine starves.

The fix is to not do it. If your weights are GGUF, serve them with llama.cpp's server or Ollama. If you need vLLM's throughput, re-quantize the original model to AWQ or GPTQ.

The kernel matrix (this is the actual decision)

vLLM ships several kernels for the same weights. The defaults are the official AWQ kernel for AWQ and ExLlamaV2 for GPTQ, but the fast paths are Marlin and Machete:

  • Marlin — built for Ampere (A100 and newer), works on Ada too. Handles both GPTQ and AWQ INT4 at group_size: 128. This is the kernel behind the 700+ tok/s numbers.
  • Machete — NeuralMagic's Hopper-optimized successor to Marlin, built on CUTLASS 3.5.1. Marlin was tuned for Ampere and leaves performance on the table on H100; Machete targets Hopper and pulls ahead at large batch sizes. Catch: Machete currently supports GPTQ only, not AWQ.
Your GPU Best format Kernel you want
RTX 3090 / A100 (Ampere) AWQ INT4 Marlin
RTX 4090 / L4 / L40S (Ada) AWQ INT4 Marlin
H100 / H200 (Hopper), small batches AWQ INT4 Marlin
H100 / H200 (Hopper), high concurrency GPTQ INT4 Machete

So the one scenario where GPTQ genuinely beats AWQ on vLLM: a Hopper card under heavy batched load, where Machete's GPTQ-only path edges out Marlin-AWQ. Everywhere else, AWQ + Marlin is the call.

For consumer Ampere/Ada cards, this all assumes you have the VRAM headroom to keep the model resident — see the runaihome.com GPU guides for which cards clear the bar for 32B-class INT4 weights. If you're spinning up Hopper capacity by the hour to test Machete throughput, RunPod is the cheapest way to rent an H100/H200 without buying one.

A real problem: GPTQ loads but crawls

Here's the trap that produces those mediocre GPTQ numbers. You quantize a model with act-order enabled (desc_act: true) because a guide told you it improves accuracy. It does — and then it silently kills your throughput, because the fast Marlin kernel does not support act-order. vLLM loads the model fine, then falls back to a slow dequantization path, and you blame "GPTQ" for being slow when the real culprit is one flag.

For Marlin-compatible GPTQ, the weights need:

# GPTQModel quantize config for vLLM + Marlin
quant_config = {
    "bits": 4,
    "group_size": 128,   # 128 or -1; 128 is the safe default
    "desc_act": False,   # CRITICAL: act-order breaks the fast Marlin path
}
Enter fullscreen mode Exit fullscreen mode

AWQ sidesteps this entirely — bits: 4, group_size: 128 is the standard config and maps to Marlin without an act-order footgun. It's one less way to shoot yourself in the foot, which is part of why AWQ is the recommended default.

Serving AWQ on vLLM in practice

Install and launch (tested against vLLM 0.9.x, June 2026):

pip install vllm

# Serve a pre-quantized AWQ model
vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ \
  --quantization awq_marlin \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92
Enter fullscreen mode Exit fullscreen mode

Expected startup log (abridged):

INFO ... Using AWQ Marlin kernel for quantized weights.
INFO ... Maximum concurrency for 8192 tokens per request: 12.4x
INFO ... Started server process
INFO ... Uvicorn running on http://0.0.0.0:8000
Enter fullscreen mode Exit fullscreen mode

If that first line instead reads Using AWQ kernel (no "Marlin"), you're on the slow default — check that your GPU is Ampere or newer and that the AWQ config uses group_size: 128. Hit it with the OpenAI-compatible endpoint:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-32B-Instruct-AWQ",
       "prompt": "def fibonacci(n):", "max_tokens": 64}'
Enter fullscreen mode Exit fullscreen mode

Letting vLLM auto-select awq_marlin (the default when it detects a compatible AWQ model on an Ampere+ GPU) is usually correct — only pin --quantization if

Top comments (0)