Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4
You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your target deployment is a single consumer GPU with 8 GB of VRAM, or perhaps an ARM MacBook with unified memory, or maybe a cloud instance where you pay per GB of GPU memory. The numbers do not add up. The model, as is, does not fit. You need to shrink it, and you need to shrink it in a way that does not turn it into a random-number generator.
This is where weight quantization enters the picture. Reducing each parameter from 16 bits to 4 bits drops the memory footprint by 4x, from 14 GB to roughly 3.5 GB for a 7B model. The trick is how you do it, because not all 4-bit values are the same, and the trade-offs between memory, speed, accuracy, and portability are different for every format.
Why quantization format choice matters
The format determines three things: which hardware can run the model, how fast inference runs, and how much accuracy you give up. These three constraints are in tension. A format optimized for CPU inference (GGUF) uses a different quantization scheme than one designed for GPU batch serving (GPTQ). A format that preserves more accuracy at the same bit-width (AWQ) may cost more to calibrate. A format designed for training (NF4 via bitsandbytes) is not the best choice for inference deployment.
Choosing the wrong format means either leaving performance on the table, or worse, building a deployment pipeline around a format that the inference engine does not support. The landscape has settled into four major formats, each with a clear niche.
The four formats: how they work
GGUF
GGUF is the GGML Universal Format, created by the llama.cpp project. It is a container format that bundles model weights, tokenizer, and hyperparameters into a single file, with the weights already quantized. The quantization methods inside GGUF range from Q2_K to Q8_0, with Q4_K_M being the most popular sweet spot.
GGUF quantizations use a block-wise scheme: weights are grouped into blocks (typically 32 weights per block) and each block gets its own scale and (optionally) zero-point. The K-quant variants (Q4_K_M, Q5_K_M, etc.) mix different bit-widths across different parts of the model, spending more bits on the layers that matter more.
The format is designed for CPU and Apple Silicon inference. Because llama.cpp can offload some layers to GPU, GGUF also works on hybrid CPU+GPU setups, but the primary target is memory-constrained environments where a GPU is not available or not large enough.
GPTQ
GPTQ (GPU Post-Training Quantization) was introduced in 2023 by Frantar et al. from IST Austria. It is a weight-only quantization method that uses a second-order optimization procedure: it quantizes weights column by column, using the Hessian of the loss to adjust the remaining unquantized weights to compensate for the information lost on the already-quantized ones.
The original implementation, AutoGPTQ, was archived in early 2025. The active successor is GPTQModel (v7.1.0, June 2026) from ModelCloud, which supports both Marlin and Triton kernels for fast GPU inference. GPTQ models are typically quantized to 4-bit (or occasionally 3-bit and 8-bit) and are stored in Hugging Face-compatible safetensors format with a quantize_config.json metadata file.
GPTQ requires a GPU to run. The Marlin kernel (int4 x fp16) achieves near-lossless throughput on NVIDIA GPUs, making GPTQ the default choice for serving quantized models on datacenter GPUs.
AWQ
AWQ (Activation-Aware Weight Quantization) was introduced by Lin et al. from MIT in 2024. The key insight is that not all weights are equally important -- the ones corresponding to large activation magnitudes have a disproportionate impact on output quality. AWQ identifies these "salient" weight channels by analyzing a small calibration dataset and protects them by scaling them up before quantization, then scaling the output back down during inference.
The implementation is AutoAWQ (v0.2.9, May 2025). Like GPTQ, AWQ targets GPU inference and produces Hugging Face-compatible weights. AWQ tends to produce slightly lower perplexity than GPTQ at the same bit-width, especially at 4-bit, though the gap is small (typically within 0.1 perplexity points).
NF4
NF4 (NormalFloat4) is a quantization data type introduced as part of the QLoRA paper (Dettmers et al., 2023). It is not a container format or a quantization algorithm per se -- it is a 4-bit data type that assumes the weights follow a normal distribution and uses a normalized float mapping that allocates more quantization levels near zero.
NF4 is implemented in the bitsandbytes library (v0.49.2, February 2026) and is the default 4-bit type for QLoRA fine-tuning in the Hugging Face ecosystem. Unlike the other three formats, NF4 is primarily used for training (parameter-efficient fine-tuning) rather than inference deployment. You use NF4 to load a model in 4-bit during training, but you typically export to a different format for serving.
Side-by-side comparison
| Property | GGUF | GPTQ | AWQ | NF4 |
|---|---|---|---|---|
| Primary use case | CPU / Apple Silicon inference | GPU inference serving | GPU inference serving | QLoRA fine-tuning |
| Container format | Single .gguf file | safetensors + config.json | safetensors + config.json | Not a standalone format |
| Quantization method | Block-wise K-quants | Hessian-based, column-by-column | Activation-aware saliency scaling | Normal-distribution optimized float |
| Typical bit-width | 2-8 bits (Q4_K_M most common) | 4-bit (3/8 also supported) | 4-bit | 4-bit |
| CPU inference | Native | No | No | No |
| GPU inference | Partial (layer offload) | Yes (Marlin kernel) | Yes (Triton kernel) | Yes (training only) |
| Apple Silicon | Native (Metal) | No | No | No |
| Calibration data needed | No | Yes (128-512 samples) | Yes (128-512 samples) | No |
| Accuracy at 4-bit | Good | Excellent | Excellent | Good |
| Inference engine | llama.cpp, Ollama, LM Studio | vLLM, TGI, HF Transformers, GPTQModel | vLLM, TGI, HF Transformers | HF Transformers (training) |
| Latest version | b9592 (llama.cpp, Jun 2026) | GPTQModel v7.1.0 (Jun 2026) | AutoAWQ v0.2.9 (May 2025) | bitsandbytes 0.49.2 (Feb 2026) |
Quantization at a glance: the pipeline
flowchart LR
A[FP16 model<br/>16-bit weights] --> B{Which format?}
B -->|CPU / Apple| C[GGUF quantization<br/>llama.cpp]
B -->|GPU serving| D[GPTQ quantization<br/>GPTQModel]
B -->|GPU serving| E[AWQ quantization<br/>AutoAWQ]
B -->|QLoRA training| F[NF4 loading<br/>bitsandbytes]
C --> G[Single .gguf file<br/>ready to run]
D --> H[safetensors + config<br/>load with vLLM/TGI]
E --> I[safetensors + config<br/>load with vLLM/TGI]
F --> J[4-bit training<br/>export to deploy format]
G --> K[llama.cpp / Ollama / LM Studio]
H --> L[vLLM / TGI / Transformers]
I --> L
J --> B
The diagram shows the branching decision. The critical fork is between CPU/Apple Silicon and GPU serving, because the format choice there determines the entire downstream toolchain.
Common pitfalls
Treating all 4-bit as equivalent. A 4-bit GPTQ model is not the same quality as a 4-bit GGUF Q4_K_M or a 4-bit NF4 model. The quantization method, calibration data, and block size all affect final perplexity. Always compare within the same family, and use perplexity as a relative guide, not an absolute one.
Assuming you need calibration data for every format. GPTQ and AWQ both require a small calibration dataset (typically 128 samples from the training distribution). GGUF and NF4 do not. If you are quantizing a model for which you do not have representative sample data, GGUF is the simpler path.
Quantizing for GPU, then trying to run on CPU. A GPTQ model uses GPU-only kernels. There is no CPU fallback. If you download a GPTQ model from Hugging Face and try to run it with llama.cpp, it will not work. Similarly, GGUF models run poorly (or not at all) in vLLM. The format and the runtime are coupled.
Building an AWQ model with a stale version. AutoAWQ v0.2.9 (May 2025) is the latest release, but HF Transformers v5.11.0 (June 2026) also includes native AWQ loading via transformers.AwqConfig. If you use the Transformers integration, you do not need the standalone AutoAWQ library. Check which path is supported by your inference engine.
Using NF4 for deployment. NF4 is not a format designed for fast inference. The bitsandbytes 4-bit dequantization path is slow compared to the dedicated kernels in GPTQ (Marlin) or AWQ (Triton). Use NF4 for QLoRA training, then re-quantize to GPTQ or GGUF for deployment.
When NOT to use each format
Do not use GGUF if you are serving a high-throughput API on NVIDIA GPUs. The CPU fallback path of llama.cpp is slower than GPTQ's Marlin kernel at batch sizes above 1.
Do not use GPTQ if your deployment target is a MacBook, a Raspberry Pi, or any non-NVIDIA GPU. GPTQ kernels are NVIDIA CUDA-only. For Apple Silicon, use GGUF. For AMD GPUs, check if ROCm-based GPTQ kernels are available (limited support as of mid-2026).
Do not use AWQ if you cannot provide a representative calibration dataset. AWQ relies on activation statistics from real data. A mismatch between calibration data and deployment data degrades the saliency detection and can increase accuracy loss.
Do not use NF4 for anything beyond training. It is a storage format for the QLoRA paper, not a deployment format. If you see a model on Hugging Face labeled "NF4", it was likely uploaded as a training checkpoint, not a serving artifact.
TL;DR
- There are four mainstream LLM weight quantization formats: GGUF, GPTQ, AWQ, and NF4. Each targets a different deployment scenario.
- GGUF (llama.cpp) is for CPU and Apple Silicon inference. It is a self-contained single-file format with no calibration step.
- GPTQ (GPTQModel v7.1.0) is for NVIDIA GPU serving. It uses Hessian-based quantization and the Marlin kernel for fast inference.
- AWQ (AutoAWQ v0.2.9) is also for NVIDIA GPU serving. It uses activation-aware saliency scaling and achieves slightly better perplexity than GPTQ at the same bit-width.
- NF4 (bitsandbytes) is for QLoRA fine-tuning, not inference deployment. Use it to train, then re-quantize for serving.
- Choose your format based on your hardware (CPU vs NVIDIA GPU vs Apple Silicon) before considering bit-width or accuracy metrics. The runtime determines the format.
- Calibration data is required for GPTQ and AWQ, but not for GGUF and NF4.
Next post
Now that you know which format to use, the next question is: how fast will a quantized model actually run on your hardware? The next post breaks down tokens-per-second for each format across consumer GPUs, Apple Silicon, and CPU configurations, with concrete benchmarks you can use to size your deployment.
If you have a quantized model deployment story -- or a horror story about picking the wrong format -- the comments are the place to share it. The next post will include community-sourced numbers from exactly these stories.
Top comments (0)