Thokozani Buthelezi

Posted on Jun 27

How I Implemented GPTQ from Scratch (and What I Learned)

#llm #ai #machinelearning #deeplearning

I implemented GPTQ from scratch on a nanoGPT model and got only 1.1% perplexity degradation across 61 quantized layers. Here's exactly how it works and what I built.

1. The Problem with Naive Quantization

Quantization is one of the simplest and most effective ways to reduce the cost of running neural networks. Instead of storing weights in 32-bit floating point format, we reduce them to lower precision like INT8 or INT4. This reduces memory usage and can significantly speed up inference on hardware that supports low-precision arithmetic.

The simplest approach is Post-Training Quantization (PTQ), where each weight is independently rounded to the nearest quantized value. While this is fast and easy to implement, it ignores an important fact: neural network weights are not independent. Each weight contributes to a shared output, and small perturbations in one weight can interact with others in non-trivial ways.

Because of this, naive PTQ often introduces noticeable accuracy degradation. Some layers are extremely sensitive, and uniform rounding treats all weights as equally important. In practice, this leads to compounding errors across layers, especially in transformers where representations are tightly coupled.

This is the core problem: we need a way to quantize weights that respects the structure of the network, not just their individual magnitudes.

2. What Makes GPTQ Different

GPTQ approaches quantization as a local optimization problem per layer, rather than a simple rounding operation. Instead of treating each weight independently, it asks: how does changing this weight affect the layer’s output loss, and how should we compensate for it?

To answer this, GPTQ uses a second-order approximation of the loss landscape. The key object is the Hessian matrix, which measures how sensitive the loss is to changes in each pair of weights. Intuitively, it tells us which directions in weight space are “steep” and which are “flat.”

Instead of explicitly computing gradients for every parameter interaction (which is expensive), GPTQ approximates the Hessian using calibration data. This gives us a compact representation of how perturbations in one weight influence others.

The key idea is error propagation. When a weight column is quantized, it introduces an error. Instead of leaving that error isolated, GPTQ distributes it across the remaining unquantized weights in proportion to their sensitivity. This prevents error accumulation and preserves the layer’s output behaviour much more closely than naive rounding.

3. How I Built It

The first step was collecting activation statistics for each linear layer. I used forward hooks in PyTorch to capture the inputs arriving at every nn.Linear layer during a calibration phase. This gave me a dataset of representative activations without modifying the model’s forward pass.

From these activations, I constructed an approximation of the Hessian matrix using:

H = 2 * Xᵀ X

where (X) is the matrix of collected inputs. This step is critical because it encodes the geometry of how inputs interact with each weight column. Without calibration data, the model has no notion of which directions in weight space matter most.

Once the Hessian was computed, I added damping for numerical stability and inverted it using Cholesky decomposition. This inverse Hessian is what allows GPTQ to propagate error across columns efficiently.

The core quantization loop processes each weight column sequentially. For each column, I quantize it, compute the resulting error, and then adjust the remaining unquantized columns to compensate. The update rule is:

W[:, j+1:] -= err_j * H_inv[j, j+1:] / H_inv[j, j]

This line is the heart of GPTQ. It ensures that the error introduced by quantizing one column is redistributed according to the curvature of the loss landscape, rather than accumulating locally.

4. Results

After applying GPTQ to a nanoGPT-style model with 61 linear layers, the degradation in performance was surprisingly small given the simplicity of the implementation.

Model	Loss	Perplexity
Baseline	1.8521	6.37
GPTQ	1.8623	6.44

The increase in loss corresponds to roughly a 1.1% degradation across the full network. Considering that every linear layer was quantized independently using only 10 calibration batches, this is a strong result. It shows that second-order information is highly effective at preserving model structure even under aggressive compression.

5. What I’d Do Differently

This implementation is correct algorithmically, but it is not optimized for production use. All quantized weights are still stored in float32 after dequantization, meaning there is no actual memory savings at runtime. Production GPTQ implementations store weights in INT4 or INT8 format and dequantize on the fly during inference.

Another improvement would be per-channel or group-wise quantization instead of per-column quantization. This would reduce variance across weight distributions and likely improve accuracy further, especially in deeper transformer layers where activation statistics vary significantly.

Finally, more sophisticated calibration strategies (larger datasets, better sampling) would improve Hessian estimation and reduce approximation error.

You can find the full implementation, including ptq.py, gptq.py, and benchmark results, at:
https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026/tree/main/phase2_content/quantization

Top comments (3)

Alex Shev • Jun 27

Implementing GPTQ from scratch is a great way to make quantization feel less magical. The useful lesson is that memory savings are not free; the question is where the error lands and how much the downstream task notices. A small perplexity hit can hide very different behavior by layer.

Thokozani Buthelezi • Jun 27

Thanks! That last point really stood out to me. Measuring perplexity gave me confidence that the overall model quality was largely preserved, but it doesn’t reveal how the error is distributed across layers. Looking at per-layer sensitivity and how different layers respond to quantization would be a natural next step.

Alex Shev • Jun 28

Per-layer sensitivity would be a great next step. The aggregate metric tells you the model still looks okay from a distance, but deployment failures usually show up in uneven places. For agent systems, I see a similar pattern: average success rate hides the one tool path or context boundary where errors concentrate.