DEV Community: Edson

How to Calculate Perplexity (PPL) the Right Way (and Avoid Common Pitfalls)

Edson — Sat, 02 Aug 2025 00:39:05 +0000

Overview

Perplexity (PPL) is a widely used metric for evaluating language models. It measures how well a model predicts text, with lower PPL indicating better predictive performance.

You’ll often use PPL when:

Comparing different models (e.g., baseline vs. fine-tuned).
Evaluating quantization impact on model accuracy.
Benchmarking compression or optimization techniques.

While the formula is straightforward, implementation mistakes are common—and they can completely invalidate your results.

⚠ Common Pitfall: Truncating Sequences

A frequent mistake is splitting your dataset into independent fixed-length chunks without preserving context.

Why is this a problem?

Language models rely on context continuity. If you break text into isolated sequences, the model cannot leverage preceding tokens, which inflates your PPL.

Example of Wrong Implementation

# ❌ BAD: Breaking text into independent segments
data = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
for sample in data['text']:
    tokens = tokenizer(sample, truncation=True, max_length=512)
    # compute NLL here

This approach ignores paragraph-level and sentence-level dependencies.

✅ Correct Approach

Concatenate the entire dataset into a single token stream.
Use a sliding window (with overlap) to process manageable chunks.
Compute NLL across the continuous stream, not independent samples.

This ensures that your evaluation reflects realistic context usage, similar to how models are used in practice.

Implementation in PyTorch

Here’s a correct, minimal implementation using the Wikitext-2 dataset:

import torch
import torch.nn as nn
from datasets import load_dataset
from tqdm import tqdm

def evaluate_perplexity(model, tokenizer):
    def _perplexity(nlls, n_samples, seqlen):
        return torch.exp(torch.stack(nlls).sum() / (n_samples * seqlen))

    # Load and concatenate dataset
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    text = "\n\n".join(dataset["text"])
    tokens = tokenizer(text, return_tensors="pt")
    input_ids = tokens.input_ids.to(model.device)

    seqlen = 2048
    n_samples = input_ids.numel() // seqlen
    nlls = []

    model.eval()
    with tqdm(range(n_samples), desc="Perplexity") as pbar:
        for i in pbar:
            start, end = i * seqlen, (i + 1) * seqlen
            batch = input_ids[:, start:end]
            with torch.no_grad():
                logits = model(batch).logits
            shift_logits = logits[:, :-1, :]
            shift_labels = batch[:, 1:]
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.reshape(-1, shift_logits.size(-1)),
                shift_labels.reshape(-1)
            )
            nlls.append(loss * seqlen)
            curr_ppl = _perplexity(nlls, i + 1, seqlen)
            pbar.set_description(f"PPL {curr_ppl:.3f}")

    return _perplexity(nlls, n_samples, seqlen).item()

Why Sliding Windows Matter

For large datasets, concatenation may not fit into memory. In that case:

Use a sliding window with overlap (e.g., 256 tokens).
Implement a stride-based approach like llama.cpp's llama-perplexity tool.

Impact on Quantization Evaluation

If you’re measuring PPL to validate INT8, AWQ, or GPTQ quantization, the wrong method can mislead you:

A naive truncation approach may show +3 to +5 PPL penalty compared to the correct method.
This might lead you to overestimate accuracy degradation and discard otherwise good optimizations.

Key Takeaways

✅ Don’t truncate sequences randomly—context continuity matters.

✅ Always concatenate or slide over the dataset for accurate PPL.

✅ Use PPL carefully when benchmarking quantization or fine-tuning.

Quantizing Llama 3.2 with llama.cpp – A Practical Guide

Edson — Wed, 30 Jul 2025 22:52:01 +0000

Preface

Recently, I explored how to quantize Llama 3.2 from Meta using llama.cpp. During the process, I encountered a few unexpected challenges. After some trial and error, I managed to overcome these issues — and I thought it would be helpful to share what I learned.

If you’re looking to optimize Llama 3.2 for smaller hardware footprints while maintaining reasonable performance, this guide walks you through the steps and includes practical workarounds for its current lack of official support in llama.cpp.

Why llama.cpp?

llama.cpp is a lightweight C++ implementation for LLM inference that supports inference, evaluation, and quantization of large language models. While it provides built-in support for many popular models, Llama 3.2 isn’t officially included yet. The good news? Its architecture is similar enough to Llama 3 that with a few tweaks, it works just fine.

What You’ll Do

Here’s a quick roadmap:

Set up llama.cpp tools
Download Llama 3.2 from Hugging Face
Convert the model to GGUF format
Quantize the model
Evaluate the result

Example project structure:

llama.cpp/
└── output/
    └── Llama-3-1B-Instruct/
        ├── model.safetensors
        ├── tokenizer.json
        └── ...

Step 1: Prepare llama.cpp Tools

Start by cloning and building llama.cpp. If you want GPU acceleration, enable CUDA:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
make

Step 2: Download the Model from Hugging Face

huggingface-cli login
huggingface-cli download --local-dir output/Llama-3-1B-Instruct meta-llama/Llama-3.2-1B-Instruct

Note: The local directory is intentionally named Llama-3-1B-Instruct for compatibility with llama.cpp scripts.

Step 3: Convert the Model to GGUF Format

Since Llama 3.2 isn’t officially supported, we need a small workaround.

a) Add Model Info in Conversion Script

Update convert_hf_to_gguf_update.py:

models = [
    {"name": "llama-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B"},
    {"name": "llama3",    "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct"},
]

The name is just an internal identifier for llama.cpp.
A little trick here is set llama3 instead of llama3.2
Due to the identity search inside llama.cpp

b) Update Conversion Data

Run:

python convert_hf_to_gguf_update.py

This updates the necessary checksum info automatically.

c) Convert to GGUF

Finally:

python convert_hf_to_gguf.py ./output/Llama-3-1B-Instruct

You should see:

Llama3-1B-Instruct-F16.gguf

Step 4: Quantize the Model

Choose a quantization type (e.g., Q8_0, Q4_K):

./build/bin/llama-quantize ./output/Llama-3-1B-Instruct/Llama3-1B-Instruct-F16.gguf ./output/Llama-3-1B-Instruct/Llama3-1B-Instruct-Q8_0.gguf --quantize Q8_0

Step 5: Evaluate the Quantized Model

Use llama-perplexity to measure perplexity (PPL):

./build/bin/llama-perplexity -m output/Llama-3-1B-Instruct/Llama3-1B-Instruct-Q8_0.gguf -f dataset/wikitext2/calibration_dataset.txt

Common Issues & Fixes

Checksum errors: Update convert_hf_to_gguf_update.py or pull latest scripts.
Model not recognized: Verify name in the models list and repo URL.
Accuracy drops: Try higher precision (e.g., Q8_0 instead of Q4_K).
Tokenizer problems: Ensure compatibility in llama-vocab.cpp.

Wrap-Up

While quantizing Llama 3.2 with llama.cpp isn’t yet a one-click process, it’s absolutely achievable with these tweaks. The result is a lighter, faster model that still performs well — perfect for running on consumer hardware or edge devices.

If you’ve tried other strategies or have insights, feel free to share — collaboration makes this journey easier for everyone!