DEV Community: Tech_Nuggets

KV cache and PagedAttention: what they do and why they matter

Tech_Nuggets — Sat, 20 Jun 2026 01:36:07 +0000

KV cache and PagedAttention: what they do and why they matter

Your production LLM server is running behind schedule. You deployed a 70B model on four A100s with 80 GB each -- within spec, within budget -- but the time-to-first-token is creeping up as concurrent users increase. By lunch, latency is double what it was at 8 AM. You check GPU memory and find that 70% of HBM is consumed by what nvidia-smi reports as "tensor buffers," but which are actually the cached transformer states of a dozen long-running conversations that nobody cleaned up. You restart the server. It works again. By 4 PM, the same slowdown is back.

This is the KV cache memory problem, and it is the single biggest operational bottleneck in production LLM serving on GPUs. This post explains what the KV cache actually stores, why it grows without bound during a conversation, and how PagedAttention -- the technique that powers vLLM -- solves it with OS-inspired memory management.

Why the KV cache matters

The KV cache is not optional. Every autoregressive transformer generates tokens one at a time. For token N, the attention mechanism needs the Key and Value tensors from tokens 0 through N-1. Recomputing those from scratch for every new token would be O(N^2) per step -- catastrophic for any conversation longer than a few hundred tokens. Instead, the inference engine caches the K and V tensors from prior tokens and appends to them on each step. That structure is the KV cache.

The problem is its memory footprint. For a Llama 3.1 70B model with 80 layers, 8 KV heads (grouped-query attention), and a head dimension of 128, a single 4096-token sequence requires approximately:

2 (K+V) * 80 layers * 8 KV heads * 128 dim * 4096 tokens * 2 bytes (FP16)
= 1,342,177,280 bytes per sequence
= ~1.3 GB per sequence

For 256 concurrent 4096-token sequences, that is 336 GB of HBM -- more than four A100s provide (320 GB total). And that is before accounting for the model weights (~140 GB for 70B in FP16), the intermediate activations, the attention scores matrix, or any batching overhead.

This is the fundamental tension: the KV cache is mandatory for acceptable latency, but it consumes more memory than the model weights for any workload with meaningful concurrency or long context windows.

What the KV cache looks like in traditional serving

In most transformer inference implementations outside vLLM, the KV cache is a pre-allocated contiguous tensor. When a sequence starts, the framework allocates a past_key_values tuple sized for the maximum sequence length (or a user-specified max_new_tokens). The allocation happens up front and stays pinned until the sequence is done.

Here is a simplified view of what happens during a single generation step:

# Simplified attention step with a traditional contiguous KV cache
# key_cache shape: [batch, num_heads, max_seq_len, head_dim]
# value_cache shape: [batch, num_heads, max_seq_len, head_dim]

def attention_step(query, key_cache, value_cache, current_pos):
    # Slice the cache to only the valid tokens so far
    past_keys = key_cache[:, :, :current_pos + 1, :]
    past_values = value_cache[:, :, :current_pos + 1, :]

    scores = torch.matmul(query, past_keys.transpose(-2, -1))
    scores = scores / (head_dim ** 0.5)
    attn = torch.softmax(scores, dim=-1)
    output = torch.matmul(attn, past_values)
    return output

The contiguous allocation means you pay the maximum possible memory cost from the very first token, even if the conversation never reaches the maximum length. This is fine for offline evaluation with fixed-length sequences, but wasteful in interactive serving where most conversations are short.

Three specific inefficiencies arise:

Internal fragmentation. A sequence allocated for 4096 tokens that only uses 300 tokens wastes 93% of its allocation.
No sharing. Two conversations that start with the same system prompt must each store their own copy of the K and V tensors for the shared prefix. There is no mechanism to deduplicate.
All-or-nothing eviction. When memory runs out, the entire sequence must be evicted or swapped to CPU memory. Moving a 4096-token KV cache for a 70B model over PCIe takes tens of milliseconds, during which the GPU stalls.

How PagedAttention works

PagedAttention, introduced by the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon, Li, Zhuang et al., 2023), applies operating-system-style virtual memory paging to the KV cache. Instead of allocating one contiguous block per sequence, the KV cache is divided into fixed-size blocks called pages -- typically 16 or 32 tokens per page. The attention kernel is modified to gather key and value data from non-contiguous physical pages during the attention computation.

flowchart TB
    subgraph Virtual["Virtual KV Cache (per sequence)"]
        S1[Sequence 1\npages: A, B, C, D]
        S2[Sequence 2\npages: E, F]
        S3[Sequence 3\npages: G, H, I, J, K]
    end

    subgraph PageTable["Logical-to-Physical Mapping"]
        P0["A -> Frame 0"]
        P1["B -> Frame 3"]
        P2["C -> Frame 7"]
        P3["D -> Frame 11"]
        P4["E -> Frame 1"]
        P5["F -> Frame 4"]
        P6["G -> Frame 2"]
        P7["H -> Frame 5"]
        P8["I -> Frame 8"]
        P9["J -> Frame 9"]
        P10["K -> Frame 6"]
    end

    subgraph Physical["Physical Memory Frames (GPU HBM)"]
        M0[(Frame 0)]
        M1[(Frame 1)]
        M2[(Frame 2)]
        M3[(Frame 3)]
        M4[(Frame 4)]
        M5[(Frame 5)]
        M6[(Frame 6)]
        M7[(Frame 7)]
        M8[(Frame 8)]
        M9[(Frame 9)]
        M10[(Frame 10)]
        M11[(Frame 11)]
    end

    S1 --> P0 & P1 & P2 & P3
    S2 --> P4 & P5
    S3 --> P6 & P7 & P8 & P9 & P10

    P0 --> M0
    P1 --> M3
    P2 --> M7
    P3 --> M11
    P4 --> M1
    P5 --> M4
    P6 --> M2
    P7 --> M5
    P8 --> M8
    P9 --> M9
    P10 --> M6

The block manager maintains a page table that maps each sequence's logical page numbers to physical frame numbers. When the attention kernel needs the key-value data for a token at a given position, it computes which page that position falls in, reads the page table to find the physical frame, and loads the data from that frame. The layout is invisible to the model -- the attention output is mathematically identical to the contiguous case.

This design unlocks three capabilities that are not available with contiguous allocation:

1. On-demand allocation. A sequence only consumes pages as it grows. If a user asks a one-turn question that generates 150 tokens, the cache uses 10 pages (at 16 tokens per page). If another user runs a 5000-token document analysis, pages are allocated dynamically. No memory is wasted on unused capacity.

2. Copy-on-write for shared prefix pages. When multiple sequences share a common prefix -- the system prompt, the conversation history, a few few-shot examples -- PagedAttention maps the same physical pages into multiple virtual address spaces. The pages are marked read-only. If one sequence diverges during generation (which it always will after the first sampling step), only the page that actually changes is copied. In many chat applications, 40-60% of the tokens in a batch can be shared prefix tokens, so the memory savings are substantial.

3. Fine-grained eviction and swapping. When GPU memory is exhausted, the block manager selects pages to evict based on a least-recently-used policy. Evicted pages are written to CPU DRAM. Because pages are small (16-32 tokens), the transfer granularity is fine and the PCIe bandwidth cost is amortized across many small transfers rather than one large blocking move.

PagedAttention vs traditional KV cache management

Aspect	Traditional contiguous KV cache	PagedAttention
Allocation strategy	Pre-allocate max length per sequence	On-demand, one page at a time
Memory waste due to fragmentation	High (allocated but unused slots)	Near zero (pay for used tokens only)
Shared prefix support	None (every sequence stores its own copy)	Copy-on-write page sharing
Eviction granularity	Entire sequence	16-32 token pages
Swap overhead per eviction	High (full sequence over PCIe)	Low (single page)
Peak throughput at same HBM budget	Baseline	2-4x on mixed workloads
Batch size ceiling	Limited by worst-case per-sequence allocation	Limited by actual memory consumption

The throughput gains are workload-dependent. vLLM's published benchmarks report 2-4x improvement over frameworks with contiguous allocation, with the largest gains on workloads that mix short and long sequences. For uniform-length batches, the advantage shrinks.

Common pitfalls

1. Page table overhead with very small page sizes. The page table itself lives in GPU memory. With page sizes of 4-8 tokens, the metadata can consume a non-trivial fraction of HBM. vLLM defaults to 16-token pages as the practical sweet spot. If you observe lower-than-expected throughput with very long contexts, check whether your page size is too small.

2. Scheduler parameters that work against PagedAttention. vLLM exposes --max-num-batched-tokens and --max-num-seqs, which control how many tokens and sequences are batched in a single iteration. Setting these too high wastes the batch without improving throughput. Setting them too low underutilizes the GPU. The general guidance is to start with --max-num-seqs 256 and --max-num-batched-tokens 8192 for a 70B model and tune from there.

3. Prefix caching is not unconditionally beneficial. vLLM's automatic prefix caching (--enable-prefix-caching) computes a hash for every block of tokens. For very short prompts or rapidly rotating system prompts, the hash computation overhead can exceed the reuse benefit. Profile with and without it for your workload.

4. Interaction with KV cache quantization. PagedAttention works with FP8 and INT4 KV cache quantization, but each page carries metadata that is proportionally more significant when the data per page is smaller. vLLM v0.23.0 added FP8 KV cache support for Ada Lovelace and Hopper GPUs, usable with --kv-cache-dtype fp8. Measure the combined effect before enabling.

When NOT to use it

PagedAttention and vLLM are not the right choice for every deployment:

Single-user local inference. If you run a model for one user on one GPU, the memory pressure that PagedAttention solves never arises. A simpler framework like llama.cpp or Hugging Face Transformers has lower overhead and fewer failure modes.
Sub-100ms interactive latency requirements. The page-walking logic during attention adds a small but measurable overhead per token -- roughly 3-5% for 16-token pages. If your application requires consistent sub-100ms time-to-first-token, a contiguous cache with static pre-allocation gives lower tail latency (at the cost of lower throughput).
Small models on high-memory GPUs. A 7B model on an A100-80GB uses about 14 GB for weights and, at 4096-token context, roughly 300 MB for the KV cache per sequence. At typical concurrency levels, the cache fits easily without paging. PagedAttention's complexity buys you nothing here.
Non-autoregressive architectures. Models that do not generate tokens left-to-right -- encoder-only models (BERT, RoBERTa), diffusion-based language models, non-causal decoders -- have no KV cache to manage. PagedAttention is specific to autoregressive decoding.
Uniform-length offline evaluation. If every sequence in a batch is the same length (common in evaluation benchmarks), the fragmentation and on-demand benefits of paging are minimal. The contiguous approach works fine.

TL;DR

The KV cache stores the Key and Value tensors from every previous token during autoregressive decoding. It is mandatory for acceptable latency but grows linearly with sequence length and batch size.
For a Llama 3.1 70B model at 256 concurrent 4096-token sequences, the KV cache consumes approximately 336 GB of HBM -- more than four A100s can provide.
PagedAttention (Kwon et al., 2023) applies OS-style virtual memory paging to the KV cache: fixed-size pages, on-demand allocation, copy-on-write page sharing, and fine-grained eviction.
vLLM (v0.23.0, June 2026, 83k+ GitHub stars) implements PagedAttention and achieves 2-4x throughput over contiguous-allocation frameworks on mixed workloads.
Default to 16-token pages and tune --max-num-seqs and --max-num-batched-tokens for your model and workload.
Use PagedAttention when concurrency is high, sequences vary in length, or prompts share prefixes. Skip it for single-user inference, small models, or uniform batch sizes.

Next post: vLLM vs TGI vs llama.cpp -- a practical serving benchmark for the same 70B model under realistic concurrency, comparing throughput, latency, and cost per token.

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

Tech_Nuggets — Wed, 17 Jun 2026 01:10:36 +0000

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain.

This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled.

Why this matters

Tokenization directly controls three things that hit your bottom line:

Inference cost. LLM APIs charge by token. A model using a 32K-vocab BPE tokenizer may break "restablecer" into 8 tokens, while a 100K-vocab Unigram tokenizer handles it in 3. Over a million queries, that difference adds up to real money.

Vocabulary coverage. Rare words, code syntax, and multilingual text stress the tokenizer. A poorly fitting vocabulary means longer sequences, which means slower generation and higher cost.

Model behavior. The tokenizer is the model's entire view of language. If your tokenizer encodes "cowboy" as ["cow", "boy"], the model learns something different than if it encodes it as ["c", "owb", "oy"]. This affects everything from spelling ability to cross-lingual transfer.

The four tokenization algorithms

Every modern tokenizer takes raw text, optionally pre-tokenizes it into words (splitting on whitespace and punctuation), then breaks words into subword units from a fixed-size vocabulary. The difference is in how that vocabulary is built and how segmentation decisions are made.

1. BPE (Byte-Pair Encoding)

BPE was introduced in 1994 for data compression and adapted for neural machine translation by Sennrich et al. in 2016. OpenAI adopted it for GPT-2 and it remains the core of GPT-4o, Llama 3, and most modern LLMs.

How it works: Start with every individual character as a token. Count all adjacent token pairs, merge the most frequent pair into a new token, add it to the vocabulary, and repeat until you hit the target vocabulary size.

Vocabulary size goal: 16
Initial vocabulary: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z,  , ., ,]
Training corpus: "low low low low low low low low lower lowest lowest lowest lowest lowest lowest lowest"

Step 1: Count pairs -> ("l", "o") appears 30 times, merge -> "lo"
Step 2: Count pairs -> ("lo", "w") appears 20 times, merge -> "low"
Step 3: Count pairs -> ("low", "e") appears 10 times, merge -> "lowe"
Step 4: Count pairs -> ("lowe", "r") appears 4 times, merge -> "lower"
Step 5: Count pairs -> ("low", "e") appears 6 times... wait, "low"+"e" appears
         in "lowest" fragments, merge -> "lowe" already exists, so merge "lowe"+"st"
...

BPE is greedy and deterministic: for any input, the segmentation is the same every time. The algorithm applies the learned merge rules in order. OpenAI's GPT-4o uses o200k_base (200,096 tokens), GPT-4 used cl100k_base (100,256 tokens), and GPT-2 used a 50,257-token vocabulary.

Who uses it: GPT-4o, GPT-4, GPT-3.5, Llama 2, Llama 3 (via SentencePiece), DeepSeek, Mistral.

2. WordPiece

Google introduced WordPiece for Japanese/Korean voice search in 2012, and it powered BERT in 2018. It is often described as "BPE but with likelihood instead of frequency."

How it works: The algorithm starts the same way as BPE -- character-level initial tokens -- but instead of counting raw frequencies, it merges the pair that maximizes the likelihood of the training data under the current vocabulary. In practice this means it picks the pair whose merge increases the corpus-likelihood the most.

Compare merge candidates:
  Merge ("a", "b") -> new token likelihood gain: 0.0032
  Merge ("th", "e") -> new token likelihood gain: 0.0417
  Merge ("ing", " ") -> new token likelihood gain: 0.0281

WordPiece picks ("th", "e") because the probability lift is largest.

The result is that WordPiece tends to create tokens that are more linguistically meaningful -- common prefixes, suffixes, and root words -- compared to BPE's purely frequency-driven merges.

Who uses it: BERT, DistilBERT, ELECTRA, and most encoder-only models from Google.

3. SentencePiece

SentencePiece is a framework by Google (Kudo and Richardson, 2018) that wraps both BPE and Unigram tokenization. Its defining innovation: it operates directly on raw text without requiring a pre-tokenization step. Most tokenizers need whitespace/punctuation splitting before training, which ties them to a language-specific concept of "word." SentencePiece treats the input as a raw Unicode byte sequence, making it truly language-agnostic.

Raw text: "Hello世界"
With pre-tokenization: ["Hello", "世界"]  <- language-dependent
SentencePiece raw: "H", "e", "l", "l", "o", "世", "界"  <- no pre-tokenization needed

Who uses it: Llama 2, Llama 3, Gemma, T5, XLNet (in Unigram mode).

4. Unigram Language Model

Unigram (Kudo, 2018) flips the problem around. Instead of greedily building up a vocabulary from characters, it starts with a large vocabulary of candidate tokens and prunes it down using a probabilistic model.

How it works: Unigram models each token as an independent event and learns a probability distribution over the vocabulary. The segmentation of a word is the sequence of tokens whose probabilities multiply to the highest score.

Vocabulary: {"UN": 0.02, "UNIC": 0.005, "NI": 0.01, "UNI": 0.015, ...}

Input: "UNICORN"
Candidate segmentations and their scores:
  UN + I + C + O + R + N  -> 0.02 * 0.03 * 0.04 * 0.02 * 0.01 * 0.02 = 1.92e-12
  UNI + C + O + R + N     -> 0.015 * 0.04 * 0.02 * 0.01 * 0.02 = 2.4e-10
  UNIC + O + R + N        -> 0.005 * 0.02 * 0.01 * 0.02 = 2.0e-9  <-- best

Unigram picks the highest-probability segmentation: UNIC + O + R + N

Because Unigram evaluates multiple candidate segmentations and chooses the best one probabilistically, it is slower to tokenize than BPE but produces more consistent token-to-meaning mappings. The probabilistic nature also enables subword regularization -- randomly sampling alternative segmentations during training to improve robustness.

Who uses it: T5, XLNet, ALBERT, and SentencePiece in Unigram mode.

Algorithm comparison

Property	BPE	WordPiece	SentencePiece (BPE)	Unigram LM
Vocabulary building	Greedy merge by frequency	Greedy merge by likelihood	Greedy merge by frequency (same as BPE)	Start big, prune by likelihood
Pre-tokenization required	Yes (whitespace/punctuation)	Yes	No (raw bytes)	No (raw bytes)
Deterministic segmentation	Yes	Yes	Yes	No (sampling possible)
Typical vocab size	32K-200K	30K	32K-128K	32K-256K
Speed	Fast	Fast	Fast	Medium (Viterbi decoding)
Multilingual handling	Weak (needs large vocab)	Moderate	Best (byte-level)	Best (byte-level + sampling)
Rare word handling	Decomposes to chars	Decomposes to chars	Decomposes to bytes	Decomposes to subwords
Primary users	OpenAI, Meta, Mistral	Google (BERT)	Meta (Llama), Google (Gemma)	Google (T5, XLNet)

What this looks like in practice

Here is a Python snippet using tiktoken (OpenAI's BPE tokenizer library) to see how different inputs break apart:

import tiktoken

# GPT-4o uses o200k_base encoding
enc = tiktoken.get_encoding("o200k_base")

test_strings = [
    "Hello, world!",
    "restablecer",          # Spanish
    "Das ist fantastisch",  # German
    "こんにちは",            # Japanese
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
]

for s in test_strings:
    tokens = enc.encode(s)
    token_strs = [enc.decode([t]) for t in tokens]
    print(f"{s!r:45s} -> {len(tokens):3d} tokens: {token_strs[:6]}...")

Output (approximate for o200k_base):

Hello, world!                                    -> 3 tokens: ['Hello', ',', ' world']
restablecer                                      -> 8 tokens: ['rest', 'able', 'cer', ...]
Das ist fantastisch                              -> 6 tokens: ['Das', ' ist', ' fant', 'ast', 'isch', ...]
こんにちは                                         -> 5 tokens: ['こ', 'ん', 'に', 'ち', 'は']
def fibonacci(n): return n if n <= 1 else ...   -> 22 tokens: ['def', ' fib', 'onacci', ...]

Notice how the Spanish word takes 8 tokens while an analogous English word of similar length might take 3-4. This is the cost asymmetry that shows up on your monthly bill.

Here is a diagram showing how a single word passes through each tokenizer type:

flowchart TD
    A["Input: 'unbelievable'"] --> B["Pre-tokenization<br/>(split on space/punct)"]
    B --> C{"Tokenizer type?"}

    C -->|BPE| D["Lookup in vocab: 'un' + 'believable'<br/>If 'believable' not found:<br/>'b' + 'el' + 'ievable' ...<br/>Greedy character-level fallback"]
    C -->|WordPiece| E["Lookup longest prefix: 'un'<br/>Try '##believable'<br/>If not found: '##b' + '##el' + ...<br/>Likelihood-based merging"]
    C -->|SentencePiece| F["Byte-level segmentation<br/>No pre-tokenization<br/>BPE merge rules on raw bytes<br/>'un' + 'bel' + 'ievable'"]
    C -->|Unigram| G["Score all candidate segmentations<br/>Pick highest-probability path<br/>'un' + 'believ' + 'able'<br/>Probabilistic, may vary"]

    D --> H["Output tokens"]
    E --> H
    F --> H
    G --> H

Common pitfalls

Assuming all tokenizers handle multilingual text equally. BPE-based tokenizers that rely on space-prefix pre-tokenization (like cl100k_base) degrade significantly on CJK and Indic scripts where whitespace does not separate words. SentencePiece models handle these better because they operate at the byte level. If your user base spans non-Latin scripts, check your tokenizer's cross-language efficiency before picking a model.

Tying your prompt design to the wrong encoding. An instruction like "Output the result as JSON" costs 5 tokens with cl100k_base but 7 tokens with o200k_base. Developers who craft prompts for GPT-4 and then migrate to a model with a different tokenizer silently change the prompt's token boundary handoff, which can shift output quality.

Ignoring the tokenizer's role in fine-tuning. When you fine-tune a model, you can extend the vocabulary -- but doing so requires initializing new embedding vectors, and the model will behave unpredictably with the new tokens for the first few thousand steps. Most practitioners are better off using the existing vocabulary and handling out-of-vocabulary tokens via character-level fallback.

The "split on prefix space" trap. Most BPE tokenizers add a space before each word during pre-tokenization (byte-pair encoding operates on the string " Hello" not "Hello"). This means "Hello" (capitalized, start of sentence) and "hello" (lowercase, mid-sentence) share the same token " Hello" if the space prefix is consistent. But if your text formatting changes -- removing trailing spaces, using non-standard punctuation -- you can tokenize the same semantic content into dramatically more tokens.

Forgetting that tokenizer version matters. p50k_base and cl100k_base and o200k_base all use BPE with different pre-tokenization rules and vocab sizes. A comparison of two models' outputs is meaningless if you used different tokenizers to count their tokens. Pin your tiktoken version (tiktoken==0.13.0 as of June 2026) and your encoding name in every evaluation script.

When NOT to use it

When you need exact character-level control. Tokenization destroys alignment between text characters and model internals. If you are building a spelling corrector, a character-level model (like ByT5 or CANINE) produces better results than any subword tokenizer.

When latency is the absolute priority. SentencePiece Unigram and WordPiece both require running a language model or Viterbi decoder to segment text. BPE is simpler and faster. If you are measuring single-digit millisecond TTFT budgets, use a pure BPE tokenizer and keep the vocabulary under 50K.

When you are building a single-language, domain-specific model. If your entire task is English medical text classification, you can build a custom BPE vocabulary (15K-20K tokens) that outperforms the general-purpose 100K vocabulary in both speed and perplexity. The general vocabularies are optimized for web-scale diversity, not domain density.

When you need reversible tokenization. Subword tokenization is lossy. You cannot reconstruct the original string perfectly from the token IDs if the tokenizer applied normalization (lowercasing, NFKC Unicode normalization, etc.). If you need byte-level round-trips, use a byte-level tokenizer (like the one in ByT5 or CANINE).

When you are benchmarking across model families. Comparing GPT-4o (200K vocab, BPE) against Llama 3 (32K vocab, SentencePiece BPE) by token count is comparing apples to oranges. Always benchmark on character or byte cost, not token cost, when models use different tokenizers.

TL;DR

BPE (GPT-4o, Llama 3, Mistral) builds vocabulary by merging the most frequent character pairs greedily. Deterministic, fast, but weak on multilingual text.
WordPiece (BERT, ELECTRA) merges by likelihood gain rather than frequency. Produces more linguistically meaningful tokens but requires pre-tokenization.
SentencePiece (Llama 3, Gemma, T5) wraps BPE and Unigram, operating on raw bytes without pre-tokenization. Best multilingual handling.
Unigram (T5, XLNet) starts with a large vocabulary and prunes it by likelihood. Supports subword regularization and produces more consistent token-to-meaning alignments at the cost of slower segmentation.
Tokenizer choice directly impacts inference cost: a 32K vocab English-optimized tokenizer and a 200K vocab general tokenizer will produce very different token counts for the same multilingual input.
Pin your tokenizer version and encoding name when reporting any token-count metric. Differences between cl100k_base and o200k_base can shift token counts by 15-30% on the same text.

When you know which tokenizer your model uses, the next question is how to prepare your data so that tokenizer wastes as few tokens as possible. That means strategic prompt design, choosing the right model for your language mix, and building evaluation pipelines that measure token efficiency alongside accuracy. We will cover token-efficient prompt engineering in the next post -- including a concrete method for estimating your per-user token consumption before you deploy.

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

Tech_Nuggets — Tue, 16 Jun 2026 01:08:06 +0000

RLHF vs DPO vs IPO vs KTO: which alignment method should you use

You have a base model, say Llama 3.2 8B, that can write poetry in any meter and pass the bar exam. It can also generate instructions for synthesizing controlled substances, roleplay as a manipulative therapist, and explain in loving detail why your pull request is an affront to good taste. You need to align it — remove the harmful outputs while keeping the capability. Your mentor says "use RLHF." A paper on your feed says DPO is simpler. Your colleague swears by KTO because they only have thumbs-up/thumbs-down log data from production. Where do you start?

Choosing an alignment method is not a theoretical debate. It is a practical decision that depends on your data, your compute budget, and the failure modes you are trying to avoid. This post compares the four dominant approaches side by side, with the actual math, the data requirements, and the sharp edges you will hit in production.

Why this matters

The alignment method you pick determines three things that directly affect shipping timelines:

Data requirements. Some methods need pairwise preferences (A beats B). Others work with per-sample binary scores. If you have production logs, you probably already have the latter. If you have a human annotation pipeline, you can collect the former — at a cost.
Compute budget. RLHF requires training a separate reward model of comparable size to your policy model, then running PPO, which is notoriously sample-inefficient and sensitive to hyperparameters. DPO, IPO, and KTO collapse the process into a single training loop on static data.
Stability and robustness. PPO can destabilize and collapse your policy. DPO can overfit to preference noise. IPO adds a regularization term that mitigates that. KTO handles scenarios where you have no strict pairwise comparisons at all.

Understanding these tradeoffs is the difference between an aligned model that ships in two weeks and an alignment project that drags for three months.

RLHF, DPO, IPO, and KTO: how each method works

All four methods start from the same place: a supervised fine-tuned (SFT) model and a dataset that captures human preferences. How they use that data differs fundamentally.

RLHF (Reinforcement Learning from Human Feedback)

The canonical approach, popularized by OpenAI's InstructGPT paper (Ouyang et al., 2022), is a three-stage pipeline:

Collect human preferences — annotators rank model outputs for a set of prompts, producing pairwise preferences (chosen vs rejected).
Train a reward model — a separate model (usually the same architecture as the policy) is trained to predict the human preference score from a given output. It learns a scalar reward function that approximates human judgment.
Optimize the policy with PPO — the policy model generates outputs, the reward model scores them, and PPO (Proximal Policy Optimization) updates the policy to increase the expected reward. A KL penalty keeps the policy from diverging too far from the SFT model.

# Simplified PPO update (conceptual)
# reward = reward_model.generate(policy_output) - beta * kl_divergence(policy || ref_policy)
# policy_loss = -ppo_clip(reward, old_logprobs, new_logprobs)

The three-stage pipeline is expensive — each stage requires its own training run, its own GPU budget, and its own hyperparameter sweep. The reward model can learn to exploit spurious correlations (reward hacking), and PPO is sensitive to the learning rate and KL penalty coefficient. On the plus side, online PPO can in theory discover outputs that are better than any human annotation in the dataset.

DPO (Direct Preference Optimization)

Rafailov et al. (2023) showed that the reward model in RLHF is strictly unnecessary. The key insight is that the Bradley-Terry preference model (the statistical model behind most reward models) has a closed-form solution that relates the optimal policy directly to the reference policy and the preference data.

DPO eliminates the reward model entirely. The training loss is:

L_DPO = -E[log sigmoid(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))]

Where y_w is the chosen output, y_l is the rejected output, pi is the current policy, pi_ref is the frozen reference policy (the SFT model), and beta controls how far the policy can diverge.

# DPO loss in practice (using Hugging Face TRL)
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=policy_model,
    ref_model=ref_model,
    train_dataset=preference_dataset,
    beta=0.1,          # KL regularization strength
    args=training_args,
)
dpo_trainer.train()

DPO runs in a single training loop on a static dataset. There is no reward model, no PPO, no online generation during training. This makes it dramatically cheaper — approximately 3x less compute than RLHF for comparable results on most benchmarks.

The tradeoff: DPO is an offline method. It never sees the model's own generations during training, so it can over-optimize for preferences that do not generalize. It also requires pairwise preference data — you need two outputs per prompt, one explicitly preferred over the other.

IPO (Identity Preference Optimization)

Azar et al. (2023) at DeepMind identified a subtle problem with DPO: the implicit reward parameterization in DPO can lead to the regularization term not actually constraining the policy the way it should. IPO replaces the reward parameterization with an identity mapping, providing stronger regularization.

The IPO loss is:

L_IPO = E[(log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x) - 1/(2*tau))^2]

Where tau is a regularization parameter. The squared loss directly penalizes the policy when the log-likelihood gap diverges too far from the target margin. This provides a cleaner optimization landscape and better-calibrated probabilities at inference time.

# IPO loss (conceptual)
# margin = (log_ratio_w - log_ratio_l)
# loss = (margin - 1/(2*tau))^2  # when margin < 1/(2*tau), else 0

IPO requires the same pairwise data as DPO. It is slightly more stable in practice, especially on noisy preference datasets where DPO can amplify annotator disagreement.

KTO (Kahneman-Tversky Optimization)

Ethayarajh et al. (2024) at Contextual AI took a different tack. Inspired by prospect theory (Kahneman and Tversky, 1979), they built an alignment method that works with per-sample binary feedback — thumbs up or thumbs down — instead of pairwise preferences.

The KTO loss treats gains (chosen responses) and losses (rejected responses) asymmetrically:

L_KTO = -E[w(y) * (1 - sigmoid(beta * (log pi(y|x)/pi_ref(y|x) - z_ref)))]

Where w(y) is a weighting factor that differs for chosen and rejected examples, and z_ref is a reference value derived from the data. The key asymmetry: losses (rejected outputs) are weighted more heavily than gains (chosen outputs), mirroring human loss aversion documented in behavioral economics.

# KTO trainer in Hugging Face TRL
from trl import KTOTrainer

kto_trainer = KTOTrainer(
    model=policy_model,
    ref_model=ref_model,
    train_dataset=binary_feedback_dataset,  # no pairs needed
    args=training_args,
)
kto_trainer.train()

KTO's major advantage is data efficiency. Many production systems log per-output user feedback (clicks, likes, flags) without recording a pairwise comparison. KTO can train directly on this signal. The tradeoff is lower sample efficiency per annotated example — pairwise comparisons carry more information per annotation than binary labels.

Comparison: which method for which situation

Dimension	RLHF	DPO	IPO	KTO
Data required	Pairwise comparisons	Pairwise comparisons	Pairwise comparisons	Binary (good/bad)
Reward model needed	Yes (separate training)	No	No	No
Training stages	3 (SFT + RM + PPO)	1 (after SFT)	1 (after SFT)	1 (after SFT)
Compute cost	Highest (~3x DPO)	Low	Low	Low
Online generation	Yes (PPO samples during training)	No (offline)	No (offline)	No (offline)
Stability	Tricky (PPO hyperparameters)	Good, can overfit to noise	Better (identity regularization)	Good
Best for	High-quality RM, large compute budget	Clean pair data, tight budget	Noisy pair data, production stability	Production logs (binary feedback)
Key risk	Reward hacking, training collapse	Overfitting on static data	Slightly more complex loss	Needs enough binary data

Here is the decision flow:

flowchart TD
    A[Do you have pairwise<br/>preference data?] -->|Yes| B{Do you have budget<br/>for a reward model<br/>and PPO?}
    A -->|No / only binary feedback| C[Use KTO]
    B -->|Yes| D[RLHF — full pipeline<br/>highest potential ceiling]
    B -->|No| E{Is your preference<br/>data clean or noisy?}
    E -->|Clean| F[DPO — simplest<br/>single-stage training]
    E -->|Noisy| G[IPO — better regularization<br/>for noisy preferences]

Common pitfalls

Running DPO on binary data. DPO requires pairwise preferences: a chosen output and a rejected output for the same prompt. If you concatenate unrelated good and bad outputs into pairs, DPO will learn arbitrary decision boundaries. Use KTO for binary data.

Ignoring the reference model. DPO, IPO, and KTO all require a frozen reference model (usually your SFT checkpoint). The loss depends on the log-ratio between the current policy and the reference. If you use a different reference model, the optimization target changes silently. Always use the same checkpoint that produced the data.

Skipping SFT. None of these methods work well on a raw pretrained base model. You need an SFT model that can produce reasonable completions. The alignment stage assumes the model can already generate coherent, on-task outputs — it is steering existing behavior, not teaching the model to generate text from scratch.

Treating beta as a free parameter. The beta (or tau) parameter controls how far the aligned policy can stray from the reference. A beta too high and you get no alignment effect. A beta too low and the model unlearns general capabilities (catastrophic forgetting). Sweep it systematically — at least 3 values (e.g., 0.01, 0.1, 0.5) on a validation set before committing to a full run.

Assuming RLHF always wins. On many benchmarks, DPO matches or exceeds RLHF at a fraction of the compute. The main advantage of RLHF is the online generation during PPO, which can discover novel high-reward outputs not present in the training data. For most production use cases where you already have a representative dataset, DPO/IPO/KTO are the better choice.

When NOT to use it

Do not use any of these methods if you have fewer than a few hundred preference examples. The signal-to-noise ratio at that scale is too low. Collect at least 500–1000 examples, and prefer 5000+ for reliable results.

Do not use RLHF if you are budget-constrained or shipping on a timeline under four weeks. The three-stage pipeline (SFT, reward model, PPO) with hyperparameter tuning and reward model debugging routinely takes 2–3 months for teams that are new to it.

Do not use DPO or IPO if your data is binary per-output feedback with no pairwise structure. You will have to fabricate pairs from unrelated outputs, which introduces noise. Use KTO instead.

Do not use KTO if you have clean pairwise preferences and enough compute for DPO. Pairwise comparisons carry more information per example, so DPO will converge faster with fewer total annotations.

Do not skip evaluating your aligned model on capability benchmarks. Every alignment method trades some general capability for safety. If your aligned model drops 5% on MMLU relative to the SFT checkpoint, you have likely over-regularized. Run MMLU, HellaSwag, and a task specific to your domain before and after alignment.

TL;DR

RLHF uses a trained reward model plus PPO optimization. It is the most expensive but supports online exploration. Use it when you have large compute budgets and a team that can manage the complexity.
DPO eliminates the reward model and optimizes a closed-form loss on static preference pairs. It is the simplest and cheapest. Use it for clean pairwise data when compute is constrained.
IPO adds identity regularization to DPO, producing more stable training on noisy preferences. Use it when your annotation quality is inconsistent.
KTO works with binary per-example feedback (good/bad) instead of pairwise comparisons. Use it when you only have production logs without explicit preference pairs.
All four require a strong SFT base model, a frozen reference model, and a minimum of several hundred examples. All four risk capability regression — evaluate on standard benchmarks before and after alignment.

Pairwise preference data is the gold standard for alignment, but collecting it at scale is expensive and annotator agreement is often low. Next time: how to build and maintain a preference dataset — sampling strategy, inter-annotator agreement metrics, and detecting when your annotation pipeline is quietly poisoning your model.

The Model Context Protocol (MCP): what it is and how to build a server

Tech_Nuggets — Mon, 15 Jun 2026 01:13:04 +0000

The Model Context Protocol (MCP): what it is and how to build a server

Your team's LLM-powered application talks to a search index through one custom integration, a code repository through another, a Postgres database through a chain of LangChain tools, and a file system through raw Python I/O calls. Every new data source means writing a new integration. Every integration uses a different authentication model and returns data in a different shape. The LLM application is tightly coupled to every backend it touches, and swapping one out requires changing the application code directly.

The Model Context Protocol (MCP) exists to replace this bespoke plumbing with a single, standardized interface. Think of it as a USB-C port for LLM applications: one connector shape, one protocol, and any compatible server can plug into any compatible client without custom wiring.

Why a standard protocol matters

LLM-powered tools have exploded in capability over the past two years, but the integration story has not kept up. Each AI application (IDE assistant, chat client, agent framework) historically built its own connectors for databases, APIs, document stores, and code repositories. There was no shared contract. If you wanted to use a specific code search tool with two different AI assistants, you needed two separate integrations.

MCP borrows its design philosophy from the Language Server Protocol (LSP), which standardized how code editors talk to language analyzers. Before LSP, each editor had its own plugin for each language. After LSP, one language server worked with every editor. MCP aims to do the same for AI tools and the data sources they need.

The protocol is an open standard, originally created at Anthropic and published under the MIT license. The specification reached stable at version 2025-11-25, and the Python SDK (mcp on PyPI) is at 1.27.2 as of May 2026. A 2.0.0 alpha was published in June 2026 with an updated transport layer.

How MCP works

MCP uses JSON-RPC 2.0 as its message format. A client (the AI application) connects to a server (a service that provides context) over one of three transport types:

stdio: the client spawns the server as a child process and communicates over stdin/stdout. Best for local, single-user setups.
SSE (Server-Sent Events): the server runs as an HTTP endpoint, the client connects over HTTP. Works across machines.
Streamable HTTP: a newer transport that allows bidirectional streaming over HTTP. Added in the 2025-11-25 spec.

Here is the conceptual architecture:

flowchart LR
    subgraph Client["Client (AI App)"]
        A["Host<br/>IDE / Chat / Agent"]
        B["MCP Client<br/>Protocol handler"]
    end
    subgraph Server["MCP Server"]
        C["MCP Server<br/>Protocol handler"]
        D["Resources<br/>context data"]
        E["Tools<br/>executable functions"]
        F["Prompts<br/>templated workflows"]
    end
    A <--> B
    B <-->|JSON-RPC 2.0<br/>stdio / SSE / HTTP| C
    C --> D
    C --> E
    C --> F

Every MCP session begins with a capability negotiation handshake. The client announces what features it supports (sampling, roots, elicitation). The server announces what features it offers (resources, tools, prompts). Both sides agree on a feature set before any data exchange happens.

Server primitives

Servers offer three main categories of functionality:

Resources expose data to the LLM. Think of them as GET endpoints in a REST API. A resource has a URI and returns content in a structured format. Example: file:///logs/2026-06-01.txt returns the content of that log file. Resources are how the LLM loads context.

Tools are functions the LLM can invoke. Think of them as POST endpoints. A tool has a name, a description, and an input schema (JSON Schema). The LLM can call a tool to execute code, query a database, or trigger an external action. Unlike resources, tools are invoked on demand.

Prompts are reusable templates for LLM interactions. A prompt defines a message template with parameter slots. The client can populate the template and present the result to the user as a pre-built interaction.

Client primitives

Clients can also offer features to servers:

Sampling: the server can request the client to generate an LLM response, enabling agentic loops where one model delegates to another.
Roots: the server can request information about filesystem or URI boundaries, so it knows where it is allowed to operate.
Elicitation: the server can request additional information from the user through the client's UI.

Building an MCP server in Python

The mcp package (v1.27.2) provides a high-level API called FastMCP that makes building a server straightforward. Here is a complete server that exposes a weather tool and a greeting resource:

from mcp.server.fastmcp import FastMCP

# Create an MCP server
mcp = FastMCP("Weather Demo")

# Add a tool: get weather for a city
@mcp.tool()
def get_weather(city: str, units: str = "celsius") -> str:
    """Get the current weather for a city."""
    # In production, call a real weather API here
    return f"Weather in {city}: 22 degrees {units}, partly cloudy"

# Add a resource: city data by URI
@mcp.resource("city://{name}")
def city_info(name: str) -> str:
    """Get information about a city."""
    cities = {
        "dubai": "Dubai, UAE. Population: 3.6M. Timezone: UTC+4.",
        "london": "London, UK. Population: 8.9M. Timezone: UTC+0.",
        "tokyo": "Tokyo, Japan. Population: 14M. Timezone: UTC+9.",
    }
    return cities.get(name.lower(), f"City '{name}' not found.")

# Add a prompt template
@mcp.prompt()
def travel_planning(city: str) -> str:
    """Generate a travel planning prompt for a destination."""
    return (
        f"You are a travel assistant helping someone plan a trip to {city}. "
        f"Provide practical advice on weather, transportation, and attractions."
    )

# Run with stdio transport (default)
if __name__ == "__main__":
    mcp.run()

Install it and run:

pip install "mcp[cli]"
python weather_server.py

The server starts on stdio by default. For HTTP transport, change the last line:

mcp.run(transport="streamable-http")

Testing with the MCP Inspector

The official MCP Inspector is a browser-based tool for testing servers:

npx -y @modelcontextprotocol/inspector

Point it at your server endpoint (or stdio command) and you can browse resources, invoke tools, and inspect messages without writing a client.

MCP vs the alternatives

Feature	MCP	Custom API / REST	LangChain Tools	OpenAI function calling
Standardized protocol	Yes	No	No (framework-specific)	No (API-specific)
Primitive types	Resources, Tools, Prompts	Endpoints only	Tools only	Functions only
Transport options	stdio, SSE, Streamable HTTP	HTTP only	In-process only	HTTP only
Bidirectional	Yes (sampling, roots)	Request-response only	Request-response only	Request-response only
Auth model	OAuth 2.1 (spec), pluggable	Custom per API	Custom per integration	API key
Client independence	Any MCP client	One client per API	LangChain only	OpenAI only

The main differentiator is client independence. A server written for MCP works with any MCP-compatible client: Claude Code, Claude Desktop, the Continue.dev VS Code extension, or a custom agent framework. Custom APIs and framework-specific tools lock you into one ecosystem.

Common pitfalls

Thinking tools are free. Tools execute arbitrary code on your server. Every tool invocation consumes compute and may have side effects. The LLM cannot distinguish between a cheap operation (reading a config file) and an expensive one (running a 100-row batch query). Set usage limits or implement a permission layer for destructive operations.

Resource URIs must be meaningful. A resource URI is not just a label -- it is the identifier the LLM uses to request data. Using opaque URIs (resource://abc123) makes it impossible for the LLM to discover resources. Use hierarchical, descriptive URIs that hint at the content structure, like docs://project/api/reference or db://customers/orders?status=pending.

Forgetting the capability handshake. If you add a new tool to an existing server and your client does not re-negotiate capabilities, the client will not know the tool exists. The capability exchange happens at connection time. Restart both sides after changing what a server offers.

Overloading a server. An MCP server that exposes 50 tools and 200 resources becomes as hard to navigate as a REST API with 50 endpoints. Group related functionality into separate servers and let the client connect to multiple servers. Claude Desktop and other hosts already support multi-server setups.

Assuming tools are always available to the LLM. Tool invocation requires user consent in most host applications. The user must approve each tool call. Design your tools to be meaningful in a single invocation, because multi-step approval flows create a poor user experience.

When NOT to use it

MCP is the wrong choice if:

You are building a single-purpose script. If your Python script calls one API and prints the result, MCP adds unnecessary complexity. Just use requests directly.
You need sub-millisecond latency. The JSON-RPC serialization and transport overhead adds a few milliseconds per call. For latency-critical, high-frequency operations (real-time streaming inference, hardware control), use a direct connection.
Your data source has no LLM interaction. MCP is designed to serve context to LLMs. If you are building a regular web application backend with no AI component, use a standard REST or gRPC API.
Your users are all on one framework. If every consumer of your service uses LangChain and will only ever use LangChain, writing a LangChain tool directly is simpler than writing an MCP server plus a LangChain MCP adapter. MCP pays off when you have multiple client types.

TL;DR

MCP standardizes how LLM applications connect to data sources. One server works with any MCP-compatible client.
The protocol uses JSON-RPC 2.0 over stdio, SSE, or Streamable HTTP transport. Features are negotiated at connection time.
Servers expose Resources (data), Tools (executable functions), and Prompts (templates). Clients can offer Sampling, Roots, and Elicitation.
The Python SDK mcp (v1.27.2) provides FastMCP, a decorator-based API for building servers in a few lines of code.
MCP pays off when you have multiple client types consuming the same data sources. For single-purpose scripts or single-framework setups, a direct integration is simpler.
Use the MCP Inspector (npx @modelcontextprotocol/inspector) to test servers without writing a client.

Next post: building a multi-server MCP setup that connects a code search service, a documentation index, and a database gateway into a single AI assistant, with practical trade-offs on transport selection and auth.

Structured output from LLMs: JSON mode, function calling, and grammar-constrained decoding

Tech_Nuggets — Sun, 14 Jun 2026 02:58:10 +0000

Structured output from LLMs: JSON mode, function calling, and grammar-constrained decoding

You deployed a chatbot that translates natural-language requests into API calls. A user says "book a table for four at 7pm tomorrow." Your prompt asks the LLM to emit a JSON like {"restaurant": string, "party_size": int, "time": string, "date": string}. One time it returns {"restaurant": "Olive Garden", "party_size": 4, "time": "19:00", "date": "2026-06-15"} -- valid JSON, everything works. The next request for "dim sum Saturday noon" produces {"restaurant": "Dim Sum House", "party_size": 2, "time": "12:00", "date": "Saturday"} followed by a free-text aside: -- also, what's the dress code?. Now your JSON parser throws, your downstream pipeline crashes, and your Slack channel lights up at 2 AM.

The problem is fundamental: LLMs generate tokens, not data structures. Any schema you ask for is a suggestion, not a constraint. Production systems that depend on structured output need a mechanism that enforces the schema at the token level, not just at the prompt level.

Why this matters for production LLM applications

Three scenarios where structured output is non-negotiable:

API wrappers and function calling. An LLM that calls tools on your behalf must produce arguments that match the tool's JSON Schema. A malformed argument means a runtime error from the tool, a retry, or silent failure. At scale, even a 2% malformation rate becomes a steady stream of incident alerts.
Data extraction and ETL pipelines. You point an LLM at 10,000 support tickets and ask it to extract {customer_id, sentiment, category, urgency}. If 3% of the rows have extra fields, missing fields, or non-JSON prose, your data pipeline either drops them silently or someone writes a regex band-aid that breaks later.
Multi-step agent loops. An agent that calls a search tool, reads the result, then calls another tool needs each step's output to be parseable. If step 2 produces free text instead of a function call, the loop stalls. Every retry costs tokens, latency, and money.

The three approaches to structured output

Developers today have three main ways to coerce an LLM into producing structured data. They differ in reliability, latency, and how deeply they integrate with the model.

Method	Enforcement level	Latency overhead	Model support	Schema expressiveness
Prompt-only JSON mode	None (suggestion)	Zero	All models	Unlimited
API-level JSON mode / function calling	Soft (post-hoc validation + retry)	0-200ms	OpenAI, Anthropic, Gemini, most providers	JSON Schema
Grammar-constrained decoding	Hard (token-level)	10-50ms per token	Local models (llama.cpp, vLLM), Outlines, Guidance, lm-format-enforcer	Any CFG, JSON Schema, regex

Prompt-only is what you write when you first prototype. API-level structured output is what most teams use in production today. Grammar-constrained decoding is the emerging standard for local and self-hosted models where you control the sampling loop.

Prompt-only JSON mode

The simplest approach: tell the model to output JSON and hope it complies.

You are a data extraction assistant.
Extract the requested fields and output ONLY valid JSON.
Do not include any explanation, markdown formatting, or extra text.

This works maybe 85-95% of the time with capable models, but the failure modes are maddening: trailing commas (not valid JSON but some parsers accept them), markdown code fences around the JSON, explanatory text before or after the JSON, missing closing braces, and string values that contain unescaped quotes.

The fatal flaw is that prompt-only mode does not interact with the token generation process at all. If the model is partway through a field value and its next most likely token is "fix" (the start of a free-text apology), it will generate that token. The prompt is just context -- it does not constrain the probability distribution.

API-level structured output (JSON mode and function calling)

OpenAI introduced JSON mode in mid-2024, and the rest of the industry followed. The API takes a response_format parameter with a JSON Schema. Behind the scenes, the provider uses a validator that resamples or masks tokens that would produce invalid JSON relative to the schema.

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Extract: John Smith, 42, john@example.com"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "email": {"type": "string", "format": "email"}
                },
                "required": ["name", "age", "email"]
            }
        }
    }
)
print(response.choices[0].message.content)

The output is guaranteed to be valid JSON matching the schema, or the API returns an error. The 'strict' flag enforces that no extra properties are emitted.

Function calling works similarly: you register tool definitions as JSON Schema objects, and the model returns a structured tool_calls array. The provider handles the token-level enforcement.

tools = [{
    "type": "function",
    "function": {
        "name": "book_restaurant",
        "description": "Book a table at a restaurant",
        "parameters": {
            "type": "object",
            "properties": {
                "restaurant": {"type": "string"},
                "party_size": {"type": "integer"},
                "time": {"type": "string"},
                "date": {"type": "string"}
            },
            "required": ["restaurant", "party_size", "time", "date"]
        }
    }
}]

The model returns something like:

{
  "name": "book_restaurant",
  "arguments": "{\"restaurant\":\"Olive Garden\",\"party_size\":4,\"time\":\"19:00\",\"date\":\"2026-06-15\"}"
}

Anthropic Claude's tool use, Gemini's function calling, and Mistral's function calling all follow the same pattern. The schema is defined client-side, the provider validates at the token level, and the output is always parseable.

Grammar-constrained decoding

For local and self-hosted models, you can push enforcement into the sampling loop itself. Grammar-constrained decoding modifies the token probability distribution at each step, zeroing out any token that would produce an invalid next character relative to a grammar or schema.

# Using Outlines to constrain generation to a Pydantic model
from pydantic import BaseModel, constr
from outlines import models, generate

class Person(BaseModel):
    name: constr(min_length=1, max_length=100)
    age: int
    email: str

model = models.transformers("Qwen/Qwen2.5-7B-Instruct")
generator = generate.json(model, Person)

result = generator("Extract: John Smith, 42, john@example.com")
print(result)
# Person(name='John Smith', age=42, email='john@example.com')

Outlines works by converting the JSON Schema or Pydantic model into a context-free grammar (CFG), then using that CFG to prune the token vocabulary at each generation step. Only tokens that represent valid continuations of the schema are kept.

The same idea works for arbitrary grammars, not just JSON:

# Grammar-constrained generation with llama.cpp
# GBNF (Grammar-Based Negative-dFidence) format
grammar = """
root ::= digit+ "." digit+
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
"""

Ollama supports GBNF grammars natively. vLLM has a --guided-decoding-backend flag (options: outlines, lm-format-enforcer, xgrammar). The key insight is that grammar-constrained decoding makes structured output a sampling-time property, not a post-processing step.

Here is how the token mask operates during generation:

flowchart TD
    A[Start generation] --> B[Get next token logits<br/>from model forward pass]
    B --> C[Apply grammar mask:<br/>zero-out tokens that would<br/>produce invalid structure]
    C --> D[Sample from masked<br/>probability distribution]
    D --> E{Is generation<br/>complete?}
    E -->|No| B
    E -->|Yes| F[Return valid structured<br/>output]

Every token is checked against the schema before it is sampled. If the schema expects a number at position 73 and the model proposes a comma, that token is masked out and the next-best valid token is sampled instead.

Comparison: which method should you use?

Criterion	Prompt-only	API JSON/function call	Grammar-constrained
Reliability	85-95%	~99.9%	>99.9%
Latency impact	None	Negligible	~10-50ms per token
Works with any model	Yes	No (provider-dependent)	Yes (local framework)
Schema validation	Post-hoc	Token-level	Token-level
Debugging difficulty	Easy (parse error)	Medium (API error)	Medium (grammar compile error)
Best use case	Prototyping, quick scripts	Production API calls	Self-hosted, sensitive data

Common pitfalls

Nested schemas with strict mode. OpenAI's strict JSON mode rejects extra properties. If your schema has additionalProperties: true or relies on optional fields that the model sometimes fills with null, strict mode will return errors. Test with strict: false first, then tighten.

Grammar compilation time. Outlines and Guidance compile the schema into a state machine before generation starts. For complex schemas with deeply nested allOf / oneOf, this can take 2-10 seconds. Cache the compiled grammar if you reuse a schema.

Token masking vs resampling. Some implementations (early Guidance) used resampling: if the output was invalid, regenerate. This is slow and unpredictable. Prefer token-masking approaches (Outlines, xgrammar, llama.cpp GBNF) that never generate invalid tokens in the first place.

Model incompatibility with grammar backends. Not all Hugging Face model architectures work with Outlines' transformers backend. If you hit an error about unsupported model type, switch to the llamacpp backend or use vLLM's guided decoding instead.

When NOT to use it

Structured output is the wrong tool when:

You need open-ended creative text. A story-writing or brainstorming session should not be grammar-constrained. The constraints reduce the model's output quality and diversity for tasks where free text is the goal.
Your schema changes frequently. Grammar compilation and testing add overhead. If you are iterating on a schema multiple times per day, start with prompt-only JSON, then add enforcement once the schema stabilizes.
Your model is behind an API that does not support it. Not all providers offer JSON mode or function calling. For those that do not, you are limited to prompt-only or running a local validation + retry loop, which adds latency and cost.
Your use case tolerates occasional parse failures. If a human reviews every output or the downstream system has robust error handling, the complexity of grammar-constrained decoding may not be worth it.
Latency is the absolute top priority. Grammar masking adds a small per-token overhead. For sub-100ms response requirements at high throughput, prompt-only with a lenient parser may be the pragmatic choice. Measure before optimizing.

TL;DR

Prompt-only JSON works ~85-95% of the time and is fine for prototyping, but will crash in production at scale.
API-level JSON mode / function calling (OpenAI, Anthropic, Gemini, Mistral) provides token-level enforcement with negligible latency overhead. Use this for production or when your provider supports it.
Grammar-constrained decoding (Outlines, Guidance, llama.cpp GBNF, vLLM guided decoding) enforces schema at the sampling step. Best for self-hosted models and sensitive-data scenarios.
Token masking is better than resampling. Prefer frameworks that mask invalid tokens rather than regenerating on failure.
Measure the overhead. Grammar compilation and per-token masking add latency. Test with your schema and model before committing.

The pipeline that evaluated the output of grammar-constrained decoding against a test corpus of 10,000 real user requests -- how we measured reliability, what broke, and what the latency budget actually looked like in production.

If you have a production story about structured output going wrong (or going right), the next post will compile reader experiences -- drop a comment with your war story.

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

Tech_Nuggets — Sat, 13 Jun 2026 01:05:53 +0000

Mixture of Experts (MoE): what it actually does under the hood, and when it pays off

You deployed a 7B model in production. Response times are fine — 45 ms per token — but you want to scale to a 70B without buying four more GPUs. Someone mentions MoE: "70B performance at 7B compute." It sounds like free lunch. So you look at the Mixtral 8x7B paper, you see 45 billion parameters and a claim that each token only activates about 13 billion of them, and you wonder: how is that physically possible, and what is the catch?

This post explains the sparse MoE architecture that powers Mixtral, DeepSeek-MoE, Qwen2.5-MoE, DBRX, and Grok-1: what the router actually does, why load-balancing is the hardest problem in training them, and the three specific constraints that determine whether MoE is the right choice for your deployment.

Why the distinction between total parameters and active parameters matters

A dense transformer (like Llama 3.2) activates 100 percent of its parameters for every token. The FFN layer in each transformer block runs the same matrix multiplication for every input. This makes memory use predictable and throughput easy to model, but it also means that scaling from 7B to 70B multiplies both memory and compute by 10x.

MoE decouples the two. The model stores more parameters (more memory), but each token only uses a fraction of them (less compute). Here is the core trade-off expressed in numbers:

Metric	Dense 7B	Dense 70B	MoE 45B (Mixtral)
Total parameters	7B	70B	45B (8 experts)
Active per token	7B	70B	~12.9B (2 experts)
Compute per token	7B-equiv	70B-equiv	14B-equiv
Memory (weights)	~14 GB	~140 GB	~90 GB
Throughput (tokens/s)	high	low	medium-high

The headline is this: MoE gives you better compute efficiency than a dense 70B, but you still pay the memory cost of a much larger model. You cannot run Mixtral on a single consumer GPU. You need at least two 24 GB cards to fit the weights. The computational savings only show up once the model is already loaded — that is the catch that the "70B performance at 7B compute" tagline often omits.

How sparse MoE works in a transformer

In a standard transformer, every layer has an FFN block (two linear projections with an activation in between). In a sparse MoE transformer, each FFN is replaced by multiple parallel "expert" FFNs plus a learned router that picks which experts to use for each token.

Here is the data flow for a single token passing through one MoE layer:

flowchart LR
    A[Input token<br/>hidden states] --> B[Router / Gate<br/>learned linear layer]
    B --> C{Softmax over<br/>N experts}
    C --> D[Select top-k<br/>experts]
    D --> E1[Expert 1<br/>FFN]
    D --> E2[Expert 2<br/>FFN]
    D --> E3[...<br/>idle]
    D --> E4[Expert N<br/>idle]
    E1 --> F[Weighted sum<br/>by router scores]
    E2 --> F
    F --> G[Output token<br/>hidden states]

The router is a small learned linear layer that takes the token's hidden state and outputs a score for each expert. You take the softmax over all experts, pick the k with the highest scores, run the token through only those experts, and combine the results weighted by the router scores. For Mixtral, k=2 out of 8 experts. For DeepSeek-MoE, k=6 out of 64 experts. The router itself adds negligible compute — a single matrix multiply of size (hidden_dim, n_experts).

The router is not just "which GPU does this go to"

A common mental model is that the router is a load balancer that assigns tokens to experts similar to how a distributed scheduler assigns work to machines. This is misleading. The router is a learned differentiable gate trained end-to-end with the rest of the model through backpropagation. It learns which experts specialize in which types of patterns — subject-matter expertise, syntactic structures, token positions — without any explicit supervision.

Expert specialization emerges, it is not designed

When you inspect the routed outputs after training, individual experts do develop preferences. One expert in Mixtral handles arithmetic-heavy tokens disproportionately often. Another handles function words and punctuation. A third handles code syntax. But these specializations are soft, not hard: there is no constraint that says "expert 3 is the math expert." The router simply learns the assignment that minimizes the loss.

Training an MoE model: the load-balancing problem

The hardest part of MoE training is preventing the router from sending every token to the same two experts. If there is no corrective signal, the router quickly collapses: it sends everything to the experts that happen to initialize well, those experts get more gradient updates, they get better, the router sends even more traffic their way, and the unused experts atrophy.

The standard fix is an auxiliary load-balancing loss added to the total training loss. The most common formulation (used in Mixtral, GShard, and ST-MoE) penalizes the router for imbalance:

# Simplified load-balancing loss (following the Switch Transformer formulation)
def load_balancing_loss(router_logits, num_experts, num_tokens):
    """
    router_logits: (num_tokens, num_experts) — raw router scores before softmax
    """
    router_probs = torch.softmax(router_logits, dim=-1)             # (tokens, experts)
    fraction_per_expert = router_probs.mean(dim=0)                  # (experts,) avg probability per expert

    # Fraction of tokens routed to each expert
    _, selected_experts = router_probs.topk(k=2, dim=-1)
    tokens_per_expert = torch.zeros(num_experts, device=router_logits.device)
    tokens_per_expert.scatter_add_(0, selected_experts.flatten(), 
                                    torch.ones(num_tokens * 2, device=router_logits.device))
    load_per_expert = tokens_per_expert / (num_tokens * 2)          # (experts,) normalized token count

    # Auxiliary loss: dot product of fraction and load
    # Minimized (zero) when all experts have equal probability AND equal load
    aux_loss = num_experts * (fraction_per_expert * load_per_expert).sum()
    return aux_loss

The num_experts multiplier scales the loss so it does not vanish at different expert counts. Typical aux_loss coefficients are between 0.01 and 0.001. Too high and the router loses discriminative power. Too low and the expert collapse returns.

Beyond the auxiliary loss: modern routing strategies

Recent work has introduced alternatives that reduce or eliminate the auxiliary loss:

DeepSeek-MoE uses a combination of shared experts (always-on, handles common patterns) and routed experts with top-6 selection. The shared experts cover the base computation that every token needs, so the routed experts can specialize more aggressively without collapsing.
Qwen2.5-MoE uses finer-grained experts (smaller intermediate size) with more of them, combined with shared experts and a "route-constrained" auxiliary loss.
Dense-to-Sparse training (DeepSpeed-MoE) starts with a dense checkpoint and incrementally sparsifies it, avoiding the collapse problem at initialization entirely.

MoE serving: where throughput meets memory

Serving an MoE model requires different infrastructure than a dense model. The key insight is that expert weights are wide but narrowly used:

Expert parallelism: place different experts on different GPUs. Since only k experts activate per token, each GPU only computes 2/k of the total expert FFN. This is the standard approach in vLLM, TGI, and SGLang for MoE models.
Memory overhead: all expert weights must be resident across the combined GPU memory. With 8 experts and 2 active per token, you need 4x the total GPU memory of the active-parameter count. For Mixtral (45B total, 12.9B active), you need ~90 GB of VRAM, which means at least 2x A100-80GB or 4x L40S.
All-to-all communication: before the MoE layer, tokens must be grouped by which expert they were routed to, sent to the correct GPU, processed, and then sent back. The router dispatch and combine operations are the main latency bottleneck in MoE inference, not the expert compute itself.

Here is a concrete serving comparison:

# vLLM configuration for MoE vs dense on 4x A100-80GB
# Dense 70B:
  model: meta-llama/Llama-3.3-70B-Instruct
  tensor_parallel_size: 2
  max_model_len: 8192
  estimated throughput: ~1800 tokens/s

# MoE 45B (Mixtral):
  model: mistralai/Mixtral-8x7B-Instruct-v0.1
  tensor_parallel_size: 2
  max_model_len: 32768  # sliding window attention
  estimated throughput: ~3200 tokens/s

The MoE throughput advantage is real but narrower than the parameter count suggests, because the dispatch overhead and the memory ceiling eat into the margin.

Common pitfalls

Router collapse during training. Even with load-balancing loss, the router can still collapse in the first few thousand steps. Monitor the expert utilization histogram during training. If one expert receives more than 30 percent of tokens while another receives less than 5 percent, increase the auxiliary loss coefficient or switch to a different routing strategy (e.g., DeepSeek's shared-expert design).

Ignoring dispatch overhead in latency budgets. The all-to-all communication in expert routing adds 5-15 ms per MoE layer depending on batch size and interconnect bandwidth. For a 32-layer model with 16 MoE layers, that is 80-240 ms of overhead before any compute happens. For latency-sensitive applications, this cost can erase the throughput gains.

Training on too-small batch sizes. MoE models require larger batch sizes than dense models because the expert capacity constrain means that each expert sees only a fraction of the batch. A batch of 256 tokens with 8 experts and k=2 means each expert processes roughly 64 tokens. Training on small batches leads to underutilized experts and noisy gradients.

Using MoE for fine-tuning without adaptation. Most MoE models were trained from scratch with MoE architecture. Taking a dense checkpoint and converting it to MoE (as in DeepSpeed-MoE's d2s approach) requires careful initialization and a warm-up schedule. Simple LoRA fine-tuning on an existing MoE model can break the learned routing patterns. Always evaluate the downstream task before and after fine-tuning to verify the routing did not drift.

Measuring memory wrong. The total parameter count of an MoE model determines model.parameters(), but the memory you need to serve it is the sum of all experts plus the shared layers. For DeepSeek-MoE-16B, the 64 experts (each with intermediate_size 1408 at hidden_size 2048) means the expert weights alone occupy roughly 45 GB at FP16. The total 16B label refers to the active parameter count, not the storage requirement.

When NOT to use it

MoE is not always the right architecture for your model:

You need consistent latency for every request. Because the router's top-k selection varies per token, and because batch composition affects which experts are active, MoE latency has higher variance than dense models. If your SLO requires 99th percentile latency under 200 ms per token, a dense model is easier to calibrate.
You are deploying on a single GPU with less than 48 GB VRAM. MoE models with real quality (anything above 2-3 active billion parameters) require at least two GPUs to fit the total weights. If your deployment is a single RTX 4090 or A5000, stick with dense models in the 7B-13B range.
You are building a small model under 3B parameters. The overhead of the router, the auxiliary loss, and the expert parallelism infrastructure is not worth it at this scale. MoE starts to pay off when the dense baseline you are trying to beat is above 30-50B parameters.
Your batch size is small and latency-critical. A batch of 1 (streaming chat) does not benefit from expert parallelism because the dispatch overhead dominates. The throughput advantage of MoE is most visible at batch sizes above 64.
You cannot afford the engineering complexity. MoE serving requires custom kernel support (Triton or CUDA kernels for fused experts, dispatch, and combine), non-trivial CI for load-balancing validation, and integration with inference engines that are still maturing their MoE support. If your team has limited ML infrastructure, a dense model with QLoRA is the safer bet.

TL;DR

MoE decouples total parameters from per-token compute by routing each token to a subset of expert FFNs.
Mixtral 8x7B has 45B total parameters but only activates ~13B per token, giving 70B-class compute efficiency at ~14B-class cost.
The router is a learned linear layer trained end-to-end, not a scheduler. Expert specialization emerges naturally.
Load-balancing loss is essential during training to prevent router collapse. Typical coefficients range from 0.01 to 0.001.
Serving MoE requires expert parallelism across GPUs. Dispatch overhead is the main latency bottleneck, not the expert compute.
MoE memory footprint is proportional to total parameters (all experts), not active parameters. You cannot fit Mixtral on a single 24 GB GPU.
MoE pays off at large scale (target dense baseline above 30B). For small models, single-GPU deployments, or latency-sensitive applications, dense is simpler and often better.

Next post: structured output — how JSON mode, function calling, and grammar-constrained decoding work under the hood, and when each approach fails.

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

Tech_Nuggets — Fri, 12 Jun 2026 01:12:21 +0000

Sampling strategies compared: temperature, top-p, top-k, min-p, and what actually works in production

You deployed a chatbot, picked temperature 0.7 because every blog post says that, and the first live user sends back screenshots of responses that drift into gibberish mid-sentence. A colleague suggests top-p 0.9. Another says top-k 50. Someone new to the team mentions min-p and claims it solves everything. You have no benchmark, no test set, and no way to tell whether any of these knobs actually fix your specific problem instead of just making the outputs shorter.

This is the state of sampling parameter selection for most teams shipping LLM products. The parameters are poorly documented, they interact in non-intuitive ways, and the default values in every inference engine are tuned for general-purpose chat benchmarks, not for your use case. This post maps the four most common sampling knobs -- temperature, top-p, top-k, and min-p -- to the concrete effects they have on the output distribution, so you can pick the right one (or combination) without guessing.

Why sampling parameters matter

Every LLM generates text one token at a time by choosing from a probability distribution over the vocabulary. The raw distribution (the logits from the final transformer layer, passed through softmax) is almost never used directly. A raw distribution might assign 0.0001 probability to fifty thousand tokens and 0.3 to the top token. If you sample directly from that, you get a narrow band of high-probability continuations that sound repetitive and robotic.

Sampling parameters reshape this distribution. The goal is to widen the distribution enough for creative or useful variation, but not so much that the model assigns meaningful probability to tokens that make no sense. Each parameter attacks a different failure mode:

Temperature controls the overall sharpness of the distribution.
Top-p (nucleus sampling) truncates the distribution to the smallest set of tokens whose cumulative probability reaches a threshold.
Top-k keeps only the k highest-probability tokens and renormalizes.
Min-p scales a probability floor relative to the top token's probability, keeping tokens whose probability is at least that fraction of the top token.

The following diagram shows how each strategy transforms the same logit distribution:

flowchart LR
    A[Raw logits<br/>from model] --> B[Softmax]
    B --> C[Full probability<br/>distribution]
    C --> D{Temperature}
    D -->|tau < 1| E[Sharpened<br/>peaks]
    D -->|tau > 1| F[Flattened<br/>tails]
    E --> G{Top-p / Top-k / Min-p}
    F --> G
    G --> H[Truncated<br/>distribution]
    H --> I[Sample<br/>next token]
    C --> J[Greedy argmax<br/>tau = 0]

Each box above is a tunable step. The order matters: temperature is applied to logits before softmax, while top-p, top-k, and min-p are applied to the resulting probability distribution after softmax. If you set temperature to 0 first, the later truncation parameters have no effect because the distribution is already a delta function on the argmax token.

The four knobs, explained from the inside

Temperature

Temperature is the oldest and most widely understood parameter. It divides the logits by tau before softmax:

P(token_i) = exp(logit_i / tau) / sum_j exp(logit_j / tau)

When tau = 1, this is the standard softmax. When tau approaches 0, the distribution converges to a one-hot vector on the highest-probability token (greedy decoding). When tau is above 1, the distribution flattens, making low-probability tokens more likely than the raw model intended.

Practical ranges: tau = 0 (deterministic, good for code generation or factual QA), tau = 0.1-0.3 (near-deterministic, useful for classification), tau = 0.6-0.9 (creative writing, conversational), tau = 1.0-1.5 (brainstorming, diverse generations). Above 1.5, the model increasingly produces incoherent text because it is assigning meaningful probability to tokens the model considers unlikely.

The critical property of temperature is that it is a distribution-wide transform. It does not prune any tokens; it just makes the probabilities more equal (tau > 1) or more unequal (tau < 1). This means tau > 1 can activate tokens that were essentially zero-probability in the raw distribution, including tokens that are misspellings, in the wrong language, or hallucinated -- because the model gave them low probability for a reason, and temperature is overriding that signal.

Top-p (nucleus sampling)

Top-p, introduced by Holtzman et al. in 2019, solves a specific problem with temperature: temperature alone does not truncate the vocabulary. At tau = 0.8, the model still assigns tiny nonzero probability to thousands of tokens, and sampling from that long tail produces unexpected tokens.

Top-p works by sorting tokens by probability descending, then keeping tokens from the top until their cumulative probability exceeds p. If p = 0.9, it keeps the top tokens that collectively account for 90% of the probability mass. This is adaptive: when the model is confident, top-p keeps few tokens; when uncertain, it keeps more.

Practical ranges: p = 0.8-0.95 for most generation tasks. Lower values (0.5-0.7) produce more focused outputs useful for factual QA. Values above 0.95 are close to no truncation at all. The surprising property of top-p is that it can be less restrictive than top-k in high-entropy distributions, because it adapts to the distribution shape.

Top-k

Top-k is the simplest truncation: keep only the k tokens with the highest probability and renormalize. A common default is k = 40 or k = 50, inherited from the early GPT-2 days.

The problem with top-k is that it is static. When the distribution is peaked (model is confident), k = 50 keeps many low-probability tokens that should have been truncated. When the distribution is flat (model is uncertain), k = 50 cuts off tokens that carry meaningful probability. Top-k works acceptably when you have tuned k for a specific domain and model, but it is fragile across models and tasks.

Practical ranges: k = 10-50 for general generation. k = 1 is greedy (effectively tau = 0). k above 100 approaches no truncation for most models.

Min-p

Min-p, proposed by Nguyen et al. in 2024 (arXiv 2407.01082), addresses the static nature of top-k with an adaptive threshold. It works by setting a floor at (min_p * P_max), where P_max is the probability of the most likely token. Tokens below this floor are discarded, and the remaining distribution is renormalized.

If min_p = 0.1 and the top token has probability 0.6, the floor is 0.06. Any token below 0.06 probability is pruned. When the model is confident (top token near 1), the floor is high and few tokens survive. When the model is uncertain (top token at 0.3), the floor drops and more tokens pass through.

Practical ranges: min_p = 0.01-0.2. Default recommendations from the paper are around 0.05-0.1 for a good balance of creativity and coherence. Values below 0.01 are close to no truncation. Values above 0.2 become very restrictive.

Comparison table

Parameter	What it does	Adaptive?	Common range	Best for	Key failure mode
Temperature	Scales logits before softmax	No	0 - 1.5	Controlling randomness/creativity	Enables low-probability tokens without discrimination
Top-p (nucleus)	Keeps top tokens up to cumulative probability p	Yes (adaptive count)	0.8 - 0.95	General generation when model confidence varies	Can be too permissive in peaked distributions
Top-k	Keeps only k highest-probability tokens	No (fixed count)	10 - 50	Legacy compatibility, simple tuning	Static; either too restrictive or too permissive
Min-p	Keeps tokens with prob >= min_p * P_max	Yes (adaptive threshold)	0.01 - 0.2	Production systems needing coherence + creativity	Less tested at very large scales

Sampling in practice: what combinations work

In production systems, sampling parameters are almost never used alone. The most common production recipe is:

Default for conversational agents: temperature = 0.7, top-p = 0.9, min-p = 0.05. This gives enough randomness for natural variation while the min-p floor prevents the model from wandering into very low-probability regions. Top-k is usually turned off (set to 0 or a high value like 200) because min-p and top-p already handle truncation more adaptively.

For code generation or structured output: temperature = 0.1-0.2, top-p = 0.95, min-p = 0.01. The near-zero temperature forces most probability onto the top few tokens. Top-p at 0.95 ensures that when the model is truly uncertain (e.g., picking a variable name), it still has options beyond the argmax.

For creative writing or brainstorming: temperature = 0.9-1.1, top-p = 0.95, min-p = 0.02. Slightly elevated temperature encourages variety. The generous top-p keeps the distribution wide. The low min-p exists mainly as a safety net against the worst long-tail tokens.

For classification or extraction: temperature = 0 (greedy), no truncation parameters needed. When the output space is a fixed set of labels, any sampling at all reduces accuracy. This is the rare case where the default parameters are actually optimal.

Here is a Python snippet showing how vLLM combines these parameters in practice:

from vllm import SamplingParams

# Conversational agent
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    max_tokens=1024,
    stop=["<|im_end|>"]
)

# Code generation
code_params = SamplingParams(
    temperature=0.1,
    top_p=0.95,
    min_p=0.01,
    max_tokens=2048
)

# Classification (deterministic)
classify_params = SamplingParams(
    temperature=0.0,
    max_tokens=16
)

Common pitfalls

Stacking truncation parameters without understanding the interaction. Top-p at 0.9 and top-k at 50 at the same time means two truncations fire sequentially. Top-p might keep 30 tokens, then top-k cuts that to 50 -- which does nothing. Or top-k keeps 50, then top-p might further trim them. The effective behavior depends on which truncation applies first. Most engines apply top-k first, then top-p, then min-p. If you set all three, you are relying on an ordering you may not remember next month. Pick at most two truncation methods.

Setting temperature above 1.5 and expecting coherence. Temperature is not a creativity dial. Above 1.5, the model assigns significant probability to tokens it considers extremely unlikely. The outputs may appear creative but are actually random. If you need diverse outputs, try increasing top-p or lowering min-p instead of pushing temperature beyond 1.2.

Using top-k as the only sampler. This is the most common mistake I see in deployed services. A static k cannot adapt to the distribution. At k=50, sometimes you keep garbage and sometimes you cut off the valid tail. If you must use top-k alone, set k conservatively (10-20) and accept that you are leaving performance on the table.

Forgetting that temperature 0 disables all sampling. If temperature is 0, the model always picks the argmax token. Top-p, top-k, and min-p have no effect because there is no distribution to truncate. If you see "temperature=0, top_p=0.95" in a config, the top_p is dead code.

Applying sampling parameters incorrectly in batched inference. Some inference engines share sampling parameters across all sequences in a batch. Passing a per-request temperature override that conflicts with the batch default causes silent fallback to the default. Always verify that per-request sampling overrides are actually wired through the batching layer.

When NOT to use it

Sampling parameters should not be the primary tool for improving output quality if:

Your outputs are incoherent at temperature 0. Sampling parameters cannot fix a model that produces bad output even when it is maximally deterministic. If greedy decoding gives poor results, the problem is in the model, the prompt, or the training data, not in the sampling strategy. Add more examples to the prompt or improve the fine-tuning data before touching sampling parameters.
You need guaranteed structured output. Sampling introduces nondeterminism. If the application requires valid JSON, a specific schema, or exact string matching, use constrained decoding (grammar-guided generation or JSON mode) instead of hoping the right parameters keep the output valid. Sampling parameters can reduce the rate of malformed output but cannot eliminate it.
You are running a benchmark or eval. Every paper and leaderboard uses greedy decoding (temperature 0) or a tightly controlled sampling procedure. If you compare a model at temperature 0.7 against another at temperature 0, you are measuring sampling strategy differences, not model quality differences. For evaluation, use deterministic settings and control for temperature as a variable.
You have not measured the output quality. Before tuning sampling parameters, establish a metric -- accuracy on a held-out set, human preference ratings, or a task-specific score. Without a metric, every sampling parameter change is cargo-culting. Measure first, tune second.
Your application uses speculative decoding. Speculative decoding's acceptance rate drops significantly at temperature 0 (greedy mode) compared to low-temperature sampling. If throughput is critical and you use speculation, the optimal temperature may be higher than you would choose for quality alone. Benchmark the throughput-quality tradeoff explicitly.

TL;DR

Temperature scales logits before softmax. It is the only knob that affects the entire distribution uniformly. Use it to control randomness, from 0 (deterministic) to ~1.2 (max practical creativity).
Top-p keeps the top tokens that cover p percent of the probability mass. It adapts to distribution shape and is the most popular general-purpose truncation.
Top-k keeps the top k tokens regardless of their probabilities. It is simple but fragile across inputs. Prefer top-p or min-p unless you have a specific reason for a fixed count.
Min-p keeps tokens whose probability is at least a fraction of the top-token probability. It is the most adaptive truncation and works well as a safety net alongside temperature and top-p.
Best production combo for most use cases: temperature 0.7 + top-p 0.9 + min-p 0.05. Drop top-k entirely. For structured output, use constrained decoding instead of sampling tricks.
Never tune sampling parameters without a metric. Greedy decoding (tau=0) is the first thing to check. If greedy fails, sampling parameters will not save you.

The MCP (Model Context Protocol) has been called the missing standard for tool integration, but the real question is what it costs in latency, reliability, and debuggability. Next post: a production-oriented walkthrough of MCP -- how tool calls flow through the protocol, where the serialization overhead lives, and what the current ecosystem actually supports.

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

Tech_Nuggets — Thu, 11 Jun 2026 01:13:14 +0000

Quantization formats compared: GGUF vs GPTQ vs AWQ vs NF4

You just finished fine-tuning a 7B parameter model. The raw FP16 weights are 14 GB. Your target deployment is a single consumer GPU with 8 GB of VRAM, or perhaps an ARM MacBook with unified memory, or maybe a cloud instance where you pay per GB of GPU memory. The numbers do not add up. The model, as is, does not fit. You need to shrink it, and you need to shrink it in a way that does not turn it into a random-number generator.

This is where weight quantization enters the picture. Reducing each parameter from 16 bits to 4 bits drops the memory footprint by 4x, from 14 GB to roughly 3.5 GB for a 7B model. The trick is how you do it, because not all 4-bit values are the same, and the trade-offs between memory, speed, accuracy, and portability are different for every format.

Why quantization format choice matters

The format determines three things: which hardware can run the model, how fast inference runs, and how much accuracy you give up. These three constraints are in tension. A format optimized for CPU inference (GGUF) uses a different quantization scheme than one designed for GPU batch serving (GPTQ). A format that preserves more accuracy at the same bit-width (AWQ) may cost more to calibrate. A format designed for training (NF4 via bitsandbytes) is not the best choice for inference deployment.

Choosing the wrong format means either leaving performance on the table, or worse, building a deployment pipeline around a format that the inference engine does not support. The landscape has settled into four major formats, each with a clear niche.

The four formats: how they work

GGUF

GGUF is the GGML Universal Format, created by the llama.cpp project. It is a container format that bundles model weights, tokenizer, and hyperparameters into a single file, with the weights already quantized. The quantization methods inside GGUF range from Q2_K to Q8_0, with Q4_K_M being the most popular sweet spot.

GGUF quantizations use a block-wise scheme: weights are grouped into blocks (typically 32 weights per block) and each block gets its own scale and (optionally) zero-point. The K-quant variants (Q4_K_M, Q5_K_M, etc.) mix different bit-widths across different parts of the model, spending more bits on the layers that matter more.

The format is designed for CPU and Apple Silicon inference. Because llama.cpp can offload some layers to GPU, GGUF also works on hybrid CPU+GPU setups, but the primary target is memory-constrained environments where a GPU is not available or not large enough.

GPTQ

GPTQ (GPU Post-Training Quantization) was introduced in 2023 by Frantar et al. from IST Austria. It is a weight-only quantization method that uses a second-order optimization procedure: it quantizes weights column by column, using the Hessian of the loss to adjust the remaining unquantized weights to compensate for the information lost on the already-quantized ones.

The original implementation, AutoGPTQ, was archived in early 2025. The active successor is GPTQModel (v7.1.0, June 2026) from ModelCloud, which supports both Marlin and Triton kernels for fast GPU inference. GPTQ models are typically quantized to 4-bit (or occasionally 3-bit and 8-bit) and are stored in Hugging Face-compatible safetensors format with a quantize_config.json metadata file.

GPTQ requires a GPU to run. The Marlin kernel (int4 x fp16) achieves near-lossless throughput on NVIDIA GPUs, making GPTQ the default choice for serving quantized models on datacenter GPUs.

AWQ

AWQ (Activation-Aware Weight Quantization) was introduced by Lin et al. from MIT in 2024. The key insight is that not all weights are equally important -- the ones corresponding to large activation magnitudes have a disproportionate impact on output quality. AWQ identifies these "salient" weight channels by analyzing a small calibration dataset and protects them by scaling them up before quantization, then scaling the output back down during inference.

The implementation is AutoAWQ (v0.2.9, May 2025). Like GPTQ, AWQ targets GPU inference and produces Hugging Face-compatible weights. AWQ tends to produce slightly lower perplexity than GPTQ at the same bit-width, especially at 4-bit, though the gap is small (typically within 0.1 perplexity points).

NF4

NF4 (NormalFloat4) is a quantization data type introduced as part of the QLoRA paper (Dettmers et al., 2023). It is not a container format or a quantization algorithm per se -- it is a 4-bit data type that assumes the weights follow a normal distribution and uses a normalized float mapping that allocates more quantization levels near zero.

NF4 is implemented in the bitsandbytes library (v0.49.2, February 2026) and is the default 4-bit type for QLoRA fine-tuning in the Hugging Face ecosystem. Unlike the other three formats, NF4 is primarily used for training (parameter-efficient fine-tuning) rather than inference deployment. You use NF4 to load a model in 4-bit during training, but you typically export to a different format for serving.

Side-by-side comparison

Property	GGUF	GPTQ	AWQ	NF4
Primary use case	CPU / Apple Silicon inference	GPU inference serving	GPU inference serving	QLoRA fine-tuning
Container format	Single .gguf file	safetensors + config.json	safetensors + config.json	Not a standalone format
Quantization method	Block-wise K-quants	Hessian-based, column-by-column	Activation-aware saliency scaling	Normal-distribution optimized float
Typical bit-width	2-8 bits (Q4_K_M most common)	4-bit (3/8 also supported)	4-bit	4-bit
CPU inference	Native	No	No	No
GPU inference	Partial (layer offload)	Yes (Marlin kernel)	Yes (Triton kernel)	Yes (training only)
Apple Silicon	Native (Metal)	No	No	No
Calibration data needed	No	Yes (128-512 samples)	Yes (128-512 samples)	No
Accuracy at 4-bit	Good	Excellent	Excellent	Good
Inference engine	llama.cpp, Ollama, LM Studio	vLLM, TGI, HF Transformers, GPTQModel	vLLM, TGI, HF Transformers	HF Transformers (training)
Latest version	b9592 (llama.cpp, Jun 2026)	GPTQModel v7.1.0 (Jun 2026)	AutoAWQ v0.2.9 (May 2025)	bitsandbytes 0.49.2 (Feb 2026)

Quantization at a glance: the pipeline

flowchart LR
    A[FP16 model<br/>16-bit weights] --> B{Which format?}
    B -->|CPU / Apple| C[GGUF quantization<br/>llama.cpp]
    B -->|GPU serving| D[GPTQ quantization<br/>GPTQModel]
    B -->|GPU serving| E[AWQ quantization<br/>AutoAWQ]
    B -->|QLoRA training| F[NF4 loading<br/>bitsandbytes]
    C --> G[Single .gguf file<br/>ready to run]
    D --> H[safetensors + config<br/>load with vLLM/TGI]
    E --> I[safetensors + config<br/>load with vLLM/TGI]
    F --> J[4-bit training<br/>export to deploy format]
    G --> K[llama.cpp / Ollama / LM Studio]
    H --> L[vLLM / TGI / Transformers]
    I --> L
    J --> B

The diagram shows the branching decision. The critical fork is between CPU/Apple Silicon and GPU serving, because the format choice there determines the entire downstream toolchain.

Common pitfalls

Treating all 4-bit as equivalent. A 4-bit GPTQ model is not the same quality as a 4-bit GGUF Q4_K_M or a 4-bit NF4 model. The quantization method, calibration data, and block size all affect final perplexity. Always compare within the same family, and use perplexity as a relative guide, not an absolute one.

Assuming you need calibration data for every format. GPTQ and AWQ both require a small calibration dataset (typically 128 samples from the training distribution). GGUF and NF4 do not. If you are quantizing a model for which you do not have representative sample data, GGUF is the simpler path.

Quantizing for GPU, then trying to run on CPU. A GPTQ model uses GPU-only kernels. There is no CPU fallback. If you download a GPTQ model from Hugging Face and try to run it with llama.cpp, it will not work. Similarly, GGUF models run poorly (or not at all) in vLLM. The format and the runtime are coupled.

Building an AWQ model with a stale version. AutoAWQ v0.2.9 (May 2025) is the latest release, but HF Transformers v5.11.0 (June 2026) also includes native AWQ loading via transformers.AwqConfig. If you use the Transformers integration, you do not need the standalone AutoAWQ library. Check which path is supported by your inference engine.

Using NF4 for deployment. NF4 is not a format designed for fast inference. The bitsandbytes 4-bit dequantization path is slow compared to the dedicated kernels in GPTQ (Marlin) or AWQ (Triton). Use NF4 for QLoRA training, then re-quantize to GPTQ or GGUF for deployment.

When NOT to use each format

Do not use GGUF if you are serving a high-throughput API on NVIDIA GPUs. The CPU fallback path of llama.cpp is slower than GPTQ's Marlin kernel at batch sizes above 1.

Do not use GPTQ if your deployment target is a MacBook, a Raspberry Pi, or any non-NVIDIA GPU. GPTQ kernels are NVIDIA CUDA-only. For Apple Silicon, use GGUF. For AMD GPUs, check if ROCm-based GPTQ kernels are available (limited support as of mid-2026).

Do not use AWQ if you cannot provide a representative calibration dataset. AWQ relies on activation statistics from real data. A mismatch between calibration data and deployment data degrades the saliency detection and can increase accuracy loss.

Do not use NF4 for anything beyond training. It is a storage format for the QLoRA paper, not a deployment format. If you see a model on Hugging Face labeled "NF4", it was likely uploaded as a training checkpoint, not a serving artifact.

TL;DR

There are four mainstream LLM weight quantization formats: GGUF, GPTQ, AWQ, and NF4. Each targets a different deployment scenario.
GGUF (llama.cpp) is for CPU and Apple Silicon inference. It is a self-contained single-file format with no calibration step.
GPTQ (GPTQModel v7.1.0) is for NVIDIA GPU serving. It uses Hessian-based quantization and the Marlin kernel for fast inference.
AWQ (AutoAWQ v0.2.9) is also for NVIDIA GPU serving. It uses activation-aware saliency scaling and achieves slightly better perplexity than GPTQ at the same bit-width.
NF4 (bitsandbytes) is for QLoRA fine-tuning, not inference deployment. Use it to train, then re-quantize for serving.
Choose your format based on your hardware (CPU vs NVIDIA GPU vs Apple Silicon) before considering bit-width or accuracy metrics. The runtime determines the format.
Calibration data is required for GPTQ and AWQ, but not for GGUF and NF4.

Now that you know which format to use, the next question is: how fast will a quantized model actually run on your hardware? The next post breaks down tokens-per-second for each format across consumer GPUs, Apple Silicon, and CPU configurations, with concrete benchmarks you can use to size your deployment.

If you have a quantized model deployment story -- or a horror story about picking the wrong format -- the comments are the place to share it. The next post will include community-sourced numbers from exactly these stories.

Flash Attention: what it does and why it matters

Tech_Nuggets — Wed, 10 Jun 2026 11:20:09 +0000

Flash Attention: what it does and why it matters

Your training job is paying for an A100 at $3/hour. The loss is going down, gradients are flowing, and the model's loss curve looks textbook-logarithmic. But if you profile the step time and look at what the GPU is actually doing, you'll see something alarming: the GPU compute units are idle 40-60% of the time. The bottleneck isn't arithmetic -- it's memory bandwidth. The GPU's HBM (high-bandwidth memory, 1.5-2 TB/s on an A100) cannot keep up with how fast the compute units want to consume data. And the single biggest chunk of memory traffic in any transformer training or inference run is the attention computation, which naively reads and writes the full N x N attention matrix to HBM for every forward pass.

Flash Attention exists to solve that one problem: it eliminates the redundant HBM traffic by fusing the attention computation into tiles that stay entirely inside the GPU's SRAM (the fast, on-chip memory, roughly 20 MB on an A100). The result is a 2-4x end-to-end speedup on attention-bound workloads, at zero loss of precision, with no model changes required.

Why attention memory costs matter

A standard self-attention layer on a single head works with three matrices Q, K, V, each of shape (N, d) where N is the sequence length and d is the head dimension. The naive computation:

Compute S = Q @ K^T -- shape (N, N)
Compute P = softmax(S, dim=-1) -- shape (N, N)
Compute O = P @ V -- shape (N, d)

The critical cost is that S and P are each N x N entries. For a 4096-token sequence with d=128, that's 16 million entries per head. At FP16, that's 32 MB per head. With 32 heads, the full N x N matrix across all heads would be 1 GB -- far larger than the ~20 MB of SRAM on a single A100 GPU. The standard implementation writes this 1 GB to HBM (slow), reads it back for softmax (HBM read), writes the result back (HBM write), then reads it again for the V multiplication.

Flash Attention avoids materializing this N x N matrix entirely by tiling the softmax computation across blocks small enough to fit in SRAM.

What Flash Attention actually does

The core insight from Tri Dao and the Stanford group (2022) was that the attention computation is IO-bound, not compute-bound, and the dominant cost is moving data between HBM and SRAM. On an A100, SRAM bandwidth is roughly 20 TB/s (compute units to SRAM), while HBM bandwidth is ~2 TB/s. A 10x difference. If the computation can be structured to stay in SRAM, it wins.

The mechanism is algorithmically straightforward:

Block the Q, K, V matrices into tiles small enough to fit in SRAM.
Compute a partial softmax for each block, using the online softmax algorithm (safe softmax that can be updated incrementally).
Accumulate partial results into the output, keeping per-block rescaling statistics in registers.
Write the final output to HBM once per layer, instead of multiple reads/writes per head.

This is a classic tiling technique, but applied to the attention-specific problem where the softmax is a global normalization -- you cannot naively sum over tiles because softmax requires a denominator over the full row. The paper's key algorithmic contribution is an online-safe softmax that lets each tile compute a local softmax and then correct the running output as new tiles arrive.

# Pseudocode for one Flash Attention forward pass block
def flash_attention_block(Q_block, K_block, V_block):
    # Q_block: (B_r, d), K_block: (B_c, d), V_block: (B_c, d)
    # B_r and B_c are tile sizes chosen to fit in SRAM

    # Initialize running maximum and denominator
    m = -inf   # row-wise max for numerical stability
    l = 0.0    # sum of exp(x - m) for the running normalization
    O = zeros(B_r, d)

    for each K, V tile:
        S = Q_block @ K_tile.T        # local attention scores (B_r, B_c)
        m_new = max(m, rowmax(S))     # update running max
        l_new = exp(m - m_new) * l + rowsum(exp(S - m_new))
        P = exp(S - m_new) / l_new    # local softmax
        O = (l * exp(m - m_new) / l_new) * O + P @ V_tile
        m, l = m_new, l_new

    return O

The algorithm reads Q, K, V from HBM once, processes them tile by tile in SRAM, and writes O to HBM once. Compare to the naive approach: for a sequence of length N, the standard implementation reads and writes the N x N attention matrix to HBM, which is O(N^2 d) HBM traffic. Flash Attention reduces this to O(N^2 d / M) where M is the SRAM size -- a reduction proportional to SRAM capacity.

The following diagram shows how the tiling skips the materialization of the full attention matrix:

flowchart TB
    subgraph SRAM["GPU SRAM (~20 MB)"]
        QB[Q tile<br/>(B_r x d)]
        KB[K tile<br/>(B_c x d)]
        VB[V tile<br/>(B_c x d)]
        ST[Partial S = QB @ KB^T<br/>(B_r x B_c)]
        OT[Partial O accumulator<br/>(B_r x d)]
    end
    subgraph HBM["GPU HBM (~40-80 GB)"]
        QF[Full Q<br/>(N x d)]
        KF[Full K<br/>(N x d)]
        VF[Full V<br/>(N x d)]
        OF[Full O<br/>(N x d)]
    end

    QF -->|read once| QB
    KF -->|read once<br/>tile by tile| KB
    VF -->|read once<br/>tile by tile| VB
    KB --> ST
    VB -->|partial products| OT
    OT -->|write once| OF

    style SRAM fill:#1e293b,stroke:#38bdf8,color:#e2e8f0
    style HBM fill:#0f172a,stroke:#64748b,color:#94a3b8

Each arrow from HBM to SRAM is a slow DMA transfer. The naive implementation makes O(N) of these per row and per head. Flash Attention makes exactly two passes over K and V (read and tile-by-tile process), then writes O once.

Flash Attention v1 vs v2 vs v3

Version	Year	Key improvements	Speedup vs naive	GPU focus
v1	2022	Tiling + online softmax, O(N^2) avoidance	2x	A100 (Ampere)
v2	2023	Reduced non-matmul ops, better parallelism, non-power-of-2 lengths supported	2-3.5x	A100, H100
v3	2024-2025	WGMMA (warp-group matrix multiply-accumulate) for H100 Tensor Cores, async pipelining, FP8 support	3-7x	H100/B200 (Hopper)

Flash Attention v2 removed a significant number of non-matrix-multiply instructions that creation of the mask and scaling required. This matters because Tensor Cores are most efficient when the workload is pure matrix multiplication, and any extra elementwise operations reduce utilization. The v2 paper reported that a single forward pass on a 65M-parameter model went from 6.5ms (PyTorch standard) to 2.6ms (Flash Attention v2).

Flash Attention v3, published in 2024, targets the H100's Hopper architecture. It uses the WGMMA instruction (warp-group MMA), which lets the GPU overlap data movement with computation during the tiled softmax pass. The synchronous SRAM reads of v1/v2 are replaced with asynchronous copies that hide latency. Additionally, v3 introduces FP8 support that cuts data movement in half again for the score computation.

Where Flash Attention is used today

Flash Attention is integrated into virtually every major LLM framework. The most common path is through PyTorch's scaled_dot_product_attention (SDPA), which has shipped the flash-attention backend since PyTorch 2.0:

import torch.nn.functional as F

# This automatically uses Flash Attention if conditions are met:
# - CUDA GPU
# - dtype is half-precision (FP16 or BF16)
# - head_dim is a multiple of 8
# - (v2+) Sequence length doesn't have restrictions on being power of 2
attn_output = F.scaled_dot_product_attention(
    query, key, value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=True
)

You don't need to import flash_attn directly in most cases. PyTorch's SDPA dispatches automatically to the best available backend: Flash Attention if available, otherwise memory-efficient attention, and falls back to the naive implementation.

For direct access, the flash-attn package on PyPI provides the FlashAttention module:

pip install flash-attn

This installs a prebuilt wheel matching your CUDA and PyTorch combination (PyPI wheels are available starting with v2.8.x). If no wheel exists for your configuration, building from source takes about 15 minutes and requires a CUDA compiler.

from flash_attn import flash_attn_func

output = flash_attn_func(
    q, k, v,
    dropout_p=0.0,
    softmax_scale=scale,
    causal=True
)

The flash_attn_func API gives you direct control over the backend parameters and is the path used by vLLM, Hugging Face transformers, and torch.compile paths.

Common pitfalls

The is_causal / padding interaction. If you use a causal mask AND a separate padding mask (for batched sequences of different lengths), the interaction between them is non-trivial. Flash Attention should handle it, but passing attn_mask with both a causal mask and individual padding requires careful construction. The safest approach is to leave causal=True and pad to the same length, or use a per-batch mask that is the full N x N with -inf in the right places.

Head dimension limits. Flash Attention has historically had constraints on head dimension. v1 required head_dim <= 128. v2 increased this to head_dim <= 256. v3 supports up to 256. If your model uses head_dim=96 or head_dim=64, you are fine. If you are experimenting with head_dim=512 (rare but seen in some vision transformers), Flash Attention cannot accelerate that attention computation.

CUDA graph compatibility. Flash Attention uses a variable amount of shared memory depending on the tile size, which can cause issues with CUDA graph capture. If you are using torch.compile with mode="reduce-overhead", test that the Flash Attention kernel does not prevent graph capture. v2.8.x has improved this, but the interaction is not guaranteed across all PyTorch versions.

AMD GPUs and non-CUDA backends. Flash Attention is a CUDA kernel. It does not run on AMD ROCm out of the box. The ROCm ecosystem has an alternative implementation called triton-based Flash Attention, but it has different performance characteristics and is not a drop-in replacement. If you are on AMD GPUs, benchmark before assuming parity.

Automatic fallback in SDPA can hide problems. Because PyTorch's SDPA silently falls back to the naive implementation if Flash Attention conditions are unmet, you can accidentally get different kernels on different GPU types and not notice. Always log which SDPA backend was selected if you care about reproducible performance.

When NOT to use it

Flash Attention is the wrong optimization if:

Your bottleneck is the MLP layers, not attention. For inference workloads where batch size is 1 and sequence length is short (under 512 tokens), the attention compute is a small fraction of total time. The MLP projections dominate. Optimizing attention gives you a 5-10% speedup instead of 2-4x. Profile first.
You are on CPU inference. Flash Attention requires a CUDA-capable GPU. CPUs use entirely different attention paths.
You need integer-only attention (e.g., quantized KV cache on CPU/edge devices). Flash Attention is implemented in CUDA and expects FP16/BF16 data. Quantized attention kernels (MatMul-free LLMs, etc.) use different algorithms.
You are training a small model for quick iteration. If your model takes 30 seconds per epoch, optimizing attention will not move the bottleneck. The overhead of importing and configuring Flash Attention (not large, but nonzero) is wasted effort.
Your sequence length is extremely long (100K+ tokens). For very long sequences, the memory-efficient attention in SDPA (which is Flash Attention for normal lengths) may still require an HBM pass that makes the tiling less effective. The Ring Attention / DeepSpeed Ulysses / Stripe Attention approaches are better suited above 100K tokens because they shard across GPUs instead of within a single GPU's SRAM.

TL;DR

Flash Attention tiles the Q, K, V matrices into blocks that fit in GPU SRAM, computing the softmax online without ever materializing the full N x N attention matrix in HBM.
v2.8.3.post1 is the current stable release (June 2026). v2 improved parallelism and removed length restrictions. v3 added H100-specific WGMMA instructions and FP8 support.
The speedup is 2-4x on A100-class GPUs, 3-7x on H100, at zero precision loss, with no model architecture changes required.
You get it automatically through PyTorch F.scaled_dot_product_attention or directly via the flash_attn package.
Watch for head_dim limits (max 256 in v2/v3), CUDA graph compatibility, and the silent SDPA backend fallback that can hide performance regressions.
Do not use Flash Attention if your bottleneck is not attention, you are on CPU/AMD, or you have extreme sequence lengths that require inter-GPU sharding.

Next post: a practical comparison of sampling strategies -- temperature, top-p, top-k, min-p, and what actually produces better output quality in production systems.

Flash Attention: what it does and why it matters

Tech_Nuggets — Wed, 10 Jun 2026 09:58:51 +0000

Flash Attention: what it does and why it matters

You have a single H100 with 80 GB of VRAM. The Llama 3.1 70B model fits — barely, at 140 GB in FP16, so you're running at 4-bit quantization and have maybe 5–8 GB of KV cache space left for a long-context workload. The model is fast enough at 8K context, so you push it to 32K for a RAG pipeline. It's still fine. Then you push it to 128K for a document-summary task, and suddenly the attention layer alone is spending 3 seconds per forward pass, 85% of which is just moving data between HBM and SRAM, not doing math. The CUDA kernel occupancy graph tells the story: green compute bars are tiny, grey memory-stall bars are huge. The GPU is bandwidth-bound, and vanilla attention is the cause.

Flash Attention is the algorithm that fixes this by restructuring the attention computation itself — not approximate, not sparse, not quantized, just IO-aware. Here is what it does, how the three versions differ, and where it stops helping.

Why this matters in practice

The attention mechanism is the core of every transformer: compute a similarity matrix S = Q K^T, normalize it with softmax P = softmax(S), and use it as weights over values O = P V. The problem is that for sequence length N and head dimension d, the S and P matrices are N×N, and writing them to GPU HBM (high-bandwidth memory) and reading them back is the bottleneck, not the matrix multiplies themselves.

For N = 32K and d = 128 (a single GPT-style head), S is 1 GB. At HBM bandwidth of 2 TB/s on an H100, moving that matrix out and back costs ~1 ms per layer. Across 80 layers and both forward and backward passes, that adds up to 150+ ms per step, and you haven't done a single useful ALU operation yet — just memory shuffling. At 128K context, the per-layer HBM traffic for vanilla attention hits ~16 GB, and the memory wall dominates.

Flash Attention eliminates almost all of the intermediate HBM traffic by tiling the Q, K, V matrices into blocks that fit in on-chip SRAM (192 KB on A100, 256 KB on H100), performing the entire softmax + weighted sum inside SRAM, and only writing the final output O back to HBM. The result: 2–4× faster attention for typical long-context workloads, up to 10× for very long sequences, with bit-exact output for FP16/BF16 and tiny relative error in FP8.

How the algorithm works

The core insight is that softmax over a sub-block can be recomputed from the running statistics. You don't need the full N×N matrix — you can process Q, K, V in blocks, compute local softmax within each block, maintain an online estimate of the softmax denominator, and merge the results.

flowchart LR
    subgraph HBM["HBM (main memory)"]
        Q["Q (N × d)"]
        K["K (N × d)"]
        V["V (N × d)"]
        O["O (N × d)"]
    end
    subgraph SRAM["SRAM (on-chip, ~192 KB)"]
        Qi["Q_block (Bc × d)"]
        Kj["K_block (Br × d)"]
        Vj["V_block (Br × d)"]
        Sij["S_block (Bc × Br)"]
        Pij["P_block (Bc × Br)"]
        Oi["O_block accumulator"]
        mi["Row max<br/>m_i"]
        li["Row sum<br/>ℓ_i"]
    end
    Q -->|tile| Qi
    K -->|tile| Kj
    V -->|tile| Vj
    Qi --> Sij
    Kj --> Sij
    Sij --> Pij
    Pij --> Oi
    Oi -.->|write| O

The algorithm for each attention head proceeds as follows:

Divide Q into blocks of size Bc that fit in SRAM alongside one block each of K and V.
Divide K and V into blocks of size Br.
For each Q block i and each K/V block j:
- Load Q_i and K_j, V_j into SRAM.
- Compute S_ij = Q_i K_j^T in SRAM.
- Compute local softmax: m_ij = rowmax(S_ij), P_ij = exp(S_ij - m_ij), ℓ_ij = rowsum(P_ij).
- Update global running max m_i = max(m_i, m_ij).
- Update global running sum ℓ_i = exp(m_i_prev - m_i) · ℓ_i + exp(m_ij - m_i) · ℓ_ij.
- Correct and accumulate output: O_i = O_i · exp(m_i_prev - m_i) / (ℓ_i / ℓ_i_prev) + (P_ij V_j) / ℓ_i.
Write the final O_i back to HBM after all K/V blocks have been processed.

The critical property: the output is identical to vanilla attention in FP16/BF16, because softmax over the full sequence is exactly reconstructed from the block-level statistics. The algorithm does not approximate — it rearranges.

Flash Attention 1 → 2 → 3

Feature	Vanilla	Flash Attn v1	Flash Attn v2	Flash Attn v3
Paper	N/A	Dao et al., 2022	Dao et al., 2023	Shah + Dao, 2025
GPU target	Any	A100 (Ampere)	A100 + H100	H100/H200 (Hopper)
HBM traffic per step	O(N² d)	O(N² d / M)	same	same
Forward speed vs vanilla	1×	2–3×	3–4×	4–6×
Backward speed vs vanilla	1×	2–3×	4–5×	6–8×
Precision	FP32/BF16	FP16/BF16	FP16/BF16	FP8/BF16/FP16
Data type	standard	FP16 only	BF16 + FP16	FP8 + BF16 + FP16
Core technique	none	Tiling + recompute	Improved block scheduling	Async WGMMA + FP8
CUDA features used	standard	MMA (Tensor Core)	MMA + better occupancy	WGMMA + async copy
Open source	—	✓ (Dao-AILab)	✓ (Dao-AILab)	✓ (Dao-AILab)

Flash Attention v1 (NeurIPS 2022, the paper that started it): Introduced the tiling scheme, proved the IO complexity result (O(N² d / M) HBM accesses vs O(N² d) for vanilla), and showed that the algorithm is exact for FP16. Forward pass is 2–3× faster than PyTorch's scaled_dot_product_attention on A100s. The backward pass uses the same tiling approach but recomputes S and P from the stored Q, K, V tiles rather than materializing the full gradient matrices.

Flash Attention v2 (2023): Redesigned the work distribution. In v1, each thread block processes one Q-block and iterates over all K/V blocks (SPMD-style). In v2, the parallelism is over different Q-blocks independently, and within each block the softmax reduction is fused with the output accumulation. This halves the number of global atomics and improves occupancy. v2 is roughly 2× faster than v1 on both A100 and H100, and it's the version that made Flash Attention a default in Hugging Face Transformers and PyTorch 2.x.

Flash Attention v3 (2024–2025, Hopper-specific): Taps the H100's WGMMA (warp-group matrix multiply-accumulate) instructions and asynchronous TMA (tensor memory accelerator) copies. v3 overlaps SRAM data transfers with computation via async copies: while the current block is computing attention, the next block's K, V tiles are being fetched in the background. The FP8 path uses the H100's 2× faster FP8 Tensor Cores (1.97 PFLOPS vs 989 TFLOPS for FP16) with stochastic rounding. v3 delivers 4–6× speedup over vanilla attention and is the recommended default for Hopper GPUs with sequence lengths above 8K.

Using it in practice

Flash Attention 3 is included in the flash-attn PyPI package (v3.1.2 as of May 2026). Installation is a single line:

pip install flash-attn

The API is straightforward once the package is installed. The main entry points are functions, not a module that auto-patches your model:

import torch
from flash_attn import flash_attn_func

q = torch.randn(1, 32, 4096, 128, dtype=torch.bfloat16, device="cuda")
k = torch.randn(1, 32, 4096, 128, dtype=torch.bfloat16, device="cuda")
v = torch.randn(1, 32, 4096, 128, dtype=torch.bfloat16, device="cuda")

# (batch, heads, seqlen, headdim) → (batch, seqlen, heads, headdim)
q = q.transpose(1, 2).contiguous()
k = k.transpose(1, 2).contiguous()
v = v.transpose(1, 2).contiguous()

out = flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=True)
# out shape: (1, 4096, 32, 128) — same as input layout

For most users, the easiest path is PyTorch's torch.nn.functional.scaled_dot_product_attention, which detects Flash Attention through the torch.backends.cuda.sdp_kernel context manager and dispatches to it automatically when the input dtype, layout, and GPU support it:

torch.backends.cuda.enable_flash_sdp(True)  # on by default in PyTorch 2.x
with torch.backends.cuda.sdp_kernel(
    enable_flash=True, enable_math=False, enable_mem_efficient=False
):
    out = torch.nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)

The dispatch check is reliable on A100 and H100 with BF16/FP16 inputs and head dimensions of 64 or 128. For FP8, you need H100 and flash_attn_func directly.

FA3 also integrates with Hugging Face models via attn_implementation="flash_attention_2" in from_pretrained:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

This swaps the attention module during model loading and is the path most training pipelines use today.

Common pitfalls

Head dimension must be 64 or 128 (v1/v2) or up to 256 (v3). This is a hardware constraint from Tensor Core layout requirements. Models with unusual head dims (e.g., 80 in some older architectures) will silently fall back to vanilla attention with no error message.
FP8 has higher numerical error on outlier-heavy models. Flash Attention 3's FP8 path pre-scales K and V row-wise and accumulates in FP16, but extremely spiky attention patterns (e.g., models trained without attention dropout) can amplify the relative error. Compare the output distribution on a few samples before trusting FP8 for your use case.
Not all GPUs support all versions. FA1 needs A100-class Tensor Cores (it won't run on V100). FA2 runs on Ampere and newer. FA3 requires Hopper (H100/H200) — SM 90 kernels will not load on Ada Lovelace.
Memory gains are less visible with very short sequences. At N < 512, the overhead of block iteration and the SRAM management cost can make Flash Attention slower than a well-tuned vanilla kernel. PyTorch's sdp_kernel handles this by falling back automatically, but if you call flash_attn_func directly at short context, benchmark first.
Dropout in attention is not free. FA supports attention dropout via a separate random mask, but because it recomputes S and P in the backward pass, the dropout rng state must be stored per block. In practice, most modern LLMs don't use attention dropout, so this rarely matters.

When NOT to use it

Flash Attention is the wrong tool if:

Your GPU is compute-bound, not memory-bound. On very small batch sizes with short contexts, the attention operation's HBM traffic is small enough that the GPU's Tensor Cores are the bottleneck, not the memory system. Flash Attention's tiling adds per-block overhead that can regress performance at N < 512 on high-end GPUs.
You need exact FP32 attention for research or numerical experiments. Flash Attention is exact for FP16/BF16 (bitwise identical to the unfused computation), but in FP32 it would be slower than vanilla because the tiling overhead is not amortized. For most LLM work this doesn't matter — BF16 is the training standard — but it's worth flagging.
Your model uses an unusual attention variant. ALiBi, xPos, linear attention (Mamba-style), and sliding-window attention have their own fused kernels that may not compose with Flash Attention's tiling. Flash Attention works for standard softmax attention with optional causal masking and ALiBi, but not for every recent variant.
You're on a production inference stack that already uses prefix caching. Flash Attention and prefix caching both sit in the attention layer, and they compose — but only if your serving engine (vLLM / SGLang) has implemented the combined kernel. As of v0.22, vLLM does not fuse FA3 with its prefix-caching kernel. You get one or the other, not both simultaneously (though this is a known work-in-progress).

TL;DR

Flash Attention tiles the Q, K, V matrices into SRAM-sized blocks, computes softmax on each block, and merges the results using online statistics. The output is bit-exact in FP16/BF16 — not approximate.
Original insight: standard attention is HBM-bandwidth-bound, not compute-bound. Reducing HBM round-trips from O(N² d) to O(N² d / M) is where the speedup comes from.
v1 (NeurIPS 2022) proved the concept on A100s. v2 (2023) doubled performance with better parallelism. v3 (2025) adds FP8 and async copies, reaching 4–6× vs vanilla on H100s.
Use it through PyTorch 2.x scaled_dot_product_attention (auto-dispatch) or Hugging Face attn_implementation="flash_attention_2" for the easiest path.
Skip it for sequences under 512 tokens, FP32 research, or unusual attention variants that don't use standard softmax.

Next post: Mixture of Experts (MoE) — what practitioners need to know about routing, load balancing, and the engineering decisions behind Mixtral and DeepSeek-V3.

LoRA and QLoRA fine-tuning: what they actually do under the hood

Tech_Nuggets — Tue, 09 Jun 2026 16:52:04 +0000

LoRA and QLoRA fine-tuning: what they actually do under the hood

You spent three weeks curating a dataset of legal contract summaries: 12,000 pairs of dense legalese and plain-English counterparts. The model you picked -- a 7B parameter instruction-tuned Llama -- understands your prompts but produces summaries that read like a junior associate who memorized Blackstone but never saw a real merger clause. You reach for full fine-tuning, the obvious move. Then torch.cuda.OutOfMemoryError hits at step 20 on your RTX 4090. You try gradient checkpointing. You try a smaller batch. You try half-precision. Still OOM. Your colleague says "just use LoRA" and walks off, as if that explains anything.

This is the gap this post fills. You do not need another high-level "LoRA is a PEFT method" post. You need the math and the trade-offs that let you decide between LoRA, QLoRA, and full fine-tuning for your specific hardware and quality requirements.

Why parameter-efficient fine-tuning exists

The cost of full fine-tuning is straightforward: a model with P parameters requires storing, at minimum, the model weights (2P bytes for fp16), the optimizer states (8P bytes for Adam), and the gradients (2P bytes). For Llama 3 8B with fp16 parameters, that is roughly 16 GB for weights plus 64 GB for optimizer state plus 16 GB for gradients -- 96 GB total. An RTX 4090 has 24 GB. A single A100-80 has exactly enough, barely, with no room for a batch size above 1.

Parameter-efficient fine-tuning (PEFT) avoids this by keeping the vast majority of the model frozen and training only a tiny set of added parameters. The key insight is that the weight update during fine-tuning, delta W, has low intrinsic rank -- you can approximate it as a product of two much smaller matrices.

LoRA: low-rank adaptation

The LoRA paper (Hu et al., 2021, arXiv 2106.09685) proposed freezing the pretrained weight matrix W in R^(d x d) and learning a low-rank decomposition:

W' = W + BA

where B in R^(d x r), A in R^(r x d), and r << d (typically r = 8 or r = 16). Instead of updating d^2 parameters per layer, you update 2dr. For d = 4096 (a common hidden dimension) and r = 8, that is 65,536 parameters per layer instead of 16,777,216 -- a reduction of roughly 256x.

During the forward pass, the computation becomes:

h = xW' = xW + xBA

The first term uses frozen weights (no gradient needed). The second term is the adapter path. Only A and B receive gradient updates. The original W stays intact, which means you can swap adapters in and out at inference time with zero overhead: just add the adapter weights to W (or compute h = xW + xBA on the fly).

Here is what the architecture looks like for a single Transformer attention layer:

flowchart LR
    subgraph Forward pass
        X[Input x] --> W[W frozen<br/>d x d]
        X --> B_adapt[B d x r]
        B_adapt --> A_adapt[A r x d]
        W --> ADD[Add]
        A_adapt --> ADD
        ADD --> OUT[Output h]
    end

    subgraph Gradient flow
        OUT --> GRAD_B[Gradients flow<br/>to B and A only]
        GRAD_B --> NO[No gradient<br/>through W]
    end

By default, LoRA is applied to the query and value projection matrices in each attention head. You can also extend it to key, output, and the feed-forward layers. Empirically, setting r = 8 on Q and V covers most of the benefit; doubling r beyond 16 rarely beats full fine-tuning by more than a trivial margin.

QLoRA: adding 4-bit quantization

QLoRA (Dettmers et al., 2023, arXiv 2305.14314) asked: what if instead of storing W in fp16, we stored it in 4 bits and still trained adapters on top? The result is a method that can fine-tune a 65B model on a single 48 GB GPU -- something that was previously impossible.

QLoRA makes three specific contributions that work together:

NF4 data type. NormalFloat4 is a quantization scheme designed for normally distributed weights. It maps the 4-bit values to the quantiles of a normal distribution, so the discretization error is minimized exactly where most weight values fall. Informally, NF4 allocates more of its 16 representable values around zero and fewer in the tails.

Double quantization. The quantization constants (scale and offset) themselves take space. QLoRA quantizes these constants from fp32 to fp8, saving another 0.5 bits per parameter. The total is ~4.5 bits per parameter for the base model -- about 3.5 GB for a 7B model instead of 14 GB.

Paged optimizers. When GPU memory runs out during a long training run, the optimizer states are paged to CPU RAM and fetched back as needed. This prevents the OOM crash but can slow training; it is a safety net, not a performance feature.

During training, QLoRA dequantizes the 4-bit weights on the fly for each forward pass, computes the LoRA adapter contribution, and backpropagates only through the low-rank matrices. The dequantized weights never have their gradients computed, which is the whole source of memory savings.

Full comparison

Dimension	Full fine-tuning	LoRA (fp16)	QLoRA (4-bit base + LoRA)
Base model memory	16 GB (7B, fp16)	16 GB (frozen)	~3.5 GB (NF4)
Adapter memory	0	2 GB (r=8, all layers)	2 GB
Optimizer state	~32 GB (Adam)	~4 GB (only adapters)	~4 GB
Total VRAM needed	~56 GB	~22 GB	~9.5 GB
Qual. vs full FT	Baseline	On par or within 0.5%	Within 1-2% on most benchmarks
Multi-task support	One copy per task	One base + N adapters	One base + N adapters
Training speed (7B, A100)	1.0x baseline	~1.4x faster	~0.8x slower (dequant overhead)

The speed trade-off is worth calling out explicitly: QLoRA trains slower than LoRA because every forward pass must dequantize the base weights. On a 7B model with a single A100, LoRA is roughly 1.4x faster than full fine-tuning (less data movement), while QLoRA is about 0.8x the speed of full fine-tuning (dequantization overhead). The memory savings are enormous though, which is why QLoRA dominates the conversation for consumer-grade GPUs.

Common pitfalls

Rank selection is not magic. Setting r = 256 everywhere will not automatically improve results. Higher rank means more trainable parameters but also more noise in the gradient signal. The original LoRA paper found that a rank of 1 already captures meaningful adaptation for many tasks. Start with r = 8 on Q and V, evaluate, and only increase rank on layers that underfit.

Adapter merge at scale. You can merge LoRA weights into W at inference time by computing W' = W + BA for each layer and discarding A and B. This eliminates the adapter inference overhead. But if you have 50 adapters for 50 different clients, you now need 50 copies of the full weights -- trading compute for storage. The right design depends on which resource you have more of.

QLoRA is not free. The NF4 dequantization adds numerical noise. On most tasks the quality loss is within the noise floor (1-2% on MMLU, roughly 0.5% on domain-specific benchmarks). But if you are tuning a model for a precision-critical task such as medical diagnosis or code correctness verification, the trade-off may swing back to full-precision LoRA or full fine-tuning.

Bitsandbytes versions matter. QLoRA depends on the bitsandbytes library for its CUDA quantization kernels. As of June 2026, bitsandbytes is at v0.49.2 and PEFT is at v0.19.1. The API changed between v0.43 and v0.44 -- if you are using an older PEFT, pin to a compatible bitsandbytes version. A version mismatch silently falls back to CPU quantization, which runs orders of magnitude slower.

Scaling the LoRA alpha. The LoRA scaling factor alpha / r controls the magnitude of the adapter update. A common mistake is setting alpha too low (adapter contribution vanishes) or too high (training destabilizes). The paper recommends alpha = 2r as a starting point. Double-check this if your loss curve looks flat after 200 steps.

When NOT to use it

LoRA and QLoRA are the wrong choice when:

You need to change the model's internal representations fundamentally. If you are adding new knowledge that the base model does not have (a new language, a new domain with very different token statistics), low-rank updates may not have enough capacity. Continued pretraining or full fine-tuning will capture the distribution shift more effectively.

Inference latency is your binding constraint and you serve from CPU. LoRA merges into the weights easily on GPU, but on CPU with on-the-fly adapter computation, the extra matrix multiply for BA adds latency. You can merge ahead of time, but then every adapter becomes a separate weight file.

You are fine-tuning a model smaller than 1B parameters. The memory savings of PEFT are less dramatic on small models. A 350M-parameter model consumes roughly 1.4 GB in fp16 -- the adapter overhead of LoRA starts to be a significant fraction of total parameters. A simple full fine-tuning pass may fit with gradient checkpointing and a reasonable batch size.

You need deterministic training across hardware. The quantization paths in QLoRA introduce non-determinism from the dequantization kernel. If you need perfectly reproducible training runs (for auditing or compliance), stick with full-precision LoRA or full fine-tuning with a fixed seed and deterministic CUDA backend.

TL;DR

LoRA approximates the fine-tuning weight update as a product of two low-rank matrices (B in d x r, A in r x d), reducing trainable parameters by 100x-1000x per layer with minimal quality loss.
QLoRA quantizes the frozen base model to 4-bit NF4, then trains LoRA adapters on top. A 65B model fits on a single 48 GB GPU.
The practical memory equation for a 7B model: full fine-tuning ~56 GB, LoRA ~22 GB, QLoRA ~9.5 GB.
Start with r = 8 on Q and V projection layers. Increase rank only if you see clear underfitting on your validation set.
QLoRA trains slower than LoRA (dequantization overhead) but uses roughly half the memory. Pick based on whether you are GPU-bound or time-bound.
Keep bitsandbytes and PEFT versions in sync. A version mismatch causes silent CPU fallback and catastrophic slowdown.
Do not use LoRA/QLoRA for small models (under 1B), for injecting fundamentally new knowledge, or for CPU-latency-sensitive serving where merge-ahead is impractical.

We covered how to adapt an existing model efficiently. The next step is knowing when that adaptation has actually worked -- and that means evaluation. Next post: building a reliable evaluation pipeline that catches regressions before they ship, with or without a labeled test set.

If you are deciding between LoRA and QLoRA for a project right now, the key variable is your GPU budget. 24 GB or less? QLoRA. 48 GB or more? LoRA with a larger rank or full fine-tuning with LoRA on the side for rapid iteration. The code to make either choice work is a single pip install away.

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

Tech_Nuggets — Sun, 07 Jun 2026 01:09:57 +0000

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

Your chatbot deploys 70B Llama on 8x H100s. Steady-state TTFT sits around 180 ms for short prompts, and the team is fine with that. Then you turn on a RAG feature: every request sends a 6,000-token context stuffed with retrieved documents, plus a short system prompt, plus the user's question. TTFT jumps to 1.4 seconds. p99 hits 2.1 s. A surprising share of those tokens are the same on every request — the system prompt, the same 6k retrieved chunks for the top queries, the tool definitions. The model is recomputing the same attention state over and over, then throwing it away. This is the problem prefix caching solves, and last week's post on KV cache quantization closed with it as the next topic — because the two features compose: a quantized prefix cache is cheaper to keep warm than a BF16 one, and the saved memory buys you either more concurrent users or a longer shared prefix.

Here's what prefix caching actually is, how vLLM and SGLang implement it differently, and where production deployments quietly lose most of the benefit.

Why this matters in practice

A modern LLM serving stack has two phases per request: prefill (process the entire prompt to build the KV cache) and decode (generate one token at a time, attending against the growing cache). For long-context workloads, prefill dominates. On a 70B Llama-3 with 8k of input, prefill accounts for roughly 70–85% of TTFT — decode is fast in comparison.

Most "long input" workloads are not actually long and unique on every request. They're long and repetitive:

RAG pipelines. The same retrieved chunks hit the same top queries. The system prompt and tool schema are byte-for-byte identical across every request. The user question is the only variable part, and it's tiny.
Multi-turn chat. Each turn is a strict prefix extension of the previous one. Round 2 shares everything except the latest assistant message and the new user turn.
Agent loops. The same tool schema, planning prompt, and few-shot examples get prepended every step. Only the latest tool result differs.
Long-document QA. Users repeatedly ask questions about the same 200-page PDF. The document is the prefix; the question is the suffix.

Prefix caching is the optimization that says: if the first N tokens of this request match a request I already processed, hand me back the KV cache for those N tokens instead of recomputing them. In the textbook case, the model output is bit-identical to a no-cache run, but prefill drops to a fraction of the cost. The reported "80% prefill saved" numbers come from RAG with 90%+ prefix overlap. The 5% numbers come from workloads where the prefix rarely matches, or the cache is constantly evicted before reuse.

What "prefix caching" actually is

The high-level idea is simple. The implementation has three decisions that drive the rest of the system: what unit do you hash on, how do you look it up, and what do you do when the cache is full.

flowchart LR
    A[New request<br/>tokens 0..N-1] --> B[Tokenize &<br/>split into blocks]
    B --> C[Hash each block<br/>tokens + parent hash]
    C --> D{Lookup in<br/>block table}
    D -- hit --> E[Reuse KV blocks<br/>skip prefill]
    D -- miss --> F[Compute KV<br/>for that block]
    F --> G[Insert block<br/>into table]
    E --> H[Continue with<br/>remaining prefill]
    G --> H
    H --> I[Decode normally<br/>+ append new blocks]

Three things matter. First, prefix caching is prefix-only: you can only skip the leading tokens, never a middle substring. If two requests share tokens 1000–2000 but differ on 0–999, you reuse nothing. Second, the cache is block-grained, not token-grained. A request has to match a whole block (default 16 tokens) to get a hit. A request that diverges at token 14,003 of a 14,016-token shared prefix still recomputes almost everything. Third, prefix caching does not change decoding — every saved token is a saved prefill token.

How vLLM does it: hash-based blocks

vLLM's Automatic Prefix Caching (APC) is block-based and content-addressed. Each KV-cache block (default 16 tokens) is keyed by a hash of three things: the parent block's hash, the tokens in the block, and a small set of "extra hashes" for LoRA adapter IDs, multimodal input hashes, and per-tenant cache salts.

The block-size choice is the lever most teams miss. A small block (4–8 tokens) gives finer reuse — a divergence only kills the divergent block. A large block (32–64 tokens) cuts hash-table overhead and improves batching, but wastes more work on partial-prefix misses. The 16-token default is a reasonable middle for chat; for RAG with 4k–8k chunks, 16 or 32 is common.

The hash function got a security upgrade in v0.11 (April 2026). Before that, the default used Python's hash() of the serialized block — a salted SipHash, randomized per process, fine for collision avoidance but non-reproducible across restarts. As of v0.22.1, the default is sha256, with a new --prefix-caching-hash-algo CLI flag:

Algorithm	Hash	Serialization	Reproducible	Notes
`sha256`	SHA-256	`pickle`	No	Default. Secure, but pickle is Python-version-sensitive.
`sha256_cbor`	SHA-256	`cbor2`	Yes	Recommended for multi-process or multi-language tiers.
`xxhash`	xxHash 128-bit	`pickle`	No	Faster, non-cryptographic. Multi-tenant risk must be assessed.
`xxhash_cbor`	xxHash 128-bit	`cbor2`	Yes	Fastest with reproducibility. Same caveat.

The multi-tenant caveat is the one to take seriously. If you serve multiple customers out of one engine and your hash function is non-cryptographic, a deliberate collision in a crafted prompt can evict another tenant's cache, or — in pathological cases — substitute their KV blocks with attacker-controlled values. If you don't control the prompts, stay on sha256 or sha256_cbor.

A typical vLLM deploy turns APC on at serve time:

vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 8 \
  --enable-prefix-caching \
  --prefix-caching-hash-algo sha256_cbor \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

APC is a server-level decision, not per-request — correct, because the cache is a shared resource.

How SGLang does it: a radix tree

SGLang keeps a radix tree of cached prefixes. Each node represents a shared prefix across one or more requests; each leaf is a request-specific tail. The engine traverses the tree per request, reuses the longest matching prefix, and forks new branches where requests diverge.

The practical differences that matter in production:

Match granularity is one token, not one block. SGLang reuses down to a single divergent token, recovering more of the cache than vLLM's block-level scheme on chatty workloads with mid-prompt variations (an inserted tool result). The trade is per-token tree-walk overhead per request.
Eviction is LRU on nodes, not blocks. When memory pressure forces a prune, the whole subtree under the coldest node goes. Faster than vLLM's per-block LRU but coarser — a cold tail can take a warm subtree with it.
Multi-LoRA / multimodal. SGLang stores per-request metadata at the leaves, so different LoRA adapters and image inputs sit naturally on different branches. vLLM achieves the same via the "extra hashes" component.

For most RAG and chat workloads, the two implementations deliver comparable hit rates. SGLang tends to win on many short shared prefixes (per-token matching helps); vLLM tends to win on very long shared prefixes (block-hash lookups are O(1) with a tiny constant).

What you actually get at the metric level

Workload	Median prefill saved	TTFT reduction	Caveat
RAG with 6k static context	88–94%	70–85%	Hit rate near 1.0 if the retrieved set is stable
Multi-turn chat, 8 turns	60–80% (avg)	30–55%	First turn is a miss; later turns reuse aggressively
Long-doc QA on a single PDF	92–97% after first query	75–90%	First query is a miss, all subsequent reuse
Open-ended Q&A (no shared prefix)	0–5%	0–5%	Don't bother enabling it
Tool-using agent loop	40–70% per step	20–45%	Tool result insertion breaks prefix mid-prompt

Hit rate — the fraction of blocks already in the cache when a request arrived — is the single most useful number to instrument. If you turn on APC and your hit rate is below 30%, something is wrong: prefixes don't match, or the cache is being evicted before reuse.

Common pitfalls

Eviction is a silent killer. vLLM evicts blocks under GPU memory pressure with LRU. A mix of long-prefix and short-prefix traffic often evicts long-prefix blocks first (they take more slots), and they're the only ones whose loss actually hurts. Raise --gpu-memory-utilization from 0.85 to 0.92 and the working set of cached prefixes typically doubles. Monitor cache hit rate after 60 seconds of warmup — a rate that decays over the day is an eviction problem, not a workload problem.
LoRA and multimodal mix badly if you forget the salt. vLLM's block hash includes LoRA IDs and image hashes; swap adapters at request time and you get cache thrash. Same for image inputs that vary per request — caching the multimodal prefix is essentially useless.
Prefix caching does not save decode. A common dashboard mistake is to credit the entire speedup to APC. Decode time is unchanged. If your workload is decode-bound, APC helps very little.
Hash algorithm migrations are not transparent. Changing --prefix-caching-hash-algo between deploys makes the new engine see zero hits until it warms back up. One-time cost, but a real incident if unexpected. Bake the algo into your Helm chart.
Cross-replica cache sharing is hard. vLLM's APC lives in GPU memory; each replica has its own cache. A request landing on a cold replica pays full prefill. Disaggregated architectures (vLLM v0.22's kv_connector, SGLang's DistServe) can route prefix-matched requests to warm replicas, but that needs explicit config.
The "first request after restart" problem. A rolling deploy invalidates the entire cache. The first 30–60 seconds after each deploy are prefill-bound. Schedule rolling deploys during low-traffic windows, or pre-warm with a synthetic-traffic sidecar.

When NOT to use it

Prefix caching is the wrong choice (or a wasted flag) if:

Your prompts have no shared structure. Open-ended completion APIs, code-gen on a fresh repo per request, single-turn Q&A with no system prompt — there's nothing to reuse. Hit rate near zero, and you're paying hash-table overhead for nothing.
You're under a strict determinism SLO that includes cache state. A cache hit and a cache miss produce the same output for the same model and same prompt, but float-rounding in the attention kernel can give a divergent token at extreme depths. If you need bit-exact reproducibility across requests, disable APC and accept the prefill cost.
You can't budget enough GPU memory for the working set. A cache that misses more than it hits is worse than no cache: you spent memory on entries that never get reused, pushing decode batch sizes down. Measure first, enable second.
Your traffic is dominated by mid-prompt insertions. Agent loops, multi-modal chat with per-turn image insertion, RAG with dynamic chunk re-ordering — these frequently insert new tokens mid-prompt, breaking the prefix. SGLang's per-token matching recovers more here, but workloads that are 50%+ mid-prompt insertions still see sub-30% hit rates in either engine.
You're already prefill-bound on a single giant request. A 100k-token analysis pass per request, one request at a time, will hit a 100% miss on the first call and a 100% hit on the second if it ever comes. The amortized win depends entirely on whether those requests repeat, and most one-shot analytics workloads don't repeat.

TL;DR

Prefix caching reuses the KV cache for the leading tokens of a request when a previous request already computed them. It only affects prefill; decode is unchanged.
vLLM's Automatic Prefix Caching (APC) is a content-addressed block store. Each block is hashed by parent hash + block tokens + LoRA/multimodal/salt extras. Default block size is 16 tokens. Default hash since v0.22.1 is SHA-256, with sha256_cbor, xxhash, and xxhash_cbor available via --prefix-caching-hash-algo.
SGLang uses a radix tree of token-level prefixes, which gives finer-grained matching at the cost of per-request tree-walk overhead.
The win is real but workload-shaped. RAG with a stable retrieved set: 88–94% prefill saved. Multi-turn chat: 60–80% averaged. Open-ended Q&A: 0–5%. Measure your hit rate before you trust the marketing numbers.
Eviction is the silent killer. Long-prefix blocks get evicted first under memory pressure. Size the cache budget explicitly and monitor hit rate over the day, not just at startup.
Don't enable it on open-ended workloads, on a multi-tenant engine with a non-cryptographic hash, or when you can't afford the working-set memory. Measure first.

Next post: structured output at the decoding layer — JSON mode vs grammar-constrained decoding vs function calling, where the three diverge in latency and reliability, and the failure modes that show up only in production.