The frustrating problem
Last month a teammate pinged me with a classic head-scratcher. He'd taken a base model with multi-token prediction (MTP) heads, ran it through a standard quantization pipeline to ship a smaller GGUF for edge inference, and the latency numbers came back worse than expected. The model still generated coherent text, but the speculative decoding speedup he'd built his benchmarks around was gone.
We poked around for an hour before the penny dropped. The MTP heads had silently been dropped on the floor during conversion. The base weights survived. The extra prediction heads — the whole reason MTP exists — did not.
If you've worked with models that ship MTP layers (the technique popularized by DeepSeek-V3, where the model predicts the next N tokens in parallel as draft tokens), you might have already run into this. The conversion toolchain assumes anything that isn't a vanilla transformer block is dead weight and trims it. Here's why it happens and how to stop it.
What MTP heads actually are
Quick refresher so we're on the same page. MTP (multi-token prediction) adds auxiliary heads on top of the base model that each predict a future token at offset +1, +2, +3, etc. At inference time you can use them as a built-in draft model for speculative decoding, which gives you a real throughput win without needing a separate small model.
The key thing: these heads are architecturally distinct from the regular lm_head. They live in their own module tree, often named something like model.mtp.layers.0, model.mtp.layers.1 and so on. They reference shared embeddings but have their own normalization, attention, and projection weights.
That naming convention is exactly what trips up the tooling.
Root cause: conversion scripts have an opinionated allowlist
Most quantization toolchains weren't designed with MTP in mind. They walk the state dict and apply transformations based on regex matches against expected layer names. Anything that doesn't match is either:
- Silently dropped (worst case)
- Left in fp16/fp32 in the output (works but bloats the file)
- Renamed in a way the loader can't recover (subtle breakage)
When I dug into the llama.cpp conversion script for the project, the relevant logic was essentially this pattern:
# Simplified version of what most converters do
KNOWN_PREFIXES = ("model.layers.", "model.embed_tokens.", "model.norm.", "lm_head.")
for name, tensor in state_dict.items():
if not name.startswith(KNOWN_PREFIXES):
# MTP heads land here and get skipped
logger.debug(f"skipping unknown tensor: {name}")
continue
write_quantized(name, tensor)
The logger.debug is the killer. Unless you run conversion with debug logging on, you never see the skip messages. The file converts "successfully" and you walk away thinking everything's fine.
GPTQ-style quantizers have a related but different failure mode. They calibrate against forward passes through the model, and if your calibration code only exercises the main lm_head path, the MTP heads never see calibration data. Even if the weights are preserved, the resulting quantized heads are essentially random.
Step-by-step solution
Here's the workflow I now use whenever I touch a model with MTP heads.
Step 1: Inventory the heads before you touch anything
Before any conversion, dump the full state dict and grep for MTP-related modules. This sets your baseline.
from safetensors import safe_open
mtp_tensors = []
with safe_open("model.safetensors", framework="pt") as f:
for key in f.keys():
# Adjust prefix to whatever your model uses
if "mtp" in key.lower() or "multi_token" in key.lower():
mtp_tensors.append((key, f.get_slice(key).get_shape()))
for name, shape in mtp_tensors:
print(f"{name}: {shape}")
print(f"\nTotal MTP tensors: {len(mtp_tensors)}")
Save this output. You'll diff against it after every conversion step.
Step 2: Patch the converter's allowlist
For llama.cpp style converters, you need to extend the known prefix list and add a mapping rule for the MTP heads. The clean way is to subclass or monkey-patch rather than editing the upstream script directly:
from convert_hf_to_gguf import Model
class MTPAwareModel(Model):
def map_tensor_name(self, name: str) -> str:
# Handle MTP heads explicitly before falling through
if name.startswith("model.mtp."):
# Preserve the layer index and submodule path
# Output name needs to match what your loader expects
return name.replace("model.mtp.", "mtp.")
return super().map_tensor_name(name)
def modify_tensors(self, data, name, bid):
# Skip the parent class's filter for MTP layers
if "mtp" in name:
return [(self.map_tensor_name(name), data)]
return super().modify_tensors(data, name, bid)
The critical bit is overriding modify_tensors — the default implementation has the silent skip we saw earlier.
Step 3: For GPTQ, calibrate through the MTP path
If you're using GPTQ-style quantization, your calibration loop needs to actually hit the MTP heads. The default model(input_ids) forward pass only routes through the main LM head. You need to force the MTP heads to see activations:
def calibration_forward(model, batch):
# Standard forward populates main path activations
outputs = model(**batch, output_hidden_states=True)
# Manually invoke MTP heads using the final hidden state
# This ensures each head gets calibration statistics
hidden = outputs.hidden_states[-1]
for i, head in enumerate(model.mtp.layers):
# Shift input so head i predicts token at position +i+1
shifted = hidden[:, :-(i + 1), :]
_ = head(shifted)
return outputs
Without this, your MTP heads quantize to garbage even though the file looks complete.
Step 4: Verify post-conversion
Re-run the inventory script against the converted file. The tensor count should match. If you went GGUF, you can also dump metadata:
# llama.cpp ships a metadata inspection tool
./gguf-dump model-quantized.gguf | grep -i mtp
Then run a quick speculative decoding sanity check. If the MTP heads are intact and properly calibrated, you should see your tokens-per-second numbers match (or get very close to) the unquantized baseline's speedup ratio.
Prevention tips
A few habits that have saved me repeated pain:
- Always run converters with debug logging enabled. The skip messages are the single most useful signal you'll get, and they're hidden by default.
- Tensor-count diff as part of CI. If your pipeline converts models automatically, fail the build when the output has fewer tensors than the input minus a known allowlist of intentionally-dropped weights.
- Test speculative decoding throughput, not just generation quality. A model can produce fluent text with broken MTP heads — your end-to-end latency benchmark is the only thing that will catch the regression.
- Pin your converter version. Upstream conversion scripts change their tensor-name handling more often than you'd think. A model that converted cleanly six months ago might silently break today.
MTP is one of those features where the failure mode is invisible until you measure the thing the feature was supposed to improve. Treat the conversion pipeline as untrusted by default, and you'll avoid burning an afternoon on it like we did.
Top comments (0)