This post is a practical deployment guide. Install, configuration, how to pick the right eviction rate, domain testing, and an honest list of what does not work yet.
Install
pip install nexusquant
Requires Python 3.9+, PyTorch 2.1+, and Transformers 4.40+. No CUDA-specific wheels — it runs on CPU for small models and on CUDA for production workloads.
The one-liner
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=500)
That is it. The context manager hooks into the model's forward pass, intercepts the KV cache after prefill, compresses it, and restores the original hooks on exit.
Quality presets
Three presets, pick based on your use case:
# Conservative: 35% eviction, 10x compression
# Use for: general-purpose, long prompts (>1K tokens), production default
with nexusquant_evict(model, quality="conservative"):
...
# Balanced: 60% eviction, 16x compression
# Use for: RAG over structured documents, short-to-medium prompts (<1.5K tokens)
with nexusquant_evict(model, quality="balanced"):
...
# Aggressive: 80% eviction, 32x compression
# Use for: memory-constrained environments, factual recall only
# Do NOT use for: prompts >1K tokens, creative tasks, multi-detail reasoning
with nexusquant_evict(model, quality="aggressive"):
...
# Lossless: no eviction, quantization only
# Use for: when you need maximum quality, 6-7x compression
with nexusquant_evict(model, quality="lossless"):
...
How to choose eviction rate for your use case
Do not guess. Test on your actual data. Here is a script:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from nexusquant import nexusquant_evict
def measure_ppl(model, tokenizer, texts, quality):
total_loss = 0
count = 0
for text in texts:
ids = tokenizer(text, return_tensors="pt").input_ids.cuda()
with nexusquant_evict(model, quality=quality):
with torch.no_grad():
out = model(ids, labels=ids)
total_loss += out.loss.item()
count += 1
return (total_loss / count)
# Load your model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1").cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Sample 20 texts from YOUR domain
your_texts = [...] # replace with real samples
# Baseline
baseline_loss = measure_ppl(model, tokenizer, your_texts, quality="lossless")
# Test presets
for quality in ["conservative", "balanced", "aggressive"]:
loss = measure_ppl(model, tokenizer, your_texts, quality=quality)
delta_pct = (loss - baseline_loss) / baseline_loss * 100
print(f"{quality}: PPL delta = {delta_pct:+.2f}%")
Run this with 20-50 samples from your actual domain. If "balanced" gives > 2% PPL delta on your data, drop to "conservative".
Domain sensitivity: what to watch for
From our experiments, here is what correlates with compression difficulty:
Compresses well (use balanced or aggressive):
- Dense factual prose (academic papers, encyclopedia entries)
- Structured technical documentation
- Formal legal or financial text
Compresses poorly (use conservative or lossless):
- Creative/narrative text (fiction, stories)
- Casual conversational text
- Code with unusual identifier names
- Mixed-language text
The attention patterns on creative text are more diffuse — fewer tokens dominate — so eviction hurts more. The quantization also hurts more because KV values for creative text have less structured distribution.
Here are our measured numbers at 500-token prefix on Mistral-7B:
| Domain | 35% evict | 70% evict |
|---|---|---|
| Academic | +0.39% | +4.81% |
| Technical | +0.90% | +3.87% |
| Creative | +2.48% | +4.62% |
Note that creative text at 70% eviction is only slightly worse than 35% eviction (4.62% vs 2.48%). The first eviction step hurts creative text proportionally more than subsequent steps.
Memory measurement
Do not trust our reported ratios. Measure yourself:
import torch
def measure_kv_memory(model, input_ids, quality=None):
torch.cuda.reset_peak_memory_stats()
baseline_mem = torch.cuda.memory_allocated()
if quality:
with nexusquant_evict(model, quality=quality):
with torch.no_grad():
out = model(input_ids, use_cache=True)
kv_cache = out.past_key_values
else:
with torch.no_grad():
out = model(input_ids, use_cache=True)
kv_cache = out.past_key_values
compressed_mem = torch.cuda.memory_allocated()
kv_mem = compressed_mem - baseline_mem
return kv_mem
baseline = measure_kv_memory(model, input_ids, quality=None)
compressed = measure_kv_memory(model, input_ids, quality="balanced")
print(f"Compression ratio: {baseline / compressed:.1f}x")
If the ratio you measure differs substantially from what we report, please open a GitHub issue with your model and config.
The latency caveat
NexusQuant is currently CPU-bound on the compression step.
The pipeline compresses the KV cache after prefill. The bottleneck is the E8 VQ nearest-neighbor lookup and the zstd entropy coding, both of which run on CPU. On a Mistral-7B prefill of 1664 tokens:
- Prefill: ~180ms (GPU)
- Compression step: ~340ms (CPU)
- Total: ~520ms vs ~180ms baseline
This means NexusQuant currently makes your time-to-first-token slower, not faster. The compression saves memory (which enables larger batches or longer contexts), but it adds latency.
The fix is Triton kernels for the VQ and entropy coding steps. We have not written them yet. This is on the roadmap and we will post an update when it is done.
If your use case is memory-bound (fitting more users in GPU memory, extending context length beyond what fits otherwise), NexusQuant solves that today. If your use case is latency-bound (fastest possible TTFT), do not use NexusQuant until the Triton kernels ship.
Context length limits
Validated context ranges:
| Prefix length | Max eviction | Max compression | Safe? |
|---|---|---|---|
| < 500 tok | 35% | 10x | Yes |
| 500-1664 tok | 60% | 16x | Yes |
| 1664-2924 tok | 35% | 10x | Yes |
| > 2924 tok | 0% (lossless only) | 6-7x | Eviction untested |
We have not validated eviction on prefixes longer than ~3K tokens beyond the catastrophic failure at 60%. For long-context applications (>3K token prefixes), use quality="lossless" to get quantization-only compression without eviction.
Model compatibility
Validated:
- Mistral-7B-v0.1 (MHA)
- Llama-3-8B (GQA)
Known issues:
- GPT-NeoX-style models: our RoPE removal assumes split-half rotation (Llama/Mistral style). GPT-NeoX uses interleaved RoPE. It will produce wrong results. Do not use on GPT-NeoX.
- Llama-3.1 with extended context: rope_scaling config is not fully handled for context lengths beyond the standard window.
- Batch size > 1: there is a bug in NexusQuantSimple where only the first batch element is processed for keys. The HuggingFace context manager (
nexusquant_evict) handles this correctly. Use that, not the low-level API.
What is not there yet
Being direct about gaps:
- Triton kernels — compression is CPU-bound, adds ~340ms latency. Critical for production.
- 16K+ context — not validated above 3K token prefixes.
- Eviction for batch > 1 — the low-level API has a bug here; context manager is fine.
- LongBench — proper long-context benchmark not yet run.
- Multi-model presets — the quality presets are tuned for Mistral/Llama. Other architectures may need different defaults.
The core use case that works well today: Mistral or Llama family model, prefill up to ~1.7K tokens, memory-bound deployment (fitting more requests per GPU), with quality=conservative or quality=balanced.
Best regards, João Marques
Top comments (0)