DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

How to deploy NexusQuant in production (and what's missing)

This post is a practical deployment guide. Install, configuration, how to pick the right eviction rate, domain testing, and an honest list of what does not work yet.


Install

pip install nexusquant
Enter fullscreen mode Exit fullscreen mode

Requires Python 3.9+, PyTorch 2.1+, and Transformers 4.40+. No CUDA-specific wheels — it runs on CPU for small models and on CUDA for production workloads.


The one-liner

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=500)
Enter fullscreen mode Exit fullscreen mode

That is it. The context manager hooks into the model's forward pass, intercepts the KV cache after prefill, compresses it, and restores the original hooks on exit.


Quality presets

Three presets, pick based on your use case:

# Conservative: 35% eviction, 10x compression
# Use for: general-purpose, long prompts (>1K tokens), production default
with nexusquant_evict(model, quality="conservative"):
    ...

# Balanced: 60% eviction, 16x compression  
# Use for: RAG over structured documents, short-to-medium prompts (<1.5K tokens)
with nexusquant_evict(model, quality="balanced"):
    ...

# Aggressive: 80% eviction, 32x compression
# Use for: memory-constrained environments, factual recall only
# Do NOT use for: prompts >1K tokens, creative tasks, multi-detail reasoning
with nexusquant_evict(model, quality="aggressive"):
    ...

# Lossless: no eviction, quantization only
# Use for: when you need maximum quality, 6-7x compression
with nexusquant_evict(model, quality="lossless"):
    ...
Enter fullscreen mode Exit fullscreen mode

How to choose eviction rate for your use case

Do not guess. Test on your actual data. Here is a script:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from nexusquant import nexusquant_evict

def measure_ppl(model, tokenizer, texts, quality):
    total_loss = 0
    count = 0
    for text in texts:
        ids = tokenizer(text, return_tensors="pt").input_ids.cuda()
        with nexusquant_evict(model, quality=quality):
            with torch.no_grad():
                out = model(ids, labels=ids)
        total_loss += out.loss.item()
        count += 1
    return (total_loss / count)

# Load your model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1").cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Sample 20 texts from YOUR domain
your_texts = [...]  # replace with real samples

# Baseline
baseline_loss = measure_ppl(model, tokenizer, your_texts, quality="lossless")

# Test presets
for quality in ["conservative", "balanced", "aggressive"]:
    loss = measure_ppl(model, tokenizer, your_texts, quality=quality)
    delta_pct = (loss - baseline_loss) / baseline_loss * 100
    print(f"{quality}: PPL delta = {delta_pct:+.2f}%")
Enter fullscreen mode Exit fullscreen mode

Run this with 20-50 samples from your actual domain. If "balanced" gives > 2% PPL delta on your data, drop to "conservative".


Domain sensitivity: what to watch for

From our experiments, here is what correlates with compression difficulty:

Compresses well (use balanced or aggressive):

  • Dense factual prose (academic papers, encyclopedia entries)
  • Structured technical documentation
  • Formal legal or financial text

Compresses poorly (use conservative or lossless):

  • Creative/narrative text (fiction, stories)
  • Casual conversational text
  • Code with unusual identifier names
  • Mixed-language text

The attention patterns on creative text are more diffuse — fewer tokens dominate — so eviction hurts more. The quantization also hurts more because KV values for creative text have less structured distribution.

Here are our measured numbers at 500-token prefix on Mistral-7B:

Domain 35% evict 70% evict
Academic +0.39% +4.81%
Technical +0.90% +3.87%
Creative +2.48% +4.62%

Note that creative text at 70% eviction is only slightly worse than 35% eviction (4.62% vs 2.48%). The first eviction step hurts creative text proportionally more than subsequent steps.


Memory measurement

Do not trust our reported ratios. Measure yourself:

import torch

def measure_kv_memory(model, input_ids, quality=None):
    torch.cuda.reset_peak_memory_stats()
    baseline_mem = torch.cuda.memory_allocated()

    if quality:
        with nexusquant_evict(model, quality=quality):
            with torch.no_grad():
                out = model(input_ids, use_cache=True)
                kv_cache = out.past_key_values
    else:
        with torch.no_grad():
            out = model(input_ids, use_cache=True)
            kv_cache = out.past_key_values

    compressed_mem = torch.cuda.memory_allocated()
    kv_mem = compressed_mem - baseline_mem
    return kv_mem

baseline = measure_kv_memory(model, input_ids, quality=None)
compressed = measure_kv_memory(model, input_ids, quality="balanced")
print(f"Compression ratio: {baseline / compressed:.1f}x")
Enter fullscreen mode Exit fullscreen mode

If the ratio you measure differs substantially from what we report, please open a GitHub issue with your model and config.


The latency caveat

NexusQuant is currently CPU-bound on the compression step.

The pipeline compresses the KV cache after prefill. The bottleneck is the E8 VQ nearest-neighbor lookup and the zstd entropy coding, both of which run on CPU. On a Mistral-7B prefill of 1664 tokens:

  • Prefill: ~180ms (GPU)
  • Compression step: ~340ms (CPU)
  • Total: ~520ms vs ~180ms baseline

This means NexusQuant currently makes your time-to-first-token slower, not faster. The compression saves memory (which enables larger batches or longer contexts), but it adds latency.

The fix is Triton kernels for the VQ and entropy coding steps. We have not written them yet. This is on the roadmap and we will post an update when it is done.

If your use case is memory-bound (fitting more users in GPU memory, extending context length beyond what fits otherwise), NexusQuant solves that today. If your use case is latency-bound (fastest possible TTFT), do not use NexusQuant until the Triton kernels ship.


Context length limits

Validated context ranges:

Prefix length Max eviction Max compression Safe?
< 500 tok 35% 10x Yes
500-1664 tok 60% 16x Yes
1664-2924 tok 35% 10x Yes
> 2924 tok 0% (lossless only) 6-7x Eviction untested

We have not validated eviction on prefixes longer than ~3K tokens beyond the catastrophic failure at 60%. For long-context applications (>3K token prefixes), use quality="lossless" to get quantization-only compression without eviction.


Model compatibility

Validated:

  • Mistral-7B-v0.1 (MHA)
  • Llama-3-8B (GQA)

Known issues:

  • GPT-NeoX-style models: our RoPE removal assumes split-half rotation (Llama/Mistral style). GPT-NeoX uses interleaved RoPE. It will produce wrong results. Do not use on GPT-NeoX.
  • Llama-3.1 with extended context: rope_scaling config is not fully handled for context lengths beyond the standard window.
  • Batch size > 1: there is a bug in NexusQuantSimple where only the first batch element is processed for keys. The HuggingFace context manager (nexusquant_evict) handles this correctly. Use that, not the low-level API.

What is not there yet

Being direct about gaps:

  1. Triton kernels — compression is CPU-bound, adds ~340ms latency. Critical for production.
  2. 16K+ context — not validated above 3K token prefixes.
  3. Eviction for batch > 1 — the low-level API has a bug here; context manager is fine.
  4. LongBench — proper long-context benchmark not yet run.
  5. Multi-model presets — the quality presets are tuned for Mistral/Llama. Other architectures may need different defaults.

The core use case that works well today: Mistral or Llama family model, prefill up to ~1.7K tokens, memory-bound deployment (fitting more requests per GPU), with quality=conservative or quality=balanced.


Best regards, João Marques

Top comments (0)