NexusQuant is now on PyPI, HuggingFace, and 9 awesome lists

#ai #machinelearning #opensource #python

This week we shipped everything. Here is the full list.

What went out the door

PyPI package - pip install nexusquant works. One line, no retraining, drop-in KV cache compression for any HuggingFace model.

HuggingFace Space - live interactive demo at huggingface.co/spaces/jagmarques/nexusquant. Upload a model, pick a compression ratio, see perplexity in real time.

Google Colab notebook - zero-setup walkthrough. Run the full pipeline in your browser without a local GPU.

13 blog posts - covering everything from E8 lattice quantization to attention-aware eviction, each with reproducible numbers and code.

9 awesome list PRs - submitted to awesome-llm, awesome-efficient-transformers, awesome-kv-cache, and six others. Four already merged.

5 GitHub issues - filed against PyTorch, vLLM, HuggingFace Transformers, LiteLLM, and llama.cpp to track upstream integration roadmap items.

NeurIPS paper draft - the research that underpins all of this: NSN + Hadamard + E8 Lattice VQ + TCC giving 7x compression with -2.26% PPL on Mistral-7B, beating TurboQuant by 32% compression with better quality.

The numbers that matter

7.06x compression, training-free
-0.002% PPL at 5.3x on Llama-3-8B (essentially lossless)
128K context → 680K tokens in the same GPU memory at 5.3x
128K context → 2.6M tokens at 20x with token merging

What is next

We are looking for contributors on:

vLLM integration (PagedAttention compatibility)
Flash Attention 3 support
Quantization-aware fine-tuning experiments
Benchmarks on Gemma-3 and Qwen-3

If any of this is your area, open an issue or ping me directly.

DEV Community

NexusQuant is now on PyPI, HuggingFace, and 9 awesome lists

What went out the door

The numbers that matter

What is next

Links

Top comments (0)