DEV Community

João André Gomes Marques
João André Gomes Marques

Posted on

NexusQuant is now on PyPI, HuggingFace, and 9 awesome lists

This week we shipped everything. Here is the full list.

What went out the door

PyPI packagepip install nexusquant works. One line, no retraining, drop-in KV cache compression for any HuggingFace model.

HuggingFace Space — live interactive demo at huggingface.co/spaces/jagmarques/nexusquant. Upload a model, pick a compression ratio, see perplexity in real time.

Google Colab notebook — zero-setup walkthrough. Run the full pipeline in your browser without a local GPU.

13 blog posts — covering everything from E8 lattice quantization to attention-aware eviction, each with reproducible numbers and code.

9 awesome list PRs — submitted to awesome-llm, awesome-efficient-transformers, awesome-kv-cache, and six others. Four already merged.

5 GitHub issues — filed against PyTorch, vLLM, HuggingFace Transformers, LiteLLM, and llama.cpp to track upstream integration roadmap items.

NeurIPS paper draft — the research that underpins all of this: NSN + Hadamard + E8 Lattice VQ + TCC giving 7x compression with -2.26% PPL on Mistral-7B, beating TurboQuant by 32% compression with better quality.

The numbers that matter

  • 7.06x compression, training-free
  • -0.002% PPL at 5.3x on Llama-3-8B (essentially lossless)
  • 128K context → 680K tokens in the same GPU memory at 5.3x
  • 128K context → 2.6M tokens at 20x with token merging

What is next

We are looking for contributors on:

  • vLLM integration (PagedAttention compatibility)
  • Flash Attention 3 support
  • Quantization-aware fine-tuning experiments
  • Benchmarks on Gemma-3 and Qwen-3

If any of this is your area, open an issue or ping me directly.

Links


Best regards, João Marques

NexusQuant — unlimited context windows for every AI model.

Top comments (0)