This week we shipped everything. Here is the full list.
What went out the door
PyPI package — pip install nexusquant works. One line, no retraining, drop-in KV cache compression for any HuggingFace model.
HuggingFace Space — live interactive demo at huggingface.co/spaces/jagmarques/nexusquant. Upload a model, pick a compression ratio, see perplexity in real time.
Google Colab notebook — zero-setup walkthrough. Run the full pipeline in your browser without a local GPU.
13 blog posts — covering everything from E8 lattice quantization to attention-aware eviction, each with reproducible numbers and code.
9 awesome list PRs — submitted to awesome-llm, awesome-efficient-transformers, awesome-kv-cache, and six others. Four already merged.
5 GitHub issues — filed against PyTorch, vLLM, HuggingFace Transformers, LiteLLM, and llama.cpp to track upstream integration roadmap items.
NeurIPS paper draft — the research that underpins all of this: NSN + Hadamard + E8 Lattice VQ + TCC giving 7x compression with -2.26% PPL on Mistral-7B, beating TurboQuant by 32% compression with better quality.
The numbers that matter
- 7.06x compression, training-free
- -0.002% PPL at 5.3x on Llama-3-8B (essentially lossless)
- 128K context → 680K tokens in the same GPU memory at 5.3x
- 128K context → 2.6M tokens at 20x with token merging
What is next
We are looking for contributors on:
- vLLM integration (PagedAttention compatibility)
- Flash Attention 3 support
- Quantization-aware fine-tuning experiments
- Benchmarks on Gemma-3 and Qwen-3
If any of this is your area, open an issue or ping me directly.
Links
- PyPI: pypi.org/project/nexusquant
- GitHub: github.com/jagmarques/nexusquant
- HF Space: huggingface.co/spaces/jagmarques/nexusquant
- Colab: linked in the repo README
Best regards, João Marques
NexusQuant — unlimited context windows for every AI model.
Top comments (0)