Your LLM is running out of memory at 128K tokens. Here is the fix.
from nexusquant import nexusquant
with nexusquant(model):
output = model.generate(input_ids, max_new_tokens=500)
That is the entire change.
Before: 128K tokens, 40 GB KV cache memory on Llama-3-70B.
After: 1.3M tokens, same 40 GB. 10x context window. Zero retraining.
The pipeline compresses KV cache in four stages — normalization, Hadamard rotation, E8 lattice quantization, temporal delta coding — at 7x compression with -2.26% perplexity on Mistral-7B. Training-free. Drop-in. One context manager.
If you are building long-context applications and memory is your ceiling, this is worth ten minutes.
GitHub: github.com/nexusquant/nexusquant
Best regards, João Marques
Top comments (0)