Alan West

Posted on Apr 7

Google Dropped TurboQuant Two Weeks Ago. The Community Already Made It Usable.

#turboquant #locallm #inference #opensource

Google published the TurboQuant paper on March 25. It's April 7. There are already five independent implementations, a llama.cpp fork running 104B parameter models on a MacBook, and an active vLLM integration effort. Google hasn't released a single line of official code.

This is the post about what happened in those two weeks.

The Paper, In 30 Seconds

TurboQuant is a KV cache compression method. During inference, large language models store key-value pairs for every token in the context -- this is the KV cache, and it's the single biggest memory bottleneck for long-context inference. The paper demonstrates quality-neutral compression at around 3.5 bits per element, with marginal degradation down to 2.5 bits -- achieving at least 6x memory reduction and up to 8x speedup in attention computation on H100 GPUs, with what the paper claims is zero accuracy loss at the sweet spot.

The critical detail: it's training-free and data-oblivious. You don't retrain the model. You don't need calibration data. You apply it to any transformer-based model and it just works.

TechCrunch reported that the internet was calling it "the Pied Piper of AI" -- a reference to the Silicon Valley compression joke that, for once, isn't actually a joke. The original @GoogleResearch tweet announcing it has accumulated over 7.7 million views.

And then Google did what Google often does with research: published the paper, took a bow, and released no code.

What the Community Built

Within days, people started building their own implementations from the paper alone. Here's what exists today, ranked roughly by maturity.

tonbistudio/turboquant-pytorch

The first to appear. A PyTorch reference implementation focused on correctness over performance. Early versions reported 5x compression with 99.5% attention fidelity, but a later README update disclosed that a bug had inflated these results. After the fix, 3-bit key quantization breaks generation quality in some configurations. The maintainers have been transparent about this limitation.

This is the one you'd use if you're a researcher who wants to study the algorithm, not deploy it. The code is readable and well-documented. Just be aware that not all quantization levels produce usable output yet.

TheTom/llama-cpp-turboquant

This is where things get serious. A C/C++ implementation with both CPU and CUDA kernels, built as a fork of llama.cpp. All 18 tests pass. MSE (mean squared error) matches the paper's reported values within 1%.

If you're already in the llama.cpp ecosystem and have an NVIDIA GPU, this is the most production-ready option for Linux/CUDA workloads right now.

TheTom/turboquant_plus

The same developer's second project, and the one that made the most noise on social media. This is a llama.cpp Metal integration targeting Apple Silicon, with two new KV-cache quantization types: turbo3 (roughly 3.25-bit keys) and turbo4 (4-bit keys with 2-bit values).

The headline number from this community implementation: a 104B parameter model running at 128K context on a MacBook with an M5 Max chip. Perplexity of 4.024. Peak memory usage of 74 GB. Prefill throughput at q8_0 parity while achieving 4.6x cache compression.

That's a model that would normally require multiple GPUs running on a laptop. Not at full speed, and not without the 128GB unified memory configuration, but running. Generating coherent text. With measurable, published benchmarks.

Building and running it looks like this:

git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus
mkdir build && cd build
cmake .. -DGGML_METAL=ON
cmake --build . --config Release -j

# Run with turbo3 KV cache quantization
./bin/llama-cli \
  -m /path/to/model.gguf \
  -ctk turbo3 \
  -ctv turbo3 \
  -c 131072 \
  -n 512 \
  -p "Your prompt here"

The -ctk and -ctv flags set the key and value cache quantization types respectively. That's the entire integration surface -- two flags on a command you're already running.

0xSero/turboquant

Takes a different approach entirely. Instead of C/C++, this implementation uses Triton kernels (the GPU programming language, not the inference server) and targets vLLM integration directly. Keys compressed to 3 bits, values to 2 bits, matching one of the paper's more aggressive configurations.

If your deployment target is vLLM on cloud GPUs, this is the one to watch. It's less mature than the llama.cpp forks but aimed at a different use case -- serving multiple users, not local inference.

scos-lab/turboquant

An ICLR paper reproduction with detailed engineering insights about what the authors got right and where the paper's descriptions were ambiguous. Less useful as a deployable tool, very useful if you're trying to understand the algorithm deeply.

The M5 Max Results Deserve Their Own Section

Running a 104B model at 128K context on a MacBook is a statement. Let me put that in perspective. These numbers come from TheTom's turboquant_plus -- a community implementation, not Google's official code.

Without TurboQuant-style compression, a 104B model in fp16 needs roughly 208 GB just for model weights. The KV cache at 128K context adds another massive chunk on top. You'd need a multi-GPU server.

With turboquant_plus on Apple Silicon, the model weights are already quantized (Q4_K_M), and the KV cache gets compressed by 4.6x on top of that. The 128GB unified memory on the M5 Max becomes just enough.

The perplexity number -- 4.024 -- is the real validation. One developer testing a 35B model reported output "identical to f16 baseline at temperature 0." The compression isn't producing garbage. It's producing statistically equivalent text.

This matters because it changes the hardware requirements for local inference from "build a server" to "buy a high-end laptop." That's a category shift, not an incremental improvement.

What Doesn't Exist Yet

Honesty check. Here's what's missing:

Google's official code. Expected sometime in Q2 2026. When it lands, every community implementation will need to reconcile differences.

Native support in mainline projects. There's an active llama.cpp discussion (#20969), a feature request (#20977), and a vLLM issue (#38171), all with regular updates. But none of these are merged. You're running forks, not upstream.

Apple MLX integration. The only Apple Silicon path is through turboquant_plus, which is a llama.cpp fork. If you're in the MLX ecosystem (Hugging Face's recommended stack for Apple Silicon), there's nothing for you yet.

Ollama support. This is the one that would bring TurboQuant to the broadest audience. No sign of it yet.

The Python implementations (tonbistudio, 0xSero) work but are slower than native C/C++ by a wide margin. If you need speed, you need the compiled forks.

Comparing the Implementations

Here's the honest breakdown for someone trying to choose today:

| Implementation          | Language     | Target       | Maturity | Best For                    |
|------------------------|-------------|--------------|----------|-----------------------------|
| tonbistudio/pytorch    | Python      | Research     | Stable   | Understanding the algorithm |
| TheTom/llama-cpp       | C/C++/CUDA  | Linux/NVIDIA | Solid    | GPU inference servers       |
| TheTom/turboquant_plus | C/C++/Metal | macOS/Apple  | Solid    | Local inference on Mac      |
| 0xSero/turboquant      | Triton      | vLLM/Cloud   | Early    | Multi-user serving          |
| scos-lab/turboquant    | Python      | Research     | Stable   | Paper reproduction          |

If you have an M5 Max or M5 Ultra MacBook, turboquant_plus is the obvious choice. If you're deploying on NVIDIA GPUs, TheTom's llama-cpp fork. If you want to wait for something more official, that's also reasonable -- none of these are "done."

The Pattern Is the Story

Here's what I think is actually interesting about TurboQuant, beyond the compression ratios.

Google publishes a paper. Google does not release code. Within 48 hours, someone has a working PyTorch implementation. Within a week, there are C/C++ implementations with GPU kernels. Within two weeks, someone is running 104B models on a laptop.

This isn't unique to TurboQuant. We saw it with FlashAttention, with LoRA, with virtually every significant ML paper in the last two years. The pattern is: paper drops, community builds, official code eventually follows (or doesn't, and nobody cares because the community version is already better).

What's different here is the speed. Two weeks from paper to "104B on a MacBook with published benchmarks" is fast even by 2026 standards. The llama.cpp ecosystem in particular has become so good at absorbing new quantization techniques that the integration surface is often just a new flag on an existing command.

This creates an interesting dynamic. Google gets the citation credit and the media cycle. The community gets the actual usable software. And users get access to the technique weeks or months before any official release.

Is the code production-ready? No. Are the forks going to diverge from whatever Google eventually releases? Probably. Does any of that matter when you can run a 104B model on your laptop today? For most people, no. It doesn't.

What to Do Right Now

If you're running local inference on Apple Silicon and you have 64GB+ unified memory, try turboquant_plus. The barrier to entry is literally two CMake flags and two command-line arguments. If it works for your model, you just got access to larger models or longer contexts for free.

If you're deploying on NVIDIA hardware, TheTom's llama-cpp fork is the safer bet. The test suite passes. The MSE numbers match the paper.

If you're using vLLM in production, watch the 0xSero implementation and vLLM issue #38171. Don't deploy it yet. But keep it on your radar.

And if you're not in a hurry, waiting for official llama.cpp or vLLM mainline support is the most conservative path. It's coming. The discussions are active. The community has already done the hard part of proving the technique works.

Two weeks. Five implementations. A 104B model on a laptop. No official code from Google. The open-source ML community continues to be the fastest engineering organization on the planet.

DEV Community