Everyone is talking about TurboQuant, and a lot of people summarize it with a line like this:
run bigger models on smaller hardware
That line is catchy, but it is also where the confusion starts. And yes, was also my initial assumption, like "nice! now I can run that 70B model on my 24GB unified-memory MacBook"
This article has two goals:
- Explain what TurboQuant actually is, and what it is not
- Show a practical local stack for Apple Silicon that uses TurboQuant where it helps without making the rest of your setup miserable
The stack here is intentionally humble. It is meant for the kind of machine many of us actually have:
- a MacBook with Apple Silicon
- limited unified memory
- a normal person budget
- perhaps an irrational amount of confidence
Part 1: what TurboQuant is, and what it is not
TurboQuant does not primarily solve model-weight size.
That is the first thing to get clear.
When people say "it lets you run bigger models on smaller hardware" what they usually mean is more indirect:
- it reduces runtime memory pressure
- that frees memory budget for longer context, more headroom, or somewhat larger configurations
But the thing being compressed is not the main model checkpoint on disk.
It is the KV cache used during inference.
The missing half of memory optimization
A lot of local-LLM discussion focuses on weight quantization:
- GGUF
- AWQ
- 4-bit and 8-bit model variants
- smaller checkpoints that fit into memory
That is useful, but it is only half the story.
At inference time, your memory bill looks more like this:
runtime memory = model weights + KV cache
The KV cache grows with context length. As prompts get larger, and as generations get longer, that cache becomes a major factor.
This is why long-context tasks often feel much worse than people expect. A model that technically fits on your machine can still become impractical once you start doing any of the following:
- stuffing lots of retrieved chunks into a RAG prompt
- cleaning up OCR text from long documents
- summarizing many files at once
- reasoning over a codebase with lots of source pasted in
What TurboQuant brings
TurboQuant attacks the runtime side of the problem.
At a high level, it compresses the KV cache much more aggressively than the standard FP16 representation while trying to preserve quality.
That creates practical benefits such as:
- lower memory pressure during long-context inference
- more headroom for larger prompts
- potentially better concurrency or stability under load
- a more realistic path to doing serious document work on hardware that is not a datacenter card
What TurboQuant does not magically do
It does not mean:
- any huge model now fits comfortably on your laptop
- quality is untouched in every case
- all runtimes support it natively today
- you no longer need weight quantization
The right mental model is this:
- weight quantization compresses the brain
- TurboQuant compresses the model's working memory
If you only optimize one, you still leave useful savings on the table.
Part 2: the engineering decision
Instead of trying to force one runtime to do everything, I chose a split architecture.
Why not just patch everything into Ollama?
Because I wanted two things at once:
- a stable day-to-day local endpoint
- a more experimental path for long-context memory-heavy work
Ollama is excellent for the first. It is simple, ergonomic, and already widely supported by tools.
For the second, a small MLX-based TurboQuant sidecar is a better fit on Apple Silicon today.
That led to this design:
client / UI / code tool
|
v
routing proxy :8000
/ \
v v
Ollama TurboQuant sidecar
:11434 :8001
What each piece does
Ollama
Ollama handles the easy path:
- short chat
- coding help
- routine interactions
- lower-context tasks
It is configured with Flash Attention and KV-cache quantization so it already gets some memory savings.
TurboQuant MLX sidecar
The sidecar handles the jobs where KV cache pressure dominates:
- long RAG prompts
- OCR cleanup for big documents
- multi-document synthesis
- file-heavy assistant workflows
It exposes an OpenAI-compatible endpoint so it can be used by clients that already know how to talk to that API shape.
Routing proxy
The router removes backend-switching friction.
It inspects requests, estimates prompt size, and decides whether the request should go to Ollama or the sidecar.
That means your clients can often point to a single URL and let the stack make a reasonable choice.
Part 3: one-command install
The code at my repository includes a single installer:
bash install.sh
That installer does the practical work:
- sets recommended Ollama environment variables
- creates a Python environment
- installs FastAPI, MLX, and the required libraries
- clones the TurboQuant MLX dependency if needed
- creates LaunchAgent files for auto-start
- installs the routing proxy and sidecar scripts
- writes Open WebUI usage notes
Why a one-command installer matters
Because experimental stacks die when setup becomes an archaeological project.
If every new machine requires a ritual involving five README tabs and one issue comment from three months ago, the stack is not really usable.
The installer turns this into a reproducible baseline.
Part 4: how the package is implemented
The sidecar
The sidecar is a small FastAPI service that:
- loads an MLX model
- applies the TurboQuant patch
- creates TurboQuant KV caches for each transformer layer
- exposes
/v1/chat/completions
That keeps the interface familiar for downstream tools.
The router
The router is another FastAPI service that also exposes /v1/chat/completions.
Its default behavior is deliberately simple:
- estimate prompt size from the combined message length
- use Ollama below a token threshold
- use TurboQuant above that threshold
- allow explicit override using a model prefix like
tq:
This is not meant to be the last routing strategy you will ever need. It is meant to be understandable, debuggable, and easy to improve.
LaunchAgents on macOS
The stack uses user LaunchAgents so both services can start automatically on login.
This keeps the setup lightweight and local, and avoids introducing a whole extra service manager unless you want one.
Part 5: where this stack fits in other tools
The reason to expose OpenAI-compatible endpoints is simple: lots of tools already know how to use them.
Open WebUI
Open WebUI can use the routing proxy as the default endpoint:
http://127.0.0.1:8000/v1
You can also add the direct endpoints for comparison and debugging.
Claude Code
If your Claude Code workflow supports OpenAI-compatible local endpoints, the router gives you a single target that can automatically push bigger contexts toward the TurboQuant backend.
That is useful when your workload alternates between:
- short code questions
- broad codebase reasoning
- file-heavy prompts
Antigravity
Anything that benefits from long prompts, many retrieved chunks, or memory-heavy contextual work is a natural fit for the routed endpoint.
The router means you do not have to manually change backends every time the prompt gets fat.
Custom scripts and agent frameworks
If they already speak the chat-completions format, you can plug them into this stack with minimal glue.
(all those above - and more - have examples at the repository README.md file)
Part 6: practical examples
Example 1: long-document OCR cleanup
A local OCR pipeline produces a long, noisy chunk of text.
You send it to the routing proxy.
Small pages stay on Ollama. Huge pages go to the TurboQuant sidecar.
Example 2: RAG over multiple PDFs
Your retriever returns many chunks from several documents.
The final prompt is large enough that KV cache pressure matters.
The router pushes the request to TurboQuant.
Example 3: codebase analysis assistant
Small questions like "what does this function do" stay on Ollama.
Larger tasks like "compare these six files and explain the shared state flow" go to the sidecar.
Example 4: mixed interactive use in Open WebUI
Normal chat remains snappy.
When you paste a wall of text and ask for synthesis, the router moves that request to the heavy backend without making you think about it.
Part 7: tradeoffs and limits
This stack is useful, not magical.
Tradeoffs include:
- the sidecar is more experimental than Ollama
- routing heuristics are still heuristics
- upstream repos may change APIs
- model choice still matters a lot
- quality and performance depend on the specific workload
But the upside is real:
it lets a modest Apple Silicon machine behave much better on long-context tasks than a naive single-backend setup.
That is worth the effort.
References and technical documentation
- Google Research: TurboQuant blog post
- Ollama documentation and FAQ for Flash Attention and KV-cache quantization
- MLX framework documentation
- MLX LM documentation and model ecosystem
-
sharpner/turboquant-mlxrepository - Open WebUI documentation
- Apple launchd and LaunchAgent documentation
- FastAPI documentation
Final thought
If the local-LLM world has taught me anything, it is this:
people do not need infinite hardware nearly as often as they need less waste.
This repository is a small, slightly mischievous attempt to operationalize that idea on a Mac.
A poorsman stack? yes.
But a respectable one.

Top comments (0)