Anderson Leite

Posted on Apr 9

TurboQuant on a MacBook: building a one-command local stack with Ollama, MLX, and an automatic routing proxy

#ai #devops #llm #softwareengineering

Everyone is talking about TurboQuant, and a lot of people summarize it with a line like this:

run bigger models on smaller hardware

That line is catchy, but it is also where the confusion starts. And yes, was also my initial assumption, like "nice! now I can run that 70B model on my 24GB unified-memory MacBook"

This article has two goals:

Explain what TurboQuant actually is, and what it is not
Show a practical local stack for Apple Silicon that uses TurboQuant where it helps without making the rest of your setup miserable

The stack here is intentionally humble. It is meant for the kind of machine many of us actually have:

a MacBook with Apple Silicon
limited unified memory
a normal person budget
perhaps an irrational amount of confidence

Part 1: what TurboQuant is, and what it is not

TurboQuant does not primarily solve model-weight size.

That is the first thing to get clear.

When people say "it lets you run bigger models on smaller hardware" what they usually mean is more indirect:

it reduces runtime memory pressure
that frees memory budget for longer context, more headroom, or somewhat larger configurations

But the thing being compressed is not the main model checkpoint on disk.
It is the KV cache used during inference.

The missing half of memory optimization

A lot of local-LLM discussion focuses on weight quantization:

GGUF
AWQ
4-bit and 8-bit model variants
smaller checkpoints that fit into memory

That is useful, but it is only half the story.

At inference time, your memory bill looks more like this:

runtime memory = model weights + KV cache

The KV cache grows with context length. As prompts get larger, and as generations get longer, that cache becomes a major factor.

This is why long-context tasks often feel much worse than people expect. A model that technically fits on your machine can still become impractical once you start doing any of the following:

stuffing lots of retrieved chunks into a RAG prompt
cleaning up OCR text from long documents
summarizing many files at once
reasoning over a codebase with lots of source pasted in

What TurboQuant brings

TurboQuant attacks the runtime side of the problem.

At a high level, it compresses the KV cache much more aggressively than the standard FP16 representation while trying to preserve quality.

That creates practical benefits such as:

lower memory pressure during long-context inference
more headroom for larger prompts
potentially better concurrency or stability under load
a more realistic path to doing serious document work on hardware that is not a datacenter card

What TurboQuant does not magically do

It does not mean:

any huge model now fits comfortably on your laptop
quality is untouched in every case
all runtimes support it natively today
you no longer need weight quantization

The right mental model is this:

weight quantization compresses the brain
TurboQuant compresses the model's working memory

If you only optimize one, you still leave useful savings on the table.

Part 2: the engineering decision

Instead of trying to force one runtime to do everything, I chose a split architecture.

Why not just patch everything into Ollama?

Because I wanted two things at once:

a stable day-to-day local endpoint
a more experimental path for long-context memory-heavy work

Ollama is excellent for the first. It is simple, ergonomic, and already widely supported by tools.

For the second, a small MLX-based TurboQuant sidecar is a better fit on Apple Silicon today.

That led to this design:

client / UI / code tool
          |
          v
   routing proxy :8000
      /         \
     v           v
 Ollama        TurboQuant sidecar
 :11434        :8001

What each piece does

Ollama

Ollama handles the easy path:

short chat
coding help
routine interactions
lower-context tasks

It is configured with Flash Attention and KV-cache quantization so it already gets some memory savings.

TurboQuant MLX sidecar

The sidecar handles the jobs where KV cache pressure dominates:

long RAG prompts
OCR cleanup for big documents
multi-document synthesis
file-heavy assistant workflows

It exposes an OpenAI-compatible endpoint so it can be used by clients that already know how to talk to that API shape.

Routing proxy

The router removes backend-switching friction.

It inspects requests, estimates prompt size, and decides whether the request should go to Ollama or the sidecar.

That means your clients can often point to a single URL and let the stack make a reasonable choice.

Part 3: one-command install

The code at my repository includes a single installer:

bash install.sh

That installer does the practical work:

sets recommended Ollama environment variables
creates a Python environment
installs FastAPI, MLX, and the required libraries
clones the TurboQuant MLX dependency if needed
creates LaunchAgent files for auto-start
installs the routing proxy and sidecar scripts
writes Open WebUI usage notes

Why a one-command installer matters

Because experimental stacks die when setup becomes an archaeological project.

If every new machine requires a ritual involving five README tabs and one issue comment from three months ago, the stack is not really usable.

The installer turns this into a reproducible baseline.

Part 4: how the package is implemented

The sidecar

The sidecar is a small FastAPI service that:

loads an MLX model
applies the TurboQuant patch
creates TurboQuant KV caches for each transformer layer
exposes /v1/chat/completions

That keeps the interface familiar for downstream tools.

The router

The router is another FastAPI service that also exposes /v1/chat/completions.

Its default behavior is deliberately simple:

estimate prompt size from the combined message length
use Ollama below a token threshold
use TurboQuant above that threshold
allow explicit override using a model prefix like tq:

This is not meant to be the last routing strategy you will ever need. It is meant to be understandable, debuggable, and easy to improve.

LaunchAgents on macOS

The stack uses user LaunchAgents so both services can start automatically on login.

This keeps the setup lightweight and local, and avoids introducing a whole extra service manager unless you want one.

Part 5: where this stack fits in other tools

The reason to expose OpenAI-compatible endpoints is simple: lots of tools already know how to use them.

Open WebUI

Open WebUI can use the routing proxy as the default endpoint:

http://127.0.0.1:8000/v1

You can also add the direct endpoints for comparison and debugging.

Claude Code

If your Claude Code workflow supports OpenAI-compatible local endpoints, the router gives you a single target that can automatically push bigger contexts toward the TurboQuant backend.

That is useful when your workload alternates between:

short code questions
broad codebase reasoning
file-heavy prompts

Antigravity

Anything that benefits from long prompts, many retrieved chunks, or memory-heavy contextual work is a natural fit for the routed endpoint.

The router means you do not have to manually change backends every time the prompt gets fat.

Custom scripts and agent frameworks

If they already speak the chat-completions format, you can plug them into this stack with minimal glue.

(all those above - and more - have examples at the repository README.md file)

Part 6: practical examples

Example 1: long-document OCR cleanup

A local OCR pipeline produces a long, noisy chunk of text.
You send it to the routing proxy.
Small pages stay on Ollama. Huge pages go to the TurboQuant sidecar.

Example 2: RAG over multiple PDFs

Your retriever returns many chunks from several documents.
The final prompt is large enough that KV cache pressure matters.
The router pushes the request to TurboQuant.

Example 3: codebase analysis assistant

Small questions like "what does this function do" stay on Ollama.
Larger tasks like "compare these six files and explain the shared state flow" go to the sidecar.

Example 4: mixed interactive use in Open WebUI

Normal chat remains snappy.
When you paste a wall of text and ask for synthesis, the router moves that request to the heavy backend without making you think about it.