carlosortet

Posted on Mar 26 • Originally published at zoopa.es

From expensive tokens to intelligent compression: how we optimize LLM costs in production

#ai #llm #machinelearning #deeplearning

We spend absurd amounts on AI tokens. And that number is only going up.

At 498Advance we run multiple LLMs in production — Claude for development, Gemini for multimodal, DeepSeek and OpenAI models locally for routine tasks. Every model does something well and fails at something else. That is why they coexist.

But this creates a problem: dependency and cost. What happens when a provider goes down? What happens when pricing changes overnight?

Here is how we deal with it, and why a new Google Research paper caught our attention this week.

Layer 1: Fallback policies

If a model fails, the system automatically redirects to the next available model. No human intervention, no perceptible downtime.

# Simplified fallback logic
models = ["claude-opus", "gpt-4o", "gemini-pro", "deepseek-local"]

def inference(prompt, task_type):
    for model in get_ranked_models(task_type):
        try:
            return call_model(model, prompt)
        except ModelUnavailable:
            log.warning(f"{model} unavailable, falling back")
            continue
    raise AllModelsUnavailable()

Simple but effective. The key is having your models ranked per task type, not globally.

Layer 2: Router shadow

Not every task needs a frontier model. A two-line summary does not need Claude Opus. A 50-page legal analysis does.

Router shadow evaluates each incoming task and routes it to the optimal model based on complexity and cost. We categorize tasks into tiers:

Tier	Task type	Model class	Cost
1	Simple extraction, formatting	Local (DeepSeek 7B)	~$0
2	Summarization, translation	Mid-tier API (Haiku, Flash)	Low
3	Complex analysis, code generation	Frontier (Opus, GPT-4o)	High

The result: cost optimization per project without sacrificing quality where it matters.

Layer 3: Local models

At 498Advance we have been running DeepSeek and OpenAI models locally for three months. They handle a significant portion of production tasks.

The benefits go beyond cost:

Security: data never leaves your infrastructure
Compliance: concrete guarantees about where data is processed
Latency: no network round-trip for simple tasks
Availability: no dependency on external uptime

The trade-off: local models are not frontier models. You lose capability on complex tasks. The strategy is selective migration — identify what can run locally, move it, keep frontier for what needs it.

The compression landscape

At some point, better hardware is not enough. You need efficiency.

LLMs keep growing — tens or hundreds of billions of parameters. The compression techniques that make them deployable:

Quantization reduces weight precision. A quantized Llama 70B fits on 1 NVIDIA A100. Unquantized, it needs 4.

Pruning removes low-relevance weights. 2:4 Sparse Llama achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1, with +30% throughput and -20% latency from sparsity alone.

Knowledge distillation trains a small student model to replicate a large teacher's behavior.

These are not mutually exclusive. Sparsity + quantization yields improvements greater than either alone.

Real-world examples

LinkedIn built domain-adapted EON models on open source LLMs with proprietary data, reducing prompt size by 30%.

Roblox scaled from <50 to ~250 concurrent ML inference pipelines using Ray and vLLM.

Red Hat maintains pre-optimized models on Hugging Face (Llama, Qwen, DeepSeek, Granite) — quantized and ready for inference with vLLM.

TurboQuant: the paper that caught our attention

On March 24, 2026, Google Research published TurboQuant (ICLR 2026). Authors: Amir Zandieh and Vahab Mirrokni.

The headline numbers:

6x minimum KV cache memory reduction
8x speedup with 4-bit quantization on H100 GPUs
3-bit KV cache quantization with zero accuracy loss
No fine-tuning or retraining required

Why it matters technically

Traditional quantization has a memory overhead problem. Most methods need to store quantization constants for each data block, adding 1-2 extra bits per number. TurboQuant eliminates this.

It combines two algorithms:

PolarQuant converts vectors from Cartesian to polar coordinates. Instead of normalizing data on a shifting square grid, it maps to a fixed circular grid where boundaries are known. This eliminates the normalization overhead.

QJL (Quantized Johnson-Lindenstrauss) compresses the residual error from PolarQuant to a single sign bit (+1 or -1) using the JL Transform. Zero memory overhead.

The pipeline:

PolarQuant compresses with most of the bits
QJL uses 1 bit to correct residual bias

Benchmark results

Tested on LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval with Gemma and Mistral:

TurboQuant (KV: 3.5 bits) scores 50.06 on LongBench — identical to Full Cache (KV: 16 bits)
KIVI needs 5 bits for 50.16, drops to 48.50 at 3 bits
Perfect needle-in-haystack results with 6x memory reduction

For vector search, TurboQuant outperforms PQ and RabbiQ in recall ratio even when those baselines use large codebooks and dataset-specific tuning.

What this means in practice

TurboQuant is a research paper, not a product. But the direction is clear:

Same hardware, bigger models: 6x KV cache compression means the GPU running an 8B model could handle something significantly larger
Lower inference costs: 8x attention speedup = fewer GPUs for the same workload
Edge deployment: compression is what separates "interesting demo" from "deployable product"
Simpler compliance: smaller models running locally = less data traveling externally

The race is not for bigger models. It is for models that are smarter about how they use their resources.

DEV Community