DEV Community: Patrick Bertsch

We Built a Python Library That Cuts LLM Memory Usage by 80%

Patrick Bertsch — Sat, 04 Apr 2026 22:17:01 +0000

The Problem

If you've run LLMs locally, you know the pain: a 14B model eats 10+ GB just for the KV cache on long prompts. The model weights fit in memory, but the cache — where attention stores every key and value vector for every token — grows linearly with context length and eventually pushes you into swap or OOM.

The standard approach is to quantize the model weights (Q4, Q8), but nobody touches the KV cache. It sits there in full FP16 precision, quietly eating 30-50% of your total memory.

The Paper

Google Research published TurboQuant at ICLR 2026. The core idea is surprisingly elegant:

Rotate the KV vectors by a random orthogonal matrix — this spreads information uniformly across all coordinates
Quantize each coordinate independently using precomputed optimal codebooks
Store the norm separately in FP16

That's it. No training. No calibration data. No model-specific tuning. The same codebooks work for Llama, Qwen, Mistral — anything.

The key insight is that after rotation, each coordinate follows a known Gaussian distribution (N(0, 1/d) where d is the head dimension). Since you know the distribution in advance, you can precompute the optimal Lloyd-Max quantizer offline. This makes the whole thing data-oblivious — you don't need to see a single token from the model to set up compression.

Why not both stages?

The paper actually has two stages. Stage 2 (QJL) adds a 1-bit residual correction for unbiased inner products. We skip it. Independent research found that QJL's variance amplification actually degrades softmax-based attention. Stage 1 alone produces better results for KV cache compression.

The Library

We turned this into tqai — a pip-installable Python library with two backends (PyTorch and MLX) and a CLI.

Two lines to compress

import tqai
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# This is the only change
cache = tqai.patch(model, bits_k=4, bits_v=2)

inputs = tokenizer("Explain quantum entanglement:", return_tensors="pt")
output = model.generate(**inputs, past_key_values=cache, max_new_tokens=200)

On Apple Silicon with MLX:

import tqai
import mlx_lm

model, tokenizer = mlx_lm.load("mlx-community/Llama-3.1-8B-Instruct-4bit")
tqai.patch(model, bits_k=4, bits_v=2, backend="mlx")

response = mlx_lm.generate(model, tokenizer, prompt="Explain quantum entanglement:", max_tokens=200)

Compression numbers

Config	Avg Bits	Memory Saved	Use Case
K4/V2	3.0	80%	Production
K3/V2	2.5	84%	Extended context
K4/V3	3.5	78%	Quality-sensitive

Original KV cache: 16 bits per coordinate (FP16). With K4/V2: 512 bytes/token → 100 bytes/token.

Does it actually work?

We tested across model sizes. The results are clear — quality depends on model size, not compression:

Model	Baseline	+ tqai K4/V2	+ tqai K3/V2
Qwen 0.5B	Good	Degraded	Poor
Qwen 3B	Excellent	Good	Degraded
Llama 8B	Excellent	Excellent	Excellent
Qwen 14B	Excellent	Excellent	Excellent

On 8B+ models, the compressed output is indistinguishable from baseline. Here's a real example from Qwen 14B Q4:

Baseline: "particles become interconnected so that the state of one particle cannot be described independently of the state of the others"

K4/V2: "particles become interconnected so that the state of one particle cannot be described without including the state of the other"

K3/V2: "two or more particles become interconnected such that the state of one particle can instantly influence the state of another"

All three are coherent, factually correct, grammatically perfect.

The CLI

tqai ships with a CLI tool for quick testing:

# Environment info
tqai info

# Accuracy benchmark (no model needed)
tqai benchmark
# Output:
# Keys (4-bit): NMSE=0.009287, SNR=20.3 dB, cosine sim=0.9954
# Values (2-bit): NMSE=0.115653, SNR=9.4 dB, cosine sim=0.9408

# Generate with compression
tqai run "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit

# Side-by-side comparison
tqai compare "Explain gravity" -m mlx-community/Llama-3.1-8B-Instruct-4bit

# Pre-convert for faster startup
tqai convert -m mlx-community/Llama-3.1-8B-Instruct-4bit -o ./llama-tqai/

Under the Hood

The architecture is intentionally simple:

src/tqai/
├── quantizer.py     # PolarQuantizer — the core algorithm (~100 lines)
├── backend/         # PyTorch + MLX abstraction (Protocol-based, ~80 lines each)
├── codebook/        # Precomputed Lloyd-Max codebooks (12 .npz files, ~50KB)
├── cache/           # HuggingFace DynamicCache + mlx-lm KVCache wrappers
├── convert.py       # Offline model conversion
└── cli.py           # CLI tool

Backend abstraction: A Python Protocol with ~15 ops (matmul, qr, norm, argmin, etc.). Each backend is ~80 lines. Adding a new backend (JAX, ONNX) means implementing one file.

Codebooks: Precomputed for head dimensions 64, 96, 128, 256 at 2/3/4 bits. Shipped as package data. If your model uses an unusual head dim, they're generated at runtime (requires scipy).

No monkey-patching of model code: For HuggingFace, we subclass DynamicCache — the model calls cache.update() as normal, we compress transparently. For MLX, we replace the cache factory.

Test Suite

179 tests covering:

Mathematical guarantees: MSE distortion within the paper's theoretical bound (√3π/2 / 4^b)
Attention fidelity: Full softmax(Q@K^T/√d)@V simulation with cosine similarity checks
Inner product preservation: Correlation and absolute error of Q@K^T
Edge cases: Zero vectors, extreme values, sparse vectors, high dimensions
Statistical properties: Unbiasedness, rotation distribution validation
Cross-backend: Torch and MLX produce equivalent results

CI runs on both Linux (PyTorch) and macOS (PyTorch + MLX).

Install

# Just the library
pip install tqai

# With PyTorch
pip install tqai[torch]

# With MLX (Apple Silicon)
pip install tqai[mlx]

# macOS via Homebrew
brew install alphawavesystems/tap/tqai

What's Next

Bit-packing: Currently indices are stored as uint8. Packing to actual 2/3/4 bits would achieve the full theoretical 5-6x compression in memory (not just on disk).
Triton kernels: Fused decode kernels that compute attention directly on compressed data without dequantizing.
vLLM adapter: Production serving integration.

I got frustrated with Flutter E2E testing… so I built my own tool

Patrick Bertsch — Wed, 25 Mar 2026 18:09:19 +0000

If you've done end-to-end (E2E) testing in Flutter, you probably know the feeling:

Tests get slow as they grow
Debugging is painful
Writing tests feels heavier than it should

I hit those limits pretty quickly with integration_test. I tried other options (including Patrol), but I still wanted something that felt faster and simpler — something where writing tests didn't feel like a chore.

So I started building my own Flutter E2E testing framework:

👉 FlutterProbe — open-source, BSL 1.1 licensed

The goal

I wasn't trying to reinvent testing — just make it feel better to use.

What I wanted:

⚡ Fast feedback — closer to unit test speed, not minutes-long integration runs
✍️ Tests anyone can read — not just the developer who wrote them
🧪 Less flakiness — no more pumpAndSettle timeouts
🧠 A simple mental model — describe what the user does, not how the framework works

What it looks like

Instead of Dart boilerplate with WidgetTester and find.byKey, you write plain English:

test "user can sign in with valid credentials"
  open the app
  tap "Sign In"
  type "test@example.com" into "Email"
  type "password123" into "Password"
  tap "Continue"
  see "Dashboard"

That's actual ProbeScript — the test language FlutterProbe uses.
No Dart imports, no pumpAndSettle, no find.byType. Just behavior.

Under the hood, FlutterProbe connects to a Dart agent running inside your app via WebSocket. The agent walks the live widget tree directly — no UI automation layer, no WebDriver.

👉 This is how it achieves sub-50ms command round-trips.

Why not just use `integration_test`?

integration_test is Flutter's official E2E option, and it's solid for:

Official support and ecosystem integration
Basic smoke tests and simple flows
Teams already deep in the Dart test ecosystem

But for me, it starts to hurt when:

Tests grow in size — the boilerplate compounds fast
You need faster iteration — full rebuilds on every change
You want cleaner test code — tester.pumpAndSettle() everywhere
You need reporting — no built-in JUnit/JSON/HTML reports
You want CI/CD parallelism — no sharding support out of the box

FlutterProbe addresses each of these:

Human-readable syntax
Sub-50ms execution
Built-in reporters
--shard for CI matrix jobs
--parallel for multi-device runs

👉 Full comparison: FlutterProbe vs integration_test

And Patrol?

Patrol solves a lot — especially around native interactions (permission dialogs, system alerts, notifications). It's a serious tool and a real step up from vanilla integration_test.

FlutterProbe is trying something slightly different:

Plain English syntax — readable by QA, PMs, and developers
Direct widget-tree access — no Appium, no native automation layer
Speed — sub-50ms per command
Migration — supports Maestro, Gherkin, Robot Framework, Detox, and Appium
Cloud device farms — BrowserStack, Sauce Labs, AWS Device Farm, Firebase Test Lab, LambdaTest

If you need native OS interactions, Patrol is the better choice.
If you want speed, readability, and CI/CD-first design, FlutterProbe is worth a look.

👉 Full comparison: FlutterProbe vs Patrol

What's different so far

Feature	FlutterProbe	integration_test	Patrol
Test syntax	Plain English (ProbeScript)	Dart	Dart + custom finders
Execution speed	<50ms per command	~200–500ms	~100–300ms
CI/CD sharding	Built-in (`--shard`)	Manual	Manual
Parallel devices	`--parallel`	No	No
Cloud device farms	5 providers	No	No
Visual regression	Built-in	No	No
Test recording	Yes	No	No
Migration tools	7 formats	No	No
Reports (HTML/JSON/JUnit)	Built-in	Manual	Manual

Plus:

Self-healing selectors
Data-driven tests (CSV support)
Random data generators
Clipboard, GPS, permission commands
before all / after all hooks
HTTP mocking
VS Code extension with CodeLens + IntelliSense

When this might help you

If you:

Feel limited by integration_test boilerplate and speed
Want faster E2E feedback loops in CI/CD
Prefer test files that non-developers can read
Need multi-device or cloud testing
Are migrating from Maestro, Detox, or Appium

Still early — I'd love feedback

I'm actively working on this and would love input from people doing Flutter testing in production:

What's your biggest pain point with Flutter E2E testing today?
What tools are you using, and what's missing?
What would make E2E testing actually enjoyable?

Drop a comment — I read every one.

Learn More

📖 Complete Documentation
🆚 FlutterProbe vs Patrol vs integration_test
🚀 Quick Start Guide
📝 A Practical Guide to Flutter E2E Testing in 2026

Try it out

👉 GitHub: https://github.com/AlphaWaveSystems/flutter-probe
👉 Docs: https://flutterprobe.dev/

If it looks useful, a ⭐ helps a lot 🙌