<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Patrick Bertsch</title>
    <description>The latest articles on DEV Community by Patrick Bertsch (@patrick_bertsch_056239a0e).</description>
    <link>https://dev.to/patrick_bertsch_056239a0e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1631390%2F96a73516-b96e-449f-9518-1eb17506feef.png</url>
      <title>DEV Community: Patrick Bertsch</title>
      <link>https://dev.to/patrick_bertsch_056239a0e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/patrick_bertsch_056239a0e"/>
    <language>en</language>
    <item>
      <title>We Built a Python Library That Cuts LLM Memory Usage by 80%</title>
      <dc:creator>Patrick Bertsch</dc:creator>
      <pubDate>Sat, 04 Apr 2026 22:17:01 +0000</pubDate>
      <link>https://dev.to/patrick_bertsch_056239a0e/we-built-a-python-library-that-cuts-llm-memory-usage-by-80-39b8</link>
      <guid>https://dev.to/patrick_bertsch_056239a0e/we-built-a-python-library-that-cuts-llm-memory-usage-by-80-39b8</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;If you've run LLMs locally, you know the pain: a 14B model eats 10+ GB just for the KV cache on long prompts. The model weights fit in memory, but the cache — where attention stores every key and value vector for every token — grows linearly with context length and eventually pushes you into swap or OOM.&lt;/p&gt;

&lt;p&gt;The standard approach is to quantize the model weights (Q4, Q8), but nobody touches the KV cache. It sits there in full FP16 precision, quietly eating 30-50% of your total memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paper
&lt;/h2&gt;

&lt;p&gt;Google Research published &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;TurboQuant&lt;/a&gt; at ICLR 2026. The core idea is surprisingly elegant:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rotate&lt;/strong&gt; the KV vectors by a random orthogonal matrix — this spreads information uniformly across all coordinates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantize&lt;/strong&gt; each coordinate independently using precomputed optimal codebooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store the norm&lt;/strong&gt; separately in FP16&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No training. No calibration data. No model-specific tuning. The same codebooks work for Llama, Qwen, Mistral — anything.&lt;/p&gt;

&lt;p&gt;The key insight is that after rotation, each coordinate follows a known Gaussian distribution (N(0, 1/d) where d is the head dimension). Since you know the distribution in advance, you can precompute the optimal Lloyd-Max quantizer offline. This makes the whole thing &lt;strong&gt;data-oblivious&lt;/strong&gt; — you don't need to see a single token from the model to set up compression.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not both stages?
&lt;/h3&gt;

&lt;p&gt;The paper actually has two stages. Stage 2 (QJL) adds a 1-bit residual correction for unbiased inner products. We skip it. &lt;a href="https://github.com/tonbistudio/turboquant-pytorch" rel="noopener noreferrer"&gt;Independent research&lt;/a&gt; found that QJL's variance amplification actually degrades softmax-based attention. Stage 1 alone produces better results for KV cache compression.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Library
&lt;/h2&gt;

&lt;p&gt;We turned this into &lt;a href="https://github.com/AlphaWaveSystems/tqai" rel="noopener noreferrer"&gt;&lt;strong&gt;tqai&lt;/strong&gt;&lt;/a&gt; — a pip-installable Python library with two backends (PyTorch and MLX) and a CLI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two lines to compress
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tqai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2.5-3B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This is the only change
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tqai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bits_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bits_v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum entanglement:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;past_key_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Apple Silicon with MLX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tqai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlx_lm&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlx_lm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/Llama-3.1-8B-Instruct-4bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tqai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bits_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bits_v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mlx_lm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain quantum entanglement:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Compression numbers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Avg Bits&lt;/th&gt;
&lt;th&gt;Memory Saved&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;K4/V2&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K3/V2&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extended context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K4/V3&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quality-sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Original KV cache: 16 bits per coordinate (FP16). With K4/V2: 512 bytes/token → 100 bytes/token.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does it actually work?
&lt;/h3&gt;

&lt;p&gt;We tested across model sizes. The results are clear — quality depends on model size, not compression:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;+ tqai K4/V2&lt;/th&gt;
&lt;th&gt;+ tqai K3/V2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 0.5B&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Degraded&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3B&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Degraded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 8B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Excellent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Excellent&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 14B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Excellent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Excellent&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On 8B+ models, the compressed output is indistinguishable from baseline. Here's a real example from Qwen 14B Q4:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline&lt;/strong&gt;: "particles become interconnected so that the state of one particle cannot be described independently of the state of the others"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;K4/V2&lt;/strong&gt;: "particles become interconnected so that the state of one particle cannot be described without including the state of the other"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;K3/V2&lt;/strong&gt;: "two or more particles become interconnected such that the state of one particle can instantly influence the state of another"&lt;/p&gt;

&lt;p&gt;All three are coherent, factually correct, grammatically perfect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CLI
&lt;/h2&gt;

&lt;p&gt;tqai ships with a CLI tool for quick testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Environment info&lt;/span&gt;
tqai info

&lt;span class="c"&gt;# Accuracy benchmark (no model needed)&lt;/span&gt;
tqai benchmark
&lt;span class="c"&gt;# Output:&lt;/span&gt;
&lt;span class="c"&gt;# Keys (4-bit): NMSE=0.009287, SNR=20.3 dB, cosine sim=0.9954&lt;/span&gt;
&lt;span class="c"&gt;# Values (2-bit): NMSE=0.115653, SNR=9.4 dB, cosine sim=0.9408&lt;/span&gt;

&lt;span class="c"&gt;# Generate with compression&lt;/span&gt;
tqai run &lt;span class="s2"&gt;"Explain gravity"&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; mlx-community/Llama-3.1-8B-Instruct-4bit

&lt;span class="c"&gt;# Side-by-side comparison&lt;/span&gt;
tqai compare &lt;span class="s2"&gt;"Explain gravity"&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; mlx-community/Llama-3.1-8B-Instruct-4bit

&lt;span class="c"&gt;# Pre-convert for faster startup&lt;/span&gt;
tqai convert &lt;span class="nt"&gt;-m&lt;/span&gt; mlx-community/Llama-3.1-8B-Instruct-4bit &lt;span class="nt"&gt;-o&lt;/span&gt; ./llama-tqai/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Under the Hood
&lt;/h2&gt;

&lt;p&gt;The architecture is intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/tqai/
├── quantizer.py     # PolarQuantizer — the core algorithm (~100 lines)
├── backend/         # PyTorch + MLX abstraction (Protocol-based, ~80 lines each)
├── codebook/        # Precomputed Lloyd-Max codebooks (12 .npz files, ~50KB)
├── cache/           # HuggingFace DynamicCache + mlx-lm KVCache wrappers
├── convert.py       # Offline model conversion
└── cli.py           # CLI tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Backend abstraction&lt;/strong&gt;: A Python Protocol with ~15 ops (matmul, qr, norm, argmin, etc.). Each backend is ~80 lines. Adding a new backend (JAX, ONNX) means implementing one file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codebooks&lt;/strong&gt;: Precomputed for head dimensions 64, 96, 128, 256 at 2/3/4 bits. Shipped as package data. If your model uses an unusual head dim, they're generated at runtime (requires scipy).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No monkey-patching of model code&lt;/strong&gt;: For HuggingFace, we subclass &lt;code&gt;DynamicCache&lt;/code&gt; — the model calls &lt;code&gt;cache.update()&lt;/code&gt; as normal, we compress transparently. For MLX, we replace the cache factory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Suite
&lt;/h2&gt;

&lt;p&gt;179 tests covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mathematical guarantees&lt;/strong&gt;: MSE distortion within the paper's theoretical bound (√3π/2 / 4^b)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attention fidelity&lt;/strong&gt;: Full softmax(Q@K^T/√d)@V simulation with cosine similarity checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inner product preservation&lt;/strong&gt;: Correlation and absolute error of Q@K^T&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge cases&lt;/strong&gt;: Zero vectors, extreme values, sparse vectors, high dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical properties&lt;/strong&gt;: Unbiasedness, rotation distribution validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-backend&lt;/strong&gt;: Torch and MLX produce equivalent results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CI runs on both Linux (PyTorch) and macOS (PyTorch + MLX).&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Just the library&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;tqai

&lt;span class="c"&gt;# With PyTorch&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;tqai[torch]

&lt;span class="c"&gt;# With MLX (Apple Silicon)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;tqai[mlx]

&lt;span class="c"&gt;# macOS via Homebrew&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;alphawavesystems/tap/tqai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bit-packing&lt;/strong&gt;: Currently indices are stored as uint8. Packing to actual 2/3/4 bits would achieve the full theoretical 5-6x compression in memory (not just on disk).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triton kernels&lt;/strong&gt;: Fused decode kernels that compute attention directly on compressed data without dequantizing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM adapter&lt;/strong&gt;: Production serving integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/AlphaWaveSystems/tqai" rel="noopener noreferrer"&gt;AlphaWaveSystems/tqai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;a href="https://pypi.org/project/tqai/" rel="noopener noreferrer"&gt;pypi.org/project/tqai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paper&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;arXiv:2504.19874&lt;/a&gt; (TurboQuant, Google Research, ICLR 2026)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Related&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2502.02617" rel="noopener noreferrer"&gt;PolarQuant&lt;/a&gt; (AISTATS 2026), &lt;a href="https://dl.acm.org/doi/10.1609/aaai.v39i24.34773" rel="noopener noreferrer"&gt;QJL&lt;/a&gt; (AAAI 2025)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MIT licensed. 179 tests. Contributions welcome — DCO sign-off required.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I got frustrated with Flutter E2E testing… so I built my own tool</title>
      <dc:creator>Patrick Bertsch</dc:creator>
      <pubDate>Wed, 25 Mar 2026 18:09:19 +0000</pubDate>
      <link>https://dev.to/patrick_bertsch_056239a0e/i-got-frustrated-with-flutter-e2e-testing-so-i-built-my-own-tool-534i</link>
      <guid>https://dev.to/patrick_bertsch_056239a0e/i-got-frustrated-with-flutter-e2e-testing-so-i-built-my-own-tool-534i</guid>
      <description>&lt;p&gt;If you've done end-to-end (E2E) testing in Flutter, you probably know the feeling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tests get &lt;strong&gt;slow&lt;/strong&gt; as they grow&lt;/li&gt;
&lt;li&gt;Debugging is painful&lt;/li&gt;
&lt;li&gt;Writing tests feels heavier than it should&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hit those limits pretty quickly with &lt;code&gt;integration_test&lt;/code&gt;. I tried other options (including Patrol), but I still wanted something that felt faster and simpler — something where writing tests didn't feel like a chore.&lt;/p&gt;

&lt;p&gt;So I started building my own Flutter E2E testing framework:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/AlphaWaveSystems/flutter-probe" rel="noopener noreferrer"&gt;FlutterProbe&lt;/a&gt;&lt;/strong&gt; — open-source, BSL 1.1 licensed&lt;/p&gt;




&lt;h2&gt;
  
  
  The goal
&lt;/h2&gt;

&lt;p&gt;I wasn't trying to reinvent testing — just make it feel better to use.&lt;/p&gt;

&lt;p&gt;What I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ &lt;strong&gt;Fast feedback&lt;/strong&gt; — closer to unit test speed, not minutes-long integration runs&lt;/li&gt;
&lt;li&gt;✍️ &lt;strong&gt;Tests anyone can read&lt;/strong&gt; — not just the developer who wrote them&lt;/li&gt;
&lt;li&gt;🧪 &lt;strong&gt;Less flakiness&lt;/strong&gt; — no more &lt;code&gt;pumpAndSettle&lt;/code&gt; timeouts&lt;/li&gt;
&lt;li&gt;🧠 &lt;strong&gt;A simple mental model&lt;/strong&gt; — describe what the user does, not how the framework works&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What it looks like
&lt;/h2&gt;

&lt;p&gt;Instead of Dart boilerplate with &lt;code&gt;WidgetTester&lt;/code&gt; and &lt;code&gt;find.byKey&lt;/code&gt;, you write plain English:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test "user can sign in with valid credentials"
  open the app
  tap "Sign In"
  type "test@example.com" into "Email"
  type "password123" into "Password"
  tap "Continue"
  see "Dashboard"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's actual ProbeScript — the test language FlutterProbe uses.&lt;br&gt;
No Dart imports, no &lt;code&gt;pumpAndSettle&lt;/code&gt;, no &lt;code&gt;find.byType&lt;/code&gt;. Just behavior.&lt;/p&gt;

&lt;p&gt;Under the hood, FlutterProbe connects to a Dart agent running inside your app via WebSocket. The agent walks the live widget tree directly — no UI automation layer, no WebDriver.&lt;/p&gt;

&lt;p&gt;👉 This is how it achieves &lt;strong&gt;sub-50ms command round-trips&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why not just use &lt;code&gt;integration_test&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;integration_test&lt;/code&gt; is Flutter's official E2E option, and it's solid for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official support and ecosystem integration&lt;/li&gt;
&lt;li&gt;Basic smoke tests and simple flows&lt;/li&gt;
&lt;li&gt;Teams already deep in the Dart test ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for me, it starts to hurt when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tests grow in size — the boilerplate compounds fast&lt;/li&gt;
&lt;li&gt;You need faster iteration — full rebuilds on every change&lt;/li&gt;
&lt;li&gt;You want cleaner test code — &lt;code&gt;tester.pumpAndSettle()&lt;/code&gt; everywhere&lt;/li&gt;
&lt;li&gt;You need reporting — no built-in JUnit/JSON/HTML reports&lt;/li&gt;
&lt;li&gt;You want CI/CD parallelism — no sharding support out of the box&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FlutterProbe addresses each of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Human-readable syntax&lt;/li&gt;
&lt;li&gt;Sub-50ms execution&lt;/li&gt;
&lt;li&gt;Built-in reporters&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--shard&lt;/code&gt; for CI matrix jobs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--parallel&lt;/code&gt; for multi-device runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;a href="https://flutterprobe.dev/comparisons/integration-test-alternative/" rel="noopener noreferrer"&gt;Full comparison: FlutterProbe vs integration_test&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  And Patrol?
&lt;/h2&gt;

&lt;p&gt;Patrol solves a lot — especially around native interactions (permission dialogs, system alerts, notifications). It's a serious tool and a real step up from vanilla &lt;code&gt;integration_test&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;FlutterProbe is trying something slightly different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plain English syntax&lt;/strong&gt; — readable by QA, PMs, and developers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct widget-tree access&lt;/strong&gt; — no Appium, no native automation layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; — sub-50ms per command&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration&lt;/strong&gt; — supports Maestro, Gherkin, Robot Framework, Detox, and Appium&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud device farms&lt;/strong&gt; — BrowserStack, Sauce Labs, AWS Device Farm, Firebase Test Lab, LambdaTest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need native OS interactions, Patrol is the better choice.&lt;br&gt;
If you want speed, readability, and CI/CD-first design, FlutterProbe is worth a look.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://flutterprobe.dev/comparisons/patrol-alternative/" rel="noopener noreferrer"&gt;Full comparison: FlutterProbe vs Patrol&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's different so far
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;FlutterProbe&lt;/th&gt;
&lt;th&gt;integration_test&lt;/th&gt;
&lt;th&gt;Patrol&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Test syntax&lt;/td&gt;
&lt;td&gt;Plain English (ProbeScript)&lt;/td&gt;
&lt;td&gt;Dart&lt;/td&gt;
&lt;td&gt;Dart + custom finders&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution speed&lt;/td&gt;
&lt;td&gt;&amp;lt;50ms per command&lt;/td&gt;
&lt;td&gt;~200–500ms&lt;/td&gt;
&lt;td&gt;~100–300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD sharding&lt;/td&gt;
&lt;td&gt;Built-in (&lt;code&gt;--shard&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel devices&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--parallel&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud device farms&lt;/td&gt;
&lt;td&gt;5 providers&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual regression&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test recording&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration tools&lt;/td&gt;
&lt;td&gt;7 formats&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reports (HTML/JSON/JUnit)&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-healing selectors&lt;/li&gt;
&lt;li&gt;Data-driven tests (CSV support)&lt;/li&gt;
&lt;li&gt;Random data generators&lt;/li&gt;
&lt;li&gt;Clipboard, GPS, permission commands&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;before all&lt;/code&gt; / &lt;code&gt;after all&lt;/code&gt; hooks&lt;/li&gt;
&lt;li&gt;HTTP mocking&lt;/li&gt;
&lt;li&gt;VS Code extension with CodeLens + IntelliSense&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When this might help you
&lt;/h2&gt;

&lt;p&gt;If you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feel limited by &lt;code&gt;integration_test&lt;/code&gt; boilerplate and speed&lt;/li&gt;
&lt;li&gt;Want faster E2E feedback loops in CI/CD&lt;/li&gt;
&lt;li&gt;Prefer test files that non-developers can read&lt;/li&gt;
&lt;li&gt;Need multi-device or cloud testing&lt;/li&gt;
&lt;li&gt;Are migrating from Maestro, Detox, or Appium&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Still early — I'd love feedback
&lt;/h2&gt;

&lt;p&gt;I'm actively working on this and would love input from people doing Flutter testing in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's your biggest pain point with Flutter E2E testing today?&lt;/li&gt;
&lt;li&gt;What tools are you using, and what's missing?&lt;/li&gt;
&lt;li&gt;What would make E2E testing actually enjoyable?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop a comment — I read every one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Learn More
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📖 &lt;a href="https://flutterprobe.dev/" rel="noopener noreferrer"&gt;Complete Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🆚 &lt;a href="https://flutterprobe.dev/comparisons/flutterprobe-vs-patrol-vs-integration-test/" rel="noopener noreferrer"&gt;FlutterProbe vs Patrol vs integration_test&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🚀 &lt;a href="https://flutterprobe.dev/getting-started/quick-start/" rel="noopener noreferrer"&gt;Quick Start Guide&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📝 &lt;a href="https://flutterprobe.dev/blog/guide-to-flutter-e2e-testing/" rel="noopener noreferrer"&gt;A Practical Guide to Flutter E2E Testing in 2026&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try it out
&lt;/h2&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/AlphaWaveSystems/flutter-probe" rel="noopener noreferrer"&gt;https://github.com/AlphaWaveSystems/flutter-probe&lt;/a&gt;&lt;br&gt;
👉 &lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://flutterprobe.dev/" rel="noopener noreferrer"&gt;https://flutterprobe.dev/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If it looks useful, a ⭐ helps a lot 🙌&lt;/p&gt;

</description>
      <category>flutter</category>
      <category>testing</category>
      <category>opensource</category>
      <category>mobile</category>
    </item>
  </channel>
</rss>
