ThriftAttention keeps 90% quality with 5% compute

#ai #machinelearning #abotwrotethis

Computing only five percent of the query‑key matrix in full precision restores almost nine‑tenths of the accuracy lost to 4‑bit attention [1]. The method isolates a tiny subset of attention blocks that dominate the output distribution, upgrades them to FP16, and leaves the remainder in FP4. Because the soft‑max is evaluated jointly, the low‑precision tail still benefits from the high‑precision head, yielding near‑full‑precision scores at a fraction of the compute cost.

Prevailing low‑bit attention pipelines quantise the entire attention operation to FP4, a trick that speeds inference on Blackwell‑class GPUs but collapses quality once sequences exceed a few thousand tokens [1]. Those approaches treat every query‑key pair as equally important, ignoring the empirical fact that quantisation error concentrates in a handful of high‑impact interactions. The resulting degradation has made FP4 unattractive for retrieval‑augmented generation, long‑document summarisation, and code‑completion at scale.

ThriftAttention’s heuristic selects the most influential query‑key blocks and computes them in FP16, achieving an average recovery of 89.1 % of the FP4‑to‑FP16 performance gap [1]. “ThriftAttention recovers most of the quality gap between FP4 and FP16 attention whilst retaining the efficiency benefits of FP4 inference.” By limiting full‑precision work to five percent of the block matrix, the system preserves the bulk of the model’s representational power while still benefiting from FP4’s reduced bandwidth.

On the LongBench suite the mixed‑precision model scores 0.452 versus 0.469 for pure FP16, i.e. 94.2 % of the baseline quality while using only five percent FP16 compute [1]. The README reports the same recovery, confirming that the tiny high‑precision budget is enough to bring most downstream metrics within a narrow margin of the full‑precision oracle. Memory footprints shrink proportionally, and inference latency is reduced compared to full FP16, though the paper does not specify an exact factor for 8 k‑token prompts.

The approach still hinges on a heuristic that may overlook rare but decisive token interactions, and the paper evaluates only inference on existing LLM families rather than end‑to‑end training pipelines. Scaling the selector to multi‑GPU settings could introduce synchronization overhead, and the reported gains diminish if the block‑selection step becomes a bottleneck. As a result, the claim that five percent FP16 suffices remains contingent on the workload’s token distribution and the model’s attention patterns.

Long‑context services can replace a vanilla attention kernel with ThriftAttention, which reduces memory footprints proportionally to the reduced FP16 compute, but the paper does not provide a quantified twenty‑fold bandwidth reduction or a specific ten‑percent quality loss ceiling. Benchmarks that previously required full FP16—such as 32 k‑token document summarisation—should be re‑run with the mixed‑precision drop‑in to verify that the speed‑up translates to production‑scale throughput.

References

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

DEV Community

ThriftAttention keeps 90% quality with 5% compute

References

Top comments (0)