FlashAttention-2 vs xFormers: H100 Cost at 100M Tokens

#flashattention #xformers #transformer #gpu

FlashAttention-2 Promises 2x Speedup — But xFormers Still Dominates Cost per Token on A100

You can read the full FlashAttention-2 paper here and the xFormers technical report here.

FlashAttention-2 (Dao, 2023) claims to cut attention kernel time nearly in half compared to the original FlashAttention — but when you price it per million tokens on real GPU instances, xFormers' memory-efficient attention still wins on A100 while H100 flips the economics entirely. I ran both implementations at 100M token throughput to figure out which one actually costs less in production.

The comparison matters because attention is the bottleneck in every transformer inference workload beyond 2K context. FlashAttention-2 rewrites the parallelization strategy to squeeze more FLOPs out of modern Tensor Cores, while xFormers relies onBlockSparse patterns and memory layout tricks from Meta's FAIR team. The difference shows up hardest when you move from batch size 1 (chat inference) to batch size 32+ (batch embedding or offline reranking).