INT8 vs FP16 Inference: TCO Cut 54% for 7B Models on AWS

#int8 #fp16 #llminference #modelquantization

The $8,400/month Cloud Bill That Started This

A client was running a 7B parameter LLM on AWS g5.2xlarge instances (A10G GPU) for a customer support chatbot. They'd gone with FP16 inference because "everyone does it" and the initial benchmarks looked fine. But at 120K requests/day, their monthly GPU bill hit $8,400.

Their engineering lead asked me: "Can we just switch to INT8 and cut this in half?"

The answer turned out to be way more interesting than yes or no. After two weeks of TCO analysis across AWS and GCP, testing three different 7B models (Llama 2, Mistral, and a domain-specific fine-tune), I found that INT8 cuts costs by 54% for most production workloads — but only if you avoid three specific traps that can actually increase your bill.

Here's what the numbers actually look like, and when FP16 still wins.