DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

INT8 vs FP16 Inference: TCO Cut 54% for 7B Models on AWS

The $8,400/month Cloud Bill That Started This

A client was running a 7B parameter LLM on AWS g5.2xlarge instances (A10G GPU) for a customer support chatbot. They'd gone with FP16 inference because "everyone does it" and the initial benchmarks looked fine. But at 120K requests/day, their monthly GPU bill hit $8,400.

Their engineering lead asked me: "Can we just switch to INT8 and cut this in half?"

The answer turned out to be way more interesting than yes or no. After two weeks of TCO analysis across AWS and GCP, testing three different 7B models (Llama 2, Mistral, and a domain-specific fine-tune), I found that INT8 cuts costs by 54% for most production workloads — but only if you avoid three specific traps that can actually increase your bill.

Here's what the numbers actually look like, and when FP16 still wins.

Wooden Scrabble tiles spelling 'Deepmind' and 'Gemini' on a wooden surface, a concept of AI and games.

Photo by Markus Winkler on Pexels

Why This Matters: Inference Cost Dominates Production LLM Budgets


Continue reading the full article on TildAlice

Top comments (0)