GPU vs CPU Inference: 5 Scenarios, Real Costs & Latency

#mlops #modelserving #gpu #cpu

The $400/Month Surprise

I ran the same BERT model on a T4 GPU and a 4-core CPU for a month. The GPU was faster, obviously. But it cost $400 more than the CPU setup, which handled 95% of requests under our 200ms SLA just fine.

Most benchmarks compare raw throughput or single-request latency. They skip the part where you actually pick hardware for a production system with a budget, an SLA, and real traffic patterns. This post runs five realistic scenarios — from a personal side project to a high-traffic API — and shows when GPUs pay for themselves and when you're just burning money.