FastAPI vs Flask ML Serving: 5 Benchmarks Beginners Miss

#fastapi #flask #mlserving #benchmarking

The Test That Changed My Mind

Flask served my first production model for 18 months without issue. Then I rewrote it in FastAPI, ran the same benchmark suite, and watched throughput jump 3.2x on identical hardware.

But that's not the whole story. The tests most beginners run — wrk with a single endpoint, no preprocessing, no real model logic — miss the patterns that actually matter when you're serving ML models. Load testing an empty route tells you nothing about inference latency under concurrent requests, nothing about CPU-bound preprocessing bottlenecks, and nothing about memory behavior when multiple models share the same process.

This post walks through five benchmarks designed specifically for ML serving scenarios: synchronous inference, async preprocessing pipelines, concurrent model loading, streaming predictions, and multipart file uploads with image models. All tests use real (albeit toy) scikit-learn and PyTorch models, not return {"prediction": 42}.