The 50ms Threshold That Changes Everything
Most "deploy your ML model" tutorials get you to a working endpoint. They don't get you to a fast one. I had a scikit-learn classifier that took 800ms per request when I wrapped it naively in FastAPI. After five specific changes, it dropped to 47ms. Same model, same hardware.
The difference wasn't clever optimization tricks. It was understanding what FastAPI actually does under the hood and where model inference fits into that picture.
Here's the final working code, then I'll break down why each piece matters:
python
# main.py - the production version
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from contextlib import asynccontextmanager
import numpy as np
import joblib
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time
# Global model and executor
model = None
executor = ThreadPoolExecutor(max_workers=4)
@asynccontextmanager
async def lifespan(app: FastAPI):
global model
print("Loading model...")
model = joblib.load("classifier.joblib")
# Warm up - this matters more than you'd think
dummy = np.zeros((1, 20))
model.predict(dummy)
print(f"Model loaded, classes: {model.classes_}")
yield
executor.shutdown(wait=False)
app = FastAPI(lifespan=lifespan)
class PredictionRequest(BaseModel):
features: list[float] = Field(..., min_length=20, max_length=20)
class PredictionResponse(BaseModel):
prediction: int
---
*Continue reading the full article on [TildAlice](https://tildalice.io/fastapi-model-serving-5-steps-inference/)*
Top comments (0)