Model Serving Templates
Production-ready model serving with FastAPI and Flask. Deploy ML models as REST APIs with async inference, request batching, A/B testing, input validation, and canary deployment configs — ready for Docker and Kubernetes.
Key Features
- FastAPI async serving — non-blocking endpoints with automatic OpenAPI docs
- Flask + Gunicorn — battle-tested synchronous serving for simpler deployments
- Request batching — accumulate requests for GPU throughput optimization
- A/B testing — traffic splitting between model versions with metric collection
- Input validation — Pydantic schemas that reject malformed requests
- Model caching — load once at startup with configurable warm-up and health checks
- Canary deployments — Kubernetes manifests for gradual rollouts
Quick Start
# 1. Copy the config
cp config.example.yaml config.yaml
# 2. Start the FastAPI server
uvicorn templates.fastapi_serve:app --host 0.0.0.0 --port 8000
# 3. Test the endpoint
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
"""FastAPI model serving with Pydantic validation."""
from fastapi import FastAPI
from pydantic import BaseModel, Field
import joblib, numpy as np
app = FastAPI(title="ML Model API", version="1.0.0")
model = None
@app.on_event("startup")
async def load_model():
global model
model = joblib.load("artifacts/model.pkl")
class PredictRequest(BaseModel):
features: list[float] = Field(..., min_length=4, max_length=4)
@app.post("/predict")
async def predict(req: PredictRequest):
X = np.array(req.features).reshape(1, -1)
return {"prediction": int(model.predict(X)[0]),
"probability": float(model.predict_proba(X)[0].max()),
"model_version": "v2.1"}
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": model is not None}
Architecture
model-serving-templates/
├── config.example.yaml # Serving configuration
├── templates/
│ ├── fastapi_serve.py # FastAPI async serving
│ ├── flask_serve.py # Flask + Gunicorn serving
│ ├── batched_inference.py # Request batching for GPU
│ ├── ab_testing.py # A/B traffic splitting
│ ├── middleware/ # Auth, rate limiting, logging
│ ├── deployment/ # Dockerfile, k8s manifests, canary configs
│ └── testing/ # Load tests, smoke tests
├── docs/
│ └── overview.md
└── examples/
├── sklearn_serving.py
└── pytorch_serving.py
Usage Examples
Request Batching for GPU Models
"""Accumulate requests and run batch inference for GPU throughput."""
import asyncio
from collections import deque
import torch
class BatchPredictor:
def __init__(self, model, max_batch: int = 32, max_wait_ms: float = 50):
self.model, self.max_batch = model, max_batch
self.max_wait, self.queue = max_wait_ms / 1000, deque()
async def predict(self, features: list[float]) -> dict:
future = asyncio.get_event_loop().create_future()
self.queue.append((features, future))
if len(self.queue) >= self.max_batch:
await self._flush()
else:
await asyncio.sleep(self.max_wait)
if self.queue:
await self._flush()
return await future
async def _flush(self):
items = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_batch))]
with torch.no_grad():
outputs = self.model(torch.tensor([i[0] for i in items]).to("cuda"))
for (_, fut), out in zip(items, outputs):
fut.set_result({"prediction": out.argmax().item()})
A/B Testing
"""Route traffic between model versions by weight."""
import random
class ABRouter:
def __init__(self, models: dict[str, object], weights: dict[str, float]):
self.models, self.weights = models, weights
def route(self, features) -> tuple[str, object]:
rand, cumulative = random.random(), 0.0
for variant, weight in self.weights.items():
cumulative += weight
if rand <= cumulative:
return variant, self.models[variant].predict(features)
return "control", self.models["control"].predict(features)
Configuration
# config.example.yaml
server:
host: "0.0.0.0"
port: 8000
workers: 4 # Gunicorn/uvicorn workers
timeout: 30 # Request timeout seconds
model:
path: "artifacts/model.pkl"
version: "v2.1"
warm_up: true # Dummy prediction on startup
batching:
enabled: false
max_batch_size: 32
max_wait_ms: 50
ab_testing:
enabled: false
variants:
control: { model: "artifacts/model_v1.pkl", weight: 0.8 }
treatment: { model: "artifacts/model_v2.pkl", weight: 0.2 }
Best Practices
- Always validate inputs with Pydantic — reject bad requests before they waste compute
-
Load models at startup, not per-request — use
@app.on_event("startup")to load once - Return model version in every response — essential for debugging and A/B analysis
- Set request timeouts — prevents slow predictions from consuming all workers
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| 422 Validation Error | Request body doesn't match schema | Check field names/types; test with /docs Swagger UI |
| Stale predictions | Old model cached in memory | Restart server or add a /reload endpoint |
| Timeout under load | Workers saturated | Increase workers or enable batching |
| Container OOM | Model too large | Increase memory limit or quantize model |
This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Serving Templates] with all files, templates, and documentation for $49.
Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.
Top comments (0)