Model Serving Templates

#machinelearning #python #datascience #ai

Model Serving Templates

Production-ready model serving with FastAPI and Flask. Deploy ML models as REST APIs with async inference, request batching, A/B testing, input validation, and canary deployment configs — ready for Docker and Kubernetes.

Key Features

FastAPI async serving — non-blocking endpoints with automatic OpenAPI docs
Flask + Gunicorn — battle-tested synchronous serving for simpler deployments
Request batching — accumulate requests for GPU throughput optimization
A/B testing — traffic splitting between model versions with metric collection
Input validation — Pydantic schemas that reject malformed requests
Model caching — load once at startup with configurable warm-up and health checks
Canary deployments — Kubernetes manifests for gradual rollouts

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Start the FastAPI server
uvicorn templates.fastapi_serve:app --host 0.0.0.0 --port 8000

# 3. Test the endpoint
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

"""FastAPI model serving with Pydantic validation."""
from fastapi import FastAPI
from pydantic import BaseModel, Field
import joblib, numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("artifacts/model.pkl")

class PredictRequest(BaseModel):
    features: list[float] = Field(..., min_length=4, max_length=4)

@app.post("/predict")
async def predict(req: PredictRequest):
    X = np.array(req.features).reshape(1, -1)
    return {"prediction": int(model.predict(X)[0]),
            "probability": float(model.predict_proba(X)[0].max()),
            "model_version": "v2.1"}

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Architecture

model-serving-templates/
├── config.example.yaml            # Serving configuration
├── templates/
│   ├── fastapi_serve.py           # FastAPI async serving
│   ├── flask_serve.py             # Flask + Gunicorn serving
│   ├── batched_inference.py       # Request batching for GPU
│   ├── ab_testing.py              # A/B traffic splitting
│   ├── middleware/                 # Auth, rate limiting, logging
│   ├── deployment/                # Dockerfile, k8s manifests, canary configs
│   └── testing/                   # Load tests, smoke tests
├── docs/
│   └── overview.md
└── examples/
    ├── sklearn_serving.py
    └── pytorch_serving.py

Usage Examples

Request Batching for GPU Models

"""Accumulate requests and run batch inference for GPU throughput."""
import asyncio
from collections import deque
import torch

class BatchPredictor:
    def __init__(self, model, max_batch: int = 32, max_wait_ms: float = 50):
        self.model, self.max_batch = model, max_batch
        self.max_wait, self.queue = max_wait_ms / 1000, deque()

    async def predict(self, features: list[float]) -> dict:
        future = asyncio.get_event_loop().create_future()
        self.queue.append((features, future))
        if len(self.queue) >= self.max_batch:
            await self._flush()
        else:
            await asyncio.sleep(self.max_wait)
            if self.queue:
                await self._flush()
        return await future

    async def _flush(self):
        items = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_batch))]
        with torch.no_grad():
            outputs = self.model(torch.tensor([i[0] for i in items]).to("cuda"))
        for (_, fut), out in zip(items, outputs):
            fut.set_result({"prediction": out.argmax().item()})

A/B Testing

"""Route traffic between model versions by weight."""
import random

class ABRouter:
    def __init__(self, models: dict[str, object], weights: dict[str, float]):
        self.models, self.weights = models, weights

    def route(self, features) -> tuple[str, object]:
        rand, cumulative = random.random(), 0.0
        for variant, weight in self.weights.items():
            cumulative += weight
            if rand <= cumulative:
                return variant, self.models[variant].predict(features)
        return "control", self.models["control"].predict(features)

Configuration

# config.example.yaml
server:
  host: "0.0.0.0"
  port: 8000
  workers: 4                        # Gunicorn/uvicorn workers
  timeout: 30                       # Request timeout seconds

model:
  path: "artifacts/model.pkl"
  version: "v2.1"
  warm_up: true                      # Dummy prediction on startup

batching:
  enabled: false
  max_batch_size: 32
  max_wait_ms: 50

ab_testing:
  enabled: false
  variants:
    control: { model: "artifacts/model_v1.pkl", weight: 0.8 }
    treatment: { model: "artifacts/model_v2.pkl", weight: 0.2 }

Best Practices

Always validate inputs with Pydantic — reject bad requests before they waste compute
Load models at startup, not per-request — use @app.on_event("startup") to load once
Return model version in every response — essential for debugging and A/B analysis
Set request timeouts — prevents slow predictions from consuming all workers

Troubleshooting

Problem	Cause	Fix
422 Validation Error	Request body doesn't match schema	Check field names/types; test with `/docs` Swagger UI
Stale predictions	Old model cached in memory	Restart server or add a `/reload` endpoint
Timeout under load	Workers saturated	Increase `workers` or enable batching
Container OOM	Model too large	Increase memory limit or quantize model

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Serving Templates] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

Model Serving Templates

Model Serving Templates

Key Features

Quick Start

Architecture

Usage Examples

Request Batching for GPU Models

A/B Testing

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)