DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Model Serving Templates

Model Serving Templates

Production-ready model serving with FastAPI and Flask. Deploy ML models as REST APIs with async inference, request batching, A/B testing, input validation, and canary deployment configs — ready for Docker and Kubernetes.

Key Features

  • FastAPI async serving — non-blocking endpoints with automatic OpenAPI docs
  • Flask + Gunicorn — battle-tested synchronous serving for simpler deployments
  • Request batching — accumulate requests for GPU throughput optimization
  • A/B testing — traffic splitting between model versions with metric collection
  • Input validation — Pydantic schemas that reject malformed requests
  • Model caching — load once at startup with configurable warm-up and health checks
  • Canary deployments — Kubernetes manifests for gradual rollouts

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Start the FastAPI server
uvicorn templates.fastapi_serve:app --host 0.0.0.0 --port 8000

# 3. Test the endpoint
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
Enter fullscreen mode Exit fullscreen mode
"""FastAPI model serving with Pydantic validation."""
from fastapi import FastAPI
from pydantic import BaseModel, Field
import joblib, numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("artifacts/model.pkl")

class PredictRequest(BaseModel):
    features: list[float] = Field(..., min_length=4, max_length=4)

@app.post("/predict")
async def predict(req: PredictRequest):
    X = np.array(req.features).reshape(1, -1)
    return {"prediction": int(model.predict(X)[0]),
            "probability": float(model.predict_proba(X)[0].max()),
            "model_version": "v2.1"}

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}
Enter fullscreen mode Exit fullscreen mode

Architecture

model-serving-templates/
├── config.example.yaml            # Serving configuration
├── templates/
│   ├── fastapi_serve.py           # FastAPI async serving
│   ├── flask_serve.py             # Flask + Gunicorn serving
│   ├── batched_inference.py       # Request batching for GPU
│   ├── ab_testing.py              # A/B traffic splitting
│   ├── middleware/                 # Auth, rate limiting, logging
│   ├── deployment/                # Dockerfile, k8s manifests, canary configs
│   └── testing/                   # Load tests, smoke tests
├── docs/
│   └── overview.md
└── examples/
    ├── sklearn_serving.py
    └── pytorch_serving.py
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Request Batching for GPU Models

"""Accumulate requests and run batch inference for GPU throughput."""
import asyncio
from collections import deque
import torch

class BatchPredictor:
    def __init__(self, model, max_batch: int = 32, max_wait_ms: float = 50):
        self.model, self.max_batch = model, max_batch
        self.max_wait, self.queue = max_wait_ms / 1000, deque()

    async def predict(self, features: list[float]) -> dict:
        future = asyncio.get_event_loop().create_future()
        self.queue.append((features, future))
        if len(self.queue) >= self.max_batch:
            await self._flush()
        else:
            await asyncio.sleep(self.max_wait)
            if self.queue:
                await self._flush()
        return await future

    async def _flush(self):
        items = [self.queue.popleft() for _ in range(min(len(self.queue), self.max_batch))]
        with torch.no_grad():
            outputs = self.model(torch.tensor([i[0] for i in items]).to("cuda"))
        for (_, fut), out in zip(items, outputs):
            fut.set_result({"prediction": out.argmax().item()})
Enter fullscreen mode Exit fullscreen mode

A/B Testing

"""Route traffic between model versions by weight."""
import random

class ABRouter:
    def __init__(self, models: dict[str, object], weights: dict[str, float]):
        self.models, self.weights = models, weights

    def route(self, features) -> tuple[str, object]:
        rand, cumulative = random.random(), 0.0
        for variant, weight in self.weights.items():
            cumulative += weight
            if rand <= cumulative:
                return variant, self.models[variant].predict(features)
        return "control", self.models["control"].predict(features)
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
server:
  host: "0.0.0.0"
  port: 8000
  workers: 4                        # Gunicorn/uvicorn workers
  timeout: 30                       # Request timeout seconds

model:
  path: "artifacts/model.pkl"
  version: "v2.1"
  warm_up: true                      # Dummy prediction on startup

batching:
  enabled: false
  max_batch_size: 32
  max_wait_ms: 50

ab_testing:
  enabled: false
  variants:
    control: { model: "artifacts/model_v1.pkl", weight: 0.8 }
    treatment: { model: "artifacts/model_v2.pkl", weight: 0.2 }
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Always validate inputs with Pydantic — reject bad requests before they waste compute
  2. Load models at startup, not per-request — use @app.on_event("startup") to load once
  3. Return model version in every response — essential for debugging and A/B analysis
  4. Set request timeouts — prevents slow predictions from consuming all workers

Troubleshooting

Problem Cause Fix
422 Validation Error Request body doesn't match schema Check field names/types; test with /docs Swagger UI
Stale predictions Old model cached in memory Restart server or add a /reload endpoint
Timeout under load Workers saturated Increase workers or enable batching
Container OOM Model too large Increase memory limit or quantize model

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [Model Serving Templates] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)