Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Model Serving Toolkit

#machinelearning #python #mlops #datascience

Model Serving Toolkit

Training a great model is half the job. The other half — serving it reliably at scale with low latency, high throughput, and zero-downtime updates — is where most teams struggle. This toolkit gives you production-ready serving configurations for FastAPI, TensorFlow Serving, NVIDIA Triton, and BentoML, plus A/B testing infrastructure and canary deployment patterns. Each serving option includes health checks, request batching, model versioning, and monitoring integration. Pick the right server for your use case and deploy with confidence.

Key Features

FastAPI Model Server — Async server with Pydantic validation, batch prediction endpoints, and OpenAPI docs.
TensorFlow Serving Config — Production configs with model versioning, request batching, and gRPC/REST endpoints.
Triton Inference Server — Multi-model, multi-framework configs for PyTorch, TensorFlow, ONNX with dynamic batching.
BentoML Service Templates — Package models as Bentos with processing pipelines and containerized deployment.
A/B Testing Framework — Traffic splitting with statistical significance testing and automatic winner selection.
Canary Deployment Patterns — Gradual rollout with health checks and automatic rollback on degradation.
Request Batching — Server-side batching for GPU-efficient inference with configurable batch size and timeout.

Quick Start

unzip model-serving-toolkit.zip && cd model-serving-toolkit
pip install -r requirements.txt

# Option 1: Start FastAPI server
python src/model_serving_toolkit/core.py serve --config config.example.yaml

# Option 2: Start with Docker
docker compose up -d --profile fastapi

# config.example.yaml
server:
  framework: fastapi  # fastapi | tfserving | triton | bentoml
  host: 0.0.0.0
  port: 8080
  workers: 4

model:
  name: churn_predictor
  version: v3
  path: ./models/churn_v3.onnx
  framework: onnx  # pytorch | tensorflow | onnx | xgboost

batching:
  enabled: true
  max_batch_size: 32
  batch_timeout_ms: 50

health_check:
  enabled: true
  endpoint: /health
  model_warmup: true

ab_testing:
  enabled: false
  variants:
    - { model_version: v3, traffic_percentage: 90 }
    - { model_version: v4, traffic_percentage: 10 }
  significance_level: 0.05
  min_samples_per_variant: 1000

monitoring:
  prometheus_metrics: true
  log_predictions: true

Architecture

┌────────────┐     ┌──────────────┐     ┌──────────────┐
│  Client    │────>│  Load        │────>│  Model       │
│  Request   │     │  Balancer    │     │  Server(s)   │
└────────────┘     └──────────────┘     └──────┬───────┘
                                                │
                   ┌──────────────┐     ┌──────▼───────┐
                   │  A/B Traffic │<────│  Request     │
                   │  Router      │     │  Batcher     │
                   └──────┬───────┘     └──────────────┘
                          │
              ┌───────────┼───────────┐
              │           │           │
       ┌──────▼──┐  ┌─────▼───┐  ┌───▼──────┐
       │ Model A │  │ Model B │  │ Canary   │
       │  (90%)  │  │  (10%)  │  │ Monitor  │
       └─────────┘  └─────────┘  └──────────┘

Usage Examples

FastAPI Model Server

from model_serving_toolkit.core import ModelServer
from pydantic import BaseModel

class PredictionRequest(BaseModel):
    features: dict[str, float]

server = ModelServer.from_config("config.example.yaml")

@server.predict_endpoint("/predict")
async def predict(request: PredictionRequest):
    result = await server.predict(request.features)
    return {"prediction": result.prediction, "model_version": result.model_version}

@server.predict_endpoint("/predict/batch")
async def predict_batch(requests: list[PredictionRequest]):
    return await server.predict_batch([r.features for r in requests])

if __name__ == "__main__":
    server.run()

python

A/B Testing Between Model Versions

from model_serving_toolkit.core import ABTestRouter

router = ABTestRouter(
    variants={"v3": {"model_path": "./models/v3.onnx", "traffic": 0.9},
              "v4": {"model_path": "./models/v4.onnx", "traffic": 0.1}},
    significance_level=0.05, primary_metric="conversion_rate",
)

async def handle_request(features: dict) -> dict:
    variant, prediction = await router.route_and_predict(features)
    return {"prediction": prediction, "variant": variant}

router.log_outcome(request_id="req_123", variant="v4", outcome=1)
status = router.get_test_status()
print(f"Significant: {status['significant']}, Winner: {status.get('winner', 'TBD')}")

Triton Inference Server Configuration

from model_serving_toolkit.utils import TritonConfigGenerator

generator = TritonConfigGenerator()
generator.create_model_repo(
    model_name="churn_predictor",
    model_path="./models/churn_v3.onnx",
    platform="onnxruntime_onnx",
    max_batch_size=32,
    instance_count=2,
    dynamic_batching={"preferred_batch_size": [8, 16, 32], "max_queue_delay_microseconds": 50000},
    output_dir="./triton_repo/",
)

Configuration Reference

Parameter	Type	Default	Description
`server.framework`	str	`fastapi`	Serving framework
`server.workers`	int	`4`	Number of worker processes
`batching.max_batch_size`	int	`32`	Max requests per batch
`batching.batch_timeout_ms`	int	`50`	Max wait to fill batch
`ab_testing.significance_level`	float	`0.05`	p-value for A/B winner

Best Practices

Always warm up models — Cold inference is 10-100x slower. Send warmup requests during startup.
Use server-side batching for GPU models — Batch 8-32 requests with a 50ms timeout for optimal throughput.
Export to ONNX for serving — ONNX Runtime is 2-3x faster than native PyTorch inference.
Load test at 2x peak traffic — If p99 latency exceeds your SLA, scale horizontally or optimize.
Separate serving from business logic — The model server does inference only. Everything else is a separate service.

Troubleshooting

Issue	Cause	Fix
High p99 latency with low average	GC pauses or cold model loads	Enable model warmup, increase `workers`, use async request handling
Batch predictions slower than single	Batch size too small for GPU	Increase `max_batch_size` and `batch_timeout_ms` to fill GPU
A/B test never reaches significance	Not enough traffic	Increase `traffic_percentage` for challenger variant, or lower `min_samples_per_variant`
ONNX model loading fails	Opset version mismatch	Re-export with `torch.onnx.export(opset_version=17)` matching ONNX Runtime version

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Model Serving Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

Model Serving Toolkit

Model Serving Toolkit

Key Features

Quick Start

Architecture

Usage Examples

FastAPI Model Server

A/B Testing Between Model Versions

Triton Inference Server Configuration

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)