DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Model Serving Toolkit

Model Serving Toolkit

Training a great model is half the job. The other half — serving it reliably at scale with low latency, high throughput, and zero-downtime updates — is where most teams struggle. This toolkit gives you production-ready serving configurations for FastAPI, TensorFlow Serving, NVIDIA Triton, and BentoML, plus A/B testing infrastructure and canary deployment patterns. Each serving option includes health checks, request batching, model versioning, and monitoring integration. Pick the right server for your use case and deploy with confidence.

Key Features

  • FastAPI Model Server — Async server with Pydantic validation, batch prediction endpoints, and OpenAPI docs.
  • TensorFlow Serving Config — Production configs with model versioning, request batching, and gRPC/REST endpoints.
  • Triton Inference Server — Multi-model, multi-framework configs for PyTorch, TensorFlow, ONNX with dynamic batching.
  • BentoML Service Templates — Package models as Bentos with processing pipelines and containerized deployment.
  • A/B Testing Framework — Traffic splitting with statistical significance testing and automatic winner selection.
  • Canary Deployment Patterns — Gradual rollout with health checks and automatic rollback on degradation.
  • Request Batching — Server-side batching for GPU-efficient inference with configurable batch size and timeout.

Quick Start

unzip model-serving-toolkit.zip && cd model-serving-toolkit
pip install -r requirements.txt

# Option 1: Start FastAPI server
python src/model_serving_toolkit/core.py serve --config config.example.yaml

# Option 2: Start with Docker
docker compose up -d --profile fastapi
Enter fullscreen mode Exit fullscreen mode
# config.example.yaml
server:
  framework: fastapi  # fastapi | tfserving | triton | bentoml
  host: 0.0.0.0
  port: 8080
  workers: 4

model:
  name: churn_predictor
  version: v3
  path: ./models/churn_v3.onnx
  framework: onnx  # pytorch | tensorflow | onnx | xgboost

batching:
  enabled: true
  max_batch_size: 32
  batch_timeout_ms: 50

health_check:
  enabled: true
  endpoint: /health
  model_warmup: true

ab_testing:
  enabled: false
  variants:
    - { model_version: v3, traffic_percentage: 90 }
    - { model_version: v4, traffic_percentage: 10 }
  significance_level: 0.05
  min_samples_per_variant: 1000

monitoring:
  prometheus_metrics: true
  log_predictions: true
Enter fullscreen mode Exit fullscreen mode

Architecture

┌────────────┐     ┌──────────────┐     ┌──────────────┐
│  Client    │────>│  Load        │────>│  Model       │
│  Request   │     │  Balancer    │     │  Server(s)   │
└────────────┘     └──────────────┘     └──────┬───────┘
                                                │
                   ┌──────────────┐     ┌──────▼───────┐
                   │  A/B Traffic │<────│  Request     │
                   │  Router      │     │  Batcher     │
                   └──────┬───────┘     └──────────────┘
                          │
              ┌───────────┼───────────┐
              │           │           │
       ┌──────▼──┐  ┌─────▼───┐  ┌───▼──────┐
       │ Model A │  │ Model B │  │ Canary   │
       │  (90%)  │  │  (10%)  │  │ Monitor  │
       └─────────┘  └─────────┘  └──────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

FastAPI Model Server

from model_serving_toolkit.core import ModelServer
from pydantic import BaseModel

class PredictionRequest(BaseModel):
    features: dict[str, float]

server = ModelServer.from_config("config.example.yaml")

@server.predict_endpoint("/predict")
async def predict(request: PredictionRequest):
    result = await server.predict(request.features)
    return {"prediction": result.prediction, "model_version": result.model_version}

@server.predict_endpoint("/predict/batch")
async def predict_batch(requests: list[PredictionRequest]):
    return await server.predict_batch([r.features for r in requests])

if __name__ == "__main__":
    server.run()
Enter fullscreen mode Exit fullscreen mode


python

A/B Testing Between Model Versions

from model_serving_toolkit.core import ABTestRouter

router = ABTestRouter(
    variants={"v3": {"model_path": "./models/v3.onnx", "traffic": 0.9},
              "v4": {"model_path": "./models/v4.onnx", "traffic": 0.1}},
    significance_level=0.05, primary_metric="conversion_rate",
)

async def handle_request(features: dict) -> dict:
    variant, prediction = await router.route_and_predict(features)
    return {"prediction": prediction, "variant": variant}

router.log_outcome(request_id="req_123", variant="v4", outcome=1)
status = router.get_test_status()
print(f"Significant: {status['significant']}, Winner: {status.get('winner', 'TBD')}")
Enter fullscreen mode Exit fullscreen mode

Triton Inference Server Configuration

from model_serving_toolkit.utils import TritonConfigGenerator

generator = TritonConfigGenerator()
generator.create_model_repo(
    model_name="churn_predictor",
    model_path="./models/churn_v3.onnx",
    platform="onnxruntime_onnx",
    max_batch_size=32,
    instance_count=2,
    dynamic_batching={"preferred_batch_size": [8, 16, 32], "max_queue_delay_microseconds": 50000},
    output_dir="./triton_repo/",
)
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
server.framework str fastapi Serving framework
server.workers int 4 Number of worker processes
batching.max_batch_size int 32 Max requests per batch
batching.batch_timeout_ms int 50 Max wait to fill batch
ab_testing.significance_level float 0.05 p-value for A/B winner

Best Practices

  1. Always warm up models — Cold inference is 10-100x slower. Send warmup requests during startup.
  2. Use server-side batching for GPU models — Batch 8-32 requests with a 50ms timeout for optimal throughput.
  3. Export to ONNX for serving — ONNX Runtime is 2-3x faster than native PyTorch inference.
  4. Load test at 2x peak traffic — If p99 latency exceeds your SLA, scale horizontally or optimize.
  5. Separate serving from business logic — The model server does inference only. Everything else is a separate service.

Troubleshooting

Issue Cause Fix
High p99 latency with low average GC pauses or cold model loads Enable model warmup, increase workers, use async request handling
Batch predictions slower than single Batch size too small for GPU Increase max_batch_size and batch_timeout_ms to fill GPU
A/B test never reaches significance Not enough traffic Increase traffic_percentage for challenger variant, or lower min_samples_per_variant
ONNX model loading fails Opset version mismatch Re-export with torch.onnx.export(opset_version=17) matching ONNX Runtime version

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Model Serving Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)