Model Serving Toolkit
Training a great model is half the job. The other half — serving it reliably at scale with low latency, high throughput, and zero-downtime updates — is where most teams struggle. This toolkit gives you production-ready serving configurations for FastAPI, TensorFlow Serving, NVIDIA Triton, and BentoML, plus A/B testing infrastructure and canary deployment patterns. Each serving option includes health checks, request batching, model versioning, and monitoring integration. Pick the right server for your use case and deploy with confidence.
Key Features
- FastAPI Model Server — Async server with Pydantic validation, batch prediction endpoints, and OpenAPI docs.
- TensorFlow Serving Config — Production configs with model versioning, request batching, and gRPC/REST endpoints.
- Triton Inference Server — Multi-model, multi-framework configs for PyTorch, TensorFlow, ONNX with dynamic batching.
- BentoML Service Templates — Package models as Bentos with processing pipelines and containerized deployment.
- A/B Testing Framework — Traffic splitting with statistical significance testing and automatic winner selection.
- Canary Deployment Patterns — Gradual rollout with health checks and automatic rollback on degradation.
- Request Batching — Server-side batching for GPU-efficient inference with configurable batch size and timeout.
Quick Start
unzip model-serving-toolkit.zip && cd model-serving-toolkit
pip install -r requirements.txt
# Option 1: Start FastAPI server
python src/model_serving_toolkit/core.py serve --config config.example.yaml
# Option 2: Start with Docker
docker compose up -d --profile fastapi
# config.example.yaml
server:
framework: fastapi # fastapi | tfserving | triton | bentoml
host: 0.0.0.0
port: 8080
workers: 4
model:
name: churn_predictor
version: v3
path: ./models/churn_v3.onnx
framework: onnx # pytorch | tensorflow | onnx | xgboost
batching:
enabled: true
max_batch_size: 32
batch_timeout_ms: 50
health_check:
enabled: true
endpoint: /health
model_warmup: true
ab_testing:
enabled: false
variants:
- { model_version: v3, traffic_percentage: 90 }
- { model_version: v4, traffic_percentage: 10 }
significance_level: 0.05
min_samples_per_variant: 1000
monitoring:
prometheus_metrics: true
log_predictions: true
Architecture
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ Client │────>│ Load │────>│ Model │
│ Request │ │ Balancer │ │ Server(s) │
└────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌──────▼───────┐
│ A/B Traffic │<────│ Request │
│ Router │ │ Batcher │
└──────┬───────┘ └──────────────┘
│
┌───────────┼───────────┐
│ │ │
┌──────▼──┐ ┌─────▼───┐ ┌───▼──────┐
│ Model A │ │ Model B │ │ Canary │
│ (90%) │ │ (10%) │ │ Monitor │
└─────────┘ └─────────┘ └──────────┘
Usage Examples
FastAPI Model Server
from model_serving_toolkit.core import ModelServer
from pydantic import BaseModel
class PredictionRequest(BaseModel):
features: dict[str, float]
server = ModelServer.from_config("config.example.yaml")
@server.predict_endpoint("/predict")
async def predict(request: PredictionRequest):
result = await server.predict(request.features)
return {"prediction": result.prediction, "model_version": result.model_version}
@server.predict_endpoint("/predict/batch")
async def predict_batch(requests: list[PredictionRequest]):
return await server.predict_batch([r.features for r in requests])
if __name__ == "__main__":
server.run()
python
A/B Testing Between Model Versions
from model_serving_toolkit.core import ABTestRouter
router = ABTestRouter(
variants={"v3": {"model_path": "./models/v3.onnx", "traffic": 0.9},
"v4": {"model_path": "./models/v4.onnx", "traffic": 0.1}},
significance_level=0.05, primary_metric="conversion_rate",
)
async def handle_request(features: dict) -> dict:
variant, prediction = await router.route_and_predict(features)
return {"prediction": prediction, "variant": variant}
router.log_outcome(request_id="req_123", variant="v4", outcome=1)
status = router.get_test_status()
print(f"Significant: {status['significant']}, Winner: {status.get('winner', 'TBD')}")
Triton Inference Server Configuration
from model_serving_toolkit.utils import TritonConfigGenerator
generator = TritonConfigGenerator()
generator.create_model_repo(
model_name="churn_predictor",
model_path="./models/churn_v3.onnx",
platform="onnxruntime_onnx",
max_batch_size=32,
instance_count=2,
dynamic_batching={"preferred_batch_size": [8, 16, 32], "max_queue_delay_microseconds": 50000},
output_dir="./triton_repo/",
)
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
server.framework |
str | fastapi |
Serving framework |
server.workers |
int | 4 |
Number of worker processes |
batching.max_batch_size |
int | 32 |
Max requests per batch |
batching.batch_timeout_ms |
int | 50 |
Max wait to fill batch |
ab_testing.significance_level |
float | 0.05 |
p-value for A/B winner |
Best Practices
- Always warm up models — Cold inference is 10-100x slower. Send warmup requests during startup.
- Use server-side batching for GPU models — Batch 8-32 requests with a 50ms timeout for optimal throughput.
- Export to ONNX for serving — ONNX Runtime is 2-3x faster than native PyTorch inference.
- Load test at 2x peak traffic — If p99 latency exceeds your SLA, scale horizontally or optimize.
- Separate serving from business logic — The model server does inference only. Everything else is a separate service.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| High p99 latency with low average | GC pauses or cold model loads | Enable model warmup, increase workers, use async request handling |
| Batch predictions slower than single | Batch size too small for GPU | Increase max_batch_size and batch_timeout_ms to fill GPU |
| A/B test never reaches significance | Not enough traffic | Increase traffic_percentage for challenger variant, or lower min_samples_per_variant
|
| ONNX model loading fails | Opset version mismatch | Re-export with torch.onnx.export(opset_version=17) matching ONNX Runtime version |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Model Serving Toolkit] with all files, templates, and documentation for $39.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)