The path to scalable ML deployment requires high-performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto-scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration.
Phase 1: The FastAPI Inference Service
Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint:
Python
# app.py snippet
# ... model loading logic ...
@app.get("/health")
def health_check():
# K8s Probes will hit this endpoint frequently
return {"status": "ok", "model_loaded": True}
# ... /predict endpoint ...
Phase 2: Docker and Kubernetes Deployment
After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes.
YAML
# deployment.yaml (Snippet focusing on probes and resources)
resources:
requests:
cpu: "250m" # For scheduling
memory: "500Mi"
limits:
cpu: "500m" # To prevent monopolizing the node
memory: "1Gi"
livenessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 5
readinessProbe:
httpGet: {path: /health, port: 8000}
initialDelaySeconds: 5 # Gives time for the ONNX model to load
Phase 3: Implementing Horizontal Pod Autoscaler (HPA)
Scalability is handled by the HPA, which requires the Metrics Server to be running.
YAML
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: clothing-classifier-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: clothing-classifier-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Scale up if CPU exceeds 50%
Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost-effective MLOps.
Read the full guide here.
If you're deploying any Python API, adopting this pattern for resource management and scaling will save you major headaches down the road.
Top comments (0)