Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

#machinelearning #kubernetes #fastapi #ai

The path to scalable ML deployment requires high-performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto-scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration.

Phase 1: The FastAPI Inference Service

Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint:

Python

# app.py snippet
# ... model loading logic ...

@app.get("/health")
def health_check():
    # K8s Probes will hit this endpoint frequently
    return {"status": "ok", "model_loaded": True}

# ... /predict endpoint ...

Phase 2: Docker and Kubernetes Deployment

After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes.

YAML

# deployment.yaml (Snippet focusing on probes and resources)
        resources:
          requests:
            cpu: "250m"  # For scheduling
            memory: "500Mi"
          limits:
            cpu: "500m"  # To prevent monopolizing the node
            memory: "1Gi"
        livenessProbe:
          httpGet: {path: /health, port: 8000}
          initialDelaySeconds: 5
        readinessProbe:
          httpGet: {path: /health, port: 8000}
          initialDelaySeconds: 5 # Gives time for the ONNX model to load

Phase 3: Implementing Horizontal Pod Autoscaler (HPA)

Scalability is handled by the HPA, which requires the Metrics Server to be running.

YAML

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clothing-classifier-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clothing-classifier-deployment
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50 # Scale up if CPU exceeds 50%

Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost-effective MLOps.

Read the full guide here.

If you're deploying any Python API, adopting this pattern for resource management and scaling will save you major headaches down the road.