DEV Community

Cover image for Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA
Austin Deyan
Austin Deyan

Posted on

Zero-to-Scale ML: Deploying ONNX Models on Kubernetes with FastAPI and HPA

The path to scalable ML deployment requires high-performance APIs and robust orchestration. This post walks through setting up a local, highly available, and auto-scaling inference service using FastAPI for speed and Kind for Kubernetes orchestration.

Phase 1: The FastAPI Inference Service

Our Python service handles ONNX model inference. The critical component for K8s stability is the /health endpoint:

Python

# app.py snippet
# ... model loading logic ...

@app.get("/health")
def health_check():
    # K8s Probes will hit this endpoint frequently
    return {"status": "ok", "model_loaded": True}

# ... /predict endpoint ...
Enter fullscreen mode Exit fullscreen mode

Phase 2: Docker and Kubernetes Deployment

After building the image (clothing-classifier:latest) and loading it into Kind, we define the Deployment. Note the crucial resource constraints and probes.

YAML

# deployment.yaml (Snippet focusing on probes and resources)
        resources:
          requests:
            cpu: "250m"  # For scheduling
            memory: "500Mi"
          limits:
            cpu: "500m"  # To prevent monopolizing the node
            memory: "1Gi"
        livenessProbe:
          httpGet: {path: /health, port: 8000}
          initialDelaySeconds: 5
        readinessProbe:
          httpGet: {path: /health, port: 8000}
          initialDelaySeconds: 5 # Gives time for the ONNX model to load
Enter fullscreen mode Exit fullscreen mode

Phase 3: Implementing Horizontal Pod Autoscaler (HPA)

Scalability is handled by the HPA, which requires the Metrics Server to be running.

YAML

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clothing-classifier-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clothing-classifier-deployment
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50 # Scale up if CPU exceeds 50%
Enter fullscreen mode Exit fullscreen mode

Result: Under load, the HPA dynamically adjusts replica count. This is the definition of elastic, cost-effective MLOps.

Read the full guide here.

If you're deploying any Python API, adopting this pattern for resource management and scaling will save you major headaches down the road.

Top comments (0)