DEV Community

Leego
Leego

Posted on • Originally published at archibaldtitan.com

How to Deploy Your AI Model to Production — Complete Hosting Guide

How to Deploy Your AI Model to Production — Complete Hosting Guide

You've trained your AI model and it works great locally. Now comes the hard part: getting it into production where real users can access it reliably, at scale, with acceptable latency. This guide walks you through every step to deploy your AI model to production.

The Deployment Pipeline

Step 1: Containerize Your Model

Docker is the standard for AI model deployment. Create a Dockerfile that packages your model with all dependencies:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model/ ./model/
COPY app.py .

# Expose the API port
EXPOSE 8080

# Run with production server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Enter fullscreen mode Exit fullscreen mode

Step 2: Create an API Layer

Wrap your model in a FastAPI application:

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

# Load model once at startup
model = torch.load("model/model.pt")
model.eval()

class PredictionRequest(BaseModel):
    text: str

class PredictionResponse(BaseModel):
    result: str
    confidence: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    with torch.no_grad():
        output = model(request.text)
    return PredictionResponse(
        result=output.label,
        confidence=output.confidence
    )

@app.get("/health")
async def health():
    return {"status": "healthy"}
Enter fullscreen mode Exit fullscreen mode

Step 3: Choose Your Hosting Platform

DigitalOcean (Recommended for Most Teams)

DigitalOcean offers the simplest path from container to production:

Option A: App Platform (Easiest)

# Deploy directly from your GitHub repo
# DigitalOcean detects your Dockerfile automatically
# Scales from 1 to N instances based on traffic
Enter fullscreen mode Exit fullscreen mode

Option B: GPU Droplets (For GPU Models)

# Create a GPU Droplet
doctl compute droplet create ai-model \
  --size gpu-h100-x1-80gb \
  --image docker-20-04

# SSH in and run your container
docker run --gpus all -p 8080:8080 your-model:latest
Enter fullscreen mode Exit fullscreen mode

Why DigitalOcean:

  • Predictable pricing (no surprise bills)
  • $200 free credit for new users
  • Simple scaling with load balancers
  • Managed Kubernetes for complex deployments

AWS SageMaker (For Enterprise)

SageMaker provides a managed ML deployment platform:

import sagemaker
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data="s3://bucket/model.tar.gz",
    role="arn:aws:iam::role/SageMakerRole",
    framework_version="2.1",
    py_version="py310",
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
)
Enter fullscreen mode Exit fullscreen mode

Step 4: Set Up Monitoring

Production AI models need monitoring beyond standard web metrics:

Model Performance:

  • Prediction latency (p50, p95, p99)
  • Throughput (predictions per second)
  • Error rate
  • Model confidence distribution

Infrastructure:

  • GPU utilization and memory
  • CPU and RAM usage
  • Disk I/O (model loading)
  • Network bandwidth

Data Quality:

  • Input distribution drift
  • Output distribution changes
  • Feature value anomalies

Step 5: Implement Scaling

Horizontal Scaling (add more instances):

# Kubernetes HPA for auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-model
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
Enter fullscreen mode Exit fullscreen mode

Model Optimization (make each instance faster):

  • Use ONNX Runtime for optimized inference
  • Apply quantization (INT8) for 2-4x speedup
  • Enable batching for throughput-heavy workloads
  • Use model distillation for smaller, faster models

Step 6: Implement CI/CD for Models

# GitHub Actions workflow for model deployment
name: Deploy Model
on:
  push:
    branches: [main]
    paths: ['model/**', 'app.py', 'Dockerfile']

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push Docker image
        run: |
          docker build -t registry/ai-model:$GITHUB_SHA .
          docker push registry/ai-model:$GITHUB_SHA
      - name: Deploy to DigitalOcean
        run: |
          doctl apps update $APP_ID --spec .do/app.yaml
Enter fullscreen mode Exit fullscreen mode

Cost Optimization Tips

  1. Use spot/preemptible instances for batch inference (up to 90% savings)
  2. Right-size your instances — don't use an H100 for a model that runs fine on a T4
  3. Implement request queuing to smooth traffic spikes instead of over-provisioning
  4. Cache frequent predictions — if the same inputs appear often, cache the outputs
  5. Use DigitalOcean's predictable pricing to avoid the AWS bill shock

Production Checklist

  • [ ] Model containerized with Docker
  • [ ] Health check endpoint implemented
  • [ ] API documentation (OpenAPI/Swagger)
  • [ ] Load testing completed
  • [ ] Monitoring and alerting configured
  • [ ] Auto-scaling configured
  • [ ] CI/CD pipeline for model updates
  • [ ] Rollback strategy defined
  • [ ] Cost monitoring enabled
  • [ ] Security review completed

Conclusion

Deploying an AI model to production doesn't have to be overwhelming. Start with Docker containerization, choose a hosting platform that matches your team's capabilities (DigitalOcean for simplicity, AWS for enterprise scale), and build monitoring and scaling incrementally.

The key is to start simple and iterate. Get your model running on a single instance first, then add scaling, monitoring, and optimization as your traffic grows.

Get started with DigitalOcean's $200 free credit and deploy your first AI model today.


Originally published on Archibald Titan. Archibald Titan is the world's most advanced local AI agent for cybersecurity and credential management.

Try it free: archibaldtitan.com

Top comments (0)