DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

πŸš€ MLOps Zoomcamp Week 4 Notes: Model Deployment Guide

πŸ“‹ Table of Contents

  • Introduction to Model Deployment
  • Web Services with Flask
  • Model Serving with Docker
  • Creating Prediction Services
  • Load Testing and Performance
  • Deployment to Cloud Providers
  • Best Practices

🌟 Introduction to Model Deployment

Model deployment is the process of making your trained machine learning model available for use in a production environment. Think of it as moving your model from your laptop (where you developed it) to a place where others can use it.

Why is deployment important?

  • A model that isn't deployed can't provide value to users or businesses
  • Deployment bridges the gap between data science experimentation and real-world applications
  • Properly deployed models can scale to handle many requests

Types of Model Deployment:

  1. Online predictions (synchronous) - When users need immediate responses, like product recommendations
  2. Batch predictions (asynchronous) - Processing large amounts of data periodically, like weekly customer churn analysis
  3. Edge deployment - Running models directly on devices, like smartphone apps

🌐 Web Services with Flask

Flask is a lightweight web framework for Python that makes it easy to create web services. We'll use it to wrap our ML model in an API (Application Programming Interface).

Basic Flask App for Model Serving Explained

# Import necessary libraries
from flask import Flask, request, jsonify  # Flask for creating web services
import pickle  # For loading our saved model

# Create a Flask application with a name
app = Flask('duration-prediction')

# Load our pre-trained model and vectorizer from a file
# The 'rb' means "read binary" - pickle files are binary files
with open('model.pkl', 'rb') as f_in:
    dv, model = pickle.load(f_in)
    # dv is our DictVectorizer that transforms input features
    # model is our trained ML model (like Linear Regression)

# Create an endpoint at /predict that accepts POST requests
@app.route('/predict', methods=['POST'])
def predict():
    # Get the JSON data sent in the request (this will be our ride information)
    ride = request.get_json()

    # Transform the ride features using our dictionary vectorizer
    # This converts categorical variables and prepares data in the format our model expects
    X = dv.transform([ride])

    # Use our model to make a prediction
    # The [0] at the end extracts the first value from the array of predictions
    y_pred = model.predict(X)[0]

    # Return the prediction as JSON
    # jsonify converts Python objects to JSON format for the response
    return jsonify({
        'duration': float(y_pred),  # Convert to float for JSON compatibility
        'model_version': '1.0'      # Include version info for tracking
    })

# This code runs when we execute this script directly
if __name__ == "__main__":
    # Start the Flask server
    # debug=True enables helpful error messages
    # host='0.0.0.0' makes the server publicly accessible
    # port=9696 is the network port to listen on
    app.run(debug=True, host='0.0.0.0', port=9696)
Enter fullscreen mode Exit fullscreen mode

What this code does:

  1. Creates a web server using Flask
  2. Loads your trained model from a file
  3. Sets up a route (/predict) that accepts ride information
  4. Processes the incoming data and runs it through your model
  5. Returns the prediction as a JSON response

Testing with curl

The curl command lets you send HTTP requests from your terminal. Here's how to test your Flask API:

curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"PULocationID": 100, "DOLocationID": 200, "trip_distance": 3.5}' \
     http://localhost:9696/predict
Enter fullscreen mode Exit fullscreen mode

What this command does:

  • -X POST: Specifies that we're sending a POST request
  • -H "Content-Type: application/json": Sets the content type to JSON
  • -d '{"PULocationID": 100, "DOLocationID": 200, "trip_distance": 3.5}': The JSON data we're sending
  • http://localhost:9696/predict: The URL of our prediction endpoint

You should receive a response with the predicted ride duration.

🐳 Model Serving with Docker

Docker is like a shipping container for your code. It packages everything your application needs to run (code, libraries, and system tools) into a single container that will work the same way everywhere.

Dockerfile Example Explained

# Start with a base image that has Python 3.9 installed
# The "slim" version is smaller in size
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the Python dependencies
RUN pip install -r requirements.txt

# Copy the prediction service files into the container
COPY ["predict.py", "model.pkl", "./"]

# Tell Docker that the container will listen on port 9696
EXPOSE 9696

# Command to run when the container starts
# Gunicorn is a production-ready web server for Python applications
# --bind 0.0.0.0:9696: Listen on all interfaces on port 9696
# predict:app: The Flask application object (app) in predict.py
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "predict:app"]
Enter fullscreen mode Exit fullscreen mode

What this Dockerfile does:

  1. Sets up a Python environment
  2. Installs all required libraries
  3. Copies your model and code into the container
  4. Specifies how to run your application
  5. Opens the necessary network port

Building and Running Docker Container

# Build the Docker image (create the container)
# -t gives it a name and tag for easy reference
docker build -t ride-duration-prediction:v1 .

# Run the container
# -it: Interactive mode
# --rm: Remove the container when it stops
# -p 9696:9696: Map port 9696 on your computer to port 9696 in the container
docker run -it --rm -p 9696:9696 ride-duration-prediction:v1
Enter fullscreen mode Exit fullscreen mode

Why use Docker?

  • Consistency: Your model will run the same way everywhere
  • Dependencies: All libraries are included, no need to install separately
  • Isolation: Your model runs in its own environment
  • Scalability: Easy to deploy multiple copies for handling more requests
  • DevOps-friendly: Fits into modern deployment workflows

πŸ”§ Creating Prediction Services

Complete Prediction Script Explained

#!/usr/bin/env python
# coding: utf-8

import pickle  # For loading the saved model
import pandas as pd  # For data manipulation
from flask import Flask, request, jsonify  # For creating the web service
from datetime import datetime  # For timestamping predictions
from dateutil.relativedelta import relativedelta  # For date calculations
from pathlib import Path  # For file path operations

# Define file paths for our model and vectorizer
MODEL_FILE = 'model.bin'
DV_FILE = 'dv.bin'

def load_model():
    """
    Load the model and dictionary vectorizer from files.

    Returns:
        tuple: (dv, model) where dv is the DictVectorizer and model is the trained model
    """
    # Create Path objects for better file handling
    model_path = Path(MODEL_FILE)
    dv_path = Path(DV_FILE)

    # Check if the files exist, raise an error if they don't
    if not model_path.exists() or not dv_path.exists():
        raise FileNotFoundError(f"Model or DV file not found at {model_path} or {dv_path}")

    # Load the model from file
    with open(model_path, 'rb') as f_model:
        model = pickle.load(f_model)

    # Load the dictionary vectorizer from file    
    with open(dv_path, 'rb') as f_dv:
        dv = pickle.load(f_dv)

    return dv, model

def prepare_features(ride):
    """
    Extract and prepare features from ride data for model prediction

    Args:
        ride (dict): Dictionary containing ride information

    Returns:
        dict: Processed features ready for the model
    """
    features = {}

    # Create a combined feature from pickup and dropoff locations
    # This helps the model understand specific routes
    features['PU_DO'] = f"{ride['PULocationID']}_{ride['DOLocationID']}"

    # Include the trip distance as a feature
    features['trip_distance'] = ride['trip_distance']

    return features

def predict(features):
    """
    Make prediction using the model

    Args:
        features (dict): Prepared features for prediction

    Returns:
        float: Predicted ride duration in minutes
    """
    # Load the model and vectorizer
    dv, model = load_model()

    # Transform features into the format expected by the model
    X = dv.transform([features])

    # Make prediction and return the first result
    y_pred = model.predict(X)
    return float(y_pred[0])

# Create Flask application
app = Flask('duration-prediction')

@app.route('/predict', methods=['POST'])
def predict_endpoint():
    """
    Endpoint for receiving prediction requests
    """
    # Get ride data from the request
    ride = request.get_json()

    # Prepare features from the ride data
    features = prepare_features(ride)

    # Get prediction from the model
    prediction = predict(features)

    # Prepare the response with prediction and metadata
    result = {
        'duration': prediction,
        'model_version': '1.0',
        'prediction_time': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }

    # Return the result as JSON
    return jsonify(result)

if __name__ == "__main__":
    # Start the Flask application
    app.run(debug=True, host='0.0.0.0', port=9696)
Enter fullscreen mode Exit fullscreen mode

What this code does in detail:

  1. load_model():

    • Checks if model files exist
    • Loads both the model and the dictionary vectorizer (which transforms your input data)
    • Returns them for use in predictions
  2. prepare_features():

    • Takes raw ride information (like pickup and dropoff locations)
    • Creates derived features (like combining locations into a route feature)
    • Returns a properly formatted feature dictionary
  3. predict():

    • Uses load_model() to get the model and vectorizer
    • Transforms the features using the vectorizer
    • Makes a prediction with the model
    • Returns the predicted duration
  4. predict_endpoint():

    • Handles HTTP requests to the /predict endpoint
    • Processes incoming ride data
    • Returns predictions with metadata (time, version, etc.)

Client Script to Test the Service Explained

import requests  # Library for making HTTP requests
import json  # Library for working with JSON data

# URL of your prediction service - this is where your Flask app is running
url = 'http://localhost:9696/predict'

# Sample ride data - this is what we're asking the model to predict
ride = {
    'PULocationID': 43,     # ID number of pickup location
    'DOLocationID': 151,    # ID number of dropoff location
    'trip_distance': 1.8    # Trip distance in miles
}

# Send POST request to the prediction service
# This is like submitting a form on a website
response = requests.post(url, json=ride)

# Get the result and convert it from JSON to a Python dictionary
result = response.json()

# Print the prediction in a friendly format
print(f"Predicted duration: {result['duration']:.2f} minutes")
print(f"Model version: {result['model_version']}")
Enter fullscreen mode Exit fullscreen mode

What this code does:

  1. Sets up a request to your prediction service
  2. Sends ride information (pickup, dropoff, distance)
  3. Gets the prediction result
  4. Displays the predicted duration and model version

This client script helps you test if your prediction service is working correctly without using curl or other command-line tools.

πŸ” Load Testing and Performance

Load testing helps you understand how your service performs under stress. Locust is a user-friendly tool for this purpose.

Basic Locust File Explained

from locust import HttpUser, task, between

class PredictionUser(HttpUser):
    # Users will wait between 1 and 3 seconds between requests
    # This simulates more realistic user behavior
    wait_time = between(1, 3)

    @task
    def predict_duration(self):
        """
        This task simulates a user making a prediction request.
        Each simulated user will repeatedly execute this method.
        """
        # Sample ride data for prediction
        ride = {
            "PULocationID": 43,
            "DOLocationID": 151,
            "trip_distance": 1.8
        }

        # Send a POST request to the /predict endpoint with the ride data
        self.client.post("/predict", json=ride)
Enter fullscreen mode Exit fullscreen mode

What this code does:

  1. Creates a simulated user class that will make requests to your service
  2. Defines how frequently users make requests (1-3 seconds between each)
  3. Creates a task that sends prediction requests with sample data
  4. Automatically collects performance metrics

To run Locust:

# Start Locust with the locustfile.py script
# --host tells Locust where your service is running
locust -f locustfile.py --host=http://localhost:9696
Enter fullscreen mode Exit fullscreen mode

After running this command, open your browser at http://localhost:8089 to see the Locust interface. Here you can:

  1. Set the number of users to simulate
  2. Set how quickly to spawn users
  3. Start the test and watch your service's performance in real-time

What to look for in load testing:

  • Response time: How quickly your service responds
  • Requests per second: How many predictions you can handle
  • Failure rate: How often requests fail under load
  • Resource usage: CPU, memory, and network usage during the test

☁️ Deployment to Cloud Providers

AWS Elastic Beanstalk Deployment

AWS Elastic Beanstalk is a service that makes it easy to deploy web applications without worrying about infrastructure.

Step 1: Prepare your application

Make sure your Flask application is named application.py and creates an object named application (instead of app):

# Rename from app to application for AWS Elastic Beanstalk
application = Flask('duration-prediction')

# ... rest of your Flask code ...

# For local testing
if __name__ == "__main__":
    application.run(debug=True, host='0.0.0.0', port=9696)
Enter fullscreen mode Exit fullscreen mode

Step 2: Set up Elastic Beanstalk CLI and initialize your application

# Install the EB CLI
pip install awsebcli

# Initialize your EB application
# -p python-3.9: Use Python 3.9 platform
# ride-duration-prediction: Name of your application
eb init -p python-3.9 ride-duration-prediction
Enter fullscreen mode Exit fullscreen mode

Step 3: Create an environment and deploy

# Create a new environment called "prediction-env"
eb create prediction-env

# Deploy your application to the environment
eb deploy
Enter fullscreen mode Exit fullscreen mode

What these commands do:

  1. eb init: Sets up your project for Elastic Beanstalk
  2. eb create: Creates a new environment in AWS with servers, load balancers, etc.
  3. eb deploy: Uploads your application code and deploys it to the environment

Google Cloud Run Deployment

Google Cloud Run lets you deploy containerized applications quickly.

Step 1: Build your Docker image

# Build the Docker image for Google Container Registry
# [PROJECT_ID] should be replaced with your Google Cloud project ID
docker build -t gcr.io/[PROJECT_ID]/ride-duration:v1 .
Enter fullscreen mode Exit fullscreen mode

Step 2: Push the image to Google Container Registry

# Push the image to Google's container registry
docker push gcr.io/[PROJECT_ID]/ride-duration:v1
Enter fullscreen mode Exit fullscreen mode

Step 3: Deploy to Cloud Run

# Deploy the container to Cloud Run
gcloud run deploy ride-duration-service \
  --image gcr.io/[PROJECT_ID]/ride-duration:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated
Enter fullscreen mode Exit fullscreen mode

What these commands do:

  1. Build your Docker image with a name that matches Google's container registry format
  2. Push the image to Google's registry where Cloud Run can access it
  3. Create a Cloud Run service that runs your container, automatically scaling based on traffic

Benefits of cloud deployment:

  • Scalability: Automatically handles increased traffic
  • Reliability: Built-in redundancy and failover
  • Security: Professional infrastructure security
  • Observability: Built-in monitoring and logging
  • Cost-efficiency: Pay only for what you use

πŸ› οΈ Best Practices

1. Model Versioning Explained

Version tracking helps you know which model is making predictions and manage updates.

# Define the model version as a constant at the top of your file
MODEL_VERSION = '1.0'

# Create a health check endpoint to verify your service is running
# and report which model version is being used
@app.route('/health', methods=['GET'])
def health():
    return jsonify({
        'status': 'healthy',
        'model_version': MODEL_VERSION,
        'service_up_since': SERVICE_START_TIME
    })
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Helps track which model version made which predictions
  • Makes it easier to troubleshoot issues with specific model versions
  • Enables smooth rollbacks if a new model version has problems

2. Logging Predictions Explained

Logging helps you understand how your model is being used and catch issues early.

import logging

# Set up logging with timestamps, log levels, and formatting
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
# Create a logger specifically for our prediction service
logger = logging.getLogger('prediction-service')

@app.route('/predict', methods=['POST'])
def predict_endpoint():
    # Get the data from the request
    ride = request.get_json()

    # Log the incoming request
    logger.info(f"Prediction request received: {ride}")

    # ... make prediction ...

    # Log the result before returning it
    logger.info(f"Prediction result: duration={result['duration']:.2f} minutes")

    return jsonify(result)
Enter fullscreen mode Exit fullscreen mode

Why logging is important:

  • Helps debug issues by showing what data was processed
  • Provides an audit trail of predictions
  • Can be used to detect unusual patterns or potential misuse
  • Helps understand actual usage patterns for further improvement

3. Environment Variables for Configuration Explained

Environment variables make your application configurable without code changes.

import os

# Get configuration from environment variables with defaults
MODEL_PATH = os.getenv('MODEL_PATH', 'model.pkl')
HOST = os.getenv('HOST', '0.0.0.0')
PORT = int(os.getenv('PORT', 9696))
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')

# Configure logging based on environment variable
logging.basicConfig(level=getattr(logging, LOG_LEVEL))

# Use the configuration in your application
if __name__ == "__main__":
    logger.info(f"Starting prediction service with model: {MODEL_PATH}")
    app.run(debug=False, host=HOST, port=PORT)
Enter fullscreen mode Exit fullscreen mode

Benefits of using environment variables:

  • Change behavior without modifying code
  • Different settings for development, testing, and production
  • Security (don't hardcode sensitive information)
  • Allows for easier Docker and cloud deployment

4. Graceful Error Handling Explained

Good error handling improves user experience and makes troubleshooting easier.

@app.errorhandler(Exception)
def handle_exception(e):
    """
    Global exception handler for the Flask application.
    Catches all unhandled exceptions and returns a user-friendly response.
    """
    # Log the error with traceback for debugging
    app.logger.error(f"Unhandled exception: {str(e)}", exc_info=True)

    # Determine if this is a known error type
    if isinstance(e, ValueError):
        # Custom message for value errors (e.g., invalid input)
        return jsonify({
            'error': 'Invalid input provided',
            'message': str(e),
            'status': 'error'
        }), 400
    elif isinstance(e, FileNotFoundError):
        # Custom message for missing files (e.g., model not found)
        return jsonify({
            'error': 'Service configuration error',
            'message': 'Required model files not found',
            'status': 'error'
        }), 500
    else:
        # Generic error message for unexpected errors
        return jsonify({
            'error': 'An unexpected error occurred',
            'status': 'error',
            'request_id': request.headers.get('X-Request-ID', 'unknown')
        }), 500
Enter fullscreen mode Exit fullscreen mode

Why good error handling matters:

  • Provides clear information about what went wrong
  • Prevents exposing sensitive information in error messages
  • Makes debugging easier through detailed logging
  • Improves user experience by providing actionable feedback
  • Enables better monitoring by categorizing errors

πŸ“Š Key Performance Metrics

When your model is in production, you should monitor these important metrics:

1. Response time

  • How long it takes to return predictions
  • Should typically be milliseconds to seconds
  • Important for user experience and SLAs (Service Level Agreements)

2. Throughput

  • How many predictions you can handle per second
  • Helps you plan capacity for peak usage
  • Should be tracked during normal and high traffic periods

3. Error rate

  • Percentage of requests that fail
  • Should be near zero in a healthy system
  • Sudden increases indicate problems

4. Resource usage

  • CPU, memory, and disk usage of your service
  • Helps identify performance bottlenecks
  • Important for cost optimization

5. Prediction drift

  • Changes in the distribution of predictions over time
  • Could indicate data drift or model degradation
  • Important for knowing when to retrain your model

πŸ”„ Continuous Deployment for ML Models

Continuous Deployment (CD) automates the process of releasing new model versions.

Here's a basic CI/CD workflow for model deployment:

  1. Model Training Pipeline: Automatically train models on new data
  2. Model Evaluation: Test model performance against validation data
  3. Automated Tests: Run tests to ensure the model and service work correctly
  4. Container Building: Package the model in a Docker container
  5. Blue-Green Deployment: Deploy new version alongside old one, then switch traffic

Example GitHub Actions workflow:

name: Deploy ML Model

on:
  push:
    branches: [ main ]  # Trigger when code is pushed to main branch

jobs:
  # First job: Run tests on the code
  test:
    runs-on: ubuntu-latest  # Use Ubuntu for running tests
    steps:
      - uses: actions/checkout@v2  # Check out the code
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'  # Use Python 3.9
      - name: Install dependencies
        run: pip install -r requirements.txt  # Install required packages
      - name: Run tests
        run: pytest tests/  # Run all tests in the tests directory

  # Second job: Build and deploy (only runs if tests pass)
  build-and-deploy:
    needs: test  # This job depends on the test job
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2  # Check out the code
      - name: Build and push Docker image
        uses: docker/build-push-action@v2
        with:
          push: true  # Push the image to registry
          tags: myregistry/myapp:latest  # Tag the image
Enter fullscreen mode Exit fullscreen mode

What this workflow does:

  1. Whenever code is pushed to the main branch, it triggers the workflow
  2. First, it runs tests to make sure everything works
  3. If tests pass, it builds a Docker image of your application
  4. It pushes the image to a Docker registry
  5. From there, the image can be deployed to your production environment

🎯 Conclusion

Deploying ML models is a critical step in making your data science work valuable to users. With this guide, you've learned:

  1. How to create a web service that serves predictions from your model using Flask
  2. How to package your model and dependencies using Docker
  3. How to test your service under load with Locust
  4. How to deploy your containerized model to cloud providers
  5. Best practices for logging, error handling, and configuration
  6. How to set up continuous deployment for your ML models

Remember that deployment isn't the endβ€”continuous monitoring and retraining are necessary to ensure your models stay accurate and relevant as data changes over time.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.