Abdelrahman Adnan

Posted on Jun 17

🚀 MLOps Zoomcamp Week 4 Notes: Model Deployment Guide

📋 Table of Contents

Introduction to Model Deployment
Web Services with Flask
Model Serving with Docker
Creating Prediction Services
Load Testing and Performance
Deployment to Cloud Providers
Best Practices

🌟 Introduction to Model Deployment

Model deployment is the process of making your trained machine learning model available for use in a production environment. Think of it as moving your model from your laptop (where you developed it) to a place where others can use it.

Why is deployment important?

A model that isn't deployed can't provide value to users or businesses
Deployment bridges the gap between data science experimentation and real-world applications
Properly deployed models can scale to handle many requests

Types of Model Deployment:

Online predictions (synchronous) - When users need immediate responses, like product recommendations
Batch predictions (asynchronous) - Processing large amounts of data periodically, like weekly customer churn analysis
Edge deployment - Running models directly on devices, like smartphone apps

🌐 Web Services with Flask

Flask is a lightweight web framework for Python that makes it easy to create web services. We'll use it to wrap our ML model in an API (Application Programming Interface).

Basic Flask App for Model Serving Explained

# Import necessary libraries
from flask import Flask, request, jsonify  # Flask for creating web services
import pickle  # For loading our saved model

# Create a Flask application with a name
app = Flask('duration-prediction')

# Load our pre-trained model and vectorizer from a file
# The 'rb' means "read binary" - pickle files are binary files
with open('model.pkl', 'rb') as f_in:
    dv, model = pickle.load(f_in)
    # dv is our DictVectorizer that transforms input features
    # model is our trained ML model (like Linear Regression)

# Create an endpoint at /predict that accepts POST requests
@app.route('/predict', methods=['POST'])
def predict():
    # Get the JSON data sent in the request (this will be our ride information)
    ride = request.get_json()

    # Transform the ride features using our dictionary vectorizer
    # This converts categorical variables and prepares data in the format our model expects
    X = dv.transform([ride])

    # Use our model to make a prediction
    # The [0] at the end extracts the first value from the array of predictions
    y_pred = model.predict(X)[0]

    # Return the prediction as JSON
    # jsonify converts Python objects to JSON format for the response
    return jsonify({
        'duration': float(y_pred),  # Convert to float for JSON compatibility
        'model_version': '1.0'      # Include version info for tracking
    })

# This code runs when we execute this script directly
if __name__ == "__main__":
    # Start the Flask server
    # debug=True enables helpful error messages
    # host='0.0.0.0' makes the server publicly accessible
    # port=9696 is the network port to listen on
    app.run(debug=True, host='0.0.0.0', port=9696)

What this code does:

Creates a web server using Flask
Loads your trained model from a file
Sets up a route (/predict) that accepts ride information
Processes the incoming data and runs it through your model
Returns the prediction as a JSON response

Testing with curl

The curl command lets you send HTTP requests from your terminal. Here's how to test your Flask API:

curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"PULocationID": 100, "DOLocationID": 200, "trip_distance": 3.5}' \
     http://localhost:9696/predict

What this command does:

-X POST: Specifies that we're sending a POST request
-H "Content-Type: application/json": Sets the content type to JSON
-d '{"PULocationID": 100, "DOLocationID": 200, "trip_distance": 3.5}': The JSON data we're sending
http://localhost:9696/predict: The URL of our prediction endpoint

You should receive a response with the predicted ride duration.

🐳 Model Serving with Docker

Docker is like a shipping container for your code. It packages everything your application needs to run (code, libraries, and system tools) into a single container that will work the same way everywhere.

Dockerfile Example Explained

# Start with a base image that has Python 3.9 installed
# The "slim" version is smaller in size
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the Python dependencies
RUN pip install -r requirements.txt

# Copy the prediction service files into the container
COPY ["predict.py", "model.pkl", "./"]

# Tell Docker that the container will listen on port 9696
EXPOSE 9696

# Command to run when the container starts
# Gunicorn is a production-ready web server for Python applications
# --bind 0.0.0.0:9696: Listen on all interfaces on port 9696
# predict:app: The Flask application object (app) in predict.py
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "predict:app"]

What this Dockerfile does:

Sets up a Python environment
Installs all required libraries
Copies your model and code into the container
Specifies how to run your application
Opens the necessary network port

Building and Running Docker Container

# Build the Docker image (create the container)
# -t gives it a name and tag for easy reference
docker build -t ride-duration-prediction:v1 .

# Run the container
# -it: Interactive mode
# --rm: Remove the container when it stops
# -p 9696:9696: Map port 9696 on your computer to port 9696 in the container
docker run -it --rm -p 9696:9696 ride-duration-prediction:v1

Why use Docker?

Consistency: Your model will run the same way everywhere
Dependencies: All libraries are included, no need to install separately
Isolation: Your model runs in its own environment
Scalability: Easy to deploy multiple copies for handling more requests
DevOps-friendly: Fits into modern deployment workflows

🔧 Creating Prediction Services

Complete Prediction Script Explained

#!/usr/bin/env python
# coding: utf-8

import pickle  # For loading the saved model
import pandas as pd  # For data manipulation
from flask import Flask, request, jsonify  # For creating the web service
from datetime import datetime  # For timestamping predictions
from dateutil.relativedelta import relativedelta  # For date calculations
from pathlib import Path  # For file path operations

# Define file paths for our model and vectorizer
MODEL_FILE = 'model.bin'
DV_FILE = 'dv.bin'

def load_model():
    """
    Load the model and dictionary vectorizer from files.

    Returns:
        tuple: (dv, model) where dv is the DictVectorizer and model is the trained model
    """
    # Create Path objects for better file handling
    model_path = Path(MODEL_FILE)
    dv_path = Path(DV_FILE)

    # Check if the files exist, raise an error if they don't
    if not model_path.exists() or not dv_path.exists():
        raise FileNotFoundError(f"Model or DV file not found at {model_path} or {dv_path}")

    # Load the model from file
    with open(model_path, 'rb') as f_model:
        model = pickle.load(f_model)

    # Load the dictionary vectorizer from file    
    with open(dv_path, 'rb') as f_dv:
        dv = pickle.load(f_dv)

    return dv, model

def prepare_features(ride):
    """
    Extract and prepare features from ride data for model prediction

    Args:
        ride (dict): Dictionary containing ride information

    Returns:
        dict: Processed features ready for the model
    """
    features = {}

    # Create a combined feature from pickup and dropoff locations
    # This helps the model understand specific routes
    features['PU_DO'] = f"{ride['PULocationID']}_{ride['DOLocationID']}"

    # Include the trip distance as a feature
    features['trip_distance'] = ride['trip_distance']

    return features

def predict(features):
    """
    Make prediction using the model

    Args:
        features (dict): Prepared features for prediction

    Returns:
        float: Predicted ride duration in minutes
    """
    # Load the model and vectorizer
    dv, model = load_model()

    # Transform features into the format expected by the model
    X = dv.transform([features])

    # Make prediction and return the first result
    y_pred = model.predict(X)
    return float(y_pred[0])

# Create Flask application
app = Flask('duration-prediction')

@app.route('/predict', methods=['POST'])
def predict_endpoint():
    """
    Endpoint for receiving prediction requests
    """
    # Get ride data from the request
    ride = request.get_json()

    # Prepare features from the ride data
    features = prepare_features(ride)

    # Get prediction from the model
    prediction = predict(features)

    # Prepare the response with prediction and metadata
    result = {
        'duration': prediction,
        'model_version': '1.0',
        'prediction_time': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }

    # Return the result as JSON
    return jsonify(result)

if __name__ == "__main__":
    # Start the Flask application
    app.run(debug=True, host='0.0.0.0', port=9696)

What this code does in detail:

load_model():
- Checks if model files exist
- Loads both the model and the dictionary vectorizer (which transforms your input data)
- Returns them for use in predictions
prepare_features():
- Takes raw ride information (like pickup and dropoff locations)
- Creates derived features (like combining locations into a route feature)
- Returns a properly formatted feature dictionary
predict():
- Uses load_model() to get the model and vectorizer
- Transforms the features using the vectorizer
- Makes a prediction with the model
- Returns the predicted duration
predict_endpoint():
- Handles HTTP requests to the /predict endpoint
- Processes incoming ride data
- Returns predictions with metadata (time, version, etc.)

Client Script to Test the Service Explained

import requests  # Library for making HTTP requests
import json  # Library for working with JSON data

# URL of your prediction service - this is where your Flask app is running
url = 'http://localhost:9696/predict'

# Sample ride data - this is what we're asking the model to predict
ride = {
    'PULocationID': 43,     # ID number of pickup location
    'DOLocationID': 151,    # ID number of dropoff location
    'trip_distance': 1.8    # Trip distance in miles
}

# Send POST request to the prediction service
# This is like submitting a form on a website
response = requests.post(url, json=ride)

# Get the result and convert it from JSON to a Python dictionary
result = response.json()

# Print the prediction in a friendly format
print(f"Predicted duration: {result['duration']:.2f} minutes")
print(f"Model version: {result['model_version']}")

What this code does:

Sets up a request to your prediction service
Sends ride information (pickup, dropoff, distance)
Gets the prediction result
Displays the predicted duration and model version

This client script helps you test if your prediction service is working correctly without using curl or other command-line tools.

🔍 Load Testing and Performance

Load testing helps you understand how your service performs under stress. Locust is a user-friendly tool for this purpose.

Basic Locust File Explained

from locust import HttpUser, task, between

class PredictionUser(HttpUser):
    # Users will wait between 1 and 3 seconds between requests
    # This simulates more realistic user behavior
    wait_time = between(1, 3)

    @task
    def predict_duration(self):
        """
        This task simulates a user making a prediction request.
        Each simulated user will repeatedly execute this method.
        """
        # Sample ride data for prediction
        ride = {
            "PULocationID": 43,
            "DOLocationID": 151,
            "trip_distance": 1.8
        }

        # Send a POST request to the /predict endpoint with the ride data
        self.client.post("/predict", json=ride)

What this code does:

Creates a simulated user class that will make requests to your service
Defines how frequently users make requests (1-3 seconds between each)
Creates a task that sends prediction requests with sample data
Automatically collects performance metrics

To run Locust:

# Start Locust with the locustfile.py script
# --host tells Locust where your service is running
locust -f locustfile.py --host=http://localhost:9696

After running this command, open your browser at http://localhost:8089 to see the Locust interface. Here you can:

Set the number of users to simulate
Set how quickly to spawn users
Start the test and watch your service's performance in real-time

What to look for in load testing:

Response time: How quickly your service responds
Requests per second: How many predictions you can handle
Failure rate: How often requests fail under load
Resource usage: CPU, memory, and network usage during the test

☁️ Deployment to Cloud Providers

AWS Elastic Beanstalk Deployment

AWS Elastic Beanstalk is a service that makes it easy to deploy web applications without worrying about infrastructure.

Step 1: Prepare your application

Make sure your Flask application is named application.py and creates an object named application (instead of app):

# Rename from app to application for AWS Elastic Beanstalk
application = Flask('duration-prediction')

# ... rest of your Flask code ...

# For local testing
if __name__ == "__main__":
    application.run(debug=True, host='0.0.0.0', port=9696)

Step 2: Set up Elastic Beanstalk CLI and initialize your application

# Install the EB CLI
pip install awsebcli

# Initialize your EB application
# -p python-3.9: Use Python 3.9 platform
# ride-duration-prediction: Name of your application
eb init -p python-3.9 ride-duration-prediction

Step 3: Create an environment and deploy

# Create a new environment called "prediction-env"
eb create prediction-env

# Deploy your application to the environment
eb deploy

What these commands do:

eb init: Sets up your project for Elastic Beanstalk
eb create: Creates a new environment in AWS with servers, load balancers, etc.
eb deploy: Uploads your application code and deploys it to the environment

Google Cloud Run Deployment

Google Cloud Run lets you deploy containerized applications quickly.

Step 1: Build your Docker image

# Build the Docker image for Google Container Registry
# [PROJECT_ID] should be replaced with your Google Cloud project ID
docker build -t gcr.io/[PROJECT_ID]/ride-duration:v1 .

Step 2: Push the image to Google Container Registry

# Push the image to Google's container registry
docker push gcr.io/[PROJECT_ID]/ride-duration:v1

Step 3: Deploy to Cloud Run

# Deploy the container to Cloud Run
gcloud run deploy ride-duration-service \
  --image gcr.io/[PROJECT_ID]/ride-duration:v1 \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

What these commands do:

Build your Docker image with a name that matches Google's container registry format
Push the image to Google's registry where Cloud Run can access it
Create a Cloud Run service that runs your container, automatically scaling based on traffic

Benefits of cloud deployment:

Scalability: Automatically handles increased traffic
Reliability: Built-in redundancy and failover
Security: Professional infrastructure security
Observability: Built-in monitoring and logging
Cost-efficiency: Pay only for what you use

🛠️ Best Practices

1. Model Versioning Explained

Version tracking helps you know which model is making predictions and manage updates.

# Define the model version as a constant at the top of your file
MODEL_VERSION = '1.0'

# Create a health check endpoint to verify your service is running
# and report which model version is being used
@app.route('/health', methods=['GET'])
def health():
    return jsonify({
        'status': 'healthy',
        'model_version': MODEL_VERSION,
        'service_up_since': SERVICE_START_TIME
    })

Why this matters:

Helps track which model version made which predictions
Makes it easier to troubleshoot issues with specific model versions
Enables smooth rollbacks if a new model version has problems

2. Logging Predictions Explained

Logging helps you understand how your model is being used and catch issues early.

import logging

# Set up logging with timestamps, log levels, and formatting
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
# Create a logger specifically for our prediction service
logger = logging.getLogger('prediction-service')

@app.route('/predict', methods=['POST'])
def predict_endpoint():
    # Get the data from the request
    ride = request.get_json()

    # Log the incoming request
    logger.info(f"Prediction request received: {ride}")

    # ... make prediction ...

    # Log the result before returning it
    logger.info(f"Prediction result: duration={result['duration']:.2f} minutes")

    return jsonify(result)

Why logging is important:

Helps debug issues by showing what data was processed
Provides an audit trail of predictions
Can be used to detect unusual patterns or potential misuse
Helps understand actual usage patterns for further improvement

3. Environment Variables for Configuration Explained

Environment variables make your application configurable without code changes.

import os

# Get configuration from environment variables with defaults
MODEL_PATH = os.getenv('MODEL_PATH', 'model.pkl')
HOST = os.getenv('HOST', '0.0.0.0')
PORT = int(os.getenv('PORT', 9696))
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')

# Configure logging based on environment variable
logging.basicConfig(level=getattr(logging, LOG_LEVEL))

# Use the configuration in your application
if __name__ == "__main__":
    logger.info(f"Starting prediction service with model: {MODEL_PATH}")
    app.run(debug=False, host=HOST, port=PORT)

Benefits of using environment variables:

Change behavior without modifying code
Different settings for development, testing, and production
Security (don't hardcode sensitive information)
Allows for easier Docker and cloud deployment

4. Graceful Error Handling Explained

Good error handling improves user experience and makes troubleshooting easier.

@app.errorhandler(Exception)
def handle_exception(e):
    """
    Global exception handler for the Flask application.
    Catches all unhandled exceptions and returns a user-friendly response.
    """
    # Log the error with traceback for debugging
    app.logger.error(f"Unhandled exception: {str(e)}", exc_info=True)

    # Determine if this is a known error type
    if isinstance(e, ValueError):
        # Custom message for value errors (e.g., invalid input)
        return jsonify({
            'error': 'Invalid input provided',
            'message': str(e),
            'status': 'error'
        }), 400
    elif isinstance(e, FileNotFoundError):
        # Custom message for missing files (e.g., model not found)
        return jsonify({
            'error': 'Service configuration error',
            'message': 'Required model files not found',
            'status': 'error'
        }), 500
    else:
        # Generic error message for unexpected errors
        return jsonify({
            'error': 'An unexpected error occurred',
            'status': 'error',
            'request_id': request.headers.get('X-Request-ID', 'unknown')
        }), 500

Why good error handling matters:

Provides clear information about what went wrong
Prevents exposing sensitive information in error messages
Makes debugging easier through detailed logging
Improves user experience by providing actionable feedback
Enables better monitoring by categorizing errors

📊 Key Performance Metrics

When your model is in production, you should monitor these important metrics:

1. Response time

How long it takes to return predictions
Should typically be milliseconds to seconds
Important for user experience and SLAs (Service Level Agreements)

2. Throughput

How many predictions you can handle per second
Helps you plan capacity for peak usage
Should be tracked during normal and high traffic periods

3. Error rate

Percentage of requests that fail
Should be near zero in a healthy system
Sudden increases indicate problems

4. Resource usage

CPU, memory, and disk usage of your service
Helps identify performance bottlenecks
Important for cost optimization

5. Prediction drift

Changes in the distribution of predictions over time
Could indicate data drift or model degradation
Important for knowing when to retrain your model

🔄 Continuous Deployment for ML Models

Continuous Deployment (CD) automates the process of releasing new model versions.

Here's a basic CI/CD workflow for model deployment:

Model Training Pipeline: Automatically train models on new data
Model Evaluation: Test model performance against validation data
Automated Tests: Run tests to ensure the model and service work correctly
Container Building: Package the model in a Docker container
Blue-Green Deployment: Deploy new version alongside old one, then switch traffic

Example GitHub Actions workflow:

name: Deploy ML Model

on:
  push:
    branches: [ main ]  # Trigger when code is pushed to main branch

jobs:
  # First job: Run tests on the code
  test:
    runs-on: ubuntu-latest  # Use Ubuntu for running tests
    steps:
      - uses: actions/checkout@v2  # Check out the code
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'  # Use Python 3.9
      - name: Install dependencies
        run: pip install -r requirements.txt  # Install required packages
      - name: Run tests
        run: pytest tests/  # Run all tests in the tests directory

  # Second job: Build and deploy (only runs if tests pass)
  build-and-deploy:
    needs: test  # This job depends on the test job
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2  # Check out the code
      - name: Build and push Docker image
        uses: docker/build-push-action@v2
        with:
          push: true  # Push the image to registry
          tags: myregistry/myapp:latest  # Tag the image

What this workflow does:

Whenever code is pushed to the main branch, it triggers the workflow
First, it runs tests to make sure everything works
If tests pass, it builds a Docker image of your application
It pushes the image to a Docker registry
From there, the image can be deployed to your production environment

🎯 Conclusion

Deploying ML models is a critical step in making your data science work valuable to users. With this guide, you've learned:

How to create a web service that serves predictions from your model using Flask
How to package your model and dependencies using Docker
How to test your service under load with Locust
How to deploy your containerized model to cloud providers
Best practices for logging, error handling, and configuration
How to set up continuous deployment for your ML models

Remember that deployment isn't the end—continuous monitoring and retraining are necessary to ensure your models stay accurate and relevant as data changes over time.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.