DEV Community: akoshel

Seamless Deployment of Hugging Face Models on AWS SageMaker with Terraform: A Comprehensive Guide

akoshel — Sun, 18 Feb 2024 13:45:36 +0000

When integrating Sagemaker with Hugging Face models using the default setup provided by the sagemaker-huggingface-inference-tollkit can be a good starting point. For a IaC setup, the terraform-aws-sagemaker-huggingface module is a handy resource, https://github.com/philschmid/terraform-aws-sagemaker-huggingface/blob/master/main.tf

However, during my experience, I ran into a few issues with the Sagemaker-Huggingface-Inference-Toolkit:

Deployment Flexibility: The toolkit was limited to deploying only through the Python SDK, which was quite restrictive. (As it described in docs. Actually you can, for example using terraform module mentioned above)
Code and Model Packaging: If you want to customize inference code it required storing the code with the model weights into a single tar file, which felt clunky. I prefer having the code as part of the image itself.
Custom Environments: The sagemaker-huggingface-inference-toolkit doesn't allow for custom environment setups, like installing the latest Transformers directly from GitHub.

One specific issue was the lack of support for setting torch_dtype to half precision for the pipelines, which was crucial for my project but not straightforward to implement.

Given these limitations, I decided against rewriting everything to default sagemaker-inference-toolkit and instead explored a solution that just overrides get_pipline function in sagemaker-huggingface-inference-toolkit. Using following example you can customize in a any way you would like

How to Deploy

Load model weights

The first step is to put model weights to s3 bucket in model.tar.gz file. Instructions how to do it here https://huggingface.co/docs/sagemaker/inference

Make entrypoint

The deployment starts with setting up an entrypoint script. This script acts as the bridge between your model and Sagemaker, telling Sagemaker how to run your model. Here's a basic template I used:

from pathlib import Path

import torch
from transformers import Pipeline, pipeline
from sagemaker_huggingface_inference_toolkit import transformers_utils, serving




def _get_pipeline(task: str, device: int, model_dir: Path, **kwargs) -> Pipeline:
    return pipeline(model=model_dir, device_map="auto", model_kwargs={"torch_dtype": torch.bfloat16})

transformers_utils.get_pipeline = _get_pipeline


if __name__ == "__main__":
    serving.main()

Build image

Next, you'll need to build a Docker image that Sagemaker can use to run your model. This involves starting with a basic transformers pytorch image (https://github.com/huggingface/transformers/blob/main/docker/transformers-pytorch-gpu/Dockerfile), than install sagemaker-huggingface-inference-toolkit with mms(multi model server), openjdk and congifure entrypoint.

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive

RUN apt update
RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
RUN python3 -m pip install --no-cache-dir --upgrade pip

ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF

# If set to nothing, will install the latest version
ARG PYTORCH='1.13.1'
ARG TORCH_VISION=''
ARG TORCH_AUDIO=''
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu121'

RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' ||  VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' ||  VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' ||  VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA

RUN python3 -m pip install --no-cache-dir -e ./transformers

# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop


RUN apt-get install -y \
    openjdk-8-jdk-headless
RUN pip install "sagemaker-huggingface-inference-toolkit[mms]"

COPY ./entrypoint.py /usr/local/bin/entrypoint.py
RUN chmod +x /usr/local/bin/entrypoint.py

RUN mkdir -p /home/model-server/


# Define an entrypoint script for the docker image
ENTRYPOINT ["python3", "/usr/local/bin/entrypoint.py"]

Now, push your image to your ECR

Deploy using terraform

Finally, you'll use Terraform to deploy everything to AWS. This includes setting up the endpoint role, model, its endpoint configuration, and the endpoint itself. Here's a simplified version of what the Terraform setup might look like:

resource "aws_sagemaker_model" "customHuggingface" {
  name = "custom-huggingface"

  primary_container {
    image          = "<YOUR_ACCOUNT>.dkr.ecr.<REGION>.amazonaws.com/<REPO>:<TAG>"
    model_data_url = "s3://<BUKET>/<PATH>/model.tar.gz"
  }
}


data "aws_iam_policy_document" "assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["sagemaker.amazonaws.com"]
    }
  }
}


resource "aws_iam_role" "yourRole" {
  name               = "yourRole"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

data "aws_iam_policy_document" "InferenceAcess" {
  statement {
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::<yourBucket>/*"]
  }
  statement {
    actions = [
      "ecr:GetAuthorizationToken",
      "ecr:BatchCheckLayerAvailability",
      "ecr:GetDownloadUrlForLayer",
      "ecr:GetRepositoryPolicy",
      "ecr:SetRepositoryPolicy",
      "ecr:DescribeRepositories",
      "ecr:ListImages",
      "ecr:DescribeImages",
      "ecr:BatchGetImage",
      "ecr:GetLifecyclePolicy",
      "ecr:GetLifecyclePolicyPreview",
      "ecr:ListTagsForResource",
      "ecr:DescribeImageScanFindings",
      "ecr:InitiateLayerUpload",
    ]

    resources = ["<YOUR_ECR>"]
  }
  statement {
    resources = ["*"]
    actions = [
      "cloudwatch:PutMetricData",
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "logs:CreateLogGroup",
      "logs:DescribeLogStreams",
    ]
  }
}

resource "aws_iam_policy" "InferenceAcess" {
  name        = "InferenceAcess"
  policy      = data.aws_iam_policy_document.InferenceAcess.json
}

resource "aws_iam_role_policy_attachment" "InferenceAcess" {
  role       = aws_iam_role.yourRole.name
  policy_arn = aws_iam_policy.InferenceAcess.arn
}
resource "aws_sagemaker_endpoint_configuration" "customHuggingface" {
  name = "customHuggingface"

  production_variants {
    variant_name           = "variant-1"
    model_name             = aws_sagemaker_model.customHuggingface.name
    initial_instance_count = 1
    instance_type          = "ml.g4dn.xlarge"
  }

}

resource "aws_sagemaker_endpoint" "customHuggingface" {
  name                 = "customHuggingface"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.customHuggingface.name
}

Invoke your endpoint

After everything is deployed, you can test the endpoint with a simple request to make sure it's working as expected.

body = json.dumps({"inputs": <Your text>})
endpoint = "customHuggingface"
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/json', Body=body)
response["Body"].read()

Useful links:

Optimize spark on kubernetes

akoshel — Sat, 01 Apr 2023 07:50:15 +0000

This is my second post about Spark on Kubernetes. I wanted to share my experience with reducing the costs of Spark computation in clouds, which can be expensive, but can be decreased by 60-70%. I am using Spark version 3.3.1.

'1. If you are running your research in client mode from iPython notebook, it is recommended to use dynamic allocation. This configuration allows you to create an executor pod only during compute time, after which the executor stops.

spark.dynamicAllocation.enabled                     true
spark.dynamicAllocation.shuffleTracking.enabled     true
spark.dynamicAllocation.shuffleTracking.timeout     120
spark.dynamicAllocation.minExecutors                0
spark.dynamicAllocation.maxExecutors                10

'2. Using spot nodes for executors significantly reduce costs (60-90% cheaper than on-demand nodes). To create a spot node group, you need to label it, for example, spark: spot. However, for driver still on-demand nodes should be used.

If you are running in client mode, set the following configuration

spark.kubernetes.executor.node.selector.spark      spot  # here you label k,v in my case k=spark, v=node

If you are using Spark Operator, use the following configuration settings:

spec:
  driver:
    nodeSelector:
      - key1: value1
      - key2: value2
  executor:
    nodeSelector:
      - key1: value1
      - key2: value2

P.S use volume mount from next point to keep executors temp results is case of spot node interruption

'3. Use SSD volume mount to executors. As mentioned above to keep executor temp results in case of spot node interruption. For this purpose, it is best to use an SSD volume mount, which accelerates the write and read of temp files that Spark saves on disk. You can use the following configuration settings:

spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName    OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass    gp # your cloud ssd storage class
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit    100Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path    /data
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly    false

'4. These are the recommended default values from "Learning Spark":

spark.shuffle.file.buffer                           1m
spark.file.transferTo                               false
spark.shuffle.unsafe.file.output.buffer             1m
spark.io.compression.lz4.blockSize                  512k

In conclusion, by following the above steps, you can significantly reduce the cost of running Spark computations in the cloud. Dynamic allocation, using spot nodes for executors, and SSD volume mounts can reduce costs by up to 60-90%. Additionally, using default values as recommended in "Learning Spark" can help optimize performance. Remember to always prioritize the needs and satisfaction of the user when making any changes and to thoroughly test any configurations before implementing them. By doing so, you can provide a useful and enjoyable experience for your users while also being cost-effective.

Recources:
https://spot.io/blog/how-to-run-spark-on-kubernetes-reliably-on-spot-instances/
https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

P.S. My first post about spark on k8s
How to run Spark on kubernetes in jupyterhub
https://dev.to/akoshel/spark-on-k8s-in-jupyterhub-1da2

How to run Spark on kubernetes in jupyterhub

akoshel — Thu, 20 Oct 2022 12:01:45 +0000

This is a basic tutorial on how to run Spark in client mode from jupyterhub notebook.
All required files are presented here https://github.com/akoshel/spark-k8s-jupyterhub

Final architecture

Motivation

I found a lot of tutorials on this topic and almost all of them have custom spark and jupyterhub deployment. So I decided to minimize custom configuration and use raw open-source solutions as it is possible.

Install minikube & helm

Firstly we should create k8s infrastructure
Minikube installation instruction https://minikube.sigs.k8s.io/docs/start/
Helm installation instruction https://helm.sh/docs/intro/install/

Make local docker images available from minikube:



eval $(minikube docker-env)

Install spark

Let's install spark locally.
Further, we will build a spark image and run the spark-pi example with spark-submit



sudo apt-get -y install openjdk-8-jdk-headless
wget https://downloads.apache.org/spark/spark-3.2.2/spark-3.2.2-bin-hadoop3.2.tgz
tar xvf spark-3.2.2-bin-hadoop3.2.tgz
sudo mv spark-3.2.2-bin-hadoop3.2 /opt/spark

Build spark image

Spark has kubernetes dockerfile. Let's build spark image



cat /opt/spark/kubernetes/dockerfiles/spark/Dockerfile
cd /opt/spark
docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

Spark base image does not support python. So we should build pyspark image (opt/spark/spark-3.2.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile).
The basic image does not support s3a and postgres. That is why maven jars should be added.
See modified image here https://github.com/akoshel/spark-k8s-jupyterhub/blob/main/pyspark.Dockerfile

Build pyspark image



cd /opt/spark
docker build -t pyspark:latest -f kubernetes/dockerfiles/spark/bindings/python/Dockerfile .

Run spark-pi

Before running examples namespace, service account, role and rolebinding should be deployed.



kubectl apply -f spark_namespace.yaml
kubectl apply -f spark_sa.yaml
kubectl apply -f spark_sa_role.yaml

Now we are ready to check the spark-pi example using spark-submit
(Use kubectl cluster-info to find your master address)



/opt/spark/bin/spark-submit \
  --master k8s://https://192.168.49.2:8443 \
  --deploy-mode cluster \
  --driver-memory 1g \
  --conf spark.kubernetes.memoryOverheadFactor=0.5 \
  --name sparkpi-test1 \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.driver.pod.name=spark-test1-pi \
  --conf spark.kubernetes.namespace=spark \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar 1000

Check logs



kubectl logs -n spark spark-test1-pi | grep "Pi is roughly"
Pi is roughly 3.1416600314166003

Great! spark is running on k8s.

Install jupyterhub

Before jupyterhub installation service account, role and rolebinding should be deployed in jupyterhub namespace



kubectl apply -f jupyterhub_sa.yaml
kubectl apply -f jupyterhub_sa_role.yaml

Spark executors have to be deployed in spark namespace from a notebook which is deployed in jupyterhub.
That is why we have to deploy driver service. (driver_service.yaml)



kubectl apply -f driver_service.yaml

To get access to spark UI ingress should be deployed



kubectl apply -f driver_ingress.yaml

Java is not installed in the default jupyterhub singleuser image.
Build modified singleuser image.



docker build -f singleuser.Dockerfile -t singleuser:v1 .

See jhub_values.yaml. There are the following modifications: new image, service account and resources.
Now we are ready to deploy jupyterhub



helm upgrade --cleanup-on-fail \
--install jupyterhub jupyterhub/jupyterhub \
--namespace jupyterhub \
--create-namespace \
--version=2.0.0 \
--values jhub_values.yaml

The easiest way to get access to jupyterhub is port-forwarding from the proxy pod. Alternatively you can configure ingress in jhub_values.yaml



kubectl port-forward proxy-dd5964d5b-6lkwp  -n jupyterhub  8000:8000 # Set your pod name

Pyspark from jupyterhub

Open jupyterhub in your browser http://localhost:8000/
Create a jupyterhub terminal and install pyspark version that matches spark version in the image



pip install pyspark==3.2.2

Create notebook

Create SparkContext



from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("k8s://https://192.168.49.2:8443") # Your master address name
        .set("spark.kubernetes.container.image", "pyspark:latest") # Spark image name
        .set("spark.driver.port", "2222") # Needs to match svc
        .set("spark.driver.blockManager.port", "7777")
        .set("spark.driver.host", "driver-service.jupyterhub.svc.cluster.local") # Needs to match svc
        .set("spark.driver.bindAddress", "0.0.0.0")
        .set("spark.kubernetes.namespace", "spark")
        .set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
        .set("spark.kubernetes.authenticate.serviceAccountName", "spark")
        .set("spark.executor.instances", "2")
        .set("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
       .set("spark.app.name", "tutorial_app"))

Run spark application



# Calculate the approximate sum of values in the dataset
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Approximate sum: 45.0

See executor pods



kubectl get pods -n spark
NAME                                   READY   STATUS    RESTARTS   AGE
tutorial-app-d63d4c83e68ed465-exec-1   1/1     Running   0          16s
tutorial-app-d63d4c83e68ed465-exec-2   1/1     Running   0          15s

Congratulations! Pyspark in client mode is running from jupyterhub

Further steps:

Configure your spark config
Configure jupyterhub https://z2jh.jupyter.org/en/stable/jupyterhub/customization.html
Install spark operator https://googlecloudplatform.github.io/spark-on-k8s-operator/docs/quick-start-guide.html

Recources:

P.S. My second post about spark on k8s
Optimize spark on kubernetes
https://dev.to/akoshel/optimize-spark-on-kubernetes-32la