When integrating Sagemaker with Hugging Face models using the default setup provided by the sagemaker-huggingface-inference-tollkit can be a good starting point. For a IaC setup, the terraform-aws-sagemaker-huggingface module is a handy resource, https://github.com/philschmid/terraform-aws-sagemaker-huggingface/blob/master/main.tf
However, during my experience, I ran into a few issues with the Sagemaker-Huggingface-Inference-Toolkit:
Deployment Flexibility: The toolkit was limited to deploying only through the Python SDK, which was quite restrictive. (As it described in docs. Actually you can, for example using terraform module mentioned above)
Code and Model Packaging: If you want to customize inference code it required storing the code with the model weights into a single tar file, which felt clunky. I prefer having the code as part of the image itself.
Custom Environments: The sagemaker-huggingface-inference-toolkit doesn't allow for custom environment setups, like installing the latest Transformers directly from GitHub.
One specific issue was the lack of support for setting torch_dtype to half precision for the pipelines, which was crucial for my project but not straightforward to implement.
Given these limitations, I decided against rewriting everything to default sagemaker-inference-toolkit and instead explored a solution that just overrides get_pipline function in sagemaker-huggingface-inference-toolkit. Using following example you can customize in a any way you would like
How to Deploy
Load model weights
The first step is to put model weights to s3 bucket in model.tar.gz file. Instructions how to do it here https://huggingface.co/docs/sagemaker/inference
Make entrypoint
The deployment starts with setting up an entrypoint script. This script acts as the bridge between your model and Sagemaker, telling Sagemaker how to run your model. Here's a basic template I used:
from pathlib import Path
import torch
from transformers import Pipeline, pipeline
from sagemaker_huggingface_inference_toolkit import transformers_utils, serving
def _get_pipeline(task: str, device: int, model_dir: Path, **kwargs) -> Pipeline:
return pipeline(model=model_dir, device_map="auto", model_kwargs={"torch_dtype": torch.bfloat16})
transformers_utils.get_pipeline = _get_pipeline
if __name__ == "__main__":
serving.main()
Build image
Next, you'll need to build a Docker image that Sagemaker can use to run your model. This involves starting with a basic transformers pytorch image (https://github.com/huggingface/transformers/blob/main/docker/transformers-pytorch-gpu/Dockerfile), than install sagemaker-huggingface-inference-toolkit with mms(multi model server), openjdk and congifure entrypoint.
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
LABEL maintainer="Hugging Face"
ARG DEBIAN_FRONTEND=noninteractive
RUN apt update
RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg
RUN python3 -m pip install --no-cache-dir --upgrade pip
ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
# If set to nothing, will install the latest version
ARG PYTORCH='1.13.1'
ARG TORCH_VISION=''
ARG TORCH_AUDIO=''
# Example: `cu102`, `cu113`, etc.
ARG CUDA='cu121'
RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
# RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA
RUN python3 -m pip install --no-cache-dir -e ./transformers
# When installing in editable mode, `transformers` is not recognized as a package.
# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop
RUN apt-get install -y \
openjdk-8-jdk-headless
RUN pip install "sagemaker-huggingface-inference-toolkit[mms]"
COPY ./entrypoint.py /usr/local/bin/entrypoint.py
RUN chmod +x /usr/local/bin/entrypoint.py
RUN mkdir -p /home/model-server/
# Define an entrypoint script for the docker image
ENTRYPOINT ["python3", "/usr/local/bin/entrypoint.py"]
Now, push your image to your ECR
Deploy using terraform
Finally, you'll use Terraform to deploy everything to AWS. This includes setting up the endpoint role, model, its endpoint configuration, and the endpoint itself. Here's a simplified version of what the Terraform setup might look like:
resource "aws_sagemaker_model" "customHuggingface" {
name = "custom-huggingface"
primary_container {
image = "<YOUR_ACCOUNT>.dkr.ecr.<REGION>.amazonaws.com/<REPO>:<TAG>"
model_data_url = "s3://<BUKET>/<PATH>/model.tar.gz"
}
}
data "aws_iam_policy_document" "assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["sagemaker.amazonaws.com"]
}
}
}
resource "aws_iam_role" "yourRole" {
name = "yourRole"
assume_role_policy = data.aws_iam_policy_document.assume_role.json
}
data "aws_iam_policy_document" "InferenceAcess" {
statement {
actions = ["s3:GetObject"]
resources = ["arn:aws:s3:::<yourBucket>/*"]
}
statement {
actions = [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:GetRepositoryPolicy",
"ecr:SetRepositoryPolicy",
"ecr:DescribeRepositories",
"ecr:ListImages",
"ecr:DescribeImages",
"ecr:BatchGetImage",
"ecr:GetLifecyclePolicy",
"ecr:GetLifecyclePolicyPreview",
"ecr:ListTagsForResource",
"ecr:DescribeImageScanFindings",
"ecr:InitiateLayerUpload",
]
resources = ["<YOUR_ECR>"]
}
statement {
resources = ["*"]
actions = [
"cloudwatch:PutMetricData",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:CreateLogGroup",
"logs:DescribeLogStreams",
]
}
}
resource "aws_iam_policy" "InferenceAcess" {
name = "InferenceAcess"
policy = data.aws_iam_policy_document.InferenceAcess.json
}
resource "aws_iam_role_policy_attachment" "InferenceAcess" {
role = aws_iam_role.yourRole.name
policy_arn = aws_iam_policy.InferenceAcess.arn
}
resource "aws_sagemaker_endpoint_configuration" "customHuggingface" {
name = "customHuggingface"
production_variants {
variant_name = "variant-1"
model_name = aws_sagemaker_model.customHuggingface.name
initial_instance_count = 1
instance_type = "ml.g4dn.xlarge"
}
}
resource "aws_sagemaker_endpoint" "customHuggingface" {
name = "customHuggingface"
endpoint_config_name = aws_sagemaker_endpoint_configuration.customHuggingface.name
}
Invoke your endpoint
After everything is deployed, you can test the endpoint with a simple request to make sure it's working as expected.
body = json.dumps({"inputs": <Your text>})
endpoint = "customHuggingface"
response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/json', Body=body)
response["Body"].read()
Top comments (0)