Sumsuzzaman Chowdhury for AWS Community Builders

Posted on Feb 7

Deploying Qwen-2.5 Model on AWS Using Amazon SageMaker AI

#aws #ai #sagmaker #llm

Deploying Alibaba's Qwen-2.5 model on AWS using Amazon SageMaker involves several steps, including preparing the environment, downloading and packaging the model, creating a custom container (if necessary), and deploying it to an endpoint. Below is a step-by-step guide for deploying Qwen-2.5 on AWS SageMaker.

Prerequisites:

AWS Account: You need an active AWS account with permissions to use SageMaker.
SageMaker Studio or Notebook Instance: This will be your development environment where you can prepare and deploy the model.
Docker: If you need to create a custom container, Docker will be required locally.
Alibaba Model Repository Access: Ensure that you have access to the Qwen-2.5 model weights and configuration files from Alibaba’s ModelScope or Hugging Face repository.

Step 1: Set Up Your SageMaker Environment

Launch SageMaker Studio:
- Go to the AWS Management Console.
- Navigate to Amazon SageMaker > SageMaker Studio.
- Create a new domain or use an existing one.
- Launch a Jupyter notebook instance within SageMaker Studio.
Install Required Libraries:
Open a terminal in SageMaker Studio or your notebook instance and install the necessary libraries:

   pip install boto3 sagemaker transformers torch

Step 2: Download the Qwen-2.5 Model

You can download the Qwen-2.5 model from Alibaba’s ModelScope or Hugging Face repository. For this example, we’ll assume you are using Hugging Face.

Download the Model Locally: Use the transformers library to download the model:

   from transformers import AutoModelForCausalLM, AutoTokenizer

   model_name = "Qwen/Qwen-2.5"  # Replace with the actual model name if different
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModelForCausalLM.from_pretrained(model_name)

   # Save the model and tokenizer locally
   model.save_pretrained("./qwen-2.5")
   tokenizer.save_pretrained("./qwen-2.5")

Package the Model: After downloading the model, package it into a .tar.gz file so that it can be uploaded to S3.

   tar -czvf qwen-2.5.tar.gz ./qwen-2.5

Upload the Model to S3: Upload the packaged model to an S3 bucket:

   import boto3

   s3 = boto3.client('s3')
   s3.upload_file("qwen-2.5.tar.gz", "your-s3-bucket-name", "qwen-2.5/qwen-2.5.tar.gz")

Step 3: Create a Custom Inference Container (Optional)

If you want to use a pre-built container from AWS, you can skip this step. However, if you need to customize the inference logic, you may need to create a custom Docker container.

Create a Dockerfile: Create a Dockerfile that installs the necessary dependencies and sets up the inference script.

   FROM python:3.8

   # Install dependencies
   RUN pip install --upgrade pip
   RUN pip install transformers torch boto3

   # Copy the inference script
   COPY inference.py /opt/ml/code/inference.py

   # Set the entry point
   ENV SAGEMAKER_PROGRAM inference.py

Create the Inference Script: Create an inference.py file that handles loading the model and performing inference.

   import os
   import json
   from transformers import AutoModelForCausalLM, AutoTokenizer

   # Load the model and tokenizer
   def model_fn(model_dir):
       tokenizer = AutoTokenizer.from_pretrained(model_dir)
       model = AutoModelForCausalLM.from_pretrained(model_dir)
       return {"model": model, "tokenizer": tokenizer}

   # Handle incoming requests
   def input_fn(request_body, request_content_type):
       if request_content_type == 'application/json':
           input_data = json.loads(request_body)
           return input_data['text']
       else:
           raise ValueError(f"Unsupported content type: {request_content_type}")

   # Perform inference
   def predict_fn(input_data, model_dict):
       model = model_dict["model"]
       tokenizer = model_dict["tokenizer"]
       inputs = tokenizer(input_data, return_tensors="pt")
       outputs = model.generate(**inputs)
       return tokenizer.decode(outputs[0], skip_special_tokens=True)

   # Return the response
   def output_fn(prediction, response_content_type):
       return json.dumps({"generated_text": prediction})

Build and Push the Docker Image: Build the Docker image and push it to Amazon Elastic Container Registry (ECR).

   # Build the Docker image
   docker build -t qwen-2.5-inference .

   # Tag the image for ECR
   docker tag qwen-2.5-inference:latest <aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest

   # Push the image to ECR
   aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com
   docker push <aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest

Step 4: Deploy the Model on SageMaker

Create a SageMaker Model: Use the SageMaker Python SDK to create a model object. If you created a custom container, specify the ECR image URI.

   import sagemaker
   from sagemaker import Model

   role = "arn:aws:iam::<your-account-id>:role/<your-sagemaker-role>"
   model_data = "s3://your-s3-bucket-name/qwen-2.5/qwen-2.5.tar.gz"
   image_uri = "<aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest"

   model = Model(
       image_uri=image_uri,
       model_data=model_data,
       role=role,
       name="qwen-2.5-model"
   )

Deploy the Model to an Endpoint: Deploy the model to a SageMaker endpoint.

   predictor = model.deploy(
       initial_instance_count=1,
       instance_type='ml.m5.large'
   )

Step 5: Test the Endpoint

Once the endpoint is deployed, you can test it by sending inference requests.

import json

# Test the endpoint
data = {"text": "What is the capital of France?"}
response = predictor.predict(json.dumps(data))

print(response)

Step 6: Clean Up

To avoid unnecessary charges, delete the endpoint and any associated resources when you're done.

predictor.delete_endpoint()

Conclusion

You have successfully deployed Alibaba's Qwen-2.5 model on AWS using Amazon SageMaker. You can now use the SageMaker endpoint to serve real-time inference requests. Depending on your use case, you can scale the deployment by adjusting the instance type and count.

Top comments (1)

Shivam Tiwari • Sep 19 • Edited

Is there an existing open source image which can be used to host this model ?
like here - github.com/aws/deep-learning-conta...