DEV Community

Cover image for Deploying Qwen-2.5 Model on AWS Using Amazon SageMaker AI
2

Deploying Qwen-2.5 Model on AWS Using Amazon SageMaker AI

Deploying Alibaba's Qwen-2.5 model on AWS using Amazon SageMaker involves several steps, including preparing the environment, downloading and packaging the model, creating a custom container (if necessary), and deploying it to an endpoint. Below is a step-by-step guide for deploying Qwen-2.5 on AWS SageMaker.

Prerequisites:

  1. AWS Account: You need an active AWS account with permissions to use SageMaker.
  2. SageMaker Studio or Notebook Instance: This will be your development environment where you can prepare and deploy the model.
  3. Docker: If you need to create a custom container, Docker will be required locally.
  4. Alibaba Model Repository Access: Ensure that you have access to the Qwen-2.5 model weights and configuration files from Alibaba’s ModelScope or Hugging Face repository.

Step 1: Set Up Your SageMaker Environment

  1. Launch SageMaker Studio:

    • Go to the AWS Management Console.
    • Navigate to Amazon SageMaker > SageMaker Studio.
    • Create a new domain or use an existing one.
    • Launch a Jupyter notebook instance within SageMaker Studio.
  2. Install Required Libraries:
    Open a terminal in SageMaker Studio or your notebook instance and install the necessary libraries:

   pip install boto3 sagemaker transformers torch
Enter fullscreen mode Exit fullscreen mode

Step 2: Download the Qwen-2.5 Model

You can download the Qwen-2.5 model from Alibaba’s ModelScope or Hugging Face repository. For this example, we’ll assume you are using Hugging Face.

  1. Download the Model Locally: Use the transformers library to download the model:
   from transformers import AutoModelForCausalLM, AutoTokenizer

   model_name = "Qwen/Qwen-2.5"  # Replace with the actual model name if different
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModelForCausalLM.from_pretrained(model_name)

   # Save the model and tokenizer locally
   model.save_pretrained("./qwen-2.5")
   tokenizer.save_pretrained("./qwen-2.5")
Enter fullscreen mode Exit fullscreen mode
  1. Package the Model: After downloading the model, package it into a .tar.gz file so that it can be uploaded to S3.
   tar -czvf qwen-2.5.tar.gz ./qwen-2.5
Enter fullscreen mode Exit fullscreen mode
  1. Upload the Model to S3: Upload the packaged model to an S3 bucket:
   import boto3

   s3 = boto3.client('s3')
   s3.upload_file("qwen-2.5.tar.gz", "your-s3-bucket-name", "qwen-2.5/qwen-2.5.tar.gz")
Enter fullscreen mode Exit fullscreen mode

Step 3: Create a Custom Inference Container (Optional)

If you want to use a pre-built container from AWS, you can skip this step. However, if you need to customize the inference logic, you may need to create a custom Docker container.

  1. Create a Dockerfile: Create a Dockerfile that installs the necessary dependencies and sets up the inference script.
   FROM python:3.8

   # Install dependencies
   RUN pip install --upgrade pip
   RUN pip install transformers torch boto3

   # Copy the inference script
   COPY inference.py /opt/ml/code/inference.py

   # Set the entry point
   ENV SAGEMAKER_PROGRAM inference.py
Enter fullscreen mode Exit fullscreen mode
  1. Create the Inference Script: Create an inference.py file that handles loading the model and performing inference.
   import os
   import json
   from transformers import AutoModelForCausalLM, AutoTokenizer

   # Load the model and tokenizer
   def model_fn(model_dir):
       tokenizer = AutoTokenizer.from_pretrained(model_dir)
       model = AutoModelForCausalLM.from_pretrained(model_dir)
       return {"model": model, "tokenizer": tokenizer}

   # Handle incoming requests
   def input_fn(request_body, request_content_type):
       if request_content_type == 'application/json':
           input_data = json.loads(request_body)
           return input_data['text']
       else:
           raise ValueError(f"Unsupported content type: {request_content_type}")

   # Perform inference
   def predict_fn(input_data, model_dict):
       model = model_dict["model"]
       tokenizer = model_dict["tokenizer"]
       inputs = tokenizer(input_data, return_tensors="pt")
       outputs = model.generate(**inputs)
       return tokenizer.decode(outputs[0], skip_special_tokens=True)

   # Return the response
   def output_fn(prediction, response_content_type):
       return json.dumps({"generated_text": prediction})
Enter fullscreen mode Exit fullscreen mode
  1. Build and Push the Docker Image: Build the Docker image and push it to Amazon Elastic Container Registry (ECR).
   # Build the Docker image
   docker build -t qwen-2.5-inference .

   # Tag the image for ECR
   docker tag qwen-2.5-inference:latest <aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest

   # Push the image to ECR
   aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com
   docker push <aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest
Enter fullscreen mode Exit fullscreen mode

Step 4: Deploy the Model on SageMaker

  1. Create a SageMaker Model: Use the SageMaker Python SDK to create a model object. If you created a custom container, specify the ECR image URI.
   import sagemaker
   from sagemaker import Model

   role = "arn:aws:iam::<your-account-id>:role/<your-sagemaker-role>"
   model_data = "s3://your-s3-bucket-name/qwen-2.5/qwen-2.5.tar.gz"
   image_uri = "<aws_account_id>.dkr.ecr.<region>.amazonaws.com/qwen-2.5-inference:latest"

   model = Model(
       image_uri=image_uri,
       model_data=model_data,
       role=role,
       name="qwen-2.5-model"
   )
Enter fullscreen mode Exit fullscreen mode
  1. Deploy the Model to an Endpoint: Deploy the model to a SageMaker endpoint.
   predictor = model.deploy(
       initial_instance_count=1,
       instance_type='ml.m5.large'
   )
Enter fullscreen mode Exit fullscreen mode

Step 5: Test the Endpoint

Once the endpoint is deployed, you can test it by sending inference requests.

import json

# Test the endpoint
data = {"text": "What is the capital of France?"}
response = predictor.predict(json.dumps(data))

print(response)
Enter fullscreen mode Exit fullscreen mode

Step 6: Clean Up

To avoid unnecessary charges, delete the endpoint and any associated resources when you're done.

predictor.delete_endpoint()
Enter fullscreen mode Exit fullscreen mode

Conclusion

You have successfully deployed Alibaba's Qwen-2.5 model on AWS using Amazon SageMaker. You can now use the SageMaker endpoint to serve real-time inference requests. Depending on your use case, you can scale the deployment by adjusting the instance type and count.

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

Best Practices for Running  Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK cover image

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

This post discusses the process of migrating a growing WordPress eShop business to AWS using AWS CDK for an easily scalable, high availability architecture. The detailed structure encompasses several pillars: Compute, Storage, Database, Cache, CDN, DNS, Security, and Backup.

Read full post