Roshan Gurung

Posted on Dec 29, 2025

Create your own ChatGPT using SageMaker Python SDK V3

#sagemaker #aws #awscommunity #llm

Introduction

AWS SageMaker Python SDK V3 has introduced more streamlined and unified core classes for ML workflow by replacing previous framework specific classes. This new SDK has a unified “ModelTrainer” class for handling custom containers and data processing, “ModelBuilder” class for model deployment and inference setup.

I didn't find much resources to deploy custom model in AWS SageMaker using latest python SDK V3, so In this tutorial, I will deploy open source LLM (Qwen3-4B-Instruct-2507) to demonstrate:

New deployment process using “ModelBuilder”, including custom dependency and environment variables for deployment containers
Custom model loading and inference invocation pipeline inside the containers
Schema driven inference for request/response validation

Prerequisites

Setup your SageMaker studio by creating a domain in Amazon SageMaker AI, then launch a JupyterLab in the SageMaker studio. For this tutorial, minimum instance type for JupyterLab will be enough. However, I will use a GPU instance for deployment endpoint, so be cautious about the incurred costs and always CLEAN UP the resources if you are not using them.

Dependencies Managements

In order to use codes from this tutorial, a recent version of sagemaker and other dependencies is needed. In my case, even starting with the recent distribution of SageMaker in JupyterLab, I was getting older versions. So, I used the following commands in the Jupyter notebook to update my dependencies.

%pip install --no-cache-dir -U sagemaker protobuf --quiet
%pip uninstall -y sagemaker sagemaker-core sagemaker-train sagemaker-serve sagemaker-mlops --quiet
%pip install --no-cache-dir sagemaker-core sagemaker-train sagemaker-serve sagemaker-mlops 'sagemaker>=3' --quiet

%pip uninstall -y tensorflow --quiet
%pip install --no-cache-dir tensorflow --quiet

Import Libraries

import json
import uuid

from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.spec.inference_spec import InferenceSpec
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.serve.utils.types import ModelServer
from sagemaker.core.resources import EndpointConfig

Custom InferenceSpec

As compared to the previous version of SDK, InferenceSpec allows for customizing model loading and data processing for inference, inside the deployment containers. In the given script, I am inheriting the InferenceSpec into a class to override the load() and invoke() function to suit model loading and data processing pipeline for the “Qwen3-4B-Instruct-2507" model. Here, load() function is one time calling to initialize the tokenizer and the model while starting the deployment container. Then, each time the endpoint gets the requests, invoke() function is executed.

class HuggingFaceInferenceSpec(InferenceSpec):
    def __init__(self):
        self.model_name = "Qwen/Qwen3-4B-Instruct-2507"

    def get_model(self):
        return self.model_name

    def load(self, model_dir: str):
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_dir)
        model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        model.eval()

        return {"model": model, "tokenizer": tokenizer}

    def invoke(self, input_object, model):
        import torch
        try:
            tokenizer = model["tokenizer"]
            hf_model = model["model"]

            if isinstance(input_object, dict) and "inputs" in input_object:
                messages = input_object["inputs"]
            else:
                messages = [{"role": "user", "content": str(input_object)}]

            text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
            )

            inputs = tokenizer(text, return_tensors="pt").to(hf_model.device)

            with torch.no_grad():
                outputs = hf_model.generate(
                    **inputs,
                    max_new_tokens=512,
                    do_sample=True,
                    temperature=0.7,
                )

            generated = outputs[0][inputs["input_ids"].shape[-1]:]
            response = tokenizer.decode(generated, skip_special_tokens=True)

            return [{"response": response}]

        except Exception as ex:
            return [{"response": f"Error during invocation: {ex}"}]

Schema Builder

I also implement SchemaBuilder to enforce schema aware inference so that our input/output is consistent and is validated.

sample_input = {"inputs": [{"role": "user", "content": "What is AWS Sagemaker?"}]}

sample_output = [{"response": "Amazon SageMaker is a fully managed cloud-based platform provided by AWS "}]
schema_builder = SchemaBuilder(sample_input, sample_output)

print("Schema builder created successfully!")

Model Builder

The parameters are defined for the deployment for HuggingFace model ID, instance type for endpoint, model name and endpoint name for their tracking. Along with this, I am enforcing dependencies and installing them manually into the container as QWEN-3 models depend upon the latest transformer architecture. Similarly, I am using environment variables for transformers such as “HF_MODEL_ID" which downloads and uses the specific model from configured parameters.

# Configuration Parameters
MODEL_ID = "Qwen/Qwen3-4B-Instruct-2507"
INSTANCE_TYPE = "ml.g4dn.xlarge"
MODEL_NAME_PREFIX = "hf-v3-qwen3-4b"
ENDPOINT_NAME_PREFIX = "hf-v3-qwen3-4b-endpoint"

dependencies = {
    "auto": False,
    "custom": ["sagemaker>=3.1.1",
               "transformers>=4.57.3", 
               "accelerate",
               "torch>=2.6.0", 
               "cloudpickle>=2.2.1",
               "protobuf"]
}

env_vars = {
    "HF_MODEL_ID": MODEL_ID,
    "HF_TASK": "text-generation",
    "HF_HOME": "/opt/ml/model",
    "TRANSFORMERS_CACHE": "/opt/ml/model",
}

# Generate unique identifiers
unique_id = str(uuid.uuid4())[:8]
model_name = f"{MODEL_NAME_PREFIX}-{unique_id}"
endpoint_name = f"{ENDPOINT_NAME_PREFIX}-{unique_id}"

Having parameters, dependencies and environment variables configured, alongside initialized custom InferenceSpec, the next step is to build the model using “ModelBuilder”. Here, as I am using a transformer based model, I am using TORCHSERVER as model server for inference. This will create a model which is like a docker image and can be viewed in the “Models/My models” tab in SageMaker studio.

# Create ModelBuilder
inference_spec = HuggingFaceInferenceSpec()
model_builder = ModelBuilder(
    inference_spec=inference_spec,
    model_server=ModelServer.TORCHSERVE,
    schema_builder=schema_builder,
    instance_type=INSTANCE_TYPE, 
    env_vars=env_vars,
    dependencies=dependencies
)

# Build the model
core_model = model_builder.build(model_name=model_name)
print(f"Model Successfully Created: {core_model.model_name}")

Model Deployment

In this step, the previously built model will be deployed where containers will be created, InferenceSpec will be executed by pulling the model and loading them into the defined instance type. Then the endpoint will be created and will be live for making API requests.

core_endpoint = model_builder.deploy(endpoint_name=endpoint_name)
print(f"Endpoint Successfully Created: {core_endpoint.endpoint_name}")

Testing

Now, when the endpoint is live, you can invoke it and send the requests in the same schema format as defined in the previous step. It will trigger invoke() function and you will get your response from the LLM.

test_input_1 = {"inputs": [{"role": "user", "content": "What are major features of AWS Sagemaker?"}]}

result_1 = core_endpoint.invoke(
    body=json.dumps(test_input_1),
    content_type="application/json"
)

response_1 = json.loads(result_1.body.read().decode('utf-8'))
print(f"Conversation Test: {response_1}")

Resource Cleanups

After all the testing is complete and if you are no longer using the endpoint, always clean up the resources using the following script to avoid any further charges.

core_endpoint_config = EndpointConfig.get(endpoint_config_name=core_endpoint.endpoint_name)

core_model.delete()
core_endpoint.delete()
core_endpoint_config.delete()

print("All resources successfully deleted!")

DEV Community