DEV Community: Rustem Feyzkhanov

Using custom docker image with SageMaker + AWS Step Functions

Rustem Feyzkhanov — Thu, 22 Oct 2020 05:22:20 +0000

Amazon SageMaker is extremely popular for data science projects which need to be organized in the cloud. It provides a simple and transparent way to try multiple models on your data.

Great news is that it becomes even cheaper to train and deploy models using SageMaker (up to 18% reduction on all ml.p2.* and ml.p3.* instance types) so it makes it even more suitable for integration with your existing production AWS infrastructure. It could be the case that you want to train your model automatically on-demand or retrain the model based on some changes in your data.

There are multiple challenges associated with this task:

First, you need a way to organize preprocessing and postprocessing steps for training the model. The former could include ETL latest data and the latter could include updating or registering models in your system.
Second, you need a way to handle long-running training tasks in an asynchronous way while also having a way to retry or restart the task in case some retriable error happened. Otherwise, you would want to handle the non-retriable error and notify that the process failed.
Finally, you need a way to run multiple tasks in parallel in a scalable way in case you need to handle the retraining of multiple models in your system.

AWS Step Functions provide a way to tackle these challenges by orchestrating deep learning training workflows, which can handle the multi-step process with custom logic like retries or error handling while also providing a way to use together with different AWS services for computing like Amazon SageMaker, AWS Batch, AWS Fargate and AWS Lambda. It has nice additional features like scheduling the workflow or using it with AWS EventBridge to integrate with other services that you could use, for example, for notification purposes.

In this post, I’ll cover a method to build a serverless workflow using Amazon SageMaker with a custom docker image to training a model, AWS Lambda for preprocessing and postprocessing, and AWS Step Functions as an orchestrator for the workflow.

We will cover the following:

Using Amazon SageMaker for running the training task and creating custom docker image for training and uploading it to AWS ECR
Using AWS Lambda with AWS Step Functions to pass training configuration to Amazon SageMaker and for uploading the model
Using serverless framework to deploy all necessary services and return link to invoke Step Function

Prerequisites:

Installed AWS CLI
Installed docker
Installed serverless frameworks with plugins

Code decomposition:

Container folder which contains Dockerfile for building the image and train script for model training
Index.py file which contains code for AWS Lambdas
Serverless.yml file which contains configuration for AWS Lambda, execution graph for AWS Step Functions and configuration for Amazon SageMaker

Using Amazon SageMaker for running the training task

Amazon SageMaker provides a great interface for running custom docker image on GPU instance. It handles starting and terminating the instance, placing and running docker image on it, customizing instance, stopping conditions, metrics, training data and hyperparameters of the algorithm.

In our example, we will make a container for the training classification model for fashion mnist dataset. The training code will look like classic training example, but will have two main differences:

Import hyperparameters step

with open('/opt/ml/input/config/hyperparameters.json') as json_file:
    hyperparameters = json.load(json_file)
    print(hyperparameters)

Saving the model to S3 step

model.save('/opt/ml/model/pipelineSagemakerModel.h5')

Here is how the dockerfile will look like:

FROM tensorflow/tensorflow:1.12.0-gpu-py3
RUN pip3 install boto3
ENV PATH="/opt/ml/code:${PATH}"
COPY . /opt/ml/code/
WORKDIR /opt/ml/code
RUN chmod 777 /opt/ml/code/train

Here is how we will build and push the image to AWS ECR (you would need to replace accountId and regionId with account and region id):

git clone https://github.com/ryfeus/stepfunctions2processing.git
cd aws-sagemaker/container
docker build -t aws-sagemaker-example .
$(aws ecr get-login --no-include-email --region us-east-1)
aws ecr create-repository --repository-name aws-sagemaker-example
docker tag aws-sagemaker-example:latest <accountId>.dkr.ecr.<regionId>.amazonaws.com/aws-sagemaker-example:latest
docker push <accountId>.dkr.ecr.<regionId>.amazonaws.com/aws-sagemaker-example:latest

Using AWS Lambda with AWS Step Functions to pass training configuration to Amazon SageMaker and for uploading the model

In our case, we will use preprocessing Lambda to generate a custom configuration for the SageMaker training task. This approach can be used to make sure that we have a unique name for the SageMaker task as well as generate a custom set of hyperparameters. Also, it could be used to provide a specific docker image name or tag or to provide a custom training dataset.

In our case the execution graph will consist of the following steps:

Preprocessing step which will generate config for the SageMaker task
SageMaker step which will run the training job based on the config from the previous step
Postprocessing step which can handler model publishing

Here is how the config for the Step Functions will look like. As you can see we define each step separately and then define what the next step in the process is. Also, we can define some parts of the SageMaker training job definition in its state config. In this case, we define instance type, docker image and whether to use Spot instance in the config.

stepFunctions:
  stateMachines:
    SagemakerStepFunction:
      events:
        - http:
            path: startFunction
            method: GET
      name: ${self:service}-StepFunction
      role:
        Fn::GetAtt: [StepFunctionsRole, Arn]
      definition:
        StartAt: PreprocessingLambdaStep
        States:
          PreprocessingLambdaStep:
            Type: Task
            Resource:
              Fn::GetAtt: [preprocessingLambda, Arn]
            Next: TrainingSagemakerStep
          TrainingSagemakerStep:
            Type: Task
            Resource: arn:aws:states:::sagemaker:createTrainingJob.sync
            Next: PostprocessingLambdaStep
            Parameters:
              TrainingJobName.$: "$.name"
              ResourceConfig:
                InstanceCount: 1
                InstanceType: ml.p2.xlarge
                VolumeSizeInGB: 30
              StoppingCondition:
                MaxRuntimeInSeconds: 86400
                MaxWaitTimeInSeconds: 86400
              HyperParameters.$: "$.hyperparameters"
              AlgorithmSpecification:
                TrainingImage: '#{AWS::AccountId}.dkr.ecr.#{AWS::Region}.amazonaws.com/aws-sagemaker-example:latest'
                TrainingInputMode: File
              OutputDataConfig:
                S3OutputPath: s3://sagemaker-#{AWS::Region}-#{AWS::AccountId}/
              EnableManagedSpotTraining: true
              RoleArn: arn:aws:iam::#{AWS::AccountId}:role/SageMakerAccessRole
          PostprocessingLambdaStep:
            Type: Task
            Resource:
              Fn::GetAtt: [postprocessingLambda, Arn]
            End: true

Here is how the execution graph will look like in the AWS Step Functions dashboard.

Here is how the AWS Lambda code will look like. Since Amazon SageMaker requires all training jobs to have unique names, we will use random generator to generate unique string.

import random
import string
from datetime import datetime

def handlerPreprocessing(event,context):
    letters = string.ascii_lowercase
    suffix = ''.join(random.choice(letters) for i in range(10))
    jobParameters = {
        'name': 'model-trainining-'+str(datetime.date(datetime.now()))+'-'+suffix,
        'hyperparameters': {
            'num_of_epochs': '4'
        }
    }
    return jobParameters

def handlerPostprocessing(event,context):
    print(event)
    return event

Using serverless framework to deploy all necessary services and return link to invoke Step Function

We will use the serverless framework to deploy AWS Step Functions and AWS Lambda. There are following advantages of using it for deploying serverless infrastructure:

Usage of plugins which provides a way to deploy and configure AWS Step Functions
Resources section which enables to use cloudformation notation to create custom resources

We can install dependencies and deploy services by using the following command:

cd aws-sagemaker
npm install
serverless deploy

Here is how the output will look like:

Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service DeepLearningSagemaker.zip file to S3 (35.3 KB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
.............................................
Serverless: Stack update finished...
Service Information
service: DeepLearningSagemaker
stage: dev
region: us-east-1
stack: DeepLearningSagemaker-dev
resources: 15
api keys:
  None
endpoints:
functions:
  preprocessingLambda: DeepLearningSagemaker-dev-preprocessingLambda
  postprocessingLambda: DeepLearningSagemaker-dev-postprocessingLambda
layers:
  None
Serverless StepFunctions OutPuts
endpoints:
  GET - https://<url_prefix>.execute-api.us-east-1.amazonaws.com/dev/startFunction

We can use the url in the output to call deployed AWS Step Functions and Amazon SageMaker. It could be done, for example, by using curl:

curl https://<url_prefix>.execute-api.us-east-1.amazonaws.com/dev/startFunction

After that, we can take a look at Step Functions execution graph at AWS Step Functions dashboard (https://console.aws.amazon.com/states/home) and review the training job at Amazon SageMaker dashboard (https://console.aws.amazon.com/sagemaker/home).

AWS Step Functions dashboard:

Amazon SageMaker dashboard:

Conclusion

We’ve created a deep learning training pipeline using Amazon SageMaker and AWS Step Functions. Setting everything up was simple, and you can use this example to develop more complex workflows, for example by implementing branching, parallel executions, or custom error handling.

Feel free to check the project repository at https://github.com/ryfeus/stepfunctions2processing.

Amazon CodeGuru Profiler for monitoring cloud applications

Rustem Feyzkhanov — Fri, 10 Jul 2020 01:41:27 +0000

One of the challenges with building cloud applications is finding the correct way to optimize it. To do so you need to have an insight into how your application performs in production and what are current bottlenecks and latencies which exist in your code. The usual cycle is that you monitor your application, you find existing bottlenecks, next you prioritize based on the latency or CPU/RAM impact and then you optimize your code. Finally, you need to monitor the app after the change to make sure that your fix actually helped.

There are a lot of APM (Application Performance Monitoring) tools out there that could be used to monitor your application in production and they have different ways of gathering and visualizing data about your application. Amazon Codeguru takes a step further in the direction of handling bottlenecks and provides several great additions to the way how the optimization cycle works:

It provides automatic insights into what are possible ways to optimize your app. Service provides insights based on the combination of knowledge base and ML algorithms.
It puts a dollar value on each bottleneck based on the instance which you are using. This feature enables you to prioritize optimization between multiple apps while also providing a way to calculate ROI for decreasing technical debt. Also, it provides a way to find methods that contribute the most to the latency of the app and estimate what are the possibilities for decreasing your app’s latency.
It monitors data continuously and can notify you in case there are any anomalies in your app related. The service also generates cloudwatch metrics meaning you can add them in your main dashboard for low-level app monitoring and easier debugging.

In this post, I’ll cover a demo on how to use Amazon Codeguru with a Java application running on AWS Batch. I will use AWS Step Functions to handle AWS Batch invocations and will build the Java application using Maven.

We will cover the following:

Building up and testing a Java application with Amazon CodeGuru Profiler
Deploying Java application with AWS Batch and providing necessary permissions
Monitoring application using Amazon CodeGuru dashboard

Prerequisites:

Installed AWS CLI
Installed Java 8 and Maven (optional)
Installed docker
Installed serverless frameworks with plugins

Code decomposition:

Java project files (demoApplication.java, pom.xml)
Dockerfile for the image which we will run
Serverless config file which contains infrastructure which we will deploy

Build and test Java application with Amazon CodeGuru profiler

We will use maven to build our Java project. If you have Java 8 and maven installed, you can run the following commands to build the project and run it locally. You will also need to have your AWS credentials set up locally (https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html). Keep in mind that we will deploy necessary infrastructure in the next step so you would be able to run code successfully only once necessary AWS Lambda is deployed. Alternatively, you can deploy AWS Lambda manually and update demoApplication.java with its name. Also, you will need to set up profiling group “demoApplication” in CodeGuru dashboard (https://console.aws.amazon.com/codeguru/profiler/search):

git clone https://github.com/ryfeus/stepfunctions2processing.git
cd aws-batch-with-profiler/docker
mvn clean compile assembly:single

You can run your Java code locally with CodeGuru Profiler and it will report profiling data to AWS. It can be used for both testing your code locally and for profiling your code without deploying it to the cloud.

java -jar target/demo-1.0.0-jar-with-dependencies.jar

In this demo we are using CodeGuru Profiler library to run Profiler within application. Alternative way is to download Profiler jar directly (from https://d1osg35nybn3tt.cloudfront.net/) and the run it as an additional parameter for Java command. In this case you don’t need to update your code to include Profiler logic and you don’t need to include dependencies and repositories to your maven project.

Now let’s build docker image which we will run in AWS Batch by using docker.

docker build -t test-java-app-with-profiler .

Deploying Java application to the cloud

Once image is built, let’s create repository in AWS ECR and push docker image there. We would need to login into ECR, create repository, tag built image and push it to ECR.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <accountId>.dkr.ecr.us-east-1.amazonaws.com
aws ecr create-repository --repository-name test-java-app-with-profiler
docker tag test-java-app-with-profiler:latest <accountId>.dkr.ecr.us-east-1.amazonaws.com/test-java-app-with-profiler:latest
docker push <accountId>.dkr.ecr.us-east-1.amazonaws.com/test-java-app-with-profiler:latest

Once the container image is pushed, we can install dependencies for the serverless framework and deploy services by using the following command:

cd aws-batch-with-profiler/aws-batch
npm install
serverless deploy

Here is how the output will look like:

Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
.....
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service StepFuncBatchWithProfiler.zip file to S3 (32.94 KB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
..........................................................................................
Serverless: Stack update finished...
Service Information
service: StepFuncBatchWithProfiler
stage: dev
region: us-east-1
stack: StepFuncBatchWithProfiler-dev
resources: 30
api keys:
  None
endpoints:
functions:
  async: StepFuncBatchWithProfiler-dev-async
layers:
  None
Serverless StepFunctions OutPuts
endpoints:
  GET - https://<urlPrefix>.execute-api.us-east-1.amazonaws.com/dev/startFunction

We can use the url in the output to call deployed AWS Step Functions and AWS Batch. It could be done, for example, by using curl:

curl https://<urlPrefix>.execute-api.us-east-1.amazonaws.com/dev/startFunction

After that we can take a look at Step Functions execution graph at AWS Step Functions dashboard (https://console.aws.amazon.com/states/home):

We can check logs produced by the application in Cloudwatch group “aws/batch/job” (https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fbatch$252Fjob). It should look in the following way:

Monitoring application using Amazon CodeGuru

Once everything is set up and you’ve run multiple jobs with CodeGuru Profiler running, you can start monitoring the results. CodeGuru Profiler provides data in a flame graph format. You can see the stack of your code’ and libraries’ methods and for each method you can see the CPU or amount of time which was spent within the method. This allows you to see which methods give the most latency and what are the places for possible optimizations. One of the main advantages is that you also will be able to see cost per each method which provides a way to estimate savings of potential optimizations.

Conclusion

We deployed a demo Java application with Amazon CodeGuru Profiler on AWS Batch, wrapped by AWS Step Functions. You can use the same project settings to deploy your application and test it to see how it performs in the cloud and get recommendations. Amazon CodeGuru has a 90 day trial so you can do so completely free.