DEV Community: Heba K. Ahmed

Machine Learning Best Practices for Public Sector Organizations

Heba K. Ahmed — Tue, 02 Nov 2021 14:11:55 +0000

Challenges for public sector

Government, education, and nonprofit organizations face several challenges in implementing ML programs to accomplish their mission objectives. This section outlines some of the challenges in seven critical areas of an ML implementation. These are outlined as follows:

Data Ingestion and Preparation: Identifying, collecting, and transforming data is the foundation for ML. Once the data is extracted, it needs to be organized so that it is available for consumption with the necessary approvals in compliance with public sector guidelines.
Model Training and Tuning: One of the major challenges facing public sector organizations is the ability to create a common platform that provides ML algorithms and optimize model training performance with minimal resources without compromising on the quality of ML models.
ML Operations (MLOps): Integrating ML into business operations requires significant planning and preparation. Also, effectively monitor the model in production to ensure that ML models do not lose their effectiveness
Management & Governance: Public sector organizations face increased scrutiny to ensure that public funds are being properly utilized to serve mission needs. In addition, any underlying infrastructure, software, and licenses need to be maintained and managed.
Security & Compliance: The sensitive nature of the work done by public sector organizations results in increased security requirements at all levels of an ML platform.
Cost Optimization: The challenge facing public sector agencies is the need to account for the resources used, and to monitor the usage against specified cost centers and task orders.
Bias & Explainability: Public sector organizations need to invest significant time with appropriate tools, techniques to demonstrate explainability and lack of bias in their ML models.

Best Practices

AWS Cloud provides several fully managed services that supply developers and data scientists with the ability to prepare, build, train, and deploy ML models.

Data Ingestion and Preparation

The AWS Cloud provides services that enable public sector customers to overcome challenges in data ingestion, data preparation, and data quality. These are further described as follows:

Data Ingestion

Streaming Data: For streaming data, Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) enable the collection, processing, and analysis of data in real time.

Amazon Kinesis Data Streams (KDS) is a service that enables ingestion of streaming data.
Ingestion of streaming videos can be done using Amazon Kinesis Video Streams.
Amazon Kinesis Data Firehose is a service that can be used to deliver real-time streaming data to a chosen destination.
you can use Amazon MSK to build and run applications that use Apache Kafka to process streaming data without needing Apache Kafka infrastructure management expertise.

Batch Data: There are a number of mechanisms available for data ingestion in batch format.

With AWS Database Migration Services (AWS DMS), you can replicate and ingest existing databases while the source databases remain fully operational.
AWS DataSync is a data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage and AWS storage services such as Amazon Elastic File System (EFS) and Amazon S3.

Data Preparation

AWS Cloud provides three services that provide prepare and organize the data.

AWS Glue is a fully managed ETL service that makes it simple and cost-effective to categorize, clean, enrich, and migrate data from a source system to a data store for ML.
AWS Glue Studio provides a graphical interface that enables visual composition of data transformation workflows on AWS Glue’s engine.
AWS Glue DataBrew, a visual data preparation tool, can be used to simplify the process of cleaning and normalizing the data.
Amazon SageMaker is a comprehensive service that provides purpose-built tools for every step of ML development and implementation.
Amazon SageMaker Data Wrangler is a service that enables the aggregation and preparation of data for ML and is directly integrated into Amazon SageMaker Studio.
Amazon EMR: Many organizations use Spark for data processing and other purposes such as for a data warehouse. Amazon EMR, a managed service for Hadoop-ecosystem clusters, can be used to process data.

Data quality

Public sector organizations need to ensure that data ingested and prepared for ML is of the highest quality by establishing a well-defined data quality framework. See How to Architect Data Quality on the AWS Cloud.

Model Training and Tuning

One of the major challenges facing the public sector is the ability for team members to apply a consistent pattern or framework for working with multitudes of options that exist in this space.
The AWS Cloud enables public sector customers to overcome challenges in model selection, training, and tuning as described in the following.

Model Selection

Amazon SageMaker provides the flexibility to select from a wide number of options using a consistent underlying platform.
Programming Language: Amazon SageMaker notebook kernels provide the ability to use both Python, as well as R, natively. To use coding languages such as Stan or Julia, create a Docker image and bring it into SageMaker for model training and inference. To use programming languages like C++ or Java, custom images on Amazon ECS/EKS can be used to perform model training.
Built-in algorithms: Amazon SageMaker Built-in Algorithms provides several built-in algorithms covering different types of ML problems. These algorithms are already optimized for speed, scale, and accuracy.
Script Mode: For experienced ML programmers who are comfortable with using their own algorithms, Amazon SageMaker provides the option to write your custom code (script) in a text file with a .py extension (see Figure 1).
Use a custom Docker image: ML programmers may use algorithms that aren't supported by SageMaker, a custom Docker image can be used in these cases to train the algorithm and provide the model (see Figure 2 below).

Model Training

Amazon SageMaker provides a number of built-in options for optimizing model training performance, input data formats, and distributed training.

Data parallel: SageMaker’s distributed data parallel library can be considered for running training jobs in parallel in case it is common to have multiple training iterations per epoch.
Pipe mode: Pipe mode accelerates the ML training process: instead of downloading data to the local Amazon EBS volume prior to starting the model training, Pipe mode streams data directly from S3 to the training algorithm while it is running.
Incremental training: Amazon SageMaker supports incremental training to train a new model from an existing model artifact, to save both training time and resources.
Model Parallel training: Amazon SageMaker’s distributed model parallel library can be used to split a model automatically and efficiently across multiple GPUs and instances and coordinate model training.

Model Tuning

Amazon SageMaker provides automatic hyperparameter tuning to find the best version of a model in an efficient manner. The following best practices ensure a better tuning result:

Limit the number of hyperparameters: limiting the search to a much smaller number is likely to give better results, as this can reduce the computational complexity of a hyperparameter tuning job and provides better understanding of how a specific hyperparameter would affect the model performance.
Choose hyperparameter ranges appropriately: better results are obtained by limiting the search to a small range of values.
Pay attention to scales for hyperparameters: During hyperparameter tuning, SageMaker attempts to figure out if hyperparameters are log-scaled or linear-scaled.
Set the best number of concurrent training jobs: a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute time.
Report the wanted objective metric for tuning when the training job runs on multiple instances: distributed training jobs should be designed such that the objective metric reported is the one that is needed.
Enable early stopping for hypermeter tuning job: Early stopping helps reduce compute time and helps avoid overfitting the model.
Run a warm start using previous tuning jobs: Use a warm start for fine-tuning previous hyperparameter tuning jobs.

MLOps

MLOps is the discipline of integrating ML workloads into release management, Continuous Integration / Continuous Delivery (CI/CD), and operations.
Using ML models in software development makes it difficult to achieve own different parts of the process; data engineers might be building pipelines to make ML engineers or developers have to work on integrating the models and releasing them.
AWS Cloud provides a number of different options that solve these challenges, either by building an MLOps pipeline from scratch or by using managed services.

Amazon SageMaker Projects

By using a SageMaker project, teams of data scientists and developers can work together on ML business problems. SageMaker projects use MLOps templates that automate the model building and deployment pipelines using CI/CD.

Amazon SageMaker Pipelines

SageMaker Pipelines is a purpose-built, CI/CD service for ML.

It brings CI/CD practices to ML, such as maintaining parity between development and production environments, version control, on-demand testing, and end-to-end automation, helping scale ML throughout the organization.
With the SageMaker Pipelines model registry, model versions can be stored in a central repository for easy browsing, discovery, and selection of the right model for deployment based on business requirements.
Pipelines provide the ability to log each step within the ML workflow for a complete audit trail of model components.

AWS CodePipeline and AWS Lambda

For AWS programmers, CodePipeline exists to utilize the same workflows for ML. Figure 3 below represents a reference pipeline for deployment on AWS.

AWS StepFunctions Data Science Software Development Kit (SDK)

The AWS Step Functions Data Science SDK is an open-source Python library that allows data scientists to create workflows that process and publish ML models using SageMaker and Step Functions.

AWS MLOps Framework

Figure 4 below illustrates an AWS solution that provides an extendable framework with a standard interface for managing ML pipelines. The solution provides a ready-made template to upload trained models (also referred to as a bring your own model), configure the orchestration of the pipeline, and monitor the pipeline's operations.

Deploy Custom Deep Learning Models

AWS also provides the option to deploy custom code on virtual machines using the followings:

Amazon EC2, and containers using self-managed Kubernetes on Amazon EC2.
Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS).
AWS Deep Learning AMIs can be used to accelerate deep learning by quickly launching Amazon EC2 instances
AWS Deep Learning Containers are Docker images pre-installed with deep learning frameworks to deploy optimized ML environments.

Deploy ML at the edge

In some cases, such as with edge devices, inferencing needs to occur even when there is limited or no connectivity to the cloud. Mining fields are an example of this type of use case.
AWS IoT Greengrass enables ML inference locally using models that are created, trained, and optimized in the cloud using Amazon SageMaker, AWS Deep Learning AMI, or AWS Deep Learning Containers, and deployed on the edge devices.

Management and Governance

ML workloads need to provide increased visibility for monitoring and auditing. This section highlights several AWS services and associated best practices to address these management and governance challenges.
Enable governance and control
AWS Cloud provides several services that enable governance and control. These include:

AWS Control Tower: creates a landing zone that consists of a predefined structure of accounts using AWS Organizations, the ability to create accounts using AWS Service Catalog, enforcement of compliance rules called guardrails using Service Control Policies, and detection of policy violations using AWS Config.
AWS License Manager: AWS License Manager can be used to track this software obtained from the AWS Marketplace and keep a consolidated view of all licenses and enables sharing of licenses with other accounts in the organization.
Resource Tagging: Automated tools such as AWS Resource Groups and the Resource Groups Tagging API enable programmatic control of tags, making it easier to automatically manage, search, and filter tags and resources.

Provision ML resources that meet policies

AWS CloudFormation: provides a mechanism to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code.
AWS Cloud Development Kit (CDK): The AWS CDK allows teams to define cloud infrastructure in code directly in supported programming languages (i.e., TypeScript, JavaScript, Python, Java, and C#).
AWS Service Catalog: provides a solution for public sector by enabling the central management of commonly deployed IT services and achieves consistent governance and meets compliance requirements.

Operate environment with governance

Amazon CloudWatch is a monitoring and observability service used to monitor resources and applications run on AWS in real time.
Amazon EventBridge is a serverless event bus service that can monitor status change events in Amazon SageMaker.
SageMaker Model Monitor can be used to continuously monitor the quality of ML models in production. Model Monitor can notify team members when there are deviations in the model quality. see the Introduction to Amazon SageMaker Model Monitor.
AWS CloudTrail: provides a record of actions taken by a user, role, or an AWS service in SageMaker. CloudTrail captures all API calls for SageMaker.

Security and compliance

Public sector organizations have a number of security challenges and concerns.

This section provides best practices and guidelines to address some of these security and compliance challenges.

Compute and network isolation

In public sector ML projects, it’s required to keep the environments, data and workloads secure and isolated from internet access. These can be achieved using the following methods:

Provision ML components in an isolated VPC with no internet access.
Use VPC endpoint and end-point policies to further limit access.
Allow access from only within the VPC: An IAM policy can be created to prevent users outside the VPC from accessing SageMaker Studio or SageMaker notebooks over the internet.
Intrusion detection and prevention: AWS Gateway Load Balancer (GWLB) can be used to deploy, scale, and manage the availability of third-party virtual appliances.
Additional security to allow access to resources outside your VPC.

Data Protection

Protect data at rest: AWS Key Management service (KMS) can be used to encrypt ML data, studio notebooks and SageMaker notebook instances.
Protect data in transit: AWS makes extensive use of HTTPS communication for its APIs.
Secure shared notebook instances.

Authentication and Authorization

AWS IAM enables control of access to AWS resources. IAM administrators control who can be authenticated (signed in) and authorized (have permissions) to use SageMaker resources. Two common ways to implement least privilege access to the SageMaker environments are identity-based policies and resource-based policies:

Identity-based policies are attached to an IAM user, group, or role. These policies specify what that identity can do.
Resource-based policies are attached to a resource. These policies specify who has access to the resource, and what actions can be performed on it. Please refer to Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation.

Artifact and model management

The recommended best practice is to use version control to track code or other model artifacts. If model artifacts are modified or deleted, either accidentally or deliberately, version control allows you to roll back to a previous stable release.

Security compliance

For a list of AWS services in scope of specific compliance programs, see AWS Services in Scope by Compliance Program. AWS provides the following resources to help with compliance:

Security and Compliance Quick Start Guides.
Architecting for HIPAA Security and Compliance Whitepaper.
AWS Compliance Resources.
AWS Config.
AWS Security Hub.

Cost optimization

Cost management is a primary concern for public sector organizations projects to ensure the best use of public funds while enabling agency missions.
Prepare
Cost control in this phase can be accomplished using the following techniques:

Data Storage: it is essential to establish a cost control strategy at the storage level.
Data Labeling: Data labeling is a key process of identifying raw data (such as images, text files, and videos) and adding one or more meaningful and informative labels to provide context so that an ML model can learn from it.
Data Wrangling: Amazon SageMaker Data Wrangler can be used to reduce this time spent, lowering the costs of the project. With Data Wrangler, data can be imported from various data sources, and transformed without requiring coding.

Build

Cost control in this phase can be accomplished using the following techniques:

Notebook Utilization: It helps prepare and process data, write code to train models, deploy models to SageMaker hosting, and test or validate models.
Test code locally: Before a training job is submitted, running the fit function in local mode enables early feedback prior to running in SageMaker’s managed training or hosting environments.
Use Pipe mode (where applicable) to reduce training time.
Find the right balance: Performance vs. accuracy.
Jumpstart: SageMaker JumpStart accelerates time-to-deploy over 150 open-source models and provides pre-built solutions, preconfigured with all necessary AWS services required to launch the solution into production, including CloudFormation templates and reference architecture.
AWS Marketplace: AWS Marketplace is a digital catalog with listings from independent software vendors to find, test, buy, and deploy software that runs on AWS.

Train and Tune

Cost control in this phase can be accomplished using the following techniques:

Use Spot Instances: Training jobs can be configured to use Spot Instances and a stopping condition can be used to specify how long Amazon SageMaker waits for a job to run using EC2 Spot Instances.
Hyperparameter optimization (HPO): it’s automatically adjusting hundreds of different combinations of parameters to quickly arrive at the best solution for your ML problem.
CPU vs GPU: CPUs are best at handling single, more complex calculations sequentially, whereas GPUs are better at handling multiple but simple calculations in parallel.
Distributed Training: the training process can be sped up by distributing training on multiple machines or processes in a cluster as described earlier
Monitor the performance of your training jobs to identify waste.

Deploy and Manage

This step of the ML lifecycle involves deployment of the model to get predictions, and managing the model to ensure it meets functional and non-functional requirements of the application.

Endpoint deployment: Amazon SageMaker enables testing of new models using A/B testing. Endpoints need to be deleted when testing is completed to reduce costs.
Multi-model endpoints: reduce hosting costs by improving endpoint utilization and provide a scalable and cost-effective solution to deploying many models.
Auto Scaling: optimizes the cost of model endpoints.
Amazon Elastic Inference for deep learning.
Analyzing costs with Cost Explorer: Cost Explorer is a tool that enables viewing and analyzing AWS service-related costs and usage including SageMaker.
AWS Budgets: help you manage Amazon SageMaker costs, including development, training, and hosting, by setting alerts and notifications when cost or usage exceeds (or is forecasted to exceed) the budgeted amount.

Bias and Explainability

Demonstrating explainability is a significant challenge because complex ML models are hard to understand and even harder to interpret and debug.
AWS Cloud provides the following capabilities and services to assist public sector organizations in resolving these challenges.

Amazon SageMaker Debugger

Amazon SageMaker Debugger provides visibility into the model training process for real-time and offline analysis. In the existing training code for TensorFlow, Keras, Apache MXNet, PyTorch, and XGBoost.
It provides three built-in tensor collections called feature importance, average_shap, and full_shap, to visualize and analyze captured tensors specifically for model explanation.

Amazon SageMaker Clarify

Amazon SageMaker Clarify is a service that is integrated into SageMaker Studio and detects potential bias during data preparation, model training, and in deployed models, by examining specified attributes.
Bias in attributes related to age can be examined in the initial dataset, in the trained as well as the deployed model, and quantified in a detailed report.
SageMaker Clarify also enables explainability by including feature importance graphs using SHAP to help explain model predictions.

Conclusion

Public sector organizations have complex mission objectives and are increasingly adopting ML services to help with their initiatives. ML can transform the way government agencies operate, and enable them to provide improved citizen services. However, several barriers remain for these organizations to implement ML. This whitepaper outlined some of the challenges and provided best practices that can help address these challenges using AWS Cloud.

Next Steps

Adopting the AWS Cloud can provide you with sustainable advantages for telehealth systems. Your AWS account team can work together with your team and/or your chosen member of the AWS Partner Network (APN) to implement your enterprise cloud computing initiatives. You can reach out to an AWS partner through the AWS Partner Network. Get started on AI and ML by visiting AWS ML, AWS ML Embark Program, or the ML Solutions Lab.

Hybrid Machine Learning

Heba K. Ahmed — Sun, 15 Aug 2021 11:58:06 +0000

Introduction

Companies who adopt a cloud-native approach realize its value when they marry compute capacity.
But for those companies born before the cloud, in many cases even before generalized computers,
to build on-premises, how can they realize the value of newly launched cloud services when the early.
AWS proposes a series of tenets to guide our discussion of a world-class hybrid machine learning experience:

Seamless management experience.
Tunable latency.
Fast time-to-value.
Flexible.
Low cost.
End in the cloud

This document follows the machine learning lifecycle, from development to training and deployment.

What is Hybrid?

We look at hybrid architectures as having a minimum of two compute environments, what we will call “primary” and “secondary” environments. Generally speaking, we see the primary environment as where the workload begins, and the secondary environment is where the workload ends.

What “Hybrid” is not?

While either your primary or secondary environment may be, in fact, another cloud provider, we prefer not to explicitly discuss those patterns here. That’s because the technical details and capabilities around specific services, features, and global deployment options are not necessarily the same across all cloud providers, and we consider that type of analysis outside the scope for this document.

Hybrid patterns for development

Development refers to the phase in machine learning when customers are iteratively building models. This may or may not include exploratory data analysis, deep learning model development,

Generally speaking, there are two major options for hybrid development:

1- laptop and desktop personal computers.
2- Self-managed local servers utilizing specialized GPUs, colocations, self-managed racks, or corporate data centers. Customers can develop in one or both of these compute environments, and below we’ll describe hybrid patterns for model deployment using both of these.

Develop on personal computers, to train and host in the cloud

Customers can use local development environments, such as PyCharm or Jupyter installations on their laptops or personal computer, and then connect to the cloud via AWS Identity and Access Management (IAM) permissions and interface with AWS service API’s through the AWS CLI or an AWS SDK (ex boto3).

Advantages
1- You have full control of your IDE in this scenario. You just have to open up your computer to get started.
2- You can easily manage what’s sitting in your S3 bucket, vs what’s running on your local laptop.
3- You iteratively write a few lines of code in your complex models, you check them locally, and you only land in the cloud to scale/track/deploy.
Disadvantages
1- Inability to scale beyond the compute resources of your laptop.
2- Lack of access to GUI-centric features like Autopilot, Data Wrangler, Pipelines.
3- If your laptop dies and you didn’t back up externally, your work is gone! Difficulty in onboarding non-super user employees can increase over time as software, OS, and hardware versions change.
4- This onboarding difficulty gets more and more painful as time goes by, in some worst-case scenarios, it leads to highly valued employees not getting access to Python or Pandas for multiple months!
When to move:
If you find yourself spending a significant portion of your time managing local compute environments, it’s time to move to the cloud.

Develop on local servers, to train and host in the cloud

Advantages
1- Ability to capitalize on previous investments in local compute.
2- Simplifies infrastructure setup required to meet some regulatory requirements, such as those for specialized industries (health, gaming, financial).
3- Lower price point per compute on some compute environments, including some advanced GPU’s
4- Ideal for non-bursting, stable computes with high precision in the 6+ month forecast.
Disadvantages
1- A fundamental challenge to dynamically provision compute resources with the needs of your business, leading teams to frequently be stuck in either overutilization of local compute resources or under-utilization.
2- Expensive GPUs can take months to ship, which leads to a larger total cost of ownership. New ideas and features can take longer to launch, because of extra effort to develop.
When to move
1- When you spend more time managing your local development than you do working on new data science projects.
2- When the multiple months it takes to procure, wait for, and provision additional compute resources leave your development teams sitting idle.

Hybrid patterns for training

Usually, a hybrid pattern for training comes down to one of two paths.
1- Either you are training locally, and you want to deploy in the cloud.
2- Or you have all of your data sitting on local resources, and you want to select from that to move into the cloud to train.

Training locally, to deploy in the cloud

1- First, if you are training locally then you will need to acquire the compute capacity to train a model.
2- After your model is trained, there are two common approaches for packaging and hosting it in the cloud.

Docker – using a Docker file you can build your custom image that hosts your inference script, model artifact, and packages. Register this image in the Elastic Container Registry (ECR), and point to it from your SageMaker estimator.
Another option is using the pre-built containers within the SageMaker Python SDK, also known as the deep learning containers (or DL AMI’s). Bring your inference script and custom packages, upload your model artifact to Amazon S3, and import an estimator for your framework of choice. Define the version of the framework you need in the estimator, or install it directly with a requirements.txt file or a custom bash script.

How to monitor your model in the cloud

A key feature for hosting is model monitor, or the ability to detect data, bias, feature, and model thresholds, trigger a re-training pipeline.
1- Upload your training data to an Amazon S3 bucket, and use our pre-built image to learn the upper and lower bounds of your training data.
2- you will receive a JSON file with the upper and lower statistically recommended bounds for each feature. You can modify these thresholds.
3- After confirming your thresholds, schedule monitoring jobs in your production environment. These jobs run automatically, comparing your captured inference requests in Amazon S3 with your thresholds.
4- You will receive CloudWatch alerts when your inference data is outside of your pre-determined thresholds, and you can use those alerts to trigger a re-training pipeline.

How to handle retraining/retuning

1- It is easy to run a retraining and retuning job in the cloud without the overhead of provision, scheduling, and managing your physical resources around this job.
2- SageMaker makes train and tuning jobs easy to manage because all you need to bring is your training script and dataset.
3- Follow best practices for training on SageMaker, ensuring your new dataset is loaded into an Amazon S3 bucket or other supported data source.
4- Another key feature of hosting models in SageMaker is multi-model endpoints.
5- Define your inference script, ensuring the framework is supported by SageMaker multi-model.
6- Create the multi-model endpoint, pointing to Amazon S3, and load your model artifacts into the SageMaker endpoint, invoking the endpoint with the name of the model you want to use.

How to serve thousands of models in the cloud at a low cost

Another key feature of hosting models in SageMaker is multi-model endpoints. SageMaker endpoint, invoking the endpoint with the name of the model you want to use.

Advantages:
1- have more control over your training environment.
2- the cloud not only provides greater flexibility, but can increase a firm’s security posture, by freeing up resources from physical security, patching, and procurement.
Disadvantages:
1- Not taking advantage of cost savings on spot instances.
2- Not using pre-built Docker images, but potentially wasting engineering effort developing these from scratch.
3- Not using advanced distributed training toolkits or custom hardware like the upcoming training.
4- Not using prebuilt tuning packages, but need to build or buy your tuning package.
5- Not using the debugger, profiler, feature store, and other training benefits.

When to move:
1- When the cost of developing your local training platform exceeds its use,
2- Also when the time to provision additional computes resources is far outstripped by the demand for training by your data science resources or business needs.

Storing data locally, to train and deploy in the cloud

Schedule data transfer jobs with AWS DataSync

1- AWS DataSync is a data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS storage services, as well as between AWS storage services.
2- Using AWS DataSync you can easily move petabytes of data from your local on-premises servers up to the AWS cloud.
3- AWS DataSync connects to your local NFS resources, looks for any changes, and handles populating your cloud environment.

Migrating from Local HDFS

you might wholly embrace HDFS as your center and move towards hosting it within a managed service, Amazon Elastic Map Reduce (EMR).

If you are interested in learning how to migrate from local HDFS clusters to Amazon EMR, please see this migration guide: https://d1.awsstatic.com/whitepapers/amazon_emr_migration_guide.pdf

Best practices

1- Use Amazon S3 intelligent tiering for objects over 128 KB.
2- Use multiple AWS accounts, and connect them with Organizations.
3- Set billing alerts.
4- Enable SSO with your current Active Directory provider.
5- Turn on Studio!

Advantages:
1- This is a fast way to realize the value of your locally stored datasets, particularly during a cloud migration.
2- Training and developing in the cloud gives you access to fully managed features within Amazon SageMaker and the entire AWS cloud.
3- You can offload your on-premises resources more easily by leveraging capabilities in the cloud.
4- This frees up your teams from procuring, provisioning, securing, and patching local compute resources, enabling them to dynamically scale these with the needs of your business.
5- Generally speaking, you can deploy more models more quickly by training and hosting in the cloud.

Disadvantages:
1- Expending more resources storing data locally than potentially necessary.
2- If you intend to train your ML models locally, you should anticipate a high volume of node drop-outs in your data centers. One large job can easily consume 1TB of RAM while another can require smaller memory, but execute for potentially days.
3- Cost mitigation can be important here. Customers should be aware of any data duplication across environments and take action to reduce costs aggressively.

When to move
1- When the cost of managing, securing, and storing your data on-premises exceeds the cost of archiving and storing it in the cloud.

Develop in the cloud while connecting to data hosted on-premises

Data Wrangler & Snowflake

Data Wrangler enables customers to browse and access data stores across Amazon S3, Amazon Athena, Amazon Redshift, and 3rd party data warehouses like Snowflake.
This hybrid ML pattern provides customers the ability to develop in the cloud while accessing data stored on-premises, as organizations develop their migration plans.

Train in the cloud, to deploy ML models on-premises

If you are deploying on-premises, you need to develop and host your local webserver. We strongly recommend you decouple* hosting your model artifact from your application.

1- Amazon SageMaker lets you specify any type of model framework, version, or output artifact you need to.
2- You’ll find all model artifacts wrapped as tar.gz archives after training jobs, as this compressed file format saves on job time and data costs.
3- If you are using your own image, you will need to own updating that image as the software version, such as TensorFlow or PyTorch, undergoes potentially major changes over time.
4- Lastly, keep in mind that it is an unequivocal best practice to decouple hosting your ML model from hosting your application.
A key step in your innovation flywheel is that once you use dedicated resources to host your ML model, specifically ones that are separated from your application, this greatly simplifies your process to push better models.

Advantages
1- Can use SageMaker Neo to compile your model for a target device • Feels like you have more control up front.
2- Taking advantage of cloud-based training and tuning features, like spot, debugger, model and data-parallel, Bayesian optimization, etc.
3- Can enable your organization to progress on their cloud migration plan while the application development moves to the cloud.

Disadvantages
1- Need to develop, manage, maintain, and respond to operational issues with locally managed web servers.
2- Own the burden of building and maintaining up-to-date versions of model software frameworks, such as TensorFlow or PyTorch.
3- Bigger risk of tightly coupling compute for your model with computing for your application, making it more challenging for you to deploy new models and new features to your application over time.
4- Need to develop your own data-drift detection and model monitor capabilities.
5- Not taking advantage of cloud-based features for global deployment, see next section for more details.
6- Need to develop your own application monitoring pipeline that extracts key metrics, business details, and model responses, to share with business and data science stakeholders.

When to move
1- When your ability to deploy new applications on-premises is hindered by your need to procure, provision, and manage local infrastructure.
2- When your model is losing accuracy and/or performance over time, due to your inability to quickly retrain and redeploy.
3- When the cost of monitoring, updating, maintaining, and troubleshooting.

Monitor ML models deployed on-premises with SageMaker Edge Manager

SageMaker Edge Manage makes it easy for customers to manage ML models deployed on Windows, Linux, or ARM-based compute environments.
Install the edge manager agent onto the CPU of your intended device, and leverage AWS IOT Core or another transfer method to download the model to the device, and execute local inferencing.
Edge Manage simplifies the monitoring and updating of these models by bringing the control plane up to the cloud.
You can bring your own monitoring algorithm to the service and trigger retraining pipelines as necessary, using the service the redeploy that model back down to the local device.

Hybrid patterns for deployment

In this pattern, we focus mostly on hosting the model in the cloud but interacting with applications that may be hosted on-premises.
Hosting in the cloud to applications on-premises can enable the data scientists, while patterns for hosting ML models via Lambda at the Edge, Outposts, Local Zones, and Wavelength.

Serve models in the cloud to applications hosted on-premises

The most common use case for a hybrid pattern like this is enterprise migrations.

Advantages

Are deploying ML models to application consumers.
Can use custom hardware for ultra-low response times with AWS Inferentia.
Can serve thousands of models on the cheap with multi-model endpoints.
Can deploy complex feature transforms with inference pipelines.
Can use built-in autoscaling and model monitor.
Can easily develop retrain and tune pipelines.

Disadvantages

Risk of your local application infrastructure maintenance hindering the speed of your model development.

When to move: When your ability to deploy new applications on-premises is hindered by your need to procure, provision, and manage local infrastructure.

Host ML Models with Lambda at Edge to applications on-premises

1- This pattern takes advantage of a key capability of the AWS global network – the content delivery network known as Amazon CloudFront.
2- Once you’ve set your Lambda function to trigger off of CloudFront, you’re telling the service to replicate that function across all available regions and points of presence. This can take up to 8 minutes to replicate and become available.

Advantages
1- Can use CloudFront, opening you up to serving on hundreds of points of presence around the world, and saving you from having to manage these.
2- Works nicely with Docker images on SageMaker, because you can create from ECR.

Disadvantages

Can’t use GPU’s, so you may introduce a bit of latency in some cases, particularly where customers may be better served by an ML model on Inferentia hosted in a nearby AWS Region.
Lambda has a hard limit on the largest amount of memory you can allocate to a function, which is 10.24GB. For many “classical” ML models, such as XGBoost or linear regressors, 10GB is more than sufficient. However, for some more complex deep learning models, especially those in the 10’s to 100’s of billions of parameters, 10GB is woefully short in terms of RAM.

When to move
1- When you need more advanced drift and monitoring capabilities.
2- When you want to introduce complex feature transforms, such as with inference pipelines on SageMaker.
3- When you want to serve thousands of models per use case.

Training with a 3rd party SaaS provider to host in the cloud

1- Ensure your provider allows the export of proprietary software frameworks, such as jars, bundles, images, etc. Follow w steps to create a Docker file using that software framework , port into the Elastic Container Registry, and host on SageMaker.

Control plane patterns for hybrid ML

One such common control plane is Kubeflow in conjunction with EKS Anywhere. EKS Anywhere is currently in private preview, anticipated to come online in 2021.
SageMaker offers a native approach for workflow orchestration, known as SageMaker Pipelines. SageMaker Pipelines is ideal for advanced SageMaker users, especially those who are already onboarded to the IDE SageMaker Studio.

Auxiliary services for hybrid ML patterns

AWS Outposts

1- Order AWS Outposts and Amazon will ship, install, and manage these resources for you. You can connect to these resources however you prefer, and manage them from the cloud.
2- You can deploy ML models via ECS to serve inference with ultra-low latency within your data centers, using AWS Outposts. You can also use ECS for model training, integrating with SageMaker in the cloud and ECS on Outposts.
3- Outposts help solve cases where customers want to build applications in countries where there is not currently an AWS Region, or for regulations that have strict data residency requirements, like online gambling and sports betting.

AWS Inferentia

AWS Inferentia provides ease of accessing custom hardware for ML inferencing.
The majority of Alexa operates in a hybrid ML pattern, hosting models on AWS Inferentia and serving hundreds of millions of Alexa-enabled devices around the world. Using AWS Inferentia, Alexa was able to reduce its cost of hosting by 25%.
You can use SageMaker’s managed deep learning containers to train your ML models, compile them for Inferentia with Neo, host on the cloud, and develop your retrain and tune pipeline as usual.

AWS Direct Connect

Ability to establish a private connection between your on-premises resources and your data center. Remember to establish a redundant link, as wires do go south!

Hybrid ML Use Cases

1- Enterprise Migrations
2- Manufacturing
3- Gaming
4- Mobile application development
5- AI-enhanced media and content creation
6- Autonomous Vehicles

Conclusion

In this document, we explored hybrid ML patterns across the entire ML lifecycle. We looked at developing locally while training and deploying in the cloud. We discussed patterns for training locally to deploy on the cloud and even to host ML models in the cloud to serve applications on-premises.

Will Amazon Sagemaker ever rule the world?

Heba K. Ahmed — Sun, 11 Jul 2021 19:39:35 +0000

The phrase "never normal" became commonplace in 2020. And, as with most dramatic upheavals, some leaders' first reaction is to focus on survival. Customers may learn how AI and machine learning spurs corporate innovation, enhances consumer experiences, and boosts profits. Whether it's improving consumer experiences, making sophisticated real-time recommendations, speeding up new product development, increasing employee productivity, or saving expenses and decreasing fraud, businesses are increasingly turning to AI and machine learning to solve problems.
AI and ML hold the promise of transforming industries, increasing efficiencies, and driving innovation. The key to machine learning success is scale.
In 2018, Amazon Sagemaker was a highly scalable machine learning and deep learning service that supported 11 of its own algorithms as well as any others you provided. You had to conduct your own ETL and feature engineering because hyperparameter optimization was still in preview.

Since then, SageMaker has grown in scope, adding IDEs (SageMaker Studio) and automatic machine learning (SageMaker Autopilot) to the core notebooks, as well as a slew of new services to the whole ecosystem, as illustrated in the diagram below.

This ecosystem helps with machine learning from start to finish, from model development through training and tuning to deployment and management.

In a nutshell, Amazon SageMaker is a set of libraries and interfaces that make building and deploying machine learning models easier. It's also important to remember that the SageMaker platform is made up of a variety of products and services that may be customized to meet your specific needs.

Benefits of Amazon SageMaker

There are several benefits to using Amazon SageMaker, but here are four areas that can be used as examples to show the strength of the platform.

Those areas are accessibility, customization, scalability, and efficiency.

The Dashboard on the AWS SageMaker console shows all the different areas and all the different services that can be accessed through the web interface. The tools and services can also be managed from an external machine using the Amazon SageMaker CLI or command-line interface.

There are many ways to customize this process. And the approach might seem daunting, but once you run through a basic pattern, the opportunities for creation are pretty much endless.

For your hosted models, Amazon SageMaker enables automated scaling (autoscaling). In reaction to changes in your workload, autoscaling dynamically alters the number of instances supplied for a model. Autoscaling brings extra instances online as the workload grows.
Autoscaling removes the unnecessary instances as the workload drops, so you don't have to pay for instances you're not using.

Amazon Sagemake provides a compute instance with a jupyter notebook running the r/python kernel that we may choose based on our Data engineering requirements on demand. Using the traditional way, we may display, process, clean, and transform the data into our necessary forms using libraries such as Pandas or Matplotlib.
We can train the models utilizing a different compute instance dependent on the model's computing needs, such as memory-optimized or GPU enabled, after data engineering.

For a range of models, take advantage of sensible default high-performance hyperparameter tweaking settings. Use performance-optimized or algorisms from AWS's extensive library, or bring our own algorisms in using industry-standard containers. Also, make the trained model available as an API, deploying it on a new computer instance to fulfill business needs and scalability.

And the entire process of providing hardware instances, performing high-capacity data jobs, coordinating the entire flow with simple commands while removing manual complications, and eventually enabling serverless elastic deployment works with only a few lines of code while being cost-effective. Sagemaker is a game-changing enterprise solution.

For most data scientists who want to achieve a genuinely end-to-end solution, Amazon Sagemaker is a wonderful offer. It handles the abstraction as well as the numerous software development talents required to complete the assignment while remaining extremely effective, versatile, and cost-efficient. Most significantly, it allows you to concentrate on the core ML experiments while supplementing the necessary abilities with simple abstractions that are similar to our current methodology.