João Moura for AWS

Posted on Mar 8, 2022

NLP@AWS Newsletter 03/2022

#aws #nlp #machinelearning #ai

Hello world. This is the monthly Natural Language Processing(NLP) newsletter covering everything related to NLP at AWS in the month of February. You can find previous month's newsletters here. Feel free to leave comments or share it with your social networks to celebrate this new launch with us. Let's dive in!

NLP Customer Success Stories

How Kustomer utilizes custom Docker images & Amazon SageMaker to build a text classification pipeline
Kustomer is the omnichannel SaaS CRM platform reimagining enterprise customer service to deliver standout experiences. Kustomer wanted the ability to rapidly analyze large volumes of support communications for their business customers — customer experience and service organizations — and automate discovery of information such as the end-customer’s intent, customer service issue, and other relevant insights related to the consumer.

In this blog post, the authors describe how Kustomer uses custom Docker images for SageMaker training and inference, which eases integration and streamlines the process. With this approach, Kustomer’s business customers are automatically classifying over 50k support emails each month with up to 70% accuracy.

Updates on AWS Language Services

Apply profanity masking in Amazon Translate
Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly considered as profane terms from appearing in the translated output.

You can now apply profanity masking to both real-time translation or asynchronous batch processing in Amazon Translate. When using Amazon Translate with profanity masking enabled, the five-character sequence ?$#@$ is used to mask each profane word or phrase, regardless of the number of characters. Amazon Translate detects each profane word or phrase literally, not contextually.

Control formality in machine translated text using Amazon Translate
This newly released feature in Amazon Translate allows you to customize the level of formality in your translation output. At the time of writing, the formality customization feature is available for six target languages: French, German, Hindi, Italian, Japanese, and Spanish. You can customize the formality of your translated output to suit your communication needs, at three different levels:

Default – No control over formality by letting the neural machine translation operate with no influence
Formal – Useful in the insurance and healthcare industry, where you may prefer a more formal translation
Informal – Useful for customers in gaming and social media who prefer an informal translation

Announcing the launch of the model copy feature for Amazon Comprehend custom models
AWS has launched the Amazon Comprehend custom model copy feature this past month, unlocking the important capability of automatically copying your Amazon Comprehend custom models from a source account to designated target accounts in the same Region, without requiring access to the datasets that the model was trained and evaluated on. This new feature is available for both Amazon Comprehend custom classification and custom entity recognition models. This feature also unlocks benefits such as:

Multi-account MLOps strategy – Train a model one time, deploy in multiple accounts
Faster deployment – No need to retrain in every account
Protect sensitive datasets – No need to share datasets between accounts or users – especially important for industries bound to regulatory requirements around data isolation and sandboxing
Easy collaboration – Partners or vendors can now easily train in Amazon Comprehend Custom and share the models with their customers.

NLP on Amazon SageMaker

Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker
In this blog post, the authors briefly summarize the rise of large and small-scale NLP models, primarily through the abstraction provided by Hugging Face and with the modular backend of Amazon SageMaker. The launch of four additional features within the SageMaker model parallel library are highlighted, which unlock 175 billion parameter NLP model pretraining and fine-tuning for customers.

The SM Model Parallel library is used on the SageMaker training platform, achieving a throughput of 32 samples per second on 120 ml.p4d.24xlarge instances and 175 billion parameters. The authors extrapolate that, if compute power is increased to 240 instances, the full model would take 25 days to train.

In this repo you will find sample code for training BERT, GPT-2, and the recently released GPT-J models using model parallelism on Amazon SageMaker.

Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints
Many of our AWS customers provide research, analytics, and business intelligence as a service. This type of research and business intelligence enables their end customers to stay ahead of markets and competitors, identify growth opportunities, and address issues proactively. The NLP models used for these types of research tasks deal with large models and usually involve long articles to be summarized considering the size of the corpus—and dedicated endpoints, which aren’t cost-optimized at the moment. These applications receive a burst of incoming traffic at different times of the day.

We believe customers would greatly benefit from the ability to scale down to zero and ramp up their inference capability on as needed basis. This optimizes the research cost and still doesn’t compromise on inference quality. This post discusses how Hugging Face along with Amazon SageMaker asynchronous inference can help achieve this.

Choose the best data source for your Amazon SageMaker training job
Data ingestion is an integral part of any training pipeline, and SageMaker training jobs support a variety of data storage and input modes to suit a wide range of training workloads.

This post helps you choose the best data source for your SageMaker ML training use case. We introduce the data sources options that SageMaker training jobs support natively. For each data source and input mode, we outline its ease of use, performance characteristics, cost, and limitations. To help you get started quickly, we provide the diagram with a sample decision flow that you can follow based on your key workload characteristics. Lastly, we perform several benchmarks for realistic training scenarios to demonstrate the practical implications on the overall training cost and performance.

Community Content

Hugging Face Inference Sagemaker Terraform Module
Our partners at Hugging Face have released a Terraform module which is incredibly useful to deploy Hugging Face Transformer models like BERT, from either Amazon S3 or the Hugging Face Model Hub to Amazon SageMaker. They have jam-packed it full of great features, such as deploying private Transformer models from hf.co/models, directly adding an autoscaling configuration for the deployed Amazon SageMaker endpoints, and even deploying Asynchronous Inference Endpoints!

Check out the Terraform module here.

NLP Data Augmentation on Amazon SageMaker
Machine learning models are very data-intensive – which is especially true for Natural Language Processing (NLP) models; at the same time, data scarcity is a common challenge in NLP, especially for low-resource languages. This is where data augmentation can greatly help – it is the process of enriching or synthetically enlarge the dataset that a machine learning model is trained on.

In this blog post, the authors explain how to efficiently perform data augmentation – namely using back translation – by leveraging SageMaker Processing Jobs and pre-trained Hugging Face translation models.)

Top comments (1)

Grigor Khachatryan • Mar 9 '22

I will definitely give it a try!