DEV Community: João Moura

NLP@AWS Newsletter 03/2022

João Moura — Tue, 08 Mar 2022 17:59:07 +0000

Hello world. This is the monthly Natural Language Processing(NLP) newsletter covering everything related to NLP at AWS in the month of February. You can find previous month's newsletters here. Feel free to leave comments or share it with your social networks to celebrate this new launch with us. Let's dive in!

NLP Customer Success Stories

How Kustomer utilizes custom Docker images & Amazon SageMaker to build a text classification pipeline
Kustomer is the omnichannel SaaS CRM platform reimagining enterprise customer service to deliver standout experiences. Kustomer wanted the ability to rapidly analyze large volumes of support communications for their business customers — customer experience and service organizations — and automate discovery of information such as the end-customer’s intent, customer service issue, and other relevant insights related to the consumer.

In this blog post, the authors describe how Kustomer uses custom Docker images for SageMaker training and inference, which eases integration and streamlines the process. With this approach, Kustomer’s business customers are automatically classifying over 50k support emails each month with up to 70% accuracy.

Updates on AWS Language Services

Apply profanity masking in Amazon Translate
Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly considered as profane terms from appearing in the translated output.

You can now apply profanity masking to both real-time translation or asynchronous batch processing in Amazon Translate. When using Amazon Translate with profanity masking enabled, the five-character sequence ?$#@$ is used to mask each profane word or phrase, regardless of the number of characters. Amazon Translate detects each profane word or phrase literally, not contextually.

Control formality in machine translated text using Amazon Translate
This newly released feature in Amazon Translate allows you to customize the level of formality in your translation output. At the time of writing, the formality customization feature is available for six target languages: French, German, Hindi, Italian, Japanese, and Spanish. You can customize the formality of your translated output to suit your communication needs, at three different levels:

Default – No control over formality by letting the neural machine translation operate with no influence
Formal – Useful in the insurance and healthcare industry, where you may prefer a more formal translation
Informal – Useful for customers in gaming and social media who prefer an informal translation

Announcing the launch of the model copy feature for Amazon Comprehend custom models
AWS has launched the Amazon Comprehend custom model copy feature this past month, unlocking the important capability of automatically copying your Amazon Comprehend custom models from a source account to designated target accounts in the same Region, without requiring access to the datasets that the model was trained and evaluated on. This new feature is available for both Amazon Comprehend custom classification and custom entity recognition models. This feature also unlocks benefits such as:

Multi-account MLOps strategy – Train a model one time, deploy in multiple accounts
Faster deployment – No need to retrain in every account
Protect sensitive datasets – No need to share datasets between accounts or users – especially important for industries bound to regulatory requirements around data isolation and sandboxing
Easy collaboration – Partners or vendors can now easily train in Amazon Comprehend Custom and share the models with their customers.

NLP on Amazon SageMaker

Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker
In this blog post, the authors briefly summarize the rise of large and small-scale NLP models, primarily through the abstraction provided by Hugging Face and with the modular backend of Amazon SageMaker. The launch of four additional features within the SageMaker model parallel library are highlighted, which unlock 175 billion parameter NLP model pretraining and fine-tuning for customers.

The SM Model Parallel library is used on the SageMaker training platform, achieving a throughput of 32 samples per second on 120 ml.p4d.24xlarge instances and 175 billion parameters. The authors extrapolate that, if compute power is increased to 240 instances, the full model would take 25 days to train.

In this repo you will find sample code for training BERT, GPT-2, and the recently released GPT-J models using model parallelism on Amazon SageMaker.

Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints
Many of our AWS customers provide research, analytics, and business intelligence as a service. This type of research and business intelligence enables their end customers to stay ahead of markets and competitors, identify growth opportunities, and address issues proactively. The NLP models used for these types of research tasks deal with large models and usually involve long articles to be summarized considering the size of the corpus—and dedicated endpoints, which aren’t cost-optimized at the moment. These applications receive a burst of incoming traffic at different times of the day.

We believe customers would greatly benefit from the ability to scale down to zero and ramp up their inference capability on as needed basis. This optimizes the research cost and still doesn’t compromise on inference quality. This post discusses how Hugging Face along with Amazon SageMaker asynchronous inference can help achieve this.

Choose the best data source for your Amazon SageMaker training job
Data ingestion is an integral part of any training pipeline, and SageMaker training jobs support a variety of data storage and input modes to suit a wide range of training workloads.

This post helps you choose the best data source for your SageMaker ML training use case. We introduce the data sources options that SageMaker training jobs support natively. For each data source and input mode, we outline its ease of use, performance characteristics, cost, and limitations. To help you get started quickly, we provide the diagram with a sample decision flow that you can follow based on your key workload characteristics. Lastly, we perform several benchmarks for realistic training scenarios to demonstrate the practical implications on the overall training cost and performance.

Community Content

Hugging Face Inference Sagemaker Terraform Module
Our partners at Hugging Face have released a Terraform module which is incredibly useful to deploy Hugging Face Transformer models like BERT, from either Amazon S3 or the Hugging Face Model Hub to Amazon SageMaker. They have jam-packed it full of great features, such as deploying private Transformer models from hf.co/models, directly adding an autoscaling configuration for the deployed Amazon SageMaker endpoints, and even deploying Asynchronous Inference Endpoints!

Check out the Terraform module here.

NLP Data Augmentation on Amazon SageMaker
Machine learning models are very data-intensive – which is especially true for Natural Language Processing (NLP) models; at the same time, data scarcity is a common challenge in NLP, especially for low-resource languages. This is where data augmentation can greatly help – it is the process of enriching or synthetically enlarge the dataset that a machine learning model is trained on.

In this blog post, the authors explain how to efficiently perform data augmentation – namely using back translation – by leveraging SageMaker Processing Jobs and pre-trained Hugging Face translation models.)

AWS - NLP newsletter September 2021

João Moura — Thu, 30 Sep 2021 15:24:09 +0000

Hello world. This is the second monthly Natural Language Processing(NLP) newsletter, covering everything related to NLP at AWS, and more. Feel free to leave comments, or share on your social network. Let's dive in!

AWS NLP Services

Feature Releases

Amazon Textract announcements price reductions, reduction in processing time for asynchronous operations up to 50% worldwide, US FedRAMP authorization
The usage of the AnalyzeDocument and DetectDocumentText API’s in eight AWS regions will now be billed at the same rates as prices in the US East (N.Virginia) region (not inclusive of the recently launched AnalyzeExpense API), posing a price reduction of up to 32%. Based on costumer feedback, enhancements made to Textract’s asynchronous operations reduced latency by as much as 50 percent worldwide. Finally, Textract achieved US FedRAMP authorization and added IRAP compliance support. What’s New, AWS News Blog, Documentation.

Amazon Transcribe adds support for 6 new languages, Amazon Lex adds support for Korean
Amazon Transcribe now supports batch transcription in six new languages - Afrikaans, Danish, Mandarin Chinese (Taiwan), Thai, New Zealand English, and South African English. Additionally, Amazon Lex it has just added support for Korean. What’s New (Transcribe), What’s New (Lex), Transcribe Documentation, Lex Documentation.

Amazon Transcribe can now generate subtitles for your video files
Amazon Transcribe now supports the generation of WebVTT (*.vtt) and SubRip (.srt) output for use as video subtitles during a batch transcription job. You can select one or both options when you submit the job, and the resultant subtitle files are generated in the same destination as the underlying transcription output file. Find more details in the title link above.

Amazon Transcribe now supports redaction of personal identifiable information (PII) for streaming transcriptions
You can now use Amazon Transcribe to automatically identify and redact PII - such as Social Security numbers, credit card/bank account information, and contact information (i.e. name, email address, phone number and mailing address) - from your streaming transcription results. In addition, granular PII categories are now provided, instead of the unique [PII] tag available when redacting PII in a batch transcription job. With this new feature, companies can provide their contact center agents with valuable transcripts for on-going conversation while maintaining privacy standards. What’s New, AWS ML Blog.

Extract custom entities from documents in their native format with Amazon Comprehend
Amazon Comprehend now allows you to extract custom entities from documents in a variety of formats (PDF, Word, plain text) and layouts (e.g., bullets ,lists). Prior to this announcement, you could only use Comprehend on plain text documents, which required you to flatten documents into machine-readable text; this feature combines the power of NLP and Optical Character Recognition (OCR) to extract custom entitities from your documents using the same API and with no preprocessing required. What’s New, Getting Started (blog), Document Annotation for new feature (blog).

Blog posts/demos

Boost transcription accuracy of class lectures with custom language models for Amazon Transcribe
Practical example of how training a custom language model in Amazon Transcribe can help improve transcription accuracy on difficult specialized topics, such as biology lectures.

Read more about how to leverage custom language models in the Transcribe documentation.

NLP on Amazon SageMaker

Feature Releases

Amazon SageMaker now supports inference endpoint testing from SageMaker Studio
Once a model is deployed to Amazon SageMaker, customers can get predictions from their models deployed on SageMaker real-time endpoints. Previously, customers used third-party tooling such as curl or wrote code in Jupyter Notebooks to invoke the endpoints for inference. Now, customers can provide a JSON payload, send the inference request to the endpoint, and receive results directly from SageMaker Studio. The results are displayed directly in SageMaker Studio and can be downloaded for further analysis.

Amazon S3 plugin for PyTorch
This is an open-source library, built to be used with the deep learning framework PyTorch for streaming data with Amazon S3. This feature is also available in PyTorch Deep Learning Containers, and with it you can take advantage of using data from S3 buckets directly with PyTorch dataset and dataloader API’s without needing to download it first on local storage. AWS ML Blog, Plugin Github.

Blog posts/demos

Detecting Data Drift in NLP using SageMaker Custom Model Monitor
Detecting data drift in NLP is a challenging task. Model monitoring becomes an important aspect in MLOPS, because the change in data distribution from the training corpus to real-world data at inference time can cause model performance decay. This distribution shift is called data drift. This demo focuses on detecting that drift, making use of the custom monitoring capabilities of SageMaker Model Monitor.

Upcoming events

NLP Summit 2021
Oct 05-07, 2021
Join the NLP Summit: two weeks of immersive, industry-focused content. Week one will include over 30 unique sessions, with a special track on NLP in Healthcare. Week two will feature beginner to advanced training workshops with certifications. Attendees can also participate in coffee chats with speakers, committers, and industry experts. Registration is free.

AWS Startup Accelerate: Start your NLP journey on AWS
Oct 11, 2021
AWS will be running a Technical talk on "Starting your NLP journey with AWS". Based on feedback from lead NLP ML Core startups, we see that developing NLP models is a complex and costly process, which is why we’d like to engage with Data Scientists and ML engineers to help them in their adoption journey. We would love to have you there! Register here.

Miscellaneous

🤗 HuggingFace: Hardware Partner Program, Optimum, and Infinity
A trio of announcements for HuggingFace this month:

Hugging Face has launched a Hardware Partner Program, partnering with AI Hardware accelerators to make state of the art production performance accessible with Transformers.
In this context, HuggingFace has released Optimum, an ML optimization toolkit, which enables maximum efficiency to train and run models on specific hardware. As of today, you can use it to easily prune and/or quantize Transformer models for Intel Xeon CPU’s using Intel Low Precision Optimization Tool (LPOT), and later this year the first models optimized for GraphCore’s Intelligence Processing Unit (IPU) will be added.
Finally, Infinity - HugginFace’s enterprise-scale inference solution - was officially announced on September 28th, comprised of a containerized solution which promises Transformers’ accuracy at 1ms latency.