DEV Community: Heiko Hotz

NLP@AWS Newsletter 02/2022

Heiko Hotz — Tue, 01 Feb 2022 05:20:10 +0000

Hello world. This is the monthly AWS Natural Language Processing(NLP) newsletter covering everything related to NLP at AWS. Feel free to leave comments & share it on your social network.

NLP@AWS Customer Success Story

Measuring customer sentiment in call centres is a huge challenge, especially for large organisations. Accurate call transcripts can help unlock insights such as sentiment, trending issues, and agent effectiveness at resolving calls in call centres.

Wix.com expanded visibility of customer conversation sentiment by using Amazon Transcribe, a speech to text service, to develop a sentiment analysis system that can effectively determine how users feel throughout an interaction with customer care agents.

Learn more about this AWS customer success story in this blog post: https://aws.amazon.com/blogs/machine-learning/how-wix-empowers-customer-care-with-ai-capabilities-using-amazon-transcribe/

AWS AI Language Services

How to approach conversation design with Amazon Lex
In this blog post you will learn how to draft an interaction model to deliver natural conversational experiences, and how to test and tune your application.

NLP on Amazon SageMaker

Detecting NLP Data Drift with SageMaker

NLP models can be an extremely effective tool for extracting information from unstructured text data. When the data that is used for inference (production data) differs from the data used during model training, we encounter a phenomenon known as data drift. When data drift occurs, the model is no longer relevant to the data in production and likely performs worse than expected. It’s important to continuously monitor the inference data and compare it to the data used during training.

Distributed fine-tuning of a BERT model on SageMaker

Hugging Face has been working closely with SageMaker to deliver ready-to-use Deep Learning Containers (DLCs) that make training and deploying the latest Transformers models easier and faster than ever. Because features such as SageMaker Data Parallel (SMDP), SageMaker Model Parallel (SMMP), S3 pipe mode, are integrated into the container, using these drastically reduces the time for companies to create Transformers-based ML solutions such as question-answering, generating text and images, optimizing search results, and improves customer support automation, conversational interfaces, semantic search, document analyses, and many more applications.

In this post, we focus on the deep integration of SageMaker distributed libraries with Hugging Face, which enables data scientists to accelerate training and fine-tuning of Transformers models from days to hours, all in SageMaker.

NLP@AWS Community Content

Enterprise-Scale NLP with Hugging Face & Amazon SageMaker

In October and November, AWS & Hugging Face held a workshop series on “Enterprise-Scale NLP with Hugging Face & Amazon SageMaker”. This workshop series consisted out of 3 parts and covers:

Getting Started with Amazon SageMaker: Training your first NLP Transformer model with Hugging Face and deploying it
Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models with Amazon SageMaker
MLOps: End-to-End Hugging Face Transformers with the Hub & SageMaker Pipelines

The workshop has been recorded and the resources have been made available on Github so you are now able to do the whole workshop series on your own to enhance your Hugging Face Transformers skills with Amazon SageMaker.

Youtube Playlist: Hugging Face SageMaker Playlist
Github Repository: huggingface-sagemaker-workshop-series

Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker
GPT-J is one of the most popular open-source alternatives to GPT-3. In this blog post, you will learn how to easily deploy GPT-J using Amazon SageMaker and the Hugging Face Inference Toolkit with a few lines of code for scalable, reliable, and secure real-time inference using a regular size GPU instance with NVIDIA T4.
Teach an AI Model to Write like Shakespeare — For Free
In this tutorial you will learn how to train an NLP model to write like Shakespeare within 5 minutes using Sagemaker Studio Lab

Stay in touch with NLP on AWS

Our contact: aws-nlp@amazon.com
Email us about (1) your awesome project about NLP on AWS, (2) let us know which post in the newsletter helped your NLP journey, (3) other things that you want us to post on the newsletter. Talk to you soon.

AWS - NLP Newsletter December 2021

Heiko Hotz — Thu, 06 Jan 2022 11:25:04 +0000

Happy new year, everyone! In 2021, the global Natural Language Processing (NLP) market size reached an estimated USD 21 Billion(!) and it is projected to grow to USD 127 Billion in 2028 at an estimated CAGR of 29.4%. To put these numbers into perspective: The estimated global revenue for AWS in 2021 is around USD 62 Billion.

Add in strategic initiatives like our partnership with Hugging Face and the continued efforts to improve our AI Language Services and it becomes clear that NLP is, and will be for a long time, one of the most important areas to cover within AWS.

In 2021 the NLP Domain has set out to successfully guide customers in their NLP journeys. We also have created resources and mechanisms to scale the field internally. In 2022 we will continue this effort by growing the NLP Domain, supporting NLP initiatives, and creating even more resources so that we as an organisation are prepared for the exciting NLP challenges to come.

NLP Customer Success Stories

Koo App Connects Millions of Voices in Their Preferred Language with AWS
When the social media revolution began, e-commerce sites mostly catered to English speakers, which left out a huge population of would-be participants. Koo, a microblogging platform based in India, noted the lack of inclusivity and made it their mission to create an app that is accessible to the entire spectrum of languages spoken in India. Koo leverages several AWS services, such as Amazon SageMaker, Aurora, EKS, and EC2 to serve millions of users on their platform.

AWS Helps Pfizer Accelerate Drug Development And Clinical Manufacturing
Also in December, AWS announced that it is working with Pfizer to create innovative, cloud-based solutions with the potential to improve how new medicines are developed, manufactured, and distributed for testing in clinical trials. To gain quick, secure access to the right information at the right time, Pfizer’s Pharmaceutical Sciences Small Molecules teams are working with AWS to develop a prototype system that can automatically extract, ingest, and process data from this documentation to help in the design of lab experiments. The prototype system is powered by Amazon Comprehend Medical (AWS’s HIPAA-eligible natural language processing (NLP) service to extract information from unstructured medical text accurately and quickly) and Amazon SageMaker, and uses Amazon Cognito to deliver secure user access control.

Updates on AWS Language Services

Double Bill: AWS announces post call analytics and live call analytics with Amazon AI Language Services.

This is a huge step for AWS customers that need rich analytics capabilities to transcribe and extract insights from your contact centre communications at scale. Both functionalities were already available in Contact Lens for Amazon Connect, and now customers who don’t use Amazon Connect can use it in their existing contact centres.

Live transcriptions of F1 races using Amazon Transcribe
The Formula 1 (F1) live steaming service, F1 TV, has live automated closed captions in three different languages: English, Spanish, and French. For the 2021 season, FORMULA 1 has achieved another technological breakthrough, building a fully automated workflow to create closed captions in three languages and broadcasting to 85 territories using Amazon Transcribe.

Clinical text mining using the Amazon Comprehend Medical new SNOMED CT API
This blog post describes how to use a new feature to automatically standardize and link detected concepts to the SNOMED CT (Systematized Nomenclature of Medicine—Clinical Terms) ontology. It details how to use the new SNOMED CT API to link SNOMED CT codes to medical concepts (or entities) in natural written text that can then be used to accelerate research and clinical application building.

Support for extracting data from identity documents using Amazon Textract

This blog post announces a new API to Amazon Textract called Analyze ID that will help you automatically extract information from identification documents, such as driver’s licenses and passports. Amazon Textract uses AI and ML technologies to extract information from identity documents, such as U.S. passports and driver’s licenses, without the need for templates or configuration. You can automatically extract specific information, such as date of expiry and date of birth, as well as intelligently identify and extract implied information, such as name and address.

Custom document enrichment in Amazon Kendra
Amazon Kendra customers can now enrich document metadata and content during the document ingestion process using custom document enrichment (CDE). Organizations often have large repositories of raw documents that can be improved for search by modifying content or adding metadata before indexing. So how does CDE help? By simplifying the process of creating, modifying, or deleting document metadata and content before they’re ingested into Amazon Kendra. This can include detecting entities from text, extracting text from images, transcribing audio and video, and more by creating custom logic or using services like Amazon Comprehend, Amazon Textract, Amazon Transcribe, Amazon Rekognition, and others.

Expedite conversation design with the automated chatbot designer in Amazon Lex
The automated chatbot designer expands the usability of Amazon Lex to the design phase. It uses machine learning (ML) to provide an initial bot design that you can then refine and launch conversational experiences faster. With the automated chatbot designer, Amazon Lex customers and partners get an easy and intuitive way of designing chatbots and can reduce bot design time from weeks to hours.

NLP on Amazon SageMaker

Build custom Amazon SageMaker PyTorch models for real-time handwriting text recognition

Unlike standard text recognition that can be trained on documents with typed content or synthetic datasets that are easy to generate and inexpensive to obtain, handwriting recognition (HWR) comes with many challenges. These challenges include variability in writing styles, low quality of old scanned documents, and collecting good quality labeled training datasets, which can be expensive or hard to collect. In this post, we share the processes, scripts, and best practices to develop a custom ML model in Amazon SageMaker that applies deep learning (DL) techniques based on the concept outlined in the paper GNHK: A Dataset for English Handwriting in the Wild to transcribe text in images of handwritten passages into strings.

Achieve 35% faster training with Hugging Face Deep Learning Containers on Amazon SageMaker
This post shows how to pretrain an NLP model (ALBERT) on Amazon SageMaker by using Hugging Face Deep Learning Container (DLC) and transformers library. We also demonstrate how a SageMaker distributed data parallel (SMDDP) library can provide up to a 35% faster training time compared with PyTorch’s distributed data parallel (DDP) library.

Amazon SageMaker Training Compiler can accelerate training by up to 50%
State-of-the-art NLP models consist of complex multi-layered neural networks with billions of parameters that can take thousands of GPU hours to train. Optimizing such models on training infrastructure requires extensive knowledge of DL and systems engineering; this is challenging even for narrow use cases. SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances.

Amazon SageMaker Serverless inference for intermittent usage patterns

Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling.

Community content

Setting up a text summarisation project on Amazon SageMaker
This tutorial serves as a practical guide for diving deep into text summarisation. It was born out of a customer engagement where the customer wanted to know how to go about setting up a text summarisation project. While there are many impressive demos on text summarisation out there, they are not well suited for actually experimenting with different models and hyperparameters. To do that, organisations need to set up their own experimentation pipeline. The tutorial is divided into the several steps to build up this pipeline:

Using a no-ML “model” to establish a baseline
Generating summaries with a zero-shot model
Training a summarisation model
Evaluating the trained model

Building a language models from scratch

This blog post takes a look at what it takes to build the technology behind GitHub CoPilot, an application that provides suggestions to programmers as they code. In this step by step guide, we'll learn how to train a large GPT-2 model called CodeParrot, entirely from scratch. CodeParrot can auto-complete your Python code - give it a spin here.

A 2021 NLP Retrospective
Much has happened in the field of Natural Language Processing (NLP) in the past year and this blog post reflects on some of the NLP highlights of 2021.