DEV Community

Cover image for AWS SageMaker BlazingText Algorithm
Michael Stainsbury
Michael Stainsbury

Posted on • Originally published at mlexam.com

AWS SageMaker BlazingText Algorithm

BlazingText is the name AWS has given it’s SageMaker built-in algorithm that can identify relationships between words in text documents. These relationships, which are also called embeddings, are expressed as vectors. The semantic relationship between words is preserved by the vectors which cluster words with similar semantics together. This conversion of words to meaningful numeric vectors is very useful for Natural Language Processing which requires input data in vector format. This is why BlazingText is used as a precursor for Natural Language Processing.

Word2Vec is used to pre-process documents containing text to be used by other systems. For example: sentiment analysis; machine language from one language to another. Word2Vec generates a numerical representation of words called embedding. This captures the relationships between words so king, queen and president would be closely related. These relationships are used by Natural Language processing systems. BlazingText is an implementation of the Word2Vec algorithm. Word2Vec was published by Google in 2013 and is compatible with Facebook’s FastText.

Text Classification is used to classify documents, search engines and for document ranking. Text Classification uses embeddings generated by Word2Vec.

This article contains revision notes for the AWS certified exam MLS-C01, Machine Learning — Specialty.

What does the BlazingText algorithm do

BlazingText is used for Textual analysis and text classification problems. BlazingText is the only SageMaker built in algorithm to have both Unsupervised and Supervised learning modes. Word2Vec is Unsupervised and Text Classification is Supervised learning.

  • Word2Vec — Unsupervised learning
  • Text Classifier — Supervised learning Usually for Text Classification you would pre-process the data by passing it through a Word2Vec algorithm and then a Text Classifier. The BlazingText algorithm implements the Word2Vec and Text Classifier as a single process.

How is BlazingText implemented

BlazingText is a SageMaker built-in algorithm and so can be trained via SageMaker Jupyter Notebooks and deployed on SageMaker endpoints. Blazing Test processes text data. The input data is presented in a single file with one sentence per line.

What are the training data formats for BlazingText

There are two input file formats:

  1. File Mode
  2. Augmented Manifest Text (AMT) format The data in File Mode is text with space separated words and one sentence per line. Each line begins with a label like this:
__label__1
Enter fullscreen mode Exit fullscreen mode

The data in Augmented Manifest Text format is in JSON (json lines) format. Each line can contain a single sentence or be split up into phrases by commas as a JSON array. Here are some examples:

A single line in File Mode:

__label__1 Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring.
Enter fullscreen mode Exit fullscreen mode

A single JSON line in Augmented Manifest Text format:

{"source":"Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring","label":1}
Enter fullscreen mode Exit fullscreen mode

A single JSON array containing phrases in Augmented Manifest Text format:

{"source":"Our aim is to increase the year-round consumption of berries in the UK, working closely with British growers during the spring and summer months, and collaborating with UK importers and overseas exporters during winter and early spring","label":[1,3]}
Enter fullscreen mode Exit fullscreen mode

Model artifacts and inference

Blazing Text uses different artifacts depending on it’s processing mode. This table summarises the file names and formats.

Artifacts and files used by BlazingText

Processing environment

Word2Vec

  • Model binaries: vectors.bin
  • Supporting artifacts: Vectors.txt Eval.json (optional)
  • Request format: JSON
  • Result: List of vectors. If word not found: zeros

Text Classification

  • Model binaries: model.bin
  • Supporting artifacts: none
  • Request format: JSON
  • Result: One prediction

Processing environment

BlazingText can be run on a single CPU or GPU instance, or multiple CPU instances. The choice depends on the type of processing being performed. Word2Vec has three processing methods:

  1. Skip-gram
  2. Continuous Bag Of Words (CBOW)
  3. Batch Skip-gram These modes are the complete opposites to each other. In skip-gram mode you supply a word and the model returns the context of the word. With CBOW you provide the context and a predicted word is returned.

Below is a summary showing mode types and processing modes used by Blazing test.

Word2Vec

  • Single CPU instance: Skip-gram, CBOW, Batch skip-gram
  • Single GPU instance (with 1 or more GPUs): Skip-gram, CBOW
  • Multiple CPU instances: No

Text Classification

  • Single CPU instance: Yes
  • Single GPU instance (with 1 or more GPUs): Yes
  • Multiple CPU instances: No

From this summary you can see that all processing methods can be performed on a single CPU instance. Only Word2Vec using batch skip-gram method can run on multiple CPUs and this method cannot utilise GPUs.

What are BlazingText’s strengths and weaknesses

The strength of BlazingText is high performance. BlazingText is more than 20x faster than other popular alternatives such as Facebook’s FastText. This enables inferences to be done in real time for online transactions rather than batch processing. The main weakness of BlazingText is handling words that were not presented during training. These are called Out Of Vocabulary (OOV) words. Typically such words will be marked as Unknown. There are other ways to perform Word2Vec processing, but they do not have the high performance of BlazingText.

What is the Use Case for BlazingText

BlazingText can only ingest words, so the input data must be text. Word2Vec is required to convert data to vectors for Natural Language Processing.

Word2Vec:

  • Sentiment analysis
  • Named entity recognition
  • Machine translation

Text classification:

  • Web searches
  • Information retrieval
  • Ranking
  • Document classification

Video: AWS re:Invent 2019: Natural language modeling with Amazon SageMaker BlazingText algorithm (AIM375-P)


This is a 50.36 minutes video from AWS by Denis Batalov. The presentation can be split into four parts as shown in the timestamps below. I suggest you skip the first two parts and start with the overview of SageMaker BlazingText at 17.13 minutes. This is the link to the Jupyter Notebook used in the demo (part 4):

SageMaker notebook on Github: https://github.com/dbatalov/wikipedia-embedding
0 — Introduction
2.17 — Word embedding
2.56 — Word representations
3.43 — One hot encoding
4.37 — Intuition, given a sentence, try to maximise the probability of predicting the context of words.
6.20 — Word2Vec algorithm
8.20 — t-SNE diagram
9.23 — Overview of Amazon SageMaker
12.20 — Build, train and deploy ML Models
13.16 — Built-in algorithms
14.10 — Deep learning frameworks
15.17 — Automatic Model Tuning
16.27 — Amazon SageMaker Neo
17.13 — Overview of SageMaker BlazingText
18.28 — BlazingText highlights
18.45 — Optimization on CPU negative samples sharing
19.40 — Through characteristics
20.35 — BlazingText benchmarking
23.00 — Demo — Georgian Wikipedia
Selected articles with examples of BlazingText being used
This article, by Evan Harris, describes the usefulness of having a website search feature tuned to the specific vocabulary used on the website. The example Evan uses is for a search for a specific grape variety which returns a list of wines that use that variety.

https://medium.com/building-ibotta/heating-up-word2vec-blazingtext-for-real-time-search-c2121bd1396
This article has a good worked example of BlazingText being used:

https://t-redactyl.io/blog/2020/09/training-and-evaluating-a-word2vec-model-using-blazingtext-in-sagemaker.html
This article is a worked example of using BlazingText in Word2Vec mode: Training Word Embeddings On AWS SageMaker Using BlazingText by Roald Schuring.

https://towardsdatascience.com/training-word-embeddings-on-aws-sagemaker-using-blazingtext-93d0a0838212
This example, from AWS, uses a method to enable BlazingText to generate vectors for out-of-vocabulary (OOV) words.

https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_subwords_text8/blazingtext_word2vec_subwords_text8.html
This is an example SageMake Notebook on Github which uses a dataset derived from Wikipedia.

https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/blazingtext_text_classification_dbpedia/blazingtext_text_classification_dbpedia.ipynb

Video: Amazon SageMaker’s Built-in Algorithm Webinar Series: Blazing Text


This is a 1.14.36 minutes video from AWS by Pratap Ramamurthy. This is a very long video so use the timestamps below to select the part you wish to see.

0 — Introduction
2.19 — What are Amazon algorithms
3.08 — BlazingText algorithms
3.17 — BlazingText use case
4.16 — Typical deep learning task on Text
5.36 — Integer encoding
9.20 — One hot encoding
14.00 — Requirements for word vectors
16.32 — Word2Vec mechanism
16.42 — Word2Vec setup
18.07 — Skip-gram preprocessing
20.30 — Neural network setup
25.38 — BlazingText word embedding
27.35 — Word vectors used for further ML training
28.20 — Intuition
28.25 — Random or is there a pattern? (t-SNE plot)
31.14 — Distance between related words
32.26 — How did the magic work?
35.08 — OOV handling using BlazingText
39.38 — Subword detection
41.43 — Text classification with BlazingText
42.18 — Typical NLP pipeline
44.25 — Parameters
47.43 — Demo
100.11 — Questions

Summary

BlazingText is a high performance algorithm for analyzing text. The two modes of processing producing either numeric vectors for Natural Language Processing via the Word2Vec algorithm or Text Classifications that can infer words from context or context from words.

Resources

These revision notes support subdomain 3.2 Select the appropriate model(s) for a given machine learning problem of the AWS certification exam: AWS Machine Learning — Speciality (MLS-C01).

3.2 Select the appropriate model(s) for a given machine learning problem.
Xgboost, logistic regression, K-means, linear regression, decision trees, random forests, RNN, CNN, Ensemble, Transfer learning. Express intuition behind models
AWS Certified Machine Learning — Specialty, (MLS-C01) Exam Guide

AWS Certified Machine Learning exam guide
Domain 3 Modeling articles index
3.2 Text processing algorithms
Questions for SageMaker built-in algorithms and their uses
Free Practice exam with 65 questions
Overview

AWS docs: https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html
Wikipedia Word2vec: https://en.wikipedia.org/wiki/Word2vec
Google original papers from 2013: https://arxiv.org/abs/1301.3781
Google original papers from 2013: https://arxiv.org/abs/1310.4546
Training data format resources

Augmented Manifest Text (AMT) format: https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html
Json lines format: http://jsonlines.org/
Text examples from https://www.britishsummerfruits.co.uk/about
Processing environment

https://aws.amazon.com/blogs/machine-learning/enhanced-text-classification-and-word-vectors-using-amazon-sagemaker-blazingtext

Credits

Burning book photo by Gaspar Uhas on Unsplash

Originally published at www.mlexam.com on March 2, 2021.

Top comments (0)