Matt Lewis for AWS Heroes

Posted on Oct 4

AWS Certified Machine Learning Engineer Core Concepts

#aws #ai #certification #learning

Overview

Earlier this year I passed the AWS Machine Learning Engineer - Associate exam. I spent time making sure I understood the core concepts before taking the exam, and made a lot of notes. The intent of this post is to summarise the concepts essential to pass the exam. Based on my experience, knowing these concepts will get you at least half way there. Layer on knowledge of the AWS AI Services, and focus on SageMaker and all its capabilities, and the certification will be yours.

Data Preprocessing
SageMaker Built-In Algorithms
Model Development
Evaluating Model Accuracy
Improving Model Accuracy
Additional Topics to Study
Other Study Guides

Data Preprocessing

Data preprocessing ensures that data is in the right shape and of the right quality to be used for training.

Labelling data is important for models to learn effectively, and this is where services such as Mechanical Turk and SageMaker Ground Truth come in. Mechanical Turk is an online marketplace to access an on-demand global workforce. SageMaker Ground Truth provides built-in workflows to automate data labelling, and can use your own workforce, third-party vendors from the AWS marketplace, or Mechanical Turk.

Cleaning data can include removing outliers and duplicates, replacing inaccurate or irrelevant data, and correcting missing data.

Approaches for inputting missing data include:

Mean replacement – replacing the missing values with the mean value from the rest of the column. The mean value is the average, which means it can be distorted by outliers. Therefore, the median value (which is the middle value when data is sorted) may be a better choice
KNN (K-Nearest Neighbour) – Find the ‘K’ nearest’ most similar rows and average their values. This assumes numeric data.

Balancing data (for datasets with underrepresented categories) can be achieved using one of the following methods:

Random oversampling – this method randomly duplicates samples from the minority category. For example, if you were building a fraud detection model and you had 1000 examples of normal transactions and only 50 of fraudulent transactions, you would duplicate the fraudulent transactions until you had an equal proportion
Random Undersampling – this method randomly removes samples from the overrepresented category to achieve the equal proportion. This would typically be used when you have a large dataset, or you would want to reduce the size of your dataset to make training the model quicker
Synthetic Minority Oversampling Technique (SMOTE) – this approach generates new synthetic samples of the minority category by interpolating between existing samples using nearest neighbours.

Encoding is the concept around converting data (typically categorical data where the data represents a category or group) into a numerical format that can be well understood by a model. The main types of encoding are the following:

Label Encoding – assigns each category a unique number e.g. Red=0, Green=1, Blue=2. There is no order implied by this encoding
One-Hot Encoding – creates binary columns for each category. This means if there is a category called colour there would be additional columns created for each unique value such as a column for Red and one for Blue and for Green. The value for each column would be assigned 1 if it is true, else 0.
Ordinal Encoding – this is similar to label encoding but is used when there is a ranked ordering between values in a category. For a category called ‘size’ you could map Small to 0, Medium to 1 and Large to 2.

Outliers are data points in a data set that deviate significantly from the general patterns. One way of detecting outliers in training data is to measure how many standard deviations a data point is from the mean of a dataset. This is often called a z-score or standard score. Data points that lie more than one standard deviation from the mean can be considered unusual.

Outliers can be handled in different ways:

Delete the record - if the outlier is clearly an error and there is enough training data, you can just delete that record.
Feature scaling or normalisation – this aims to transform the numeric values so that are all values are on the same scale, often between 0 and 1. This rescaling makes the values more comparable
Standardisation – is similar to normalisation but instead of scaling values from 0 to 1, it rescales the features to have a mean of 0 and standard deviation of 1
Binning – takes a continuous numerical feature and splits into a set of intervals or bins. Each value is then assigned to a bin which can cover up imprecision or uncertainty e.g. someone aged 110 could end up in a bin which is “70+”.

After data has been cleaned up and encoded, you can fine tune or create new features in your dataset through feature engineering. There are other methods such as bag-of-words and N-gram that can be used to extract key information from text data.

In machine learning, dimensionality refers to the number of features in your dataset. If you have a dataset with 3 features (age, income and height), it exists in a 3-dimensional space. The curse of dimensionality refers to the problems that arise when you have too many features. When this happens, the data becomes sparse (e.g. spread out too thinly in the feature space), and it is hard to find meaningful patterns.

There are a number of unsupervised reduction techniques that can help to distil many features into a smaller more manageable number:

Principal component analysis (PCA) – this technique retains most of the variation in the original features but reduces the overall number of features. It works by transforming features into a new set of features called principal components
K-Means – this technique uses a clustering algorithm to group similar data points into K clusters. It does not create new features but assigns a cluster label to each data point.

SageMaker Built-In Algorithms

Before you can train your model, you need to select a machine learning algorithm to use. Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. It is worth understanding each of these algorithms at a high-level, understanding which ones are used for supervised versus unsupervised learning, and their main uses.

Linear Learner - A supervised learning algorithm suited for general classification (Logistic Regression) and regression (Linear Regression) tasks. It makes predictions based on labelled data. In simpler terms, the model is given examples where each example has some features (like size, price, or age) and an outcome (like a house price or a category). For classification tasks, the model sorts data into categories, such as whether a house is expensive or not. For regression tasks, it predicts a specific value, like the actual price of a house. It assumes linear regression and must pre-process missing values.

K-Means - An unsupervised algorithm designed for clustering or grouping data points based on their features (chosen attribute), without needing labelled data

BlazingText - A supervised algorithm used for Natural Language Processing tasks like text classification.

Seq2Seq - A supervised algorithm specifically designed for sequence-to-sequence tasks, such as predicting the next word in a sequence, making it ideal for tasks like language translation or text generation

DeepAR - A supervised algorithm used to forecast time-series predictions by using recurrent neural networks (RNN)

XGBoost - A supervised learning algorithm used for both classification and regression tasks — especially when you care about speed and performance. It is an optimized, scalable implementation of gradient boosting that builds an ensemble of decision trees in a sequential manner. Often outperforms other models in competitions (e.g., Kaggle)

Random Cut Forest - An unsupervised algorithm used to identify abnormal data points within a data set e.g. anomaly detection

Semantic Segmentation - A supervised algorithm that provides pixel-level classification but does not label objects with bounding boxes. It is typically used to classify individual pixels by tagging each pixel with a specified class

Principal Component Analysis (PCA) - An unsupervised algorithm used for dimensionality reduction

Image Classification - A supervised algorithm that is used to label entire images, not individual objects. It simply assigns a single label to an entire image, categorising it based on the predominant features. It cannot identify or count multiple objects within a single image.

Object Detection - A supervised algorithm used to identify and classify multiple objects within an image, assigning bounding boxes and confidence scores. It draws bounding boxes around detected objects and classifies them into different categories, making it very useful for tasks where you need to recognise what is in the image and determine the exact location of each object. This algorithm is well-suited for scenarios that require counting specific items, such as animals, in drone imagery, as it can distinguish between individual objects even in complex scenes.

Object2Vec - A supervised algorithm primarily used to learn vector embeddings of discrete objects. Its typically used in recommendation systems, document classification or semantic similarity tasks, not computer vision or image processing.

IP Insights - An unsupervised algorithm used to detect anomalies in IP address usage patterns. It captures associations between these IP addresses and various entities, such as user IDs or account numbers. For instance, you can use it to detect a user attempting to log into a web service from an anomalous IP address. Additionally, it helps identify accounts that create computing resources from unexpected IP addresses.

Latent Dirichlet Allocation (LDA) - An unsupervised learning technique designed to represent a collection of documents as a combination of various topics. LDA is primarily used to identify a specified number of topics within a set of text documents. The LDA algorithm is a powerful tool for text mining and natural language processing tasks. It allows companies to sift through vast amounts of textual data and discern patterns that might take time to be apparent. Since LDA is an unsupervised method, the topics are not specified up front, and the discovered topics may not necessarily match human categorisations. Instead, LDA learns the topics as a probability distribution over the words in the documents, and each document is characterised as a mixture of these topics.

Neural Topic Model (NTM) - An unsupervised algorithm used for organising documents into topics. It is just like LDA — but it's based on neural networks rather than probabilistic graphical models.

Factorization Machines - A supervised algorithm designed to handle sparse data, making it ideal for recommendation systems where user-item interactions are often sparse. It is primarily used for recommendation systems and ranking predictions.

Text Classification – TensorFlow algorithm - A supervised algorithm designed to classify text into predefined categories.

Model Development

Hyperparameters are external configuration variables used to control a training model, improve model performance and the model outcome. Hyperparameters are set before training. This can be done manually, although SageMaker offers an intelligent version of hyperparameter tuning methods based on Bayesian search theory designed to find the best model in the shortest time. Amazon SageMaker AI automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. Amazon SageMaker AI automatic model tuning (AMT) is also known as hyperparameter tuning and supports Hyperband, a new search strategy.

Common hyperparameters include:

Epoch – the number of times the entire training dataset is shown to the network during training. A smaller epoch value means faster training, but the model might not learn enough patterns and end up underfitting. A larger epoch value gives more opportunity to refine weights and give better convergence, but will take longer to train and may end up memorising training data and so overfitting
Learning rate – the rate at which an algorithm updates estimates. Too high a learning rate means you might overshoot the optimal solution. Too small a learning rate will take too long to find the optimal solution
Batch size – how many batch training samples are used within each batch of each epoch. Large batch sizes are faster per epoch as it will fill the GPU, but risk worse generalisation and can end up getting stuck in the wrong solution. Small batch sizes are slower per epoch but can provide better generalisation.

Note that hyperparameters are not related to inference parameters. Inference parameters are settings you can adjust during inference, that influence the response from the model. The most common are:

Temperature: Temperature is a value between 0 and 1, and it regulates the creativity of the model's responses. Use a lower temperature if you want more deterministic responses, and use a higher temperature if you want creative or different responses for the same prompt
Top K: The number of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.
Top P: The percentage of most-likely candidates that the model considers for the next token. Choose a lower value to decrease the size of the pool and limit the options to more likely outputs.

Evaluating Model Accuracy

Metrics are used to measure the performance and accuracy of a machine learning model. These metrics can typically be broken down into classification metrics and regression metrics.

With classification, the goal of the model is to predict a label or class (category) for the given input. With binary classification, there are only two possible outputs (positive or negative). This is used to predict whether an image is a dog or not, or whether an email is spam or not. With multi-class classification, there are more than two possible outputs, such as predicting whether an animal is a dog, cat or cow.

With regression, the goal of the model is to predict a numerical value. This could be predicting a house or stock price, or a person's annual income given certain inputs.

Classification Metrics

The confusion matrix is a great way to help understand common classification metrics.

Recall - Recall is the percentage of positives correctly predicted. It is focused on how many of the actual positives did the model get right. You use it when you prefer to catch as many positives as possible, even if some are incorrect. It is a good metric when false negatives are costly e.g. fraud detection or cancer screening where it is better to flag more cases (even if some are wrong) than to miss a true positive

It is calculated as: Recall = TP / (TP + FN)

Precision - Precision is all around correct positives. All positive predictions include both true positives and false positives (those predicted as positive but are actually negative). This means it is a good metric when false positives are costly e.g. spam email when you don't want to mark legitimate emails as spam, or object detection in autonomous vehicles, where a false positive can induce sudden unnecessary braking.

It is calculated as: Precision = TP / (TP + FP)

A model with high precision and low recall catches few positives but is rarely wrong.

A model with high recall and low precision catches most positives but includes many false alarms.

F1 Score - The F1 Score is the harmonic mean of precision and recall. It is used when you need a balance betweeb both.

It is calculated as: F1 Score = 2 x ((Precision x Recall) / (Precision + Recall))

Accuracy - Accuracy measures overall correctness — how often the model was right, regardless of class. It considers all predictions (true positives, true negatives, false positives, and false negatives).

It is calculated as: Accuracy = TP + TN / TP + TN + FP + FN

AUC and ROC - The ROC curve is a graphical plot that helps visualise how well a binary classification model performs across different threshold values. It is a plot of true positive rate (recall) versus false positive rate and helps you see this trade off between True Positives and False Positives.

The Area under the Curve (AUC) is a single scalar value between 0 and 1 of how well the classification model can separate the positive and negative predictions. A value of 0.5 means the model performs no better than a random classifier. A value of 1.0 is a perfect classifier.

Regression Metrics

If you are using regression where you are predicting a number and not just a classification, then there are other metrics:

Mean Absolute Error (MEA) - MEA measures the average absolute difference between the predicted values and the actual values. It tells you on average, how much your models predictions are off from the true values. A lower score means a better model. It is simple to understand and is not affected by outliers. MAE is robust to outliers because it handles them linearly, not exponentially. This makes it a good choice when you don’t want a few bad predictions to dominate the error metric.

Mean Squared Error (MSE) - MSE averages the squared difference between actual and predicted values. Because it squares the values, outliers become amplified. This makes MSE more sensitive to outliers. You would choose MSE (Mean Squared Error) over MAE (Mean Absolute Error) when you want to penalize large errors more heavily and are more concerned with model performance on extreme values.

RMSE (Root Mean Square Error) - RMSE is a metric used to measure the differences between predicted values and actual values in a regression problem. It calculates the square root of the average squared differences between the predicted and actual values. A lower Root Mean Square Error value indicates better model performance. Since the errors are squared before averaging, larger errors have a bigger impact (this makes RMSE sensitive to outliers).

You would use RMSE (Root Mean Squared Error) over MSE (Mean Squared Error) when you want the error metric to be in the same units as the target variable, making it more interpretable. For example, if you're predicting house prices in dollars, RMSE is in dollars, while MSE is in squared dollars, which is less intuitive.

R Squared - R-Squared measures the square of the correlation coefficient between observed outcomes and predicted. It measures how well your regression model explains the variability of the target (dependent) variable. A score of 1 means the model explains all the variance perfectly. A score of 0 means the model explains none of the variance.

Improving Model Accuracy

Understanding model fit is important when understanding the root cause for poor model accuracy.

Two common terms that come up to describe model performance are:

Overfitting – a model that is overfitting has learned patterns in the training data that don’t generalise out to the real world. This means that it has high accuracy on the training data set, but lower accuracy on evaluation data sets
Underfitting – a model that is underfitting performs poorly on the training data and in the real world. This is because the model is unable to capture the relationship between the input examples and the target values.

Regularization techniques are intended to prevent overfitting. Common techniques include:

Dropout – this is a technique where random neurons are temporarily dropped out (e.g. ignore) during each training iteration. This means the network can’t rely too heavily on any specific neuron or connection
Early Stopping – this is a technique where you stop training the neural network before it overfits the training data. It works by monitoring validation loss and accuracy and stopping training when the model stops improving.
L1 and L2 Regularization

L1 and L2 Regularization are techniques used to prevent overfitting by penalising large model weights. In a machine learning model, a weight is a numeric parameter that connects an input feature to an output. A large weight value means the model is putting an extremely strong emphasis on that specific feature, which means the model becomes very sensitive to small changes in inputs, which can lead to overfitting. These techniques add a penalty term to the loss function to discourage large weights:

L1 Regularization (Lasso) – the penalty is the sum of the absolute value of the weights. It shrinks some weights entirely to zero to create sparse models. This is a form of feature selection (removing irrelevant features). You should use this when you suspect only a few features are important. It is computationally inefficient.
L2 Regularization (Ridge) – the penalty is the sum of the square of the weights. This shrinks the weights but does not make them zero. It helps keep the model simpler and reduces sensitivity. It is computationally efficient.

Additional Topics to Study

The two other main topic areas you need to understand are:

AWS AI Services - these are the managed AWS services that offer a simpler entry point than building your own model. You will need to a good understanding of each service and what it is used for, so you can distinguish between Amazon Lex and Amazon Polly, and between Amazon Translate and Amazon Transcribe.
Amazon SageMaker - Amazon SageMaker is a service that provides a whole host of features and capabilities you need to be aware of. You need to understand which feature you can use to detect bias; which to import, prepare and transform data; which to share curated features, and so on.

Other Study Guides

AWS Certification Page - the AWS certification home page for this exam includes the study guide and links to additional resources
AWS SkillBuilder - the AWS official learning plan which is available for free alongside an official set of practice exam questions. Additional material including longer review sections, extra questions and labs are available with a subscription.
Udemy - the certification course provided by Stephane Maarek and Frank Kane comes highly recommended
Pluralsight - this certification course by Pluralsight also includes labs. Pluralsight offer a 10 day free individual trial and monthly subscriptions which may work for some
Tutorials Dojo - this set of practice questions is a great way to get used to the style of exam in various modes

DEV Community