Disclaimer: The opinions expressed here are my own and I'm not writing on behalf of AWS or Amazon.
The AWS Machine Learning - Specialty Certification covers a wide spectrum of topics from data engineering to exploratory data analysis to model training and deployment. Here are some quick notes I've gathered to prepare for the certification:
AWS AI Services
Beneficial for developers who want to add AI into their applications through API calls instead of developing and training their own ML models from scratch.
Amazon Textract
Extract text from scanned documents using Optical Character Recognition (OCR).
Documents
Returns text, forms, tables and query responses.
Expenses
Extracts data from invoices/receipts eg. vendor name, invoice/receipt date, invoice/receipt number, item name, item price, item quantity, total amount.
Amazon Comprehend
Extract entities, key phrases, language, personal identifiable information (PII), and sentiments from text.
Entities
Extract entities from text documents eg. people, places, locations.
Using AWS Console:
Using AWS API:
Key Phrases
Extract the key phrases (one or more words) from text documents.
Using AWS Console:
Using AWS API:
Sentiment
Predict the overall sentiment of the text - positive, negative, neutral, mixed.
Language
Predict the dominant language of the entire text. Amazon Comprehend can recognize 100 languages.
Personally Identifiable Information (PII)
List out entities in your input text that contain personal information eg. address, bank account number, or phone number.
Vision
Amazon Rekognition
Analyze images and videos to identify objects, people, text, scenes, and activities.
Label Detection
Extract labels of objects, concepts, scenes, and actions in your images.
Facial Analysis
Detect faces and retrieve facial attributes in an image eg. facial expressions, accessories, facial features, etc.
Face Comparison
Compare faces within a set of images with multiple faces in them. Compares the largest face in the source image (reference face) with up to 100 faces detected in the target image (comparison faces), and generate a similarity score.
Other AWS AI Services
- Amazon Lex: Build conversational interfaces using voice/text as input
- Amazon Polly: Text to speech
- Amazon Transcribe: Speech to text
- Amazon Translate: To different languages
Domain 1: Data Engineering
AWS Glue
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
- Serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources.
- Data Sources: S3, RDS, JDBC, DynamoDB, Kinesis Data Streams, Apache Kafka
- Data Targets: S3, RDS, JDBC
- Crawlers: Automatically infer database and table schema from your source data, storing the associated metadata in the AWS Glue Data Catalog.
- ETL Programming Languages: PySpark (Python), Scala
- FindMatches Transform: Use this machine learning transformation step to identify duplicate or matching records. Eg. matching customers/products/improve fraud detection, etc.
Amazon Athena
https://docs.aws.amazon.com/athena/latest/ug/what-is.html
- Serverless, interactive query service to query data and analyze big data in Amazon S3 using standard SQL.
- Integration with AWS Glue: AWS Glue crawlers automatically infer database and table schema from data in S3 and store the associated metadata in AWS Glue Data Catalog. This catalog lets the Athena query engine know how to find, read, and process the data you want to query.
- When to use Amazon Athena vs Redshift vs EMR: https://docs.aws.amazon.com/athena/latest/ug/when-should-i-use-ate.html
Amazon Kinesis
https://docs.aws.amazon.com/kinesis/index.html
Kinesis Video Stream
Stream live video data, optionally store it, and make the data available for consumption both in real time and on a batch or ad hoc basis.
Kinesis Data Stream
Collect and process large streams of data records in real time.
- Reading from Data Streams (Consumers): Using Kinesis Data Analytics, Kinesis Data Firehose, Lambda, EC2
Kinesis Data Firehose
ETL service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services. Buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.
- Use Lambda to do data transformation for each buffered batch/convert file format. Eg. Apache Parquet more efficient to query than JSON format.
- Delivery Stream Destination: S3, Redshift, Elasticsearch, Splunk, HTTP endpoint, etc
Kinesis Data Analytics
Continuously read and analyze data from a connected streaming source in real-time.
- Source: Kinesis Data Stream, Kinesis Data Firehose
- Destination: 1/ Kinesis Data Stream, 2/ Kinesis Data Firehose, 3/ Lambda
- Runtime: SQL, Apache Flink
- Aggregate/Analytical Functions: Hotspots, Random Cut Forest, etc
Domain 2: Exploratory Data Analysis
- Data Labelling: AWS Ground Truth (Data labeling service using human annotators from Amazon Mechanical Turk or your own private workforce)
- Feature Engineering: 1 hot encoding, binning, outliers, normalization, PCA dimension reduction. For text: TF-IDF, Bag of Words, N-Gram.
- Know the different types of data visualization: Histogram, scatter plot, box plot, correlation heatmap, hierarchical plot, etc.
Domain 3: Modelling
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
- Supervised Learning Algos: XGBoost, k-NN, Linear Learner, DeepAR Forecasting, Object2Vec,
- Unsupervised Learning Algos: K-Means, PCA, Random Cut Forest
- Text Analysis Algos: BlazingText, Sequence-to-Sequence, LDA, Neural Topic Model (NTM)
- Image Processing Algos: MXNet, TensorFlow, Object Detection, Semantic Segmentation (pixel level)
- Evaluation of ML Models: Confusion Matrix, AUC-ROC, Accuracy, Precision, Recall, F1 Score, RMSE
- Overfitting Solutions: 1/ Use fewer features, 2/ Decrease n-grams size, 3/ Increase amount of regularization used, 4/ Increase amount of training data examples
- Underfitting Solutions: 1/ Add new domain-specific features, 2/ Add more Cartesian products, 3/ Increase n-grams size, 4/ Decrease amount of regularization used, 5/ Increase amount of training data examples
- Hyperparameter Tuning: Random Search, Bayesian Search
- How SageMaker Studio works: https://aws.amazon.com/blogs/machine-learning/dive-deep-into-amazon-sagemaker-studio-notebook-architecture/
- SageMaker Studio Notebooks vs SageMaker Notebook Instances: https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-comparison.html
Domain 4: Machine Learning Implementation & Operations
- Real-time Inference: Create a HTTPS endpoint if you require a persistent endpoint for apps to call to get inferences
- Batch Transform: Preprocess datasets, run inferences from large datasets, does not require a persistent endpoint.
- SageMaker Neo: Automatically optimizes machine learning models for inference on cloud instances and edge devices to run faster with no loss in accuracy.
- SageMaker Elastic Inference (EI): Speed up the throughput and decrease the latency of getting real-time inferences from your deep learning models that are deployed as SageMaker hosted models, but at a fraction of the cost of using a GPU instance for your endpoint
- Track and monitor SageMaker metrics using: 1/ AWS Console, 2/ CloudWatch, 3/ SageMaker Python SDK APIs
This is only a brief summary of the core topics I found to be important and definitely not exhaustive. Please refer to https://aws.amazon.com/certification/certified-machine-learning-specialty/ for the full set of topics to prepare.
Top comments (0)