Wendy Wong for AWS Community Builders

Posted on Aug 9, 2022 • Edited on Aug 11, 2022

How to ETL with AWS Glue and Amazon Glue Studio to transform your data - Part 2

#tutorial #glue #aws #beginners

Introduction

In the previous blog you mastered ETL using sample sales data for both Amazon Glue Studio and AWS Glue.

The third tutorial explored how you may query data from your Amazon S3 bucket using Amazon Athena and create data visualizations from an Amazon Athena table in Amazon QuickSight that references the sales database stored in the AWS Data Catalog.

In this new tutorial, you will learn how to:

a) Locate open source datasets for your passion project, build a test or production workload that you can use with AWS services.

b) Complete data transformation with your own data using both Amazon Glue and Amazon Glue Studio.

Where can you find interesting open source datasets for analytics and machine learning projects?

Kaggle.com

Government websites e.g. data.gov, data.gov.uk, data.gov.au.

Google data search

Census data e.g. Australian census

UCI Machine Learning Repository

Solution Architecture overview - AWS Glue

Tutorial 1: Data Transformation with AWS Glue

Step 1: In this tutorial, download the DOHMH New York City Restaurant Inspections data as a csv file from this link

A preview of the raw dataset:

Step 2: Log into your AWS account. If you do not have an AWS account you may create one here

Step 3: You may follow the instructions here to create your first S3 bucket and give it a name new-york-restaurant-inspections.

Step 4: Click the bucket and select Create Folder.

Create two folders within the Amazon S3 bucket and name them raw and processed respectively as shown below:

Step 5: Upload and click add file to the folder raw

Step 6: Set up basic IAM permissions to use AWS Glue console. You may follow the blog for more details.

If you need to access more Amazon resources such as Amazon Sagemaker Notebooks, you may follow the instructions here.

Step 7: Navigate to the Amazon Glue console and on the left-hand menu, click crawler followed by Add crawler.

Step 8: Provide a name for the crawler. For this example, the we will name the crawler new york-restaurant-inspection-crawler and click Next.

Step 9: For crawler source type, select Data stores and then crawler all folders.

Step 10: In this section, add the S3 bucket path and selecting the csv file from the raw folder and click Next.

Step 11: Click Yes to add another data store and Next.

Step 12: Repeat Step 10 but select the Processed* folder for the S3 bucket path. Click Next.

Step 13: There are no more data stores to be created, click No. Click Next.

Step 14: You may select an existing IAM role for AWS Glue that was created earlier for example 'AWSGlueServiceRole-learn-glue-role'. Click Next.

Step 15: Create a schedule for the crawler, select from the drop-down menu Run on demand.

Step 16: Configure the crawler output. Click Add database and provide a name for the database 'restaurant-inspections'. Add a prefix for the table name called Raw. Click Next.

Step 17: Check the input details and click Finish.

The crawler details are show below:

Step 18: Click the crawler'new-york-restaurant-inspection-crawler' and select Run crawler to execute the job.

Step 19: After the crawler job is completed with a 'ready' status , you can inspect the meta data by clicking Tables on the left-hand side.

You may inspect the following schema from the data catalog:

Solution Architecture overview - Amazon Glue Studio

Tutorial 2: Data Transformation with Amazon Glue Studio

Step 1: Access 'AWS Glue Studio' from the AWS Glue console.

Step 2: From the AWS Glue Studio console, click View jobs.

Step 3: Select the option Visual with a source and target and click Create.

Step 4: Under Node Properties select S3 bucket followed by Amazon S3

Step 5: Under the S3 bucket node, select from the option Data catalog the database name followed by the table name as shown in the image below.

Choose Apply Mapping node.

Step 6: Under the'Apply Mapping node, click on the tab Transform.

Are there any variables that have been incorrectly categorized in the data catalog?

Are there any variables that you need to change the data type?

Are there any columns that you need to drop from your dataset that won't be used in your data analysis or for machine learning?

You may complete some tranformatons such as changing date variables from the data type 'string' to 'date'.

You may view a listing of common data types in AWS Glue in the documentation

Step 7: Click on the tab Output Schema, check the data transformations.

Step 8: Click on S3 bucket- data target node, followed by Parquet format, Compression Type as Snappy and lastly select the S3 bucket path for the processed data folder to output the transformed data.

Select Create a data catalog on subsequent runs and name this table processed.

Step 9: Click on the Script tab to view the Python generated code for this ETL job.

Step 10: Click on the tab 'Job Details', provide a name for the job e.g.restaurant inspection job and a description for the data transformation. Change the number of workers to 2.

Step 11: Click Save and Run to commence the ETL job.

Step 12: The ETL job will take a few minutes to run and will change to a status of 'Succeeded'.

Step 13: Navigate back to the AWS Glue console and select Crawlers on the left hand menu and re-run the'restaurant-inspection-crawler' by clicking Run crawler.

After the status changes to 'ready' click on Table to inspect a processed table in the Data catalog for transformed data types.

Resources

Setting up IAM permissions for AWS Glue

Get started with AWS Glue

Data types for AWS Glue

Working with geospatial data in QuickSight

Best practice working with geospatial data in Amazon Athena and Glue

Common data types

AWS Innovate Online Conference APJ - 23 & 25 August 2022

🚀 Hot off the press: Introducing AWS Glue Flex jobs: Cost savings on ETL workloads

AWS Glue Flex jobs is perfect for use cases including:

Scheduling ETL jobs to run overnight
Scheduling test workloads
One-time bulk upload
Time-sensitive workloads

You can read about the new feature and flex pricing from the AWS Big Data Blog here.

Until the next lesson, happy learning ! 😁

Top comments (4)

Pablo Ezequiel Inchausti • Aug 15 '22

Congrats! Very detailed!

Wendy Wong • Aug 15 '22

Thanks Pablo!

Adit Modi • Aug 9 '22

Great blog, wendy 😊
Flex jobs is a great feature to the existing AWS Glue service. excited to try it out 🔥
Also AWS Glue Streaming ETL Auto Scaling is GA and it looks good 👌

Wendy Wong • Aug 9 '22

Thanks Adit, I am excited to try AWS Glue Streaming ETL for a business use case :)