Nandini Rajaram for AWS Community Builders

Posted on Mar 23, 2023 • Edited on Aug 30, 2023

An ETL Job using AWS Glue Studio to inner join DynamoDB tables, Apply Queries and Store the result in S3

#database #beginners #python #aws

ETL Jobs

ETL stands for extract, transform, and load. And this is the most common paradigm for combining data from various systems into a single database, data store, or warehouse for legacy storage or analytics.

Extraction
Extraction is the process of retrieving data from one or more sources. Following data retrieval, ETL is to compute work that loads the data into a staging area and prepares it for the next phase.
Transformation
Transformation is the process of mapping, reformatting and other operations to prepare data for consumption.
Loading
Loading involves successfully inserting the transformed data into the target database, data store or data warehouse.

All of this work is processed in what the business intelligent developers call an ETL job.

AWS Glue

AWS Glue is the key service that integrates data over a data lake, data warehouse, and purpose-built data stores which simplifies data movements from multiple sources

AWS Glue Studio

AWS Glue Studio is an easy-to-use graphical interface for creating, running, and monitoring AWS Glue extract, transform, and load (ETL) jobs.

Olive wants to purchase a finger ring with her birth stone.

The details of Birth stones and ring prices are stored in two DynamoDB tables

Now, We are going to create an ETL job using AWS Glue Studio which inner join two DynamoDB tables Birthstone and RingsTable on condition RingsTable.RingStoneName = Birthstone.Stone Name

After that, apply a Query on the result table to find the ring stone and its price matching to Olive's Birth Month and then save the result to the s3 bucket named "gemstonejobresult"

We use git for the source code management so finally we are going to create a remote git repository on AWS Code Commit and push the job to the git repository

Prerequisites

Two tables created using AWS DynamoDB Service: Birthstone, RingsTable

(Recipe: https://dev.to/aws-builders/cook-a-recipe-with-aws-dynamo-db-table-1kn1)

S3 Bucket to store the result - gemstonejobresult
A git repository on AWS Code commit - gemstone_job_repository
An IAM role named AWSGlueServiceRoledemo- with permission to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job.

Lets set up the Prerequisites first

1 : DynamoDB Tables

Here, I have created two Tables named "Birthstone" and "Birthstone" and some items to them
Make sure to enable PITR on both the tables

The Table schema details provided below

Table Data

2: S3 Bucket for Storing the Query Result

Navigate to S3 console

Click Create Bucket button
You will be navigated to bucket creation page
Specify a unique name to the bucket, Leave the rest of the fields to default values

Scroll down and click create bucket button

You will be navigated back to bucket list page on S3 Console and the newly created bucket will be listed here

3: A git repository on AWS Code commit

AWS Code commit

AWS CodeCommit is a fully-managed source control service that makes it easy for companies to host secure and highly scalable private Git repositories.
CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure.

Repository

In CodeCommit, the fundamental version control object is a repository.

It's where you keep your project's code and files safe. It also keeps track of your project's history, from the first commit to the most recent changes.

You can collaborate on a project by sharing your repository with other users.

Before you can push changes to a CodeCommit repository, you must configure an IAM user in your Amazon Web Services account, or set up access for federated access or temporary credentials.
DO NOT USE ROOT USER

Navigate to code commit console

Click Create Repository button to create a git repository for storing our code

Specify the repository name, I am naming it as "gemstone_job_repository" and click create repository button

Lets Create a Readme.md file and commit it to the repository, Click Create file button
Add some data to the file and save the file as ReadMe.md
The file will be saved to the main branch
Specify the Author name,Email address and commit message (Though optional, its a good practice to provide commit message)
Click the button commit changes

Branches in git
Branches can be used to develop new features, save a specific version of your project from a specific commit, and much more.

When you create your first commit in aws code commit console , a default branch main is created for you

Now, we have created a repository on aws console , made a first commit to it by saving a ReadMe file to the main branch

4: IAM role

Navigate to IAM Console,

Click on Roles link on Left Pane

Click Create Role button to create a role

Select AWS Service as Trusted entity type
Select glue option from Use cases for other services dropdown

Add the below permissions to the role and click Next button

Specify a role name "AWSGlueServiceRoledemo" and Click Create Role button

Create an ETL job using AWS Glue Studio to inner join DynamoDB Tables

Navigate to AWS Glue Studio from aws console

Click Jobs from Left pane

You will be navigated to Jobs page, where you can create ETL Jobs
By default, Visual with source and target option would be selected
Select Source as DynamoDB and Target as Amazon S3 in Create job
Click Create button

An untitled job with the source as DynamoDB and target as Amazon S3 will be created for you. And You will be navigated to the page of untitled job where you will be able to edit the configuratons

Visual Tab

First You will be on the Visual tab

Customers can create data integration jobs using the Amazon Glue API from a JSON object that displays a visual step-by-step workflow thanks to the API offered by AWS Glue.
Clients can then work with these jobs using the visual editor in Amazon Glue Studio.

Now we are going to create a job which reads data from two dynamodb tables BirthStone table and Ring table.
We are going to inner join these two tables based on name attribute
and apply a query to find birtstone suitable for Olive and its price
First we will add the required Source blocks, Action blocks and Target Blocks
Store the Query result in S3 bucket
- Nodes Required:
  - Two DataSource DynamoDB (RingsTable, Birthstone)
  - Transfrom Join (For applying inner join)
  - SQL Query ( For querying the result table)
  - Data target - S3 Bucket
Click on the Source and add one more DynamoDB table
Remove the ApplyMapping Block, if its already there, since we don't need it for this job

(Click on the ApplyMapping Block and Click Remove)
From the Action, Select Join and Add it
From the Action, Select SQL Query and Add it
S3 bucket will be there on the flow as target , leave it as it is

Now, Lets Configure the flow

Source Nodes

Click on One of the Data Source DynamoDB nodes,
On the Right pane, configure the node properties as shown in the picture below

Select the option Choose the Dynamodb Table directly radio button for DynamoDB source
Select the Birthstone table from the Dropdown, Make sure PITR is enabled for the Birth stone table , if not the table will not be populated on the dropdown, (If you do not see the table in the dropdown, Go to the Dynamodb service, enable PITR for the Birthstone table and Ring Table. Then Refresh this page to get the tables on this dropdown)
Click on the second DynamoTable Data source node,
On the Right pane, configure the node properties as shown in the picture below

Now, We have configured the Source Nodes successfully

Action Nodes

Click on the Transform join node
On the Right pane, Click on the Node Properties Tab
Select the two DynamoDB tables as Node Parents

Click on the Transform Tab, Make sure Inner join is selected as Join Type, Click Add Condition button

Specify the inner join condition as shown in the image below

Click on Transform SQL Query node
On the Right pane, Click on the Node Properties Tab
Select Transform Check box from the Node Parents dropdown list, You can see Join node getting selected

Click on the Transform Tab to specify the query on the Inner joined Table
As You see here, Inner Joined table is alised as myDataSource
Now Lets Write the SQL Query to generate the Birth Stone name and Price of the Birst stone matching to olive's Birth Month

Olive's Birth Month is December since she is born on a christmas day , Lets Create a query to get the Birth Stone gem and its price matching to the month

select RingStoneName, Price from myDataSource where BirthMonth = 'December' limit 1

Yes, Now we have configured Transform nodes successfully

Target Node

We need to store the query result in the s3 bucket

Click on the Data target - S3 bucket node
On the Right pane, Click on the Node Properties Tab and Tick SQL Query checkbox

Click on Data Target Properies - S3 Tab
Configure it as shown in the below image

As you see, Format should be Parquet
The bucket location should be selected from s3 so that a Parquet file with the result data will be saved on the S3 bucket once the job execution is completed successfully
After all the configurations, The Visual Tab will be seen as below image

Script Tab

Based on the Job Configuration, Python script will be generated on the Script tab

Job details

Now, Navigate to Job details tab to specify the job details

By default, Job name would be Untitled, Specify a job name in the Name Field.
Specify a description as well

Select the AWSGlueServiceRole-demo which we already created as a part of prerequisites

Leave the remaining to default settings

Click Save button to save the Job

Version Control Tab

Navigate to Version Control Tab

Here, You need to do the git configurations
Select AWS Code commit option for git service

In the Repository Configuration section, choose the git repository that we created in aws code commit (Refer Prerequisites)

Choose the branch as main

Click Save button on the top right

Click the Actions Button, Choose Push to repository option

The job will be pushed to the repository and a JSON file will be saved on to the code repository.

Navigate to aws code commit and verify if files are saved on to the repository

Executing the ETL Job and Verifying the result in S3 bucket

Once you have saved the job, its time to Run the job

Click the Run button on the Top Right pane

You will be getting the following message

Navigate to Runs Tab to see the job status

Once the Run status becomes succeeded, Navigate to S3 Console and click on the gemstonejobresult bucket
You will be able to see the result saved as a parquet file

Download the Parquet file and save the file locally by clicking the Download button

Open the website https://www.parquet-viewer.com/ to read results from parquet file

Now You will be able to view the result

Reading the Parquet File object using S3 select

Alternatively, You can Query the Parquet file with S3 select

Navigate to S3 Console and click on the gemstonejobresult bucket
Click on the Parquet file name link where the result is saved

Now, Click the Object Actions button and from the list, select the Query With S3 Select and give a click
You will be navigated to Query with S3 Select Page, Now by default Apache Parquet will be chosen in input settings, Keep it as it is
You can choose the Format in Output settings.

S3 Select allows you to specify how the query results are serialized in the output. You can choose formats like JSON or CSV. This determines how the data will be structured in the output files.