Wendy Wong for AWS Heroes

Posted on Jul 7, 2023 • Edited on Jan 21, 2024

How to check for quality? Evaluate data with AWS Glue Data Quality

#aws #etl #analytics #tutorial

Data is the new oil

The Women in Data Science (WiDS) Conference 2017 trailer from Stanford University aimed to inspire the audience in the field of data science.

In the trailer, the panelist mentioned the power of data.

Who gets to control data, controls the society that's the reality. -- Bell Wei (Carolyn Guilding Chair in Engineering and Innovative Learning, San Jose State University)

Having access to the right data and understanding your data will empower you to make data-driven decisions for your organization. Data understanding is Step 2 of the Data Analytics and Data Science workflow called Cross-industry standard process for data mining (i.e. CRISP-DM).

You may read more about CRISP-DM here. The second step:

Data understanding – What data do we need or is available? Is it clean?

This is the data science workflow using CRISP-DM explaining all the stages and the image is from Wikipedia:

Learning Objectives

In this lesson you will learn the following:

How to create rules with DQDL
What is AWS Glue?
What is AWS Glue Data Quality?
Solution Architecture
How to check for data quality results?

Who owns data quality?

Is data quality the responsibility of the data engineer building your ETL data pipelines? Is it the responsibility of the data analyst building the data modelling? Or the data scientist building hypotheses and machine learning models responsible for data quality?

Take a look at this Twitter post on data quality here.

In fact we can all play part to ensure that with good quality data, we can improve our data analysis and machine learning model performance to make business decisions.

What is AWS Glue?

AWS Glue is an AWS service that allows data engineers, business analysts, data analysts, developers and data scientists to integrate data from multiple sources and also perform ETL.

It is serverless data integration service to allow you to easily scale your workloads in preparing data and moving transformed data into a target location.

You do not need to provision any servers or clusters before using AWS Glue with a managed serverless experience.

You may bring in your own code or notebook to create an ETL job on demand, batch or streaming. You may create an AWS Glue data catalog to make data available for others to use.

You may also use AWS Glue Studio a visual editor to create your ETL pipeline.

What is AWS Glue Data Quality?

AWS Glue Data Quality is a new product feature of AWS Glue that is now generally available since 6 June 2023.

AWS Glue Data Quality allows you to measure and monitor 'good or bad data' in your ETL pipelines before it enters your data lake or data warehouse to ensure high quality data is available to make data-driven decisions.

What are the benefits?

In the AWS Glue Developer Guide for AWS Glue Data Quality, the benefits include the following:

Serverless – there is no installation, patching or maintenance.
Get started quickly – AWS Glue Data Quality quickly analyzes your data and creates data quality rules for you.
Improvise your rules to check the integrity and accuracy of the data
Evaluate data quality and make confident business decisions
Zero in on bad data with errors
Pay as you go - There are no fixed costs and you pay for your usage when you use AWS Glue Data Quality
Data quality checks - you can implement checks in the AWS Glue data catalogue
You may create rules to check the profile of your dataset.
You may get started quickly

Solution Architecture

Below is my own solution overview of the new product feature of AWS Glue Data Quality.

Dataset

Let's examine the Amazon Data Science Books Dataset from Kaggle.com

Pre-requisite

You may read this blog to learn how to get started with AWS Glue Data Quality.
You may read this blog How to ETL with AWS Glue - Part 1.
You may dive deeper with a practical example in Part 2.
You have an AWS Account, if you do not have an account you may learn how to create one here.

Tutorial 1: Add the Evaluate Data Quality transform to the visual job in AWS Glue Data Studio

In this tutorial you may refer to the instructions from the AWS Glue User Guide.

Step 1: Log into your AWS account as an IAM Admin User.

Step 2: Navigate to Amazon S3 and create your bucket in an AWS region of your preference and click Create bucket.

Step 3: Upload the Amazon data science books dataset into your Amazon S3 bucket and click Upload.

Step 4: Navigate to AWS Glue dashboard.

Click Create a job to evaluate data quality

Step 5: Click on Visual and select Evaluate Data Quality.

Step 6: Add the data quality node.

On the AWS Glue Studio console, choose Visual with a source and target from the Create job section. Choose Create.

Step 7: Choose a node on which to apply the data quality transformation. Currently there is no node selected.

You may select the transform node (i.e. Transform - ApplyMapping).

Step 8: On the left-hand side click the blue plus sign and select from the drop-down menu Evaluate Data Quality towards the bottom section. You may name this job and be sure to save it e.g. GlueDataQuality_Tutorial.

Step 9: The selected Evaluate Data Quality transform node will be displayed in the visual editor.

On the right-hand side you may inspect if you would like to retain the current parent node or change it from the drop-down menu. (Note: The parent node is connected to the Evaluate Data Quality node).

Step 10: You may validate data quality rules across multiple datasets. The rules that support multiple datasets include:

Referential Integrity
DatasetMatch
SchemaMatch
RowCountMatch
AggregateMatch

When you add multiple inputs to the Evaluate Data Quality Transform, you need to select your primary input.

Select your primary input which is the Amazon data science books dataset to validate data quality for.

All other nodes or inputs are considered as references.

Use Evaluate Data Quality transform to identify specific records that failed data quality checks.

(Note: New columns flagged as bad records are added to the primary dataset).

Step 11: In the Data Source - S3 bucket select the S3 bucket where the dataset or primary input is saved.

Step 12: Click on Output schema tab to modify any data types e.g. change price from string to integer.

Click Apply to change the data types.

Step 13: Select the Data target - S3 bucket and save the transformed dataset in this location.

Tutorial 2: Create rules using DQDL builder

Step 1: Preview the Amazon data science books dataset and let's create a rule using the DQDL rule builder to check for the completeness of data.

You may browse the available data quality rules from the Rules type tab which include the following:

ColumnCount
ColumnLength
ColumnExists
ColumnDataType
ColumnValues
ColumnNameMatchesPattern
Completeness
CustomSql
DataFreshness
DatasetMatch
DistinctValuesCount
Entrophy
IsComplete
IsPrimaryKey
Sum
Uniqueness
ReferentialIntegrity
Mean
RowCount
RowCountMatch
StandardDeviation
UniqueValueRatio
SchemaMatch

I selected the data quality rule for Completeness, because I would like to check the percentage of missing data greater than 80% for the following variables:

Price
Price (that includes used books)
Number of book reviews
Average rating

Firstly preview the primary data source to understand the data.

The following rules were created in the DQDL Builder in the * Schema* tab.

Rules= [
    Completeness "avg_reviews"> 0.8, Completeness "n_reviews"> 0.8, Completeness "price" > 0.8, Completeness "price (including used books)" > 0.8
]

The Completeness rule will check for the specified columns if there is greater than 80% of non-null values present in the primary data source.

Tutorial 3: Configure Data Quality Outputs

Step 1: After the data quality rules are created, you can select additional options to be included in the data quality results output.

I have selected two additional options:

Action: publish actions to Cloudwatch
Data Quality Results: to flag fail or pass results

Below is an image of the rule outcomes node.

Step 2: Under Data Quality Transformation Output, I also checked the box for Original data as this will also append additional columns to the primary dataset to indicate bad errors.

Tutorial 4: Configure Data Quality actions

After a data quality rule is created, you may select actions for CloudWatch to publish metrics or stop jobs based on a criteria.

Actions in CloudWatch are also published to Amazon Eventbridge and can be used to create alert notifications.

On ruleset failure
Fail job after loading data to target
Fail job without loading to target data

Tutorial 5: View data quality results

Click Save and initiate the AWS Glue job by selecting Run.

You may click Run details to inspect the progress of the Glue job.

After the job has completed, select the Data Quality New tab to inspect the results.

You will be able to see that the data quality rules have passed successfully and you may click Download results as a csv file

[
    {
        "ResultId": "dqresult-xxxxxxxxxxxxxxxxxxxxxxxxxx",
        "Score": 1,
        "RulesetName": "EvaluateDataQuality_nodexxxxxxxxxx",
        "EvaluationContext": "EvaluateDataQuality_nodexxxxxxxxx",
        "StartedOn": "2023-07-07T08:57:48.117Z",
        "CompletedOn": "2023-07-07T08:58:08.203Z",
        "JobName": "GlueDataQuality_tutorial",
        "JobRunId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "RuleResults": [
            {
                "Name": "Rule_1",
                "Description": "Completeness \"avg_reviews\" > 0.8",
                "EvaluatedMetrics": {
                    "Column.avg_reviews.Completeness": 1
                },
                "Result": "PASS"
            },
            {
                "Name": "Rule_2",
                "Description": "Completeness \"n_reviews\" > 0.8",
                "EvaluatedMetrics": {
                    "Column.n_reviews.Completeness": 1
                },
                "Result": "PASS"
            },
            {
                "Name": "Rule_3",
                "Description": "Completeness \"price\" > 0.8",
                "EvaluatedMetrics": {
                    "Column.price.Completeness": 1
                },
                "Result": "PASS"
            },
            {
                "Name": "Rule_4",
                "Description": "Completeness \"price (including used books)\" > 0.8",
                "EvaluatedMetrics": {
                    "Column.price (including used books).Completeness": 1
                },
                "Result": "PASS"
            }
        ]
    }
]

If you are running multiple data quality jobs, you may filter the data quality results by date and time.

If you navigate to the tab Script you will be able to see that AWS Glue Studio automatically created Python code for the transformation steps that you could easily download for reusability.

Conclusion

You have learnt how to set up data quality rules in AWS Glue Data Quality using the visual editor of AWS Glue Studio. You have explored how to create an ETL job and examine the data quality results for a pass or fail from rules that were created.

Until the next lesson happy learning! 😀

AWS Glue Data Quality Quick Start videos on Youtube

If you would like to learn more you may watch the following videos

AWS Glue Data Quality Overview

Measure and Monitor Data Quality of your Datasets in AWS Glue Data Catalog

Introducing AWS Glue Data Quality for ETL Pipelines

Reference

New announcements from AWS Glue

Next Week: AWS Builders Online Series - 13 July 2023

You may register to join AWS Builders Online Series and learn from AWS experts on architectural best practices.

DEV Community