DEV Community

Neeraj Iyer
Neeraj Iyer

Posted on

Glue Data Brew- Data Profiling & Data Quality

With the evolution of technology, we can see data is growing exponentially, new sources of data, diverse data, and is accessed by many applications. But 80% of the time is spent in preparing data today.

Data analysts, data engineer , ETL developers and Data scientists need the right tool for right job so that they do not have to spend hours in a tool that they do not have to use everyday.

Glue data brew is a serverless , no code data preparation tool for data analysts and data scientists. Data Brew can be accessed by the AWS Management console , using plugin for Jupyter notebooks & Sagemaker studio.

Glue Data brew is a tool that can be used for data transformation and data munging.

With Glue data brew data analysts and data scientists can understand the data quality and detect anomalies , clean and normalize data using over 250 built-in transformations, understand the steps that data has been through using visual data lineage , save the transformations and use it again when new data comes in which is also called recipes.

Following are the Glue DataBrew functionalities:

Image description

How does it work?

you can have your data sources in a data lake, local files, Glue data catalog, Redshift , JDBC with permissions applies to each of these data sources. Glue DataBrew can get data from these data sources, join data from these sources, apply transformations to the data, build recipes based on your transformation and also schedule your Glue job. This will help you in cleaning the data , re-use recipes, data profiling , maintaining data quality and lineage. The transformed data can be used by a variety of targets like visualization/reporting tools, notebooks, sagemaker model, ETL pipelines

Some use cases for Glue DataBrew are:
One-time data analysis for business reporting
set up data quality rules with AWS Lambda
Data preprocessing for Machine Learning
Orchestrating data preparation in workflows

Data Profiling
DataBrew creates a report called a data profile when you profile your data. It basically gives you existing shape of your data , including context of the content , structure of data and its relationships. A data profile can be created for any dataset by running a data profile job.

Using DataBrew you can evaluate the quality of your data by profiling it to understand data patterns and detect anomalies. you can also examine and collect statistical summaries about the data in data profile section.

Image description

Creating data profile
Once you have loaded your data , you can navigate to datasets and select the dataset that you loaded and click Run Data profile. If its the first time creating a profile job it will give you a prompt and you click on it to create a new profile job using a name. you select the job output details on where you want to load your target dataset. Under data profile configurations, you will find a variety of configurations. you can select Enable PII Statistics and select all categories. you can apply default permissions and then create and run job.

Image description

It gives you a summary of identified PII columns mapped to PII categories.

Data quality Check

you can create data quality rules over your dataset. you will have to provide a name for the data quality rule and add rule. you can either create rules on your own or go by the recommendations provided by Glue DataBrew. if you are creating a new rule then you have to provide a data quality scope , rule success criteria , if multiple data checks then get them added by condition. Once the rulesets are created you can click Create ruleset. Once you provide a name for the job, you can create and run the job.

Image description

Image description

Once the job run finishes, you can look at your data profile sections and view the summary and also the column statistics tab and view data quality rules in action under data quality rules tab.

Image description

Image description

you can see below that the data quality checks have failed as there are duplicate rows and quantity and total sales >0 rule fails. you can filter out such rows using advance transforms.

Image description

In the next blog we will see on how we could apply transformations on the same dataset and eliminate the data quality checks that have failed.

Top comments (0)