DEV Community

Cover image for Deequ: Your Data's BFF
Abdul Raheem for AWS Community Builders

Posted on

Deequ: Your Data's BFF

Data quality is crucial for reliable applications, whether you’re training accurate machine learning models or ensuring that your insights and decisions are based on trustworthy and accurate data. Issues like missing values, data distribution shifts, and incorrect data can lead to malfunctions, inaccurate machine-learning models, and bad business decisions.

What is Deequ and Why It’s Important?

Deequ is a library built on Apache Spark that allows you to create "unit tests for data," helping you check and measure data quality in large datasets. Deequ is used internally at Amazon to ensure the quality of various large production datasets.

Where in AWS, relevant team members set data quality constraints, and the system regularly checks metrics and enforces rules, pushing datasets to ML models upon success.

Image Source: AWS

The best part is If we own a large number of datasets or if our dataset has many columns, it may be challenging for us to manually define appropriate constraints. Deequ can automatically generate some useful constraints by analyzing the data distribution. It begins with data profiling and then applies a series of rules to the results. We will see that in detail in the practical part.

Moreover, Deequ leverages Spark to compute and provide direct access to data quality metrics like completeness and correlation through an optimized set of aggregation queries.

Purpose of Using Deequ?

Deequ's purpose is to "unit-test" data to find errors in an early stage before the data gets fed to consuming systems or machine learning algorithms.

Some of the benefits of using Deequ are as follows:

  1. Early data error detection
  2. Improve data reliability and trustworthiness
  3. Automated data validation
  4. Improved data integrity
  5. Streamlined data profiling
  6. Integration with Spark(scalability + efficiency)

Ideal Datasets for Deequ?

Deequ is useful for datasets that are meant to be consumed by machines or for tasks involving data analysis, or in simple words we can use Deequ for any dataset that can fit into a Spark dataframe.

This includes data stored in tables, such as spreadsheets, or databases with a well-defined schema. Deequ confirms the data quality by applying pre-defined constraints or automated constraints to ensure consistency, accuracy, and completeness. It’s specifically designed for structured data with clearly defined attributes. That is why It's designed for data with a clear structure and defined attributes.

Important: Deequ's strength lies in handling massive datasets efficiently. Its distributed processing power will not be fully utilized with a few thousand rows dataset. Setting up and managing Spark clusters can add overhead, potentially slowing down the overall processing pipeline. The advantage of automated checks and scalability wouldn't be relevant for such a small size. Because we can achieve similar results with lower computational cost and minimal manual efforts.

Deequ excels at ensuring data quality in batch data processing rather than streaming data.

Pros & Cons of Deequ

Before diving into Deequ, it’s important to weigh its pros and cons to understand how it fits your data quality needs and what challenges you might face.

Pros:

  1. Declarative API: Easy to use, we specify what we expect the data to look like or behave rather than writing complex validation checks manually.
    Example: Instead of writing complex code to check if a column has missing values, we can simply say "This column should not have missing values" in Deequ's declarative language. This makes it easier for our team to understand and maintain our data validation checks.

  2. Metrics and Constraints: Provides various data quality metrics and allows defining constraints based on those metrics for comprehensive data analysis.
    Example: We can define constraints on the number of missing values allowed in a column (e.g., "no more than 5% missing values"). Deequ will calculate the actual percentage of missing values and compare it to our constraint, highlighting any violations. Additionally, we can define constraints for data distribution (e.g., "ensure the age column has a normal distribution"), allowing for comprehensive data analysis.

  3. Scalability: Leverages Apache Spark for distributed processing, making it efficient for big data (billions of rows).
    Example: Imagine a dataset with billions of customer records. Validating this data locally on a single machine would be slow and impractical. Deequ utilizes Apache Spark, which distributes the data and validation tasks across multiple machines in a cluster. This allows Deequ to handle massive datasets efficiently, analyzing each record in parallel.

  4. Automation: Integrates with ML pipelines for automatic data validation, catching issues early and preventing downstream problems.
    Example: We can integrate Deequ into our machine-learning workflow. Before training our model, Deequ automatically validates the data, catching issues like missing values or unexpected data formats. This helps prevent us from training a model on bad data, potentially leading to inaccurate results.

  5. Open-Source: Freely available and customizable to specific needs.

Cons:

  1. Learning Curve: Requires some understanding of Apache Spark and data quality concepts.
  2. Limited Out-of-the-Box Rules: While Deequ offers a good set of metrics, we might need to write custom rules for complex validation needs.
  3. Overhead for Small Datasets: Setting up Deequ involves some initial configuration and code writing and Cloud corporate cost if applicable. For very small datasets (like 2,000 rows), the time and effort spent setting up Deequ might outweigh the benefits of automated data validation.

Practical Work

Funny Cat GIF From https://giphy.com/

In this practical work, we will use PyDeequ, an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon). While Deequ is written in Scala, PyDeequ allows us to use its data quality and testing capabilities from Python and PySpark, the language of choice of many data scientists.

Image Source: AWS

We can call each Deequ function using Python syntax. The wrappers translate the commands to the underlying Deequ calls and return their response.

Step-by-Step Process

1. Setting Up Pyspark:

I used the PyDeequ Notebook and ran it in the AWS Sagemaker and set up the system using the following:

  • Install necessary Python packages: pydeequ, sagemaker-pyspark, and pyspark.
  • Download and set up Java JDK (OpenJDK 11).
  • Set the JAVA_HOME environment variable and update the PATH.
  • Verify Java installation.
  • Initialize a Spark session with PyDeequ configurations.

Setting Up

Setting Up

2. Loading Dataset & Visualizing Schema:

For the dataset, I used an open-source dataset of NYC TLC Trip.

Load Dataset

3. Data Analysis:

Before we define checks on the dataset, we want to calculate some statistics on the dataset; we call them metrics.

Data Analysis

The output table shows the results of various data features, such as its size, completeness, distinct counts, mean values, compliance checks, and correlations.

From the above data metrics, we learned the following:

Metric Observation
Compliance of Long Trips Only 6.53% of trips are classified as long trips.
Mean Trip Distance The average trip distance is approximately 5.37 miles.
Dataset Size The dataset contains approximately 2,463,931 records.
VendorID Completeness The VendorID column is 100% complete, with no missing values.
Correlation between Total Amount and Fare Amount There is a very high correlation (0.9999) between total_amount and fare_amount, indicating almost perfect correlation.
Correlation between Fare Amount and Trip Distance There is a very low correlation (0.0004) between fare_amount and trip_distance, indicating little to no linear relationship.
Approximate Count of Distinct VendorIDs There are approximately 4 distinct values in the VendorID column.

4. Define and Run Tests on Data:

After analyzing the data, it's important to make sure the same properties hold in future datasets. Let's define some data quality checks to the pipeline, we can ensure every dataset is reliable for any application that uses it.

Run Test

Output:

Tests Output

Analysis

Check Outcome
Record Count Passed, with over 2,000,000 entries.
Completeness All columns (VendorID, payment_type, etc.) passed the completeness check.
Uniqueness Failed for VendorID, indicating duplicate values exist.
Value Range Passed for VendorID and payment_type (with 96% of payment_type values within ["1", "2"]).
Non-Negativity Passed for DOLocationID, PULocationID, and trip_distance.

Note: VendorID failed the uniqueness check, showing duplicate values. All other checks were successful.

5. Automated Constraints Generation:

Remember, as we discussed earlier, we can automatically generate certain data quality checks for the dataset. Let's see how it works in action.

Constraints

Now, you can apply some of these constraints to your data to ensure it meets the quality standards and performs well under these checks.

Conclusion:

Deequ is a powerful tool for automating data quality checks at scale, ensuring reliable and accurate datasets for better decision-making.

That’s it for today’s Deequ blog! I hope you found it insightful and learned something new. For more information, detailed documentation, and the original code, feel free to explore the following pages:

  1. Deequ
  2. PyDeequ
  3. Deequ GitHub Repository
  4. Python-Deequ GitHub Repository

Top comments (0)