DEV Community: Cícero Joasyo Mateus de Moura

Simplifying Data Transformation in Redshift: An Approach with DBT and Airflow

Cícero Joasyo Mateus de Moura — Tue, 07 Nov 2023 17:23:42 +0000

Let's transform and model data stored in Redshift with a simple and effective approach using DBT and Airflow. Why make it complicated when we can simplify it?

In this article, we'll work with the following scenario:

Amazon Redshift stores the input data from an e-commerce website.
We'll provide insights into customer behavior.
The DW is not yet structured for analytical applications; it contains a copy of the raw data
We'll transform the data and structure the DW to enable business analysis.

The data used for this article is open and relates to sales on the Amazon website, which can be found on Kaggle at this link.
The complete code for this project can also be found on my GitHub.

About the solution

To solve the presented problem, we have the following architecture design with the technologies and tools that will be used:

For those unfamiliar with the tools to be used, I'll provide a brief explanation of them below.
If you're already familiar, feel free to skip to the next section: Data Modeling.

About dbt

Data Build Tool or dbt is a modern tool for data transformation in the data warehouse or lakehouse scenario. It allows data engineers, data scientists, and data analysts to manipulate data using SQL.

DBT has two versions:

dbt-core: It's an open-source version maintained by the community and can be freely used.
dbt-cloud: It's the paid version managed as a SaaS, which can be used in the cloud with a monthly subscription.

About Airflow and Cosmos

Airflow is the most widely used and well-known tool for orchestrating data workflows. It allows for efficient pipeline construction, scheduling, and monitoring.

Since we're talking about Airflow, let's also discuss Cosmos.

Cosmos is a library for Airflow developed by Astronomer that aims to simplify the execution of DBT projects with Airflow.

With Cosmos, you can execute a DBT project through a group of Airflow tasks, which are automatically recognized. Each DBT model becomes an Airflow task or group, performing transformations and tests.

Amazon Redshift

The Redshift cloud-based data warehouse is intended for analyzing and querying massive amounts of data.
Nowadays, Redshift has both a managed and a serverless version and has evolved into a robust data platform.

Data Modeling

To better understand the solution we'll adopt in this article, the following diagram shows the data modeling for building the DW.

As mentioned earlier, there is a table where the data is stored without transformation (raw data), which is the sales table.

The sales table also serves as a staging area for data processing and the creation of other tables.

In the data warehouse modeling, there are three dimension tables:

dim_product: This table contains data about the store's products.
dim_user: It contains data about users who are customers of the store.
dim_rating: It contains all the product ratings given by customers of the store.

And two fact tables:

fact_product_rating: This table contains data for extracting product rating metrics by users, e.g., the top-rated products.
fact_sales_category: This table contains data for extracting sales metrics by product categories, e.g., the top categories with the most profit for the store.

Note: It's important to remember that the example in this article is hypothetical and may not represent the best data modeling for the given data; it's for educational purposes only.

Show me the Code

Next, it's time to explore in practice the construction of the dbt project, the DAG in Airflow and the results of data transformations in Redshift.

1. Dbt Project

In this section, the focus is on the structure, configuration, and SQL code for the data transformations that are part of the project.
To start a DBT project, you need to install the Python package and use the CLI.

To install the DBT package:

$ pip install dbt-core==1.4.9

To initialize a project:

$ dbt init <project_name>

By default, a DBT project already has some folders and configurations.
Here's the structure with all the folders and files created for our scenario:

-- dbt_project/
  |______ dbt_project.yml
  |______ analyses/
  |______ macros/
  |______ models/
  |   |   |______ dimensions/
  |   |   |      |______ dim_product.sql
  |   |   |      |______ dim_rating.sql
  |   |   |      |______ dim_user.sql
  |   |   |______ facts/
  |   |   |      |______ fact_product_rating.sql
  |   |   |      |______ fact_sales_category.sql
  |   |   |______ staging/
  |   |   |      |______ stg_sales_eph.sql
  |   |   |      |______ staging.yml
  |______ seeds/
  |______ snapshots/
  |______ tests/

Most of the work is done within the models folder.
The next step is to analyze each SQL transformation code.

Staging Table
As mentioned earlier, the staging table is the sales table itself, which already exists in the data warehouse. Essentially, it involves loading the entire table into memory.

{{
    config(
        materialized='ephemeral'
    )
}}

SELECT 
    *
FROM {{ source('public', 'sales') }}

I'd like to highlight three points:

In DBT, there are two ways to materialize tables: ephemeral, which is a virtual table loaded only in memory and not persisted in the database, and table, which is a table that will be persisted in the data warehouse.
The second point concerns processing, which is entirely done in the source database, in this case, Redshift. So when I say that the table is loaded into memory, it refers to Redshift's memory.
Tables, whether ephemeral or table are created with the same name as the SQL file.

Dimension Tables
Now let's analyze the SQL code for the three dimension tables, which will be materialized as table.

The first one is dim_product, containing data about the store's products based on the sales table.

{{
    config(
        materialized='table'
    )
}}

SELECT
    DISTINCT
    product_id,
    product_name,
    category,
    about_product,
    img_link,
    product_link
FROM {{ ref('stg_sales_eph') }}

The second is dim_rating, which includes product ratings by buyers.

{{
    config(
        materialized='table'
    )
}}

SELECT
    user_id,
    product_id,
    rating,
    rating_count
FROM {{ ref('stg_sales_eph') }}

The third and final one is dim_user, containing data about store customers.

{{
    config(
        materialized='table'
    )
}}

SELECT DISTINCT
    user_id,
    user_name
FROM {{ ref('stg_sales_eph') }}

With the dimension tables code completed, it's time to look at the fact tables.

Fact Tables
The two fact tables are more complex as they include business metrics, aggregations, and relationships with dimensions.

Here's the code for fact_product_rating, which relates to the product and rating dimensions, calculating average rating scores grouped by products.

{{
    config(
        materialized='table'
    )
}}

SELECT
    p.product_id,
    p.product_name,
    AVG(CASE
        WHEN r.rating ~ '^[0-9.]+$' THEN CAST(r.rating AS numeric)
        ELSE NULL
    END) AS avg_rating
FROM
    {{ ref('stg_sales_eph') }} s
    JOIN {{ ref('dim_product') }} p ON s.product_id = p.product_id
    JOIN {{ ref('dim_rating') }} r ON r.product_id = p.product_id
GROUP BY
    p.product_id,
    p.product_name

The second fact table is fact_sales_category, which groups products by category and calculates total revenue. It relates to the user and category dimensions.

{{
    config(
        materialized='table'
    )
}}

SELECT
    u.user_id,
    p.category,
    SUM(CAST(REGEXP_REPLACE(s.actual_price, '[^0-9.]', '') AS DECIMAL(10, 2))) sales_amount
FROM
    {{ ref('stg_sales_eph') }} s
    JOIN {{ ref('dim_product') }} p ON s.product_id = p.product_id
    JOIN {{ ref('dim_user') }} u ON s.user_id = u.user_id
GROUP BY
    u.user_id,
    p.category

With the data modeling ready, it's time to complete the project and make it all run with Airflow.

2. Construction of the DAG in Airflow

As mentioned earlier, for this project, we'll use the Cosmos library, which offers a great integration for projects with DBT and Airflow.
To do this, we need to follow a few steps:

Install the necessary dependencies on the Airflow environment to run the project.
Configure an Airflow connection for Redshift. This is an interesting feature of Cosmos; we don't need to configure the database connection in the DBT project. Instead, we can pass it as a parameter through Airflow.
Finally, build the tasks with the "DbtTaskGroup" from Cosmos. Following the mentioned steps, here's the "requirements.txt" file with the necessary dependencies to be installed on the Airflow server:

dbt-core==1.4.9
dbt-redshift==1.4.0
astronomer-cosmos==1.0.5

The configuration of the connection in the Airflow UI for Redshift looks like this:

Now, here's the complete DAG code:

from airflow.decorators import dag
from airflow.operators.dummy_operator import DummyOperator

from cosmos import DbtTaskGroup, ProjectConfig, ProfileConfig, RenderConfig
from cosmos.profiles import RedshiftUserPasswordProfileMapping
from cosmos.constants import TestBehavior

from pendulum import datetime


CONNECTION_ID = "redshift_default"
DB_NAME = "amazon_sales"
SCHEMA_NAME = "public"

ROOT_PATH = '/opt/airflow/dags/dbt'
DBT_PROJECT_PATH = f"{ROOT_PATH}/sales_dw"

profile_config = ProfileConfig(
    profile_name="sales_dw",
    target_name="dev",
    profile_mapping=RedshiftUserPasswordProfileMapping(
        conn_id=CONNECTION_ID,
        profile_args={"schema": SCHEMA_NAME},
    )
)


@dag(
    start_date=datetime(2023, 10, 14),
    schedule=None,
    catchup=False
)
def dag_dbt_sales_dw_cosmos():

    start_process = DummyOperator(task_id='start_process')

    transform_data = DbtTaskGroup(
        group_id="transform_data",
        project_config=ProjectConfig(DBT_PROJECT_PATH),
        profile_config=profile_config,
        default_args={"retries": 2},
    )

    start_process >> transform_data


dag_dbt_sales_dw_cosmos()

I'd like to highlight the following part of the code for further analysis:

transform_data = DbtTaskGroup(
       group_id="transform_data",
       project_config=ProjectConfig(DBT_PROJECT_PATH),
       profile_config=profile_config,
       default_args={"retries": 2},
   )

The "DbtTaskGroup" reads the directory where the DBT project is located. In the previous code, the path is indicated by the "DBT_PROJECT_PATH" variable.
It then constructs Airflow tasks based on the models created in DBT, which, in our case, include staging, dimensions, and fact tables.

ROOT_PATH = '/opt/airflow/dags/dbt'
DBT_PROJECT_PATH = f"{ROOT_PATH}/sales_dw"

Note: It's important to emphasize that in this case, the DBT project needs to be on the same server as Airflow, as defined above.

The ProfileConfig is the object that configures the connection.
Essentially, it's the Airflow connection and some parameters that can be passed, such as the database schema.

CONNECTION_ID = "redshift_default"
DB_NAME = "amazon_sales"
SCHEMA_NAME = "public"

profile_config = ProfileConfig(
   profile_name="sales_dw",
   target_name="dev",
   profile_mapping=RedshiftUserPasswordProfileMapping(
       conn_id=CONNECTION_ID,
       profile_args={"schema": SCHEMA_NAME},
   )
)

Constructing the DAG is that simple – it's straightforward and efficient, isn't it?

Here's an image of how the tasks built automatically by Cosmos in the Airflow UI will look:

You can see the dependencies for building fact_product_rating in the image, as defined by the SQL in the models.

Also highlighted in the image are the dependencies for building fact_sales_category.

3. Finally... Results in Redshift

Upon running the DAG and building the models configured in DBT, we achieve the following result in Redshift:

You can see the sales table, which contains raw data used for staging, as well as the materialized dimension and fact tables.
Performing a query on the fact_sales_category table, we get the following result:

Conclusion

DBT is an excellent tool for data processing and modeling, providing convenience and speed for data projects. One of its strengths is that it automatically creates dependencies between models, which makes it easier to work with because you only need to write your transformations in SQL.

The Cosmos library from Astronomer is also a great help in orchestrating DBT with Airflow, simplifying and speeding up the process. It provides an overview of the DBT models and their execution.

Airflow + DBT + Cosmos = The Perfect Combination ❤

If you're looking for a simple, efficient, and cost-effective data stack, the solution presented in this article may be effective and ideal for your scenario.

I'll be writing more articles about DBT in the future, so stay tuned!

Follow me here and on other social networks:
LinkedIn: https://www.linkedin.com/in/cicero-moura/
Github: https://github.com/cicerojmm

Data-aware Scheduling in Airflow: A Practical Guide with DAG Factory

Cícero Joasyo Mateus de Moura — Mon, 31 Jul 2023 20:37:03 +0000

I recently contribute to an open-source project called DAG Factory, library for building Airflow DAGs declaratively using YAML files, eliminating the need for Python coding.

My contribution was to add support for the Data-aware Scheduling (Datasets) functionality of Airflow, which was introduced starting from version 2.4 (at the time of writing this article, Airflow is at version 2.6.3).

The aim here is to talk about the Datasets functionality in Airflow, introduce the DAG Factory library, and create a practical example using both.

You can access the repository with the code used in this article here.

What are Datasets in Airflow?

Data-aware Scheduling allows creating DAGs linked to files, whether local or remote, to trigger data processing based on the modification of one or multiple files, known as datasets.

Datasets help resolve the problem of data dependency between DAGs, which occurs when one DAG needs to consume data from another for analysis or further processing. They enable a more intelligent and visible scheduling with an explicit dependency between DAGs.

Basically, there are two fundamental concepts in Airflow's Datasets:

DAG Producer: this DAG creates or updates one or more datasets, accomplished through tasks using a parameter called outlets, to specify a particular dataset.
DAG Consumer: this DAG consumes one or more datasets and will be scheduled and triggered as soon as all datasets are successfully created or updated by the DAG Producer. The scheduling is done using the schedule directly in the DAG configuration.

Currently, there are two ways to schedule DAGs in Airflow: either by a recurrent schedule (cron, timedelta, timetable, etc.) or through one or multiple datasets. It's important to note that we cannot use both scheduling methods in a single DAG, only one in each DAG.

DAG Factory: Building DAGs with YAML

DAG Factory is a community library that allows configuring Airflow to generate DAGs from one or multiple YAML files.

The library aims to facilitate the creation and configuration of new DAGs by using declarative parameters in YAML. It allows default customizations and is open-source, making it easy to create and customize new functionalities.

The community around this library is highly engaged, making it worth exploring =)

Practical Use of Datasets

In this article, we'll work with the following scenario:

We need to build a pipeline that downloads data from an API and saves the results to Amazon S3. After successfully extracting and saving the data, we need to process it. Hence, we'll have another pipeline that will be triggered based on the create or update of the data.

The infrastructure to run Airflow and reproduce the example in this article can be found here.

The first step is to build the pipeline that extracts and saves the data to S3.

Producer DAG for Data

This pipeline consists of two tasks to extract data from the public PokeAPI and another two tasks to save the data to S3.

The tasks that extract data from the API using the SimpleHttpOperator, and the tasks that save the data to S3 use the S3CreateObjectOperator.

Since we'll be using YAML to build our DAGs, the following code constructs this first DAG with all its tasks.



download_data_api_dataset_producer_dag:
  description: "Example DAG producer custom config datasets"
  schedule_interval: "0 5 * * *"
  task_groups:
    extract_data:
      tooltip: "this is a task group"
    save_data:
      tooltip: "this is a task group"
  tasks:
    start_process:
      operator: airflow.operators.dummy.DummyOperator
    get_items_data:
      operator: airflow.providers.http.operators.http.SimpleHttpOperator
      method: "GET"
      http_conn_id: "poke_api"
      endpoint: "item/1"
      task_group_name: extract_data
      dependencies: [start_process]
    save_items_data:
      operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
      aws_conn_id: aws_default
      s3_bucket: cjmm-datalake-raw
      s3_key: "poke_api/item/data_{{ ts }}.json"
      data: "{{ ti.xcom_pull(task_ids='get_items_data') }}"
      dependencies: [get_items_data]
      task_group_name: save_data
      outlets:
        file: /opt/airflow/dags/dags_config/datasets_config.yml
        datasets: ['dataset_poke_items']
    get_items_attribute_data:
      operator: airflow.providers.http.operators.http.SimpleHttpOperator
      method: "GET"
      http_conn_id: "poke_api"
      endpoint: "item-attribute/1"
      dependencies: [start_process]
      task_group_name: extract_data
    save_items_attribute_data:
        operator: airflow.providers.amazon.aws.operators.s3.S3CreateObjectOperator
        aws_conn_id: aws_default
        s3_bucket: cjmm-datalake-raw
        s3_key: "poke_api/items_attribute/data_{{ ts }}.json"
        data: "{{ ti.xcom_pull(task_ids='get_items_attribute_data') }}"
        dependencies: [get_items_attribute_data]
        task_group_name: save_data
        outlets:
          file: /opt/airflow/dags/dags_config/datasets_config.yml
          datasets: ['dataset_poke_items_attribute']

A highlight is the configuration of the Datasets, done through the outlets tag added to the tasks save_items_data and save_items_attribute_data.



outlets:
   file: /opt/airflow/dags/dags_config/datasets_config.yml
   datasets: ['dataset_poke_items_attribute']

In this configuration, we specify the path of the file, where all Datasets are centrally declared for reuse, and the names of the datasets contained in the file for use.

Below is the datasets_config.yml file used in this example, containing the Dataset's name (used only in Airflow) and the URI, which is the path where the current file is stored, in this case, Amazon S3.



datasets:
  - name: dataset_poke_items_attribute
    uri: s3://cjmm-datalake-raw/poke_api/items_attribute/*.json
  - name: dataset_poke_items
    uri: s3://cjmm-datalake-raw/poke_api/items/*.json

The resulting DAG visualization in Airflow will look like this:

Consumer DAG for Data

Now, let's build the DAG that consumes the data, which performs the processing and handling of the datasets.

The DAG is scheduled based on datasets, not on an execution time, so it will only be triggered when all the datasets it depends on are updated.

Currently, we cannot use two types of scheduling simultaneously; it's either through a schedule interval or datasets.

In this example, we'll only build a DAG with PythonOperator, simulating the consumption and processing of the produced data.

Below is the configuration file for the consumer DAG:



process_data_api_dataset_consumer_dag:
  description: "Example DAG consumer custom config datasets"
  schedule:
    file: /opt/airflow/dags/dags_config/datasets_config.yml
    datasets: ['dataset_poke_items', 'dataset_poke_items_attribute']
  tasks:
    start_process:
      operator: airflow.operators.dummy.DummyOperator
    process_data:
      operator: airflow.operators.python_operator.PythonOperator
      python_callable_name: process_data_function
      python_callable_file: /opt/airflow/dags/process_data.py
      task_group_name: etl_data
      provide_context: true
      dependencies: [start_process]

A highlight is the configuration of the schedule based on datasets, which is similar to the configuration of the outlets in the producer DAG:



schedule:
   file: /opt/airflow/dags/dags_config/datasets_config.yml
   datasets: ['dataset_poke_items', 'dataset_poke_items_attribute']

The resulting DAG visualization in Airflow will be as follows:

Overview of DAGs with Datasets

When we have DAGs using Airflow's datasets, we can observe some interesting points:

The consumer DAG in the list of all DAGs is flagged to indicate scheduling based on datasets.

There is a specific visualization in the Airflow menu called Datasets, you can check the configured datasets, the dependencies between DAGs, and the log of dataset creation, update, and consumption.

The DAG Dependencies visualization shows the relationships between the DAGs, providing a helpful overview of the processing mesh and data dependencies.

Important Points about Datasets

The functionality of Datasets in Airflow is still recent, and there are many improvements in the community backlog. However, I would like to highlight some points at this moment:

Currently, Airflow's dataset functionality does not directly inspect the physical file itself. Instead, it schedules the consumer pipeline directly through the database, almost like an implicit DAG Trigger.
Considering the previous point, it's better to use a Sensor if you genuinely need to "see and access" the data when triggering the DAG Consumer.
The official documentation does not recommend using regular expressions in the URI of datasets. However, in my tests, I didn't encounter any issues with this, as the functionality doesn't yet look directly at the physical file.
Since the DAG Consumer doesn't have a specific schedule, it's challenging to measure if it was triggered at a planned time, making it difficult to define an SLA. A more refined monitoring approach is needed to avoid missing critical scheduling.

Conclusion

By using the DAG Factory library, we simplify the process of creating and configuring new DAGs, leveraging the extensibility provided by the library's open-source code.

Airflow's Datasets enable more efficient scheduling by triggering DAGs only when necessary data is available, avoiding unnecessary and delayed executions.

I hope this article has been useful in understanding Airflow's Datasets functionality and how to apply it to your projects. With this approach, you can build more robust and efficient pipelines, fully utilizing Airflow's potential.

Follow me:
LinkedIn: https://www.linkedin.com/in/cicero-moura/
Github: https://github.com/cicerojmm

Data Quality at Scale with Great Expectations, Spark, and Airflow on EMR

Cícero Joasyo Mateus de Moura — Mon, 24 Apr 2023 13:56:19 +0000

Data quality is one of the biggest challenges that companies face nowadays, as it's necessary to ensure that the data is accurate, reliable, and relevant so that the decisions made based on this data are successful.

In this regard, we have seen several trends emerge, such as the Modern Data Stack that brings data quality as one of the main practices.

The Modern Data Stack (MDS) is a set of tools and technologies that help companies store, manage, and learn from their data quickly and efficiently. Concepts such as Data Quality and Data Observability are highlights of the MDS.

This article aims to explore Great Expectations, a data validation tool contained within the MDS, which can be used in conjunction with Spark to ensure data quality at scale.

The code used in this article can be found in this repository.
The link to the data docs generated by Great Expectations can also be accessed here.

Great Expectations

Great Expectations (GE) is an open-source data validation tool that helps ensure data quality.

With Great Expectations, it's possible to define expectations about your data and check whether they meet them or not.

Some of the existing functionalities include the ability to validate data schema, ensure referential integrity, check consistency, and detect anomalies.

GE is very flexible and scalable, allowing integration into our data pipelines, whether to validate, generate reports, or even prevent the pipeline from advancing by recording inconsistent data in the most "curated" layers of the Data Lake.

Some points that we can highlight:

it's possible to create tests for your data directly from Spark or Pandas dataframe;
it's possible to create data documentation in HTML including expectation suites and validation reports;
it's possible to save a set of tests (suite) to be used later (checkpoints);
we can use a large number of ready-made expectations or easily create custom expectations that meet our test cases; -it has a CLI that simplifies the creation of test cases, or we can generate tests by coding in Python;
it's possible to connect directly to data source origins, consequently validating data more quickly.

Practical case with Great Expectations

In this article, we present a scenario that is closest to what we find in our daily lives, so we will work with the following case:

We have data stored in a Data Lake located on AWS S3 and need to verify data quality before the business makes critical decisions based on it.
The dataset used is about product sales from an e-commerce website (Amazon) and tells us a lot about the behavior of that store's customers.
The dataset used in this article is open and can be found on Kaggle at this this link.

Using Great Expectations with Spark on EMR

This article will use Great Expectations with Spark to execute test cases.
The Spark environment will be on EMR and Airflow will be the means of orchestrating the jobs that will run.

To facilitate the understanding of the process, we will analyze the architecture design below:

We can highlight the following points:

the Spark code containing all the logic to execute GE with Spark will be stored in S3;
the data is also stored in S3 in CSV format;
the generated data docs will also be stored in an S3 bucket configured for a static site;
Airflow will orchestrate the EMR and control the lifecycle of the jobs.

1. Creation of Spark script with Great Expectations

To create the Spark script that contains the test cases, we will divide it into some steps, as follows:

Configuration of the context

The GE context indicates the main configurations to be considered to run the tests.

The following code configures the context through a YAML created by a Python object itself.



datasource_yaml = f"""
    name: my_spark_datasource
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
        module_name: great_expectations.execution_engine
        class_name: SparkDFExecutionEngine
    data_connectors:
        my_runtime_data_connector:
            class_name: RuntimeDataConnector
            batch_identifiers:
                - some_key_maybe_pipeline_stage
                - some_other_key_maybe_airflow_run_id
    """

def create_context_ge(output_path):
  context = ge.get_context()

  context.add_expectation_suite(
      expectation_suite_name=suite_name
  )

  context.add_datasource(**yaml.load(datasource_yaml))
  config_data_docs_site(context, output_path)

  return context

The configuration of this context is basically informing that Spark is used to perform the tests, as it could be another scenario, such as the use of Pandas.

Configuring Data Docs

An important point is setting where our data docs will be saved. By default, the HTML documentation is generated on the local disk, but for this article, the data docs will be stored and hosted by S3.

The destination bucket (output_path) is a parameter in the following code, so the script becomes more dynamic and customizable.



def config_data_docs_site(context, output_path):
    data_context_config = DataContextConfig()

    data_context_config["data_docs_sites"] = {
        "s3_site": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": output_path.replace("s3://", "")
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder"
            }
        }
    }

    context._project_config["data_docs_sites"] = data_context_config["data_docs_sites"]

Creation of a Validator

Before adding the test cases, we must configure a Validator to indicate the tests as a Batch Request.

The Validator already incorporates data validation functions in a built-in way, as we will see later, which makes the creation of test cases much easier and more intuitive.

The code below configures and creates the Validator using the context of our tests and the dataframe containing the data for validation.



def create_validator(context, suite, df):
    runtime_batch_request = RuntimeBatchRequest(
        datasource_name="my_spark_datasource",
        data_connector_name="my_runtime_data_connector",
        data_asset_name="insert_your_data_asset_name_here",
        runtime_parameters={"batch_data": df},
        batch_identifiers={
            "some_key_maybe_pipeline_stage": "ingestion step 1",
            "some_other_key_maybe_airflow_run_id": "run 18",
        },
    )

    df_validator: Validator = context.get_validator(
        batch_request=runtime_batch_request,
        expectation_suite=suite
    )

    return df_validator

Creating Test Cases

The most awaited moment has arrived, creating the test cases.

At this stage, the objective is to work with two scenarios of test cases: the first is to run a data profile and the other is to add custom test cases as the business needs.

Data Profile

Data Profile is the process of examining, analyzing, reviewing, and summarizing datasets to obtain information about the quality of the data.

The GE allows you to create a data profile automatically and very simply.

In this profile, information will be generated for all data columns, including tests to check for null values, data types, and the most frequent pattern in each column.

To create a data profile and add it to the test context, you just need to have the following code:



def add_profile_suite(context, df_ge):
    profiler = BasicDatasetProfiler()
    expectation_suite, validation_result = profiler.profile(df_ge)
    context.save_expectation_suite(expectation_suite, suite_profile_name)

An important point is that the profile is executed through a Spark object created by GE (df_ge), which will be seen later, it differs from the other test cases that will be added next, as they are based on the Validator object (created in the previous step).

Another point to highlight is that a name was used for the test suite of the profile and another for the validator tests, so they will be separated in the data docs, which helps with documentation organization.

Test cases

Now just add the test cases as needed for data validation.

The following code adds the following tests:

Validate if all desired columns are in the dataset;
Validate if the product_id field has unique and non-null values;
Validate if the discount_percentage field contains only values between 0 and 100;
Validate if the rating field contains only values between 0 and 5;
Validate if the product_link field contains only data with a valid link format, using a regex to validate the pattern.

After adding all desired test cases, save the test suite's expectations:



def add_tests_suite(df_validator):
    columns_list = ["product_id", "product_name", "category", "discounted_price", "actual_price",
                    "discount_percentage", "rating", "rating_count", "about_product", "user_id",
                    "user_name", "review_id", "review_title", "review_content", "img_link", "product_link"]

    df_validator.expect_table_columns_to_match_ordered_list(columns_list)
    df_validator.expect_column_values_to_be_unique("product_id")
    df_validator.expect_column_values_to_not_be_null("product_id")
    df_validator.expect_column_values_to_be_between(
        column='discount_percentage', min_value=0, max_value=100)
    df_validator.expect_column_values_to_be_between(
        column='rating', min_value=0, max_value=5)
    df_validator.expect_column_values_to_match_regex(
        column="product_link",
        regex=r'^https:\/\/www\.[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}$',
        mostly=0.9
    )
    df_validator.save_expectation_suite(discard_failed_expectations=False)

    return df_validator

Running the tests

Now it's time to connect all the dots.

The code below is the main function that will be called by Spark. It reads the data we want and invokes the other functions we discussed earlier to set up and execute the test suites:



def process_suite_ge(spark, input_path, output_path):
    path_data = join(input_path, 'sales', 'amazon.csv')
    df = spark.read.format("csv").option("header", "true").load(path_data)
    df_ge = SparkDFDataset(df)

    context = create_context_ge(output_path)

    suite: ExpectationSuite = context.get_expectation_suite(
        expectation_suite_name=suite_name)

    add_profile_suite(context, df_ge)

    df_validator = create_validator(context, suite, df)
    df_validator = add_tests_suite(df_validator)

    results = df_validator.validate(expectation_suite=suite)
    context.build_data_docs(site_names=["s3_site"])

    if results['success']:
        print("The test suite run successfully: " +
              str(results['success']))
        print("Validation action if necessary")

2. Creating the DAG in Airflow

In this step, it's time to create a DAG in Airflow to run the tests with GE inside the EMR with Spark.

We will have the following tasks in our DAG:

create_emr: task responsible for creating the EMR for job execution. Remember to configure the connection with AWS (aws_default) or IAM if you're running Airflow on AWS. EMR configurations can be found in the project repository.
add_step: responsible for adding a job to the EMR (step). We will see the configuration of this job (spark-submit) later on.
watch_step: an Airflow sensor responsible for monitoring the status of the previous job until it is completed, either successfully or with failure.
terminate_emr: after the job is finished, this task terminates the EMR instance allocated for running the tests.

Below is the code for the DAG:



create_emr = EmrCreateJobFlowOperator(
    task_id='create_emr',
    aws_conn_id='aws_default',
    job_flow_overrides=JOB_FLOW_OVERRIDES,
    dag=dag
)

add_step = EmrAddStepsOperator(
    task_id='add_step',
    job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr', key='return_value') }}",
    steps=STEPS_EMR,
    dag=dag
)


watch_step = EmrStepSensor(
    task_id='watch_step',
    job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr', key='return_value') }}",
    step_id="{{ task_instance.xcom_pull('add_step', key='return_value')[0] }}",
    aws_conn_id='aws_default',
    dag=dag,
)

terminate_emr = EmrTerminateJobFlowOperator(
    task_id='terminate_emr',
    job_flow_id="{{ task_instance.xcom_pull('create_emr', key='return_value') }}",
    aws_conn_id='aws_default',
    trigger_rule=TriggerRule.ALL_DONE,
    dag=dag,
)

create_emr >> add_step >> watch_step >> terminate_emr

Now I'll detailthe configuration of the job that will be added to EMR to process the tests, which is basically a spark-submit.

We can check all the settings in the code below, including the script parameters.



args = str({'job_name': 'process_suite_ge', 'input_path': 's3://cjmm-datalake-raw',
           'output_path': 's3://datadocs-greatexpectations.cjmm'})

STEPS_EMR = [{
    'Name': 'Run Data Quality with Great Expectations',
    'ActionOnFailure': 'CONTINUE',
    'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                '/usr/bin/spark-submit',
                '--deploy-mode', 'client',
                '--master', 'yarn',
                '--num-executors', '2',
                '--executor-cores', '2',
                '--py-files', 's3://cjmm-code-spark/data_quality/modules.zip',
                's3://cjmm-code-spark/data_quality/main.py', args
            ]
    }
}]

It's important to highlight that the code that will be executed in Spark is stored in S3, both the main.py file that calls the other functions and the modules.zip file that contains all the logic for the tests to run.

This coding model was adopted to be scalable and easier to maintain, besides allowing us to easily run Spark in client or cluster mode.

3. Executing the Script on EMR

With the script developed, and the Airflow DAG created, we can now run the tests.
Below is an example of the Airflow DAG that ran successfully:

The following images shows more details about the job successfully executed on EMR:

4. Results

Now it's time to analyze the two results of the executed tests.
The first one is the data docs files saved in the S3 bucket, as shown in the following image:

The second result is accessing the data docs, as shown below:

Remember that the data docs created in this article can be accessed at this link.

When accessing the suite with the data profile, we have the following result:

And when accessing the suite with the created test cases, we have the result below:

Conclusion

Great Expectation is the fastest-growing open-source data quality tool with a highly active community, constantly updated, and several large companies worldwide using it.
With GE, we can easily create test cases for various scenarios that accommodate different datasets and customize tests for our use cases.

In addition to bringing statistical results of tests that we can save and use as desired, it also brings ready-to-use data docs in HTML with a lot of helpful information about data quality.

Great Expectation is an excellent tool with easy integration and management. It uses concepts we already know in the world of Big Data, so it is worth testing and using it daily to mature your Data Governance and Data Quality Monitoring.

Remember:

More than having data available for analysis, it is essential to ensure its quality.