Lewis Hemens for Dataform

Posted on Jun 26, 2019 • Edited on Jul 26, 2019 • Originally published at dataform.co

Testing data quality with SQL assertions

#sql

Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.

Often, input data sources are missing rows, contain duplicates, or include just plain invalid data. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.

In this article we explain the concept of a SQL data assertion, look at some common data quality problems, how to detect them, and - most importantly - how to fix them in a way that persists for all data consumers.

The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.

What is a data assertion?

A data assertion is a query that looks for problems in a dataset. If the query returns any rows then the assertion fails.

Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.

Checking field values

Let’s take a look at a simple example.

Assume that there is a database.customers table containing information about customers in the database.
Some checks that we might want to verify on the table’s contents include:

The field email_address is always set
The field customer_type is one of “business” or “individual”

The following simple query will return any rows violating these rules:

SELECT customer_id
FROM database.customers
WHERE  email_address IS NULL
OR NOT customer_type IN (“business”, “individual”)

Checking for unique fields

We may also want to run checks across more than one row. For example, we might want to verify that the customer_id field is unique. A query like the following will return any duplicate customer_id values:

SELECT
    customer_id,
    SUM(1) AS count
FROM database.customers
GROUP BY 1
HAVING count > 1

Combinining multiple assertions into a single query

We can combine all of the above into a single query to quickly find any customer_id value violating one of our rules using UNION ALL:

SELECT customer_id, “missing_email” AS reason
FROM database.customers
WHERE email_address IS NULL

UNION ALL

SELECT customer_id, “invalid_customer_type” AS reason
WHERE not customer_type in (“business”, “individual”)
FROM database.customers

UNION ALL

SELECT customer_id, “duplicate_id” AS reason
FROM (
    SELECT customer_id, SUM(1) AS count
    FROM database.customers
    GROUP BY 1
)
WHERE count > 1

We now have one query we can run to detect any problems in our table, and we can easily add another unioned SELECT statement if we want to add new conditions in the future.

Creating clean datasets

Now that we’ve detected the issues in our data, we need to clean them up. Ultimately choosing how to handle data quality issues depends on your business use case.

In this example we will:

Remove any rows that are missing the email_address field
Set a default customer type if it’s invalid
Remove rows with duplicate customer_id fields, retaining one row per customer_id value (we don’t care which one)

Rather than editing the dataset directly, we can create a new clean copy of the dataset - this gives us freedom to change or add rules in the future and avoids deleting any data.

The following SQL query defines a view of our database.customers table in which invalid rows are removed, default customer types are set, and duplicate rows for the same customer_id are removed:

SELECT
    customer_id,
    ANY_VALUE(email_address) AS email_address,
    ANY_VALUE(
        CASE customer_type
        WHEN “individual” THEN “individual”
        WHEN “business”   THEN “business”
        ELSE “unknown”
    END) AS customer_type
FROM database.customers
WHERE NOT email_address IS NULL
GROUP BY 1

This query can be used to create either a view or a table in our cloud data warehouse, perhaps called database_clean.customers, which can be consumed in dashboards or by analysts who want to query the data.

Now we've fixed the problem, we can check that the above query has correctly fixed the problems by re-running the original assertion on the new dataset.

Continuous data quality testing

Assertions should be run as part of any data pipelines to make sure breaking changes are picked up the moment they happen.

If an assertion returns any rows, future steps in a pipeline should either fail, or a notification delivered to the data owner.

Dataform has built in support for data assertions, and provides a way to run them as part of a larger SQL pipeline.

These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.

For more information on how to start writing data assertions with Dataform, check out the assertions documentation guide for Dataform’s open-source framework, or create an account for free and start using Dataform's fully managed Web platform.