Data Analytics at Potloc I: Making data integrity your priority with Elementary & Meltano

Stéphane Burwash — Fri, 06 Jan 2023 03:28:33 +0000

Foreword

This is the first of a series of small blog posts where we describe plugins that our data engineering team at Potloc developed in order to solve business issues that we were facing.

These features were developed to enhance our current stack, which consists of:

Meltano as our DataOps platform + Extract / Load tool
dbt as our data transformation tool
Airflow
Bigquery as our data warehouse
AWS Fargate as our hosting infrastructure, managed through terraform

In this article, it is assumed that the reader has basic knowledge of:

Meltano
dbt

We hope that you enjoy the article! If you have any questions, feel free to reach out.

This article was not sponsored by Elementary in any way, we're just big fans 😉.

Data Integrity - It's more than a buzzword

Data integrity is an essential component of a data pipeline. Without integrity and trust, your pipeline is essentially worthless, and those hundreds of hours you spent integrating new sources, optimizing load times, and modeling raw data into usable insights go down the drain. More than once, our data team has created "production ready" dashboards, only to realise that the integrity / buisness logic behind the dashboard was completely flawed. What was supposed to be a 2 day project became a 3 week debacle.

To circumvent falling into the data quality trap, you can write tests using a number of powerful open-source data integrity solutions such as Great Expectations, Soda Core or even natively using dbt tests to validate that your data is doing what it's supposed to. But as you write more and more tests, you can start running into some issues, mainly:

Tracking over time: How do you keep track of test results over time? How are you progressing in terms of tackling these issues?

This was the starting point for our quest to find a long-term data integrity solution at Potloc. We were attempting to map integrity issues in user-inputted data. We also wanted to give our team an accurate report of their progress as they resolved these issues one-by-one in the source data.

Readability: As you go from 5 to 50 to 500 to 5000+ tests, reading results becomes exponentially more complicated and time-intensive.

Reproducibility: Once you have detected an integrity issue, how do you quickly reproduce the test to be able to investigate?

Unknown unknowns: While it is possible to test for every known possible issue, it can be harder / impossible to test for unknown unknowns such as dataset shifts, anomalies, large spikes in row count, etc.

These issues can be circumvented by integrating Elementary into your workflow.

Features

Elementary is an open-source tool for data observability and validation that wraps around an existing dbt project. It allows users to graduate from simply having integrity tests to using them in order to improve confidence in your data product.

Here are only some of the reasons why I personally love Elementary:

1. The UI, the glorious UI:

Elementary and its associated CLI (command line interface) edr natively allow you to generate a static HTML file containing test results. This file can either be viewed locally, sent through slack or even hosted in a cloud service to be viewed as a webpage.

From this UI, you can view your most recent test results, historical test results, check model run times and even view lineage graphs.

If you have any failures in tests run, you can view samples of offending entries or copy the SQL query that generated these errors to quickly investigate.

You can play around with Elementary's demo project to get a feel for it.

2. Stored results and historical views

Elementary integrates with your dbt project in order to store all test runs and uses on-run-end hooks to upload results. This all happens on the dbt package, without need to connect to the data warehouse.

This allows us to view test results over time, view progress from run to run, and use test results for internal reporting.

It can sometimes be hard to share integrity metrics with the rest of your non-data-literate team. Having easy access to results & metrics such as row count directly in your warehouse allows you to create integrity dashboards curated for business use cases, giving your team the opportunity to start tackling issues and take stock of their progress.

3. Anomaly detection tests

As mentioned above, it is hard to deal with unknown unknowns and issues that arise over longer periods of time (days, weeks, or even months). Even if your data is clean when your model first goes into production, it does not mean that mistakes/issues cannot slip in as time goes on. A supported table requires constant monitoring, especially if business logic has been hard-coded.

This task can be greatly alleviated by making use of Elementary's native anomaly detection tests, which monitor for shifts at the table and column level for metrics such as:

Row count
Null count
Percentage of missing
Freshness
Etc.

A full list of all anomaly metrics Elementary tests for can be found here.

By basing itself on past results rather than hard-coded baselines (ex: an increase of 10% in row count or half-day delay in freshness), Elementary can easily be integrated out-of-the-box without needing to fine-tune from pipeline to pipeline.

At Potloc, we mainly use this feature to identify freshness issues. Elementary allows you to easily setup freshness checks without having to specify hard deadlines (ex: 12h since last sync, 24h since last updated, etc.). This means that we can change our upstream Extract/Load job schedules without having to change our downstream tests; the tool will automatically flag the issue, and then adapt to the new schedule as it becomes norm. This also makes the intial setup is quick and painless.

Integrating Elementary into your existing Meltano Project

We developed an Elementary plugin for Meltano using the Meltano EDK that can easily be integrated into your project.

To add it, simply run:

meltano add utiliy elementary

This should add the following code snippet to your meltano.yml file:

  - name: elementary
    variant: elementary
    pip_url: elementary-data==<EDR VERSION> git+https://github.com/potloc/elementary-ext.git

You will also need to add the following snippet to your packages.yml file:

packages:
  - package: elementary-data/elementary
    version: <DBT PACKAGE VERSION>
    ## Docs: <https://docs.elementary-data.com>

As you can see, we have 2 elements we now need to complete:

EDR version
dbt package version

Both of the can be found in the elementary quickstart documentation or in their respective package indexes (pypi and dbt packages).

It is important that both of these versions are aligned in accordance with creator releases. If CLI and dbt package versions are misaligned, errors can ensue.

At the time of writing this article, we would be using EDR VERSION = 0.63 & DBT PACKAGE VERSION = 0.66

Note: We will be working on making this process easier so that you do not need to specify package versions.*

Next, you need to set all of your environment variables for Elementary so that they use the same as your existing dbt project. A typical setup could look like this:

      - name: elementary
        namespace: elementary
        pip_url: elementary-data[platform]==0.6.3 git+https://github.com/potloc/elementary-ext.git
        executable: elementary_invoker
        settings:
        - name: project_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/
        - name: profiles_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
        - name: file_path
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/path/to/report.html
        - name: skip_pre_invoke
          env: ELEMENTARY_EXT_SKIP_PRE_INVOKE
          kind: boolean
          value: true
          description: Whether to skip pre-invoke hooks which automatically run dbt clean and deps
        - name: slack-token
          kind: password
        - name: slack-channel-name
          kind: string
          value: elementary-notifs
        config:
          profiles-dir: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
          file-path: ${MELTANO_PROJECT_ROOT}/path/to/report.html
          slack-channel-name: your_channel_name
          skip_pre_invoke: true
        commands:

          initialize:
            args: initialize
            executable: elementary_extension
          describe:
            args: describe
            executable: elementary_extension
          monitor-report:
            args: monitor-report
            executable: elementary_extension
          monitor-send-report:
            args: monitor-send-report
            executable: elementary_extension

Make sure to specify the platform, which should be specified in your profile (we use bigquery).

After this, simply follow the instructions in the Elementary Quickstart Guide to get the plugin up and running.

Generating your first report

Once you have got elementary up and running, it's time to generate your first report. Simply run the command

meltano invoke elementary:monitor-report

and a brand new report should be generated at the specified file-path (${MELTANO_PROJECT_ROOT}/path/to/report.html in our case).

Next steps

Once you've generated your first report, the sky is the limit in terms of integrating elementary to your workflow.

The Elementary team has made it incredibly easy to send a report as a slack message. At Potloc, we receive reports twice a day to monitor the state of our pipeline.

You can also set up hosting for your report on s3 or send slack alerts when an error occurs. Experiment and find what best works for you!

A quick closing statement

While incredibly powerful, Elementary is not a replacement for best practices.
When writing tests, ensure that they are pertinent and targeted.
Tests should be written to identify data integrity issues that can compromise business insights, not simply to identify null values.
If you write too many tests without thinking of the meaning behind them, you run the risk of falling into the "too many errors = no errors" paradigm where you have so many warnings that it's impossible to differentiate between actual issues and unfixable noise.

I speak from experience; at one point, we had multiple tests in our pipeline that returned a warning of over 5 000 erroneous values, with one going up to 180 000. These errors were unactionable, and yet the tests remained. Even with Elementary, this made it hard for us to differentiate between useless warnings and actual integrity issues that needed to be resolved.

Make sure to reach out to the Elementary team if you have any questions about their product!

Interested in what we do at Potloc? Come join us! We are hiring 🚀

DEV Community: Stéphane Burwash