<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stéphane Burwash</title>
    <description>The latest articles on DEV Community by Stéphane Burwash (@sburwash).</description>
    <link>https://dev.to/sburwash</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F997395%2F7e8efc33-e2ba-4a7f-8c5b-9ad93441e116.jpeg</url>
      <title>DEV Community: Stéphane Burwash</title>
      <link>https://dev.to/sburwash</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sburwash"/>
    <language>en</language>
    <item>
      <title>Data Analytics at Potloc I: Making data integrity your priority with Elementary &amp; Meltano</title>
      <dc:creator>Stéphane Burwash</dc:creator>
      <pubDate>Fri, 06 Jan 2023 03:28:33 +0000</pubDate>
      <link>https://dev.to/potloc/data-analytics-at-potloc-i-making-data-integrity-your-priority-with-elementary-meltano-1ob</link>
      <guid>https://dev.to/potloc/data-analytics-at-potloc-i-making-data-integrity-your-priority-with-elementary-meltano-1ob</guid>
      <description>&lt;h3&gt;
  
  
  Foreword
&lt;/h3&gt;

&lt;p&gt;This is the first of a series of small blog posts where we describe plugins that our data engineering team at &lt;a href="https://www.potloc.com/" rel="noopener noreferrer"&gt;Potloc&lt;/a&gt; developed in order to solve business issues that we were facing.&lt;/p&gt;

&lt;p&gt;These features were developed to enhance our current stack, which consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://meltano.com/" rel="noopener noreferrer"&gt;Meltano&lt;/a&gt; as our DataOps platform + Extract / Load tool&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.getdbt.com/" rel="noopener noreferrer"&gt;dbt&lt;/a&gt; as our data transformation tool&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/airflow" rel="noopener noreferrer"&gt;Airflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/bigquery" rel="noopener noreferrer"&gt;Bigquery&lt;/a&gt; as our data warehouse&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/fargate/" rel="noopener noreferrer"&gt;AWS Fargate&lt;/a&gt; as our hosting infrastructure, managed through &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;terraform&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, it is assumed that the reader has basic knowledge of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Meltano&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We hope that you enjoy the article! If you have any questions, feel free to reach out.&lt;/p&gt;

&lt;p&gt;This article was not sponsored by Elementary in any way, we're just big fans 😉.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data Integrity - It's more than a buzzword
&lt;/h1&gt;

&lt;p&gt;Data integrity is an essential component of a data pipeline. Without integrity and trust, your pipeline is essentially worthless, and those hundreds of hours you spent integrating new sources, optimizing load times, and modeling raw data into usable insights go down the drain. More than once, our data team has created "production ready" dashboards, only to realise that the integrity / buisness logic behind the dashboard was completely flawed. What was supposed to be a 2 day project became a 3 week debacle.  &lt;/p&gt;

&lt;p&gt;To circumvent falling into the data quality trap, you can write tests using a number of powerful open-source data integrity solutions such as &lt;a href="https://greatexpectations.io/" rel="noopener noreferrer"&gt;Great Expectations&lt;/a&gt;, &lt;a href="https://www.soda.io/core" rel="noopener noreferrer"&gt;Soda Core&lt;/a&gt; or even natively using &lt;a href="https://docs.getdbt.com/docs/build/tests" rel="noopener noreferrer"&gt;dbt tests&lt;/a&gt; to validate that your data is doing what it's supposed to. But as you write more and more tests, you can start running into some issues, mainly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracking over time&lt;/strong&gt;: How do you keep track of test results over time? How are you progressing in terms of tackling these issues?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This was the starting point for our quest to find a long-term data integrity solution at Potloc. We were attempting to map integrity issues in user-inputted data. We also wanted to give our team an accurate report of their progress as they resolved these issues one-by-one in the source data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Readability&lt;/strong&gt;: As you go from 5 to 50 to 500 to 5000+ tests, reading results becomes exponentially more complicated and time-intensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: Once you have detected an integrity issue, how do you quickly reproduce the test to be able to investigate?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unknown unknowns&lt;/strong&gt;: While it is possible to test for every known possible issue, it can be harder / impossible to test for unknown unknowns such as dataset shifts, anomalies, large spikes in row count, etc.&lt;/p&gt;

&lt;p&gt;These issues can be circumvented by integrating &lt;a href="https://www.elementary-data.com/" rel="noopener noreferrer"&gt;Elementary&lt;/a&gt; into your workflow.&lt;/p&gt;

&lt;h1&gt;
  
  
  Features
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Elementary&lt;/strong&gt; is an open-source tool for data observability and validation that wraps around an existing dbt project. It allows users to graduate from simply &lt;strong&gt;having&lt;/strong&gt; integrity tests to &lt;strong&gt;using&lt;/strong&gt; them in order to improve confidence in your data product.&lt;/p&gt;

&lt;p&gt;Here are only some of the reasons why I personally love Elementary:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The UI, the glorious UI:
&lt;/h3&gt;

&lt;p&gt;Elementary and its associated CLI (command line interface) &lt;em&gt;edr&lt;/em&gt; natively allow you to &lt;strong&gt;generate a static HTML file containing test results&lt;/strong&gt;. This file can either be viewed locally, sent through slack or even hosted in a cloud service to be viewed as a webpage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqhtb932ag6r0b5xpqy7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqhtb932ag6r0b5xpqy7.png" alt="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pqhtb932ag6r0b5xpqy7.png" width="800" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this UI, you can view your most recent test results, historical test results, check model run times and even view lineage graphs.&lt;/p&gt;

&lt;p&gt;If you have any failures in tests run, you can view samples of offending entries or copy the SQL query that generated these errors to quickly investigate.&lt;/p&gt;

&lt;p&gt;You can play around with Elementary's &lt;a href="https://storage.googleapis.com/elementary_static/elementary_demo.html#/test-results" rel="noopener noreferrer"&gt;demo project&lt;/a&gt; to get a feel for it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Stored results and historical views
&lt;/h3&gt;

&lt;p&gt;Elementary integrates with your dbt project in order to store all test runs and uses &lt;em&gt;on-run-end&lt;/em&gt; hooks to upload results. This all happens on the dbt package, without need to connect to the data warehouse.&lt;/p&gt;

&lt;p&gt;This allows us to view test results over time, view progress from run to run, and &lt;strong&gt;use test results for internal reporting&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It can sometimes be hard to share integrity metrics with the rest of your non-data-literate team. Having easy access to results &amp;amp; metrics such as row count directly in your warehouse allows you to create integrity dashboards curated for business use cases, giving your team the opportunity to start tackling issues and take stock of their progress.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Anomaly detection tests
&lt;/h3&gt;

&lt;p&gt;As mentioned above, it is hard to deal with unknown unknowns and issues that arise over longer periods of time (days, weeks, or even months). Even if your data is clean when your model first goes into production, it does not mean that mistakes/issues cannot slip in as time goes on. A supported table requires constant monitoring, especially if business logic has been hard-coded.&lt;/p&gt;

&lt;p&gt;This task can be greatly alleviated by making use of Elementary's native &lt;strong&gt;anomaly detection tests&lt;/strong&gt;, which monitor for shifts at the table and column level for metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row count&lt;/li&gt;
&lt;li&gt;Null count&lt;/li&gt;
&lt;li&gt;Percentage of missing&lt;/li&gt;
&lt;li&gt;Freshness&lt;/li&gt;
&lt;li&gt;Etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A full list of all anomaly metrics Elementary tests for can be found &lt;a href="https://docs.elementary-data.com/guides/add-elementary-tests" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By basing itself on &lt;strong&gt;past results&lt;/strong&gt; rather than hard-coded baselines (ex: an increase of 10% in row count or half-day delay in freshness), Elementary can easily be integrated out-of-the-box without needing to fine-tune from pipeline to pipeline.&lt;/p&gt;

&lt;p&gt;At Potloc, we mainly use this feature to identify &lt;strong&gt;freshness issues&lt;/strong&gt;. Elementary allows you to easily setup freshness checks &lt;em&gt;without having to specify hard deadlines&lt;/em&gt; (ex: 12h since last sync, 24h since last updated, etc.). This means that we can change our upstream Extract/Load job schedules without having to change our downstream tests; the tool will automatically flag the issue, and then adapt to the new schedule as it becomes norm. This also makes the intial setup is quick and painless.&lt;/p&gt;

&lt;h1&gt;
  
  
  Integrating Elementary into your existing Meltano Project
&lt;/h1&gt;

&lt;p&gt;We developed an &lt;a href="https://hub.meltano.com/utilities/elementary/" rel="noopener noreferrer"&gt;Elementary plugin&lt;/a&gt; for Meltano using the Meltano EDK that can easily be integrated into your project.&lt;/p&gt;

&lt;p&gt;To add it, simply run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;meltano add utiliy elementary

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should add the following code snippet to your &lt;code&gt;meltano.yml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  - name: elementary
    variant: elementary
    pip_url: elementary-data==&amp;lt;EDR VERSION&amp;gt; git+https://github.com/potloc/elementary-ext.git

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will also need to add the following snippet to your &lt;code&gt;packages.yml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;packages:
  - package: elementary-data/elementary
    version: &amp;lt;DBT PACKAGE VERSION&amp;gt;
    ## Docs: &amp;lt;https://docs.elementary-data.com&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, we have 2 elements we now need to complete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EDR version&lt;/li&gt;
&lt;li&gt;dbt package version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of the can be found in the elementary &lt;a href="https://docs.elementary-data.com/quickstart" rel="noopener noreferrer"&gt;quickstart documentation&lt;/a&gt; or in their respective package indexes (&lt;a href="https://pypi.org/project/elementary-data/" rel="noopener noreferrer"&gt;pypi&lt;/a&gt; and &lt;a href="https://hub.getdbt.com/elementary-data/elementary/latest/" rel="noopener noreferrer"&gt;dbt packages&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;It is important that &lt;strong&gt;both of these versions are aligned in accordance with creator releases&lt;/strong&gt;. If CLI and dbt package versions are misaligned, errors can ensue.&lt;/p&gt;

&lt;p&gt;At the time of writing this article, we would be using &lt;code&gt;EDR VERSION = 0.63&lt;/code&gt; &amp;amp; &lt;code&gt;DBT PACKAGE VERSION = 0.66&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; We will be working on making this process easier so that you do not need to specify package versions.*&lt;/p&gt;

&lt;p&gt;Next, you need to set all of your environment variables for Elementary so that they use the same as your existing dbt project. A typical setup could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      - name: elementary
        namespace: elementary
        pip_url: elementary-data[platform]==0.6.3 git+https://github.com/potloc/elementary-ext.git
        executable: elementary_invoker
        settings:
        - name: project_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/
        - name: profiles_dir
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
        - name: file_path
          kind: string
          value: ${MELTANO_PROJECT_ROOT}/path/to/report.html
        - name: skip_pre_invoke
          env: ELEMENTARY_EXT_SKIP_PRE_INVOKE
          kind: boolean
          value: true
          description: Whether to skip pre-invoke hooks which automatically run dbt clean and deps
        - name: slack-token
          kind: password
        - name: slack-channel-name
          kind: string
          value: elementary-notifs
        config:
          profiles-dir: ${MELTANO_PROJECT_ROOT}/transform/profiles/platform/
          file-path: ${MELTANO_PROJECT_ROOT}/path/to/report.html
          slack-channel-name: your_channel_name
          skip_pre_invoke: true
        commands:

          initialize:
            args: initialize
            executable: elementary_extension
          describe:
            args: describe
            executable: elementary_extension
          monitor-report:
            args: monitor-report
            executable: elementary_extension
          monitor-send-report:
            args: monitor-send-report
            executable: elementary_extension

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure to &lt;strong&gt;specify the platform&lt;/strong&gt;, which should be specified in your profile (we use bigquery).&lt;/p&gt;

&lt;p&gt;After this, simply follow the instructions in the &lt;a href="https://docs.elementary-data.com/quickstart" rel="noopener noreferrer"&gt;Elementary Quickstart Guide&lt;/a&gt; to get the plugin up and running.&lt;/p&gt;

&lt;h1&gt;
  
  
  Generating your first report
&lt;/h1&gt;

&lt;p&gt;Once you have got elementary up and running, it's time to generate your first report. Simply run the command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;meltano invoke elementary:monitor-report

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and a brand new report should be generated at the specified &lt;code&gt;file-path&lt;/code&gt; (&lt;code&gt;${MELTANO_PROJECT_ROOT}/path/to/report.html&lt;/code&gt; in our case).&lt;/p&gt;

&lt;h1&gt;
  
  
  Next steps
&lt;/h1&gt;

&lt;p&gt;Once you've generated your first report, the sky is the limit in terms of integrating elementary to your workflow.&lt;/p&gt;

&lt;p&gt;The Elementary team has made it incredibly easy to send a report as a slack message. At Potloc, we receive reports twice a day to monitor the state of our pipeline.&lt;/p&gt;

&lt;p&gt;You can also set up hosting for your report on s3 or send slack alerts when an error occurs. Experiment and find what best works for you!&lt;/p&gt;

&lt;h1&gt;
  
  
  A quick closing statement
&lt;/h1&gt;

&lt;p&gt;While incredibly powerful, Elementary &lt;strong&gt;is not a replacement for best practices&lt;/strong&gt;.&lt;br&gt;
When writing tests, ensure that they are &lt;em&gt;pertinent&lt;/em&gt; and &lt;em&gt;targeted&lt;/em&gt;.&lt;br&gt;
Tests should be written to identify data integrity issues that can compromise business insights, not simply to identify null values.&lt;br&gt;
If you write too many tests without thinking of the meaning behind them, you run the risk of falling into the &lt;em&gt;"too many errors = no errors"&lt;/em&gt; paradigm where you have so many warnings that it's impossible to differentiate between actual issues and unfixable noise. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I speak from experience; at one point, we had multiple tests in our pipeline that returned a warning of over &lt;em&gt;5 000 erroneous values&lt;/em&gt;, with one going up to &lt;em&gt;180 000&lt;/em&gt;. These errors were unactionable, and yet the tests remained. Even with Elementary, this made it hard for us to differentiate between useless warnings and &lt;strong&gt;actual integrity issues&lt;/strong&gt; that needed to be resolved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Make sure to reach out to the Elementary team if you have any questions about their product!&lt;/p&gt;

&lt;p&gt;Interested in what we do at &lt;a href="https://www.potloc.com/" rel="noopener noreferrer"&gt;Potloc&lt;/a&gt;? Come join us! &lt;a href="https://jobs.lever.co/Potloc" rel="noopener noreferrer"&gt;We are hiring &lt;/a&gt; 🚀&lt;/p&gt;

</description>
      <category>emptystring</category>
    </item>
  </channel>
</rss>
