SQL: CsvPath vis-a-vis SodaCL

#data #dataengineering #sql #csv

Let's have some more fun with comparing and contrasting schema languages. In this post we'll look a schemas + rules-based validation tool, Soda, vis-a-vis CsvPath Framework's CsvPath Validation Language.

SodaCL is the validation rules language for the Soda data quality library. You can learn more at soda.io. I'll say right up front that this is an apples-to-oranges comparison. Here's why:

Soda is mainly relational data focused; CsvPath Framework is mainly files focused
Soda is a data quality tool; whereas, CsvPath Framework is for data preboarding, which includes data quality, but isn't limited to it
SodaCL is a domain-specific language built on YAML that uses embedded SQL; CsvPath Validation Language is a first-class stand-alone validation language, more similar in that regard to DDL or XSD

Nevertheless, both tools do flat-file validation, so it is an apt comparison point. As much so as apples and oranges, both being tasty fruit.

Let's grab a first example from the SodaCL docs and see where it takes us. Please note that these are quick and dirty comparisons. I'm not building the SodaCL or CsvPath for perfection, just giving a rough sense of the differences and similarities.

Duplicates

Here's a duplicate rows query check in SodaCL. Even though it is a SQL check it only checks one table so it seems fair game for comparison to a tabular data file.

checks for dim_product:
  - failed rows:
      fail query: |
        with duplicated_records as (
          select
            {{ column_a }},
            {{ column_b }}
          from {{ table }}
          group by {{ column_a }}, {{ column_b }}
          having count(*) > 1
        )
        select
          q.*
        from {{ table }} q
        join duplicated_records dup
          on q.{{ column_a }} = dup.{{ column_a }}
          and q.{{ column_b }} = dup.{{ column_b }}

This test finds and returns rows that have a common column A + column B. In other words, column A with column B act as a meaningful identity, and if we find a duplicate we found an error. As we're using a SELECT the result is the set of every row that has a duplicate row.

In CsvPath we would prefer to do something a bit simpler:

$[*][
    has_dups(#a, #b)
]

This does almost the exact same thing. The result is the duplicate lines, but not the original lines. An original line is the first line of a set of duplicates.

If we want to know all lines with duplicates, regardless of if they are the original line or not, we can use dup_lines(). This function returns all the line numbers that are duplicated, including the first. That would net us a variable named @dup_lines (or whatever we want it to be named).

The variable would contain a key for every unique value holding a list of line numbers. In order to get the actual lines we would need a second CsvPath that uses the duplicate lines variable to return the all the duplicate lines.

$[*][
   dup_lines.lines(#a, #b)
   no()
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   @t = size(@s)
   above(@t, 1)
   append("line", line_number(), yes())
]

Here the first csvpath creates a variable, @lines, that has all the unique a+b header value fingerprints as keys to stacks of line numbers where that fingerprint was found. The no() keeps lines from being collected, since we don't need them.

We load these two csvpaths as a single named-paths group called dups. Running the named-paths group serially makes sure the first csvpath has prepared the data that the second needs before the second starts.

If the @lines gives a stack for any line with a count above 1 the line has duplicates. Because above() returns true the line matches and is collected. QED.

In our example there is no ID that distinguishes lines. In a real case, you might want to have the line numbers so you can better investigate why there were duplicates. As you can see, the dup_lines() function captures that for you in a variable. The variable is available programmatically and in the vars.json file generated by the run.

However, to stay closer to our working csvpath, we can just add the line number to the lines captured. To do that, we append a new header line, giving it the value from line_number(). The yes() says that we want to include line as in the header line.

To strip the CsvPath solution back to essentials that match the SodaCL we get:

$[*][
   dup_lines.lines(#a, #b)
]
---- CSVPATH ----
$[*][
   @s = get(
         $dups.variables.lines,
         fingerprint(#a, #b)
        )
   above(size(@s), 1)
]

That's a nice concise pair of csvpaths. If we were preboarding data with more rules, we would add these two statements to a larger named-paths group covering all the validation rules.

You can see that for data preboarding, CsvPath Framework's purpose-built CSV validation capabilities are on target. SodaCL, while not a preboarding tool, is also highly effective and obviously a better choice for monitoring the data quality of database-housed data downstream of CsvPath. There's more we can compare between CsvPath and SodaCL. We'll return to it in a future post.

DEV Community

SQL: CsvPath vis-a-vis SodaCL

Duplicates

Top comments (0)