David Kershaw

Posted on Jan 28

Comparing Validatar to CsvPath Validation

#dataengineering #data #python #programming

Let's go back to the buffet and compare CsvPath Framework and FlightPath Data to another validation tool. Today we'll look at Validatar. Once more into the breach dear friends!

As with the other comparisons, please remember that data quality tools like SodaCL, Great Expectations, or today's contestant, Validatar, only do data quality. CsvPath Framework, by contrast, is a data-file feeds management infrastructure that covers data validation as just one aspect of the full data preboarding lifecycle.

Moreover, CsvPath Framework does not deal with relational databases (other than as an option for storing its own metadata). Validatar et. al., are first and foremost relational database quality management tools, and only secondarily deal with data files. So it's a mismatch, to some degree, but useful and entertaining nonetheless.

The Validatar Example
The CsvPath Way

The Validatar example we'll replicate using CsvPath Validation Language and FlightPath Data is at https://docs.validatar.com/docs/exercise-10-create-a-uniqueness-monitoring-template-for-csv-files. Just from the name of the exercise you know this is more up CsvPath Framework's alley than it is Validatar's.

The Validatar Example

Here's the problem description at the top of the Validatar example:

This Standard Test is designed to demonstrate the concept of how to create a uniqueness template test for all CSV files in multiple folders.
The Standard Test here compares the Row Count per account_id value in the account_data.csv to make sure all account_id's only have 1 row. The test only keeps failures and stops after 100 failure records.

Spoiler alert: in FlightPath this is a trivial example (as I think it is meant to be)

Validatar starts by having you create a test template. Before you can do that, though, you need a project. Here are the instructions for that step:

Choose your project
Make sure your Data Source is Mapped correctly

The first bullet sounds simple. I'm not sure what the second bullet means because I'm not a Validatar expert and that one isn't explained on that page.

Moving on, let's create that template. Most of this exercise is forms based. The setup is shown in the image below. Their ask is that you notice:

Note that the column specified to group by is account_id
Note that it is comparing the ROW_COUNT to a fixed value of 1
Note that the Result Configuration is set so that only Failures are kept and to abort after 100 failures are found

Good requirements for us to use on the CsvPath side.

At this point we have our test. Now we need to create a template from it so that we can apply it to each CSV file. This is how we get to a single action we can apply to multiple files in a uniform way.

I'm going to just add the bullets because the screenshot is in the link above, which is of course a more complete description. We do:

Click Build Template
Update the Folder input to {{schema.name}}
Update the File input to {{table.name}}
Update the Column input to {{#replace table.name "_data.csv" "_id"}}
Update the Metadata Links to {{schema.name}}.{{table..name}}
Change the Generate column list using to Dynamic Template Configuration
Update the Dynamic Script

The dynamic script is pretty simple:

    [
    {"name":"{{#replace table.name "_data.csv" "_id"}}","sequence":1,"type":"Numeric","role":"Key"},
        {"name":"ROW_COUNT","sequence":2,"type":"Numeric","role":"Value"}
    ]

Now, we're going to use some metadata to filter down to the files we care about.

Switch to the Metadata Selection Tab
Change to the Use Filters option
Add a Filter on the Table Name Field that contains "_data.csv"

At this point, check that the filter finds your files and run the example. You should be good to go. My feeling is that all works better for database tables than for CSV files, just as you would expect from Validatar.

Now, CsvPath Framework

Once more, this time with feeling! Let's see how CsvPath Framework and FlightPath Data can make the same magic happen. And, hopefully you'll agree that it's much simpler and more powerful for its use case.

The requirements, again

Create a uniqueness test for all csv files in multiple folders
The column specified to group by is account_id
Only keep failures
Stop after 100 failures

The core of these requirements is the validation statement. Using CsvPath Validation Language this is next to trivial:

$[*][ 
    @duplicate_accounts.nocontrib == 100 -> stop()
    has_dups(#account_id) -> counter.duplicate_accounts(1) 
]

(The @ sign means a variable and the # sign indicates a header name)

This csvpath says: for each line in a file check if the counter is 100. If it is, stop processing that file. Otherwise, increase the counter if the #account_id is a duplicate.

The statement will collect only error lines because:

The counter is a side-effect with no contribution to matching
The check if @duplicate_lines equals 100 is marked to not contribute to matching. (Using the nocontrib qualifier)

The function that does the heavy lifting is has_dups(). If that returns True (i.e. the value of true()) we match the line and capture it.

All pretty readable. Now what do we do with it?

FlightPath Data FTW

All of what we need to do is almost as simple in Python using only CsvPath Framework. Almost! But using FlightPath Data it is even simpler.

In FlightPath, create a new file called dups.csvpath. Paste in our statement.

Right-click on dups.csvpath and select Load csvpaths.

In the load dialog give the named-paths group the name dups and click Create.

You should see your csvpath show up in the middle window on the right under the dups folder. When you load a csvpath statement it always goes into a group.csvpaths file. And when you click on that file its background is pale green to let you know you cannot edit it. (You can, of course, over-write it anytime without losing prior versions, but that is another topic for a different post.)

Next stage your data. In the example, each file is in its own folder and its folder is one of many in the same directory. We'll just add the parent folder and let FlightPath find the files for us.

To do that right click the parent directory and select Stage data. In the stage data dialog uncheck the Separate named-files checkbox because we're going to have every physical file be one version of the same named-file. Think of a named-file as a category that has one file assigned to it at a time, in sequence. We say named-files have versions.

In the named-file name box type accounts. That's our category. You will see your data in the top-right window as a directory named accounts. I used a template of :6/:filename in order to keep the month folders, but that is completely optional.

The result looks like:

Finally, right click the accounts folder, or the dups folder below it, and select New run. In the run dialog, for named-paths select dups. For named-file type in $accounts.files.:all. That named-file name is a reference that indicates every version of the accounts named-file. Again remember, the named-file is like a category that registers a file at a time. We registered a bunch of files and now we're applying our CsvPath Validation Language statement to each of them in turn.

And here's the Run dialog:

When you click Run you will see your results in the lower right-hand window. Your run is date stamped within the dups results. In your date-stamped run you can see the data.csv where your duplicate lines landed. In this image I dropped each run into its own folder using a template; you can see the test and test2. That is completely optional, of course.

And that is it!

There is, of course, much more you can do with CsvPath Framework. Likewise, Validatar has a ton more functionality than what we showed. But now you've had a taste of both.

What I'd hope you come away with is that CsvPath Framework is the better tool for CSV, JSONL, and Excel file validation. The ease of using FlightPath Data for this validation example makes the case well. Obviously, for relational database validation, Validatar is your horse.

And of course I also want to point out again that CsvPath Framework is a complete data preboarding solution, not just a validation engine. Preboarding inbound data files is a big deal. If you need that (and who doesn't?) you owe it to yourself to take a look at CsvPath Framework.

To whet your appetite, here's a post on data preboarding build or buy. Enjoy!