David Kershaw

Posted on Nov 14 • Edited on Nov 25

A real-world example of CsvPath schemas

#data #dataengineering #csv #csvpath

Zuora's example invoice upload CSV came to my attention in a timely way. Just last night I posted this comparison of SQL and CsvPath schemas. CSV schemas are an interesting topic for anyone who likes to geek-out about document and data validation. (👋 Raises hand...) but are they practical?

As it turns out, yes. And a good example of where you might use them comes from the Zuora docs: Guidelines_for_CSV_file_upload_in_Data_Loader

This file format combines invoices with invoice items. That's a very common thing to do. It is about as tasteful as a wildly denormalized SQL table, but practical, for sure. How might we validate an invoice upload file? We would use CsvPath schemas. Could we use just plan CsvPath rules? Sure, we could. But as you'll see, line() schemas are cleaner.

If you're playing along at home, pop open FlightPath Data. Those of you who are new, FlightPath Data is CsvPath Framework's favorite frontend. It is available free on the Windows store and the Apple macOS app store.

Take a look at the Zuora example file. In FlightPath Data it looks like:

. It starts with three empty lines. If you switch to plain text it's even clearer. The lines have delimiters but no values.

This is our first challenge. Let's say that we can accept the four blank lines without calling the file invalid. How does our CsvPath have to accommodate that? The answer is one of those "easy if you know it" things.

In this case, we're going to assume that we'll typically have blank lines. The accommodation/fix is:

    empty.nocontrib(headers()) -> skip()
    after_blank.nocontrib() -> reset_headers()

These two match components say that if a line is empty we skip it. We put the nocontrib qualifier there to say that we don't want emptiness to be a validation criteria. In other words, the empty() doesn't contribute to the match calculation. The skip() make CsvPath jump to the next line without matching the current line and without checking any of the match components below it.

Then we have to find the headers. Technically we don't need to do this, because we can access header values using indexes. E.g. #0 is the same as #IsNewInvoice. However, the names are easier to work with. To find the headers we simply check if the previous line was a blank, and if so, reset the header names to be the values in that line. And again, we don't want the after_blank() function to determine if a line matches. Easy!

Now the fun part. We need to define two schema entities using the line() function. One is for the invoices. The other is for each invoice's line-items. Interestingly, and again commonly for this form of CSV, the invoice lines are cohabitated with the first invoice item. That is no trouble at all.

The invoice entity

A new invoice is indicated by a TRUE value in the #0 header. When we see that we need to match:

   #0 == "TRUE" -> line.invoice(
       blank(),
       string(#"Account Id"),
       string(#"Invoice Date"),
       string(#"Auto Pay"),
       string(#comments),
       string(#"Invoice Number"),
       string(#"Invoice.PO Number"),
       wildcard()
   )

This enitity skips the first header, accepts strings for the next six headers, and adds a wildcard() to allow any number of additional values to exist to the right.

When you run this, you get two matching lines. With our validation strategy, these two lines are valid. Everything else in the file is skipped or invalid. We're doing well. We have invoices. But we still need invoice items. Or, I should say, we have invoice items but they are incorrectly being found to be invalid. We'll fix that.

For our second entity, the invoice items, we make another line(). Keep in mind that a line() is always a complete description of a single line. But at the same time it is also as specific as you like, and so may not consider all header values.

The way line() can model both a whole line and an entity within a line is by using blank() and wildcard() to block out the parts of a line that it doesn't care about. We do this in the invoice entity in the first header position because we don't care about the signal IsNewInvoice. Here, in the invoice item entity, we don't care about the invoice, so we'll skip over that with wildcard().

   #IsNewInvoiceItem == "TRUE" -> line.item(
       wildcard(8),
       string(#"Invoice Item Amount"),
       datetime(#"Invoice Item Service Start Date", "%Y-%m-%d %H:%M:%S"),
       string(#"Invoice Item Charge Date"),
       string(#"Invoice Item Charge Name"),
       string(#"Invoice Item Description"),
       string(#"Invoice Item Quantity"),
       string(#"Invoice Item Service End Date"),
       string(#"Invoice Item Unit Price"),
       string(#"IsNewInvoiceItemTaxItem"),
       string(#"Invoice Item Tax Item Tax Code"),
       string(#"Invoice Item Tax Item Tax Mode"),
       string(#"Invoice Item Tax Date")
   )

These are very simple entities without much typing or constraints. In a real invoice validation csvpath statement you would probably use types, constraints, and qualifiers, and add rules outside the line(). But here we're only roughing out what is possible.

This looks good! It should work, right?

Well, not exactly. If we run this we will get the two invoices, but not the invoice items. Why is that?

The reason is we're missing a nocontrib qualifier. We are matching the two invoices just fine. But when we don't match on an invoice we are considering the line to be invalid. That means that we only get one invoice item per invoice -- the one that is on the same line as the parent invoice. But clearly that's not what we want.

Since we always have a first invoice item on an invoice's line, all we need to do is declare that there being an invoice, or not, doesn't determine if a line matches. I.e., the invoice entity doesn't contribute to the match. And we already saw how to do that with the nocontrib qualifier. We just add another to the invoice test on the first column like this:

    #0.nocontrib == "TRUE" -> line.invoice(

Now when you run the csvpath, you get all the invoices and all their invoice items in the matched lines collected. You have validated your file, and as a bonus, removed the extraneous empty lines at the top.

Easy, simple, and if you're a validation geek like me, really cool looking! Have fun with this CsvPath Validation Language schema stuff. If you get stuck, leave a comment and I'll be happy to help out.

DEV Community

A real-world example of CsvPath schemas

The invoice entity

Top comments (0)