CsvPath. What? Why? How?
CsvPath Framework includes CsvPath Validation Language. To save scarce ASCII, let's call it CVL.
CVL is a document-oriented tabular data validation language like XSD, Schematron, DDL, and JSONQuery. You can use CVL to validate CSV, Excel, and JSONL files, as well as Pandas-type data frames in memory. The purpose of CVL within CsvPath Framework is to be the validation and upgrading capability within the larger data preboarding architecture.
CsvPath Validation Language is line-based. That means it validates streaming line-by-line, rather than looking at a complete set of data all at once. All of the other languages referenced above are full-set-at-a-time languages, not line-by-line. That means that, in principle, they trade memory size for speed. In practice the picture is more nuanced. DDL in particular typically has strategies that allow it to handle data larger than RAM. CVL being line-by-line also means it can do some things that the full data set languages cannot (easily) do and it cannot do some things (easily) that they can do. In both cases the runtimes they run within can compensate for limitations, at least to some degree.
Also worth pointing out, CVL and all of the others mentioned, are declarative languages. By and large, you don't program them to follow a logical algorithm determined by you. Rather you declare what should happen or what you are testing and let the runtime figure out how to give you the desired results.
CVL is both schema-based and rules-based. Its schema definition capabilities are most like SQL's DDL. Its rules capabilities most like Schematron and SQL's queries. CVL is an intentionally simplistic language, at least in grammatical terms. However, that is like saying the Chinese language is grammatically simple. Yes, but have you seen the characters?
In CVL's case, while the challenge isn't at the magnitude of tens of thousands of Chinese characters, the set of functions is still substantial. CVL has hundreds of functions that can be used to create rules that go far beyond schemas into business logic validation. Knowing which functions are the best ones to use for a particular purpose is perhaps the biggest challenge. And on top of that are a host of qualifiers, modes, and other power toys.
Why do we need a tabular data validation language? Because tabular data, CSV especially, is widespread, is way too flexible in ways that hurt productivity and add bottom-line risk, and is frequently under the control of external data partners, not internal staff. Moreover, the lowest cost point, by far, to find and fix data problems is as close the source as possible. Every step data takes inward from the enterprise's edge magnifies the blast radius of errors dramatically. And every time you need to rewind a process or replay a transformation, your cost and risk go up like skyrockets. 🧨
A Practical Question
So now we have a sense for what CsvPath Validation Language is and why it is important. A practical next question is what it is going to cost me in time and toil to learn to use it?
The answer is relative. If your goal is to validate the data types of a data frame now and then, CVL is probably too much trouble compared to hacking up a Python script, especially if you are already comfortable with Pandas or Polars. CVL was not made for one-off validations, although it can certainly do them well enough.
But if your goal is to run a data collection operation involving tens, hundreds, or thousands of data partners and document types, CVL and CsvPath Framework are vital. Compared to the challenges of:
- designing and building your own framework or
- handling every partnership as a one-off unique project,
- of managing the codebase to control drift and tech debt,
- of maintaining knowledge as team members change over time, or
- compensating for other languages' lack of applicability to the tabular data file problems you face, or
- of firefighting constantly with one arm tied behind your back,
- etc., etc.
Compared to all that, CVL is an absolute walk in the park.
We would argue that CVL is an easy language to use, easy to learn, capable of most situations it was built for, and only about as capable of helping you shoot yourself in the foot as SQL. Which is to say, it's not hard to blow your foot off, but it's quite doable to just not. And, and this is important, CsvPath Validation Language is built for automation, not ad hoc scripting. We expect that you'll treat CVL as code, use iterative dev and tests, create it methodically with an eye on simplicity and reuse, version it, etc. CVL will treat you as well as you treat it, like any other serious language.
So, after all that, is CVL easy to learn and use? It depends on your perspective.
A (Relatively) Simple Example
I'm writing this because I ran into some CVL that was quite straightforward, but also had a lot of small things to remember. That's the "it's easy if you know it" side of so many things.
Here's the scenario, it may be familiar to some of you.
We have a data partner in the retail space who sends us a CSV of orders on a weekly or monthly basis. We want to validate files as the enter the enterprise, before they are accepted as being basically good data. (Downstream our application may reject some orders, but its more exacting business rules are not pertinent to preboarding raw data as it arrives; we're not trying to reinvent the app, just protect it from crap).
Here is the sample data:
ID,date,time,store,address,city,state,zip,category,type,shelf,vendor,product name,UPC,SKU,unit,quantity,a price
03358993,03/21/2024,10:24:14,Bob's store,1 Lakeshore Drive, Chicago, IL, 33581,OFFICE,PAPER,1-5,Sams Paper,20lbs Ream,0301024855,,per each,8,20.99
03358994,03/21/2024,10:31:28,Fred's store,1 Lakeshore Drive, Chicago, IL, 33581,OPERA,WRITING,1-5,Biz Pen,10-pack Black,0541931855,0432950078,per each,2,4
03358995,03/21/2024,11:26:18,Mary's story, 1 Lakeshore Drive, Chicago, IL, 33581,FOOD,CANDY,7,Starbursts,Single,3583900656,0899920453,per each,1,1.29d
We want to catch several things here using rules. Creating a schema would be a different exercise than what I was doing. Here we want to catch:
- Bad prices
- Wrong categories
- Missing SKU (stock-keeping units)
- Missing UPC (universal product codes)
This is one of the exercises on https://www.csvpath.org, but simplified a bit.
The way I tackled these rules was to create this csvpath. In the real world, I probably would have chosen to create multiple csvpaths, 1 per rule, and run them as a named-paths group. But here I just whipped up this one-off csvpath. You can tell it's a one-off because it includes the path to a file in its root. I.e. ${FILE}.
~
validation-mode:no-raise, no-stop, print, no-fail, collect
logic-mode:OR
~
${FILE}[1*][
~ 1x wrong, 2nd item ~
@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )
not( @in.asbool ) ->
error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())
~ 2x wrong, 2nd and 3rd items ~
@price_format = exact( end(), /\\$?\\d*\\.\\d{{2}}/ )
not( @price_format.asbool ) ->
error.price("Bad price $.headers.'a price' at line $.csvpath.count_lines", fail())
~ 1x missing, 1st item ~
#SKU
not( #SKU ) ->
error.sku("No SKU at line $.csvpath.count_lines", fail())
~ always exists ~
#UPC
not( #UPC ) ->
error.upc("No UPC at line $.csvpath.count_lines", fail())
]
These are the things I found myself jotting down as I went. They are not all-new or unfamiliar. At first I just wanted to remember the changes I was making. Then I started thinking about what a new CVL user would have to know.
THINGS TO PAY ATTENTION TO:
* using OR logic
* the importance of validation-mode
* scanning instructions and headers
* variable existence tests and the asbool qualifier
* side-effects vs. match determiners
* assignments don't determine matches
* header existence tests
* add names to errors, at least here (how id_chains work)
* what is fail()?
* use " on headers in body, use ' on headers in print() and error()
10 things
Yeah, it's a lot. But does that make CVL hard? Is the sum of the parts powerful? Again, depends on your point of view. I'd say not super hard, but it takes practice, same as any language. And, yes, powerful, if you need it; otherwise, it could be overkill, as I said above.
Let's go through the items.
Using OR logic
The example uses logic-mode to switch from the default ANDing to ORing. Modes are ways of configuring a single csvpath to behave in a certain way without changing your CsvPath Framework project's global config. You set modes in comments at the top of a csvpath. Comments are demarcated with ~ characters.
AND is the default. That means that within the "matching part" of a csvpath each declared unit, called a match component, is ANDed together to determine if a line matches the csvpath statement. (Read about AND and OR here). That means all match components must evaluate to true (in Python terms True) for the line to match.
In this case, we're going to use OR by setting logic-mode: OR. The decision is somewhat arbitrary for the exercise. We could have used AND with only small changes to the csvpath, but the starting point for my work on this example was already OR so I left it.
OR means that each match component has the opportunity to determine if a match. In AND all must be true. In OR all it takes is one true for the line to match. This means we're using an alternate validation strategy. In OR we're saying "good looks like one of these things"; whereas, in AND we're saying "good requires all of these things".
The importance of validation-mode
Another mode, validation-mode, determines what happens when we hit an error. validation-mode controls the action that happens when the Framework runs into a validation or language error. A validation error is a data error. It may be your own custom rule or it may be built-in. A language error is when you try to do something that is syntactically wrong or uses a match component incorrectly.
validation-mode has the following options when an error happens:
- Raise an exception, resulting in the run terminating noisily
- Print the error to the default printstream (or you can choose another printstream if the error is from your own custom rule)
- Stop the run cleanly
- Fail the run, but keep going
- Collect the error for later programmatic review and/or in a JSON file
You can negate these options by adding no- as a prefix to the option. E.g. no-raise. And you can select multiple options at once by separating them with a comma. E.g.
validation-mode:no-raise, print, fail
The most important things to remember about validation-mode is that raise will halt runs and print stack traces; whereas, no-raise will keep going and errors may be harder to detect. no-raise + no-print is generally a bad option, at least in dev, because you will have trouble telling what happened when you hit an error. The second most important thing is that if print is turned off project wide, you lose a lot of information, and in that case using validation mode to turn print on is often a good idea. However, printing is expensive. When you do a lot of printing the run slows down considerably, so if you expect a lot of errors you may want to just collect them silently and review that JSON after the run, rather than printing a ton of lines that may slow a large file down too much.
Scanning instructions and headers
This is an easy one. The "scanning part" of a csvpath is the first set of brackets. Often it looks like: [*]. Scanning instructions can pick out certain lines to validate and skip over other lines. One of the most common use cases is to use [1*] to say that we want to scan from line 1 to the end of the file. This means the 2nd line to the end, because lines numbers are 0-based. What this does is skip over the header line. The header line is, initially, always line 0. You can reset the headers any time, but that first line is the typical location.
If you know you have a header line you probably don't want to validate it. Let's parse that statement. You may want to validate that the headers are, say, in a certain order, all-caps, from a certain list of names, etc. But you don't have to do that one the 0th line. You can declare those rules and they will take affect, just as long as any one or more lines are considered for matching. At the same time, while your header rules won't trouble other data lines, the reverse may not be true. If you expect the item count header to have an integer that you want to sum, you probably don't want to see an error on line 0 because item count is not an integer.
Variable existence tests vs the asbool qualifier
When you have a variable all by its self it is a test for the existence of the variable. Variables are words with an @ in front, @firstname. A variable is created the first time it is used. But saying @firstname by itself doesn't create the variable, rather it returns true if @firstname exists, otherwise false.
This can be confusing if you have a variable with a boolean value where you want the variable to vote on the line matching. We have that in the example:
not( @price_format.asbool )
If the price format is False we want to do something. But, if we just do:
@price_format
We aren't voting to match based on the correctness of the price format. We're instead voting to match based on the existence of the price format variable. In order to vote with the value of the variable we have you use the asbool qualifier. A qualifier modifies how a match component behaves. In this case we're saying "pay attention to my value, not to if I exist".
Side-effects vs. match determiners
At a high level, a match component can do one of three things:
- Determine if lines match (a.k.a. vote on matching)
- Return a calculated value
- Do some other thing that doesn't provide a vote or a value
The latter is called a side-effect. An obvious example is print(). You can print without voting on a line matching or not. And print() doesn't return a value that can be used in an assignment or existence test.
An assignment is a side-effect. For e.g.
@in = in( #category, "OFFICE|COMPUTING|FURNITURE|PRINT|FOOD|OTHER" )
This match component is an assignment of the value produced by in() to the variable @in. You might think that either or both sides of a when/do expression would be the same. A when/do is an expression where a left-side match component is separated from a right-side match component by a -> operator. When/do is CVL's if/then statement. This guess would be reasonable, but wrong. A when/do operation allows both sides to vote. That means that:
not( @price_format.asbool ) ->
error.price("Bad price $.headers.'a price' at line $.csvpath.count_lines", fail())
The left-hand side is voting to match or not based on the boolean value of the @price_format variable. At the same time, error.price() is not voting because an error() is a side-effect. To be more specific, not() both generates a value and also votes on matches; error() does neither.
With hundreds of functions, CVL demands a quick function lookup with detailed documentation. Luckily you get that in CsvPath Framework's CLI (just do poetry run cli) and/or look in FlightPath Data in the language helper window that opens on the right when you are editing a csvpath file.
Assignments don't determine matches
We've just covered this point-to-remember above, but it bears repeating: an @x = "true" expression does not vote on if a line matches. In this case we only set @x to the value "true", nothing more. We can subsequently use @x to vote using either its existence:
@x
or its value:
@x.asbool
To (hopefully!) be over-clear, this csvpath:
@x = "true" @x @x.asbool
has 3 match components and will always match every line because not only does @x exist, but @x is also true, which means @x.asbool evaluates to a Python True under the hood based on the "true" value of @x.
Header existence tests
Similar to the point I just made above. A header existence test is like #SKU. That means the #SKU header evaluates to true when a line has a value for SKU; otherwise, false. In the example, we aren't checking what the value is, we only care that there is a value. If we needed the value of #SKU to be treated as a boolean we would again use the asbool qualifier, but in this case we do not need that.
The example uses OR logic and we want to collect all the lines, valid or not, so I have both the rule predicate not(#SKU) and the existence test #SKU. The first determines if we get a validation error message and collects lines that are missing SKUs, and the second collects the line if there is a SKU, so that we get all the lines.
An improvement to the example might be to remove the two match components #SKU and #UPC and instead add yes(). yes() always votes to match. In OR we only need one positive vote, so yes() collects every line. The reason this would be an improvement is that the single yes() replaces two match components, and fewer is generally better. And, because, as written, we collect all lines because of the SKU and UPC match components, but we don't follow the same pattern with price and category. It would be nice to use the same pattern for all four rules, even if either the SKU or UPC match components collect all the lines for everyone. Using a yes() separates the line collection from the validation rules and makes all the rules behave the same, while at the same time simplifying the csvpath a little.
Add names to errors
It's important to remember that print() and error() are essentially the same thing, with one super important difference error() creates error events. Both functions print messages to printstreams, optionally using print references to include potentially detailed metadata. But only error() throws off JSON error events that are captured to the errors.json and available programmatically. And the error messages you create using error() are printed in error-formatted lines.
An error-formatted line is like log output that is formatted using a log template. You can set the error template in config.ini or the config panel in FlightPath Data. Your config can choose to print "bare" errors or "full". A bare error doesn't use the error template at all; instead, the message is the whole output to the printstream. The default error message template is:
{time}:{file}:{line}:{paths}:{instance}:{chain}: {message}
Now comes the point. If you add names to your error() functions, like
error.price("Bad price $.headers.price at line $.csvpath.count_lines")
You get a full-format error message that has a clear ID chain. (Assuming your template includes the {chain}. What does this mean?
An ID chain is the path-like identifier of each match component. Each match component in your csvpaths is, potentially, composed of other match components in a tree structure. Each top-level match component is considered an expression. At least the structure is held in an Expression object, even if we rarely speak of it. Within each expression the name-by-name path, parent to child, leading to a specific match component is its ID chain. In an error line the prefix before the message might look like this:
2025-12-14 04h18m59s-936643:March-2024.csv:11:::category
Here category is what we declared as:
not( @in.asbool ) ->
error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", fail())
However, this simple example doesn't show the power of the ID chain. Let's make a contrived example that helps better show it. If we change the error category match component to nest another child match component, while removing the category name on the error(), like this:
not( @in.asbool ) ->
error( "Bad category $.headers.category at line $.csvpath.count_lines ", error("just testing", fail()))
The error message we get from the nested error is this:
2025-12-14 04h20m04s-975094:March-2024.csv:11:::error[1].error[1]
That says that the first error child of an error raised an error event. We can see what that means because it's a simple example. But since ID chains are always scoped to the expression, not the csvpath statement as a whole, what happens when we have very similar error checks for SKUs and UPCs? error[1].error[1] doesn't feel so informative. And these rules are simple; in more complex situations ID chains are both more important and, potentially, harder to read.
Just adding the names to the functions makes all the difference to interpreting where the error occurred. Let's change the expression to:
not( @in.asbool ) ->
error.category( "Bad category $.headers.category at line $.csvpath.count_lines ", error.testing-idchains("just testing", fail()))
The error message of the 2nd error() becomes:
2025-12-14 04h33m20s-635277:March-2024.csv:11:::category.testing-idchains
That's much more readable! Now our UPC, SKU, price, and category errors are much clearer, making error investigation just a bit less frustrating and a bit more quick.
What is fail()?
There are many approaches to validation using any competent validation language in any non-trivial use case. CsvPath Validation Language is no exception, and in fact, due to the quirks of CSV and tabular data file processing CVL has more options than most. Among these, the big ones are:
- Collect matching lines as the valid data
- Collect matching lines as the invalid data
- Print or
error()to indicate errors, without collecting data or making a ruling on the file as a whole - Pass or file a data file as a whole based on what rules, schemas, or built-in validations passed or failed
The last option, marking the data file invalid based on rules and/or schemas, relies on the fail() function and the settings for error handling in config.ini and validation-mode to set the run's is_valid field. is_valid is visible and/or accessible in:
- Run metadata JSON
- Print and error function references (e.g.
$.csvpath.valid) - The
valid()andfailed()functions - Programmatically on the CsvPath and Result objects
Depending on your config and validation-mode setting, if any, built-in errors can automatically fail a run. In some cases, though, you may want to be more deliberate in failing your runs based on certain rules, but allowing others to just be informational. For instance, you might want to flag a date format error, but then simply upgrade the data, rather than failing the run for a very correctable problem. At the same time, if the date was empty but expected, you might want to use fail() mark the run and data file invalid because you found an uncorrectable, unignorable problem.
Use " on headers in body, use ' on headers in print() and error()
And finally, an easy one! We use #"my header name with spaces" to refer to a header named my header name with spaces. At least, that's how we do it in the world of header match components. Ah, but the world of print() and error() messages is a bit different. You see the problem, right? A print or error message is wrapped in quotes, so it's not possible for a header reference to use quotes.
Instead, within the print or error message we simply use single quotes for header names that need to be quoted. This really isn't a big ask. Here is our example:
#"a price"
not( @price_format.asbool ) ->
error.price("Bad price $.headers.'a price' at line $.csvpath.count_lines", fail())
Here we are testing if the #"a price" header exists; presumably to make sure we match lines with prices. Then we're raising an error if the @price_format variable is False. In the error() message we refer to the same header as $.headers.'a price'. Not too much to ask. But it is one more one-little-thing to remember.
So, What Is CsvPath Validation Language: Easy? Hard?
I hope this post helps call out some of the knowledge that can help you create effective validation and upgrading scripts. There is, of course, much more to unpack. But the real question here isn't how much or how little, how hard or how easy. The real question is, is it worth it? And the test of that for many people is: does the solution make easy stuff easy and hard stuff possible?
By that measure, CsvPath Validation Language and the whole CsvPath Framework scores well.
Top comments (0)