Well-formed, Valid, Canonical, and Correct

#data #dataops #etl #dataengineering

The world of data is messy in just about every dimension. One of those dimensions is how we talk about data validation.

This post tackles the definitions of a few terms relating to validation. Not infrequently we just skate past these concepts, let alone define them. Broad brush, we're talking about levels of data acceptance. How good do we feel about an item of data or a data set?

The world of data isn't just black and white. Garbage-in, garbage-out is definitely a thing. But one person's trash may be another person's treasure. And your grass-fed raw data may need to be cooked before I can eat it.

The terms in question are well-formed, valid, canonical, and correct. I'm listing them in order from least specified to most, from an end data consumer's perspective. Why are they important? Generally because data goes through stages, from acquisition to preboarding to ETL to enrichment and mastering to production end uses. If we can't speak about the quality and value of the data in acceptance terms at each step, how could we know when to progress it to the next stage?

This progression is a practical matter for tool builders as well. How does MFT (managed file transfer) know when to progress data from arrival to preboarding? How does preboarding progress to onboarding and workflow tools? What are the stages of the medallion data lake, specifically?

Above all, do we know what good looks like? Are we moving too fast? How do we know when we're done? Are we there yet?

Well-formededness

Data that is well-formed first and foremost matches a physical specification, and, secondly, has the correct "outline" to be an item of data of the form expected. The specifications are standards like:

XML
JSON
HTML

Well-formedness also relies on lower level definitions such as unicode and byte-ordering. Without detailed agreements on what constitutes minimally viable raw data the world quickly breaks down through an inability to communicate.

Valid

The next level up from well-formed is validity. Validity is a more robust stage, in that if data is valid, it is probably useful for something.

Files that are valid have data that is compared against a definition of what good data looks like. Data can be validated using rules or models. Well-known examples include:

XSD
Schematron
DDL (apologies for the paywall; google can find you more references)
X12 (ditto!)
JSONSchema

And my favorite, wait for it...

CsvPath

I love all these specs, despite their dryness. Each has its own strengths and coolnesses. An XSD is primarily a model. A Schematron file is principally rules. In fact, a model is a short-hand and generalized way of writing rules. And, in this context, a set of rules is just a classification. But in practice it's simple: an item of data that doesn't match its schema is considered invalid.

Canonical

A canonical form is the form that is preferred over other possible forms of the same data. A simple example is the term IBM. Its canonical form may be IBM. It may also be seen as I.B.M. or International Business Machines. If we are canonicalizing data using this mapping to IBM and we see I.B.M. we substitute the canonical form. Note that if there are multiple accepted forms the canonical form is any of them, given the right time, place and/or bounded context. Canonicalization is closely related to data mastering.

Correct

Correct data is more than well-formed + valid + canonicalized. Correct means that the semantic and business rule content of the data meets expectations. For example, imagine a CSV file that includes a list of companies. Each company has an area of commercial activity. We see that:

The file is readable as a CSV file, so it is well-formed
The file has values under all headers in all rows, so for our purposes we'll call it valid
The company name I.B.M has been canonicalized to IBM so we'll say that the data is in a canonical form
And the company listed as IBM is described as being in the business of Sunflower Farming

Due to the last bullet having sketchy intelligence — we don't think IBM grows sunflowers, but maybe? — we'll say that this data is incorrect. Ultimately this is the most important consideration. However, if the lower acceptance layers are good-to-go, then the value of effort expended to make the data actually correct may be worth it. Or maybe IBM should start growing sunflowers. Actually, both things can be true.

Using CSV files? CsvPath validation can help!

My personal interest, today, is in preboarding data. Preboarding is the first half of data onboarding -- it is a more specific way of talking about data ingestion. In the past I worked with XML, EDI, and JSON files more, but today I'm focused on solving for the messiness of the world of CSV and Excel.

Historically, CSV files have not had a validation language in widespread use. CsvPath Validation Language is a new language that can help with validity, canonicalization, and checking correctness. It uses both rules and schemas and has a lot to offer.

All things have their place. CsvPath Validation cannot help you with well-formedness checking. CsvPath treats any file that Python can read as CSV or Excel as being well-formed. By contrast, most of the so-called CSV validators on the Internet are simple well-formedness checkers. A smaller number of them can check a structural definition of a file's headers. Vanishingly few go beyond that.