Translating Between CSV Schema Languages

#data #dataengineering #tooling #programming

As we continue to compare CsvPath Validation Language to SQL and other validation languages, we want to give a nod to CSV Schema. CSV Schema is a draft standard that has been used in national archives. We don't know how popular it is, but it doesn't appear to have become mainstream. Nevertheless, it is interesting and definitely a fellow-traveler.

This is a rough translation of a UK National Archives file metadata definition from CSV Schema 1.0 to CsvPath Validation Language. The original is at: https://github.com/digital-preservation/csv-schema/blob/master/example-schemas/metadata_v11_WA12B000.csvs. The original individual rules are in the csvpath as comments. (CsvPath comments are marked with ~s).

While we're focusing on CSV validation here, don't forget that CsvPath Validation Language works with Excel and other tabular data formats too.

The Translation

~
There are two physical file operations CsvPath cannot do out-of-the-box.
CsvPath doesn't allow csvpaths to operate on files; however, in principle a _
_custom function can do that. We'll experiment with that idea. Custom functions
have been available in CsvPath almost from the beginning for exactly this type
of requirement.

CsvPath Validation Language will not be natively supporting file operations. That
would pose security issues and raise performance questions when used with the
cloud backends. If there were demand, we might release an optional custom function.

It seems the CSV Schema marks files invalid if the schema doesn't pass. We can
do that using validation-mode to fail on error. fail marks the file invalid. It
does not stop processing of the file, though; that will continue line-by-line.

The CSV Schema tool probably doesn't collect lines the way CsvPath does. We'll
collect the unmatched lines, because those are the ones with errors.

The way the schema was written, it is an ordered set of rules that are ANDed,
with a true value meaning the line is good, and one bad line meaning the file is
invalid. We have written our csvpath to match that strategy and used
validation-mode and return-mode to do exactly the same.

id: translating from CSV Schema 1.0 to CsvPath Validation Language
validation-mode:print, fail
return-mode:no-matches
~
~ Our scanning instruction will skip the first line of headers. The csvpath still uses the headers, but it won't consider them to be data, which is perfect. ~
$[1*][

~ -------------- ~
~ DOCUMENT ~
~ -------------- ~

~ @totalColumns 15 ~
count_headers() == 15

~ -------------------- ~
~ HEADERS SCHEMA ~
~ -------------------- ~

~ Note that CSV Schema mixes rules and structure together;
whereas, CsvPath Validation Language keeps them separate.
Neither approach is right or wrong. We believe our way is
easier to understand. ~

line(
      url(#identifier),
      string(file_name, none(), 1),
      string.notnone(#file_name_language),
      string(#file_name_translation),
      string(#file_name_translation_language),
      string(#description),
      string.notnone(#folder),
      datetime(#date_last_modified),
      string(#checksum),
      string.notnone(#rights_copyright),
      string(#related_material),
      string(#language),
      datetime(#start_date),
      string.notnone(#legal_status),
      string.notnone(#held_by)
)

~ ---------------- ~
~ VALUES RULES ~
~ ---------------- ~

~ Csvpaths don't do file operations. A custom function could do it.~
~identifier: uri fileExists unique if($folder/is("folder"),ends("/")) ~
#folder.nocontrib == "folder" -> endswith(#identifier, "/")

~file_name_language: is("Welsh") or is("English")~
in( #file_name_language, "Welsh|English" )

~ using in() again but slightly differently, just for fun ~
~file_name_translation_language: is("Welsh") or is("English") or is("")~
in(#file_name_translation_language, "Welsh", "English", "")

~we could use in() again here, a 3rd time, but just for variety we'll use or() ~
~folder: is("folder") or is("file")~
or(
#folder == "folder",
#folder == "file"
)

~ Csvpaths don't do file operations. A custom function could do it. ~
~checksum: if($folder/is("file"),checksum(file($identifier),"SHA-256"),is("")) ~
#folder.nocontrib == "file" -> #checksum

~rights_copyright: is("Crown Copyright")~
#rights_copyright == "Crown Copyright"

~language: if($folder/is("file"),is("Welsh") or is("English") or is("English and Welsh"))~
#folder.nocontrib == "file" -> in(#language, "Welsh", "English", "English and Welsh")

~ start_date: if($folder/is("folder"),empty,xDateTime) @optional ~
#folder.nocontrib == "folder" -> or(empty(#start_date), datetime(#start_date))

~legal_status: is("Welsh Public Record")~
#legal_status == "Welsh Public Record"

~held_by: is("The National Archives, Kew")~
#held_by == "The National Archives, Kew"
]