David Kershaw

Posted on Nov 25

A Stranger In a New Town: CsvPath metadata fields

#metadata #dataengineering #csv #datascience

The horse half died by the time we came off the highland. That's why I don't name my horses. My boots had holes. My six was dry. Nothing in my pockets but metadata. What's a guy gotta do to get a drink in this town?

Metadata is the wild west.

A great example that goes beyond data is tags. A lot of you have probably noticed in AWS that tags are amazing. Amazingly hard to use well for any sizable project, let alone the enterprise. Same with the other cloud providers.

Metadata is supposed to drive everything, ideally from metadata catalogs. But just you try figuring out how to capture consistent metadata across all your systems automatically so that they are up to date and consistent. Then you sit back and think, ok, I got that, so now what should I capture and how should I use it and how do I know I can trust it.

Sorry partner, I'm not here for that.

I'm laying out that prologue just to highlight that it's a deep subject and ornery to operationalize. You may not have gotten a lead line on CsvPath Framework's small contribution yet. Let's take a look. It's a narrow vista of a much larger landscape, but manageable and super useful.

Types of Metadata

There are three types of metadata in CsvPath Framework:

Framework generated
User configured
User defined

Framework generated metadata is everything collected in the act of:

Staging data in named-files
Loading csvpath statements in named-path groups
Running named-files against named-paths groups

Each of these activities results in metadata that is minimally captured to JSON files. Most of it can also be sent to a relational database and/or observability platform. The JSON files are manifest.json for Framework mechanics and meta.json for runtime generated metadata. It's the latter that I'm going to focus on here.

User-configured metadata and user-defined metadata come from the CsvPaths themselves. They are located in "external comments". An external comment is one that is above (or less commonly below) the body of a csvpath statement. In files with multiple csvpaths separated by ---- CSVPATH ---- external comments live between the csvpaths.

User-configured metadata are, primarily, the modes, along with some integrations-specific fields known only by the integration that allows them. A mode is one of 11 settings that can be applied on a csvpath-by-csvpath basis. Modes do things like:

Determine how validation is handled
Set the logical operator used to combine match components
Switch on collecting of unmatched lines

And a bunch of other useful things. Modes are built into the Framework. They are helpful in understanding why the results you get are what you got.

User-defined metadata are fields that you, the csvpath writer, create to document your data, and potentially to trigger behavior in other systems. A user-defined field looks like a word with a colon after it. This is a metadata field:

description: this csvpath validates order files

The meaning is pretty clear, we're creating a description field and setting its value to this csvpath validates order files

CsvPath Framework Tags

The Framework doesn't use the word "tag" today. Neither does FlightPath Data. That's one reason for me to drop this post. Tags are super helpful, but since we call them user-defined metadata fields, a lot of words, and then don't talk about them much, they are probably under-used.

So I'll just call them tags.

When you create a tag, it is free text. Any one word followed by a colon creates a tag. The value of the tag runs until the next word-with-a-colon is seen. You can also stop a tag by just a stand-alone colon. That can be useful if you prefer to put your tags above a narrative description of the csvpath.

For example:

    ~ 
      copyright: © atesta analytics
      author: William Blake
      test-data: examples/schemas/example-one.csv
      : This csvpath shows how metadata is created along side  
      documentation in external comments. It is just a quick example.
    ~

Here we created two metadata fields, "copyright" and "author", as well as using a well-known instruction for FlightPath ("test-data") and adding some documentation.

When we run our csvpath against a CSV we get something like this:

You can see the two fields we created. The test-data field for FlightPath is there. (Though in the run that created the screenshot we didn't use it.) Print mode was added by the Framework in the background. Any modes we used explicitly would also show here. And you can see the entire original comment for context.

Simple! And in many ways very similar to AWS or JIRA or any other system that offers tags-based organization.

Now, what do we do with these metadata field tags? Well, one obvious thing to do is to document our csvpaths and the data they validate and/or upgrade. This means capturing the world of a csvpath's run as: - narrative docs

Framework generated metadata
User-defined tags

Some things in CsvPath Framework are clear enough at a technical level from metadata you don't have to define yourself. For example, you know what named-file and named-paths group were used in every run. But you don't know who the data belongs to. Even if you have an indication by the named of the named-file or the path within the named-file that gets you the data file bytes you might not know and your downstream system almost certainly has a different viewpoint.

We can add some more tags:

Using FlightPath Data we see the metadata flow into a run's meta.json:

And now we have the opportunity to pull that run's metadata from the archive using FlightPath Server's API:

How you use this feature is of course up to you. While you can annotate your schemas and rules with inline comments, a good use for user-defined metadata fields is to say more about what each part of a schema means.

For example:

    ~
       User schema for the Wild West Order Management application.

       username: this username is controlled by the SSO
       firstname: a free optional field. middle names can go here if needed.
       family_name: not optional. we expect a single name, possibly hyphenated. we're not the system of record. the name should match SSO lastname.

        validation-mode: print, raise
    ~
    $[*][
       line.user.distinct(
          string.username.nonnone(#0, 35, 8),
          string.firstname(#firstname, 40),
          string.lastname.notnone(#family_name, 55)
       )
    ]

Clearly there's a ton of documentation here for a tiny schema, as well as several hard constraints. We can easily find a way to convey basically everything about the data file that you'd want to know, and that's before looking at the Framework-defined metadata and runtime metrics data. That means, if we go all out on the metadata, we have a lot of choices to make.

The potential for the tag-o-sphere to become a mess is high, of course. The good news is that CsvPath Framework and FlightPath are not intended to be a metadata catalog. They can and should feed a catalog and/or stand ready to serve data based on metadata fields known by other systems. But you don't typically browse the archive as a metadata repository like you might with OpenMetadata, DataHub, or Secoda. CsvPath Framework is a producer system, not a consumer system. It will tell you what you need to know in great detail, but unlike those other systems, CsvPath doesn't offer you all the things you don't know.

Again, this is powerful stuff. You can safely ignore user-defined metadata if you choose. But as your data operations expand and mature, you have an awesome opportunity to add a huge amount of clarity for downstream users through producing the right metadata. And it's easy to do.

Not bad for a one-horse town.

DEV Community

A Stranger In a New Town: CsvPath metadata fields

Types of Metadata

CsvPath Framework Tags

Top comments (0)