DEV Community

Cover image for CSV Schema Validation
David Ayres
David Ayres

Posted on

CSV Schema Validation

Intro

The humble CSV file; which I'm not going to cover in detail here. If you don't know what a CSV file and were hoping this document would help - I'm more than happy to signpost you to the Wikipedia page.

So what's this document actually about then? Well, no matter how much we fight it, the CSV is a heavily used file format to share large amounts of data across integration platforms, which is especially useful when it comes to the always difficult task of integration across distributed third party systems.

I didn't set out on my career with a strong desire to be an Integration Architect but as most other Solution Architects deal with, we have to wear many hats, so I find myself in a scenario where I'm having to deal with a lot of files, moving between a lot of systems and CSV is a format I have to handle, no matter how much I fight it.

So the good?

CSV is a simple enough format to define; I've got X columns and a delimiter to indicate how to split those columns. Then there's the rows, the many many rows that make up the file. Quick and easy for a no code Integration platform to define. Consuming within a similar platform isn't too difficult and just becomes mostly config. As long as the columns are consistent on all rows you are good to go and can consume the file easily enough.

You can also zip these files and majorly reduce the file size, which makes it much easier to send/receive. Compression is majorly efficient when it comes to a CSV.

The bad?

Trust. The provider of the file has to be very strict on how they generate their file. They always have to have all columns on each row. If they miss a column, the file can't be imported and that row will have to be rejected. Also when generating the file, that pesky delimiter has to be suitably escaped. If it isn't then it'll generate additional columns which will break the file consumer.

For example:

ID,Title,Cost
1,Book 1,4.99
2,"Book, Book 2",5.99
3,Book, Book 3,6.99
Enter fullscreen mode Exit fullscreen mode

Row 3 will have 4 columns if you split the columns on the delimiter: ",". Row 2 should pass because the delimiter is within speech marks which should tell consumers to treat the content between them as 1 column. One hopes that's the case anyway, plus we also hope that any speech marks are suitably escaped.....

So it's a pretty fragile file format and open to be broken.

Also a CSV file isn't able to hold any sort of meta data. This means any CSV Interface has to have a document shared by the owner so that the data held within can be understood. It's another level of trust between the provider and consumer. More often than not fields get treated as a string by the consumer as it's less prone to breaking. That's not ideal.

Then there's the complete lack of any sort of schema to validate the data against. The consumer has to understand the interface document (there's no standard for this at all) and potentially implement custom validation and logic. If you don't have a no code Integration Platform and are writing the code by hand, it can be timely repeating the same effort for each CSV file you consume.

CSV Schemas?

You might have looked into a CSV schema to make consuming these files easier, which is how you've probably stumbled across this article. With XML there's XSD and JSON we have JSON Schema, so why doesn't CSV have anything!? The below is what I've stumbled across with a critique of my opinion of them.

CSV Schema

An attempt was made to try and standardise a CSV Schema approach. An unofficial draft was made in 2016 but was never formally adopted.

CSV On The Web

CSV on the Web Is a fairly new attempt to standardise how a CSV schema should be documented. It's been recommended by the UK Government Digital Service but hasn't gotten the traction it might need. The biggest issue is that nobody has pioneered any .NET libraries to implement validation for it. I tried but it became a bigger job than I was hoping for to write my own library for this. It definitely shows promise and if it matures and there's more industry adoption, this would be a winner.

Ultimately, CSV is the forgotten file format of the digital age and doesn't get as much love or attention as it deserves.

My Use Case

As I've mentioned, I deal with a lot of CSV integrations, I don't have the privilege of a no code Integration Platform so I wanted to have a schema for my CSV files; to validate against when I'm generating files, or to use when consuming them. A quick schema check ahead of publishing and consuming data helps the end to end Integration process and improves data quality across the journey. There's been too many issues where bad files have been shared, which trigger support tickets and ultimately cost somebody time to look into and debug issues.

I've managed to cobble together an approach for CSV validation that's proven to be fast, scalable and has managed to handle even the weirdest of file content I have to deal with on a daily basis, hopefully it's useful to somebody!

CSV Schema Validation Tool

It's rare these days to have the pleasure of writing your own unique piece of software from scratch but validating a CSV file in .NET, with no packages I could just download and use, felt like I was pioneering something so I rolled up my sleeves and got stuck in!

Step 1; JSON Schema notation.

If you haven't been lucky enough to work with JSON Schema, then check it out here. It's mature, well supported, feature rich and a staple when defining JSON file formats and API responses. Why not make this work for a CSV?

This was the easy part. A CSV header doesn't take a huge amount of work to be written as a JSON schema. With not too much work I was able to define a schema that has all the wonderfulness like variable types, lengths, I can do REGEX patterns (who doesn't love a REGEX!?), enums and even better some of the built in JSON Schema field types like email formatting. Excellent.

To support a CSV file, I've introduced a couple of custom values that need to be added to the JSON Schema. All CSV files have a delimiter, so that's a mandatory field. Also (frustratingly) a CSV file doesn't have to have a header row, so that's a second mandatory field that needs to be added to the schema.

{
  "$id": "https://example.com/person.schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Person",
  "type": "object",
  "csvDelimiter": ",",
  "csvHasHeader": "true",
  "properties": { }
}
Enter fullscreen mode Exit fullscreen mode

Step 2; CSV to JSON.

Now, as I've mentioned above, I work with the Microsoft stack so we can now start delving into some code.

The first thing to do is read in the JSON schema file. To understanding how to read the CSV file, we need that header and delimiter meta data. After some research, I settled on JsonSchema.net to read and parse my Schema files. Once the file is read, there's some validation checks to make sure the delimiter and header fields are present. If not, we reject the files as that's mandatory metadata.

The second thing to do is consume the CSV into the application and to do that, for years I've advocated for the NuGet package: CSV Helper. It's been around for a very long time and for good reason. It'll read a CSV file very quickly and comes with a "dynamic" type, so it's simple enough to generically read in a CSV file into a collection. During the read process, we pass in the delimiter value and if the file has a header or not, it does all the hard work for us!

One thing I love about CSV Helper is that it handles all the special characters for you, even a carriage return within double quotes;

firstName,lastName,age
John,Fish,5
David,"Cr
ab",22
Enter fullscreen mode Exit fullscreen mode

So the above will still produce 2 rows of data, with 3 columns but the lastName for David would be:

Cr
ab
Enter fullscreen mode Exit fullscreen mode

Which look strange but is exactly what we are expecting.

** It's slowly coming together nicely..... **

The final step of the puzzle is to convert a CSV file into a JSON file. This is the slightly less elegant part of the solution.

We loop through the CSV collection building up a dictionary of <string,object>. The string part is the column name that we extract from the validated schema file. We simply take each column in the CSV row and pull out the positional field name from the schema.

Of everything this feels a little "hacky" but there's no other way to associate the column name to the CSV. As column positioning is critical in a CSV, this approach simply takes advantage of that. If the schema column order doesn't match the CSV column order, everything will fall over and throw validation errors but I deemed this acceptable due to the behaviour of CSV.

The second thing we do here is make sure that the data from the CSV column is parsed and stored in the dictionary as the correct type. JsonSchema.net has built in enums for Schema Value Types, so we take the field type from the Schema and parse the value from the CSV. Now we've got a nicely formatted dictionary! If we didn't properly parse the data, the schema validation would fail.

The final step in transforming the CSV to JSON is to take the Dictionary and pass it through the System.Text.Json JsonSerializer so it becomes a JSON friendly string, then we parse it into a JsonDocument using JsonDocument.ParseAsync for the code to then treat it as a valid JSON Document. Remember, we are doing this row by row so each row is treated like it's own individual JSON document. The reason for this is that it gives us a line by line schema validation result so that consumers can be directed to specific faults. It also means that the JSON Schema file is written as if it's one simple row, which makes the notation simpler and easier to understand. This could be something to revisit for a V2.

The final step is to pass that JSON document into the Schema validator library and out come some results! That's the easy part. It returns some results, so we check through these and add any errors to a general error object that can be utilised by a system calling this library and hey presto - we can use JSON Schema to parse a CSV file!

Performance

Without spending too much time on optimisation, it can handle a good 500,000 rows in a couple of seconds. Given these are slow moving files being validated as part of larger integration journey, that's acceptable performance. I'm sure it can be optimised further.

The Quirks

All companies have Technical Debt and ghosts from past decisions that haunt them. Recently for me that's been multiple different CSV formats within the same file. Eurgh. However, with this tool we can handle these scenarios.

There would need to be a JSON Schema file for each row format, then a way to identify what type of row is being processed. Then instead of validating a file and schema combination, we can validate passing in a row and schema one at a time. Doing it this way means we have to handle any error messaging slightly differently but with a little bit of pre-processing of the data, it becomes a trivial hurdle.

Conclusion

Hopefully this has been useful and you've been able to solve a problem you've got! My approach is still in it's infancy but is already being tested in a Production environment.

For those of you interested in the Source Code, it's available here. Feel free to do a PR if you can see improvements. A reminder I'm a hobby engineer these days so there's definitely room for improvement!

Top comments (0)