What Are Delimited, Tabular Files?
A delimited file contains data structured in rows and columns separated by a symbol character.
CSV files are the most common. In CSV files, columns are delimited by commas and rows are marked by the new line character. Values are quoted if they contain spaces. A delimited format like CSV is simple to create, simple to edit, accessible by virtually all tools, and highly compressible.
The main problem with delimited files -- and there are many! -- is that they lack consistency and guardrails. While the data in a delimited file is assumed to be like-for-like values in rows and columns, it can be as many shapes as a box of Legos.
It is common for the first row to be descriptive headers. It is also common for there to be no headers, one or more blank lines before anything else, free-form comments, headers that don't match the following header values, etc. It is also not uncommon for multiple datasets to live in the same file, for headers to repeat or change, for data sets to be positioned side by side, and on and on.
When tabular data isn't really tabular, problems ensue!
This happens a lot, sadly. And historically, there has not been a strong validation language to tap the breaks and say when a file is invalid and why. CsvPath Validation Language fills this validation gap, we'll look closer at that technique in another post.
Common Scenarios
I'll layout, broadly, the most common ways data engineers, ops team members, data scientists, and others run into CSVs:
First, let's split the list into two parts:
Ad hoc / manual / one-off CSV production
- Files from manual processes
- Exported search results
- Official data sets from authoritative sources
- Manually uploaded CSVs
These are situations you face irregularly. Because they are irregular, it is relatively expensive to create tools and scripts to handle them -- the effort can be high and the payoff happens once or infrequently. In some cases, e.g. manual CSV uploads, the location and purpose are consistent but the sources vary and may not be under your control.
Regular automated CSV production
- Outputs of automated processes
- Regular transactions and data feeds
- Export-import batch loads
- Database backups and distributions
In these situations you receive the same type of data regularly and frequently. Because they are regular and frequent, the cost to automate is lower. These are often highly valuable processes with high risk. Generally you don't control the inbound data. In a large business there can be a lot of these situations, each a bit different. And because they are regular, time is of the essence.
Ad hoc / manual / one-offs
Let's look closer and give some example tools that can help you deal with these different challenges. The tools are not the only ones or necessarily the best. They are just given to help you zero in on the use case and types of solutions.
First, the hard to automate, manual processes.
Name | Purpose | Challenge | Tools |
---|---|---|---|
Files from manual processes | Data engineering dev, analysis and exploration, unit testing, etc. | Remembering/versioning what you did; being consistent; understanding the structure and data types | |
Exported results | Capturing results from a tool. (e.g. Exporting a JIRA query) | Making the data importable; Comments and non-data prologues; inconsistent data; embedded non-tabular data, etc. | |
Official data, authoritative source | Published statistics; open data; XBRL | Weak canonicalization; may require data aggregation; Incomparable formats, other formats better maintained, etc. |
|
Manually uploaded CSV | Application CSV upload forms/features | Understandable feedback to non-tech users; structure comparison; flexibility to handle different or changing use cases; type conversion |
Regular automated CSV production
Now let's look at the more automation-friendly uses of tabular files.
Name | Purpose | Challenge | Tools |
---|---|---|---|
Log-form data or API output | Archiving and review/triage | Large size; retention; embedded non-tabular data; always on |
|
Regular transactions and file feeds | Exchange with data partners; collect regular updates; transaction batches | Efficient quality management; lost revenue or penalities for inaccuracy; consistency across many partner setups; round the clock MFT |
|
Export-import batches | Batch loading tabular data stores | Speed; consistent archiving and cataloging; observability |
|
Database or multi-set distributions | Publishing an entire database or a dataset with many files | Maintaining structure in flat files; bundling; documentation |
|
At a high-level, those few scenarios cover 90% of the times developers will encounter CSV or Excel files. They are everywhere and come with serious limitations. But there are good tools and good reasons to use the simplest means of data transfer. The benefits of simple delimited files outweigh their challenges -- which is why you encounter these files frequently.
If I missed your tabular data use case please drop a comment! More tools suggestions and/or questions welcome!
Top comments (0)