DEV Community

Cover image for Where We Encounter Delimited Data and How We Handle It
David Kershaw
David Kershaw

Posted on

Where We Encounter Delimited Data and How We Handle It

What Are Delimited, Tabular Files?

A delimited file contains data structured in rows and columns separated by a symbol character.

CSV files are the most common. In CSV files, columns are delimited by commas and rows are marked by the new line character. Values are quoted if they contain spaces. A delimited format like CSV is simple to create, simple to edit, accessible by virtually all tools, and highly compressible.

The main problem with delimited files -- and there are many! -- is that they lack consistency and guardrails. While the data in a delimited file is assumed to be like-for-like values in rows and columns, it can be as many shapes as a box of Legos.

It is common for the first row to be descriptive headers. It is also common for there to be no headers, one or more blank lines before anything else, free-form comments, headers that don't match the following header values, etc. It is also not uncommon for multiple datasets to live in the same file, for headers to repeat or change, for data sets to be positioned side by side, and on and on.

When tabular data isn't really tabular, problems ensue!

This happens a lot, sadly. And historically, there has not been a strong validation language to tap the breaks and say when a file is invalid and why. CsvPath Validation Language fills this validation gap, we'll look closer at that technique in another post.

Common Scenarios

I'll layout, broadly, the most common ways data engineers, ops team members, data scientists, and others run into CSVs:

First, let's split the list into two parts:

Ad hoc / manual / one-off CSV production

  • Files from manual processes
  • Exported search results
  • Official data sets from authoritative sources
  • Manually uploaded CSVs

These are situations you face irregularly. Because they are irregular, it is relatively expensive to create tools and scripts to handle them -- the effort can be high and the payoff happens once or infrequently. In some cases, e.g. manual CSV uploads, the location and purpose are consistent but the sources vary and may not be under your control.

Regular automated CSV production

  • Outputs of automated processes
  • Regular transactions and data feeds
  • Export-import batch loads
  • Database backups and distributions

In these situations you receive the same type of data regularly and frequently. Because they are regular and frequent, the cost to automate is lower. These are often highly valuable processes with high risk. Generally you don't control the inbound data. In a large business there can be a lot of these situations, each a bit different. And because they are regular, time is of the essence.

Ad hoc / manual / one-offs

Let's look closer and give some example tools that can help you deal with these different challenges. The tools are not the only ones or necessarily the best. They are just given to help you zero in on the use case and types of solutions.

First, the hard to automate, manual processes.

Name Purpose Challenge Tools
Files from manual processes Data engineering dev, analysis and exploration, unit testing, etc. Remembering/versioning what you did; being consistent; understanding the structure and data types
Exported results Capturing results from a tool. (e.g. Exporting a JIRA query) Making the data importable; Comments and non-data prologues; inconsistent data; embedded non-tabular data, etc.
Official data, authoritative source Published statistics; open data; XBRL Weak canonicalization; may require data aggregation; Incomparable formats, other formats better maintained, etc.
Manually uploaded CSV Application CSV upload forms/features Understandable feedback to non-tech users; structure comparison; flexibility to handle different or changing use cases; type conversion

Regular automated CSV production

Now let's look at the more automation-friendly uses of tabular files.

Name Purpose Challenge Tools
Log-form data or API output Archiving and review/triage Large size; retention; embedded non-tabular data; always on
Regular transactions and file feeds Exchange with data partners; collect regular updates; transaction batches Efficient quality management; lost revenue or penalities for inaccuracy; consistency across many partner setups; round the clock MFT
Export-import batches Batch loading tabular data stores Speed; consistent archiving and cataloging; observability
Database or multi-set distributions Publishing an entire database or a dataset with many files Maintaining structure in flat files; bundling; documentation

At a high-level, those few scenarios cover 90% of the times developers will encounter CSV or Excel files. They are everywhere and come with serious limitations. But there are good tools and good reasons to use the simplest means of data transfer. The benefits of simple delimited files outweigh their challenges -- which is why you encounter these files frequently.

If I missed your tabular data use case please drop a comment! More tools suggestions and/or questions welcome!

Top comments (0)