Where We Encounter Delimited Data and How We Handle It

#data #analytics #programming #dataengineering

What Are Delimited, Tabular Files?

A delimited file contains data structured in rows and columns separated by a symbol character.

CSV files are the most common. In CSV files, columns are delimited by commas and rows are marked by the new line character. Values are quoted if they contain spaces. A delimited format like CSV is simple to create, simple to edit, accessible by virtually all tools, and highly compressible.

The main problem with delimited files -- and there are many! -- is that they lack consistency and guardrails. While the data in a delimited file is assumed to be like-for-like values in rows and columns, it can be as many shapes as a box of Legos.

It is common for the first row to be descriptive headers. It is also common for there to be no headers, one or more blank lines before anything else, free-form comments, headers that don't match the following header values, etc. It is also not uncommon for multiple datasets to live in the same file, for headers to repeat or change, for data sets to be positioned side by side, and on and on.

When tabular data isn't really tabular, problems ensue!

This happens a lot, sadly. And historically, there has not been a strong validation language to tap the breaks and say when a file is invalid and why. CsvPath Validation Language fills this validation gap, we'll look closer at that technique in another post.

Common Scenarios

I'll layout, broadly, the most common ways data engineers, ops team members, data scientists, and others run into CSVs:

First, let's split the list into two parts:

Ad hoc / manual / one-off CSV production

Files from manual processes
Exported search results
Official data sets from authoritative sources
Manually uploaded CSVs

These are situations you face irregularly. Because they are irregular, it is relatively expensive to create tools and scripts to handle them -- the effort can be high and the payoff happens once or infrequently. In some cases, e.g. manual CSV uploads, the location and purpose are consistent but the sources vary and may not be under your control.

Regular automated CSV production

Outputs of automated processes
Regular transactions and data feeds
Export-import batch loads
Database backups and distributions

In these situations you receive the same type of data regularly and frequently. Because they are regular and frequent, the cost to automate is lower. These are often highly valuable processes with high risk. Generally you don't control the inbound data. In a large business there can be a lot of these situations, each a bit different. And because they are regular, time is of the essence.

Ad hoc / manual / one-offs

Let's look closer and give some example tools that can help you deal with these different challenges. The tools are not the only ones or necessarily the best. They are just given to help you zero in on the use case and types of solutions.

First, the hard to automate, manual processes.

Name	Purpose	Challenge	Tools
Files from manual processes	Data engineering dev, analysis and exploration, unit testing, etc.	Remembering/versioning what you did; being consistent; understanding the structure and data types	Notebooks (e.g. Jupyter and frameworks (e.g. dbt) Commandline tools (e.g. QSV) libraries (e.g. Pandas, Pandera)
Exported results	Capturing results from a tool. (e.g. Exporting a JIRA query)	Making the data importable; Comments and non-data prologues; inconsistent data; embedded non-tabular data, etc.	Text editors with regex find and replace (e.g. WildEdit, Vim) CSV editors (e.g. , qsv pro) Excel
Official data, authoritative source	Published statistics; open data; XBRL	Weak canonicalization; may require data aggregation; Incomparable formats, other formats better maintained, etc.	Open data tools (e.g. CKAN + DataPusher) Tools for specific formats (e.g. Altova's XBRL tools) Any of the other tools above
Manually uploaded CSV	Application CSV upload forms/features	Understandable feedback to non-tech users; structure comparison; flexibility to handle different or changing use cases; type conversion	Prebuilt upload components (e.g. OneSchema, FlatFile) Custom code

Regular automated CSV production

Now let's look at the more automation-friendly uses of tabular files.

Name	Purpose	Challenge	Tools
Log-form data or API output	Archiving and review/triage	Large size; retention; embedded non-tabular data; always on	Transformation (e.g. Spark, Dask) Infra observability platforms (e.g. New Relic, Grafana) Storage and retention/aging (e.g. AWS Glacier, Azure Blob Storage)
Regular transactions and file feeds	Exchange with data partners; collect regular updates; transaction batches	Efficient quality management; lost revenue or penalities for inaccuracy; consistency across many partner setups; round the clock MFT	Preboarding ingestion (e.g. CsvPath Framework) Transformation/ETL (e.g. Informatica, MapForce) Data partner projects (e.g. FlightPath Data, Nexla) Data observability (e.g. Great Expectations, DataKitchen) MFT (e.g. Diplomat, MoveIT)
Export-import batches	Batch loading tabular data stores	Speed; consistent archiving and cataloging; observability	Database tools (e.g. psql, Toad) Data catalogs (e.g. DataHub), OpenMetadata, CKAN )
Database or multi-set distributions	Publishing an entire database or a dataset with many files	Maintaining structure in flat files; bundling; documentation	Reverse ETL(e.g. MapForce, Matillion) Bundling (e.g. zip, Frictionless Framework)

At a high-level, those few scenarios cover 90% of the times developers will encounter CSV or Excel files. They are everywhere and come with serious limitations. But there are good tools and good reasons to use the simplest means of data transfer. The benefits of simple delimited files outweigh their challenges -- which is why you encounter these files frequently.

If I missed your tabular data use case please drop a comment! More tools suggestions and/or questions welcome!