JSON, CSV, and Parquet: Guardians of Data

#datascience #dataengineering #analyst #opendata

As personal computers emerged in the 1970s, software companies explored ways to store and exchange data. At the time, most business records were stored on punch cards or in proprietary database formats that were designed without giving thought to portability.

In 1979, VisiCalc, the first-ever spreadsheet program, was released. It revolutionized data entry and calculations, and the growth of VisiCalc as a ‘killer application’ prompted other companies to follow suit. And soon enough, everyone wanted in. Users now needed a way to export their data into a format that is compatible across different systems to access a wider set of software and tools.

With this demand, people pushed for better standards for data storage and sharing. Decades pass, and this dilemma persists to this day.

With so many performant formats that are available today, such as Avro, Delta, and Iceberg. What makes file formats like JSON’s and CSV’s extremely relevant and well-adopted? You can reasonably thank them for their ability to enforce data best practices (and that's by design).

Structure enforcement

Yes, it might seem counterintuitive at first. Most data engineers and data analysts have sifted through terabytes of data that is messy and broken. Most of this data is in CSV or JSON format. So, one would assume that these file formats do not enforce proper data structure and threaten its integrity.

From a broader perspective, most of these cases do not originate from datasets natively generated in these formats. Instead, they were initially stored in other formats and later forced into CSV or JSON to make them 'work,' often leading to corruption and unexpected failures.

Enforcement of key-value pairs usually takes care of a lot of pain when it comes to data structure and integrity and ensures that the data on a higher level is functional. And this type of enforcement is exactly what these seemingly simple file formats do best.

Adoption of these formats, and their inspiration on the title of this article, is one of the key reasons why data scientists, analysts, and researchers prefer these file formats given their simplicity and how most tools support them.

You can count on being able to enable best practices and frictionless adoption of data enforcement if all levels of the data governance hierarchy are able to use simple formats, and this trinity enables just that.

Quick Wins and Familiar Tools

One of the key factors that prevents a specific technology from taking off is a lack of adoption. Yes, there are certainly better options out there than “the Trinity,” but getting teams on board with these standards is no easy task. In a perfect world, one could enforce or build policies that promote the most effective data file formats. However, teams vary significantly in their tooling and capabilities. The Trinity enables frictionless adoption, accommodates analytical teams, and facilitates better data cataloging.

Imagine this scenario: Three business users, A, B, and C, are a team of data professionals. User C comes from a data analysis background, User B comes from a data engineering background, and User A is a data scientist.
User A queries the data from an SQL database and shares the results with User C. However, User C is not highly proficient in SQL. User C requests that User A send the data in xlsx so they can update their Power BI dashboard. User A understands that using a proprietary format like that can cause issues sometime in the future and believes a direct query from PowerBI would be more beneficial. User B wants to add these assets into their metric store in their data catalog.

Situations like these happen all the time in the real world, and they can sometimes lead to more professional issues than technical ones. Using a format like CSV, JSON, or Parquet is a good compromise in such cases. The data engineer can easily identify issues within the dataset, the data analyst can work with the file format without difficulty, and the data scientist knows exactly what to share while maintaining a reasonable level of data integrity and structure.

Reality vs. Theory in the world of data

Reading through online articles, blog posts, and websites, it seems like the world is pushing the boundaries of high-tech innovation. Most discussions revolve around stacks involving Snowflake, dbt, Azure, Databricks, and similar tools. However, these technologies represent only a small fraction of data teams.

In reality, these tools are often prohibitively expensive, making them inaccessible to many organizations.
It might seem that the average data team has a Snowflake data warehouse and a data catalog like Atlan or Informatica to enable seamless access to their data and assets. However, many teams simply do not have the capabilities or resources to implement such solutions. In fact, many still rely on shared drives to store their datasets while continuing to make an impact. Despite their efforts to maximize efficiency with the tools and skills available to them, it can be overwhelming to compete with teams that have access to high-end infrastructure.

To remain effective, many teams rely on widely supported file formats like CSV, JSON, and Parquet for their workloads. These formats allow them to stay productive and competitive, even without the extensive resources that larger organizations possess.

It’s not over for the little guy

Functional data teams with varying skill sets and toolsets—without access to high-end infrastructure—are still a major driving force in decision-making across a wide variety of companies, especially smaller ones. One of the key factors that keeps them operational and effective is the Trinity of datasets (CSV, JSON, and Parquet). If their datasets were stored in proprietary formats such as Excel files, or in formats that lack portability, they would struggle with integration and interoperability.

Many of the more advanced file formats, such as Iceberg and Delta Tables, are designed for large-scale data lakes and enterprises, leaving smaller teams without practical solutions. As a result, these teams make do with what they have, as larger tools often fail to accommodate their specific use cases. For example, data cataloging is a significant challenge for teams storing datasets across multiple scattered CSV files. The reason these files are scattered is that enterprise-scale tools do not provide affordable or accessible solutions for them.

While some open-source data catalog tools, such as CKAN, could be beneficial for smaller companies, they are notoriously difficult to use. A promising alternative is RepoTen, possibly the fastest way for smaller teams to set up a data catalog. However, even with solutions like this, many enterprise-level tools still fail to address the needs of smaller data teams, leaving them to rely on flexible, widely supported formats like the Trinity to stay productive.

Many teams continue to drive meaningful decisions with simple, widely supported formats like CSV, JSON, and Parquet. These formats act as an equalizer, enabling teams with fewer resources to remain agile, collaborative, and effective.

Having a system that works for your team that allows you to store, share, and analyze data efficiently is a good investment. And sometimes, the best solutions are not the most complex, but the ones that fit your needs just right.

DEV Community

JSON, CSV, and Parquet: Guardians of Data

Structure enforcement

Quick Wins and Familiar Tools

Reality vs. Theory in the world of data

It’s not over for the little guy

Top comments (0)