DEV Community

Cover image for Different file formats, a benchmark doing basic operations
Pedro H Goncalves
Pedro H Goncalves

Posted on

Different file formats, a benchmark doing basic operations

Recently, I've been designing a data lake to store different types of data from various sources, catering to diverse demands across different areas and levels. To determine the best file type for storing this data, I compiled points of interest, considering the needs and demands of different areas. These points include:

Tool Compatibility

Tool compatibility refers to which tools can write and read a specific file type. No/low code tools are crucial, especially when tools like Excel/LibreOffice play a significant role in operational layers where collaborators may have less technical knowledge to use other tools.

Storage

How much extra or less space will a particular file type cost in the data lake? While non-volatile memory is relatively cheap nowadays, both on-premise and in the cloud, with a large volume of data, any savings and storage optimization can make a difference in the final balance.

Reading

How long do the tools that will consume the data take to open and read the file? In applications where reading seconds matter, sacrificing compatibility and storage for gains in processing time becomes crucial in the data pipeline architecture planning.

Writing

How long will the tools used by our data team take to generate the file in the data lake? If immediate file availability is a priority, this is an attribute we would like to minimize as much as possible.

Query

Some services will directly consume data from the file and perform grouping and filtering functions. Therefore, it's essential to consider how much time these operations will take to make the correct choice in our data solution.

Benchmark

Files

1 - IBM Transactions for Anti Money Laundering (AML)

Rows: 31 million
Columns: 11
Types: Timestamp, String, Integer, Digital, Boolean

2 - Malware Detection in Network Traffic Data

Rows: 6 million
Columns: 23
Types: String, Integer

Number of Tests

15 tests were conducted for each operation on each file, and the results in the graphs represent the average of each test iteration's results. The only variable unaffected by the number of tests is the file size, which remains the same regardless of how many times it is written.

Why 2 datasets?

I chose two completely different datasets. The first is significantly larger than the second and contains many null values represented by "-" and many columns with duplicate values where the distinction is low. In contrast, the first dataset has few columns with little data variability and contains more complex types such as timestamps. These characteristics highlight the distinctions, strengths, and weaknesses of each format.

Script

The script used for benchmarking is open on GitHub for anyone who wants to check or conduct their benchmarks with their files, which I strongly recommend.

file-format-benchmark: benchmark script of key operations between different file formats

Tools

I will use Python with Spark for the benchmark. Spark allows native queries on different file types, unlike Pandas, which requires an extra library to achieve this. Additionally, Spark is more performant in handling larger datasets, and the datasets used in this benchmark are relatively large, where Pandas struggled.

Env:
Python version: 3.11.7
Spark version: 3.5.0
Hadoop version: 3.4.1

Benchmark Results

Tool Compatibility

Although I wanted to measure tool compatibility, I couldn't find a way to do it, so I'll share my opinion. For pipelines with downstream stakeholders who have more technical knowledge (data scientists, machine learning engineers, etc.), the file format matters little. With a library/framework in any programming language, you can manipulate information from a file in any format. However, for non-technical stakeholders like business analysts, C-level executives, or other collaborators who work directly with product/service production, the scenario changes. These individuals often use tools like Excel, LibreOffice, PowerBI, or Tableau (which, despite having more native readers, do not support Avro or ORC). In cases where files are consumed "manually" by people, you will almost always opt for CSV or JSON. These formats, being plain-text, can be opened, read, and understood in any text editor. Additionally, all kinds of tools can read structured data in files in these formats. Parquet still has some compatibility, being the column storage type with the most support and attention from the community. On the other hand, ORC and Avro have very little support and can be challenging to find parsers and serializers in non-Apache tools.

In summary, CSV and JSON have a significant advantage over the others, and you will likely choose them when your stakeholders are directly handling the files and lack technical knowledge.

Storage

Dataset 1:
Storage results graph

Dataset 2:
Storage results graph
To calculate storage, we loaded the dataset in CSV format, rewrote it in all formats (including CSV itself), and listed the amount of space they occupy.

The graphs show a significant advantage for JSON, which was three times larger than the second-largest file (CSV) in both cases. The difference is so pronounced due to the way JSON is written, following a schema of a list of objects with key-value pairs, where the key is the column, and the value is the column's value in that tuple. This results in unnecessary schema redundancy, always inferring the column name in all values. In addition to being plain-text without any compression, similar to CSV, JSON shows the two worst performances in terms of storage. Parquet, ORC, and Avro had very similar results, highlighting their efficiency in storage compared to more common types. The key reasons for this advantage are that Parquet and Avro are binary files, offering a storage advantage. Furthermore, Parquet and ORC are columnar format files that significantly reduce data redundancy, avoiding waste and optimizing space. All three formats have highly efficient compression methods.

In summary, CSV and JSON are by no means the best for storage optimization, especially in cases like storing logs or data with no immediate importance but cannot be discarded.

Reading

Dataset 1:
Reading results graph

Dataset 2:
Reading results graph
In the reading operation, we timed the dataset loading and printed the first 5 rows.

In reading, there is a peculiar case: despite increasing the file size difference several times (3x), the only format with a visible and relevant difference was JSON. This occurs solely due to the way JSON is written, making it costly for the Spark parser to work with that amount of metadata (redundant schema). The growth in reading time is exponential with the file size. As for why CSV performed as well as ORC and Parquet, it is because CSV is extremely simple, lacking metadata like a schema with types or field names. It is quick for the Spark parser to read, separate, and assess the column types of CSV files, unlike ORC and, especially, Parquet, which have a large amount of metadata useful in cases of files with more fields, complex types, and a larger amount of data. The difference between Avro, Parquet, and ORC is minimal and varies depending on the state of the cluster/machine, simultaneous tasks, and the data file layout. In the case of these datasets in the reading operation, it is challenging to evaluate the difference; it becomes more evident when scaling these files to several times larger than the datasets we are working with.

In summary, CSV, Parquet, ORC, and Avro had almost no difference in reading performance, while JSON cannot be considered as an option in cases where fast data reading is required. Few cases prioritize reading alone; it is generally evaluated along with another task like a query. If you are looking for the most performant file type for this operation, you should consider conducting your own tests.

Writing

Dataset 1:
Write results graph

Dataset 2:
Write results graph
In the writing operation, we read a .csv file and rewrote it in the respective format, only counting the writing time.

In writing, there was a surprise: JSON was not the slowest to be written in the first dataset; it was actually ORC. However, in the second dataset, JSON took the longest. This discrepancy is due to the second dataset having more columns, meaning more metadata to be written. While ORC is a binary file with static typing of data, similar to Parquet, the difference is that ORC applies "better" optimization and compression techniques, requiring more processing power and time. This justifies the query time (which we will see next) and the generated file size, which is smaller in almost all cases than Parquet files. CSV had good performance because it is a very simple format, lacking additional metadata such as schema and types or redundant metadata like JSON. On a larger scale, more complex files would have better performance than CSV. Avro also has its benefits and had a very positive result in dataset 1, outperforming Parquet and ORC with a significant advantage. This probably happened due to the data layout favoring Avro's optimizations, which differ from Parquet and ORC.

In summary, Avro, despite not being a format with much fame or community support, is a good choice in situations where you want the quick availability of your files for stakeholders to consume. It starts making a difference when scaling to several GBs of data, where the difference becomes 20-30 minutes instead of 30-40 seconds.

Query

Dataset 1:
Query results graph

Dataset 2:
Query results graph
In the query operation, the dataset was loaded, and a query with only one WHERE clause filtering a unique value was performed, followed by printing the respective tuple.

In the first file, all formats had good performance, and the graph scales give the impression that Parquet had poor performance. However, the differences are minimal. Since dataset 2 is much smaller, we believe the query results are very susceptible to external factors. Therefore, we will focus on explaining the results in the first dataset. As mentioned earlier, ORC performs well compared even to Avro, which had excellent performance in other operations. Still, Parquet leads this ranking as the fastest query result. Why? Parquet, being the default format for Spark, indicates much about how the framework works with this format. It incorporates various query optimization techniques, many consolidated in DBMSs. One of the most famous is predicate pushdown, which essentially leaves the WHERE clauses at the end of the execution plan to reduce the amount of data read and examined on the disk. This is an optimization not present in ORC. Why do CSV and JSON lag so far behind? In this case, CSV and JSON are not the problem; the truth is that Parquet and ORC are very well optimized. All the benefits mentioned earlier, such as schema metadata, binary files, and columnar formats, give them a significant advantage. And where does Avro fit into this since it has many of these mentioned benefits? In terms of query optimization, Avro lags far behind ORC and Parquet. One of the points we can mention is column projection, which essentially computes only the specific columns used in the query rather than the entire dataset. This is present in ORC and Parquet but not in Avro. Logically, this is not the only thing that makes ORC and Parquet differ so much from Avro in terms of query optimization, but overall, Avro falls far behind in query optimization.

In summary, when working with large files, with both simple and complex queries, you will want to work with Parquet or ORC. Both have many query optimizations that will deliver results much faster compared to other formats. This difference is already evident in files slightly smaller than dataset 1 and becomes even more apparent in larger files.

Conclusion

In a data engineering environment where you need to serve various stakeholders, consume from various sources, and store data in different storage systems, operations such as reading, writing, or queries are widely affected by the file format. Here, we see what issues certain formats may have, evaluating the main points raised in data environment constructions.

Even though Parquet is the "darling" of the community, we were able to highlight some of its strengths, such as query performance, but also show that there are better options for certain scenarios, such as ORC in storage optimization.

The performances of these operations for each format also depend heavily on the tool you are using and how you are using it (environment and available resources). The results from Spark probably will not differ much from other more robust frameworks like Duckdb or Flink, but we recommend that you conduct your tests before making any decisions that will have a significant impact on other areas of the business.

Top comments (2)

Collapse
 
respect17 profile image
Kudzai Murimi

Thanks, for sharing with the community

Collapse
 
pedrohgoncalves profile image
Pedro H Goncalves

thanks for the comment