DEV Community

Vani
Vani

Posted on • Updated on

Python TSV to Parquet File Format.

TSV:

Tab-separated values (TSV) is a simple, text-based file format for storing tabular data.[1] Records are separated by newlines, and values within a record are separated by tab characters. The TSV format is thus a delimiter-separated values format, similar to comma-separated values.

Image description

Parquet:
What is Parquet? Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Image description

Benefits of Parquet

  1. Good for storing big data of any kind (structured data tables, images, videos, documents).
  2. Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.
  3. Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.

Advantages of Storing Data in a Columnar Format:

  1. Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
  2. Apache Parquet is built from the ground up. Hence it is able to support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
  3. Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
  4. Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.

Python Snippet:

Image description

TAB FILE:

Image description

PARQUET RESULT:

Image description

Top comments (0)