DEV Community

Tatsuya Nishimura
Tatsuya Nishimura

Posted on

Comparison of Apache Parquet and Apache Arrow

Apache Parquet

A column-oriented file format designed for efficient storage and querying of large-scale datasets. It reduces storage costs and I/O overhead through compression and encoding. Widely used in data lakes and distributed processing frameworks such as Hadoop, Spark, and BigQuery.

Apache Arrow

A column-oriented in-memory format specification. It enables zero-copy data sharing across different processes and languages by using the same binary structure for in-memory processing, file storage (.arrow/.feather), inter-process communication (IPC), and network transfer (Flight).

When to Use Each

  • Parquet: Long-term storage and archiving, reducing storage costs, leveraging statistics, integrating with data lakes like Spark and BigQuery

  • Arrow: Sharing data across processes and languages, low-latency requirements, in-memory caching, IPC and Flight communication

Design Purpose Differences

Aspect Parquet Arrow
Purpose File storage and archiving In-memory processing and sharing (also for files and network)
Compression Snappy / gzip / Zstd / LZ4 LZ4 / Zstd (optional)
Metadata Thrift (at footer, includes statistics) Flatbuffer (at header, no statistics)
Read Time Seconds to milliseconds Microseconds
Memory Efficiency Dense on disk (compressed), expanded when loaded Dense in memory, further reduced with buffer sharing
Updates Not supported (write new files only) Immutable but fast recreation in-memory, file appending easier

Parquet Compression Defaults: PyArrow and DuckDB use Snappy by default (PyArrow, DuckDB). Polars uses Zstd by default (Polars implementation). Compression is optional in all cases.

Parquet ↔ Arrow Conversion

It can be done in a single line, but Parquet requires additional processing such as adding statistics to metadata, so the conversion isn't necessarily very fast.

import pyarrow.parquet as pq
import pyarrow as pa

# Parquet → Arrow
table = pq.read_table('data.parquet')
# table is an Arrow array table representation

# Arrow → Parquet
pq.write_table(table, 'output.parquet')
Enter fullscreen mode Exit fullscreen mode

Data Type Compatibility

Nearly one-to-one correspondence. Both formats support primitive types (int, float, string, etc.) as well as nested types (struct, list, map).

Regarding Nested Types: Parquet uses Dremel encoding (a combination of definition and repetition levels) for encoding nested types (Parquet Format - Nested Encoding). Arrow, on the other hand, represents nested types as relationships between parent and child arrays; while the in-memory layout differs, the semantics are compatible. When reading a nested column from a Parquet file using Arrow, internal layout conversion is necessary, but the data meaning is preserved. (References: Arrow Columnar Format - Struct Layout, Arrow Columnar Format - Nested type arrays)

Binary Structure Differences

Parquet Layout

A Parquet file is a collection of pages, where each page contains compressed and encoded column data. On read, the metadata section at the footer is read first to determine "where each page is located," and then only the necessary pages are decompressed. Statistics allow skipping unnecessary data, making it advantageous for conditional queries on large-scale data.

[Column A, page 1: compressed]
[Column B, page 1: compressed]
[Column A, page 2: compressed]
[Column B, page 2: compressed]
...
[Footer metadata + Schema + Statistics]
Enter fullscreen mode Exit fullscreen mode

Arrow Layout

Arrow arrays consist of metadata and buffers that can be directly mapped to memory.

[Metadata: schema, buffer offsets, sizes]
[Buffer 0: validity bitmap (0 if null, 1 otherwise)]
[Buffer 1: values / offsets / data]
[Buffer 2: ...]
Enter fullscreen mode Exit fullscreen mode

Closing Thoughts

Apache pretty much has everything you need.

Top comments (0)