Tatsuya Nishimura

Posted on Jan 13

Comparison of Apache Parquet and Apache Arrow

#parquet #arrow

Apache Parquet

A column-oriented file format designed for efficient storage and querying of large-scale datasets. It reduces storage costs and I/O overhead through compression and encoding. Widely used in data lakes and distributed processing frameworks such as Hadoop, Spark, and BigQuery.

Apache Arrow

A column-oriented in-memory format specification. It enables zero-copy data sharing across different processes and languages by using the same binary structure for in-memory processing, file storage (.arrow/.feather), inter-process communication (IPC), and network transfer (Flight).

When to Use Each

Parquet: Long-term storage and archiving, reducing storage costs, leveraging statistics, integrating with data lakes like Spark and BigQuery
Arrow: Sharing data across processes and languages, low-latency requirements, in-memory caching, IPC and Flight communication

Design Purpose Differences

Aspect	Parquet	Arrow
Purpose	File storage and archiving	In-memory processing and sharing (also for files and network)
Compression	Snappy / gzip / Zstd / LZ4	LZ4 / Zstd (optional)
Metadata	Thrift (at footer, includes statistics)	Flatbuffer (at header, no statistics)
Read Time	Seconds to milliseconds	Microseconds
Memory Efficiency	Dense on disk (compressed), expanded when loaded	Dense in memory, further reduced with buffer sharing
Updates	Not supported (write new files only)	Immutable but fast recreation in-memory, file appending easier

Parquet Compression Defaults: PyArrow and DuckDB use Snappy by default (PyArrow, DuckDB). Polars uses Zstd by default (Polars implementation). Compression is optional in all cases.

Parquet ↔ Arrow Conversion

It can be done in a single line, but Parquet requires additional processing such as adding statistics to metadata, so the conversion isn't necessarily very fast.

import pyarrow.parquet as pq
import pyarrow as pa

# Parquet → Arrow
table = pq.read_table('data.parquet')
# table is an Arrow array table representation

# Arrow → Parquet
pq.write_table(table, 'output.parquet')

Data Type Compatibility

Nearly one-to-one correspondence. Both formats support primitive types (int, float, string, etc.) as well as nested types (struct, list, map).

Regarding Nested Types: Parquet uses Dremel encoding (a combination of definition and repetition levels) for encoding nested types (Parquet Format - Nested Encoding). Arrow, on the other hand, represents nested types as relationships between parent and child arrays; while the in-memory layout differs, the semantics are compatible. When reading a nested column from a Parquet file using Arrow, internal layout conversion is necessary, but the data meaning is preserved. (References: Arrow Columnar Format - Struct Layout, Arrow Columnar Format - Nested type arrays)

Binary Structure Differences

Parquet Layout

A Parquet file is a collection of pages, where each page contains compressed and encoded column data. On read, the metadata section at the footer is read first to determine "where each page is located," and then only the necessary pages are decompressed. Statistics allow skipping unnecessary data, making it advantageous for conditional queries on large-scale data.

[Column A, page 1: compressed]
[Column B, page 1: compressed]
[Column A, page 2: compressed]
[Column B, page 2: compressed]
...
[Footer metadata + Schema + Statistics]

Arrow Layout

Arrow arrays consist of metadata and buffers that can be directly mapped to memory.

[Metadata: schema, buffer offsets, sizes]
[Buffer 0: validity bitmap (0 if null, 1 otherwise)]
[Buffer 1: values / offsets / data]
[Buffer 2: ...]

Closing Thoughts

Apache pretty much has everything you need.

DEV Community