Apache Parquet
A column-oriented file format designed for efficient storage and querying of large-scale datasets. It reduces storage costs and I/O overhead through compression and encoding. Widely used in data lakes and distributed processing frameworks such as Hadoop, Spark, and BigQuery.
Apache Arrow
A column-oriented in-memory format specification. It enables zero-copy data sharing across different processes and languages by using the same binary structure for in-memory processing, file storage (.arrow/.feather), inter-process communication (IPC), and network transfer (Flight).
When to Use Each
Parquet: Long-term storage and archiving, reducing storage costs, leveraging statistics, integrating with data lakes like Spark and BigQuery
Arrow: Sharing data across processes and languages, low-latency requirements, in-memory caching, IPC and Flight communication
Design Purpose Differences
| Aspect | Parquet | Arrow |
|---|---|---|
| Purpose | File storage and archiving | In-memory processing and sharing (also for files and network) |
| Compression | Snappy / gzip / Zstd / LZ4 | LZ4 / Zstd (optional) |
| Metadata | Thrift (at footer, includes statistics) | Flatbuffer (at header, no statistics) |
| Read Time | Seconds to milliseconds | Microseconds |
| Memory Efficiency | Dense on disk (compressed), expanded when loaded | Dense in memory, further reduced with buffer sharing |
| Updates | Not supported (write new files only) | Immutable but fast recreation in-memory, file appending easier |
Parquet Compression Defaults: PyArrow and DuckDB use Snappy by default (PyArrow, DuckDB). Polars uses Zstd by default (Polars implementation). Compression is optional in all cases.
Parquet ↔ Arrow Conversion
It can be done in a single line, but Parquet requires additional processing such as adding statistics to metadata, so the conversion isn't necessarily very fast.
import pyarrow.parquet as pq
import pyarrow as pa
# Parquet → Arrow
table = pq.read_table('data.parquet')
# table is an Arrow array table representation
# Arrow → Parquet
pq.write_table(table, 'output.parquet')
Data Type Compatibility
Nearly one-to-one correspondence. Both formats support primitive types (int, float, string, etc.) as well as nested types (struct, list, map).
Regarding Nested Types: Parquet uses Dremel encoding (a combination of definition and repetition levels) for encoding nested types (Parquet Format - Nested Encoding). Arrow, on the other hand, represents nested types as relationships between parent and child arrays; while the in-memory layout differs, the semantics are compatible. When reading a nested column from a Parquet file using Arrow, internal layout conversion is necessary, but the data meaning is preserved. (References: Arrow Columnar Format - Struct Layout, Arrow Columnar Format - Nested type arrays)
Binary Structure Differences
Parquet Layout
A Parquet file is a collection of pages, where each page contains compressed and encoded column data. On read, the metadata section at the footer is read first to determine "where each page is located," and then only the necessary pages are decompressed. Statistics allow skipping unnecessary data, making it advantageous for conditional queries on large-scale data.
[Column A, page 1: compressed]
[Column B, page 1: compressed]
[Column A, page 2: compressed]
[Column B, page 2: compressed]
...
[Footer metadata + Schema + Statistics]
Arrow Layout
Arrow arrays consist of metadata and buffers that can be directly mapped to memory.
[Metadata: schema, buffer offsets, sizes]
[Buffer 0: validity bitmap (0 if null, 1 otherwise)]
[Buffer 1: values / offsets / data]
[Buffer 2: ...]
Closing Thoughts
Apache pretty much has everything you need.
Top comments (0)