DEV Community: Tatsuya Nishimura

Comparison of Apache Parquet and Apache Arrow

Tatsuya Nishimura — Tue, 13 Jan 2026 03:58:30 +0000

Apache Parquet

A column-oriented file format designed for efficient storage and querying of large-scale datasets. It reduces storage costs and I/O overhead through compression and encoding. Widely used in data lakes and distributed processing frameworks such as Hadoop, Spark, and BigQuery.

Apache Arrow

A column-oriented in-memory format specification. It enables zero-copy data sharing across different processes and languages by using the same binary structure for in-memory processing, file storage (.arrow/.feather), inter-process communication (IPC), and network transfer (Flight).

When to Use Each

Parquet: Long-term storage and archiving, reducing storage costs, leveraging statistics, integrating with data lakes like Spark and BigQuery
Arrow: Sharing data across processes and languages, low-latency requirements, in-memory caching, IPC and Flight communication

Design Purpose Differences

Aspect	Parquet	Arrow
Purpose	File storage and archiving	In-memory processing and sharing (also for files and network)
Compression	Snappy / gzip / Zstd / LZ4	LZ4 / Zstd (optional)
Metadata	Thrift (at footer, includes statistics)	Flatbuffer (at header, no statistics)
Read Time	Seconds to milliseconds	Microseconds
Memory Efficiency	Dense on disk (compressed), expanded when loaded	Dense in memory, further reduced with buffer sharing
Updates	Not supported (write new files only)	Immutable but fast recreation in-memory, file appending easier

Parquet Compression Defaults: PyArrow and DuckDB use Snappy by default (PyArrow, DuckDB). Polars uses Zstd by default (Polars implementation). Compression is optional in all cases.

Parquet ↔ Arrow Conversion

It can be done in a single line, but Parquet requires additional processing such as adding statistics to metadata, so the conversion isn't necessarily very fast.

import pyarrow.parquet as pq
import pyarrow as pa

# Parquet → Arrow
table = pq.read_table('data.parquet')
# table is an Arrow array table representation

# Arrow → Parquet
pq.write_table(table, 'output.parquet')

Data Type Compatibility

Nearly one-to-one correspondence. Both formats support primitive types (int, float, string, etc.) as well as nested types (struct, list, map).

Regarding Nested Types: Parquet uses Dremel encoding (a combination of definition and repetition levels) for encoding nested types (Parquet Format - Nested Encoding). Arrow, on the other hand, represents nested types as relationships between parent and child arrays; while the in-memory layout differs, the semantics are compatible. When reading a nested column from a Parquet file using Arrow, internal layout conversion is necessary, but the data meaning is preserved. (References: Arrow Columnar Format - Struct Layout, Arrow Columnar Format - Nested type arrays)

Binary Structure Differences

Parquet Layout

A Parquet file is a collection of pages, where each page contains compressed and encoded column data. On read, the metadata section at the footer is read first to determine "where each page is located," and then only the necessary pages are decompressed. Statistics allow skipping unnecessary data, making it advantageous for conditional queries on large-scale data.

[Column A, page 1: compressed]
[Column B, page 1: compressed]
[Column A, page 2: compressed]
[Column B, page 2: compressed]
...
[Footer metadata + Schema + Statistics]

Arrow Layout

Arrow arrays consist of metadata and buffers that can be directly mapped to memory.

[Metadata: schema, buffer offsets, sizes]
[Buffer 0: validity bitmap (0 if null, 1 otherwise)]
[Buffer 1: values / offsets / data]
[Buffer 2: ...]

Closing Thoughts

Apache pretty much has everything you need.

Save on DuckDB + S3 Transfer Costs

Tatsuya Nishimura — Mon, 12 Jan 2026 04:31:11 +0000

TL;DR

Use Cloudflare R2, or run DuckDB on EC2 in the same region as your S3 bucket with Gateway Endpoint enabled.

A quick note

Stick with Parquet.

How much data actually gets transferred?

When you query a Parquet file on S3 through DuckDB, the whole file doesn't get downloaded. Instead, DuckDB uses HTTP Range Requests to grab only the bytes it needs.

The mechanics

DuckDB fetches data in two passes:

Metadata: Range-request just the metadata section of the Parquet file
Data: Range-request only the columns and row groups needed by your query

"DuckDB always uses range requests, firstly to query the metadata only, then to fetch the required columns."
— PR #5405: HTTP parquet optimizations

A concrete example

SELECT column_a FROM 's3://bucket/file.parquet';

DuckDB downloads only the bytes containing column_a. So even with a 10GB file, if column_a is just 100MB, you only transfer ~100MB.

Even better—sometimes you don't transfer anything:

SELECT count(*) FROM 's3://bucket/file.parquet';

Parquet metadata includes row counts, so DuckDB can return your result without reading any data at all.

Reference: DuckDB Official Documentation - HTTP(S) Support

Filter and projection pushdown

DuckDB's S3 reader can push filters and projections down to the storage layer, so even less data gets touched.

"We're able to do partial reads via Range requests actually, so it should be fairly efficient."
— Discussion #4559

So why do S3 bills get so nasty?

Here's the catch: intra-region EC2-to-S3 transfers are free.

Yet somehow people end up with shocking bills. What's going on?

1. No S3 Gateway Endpoint

Without a Gateway Endpoint, traffic from your VPC to S3 gets routed through NAT Gateway or the internet gateway.

Via NAT Gateway: You pay $0.045/GB
Via Gateway Endpoint: Free

"There is no additional charge for using gateway endpoints."
— AWS Official Documentation - Gateway endpoints for Amazon S3

2. Accessing across regions

If your S3 bucket and EC2 are in different regions, AWS charges you $0.01–$0.02/GB for the privilege.

3. Going out to the internet

Querying from your laptop or anything outside AWS? You pay $0.09/GB and up for internet egress.

The fix

EC2 + S3 Gateway Endpoint in the same region = zero transfer charges.

Querying Parquet from EC2 in your bucket's region beats downloading everything locally by a mile. The bigger your data, the bigger the savings.

The downside? Standing up and configuring EC2 every time gets old fast.

Other object storage options

Let's compare alternatives. The key metric is egress—that's what kills your budget.

Note: Ingress (uploading) is always free. With object storage, you pay to get your data back out.

AWS S3 Pricing - "Data Transfer IN To Amazon S3 From Internet: $0.00 per GB"
GCS Pricing - "Network ingress: Free"
Azure Bandwidth Pricing - "Data Transfer In: Free"

Cloudflare R2

Cloudflare R2: free egress, full stop.

Item	Price	Free Tier
Storage	$0.015/GB/month	10GB/month
Class A operations	$4.50/million	1M requests
Class B operations	$0.36/million	10M requests
Egress	Free	Unlimited

If you want to ditch the transfer bill entirely, R2 is your answer.

Reference: Cloudflare R2 Pricing

Backblaze B2

Backblaze B2 keeps egress essentially free too.

Item	Price
Storage	$6/TB/month ($0.006/GB/month)
Egress	Free (up to 3× your storage/month)
Overage Egress	$0.01/GB

Store 100GB, download 300GB free per month. Plus it's S3-compatible.

Reference: Backblaze B2 Pricing

Google Cloud Storage

GCS hands out free transfers between services in the same region.

Transfer Type	Price
Same zone (private IP)	Free
Same region (GCS ↔ GCE, etc.)	Free
Different zones (same region)	$0.01/GB
Inter-region (e.g., US zones)	$0.02/GB
Outbound to internet	$0.12/GB

Run DuckDB on a GCE instance in the same region as your data and you pay nothing.

Reference: Google Cloud Storage Pricing

Azure Blob Storage

Azure does the same for intra-region transfers.

Transfer Type	Price
Same Availability Zone	Free
Inter-region (e.g., US to Canada)	$0.02/GB
Outbound to internet	$0.087/GB (first 100GB free/month)

Spin up an Azure VM in the same region as your Blob Storage bucket and transfers are free.

Reference: Azure Blob Storage Pricing

Quick comparison

Service	Storage	Egress	Intra-region
AWS S3	$0.023/GB	$0.09/GB	Free (with Gateway Endpoint)
Cloudflare R2	$0.015/GB	Free	N/A
Backblaze B2	$0.006/GB	Free (3× storage/month)	N/A
GCS	$0.020/GB	$0.12/GB	Free
Azure Blob	$0.018/GB	$0.087/GB	Free

For DuckDB queries: Cloudflare R2 or Backblaze B2 eliminate egress entirely.
From a cloud VM: Use that cloud's storage in the same region and pay nothing.

Wrap up

Want zero egress charges with DuckDB? Pick R2 or Backblaze B2—both eliminate them entirely.

Running on a cloud VM? Pick that cloud's object storage, keep it in the same region, and you're fine. Setting up EC2 each time is annoying, but at least the transfer costs disappear.

I build observability tools with DuckDB + object storage.