Kumaravelu Saraboji Mahalingam

Posted on Apr 10

Apache Parquet File Anatomy: Row Groups, Column Chunks, Pages, and Metadata Explained 🧱📦

#dataengineering #apacheparquet #iceberg #analytics

If you use Spark, Athena, Iceberg, Snowflake, DuckDB, or Pandas, you’ve probably worked with Parquet hundreds of times. But most of us first learn Parquet as a simple rule of thumb: it’s columnar, compressed, and great for analytics. That’s true, but it leaves out the most interesting part — why Parquet performs so well in the first place.

Under the hood, a Parquet file is not just a blob of compressed data. It has a deliberate internal structure made of row groups, column chunks, pages, and footer metadata, and that structure is exactly what enables column pruning, predicate pushdown, and efficient scans in modern query engines.

In this post, we’ll break down the anatomy of a Parquet file from the file boundary all the way down to individual pages, and then connect those pieces back to the real-world performance behavior you see in Spark, Iceberg, and Athena.

Why Parquet matters ⚡

Most analytical queries do not read every column and every row. They usually select a subset of columns, filter by a few predicates, and aggregate over large volumes of data. Parquet is designed specifically for that style of access, which is why it outperforms row-oriented formats like CSV for analytics-heavy workloads.

Instead of storing each record end-to-end, Parquet stores data column by column, while still grouping rows into larger units for efficient processing. That combination improves compression, reduces unnecessary I/O, and allows engines to skip chunks of data using metadata rather than brute-force scanning.

Start with the big picture 🗂️

The easiest way to understand Parquet is to think of it as a hierarchy:

A file contains one or more row groups.
Each row group contains one column chunk per column.
Each column chunk contains one or more pages.
The file ends with a footer that stores schema and metadata about those structures.

That may sound abstract at first, so here is the mental model I use: a Parquet file is like a mini warehouse 🏭, where rows are divided into sections, each section stores columns separately, and the catalog for the whole warehouse sits at the very end of the file.

The physical file layout 🧩

At the physical level, a Parquet file starts with a magic marker, stores row-group data in the body, and ends with footer metadata, the footer length, and another magic marker. Apache Parquet documents this structure explicitly with PAR1 at both the beginning and the end of the file.

Here is the high-level layout:

[PAR1][Row Group Data ...][File Metadata][Metadata Length][PAR1]

That footer-at-the-end design is more important than it looks. A reader can jump to the end of the file, inspect the metadata, understand the schema and row groups, and plan an efficient read before touching most of the actual data blocks.

A file-level diagram 🏗️

This is the skeleton of every Parquet file: data first, metadata last.

Row groups: the first major building block 🧱

A row group is a horizontal partition of rows inside a single Parquet file. If a file contains one million rows, those rows may be split across multiple row groups, and each row group becomes a self-contained unit for reading and processing.

This matters because row groups are a natural unit for parallelism. Distributed engines can assign different row groups to different tasks, and metadata associated with each row group can help decide whether that row group needs to be read at all.

You can think of it like this:

Parquet File
├── Row Group 1 -> rows 1 to 250,000
├── Row Group 2 -> rows 250,001 to 500,000
├── Row Group 3 -> rows 500,001 to 750,000
└── Row Group 4 -> rows 750,001 to 1,000,000

The important nuance is that a row group is not stored row-by-row internally. It is still columnar inside, which is where column chunks come in.

Column chunks: where columnar storage shows up 🧵

Inside each row group, every column gets its own column chunk. That means for a row group containing id, country, and amount, Parquet stores one chunk for id, one for country, and one for amount.

This is the mechanism behind column pruning. If your query only needs country and amount, the engine can skip the id chunks entirely, which reduces both I/O and deserialization work.

Here is a simple view:

At this point, you can already see why Parquet is so effective for analytics. Analytical queries rarely need every field in every row, and Parquet’s internal structure mirrors that reality.

Pages: the smallest units inside a chunk 📄

Each column chunk is further divided into pages, which are the smallest units used to store encoded data. Pages hold the actual values, and depending on the encoding being used, they may also be preceded by a dictionary page.

That means a column chunk is not one monolithic blob. It is a sequence of smaller blocks that can be encoded and compressed efficiently, while still fitting the overall columnar structure.

A useful diagram looks like this:

In practice, this page-level organization helps Parquet balance storage efficiency with read efficiency. The format can encode and compress data in manageable units instead of treating each column chunk as a single continuous stream.

Dictionary pages and encoding 📚

One of the most common Parquet optimizations is dictionary encoding. Instead of writing repeated string values over and over, Parquet can write a dictionary of unique values once and then store compact references in the data pages.

For a column like country, the dictionary might contain US, IN, and CA, and the data pages would store something closer to 0, 0, 1, 2 than full repeated strings. That reduces storage size and often improves downstream compression too.

This is one reason categorical columns often compress especially well in Parquet. Repeated patterns are easier to encode when similar values are physically grouped together in the same column chunk.

The footer: the real brain of the file 🧠

The most important part of a Parquet file is arguably not the data body but the footer. That footer stores file metadata such as the schema, row-group descriptions, and column-level information needed by readers to interpret the file efficiently.

Because the footer is written at the end of the file, readers can retrieve it first, inspect the contents, and decide what to read and what to skip. That is a huge part of why Parquet feels smart rather than brute-force.

At a high level, the footer can tell a reader:

What the schema is.
How many row groups exist.
Where each column chunk lives in the file.
What encodings and compression settings were used.
What statistics are available for pruning.

Metadata is what powers skipping 🚦

Parquet’s metadata is not just descriptive. It is actionable. The row-group and column metadata often includes statistics such as minimum value, maximum value, and null count, which allows query engines to avoid reading irrelevant data.

For example, if a row group’s event_date has a minimum of 2026-01-01 and a maximum of 2026-01-31, then a query filtering for March 2026 can skip that row group entirely. The engine does not need to inspect every row to know there is no match.

That optimization is the foundation of predicate pushdown and row-group pruning. Instead of reading first and filtering later, engines can use metadata to avoid unnecessary reads in the first place.

Predicate pushdown diagram 🎯

This is one of the most important performance ideas in Parquet. The file is designed so engines can make good decisions before scanning the full payload.

A concrete example 🧪

Let’s say you have this table:

id	country	amount
1	US	100
2	US	120
3	IN	900
4	CA	80

In a row-based file, those values are stored as complete records one after another. In Parquet, the values are stored by column inside each row group, so the country values sit together and the amount values sit together rather than being interleaved row-by-row.

Now imagine this query:

SELECT country
FROM sales
WHERE amount > 500

A Parquet-aware engine can use metadata to identify which row groups might contain amount > 500, read the relevant amount column chunks for filtering, and then read only the country column for matching records. It does not need to read every column for every row the way a plain text row format typically would.

Why compression works so well 🗜️

Parquet’s storage efficiency comes from a combination of columnar layout, encoding, and compression. Similar values tend to sit next to each other within a column, which usually makes them more compressible than mixed-value row-based storage.

For example, a status column containing repeated values like SUCCESS, SUCCESS, FAILED, SUCCESS is far easier to encode compactly when those values are grouped together than when they are scattered across full records containing timestamps, IDs, and free-form text.

That is why Parquet often ends up dramatically smaller than CSV while also being faster to scan for analytical use cases. Its internal organization works with compression instead of fighting it.

Why row group size is a tuning lever 🎛️

Row groups are not just a format detail. They are also a performance tuning lever. Larger row groups often improve compression and reduce metadata overhead, but they can reduce pruning granularity. Smaller row groups allow finer skipping and often more parallelism, but they introduce more metadata and may hurt compression efficiency.

This is one of the reasons output file design matters so much in distributed data systems. A well-formed Parquet file is not just about “using Parquet” — it is also about choosing file sizes and row-group sizing that match your workload.

What this means in Spark 🔥

In Spark, Parquet’s layout maps naturally to common optimizations like column pruning and predicate pushdown. When Spark can use Parquet statistics effectively, it avoids reading unnecessary row groups and often avoids materializing columns that are not selected by the query.

That means your file layout choices affect real job behavior. If your data is written into too many small files or poorly sized row groups, you may lose many of the benefits that Parquet is structurally capable of delivering.

What this means in Iceberg 🧊

Iceberg relies heavily on Parquet because Parquet already provides efficient columnar storage and file-level metadata patterns that work well for analytical reads. Iceberg adds another planning layer on top, but the scan efficiency still depends a lot on the properties of the underlying Parquet files.

In other words, Iceberg gives you table-level intelligence, but Parquet still does much of the physical storage work. Understanding row groups and statistics helps explain why good file compaction and sort strategy can matter so much in Iceberg-backed tables.

What this means in Athena 🏛️

Athena benefits from Parquet for the same core reasons: fewer bytes scanned, better compression, and the ability to skip irrelevant data using metadata and layout-aware reads. Since Athena pricing and performance are tightly tied to scanned data volume, Parquet’s structure can directly reduce both runtime and cost.

That is why converting CSV-based data lakes into partitioned and well-written Parquet often delivers an immediate practical benefit. The file format itself changes how much work the engine has to do.

A common misconception 🚫

A common misconception is that Parquet is just “a binary CSV with compression.” That is not really what it is. Parquet is a structured columnar storage format with typed schema metadata, row groups, column chunks, pages, and statistics-aware footers that analytical engines can exploit directly.

CSV is a simple row-based serialization format. Parquet is a storage format engineered for selective analytical access. Those are fundamentally different design goals.

Final mental model 🧠

If you only remember one thing, remember this:

Row groups partition rows into larger processing units.
Column chunks store one column’s data inside each row group.
Pages break column chunks into smaller encoded blocks.
Footer metadata tells engines what exists, where it lives, and what can be skipped.

Once that clicks, a lot of data engineering advice becomes easier to reason about. File sizing, pruning, partitioning, compaction, and scan performance all tie back to this physical layout.

If you work in Spark, Iceberg, Athena, or any modern analytical stack, understanding Parquet internals is one of those low-level concepts that pays off repeatedly. The format is doing much more than simply storing data — it is shaping how your engine thinks about reading it.

👉 Want to inspect this visually? Try it here: https://databro.dev/tools/parquet-inspector-plus/

DEV Community