Prithvi S

Posted on Jun 6

How Apache Iceberg Tracks Billions of Files Without a Central Database

#iceberg #data #architecture #database

The Problem With Hive-Style Partitioning

If you have ever managed a data lake with Hive-style partitioning, you know the pain. A table with 10,000 partitions means 10,000 directories on HDFS or S3. Querying WHERE year=2024 requires listing objects in a directory tree. Schema changes are risky. Time travel is impossible. And the metadata lives in a separate metastore that becomes a bottleneck at scale.

Apache Iceberg solves this by rethinking how table metadata is stored. Instead of relying on a central database to track every file, Iceberg builds a self-contained metadata tree inside the file system itself. The result is a format that supports atomic commits, schema evolution, time travel, and hidden partitioning - all without a heavyweight metastore.

In this post, we will walk through how Iceberg's metadata hierarchy works, why the catalog pointer is the only mutable piece, and how this design enables query engines to prune files without touching the data layer.

The Five-Layer Metadata Tree

Iceberg organizes table metadata into five layers, stacked from bottom to top:

Catalog (pointer)
  ↓
Metadata File (JSON)
  ↓
Manifest List
  ↓
Manifest Files
  ↓
Data Files (Parquet/ORC/Avro)

Each layer has a specific job. Together, they form a complete picture of the table at any point in time.

Layer 1: Data Files

At the bottom are the actual data files - Parquet, ORC, or Avro files containing rows. These are immutable. Once written, a data file never changes. This is the same as any modern columnar format.

What makes Iceberg different is that these files are not organized by directory structure. Instead, their location and metadata are tracked in the layers above. This means you can change partitioning strategies without moving files. Old data stays where it is. New data uses the new layout. Both coexist in the same table.

Layer 2: Manifest Files

Manifest files are the workhorses of Iceberg's metadata system. Each manifest is a small file (typically Avro format) that tracks a subset of data files. For each data file, it stores:

File path and format
Partition values
Column-level statistics: min, max, null count, distinct count
Row counts and file sizes

These statistics are the key to query performance. When a query engine plans a scan, it reads the manifest first. If the query filters on amount > 1000 and the manifest says a file's max amount is 500, that file is skipped entirely. No I/O to the data file. No decompression. No parsing.

This is file-level pruning, and it happens before any data is read.

Layer 3: Manifest List

A manifest list is a snapshot-level file that tracks all manifests for a particular table snapshot. It contains metadata about each manifest:

Manifest file path
Partition spec ID
Added/deleted/existing file counts
Row counts
Column-level statistics (summed across the manifest)

The manifest list provides a second level of pruning. Before reading individual manifests, the query engine can check if a manifest is even relevant to the query. If a manifest only contains files for year=2023 and the query asks for year=2024, the entire manifest is skipped.

This two-level pruning - manifest list first, then manifest files - is what makes Iceberg efficient even with millions of data files.

Layer 4: Metadata File

The metadata file is a JSON document that contains the complete table state for a snapshot:

Table schema (with field IDs, not just names)
Partition specs (current and historical)
All snapshots and their manifest list paths
Current snapshot pointer
Table properties and configurations
Snapshot history

This is the "brain" of the table. It knows what columns exist, how the table is partitioned, and what snapshots are available. Because it is a single file, it can be atomically replaced. When a write commits, it creates a new metadata file and updates the catalog pointer. The old metadata file is still there - it just is not the current snapshot anymore.

This is how time travel works. Every snapshot is a complete, immutable metadata file. Query it by ID or timestamp, and you get the exact table state at that moment.

Layer 5: The Catalog

The catalog is the only mutable piece in the entire system. Its job is simple: store a pointer to the current metadata file. That is it.

Common catalog implementations include:

Hive Metastore - stores the metadata file path in a table property
Hadoop Catalog - uses a file in the warehouse directory
REST Catalog - exposes the pointer via HTTP API (what Apache Polaris uses)
Glue Catalog - AWS Glue Data Catalog integration

When a writer commits a transaction, it:

Writes new data files
Writes new manifests and manifest list
Writes a new metadata file
Updates the catalog pointer atomically (compare-and-swap)

If two writers try to commit simultaneously, only one succeeds. The other retries with the new metadata file pointer. This gives serializable isolation without a distributed transaction manager.

Why This Design Matters

No Central Metadata Database

Unlike Hive, which stores partition locations and file lists in the metastore, Iceberg keeps all metadata in the file system. The metastore only stores a single pointer. This removes the bottleneck of central metadata storage and makes the table self-describing.

Atomic Commits

Every write creates a new metadata file. The commit is atomic because it is a single pointer swap. If the commit fails, the table is unchanged. Partial writes are invisible to readers.

True Time Travel

Because every snapshot is a complete metadata file, you can query the table at any historical point. This is not a log-based approach where you replay changes. It is a pointer-based approach where each snapshot is a self-contained view.

Schema and Partition Evolution

The metadata file tracks all schemas and partition specs by ID. Old data files reference their original schema and partition spec. New files use the current one. Query engines handle the mapping automatically. You can add, rename, drop, or reorder columns without rewriting data.

Engine Independence

The metadata tree is just files. Any engine that can read the format can query the table. Spark, Trino, Flink, Presto, Hive, Impala - they all read the same metadata files. There is no engine-specific metadata store.

The Metadata Tree in Practice

Let us look at what happens when you insert data into an Iceberg table:

Write data files - Spark writes Parquet files to the warehouse directory
Write manifest file - Iceberg creates a manifest listing the new data files, with partition values and column stats
Write manifest list - A new manifest list references the new manifest plus existing manifests
Write metadata file - A new metadata file points to the new manifest list, includes the new schema if it changed, and adds a new snapshot
Update catalog - The catalog pointer is atomically updated to the new metadata file

For readers, the process is:

Read catalog - Get the current metadata file pointer
Read metadata - Load schema, partition specs, and current snapshot
Read manifest list - Load the manifest list for the snapshot
Prune manifests - Skip manifests that do not match the query filter
Read manifests - Load remaining manifests, prune data files using column stats
Read data files - Scan only the remaining files

This is why a query on a billion-row table can plan in milliseconds. The metadata tree is a pre-built index, stored in the file system, readable by any engine.

Comparing to Hive

Feature	Hive	Iceberg
Metadata location	Central metastore	Self-contained in file system
File tracking	Directory listing	Manifest files with stats
Schema evolution	Limited, risky	Full, safe with field IDs
Partitioning	Directory-based	Hidden, transform-based
Time travel	Not supported	Built-in, per snapshot
Atomic commits	Not guaranteed	CAS pointer swap
Concurrency	Risky	Optimistic, serializable

The Hidden Cost: Metadata Files Accumulate

Every write creates new metadata files. Over time, these accumulate. Iceberg handles this with:

Snapshot expiration - Old snapshots can be removed (configurable retention)
Orphan file cleanup - Data files not referenced by any snapshot are removed
Metadata compaction - Older metadata files can be compacted into summaries

These are background maintenance operations. They do not block reads or writes. But they are essential for keeping metadata size manageable over months of continuous writes.

Summary

Apache Iceberg's metadata tree is a fundamental rethink of how data lakes manage tables. By pushing metadata into the file system itself - manifests with column stats, manifest lists for snapshot isolation, and metadata files for complete table state - Iceberg eliminates the central metadata bottleneck.

The catalog stores one pointer. Everything else is immutable files. This enables atomic commits, time travel, schema evolution, and engine independence without requiring a distributed transaction coordinator.

If you are building a data platform and have not looked at Iceberg yet, the metadata architecture alone is worth the evaluation. It is the kind of design that seems obvious in retrospect - which is the mark of a good abstraction.

What is your experience with table formats? Have you migrated from Hive to Iceberg, or are you using Delta Lake or Hudi? I would love to hear your thoughts in the comments.

About the author: I am Prithvi S, Staff Software Engineer at Cloudera and an open-source enthusiast. I work on data platforms and agentic systems. Follow my work on GitHub.

Cover image: Data architecture visualization by Luke Chesser on Unsplash

DEV Community