Prithvi S

Posted on Apr 17

The Metadata Tree: How Apache Iceberg Finds the Right Files Without a Database

#iceberg #data #architecture #database

When you're managing petabytes of data across hundreds of machines, every millisecond matters. Most data engineers assume you need a separate metadata system to keep track of what's where. Iceberg proves you don't. Instead, it bakes metadata intelligence directly into the file format itself, creating a self-describing hierarchy that can scale to massive datasets without external dependencies.

This is the story of how Iceberg's metadata architecture works, why Netflix designed it this way, and what it means for your data platform.

The Problem: Metadata Hell at Scale

Let's start with what happens in traditional data lakes.

You have files on HDFS or cloud storage. Lots of them. Thousands. Millions. Every time someone runs a query, the query engine needs to answer three questions:

Which files contain data relevant to this query?
What schema does each file have?
What values are in each column (for filtering)?

In Hive or standard Parquet setups, you usually solve this with an external metadata store. Maybe it's a Hive metastore (which is just a relational database). You track table locations, schemas, partitions, and statistics in that database. When you write new data, you update the database. When you read, you query the database first.

This approach has real problems:

External dependency: Your data isn't independent. The metadata store becomes a single point of failure.
Consistency issues: What happens when a write succeeds on HDFS but the metadata update fails? You now have files with no catalog entry.
Scalability: At petabyte scale, metadata queries become bottlenecks. A single database can't efficiently track billions of files.
Time travel is hard: Historical metadata isn't naturally preserved. Rollback operations require custom logic.
Schema evolution breaks: Renaming columns means updating the entire catalog and potentially invalidating downstream queries.

Iceberg's answer: What if metadata lived with the data, not in a separate system?

The Metadata Hierarchy: Five Layers of Intelligence

Iceberg organizes metadata in a strict hierarchy. Each layer builds on the one below it, creating a chain from the catalog pointer all the way down to individual data files.

Catalog (mutable pointer)
    ↓
Metadata File (JSON, immutable)
    ↓
Manifest List (immutable)
    ↓
Manifests (immutable)
    ↓
Data Files (immutable)

Let's walk through each layer.

Layer 1: The Catalog (The Single Mutable Piece)

The catalog is the entry point to your table. It's a simple pointer: "The current state of this table is defined by this metadata file."

That's it. Just a reference. And this reference is the only thing that ever changes.

When you execute a write operation, Iceberg doesn't modify the existing metadata. Instead, it:

Creates a new metadata file describing the new table state
Creates new manifest files describing new snapshots
Writes new data files
Atomically updates the catalog pointer using compare-and-swap (CAS) logic

If two writers collide, one's update is rejected, it retries, and the conflict resolves cleanly. This is optimistic concurrency control, and it's elegant because there's only one pointer to update.

The catalog itself is usually a simple file or an HTTP endpoint. For file-based catalogs, it's literally a pointer to a location in cloud storage. For REST catalogs (used by distributed systems like Polaris), it's an HTTP service that handles the pointer updates.

Layer 2: The Metadata File (The Table Schema)

Each snapshot has exactly one metadata file. It's a JSON file that describes the entire table at that point in time:

Schema: Column names, data types, field IDs (more on field IDs later)
Partition spec: How data is partitioned (by year? month? day?)
Current snapshot ID: Which snapshot is currently active
Snapshot history: All previous snapshots and their timestamps
Table properties: Custom metadata, format version, sort order

This metadata file is immutable. Once written, it never changes. Every time you write data, a new metadata file is created with an updated snapshot list.

Why immutability matters: Your entire table history is preserved. You can query any previous snapshot by ID. You can time-travel to last Tuesday. You can audit exactly what changed and when.

Layer 3: The Manifest List (The Snapshot View)

A manifest list is a file that describes all the manifests for a single snapshot.

Think of it as a table of contents: "This snapshot contains these manifests, which collectively describe all the data files in the table at this point in time."

The manifest list includes:

References to all manifest files
Partition specs for each manifest
Min/max values for each partition (for pruning)
File counts and row counts

When a query engine wants to scan a snapshot, it reads the manifest list first. This allows it to prune entire manifests before reading individual data files.

Why this level of indirection? It allows Iceberg to handle large tables elegantly. Instead of one giant list of all files, you have a list of manifests. If you have 10 million data files, you might have 1000 manifest files, and one manifest list. The query engine can prune at the manifest level before drilling into individual files.

Layer 4: Manifests (The File Inventory)

Each manifest is a file that lists a set of data files. It includes metadata about each file:

File path: Location of the data file
File format: Parquet, ORC, Avro
Row count: How many rows in this file
Byte size: File size
Column statistics: Min, max, null count per column
Partition values: What partition this file belongs to

Manifests are where the real filtering magic happens. Before a query engine reads a single data file, it can:

Check partition values: "Do I even need this file for this query?"
Check column statistics: "Are all values in this column outside my filter range?"
Prune aggressively: Skip entire files without touching them

For example, if you query WHERE year = 2024 AND user_id > 1000000, the manifest can tell you instantly which files have relevant data. No scanning required.

Layer 5: Data Files (The Actual Data)

Finally, at the bottom, you have the data files themselves. These are standard Parquet, ORC, or Avro files. They contain the actual table data.

The key point: They're immutable. Once written, they never change. All mutations (updates, deletes) are handled higher in the stack through delete files and new snapshot creation.

How the Hierarchy Enables Architecture Without Databases

Now you see the full picture. Let's connect it back to the original problem.

With this hierarchy, Iceberg can answer all three questions without an external metadata store:

Which files are relevant? Walk down from the catalog pointer through manifest list to manifests. File-level granularity statistics do the filtering.
What's the schema? It's in the metadata file, part of the immutable snapshot.
What values are in each column? Statistics are stored in manifests alongside each file reference.

The entire metadata layer is self-contained. You don't need a separate database. You don't need to manage catalog consistency separately from data consistency. It's all atomic, all together.

And because each snapshot is immutable, your entire table history is naturally preserved. Time travel queries aren't a special feature; they're just reading an older snapshot. Schema changes don't break the system; old data files continue to work with old schemas (thanks to field IDs, which we'll touch on next).

Field IDs: Why Names Are Dangerous

This is the secret weapon that makes schema evolution work.

In traditional tables, columns are identified by position or name. If you rename a column, downstream systems break. If you reorder columns, file readers get confused.

Iceberg uses field IDs instead. Every column has a unique numeric ID that never changes. When you ALTER TABLE to rename a column, the field ID stays the same. The data file format doesn't care about the name; it reads by ID.

Example:

-- Original table
CREATE TABLE users (
  id INT,              -- field ID 1
  name STRING,         -- field ID 2
  email STRING         -- field ID 3
)

-- Years later, rename a column
ALTER TABLE users RENAME COLUMN email TO email_address

-- Old data files still have field ID 3 pointing to email data
-- New queries use field ID 3, which now maps to email_address
-- Everything works transparently

This is why Iceberg tables can safely evolve their schemas over years without rewriting historical data.

Partition Evolution: Changing Strategy Without Rewrites

The same principle applies to partitioning.

Let's say you originally partitioned your data by year:

/year=2020/...
/year=2021/...
/year=2022/...

Later, you realize you need monthly partitions for performance:

/year=2023/month=01/...
/year=2023/month=02/...

With traditional data lakes, you'd have to rewrite all the old yearly partitioned data to the new monthly scheme. That's hours of work for large tables.

With Iceberg, you change the partition spec and move on. New writes use the new partition layout. Old writes keep the old layout. The manifest files track which partition spec applies to which data files. When you query, Iceberg handles both seamlessly.

Hidden Partitioning: The Query Engine Doesn't Know or Care

Here's another elegance: Your query doesn't explicitly reference partitions.

In Hive, you might write:

SELECT * FROM events WHERE year = 2024

Hive sees the partition column and uses it for partition pruning.

In Iceberg, you write:

SELECT * FROM events WHERE event_date >= '2024-01-01'

You don't mention partitioning at all. You just query the actual column. Iceberg's hidden partitioning handles partition transforms internally. If the table is partitioned by year(event_date), Iceberg applies the transform, prunes the right partitions, and returns the answer. The query engine never knows partitioning happened.

This is powerful because:

Queries are simpler and more portable across engines
You can change partition transforms without rewriting queries
The partition strategy is an implementation detail, not part of the query contract

Snapshots: Immutable Points in Time

Every write operation creates a new snapshot. A snapshot is an immutable view of the table at a point in time.

The catalog points to the current snapshot. Previous snapshots remain accessible. You can:

Query snapshot N: SELECT * FROM table VERSION AS OF snapshot_id_12345
Query by timestamp: SELECT * FROM table FOR SYSTEM_TIME AS OF '2026-04-10'
Audit history: Inspect the metadata file to see all snapshot timestamps
Rollback: Point the catalog back to an earlier snapshot

Because snapshots are immutable and complete, time travel isn't a performance problem. You're not scanning all history; you're just reading a different snapshot.

Copy-on-Write and Merge-on-Read: Two Paths to Mutation

Iceberg supports two different strategies for handling updates and deletes.

Copy-on-Write (CoW): When you delete a row, Iceberg reads the entire data file, filters out the deleted row, and writes a new file. The old file is marked as deleted in the manifest.

Pros: Clean reads, no delete reconciliation needed
Cons: Every mutation rewrites files (slow for write-heavy workloads)

Merge-on-Read (MoR): When you delete a row, Iceberg writes a separate delete file that marks rows as deleted. At read time, the query engine merges the data files and delete files to reconstruct the current state.

Pros: Fast writes, no file rewrites
Cons: Reads have to do reconciliation work

The manifest tracks both data files and delete files, so readers know which delete files apply to which data files.

Metadata Compaction: Keeping History Manageable

Over months and years, you accumulate many snapshots. The metadata files grow. The manifest lists grow.

Iceberg has automatic mechanisms to clean this up:

Metadata expiration: Old snapshots beyond a retention period are marked for deletion
Manifest compaction: Small manifest files are merged into larger ones
Orphan file cleanup: Files with no references in active snapshots are removed

You can also manually trigger compaction to optimize for query performance.

Why This Architecture Matters

Let's zoom out. Why is this design so important?

Scalability without infrastructure: You don't need a separate metadata store. Files themselves carry the metadata. This means Iceberg scales to petabytes without additional complexity.
ACID correctness: The atomic catalog pointer ensures that either a write succeeds completely or fails completely. There's no partial success, no consistency holes.
Engine independence: The metadata hierarchy is format-agnostic. Spark reads it the same way Trino does. The metadata is the contract between engines, not some database-specific schema.
History preservation: Because snapshots are immutable, you get time travel and audit trails for free. No special feature; just a natural consequence of the design.
Evolution without friction: Schema and partition evolution don't require data rewrites. Your table grows and changes safely.
Concurrent writes: Optimistic concurrency control at the metadata level means multiple writers can work simultaneously without locking. Conflicts resolve cleanly at the atomic point.

This is why Iceberg has become the standard for large-scale analytics. It's not that it's flashy or new. It's that it solves real problems at scale.

Bringing It All Together

The next time you write a query against an Iceberg table, remember the architecture beneath it:

The catalog pointer found your table
The metadata file described its schema
The manifest list identified relevant snapshots
The manifests pruned unnecessary files using statistics
The data files provided the actual rows
Delete files (if any) marked rows as removed

No external database needed. No consistency problems. No performance cliffs at scale.

Just files, organized intelligently, describing themselves.

That's Iceberg.

Resources

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: https://github.com/iprithv

DEV Community