Prithvi S

Posted on Apr 24 • Edited on Apr 27

The Metadata Tree: How Apache Iceberg Tracks Everything Without a Database

#iceberg #data #architecture #database

If you've worked with petabyte-scale data lakes, you've felt the pain: tables fragment into thousands of small files, query engines can't find what they need, schema changes become nightmares, and concurrent writes collide silently. Apache Iceberg solves these problems with something deceptively simple: a metadata hierarchy that tracks every file, every change, and every snapshot without needing a centralized database.

In this post, we'll walk through Iceberg's architecture from the ground up. By the end, you'll understand why Netflix built this, how it works, and why it's becoming the standard for analytics across Spark, Trino, Flink, and beyond.

The Problem: Why Big Data Tables Are So Fragile

Before Iceberg, data lakes used simple file layouts (Parquet/ORC on HDFS or S3). Sounds fine in theory. In practice:

Atomicity fails. A write dies halfway through. Are those partial files valid? Which ones should queries skip?
Partition discovery is slow. Query engines scan directory listings to find what data exists. With millions of files, this takes minutes.
Schema changes break old files. Add a column, and now old data files don't have it. Readers fail or invent NULLs.
Concurrent writes collide. Two writers claim they "finished" the same table simultaneously. Which one won? Corruption.
Pruning is guesswork. Engines don't know column ranges within files. They scan everything.

Hive, Delta, and early Iceberg prototypes tried band-aids (metadata services, loose conventions). They all had the same fundamental issue: the catalog was either centralized (slow, inconsistent) or decentralized (unreliable).

Iceberg took a different approach: immutable snapshots and a metadata hierarchy, with a single atomic pointer as the source of truth.

Meet the Metadata Tree

Here's the insight: instead of asking "what files exist?", Iceberg asks "what does this snapshot contain?". A snapshot is an immutable point-in-time view of the table. Every write creates a new snapshot. Every snapshot points to a set of files. And Iceberg tracks all of this through a beautifully layered hierarchy.

Let's build it from the bottom up:

Layer 1: Data Files (Parquet/ORC/Avro)

The base layer is simple: actual data. Files contain table rows in columnar format (Parquet is standard, ORC supported, Avro optional).

s3://my-warehouse/my-table/data/
  00000-abc123.parquet  (1 GB, 10M rows, partition_date=2024-01-01)
  00001-def456.parquet  (950 MB, 9.5M rows, partition_date=2024-01-01)
  00002-ghi789.parquet  (1.1 GB, 11M rows, partition_date=2024-01-02)

Key property: These files are write-once. Once created, they never change. If you delete a row, you don't rewrite the file (we'll see how Iceberg handles deletes later). This makes them safe for concurrent reads and enables efficient caching.

Layer 2: Manifest Files (Track Data Files and Stats)

Now here's where Iceberg gets clever. Instead of listing files via S3 API calls (slow, inconsistent), Iceberg writes a manifest file for each snapshot. A manifest is a Parquet file that lists all data files in that snapshot along with their metadata.

For each data file, a manifest stores:

File path: s3://my-warehouse/my-table/data/00000-abc123.parquet
File format: parquet
File size: 1,073,741,824 bytes
Partition info: {partition_date: 2024-01-01} (as integers, not strings)
Record count: 10,000,000
Column stats: min/max/null_count for each column
- user_id: min=1000, max=9999999, null_count=0
- amount: min=0.01, max=9999.99, null_count=1200
- timestamp: min=2024-01-01 00:00:00, max=2024-01-01 23:59:59, null_count=0

Why stats? Because query engines use them for predicate pushdown. If you ask "SELECT * WHERE amount > 5000", the engine checks the manifest: "Does any file have max amount >= 5000?". If no, skip the entire file.

s3://my-warehouse/my-table/metadata/
  manifests/
    20240501-000001-abc.avro  (manifest for snapshot 1, 500 KB)
    20240502-000002-def.avro  (manifest for snapshot 2, 510 KB)

Multiple snapshots = multiple manifests.

Layer 3: Manifest List (Snapshot-Level Index)

One manifest per snapshot might be thousands of files. Iceberg adds another layer: the manifest list. This is a small Parquet file that lists which manifests belong to a given snapshot.

For each manifest, it stores:

Manifest file path: s3://my-warehouse/my-table/metadata/manifests/20240501-000001-abc.avro
Manifest length: 500,000 bytes
Partition stats: min/max partition values across all files in this manifest
- partition_date: min=2024-01-01, max=2024-01-01
Record count: Sum of all files in this manifest

So if your snapshot includes 100 manifests, the manifest list is a ~50 KB file that summarizes all of them. Engines can scan the manifest list to decide which manifests to read.

s3://my-warehouse/my-table/metadata/
  snap-20240501000001.avro  (manifest list for snapshot 1, 25 KB)
  snap-20240502000002.avro  (manifest list for snapshot 2, 26 KB)

Layer 4: Metadata File (Table Schema, History, Current Snapshot)

The metadata file is a JSON file that contains the entire truth about the table:

{
  "format-version": 2,
  "table-uuid": "12345-abcde",
  "location": "s3://my-warehouse/my-table",
  "last-updated-ms": 1714605600000,
  "last-column-id": 5,
  "schema": {
    "type": "struct",
    "fields": [
      {"id": 1, "name": "user_id", "required": true, "type": "long"},
      {"id": 2, "name": "email", "required": true, "type": "string"},
      {"id": 3, "name": "amount", "required": false, "type": "double"},
      {"id": 4, "name": "created_at", "required": true, "type": "timestamp"},
      {"id": 5, "name": "country", "required": false, "type": "string"}
    ]
  },
  "partition-spec": [
    {"name": "date_partition", "transform": "year", "source-id": 4}
  ],
  "current-snapshot-id": 2,
  "snapshots": [
    {
      "snapshot-id": 1,
      "timestamp-ms": 1714519200000,
      "summary": {"operation": "append", "added-files": "100"},
      "manifest-list": "s3://my-warehouse/my-table/metadata/snap-20240501000001.avro"
    },
    {
      "snapshot-id": 2,
      "timestamp-ms": 1714605600000,
      "summary": {"operation": "append", "added-files": "5"},
      "manifest-list": "s3://my-warehouse/my-table/metadata/snap-20240502000002.avro"
    }
  ],
  "properties": {
    "write.format.default": "parquet"
  }
}

See current-snapshot-id: 2? That pointer tells engines: "If you want the latest table, load snapshot 2, which points to this manifest list, which lists these manifests, which contain these files."

The metadata file also contains:

Schema: Column definitions and IDs (more on this later)
Partition spec: How to partition new data (year, month, day, hour, bucket, truncate)
Snapshot history: Every snapshot ever created, its timestamp, and what manifest-list it uses
Properties: Table config (compression, default format, etc.)

s3://my-warehouse/my-table/metadata/
  00000.json  (metadata file v0, created at table init)
  00001.json  (metadata file v1, created after first write)
  00002.json  (metadata file v2, created after second write)

Each write creates a new metadata file.

Layer 5: Catalog Pointer (The Only Mutable Piece)

Finally, the catalog. This is where Iceberg gets its reliability. The catalog is a simple mapping:

table_name -> current_metadata_file_location

For example:

my-table -> s3://my-warehouse/my-table/metadata/00002.json

That's it. One pointer. Everything else is immutable.

How does the catalog live? It doesn't live in S3. It lives in a catalog service:

Hive metastore: Legacy, still widely used
AWS Glue: Managed, but slower
Nessie: Git-like, multi-branch table versioning
REST catalog: HTTP-based, engine-agnostic (Apache Polaris)
Iceberg catalog: In-memory, for testing

When you write to a table:

Load current metadata file (via catalog pointer)
Generate new data files and manifests
Create a new metadata file that points to the new manifests
Atomically update the catalog pointer from old metadata to new metadata

Step 4 is the secret sauce. Most catalog services support atomic CAS (compare-and-swap) operations. If two writers race, only one CAS succeeds. The loser retries with the new current state. This gives you serializable isolation without distributed locks.

Putting It Together: A Write Operation

Let's trace a real write to see all five layers in action.

Initial state:

Catalog: my-table -> metadata/00000.json
Metadata 00000: current-snapshot-id=1, 100 data files
Snapshot 1: manifest-list points to 10 manifests

Writer appends 5 new files:

Writer generates 5 new Parquet files in the data directory
Writer creates a new manifest (manifest v1) listing these 5 files + stats
Writer creates a new manifest-list (snap-2) that includes both old manifests and the new one
Writer creates new metadata file (00001.json):
- current-snapshot-id: 2
- snapshot 2 points to new manifest-list
- schema unchanged
- partition-spec unchanged
Atomic CAS: Writer tells catalog: "Update my-table to point to metadata/00001.json (only if it still points to 00000.json)"
Catalog confirms: pointer updated
Write succeeds. Old metadata file (00000.json) stays on disk for history/time travel

Concurrent reader:

Sees catalog still pointing to 00000.json (old snapshot)
Reads that snapshot's manifest-list
Reads 10 manifests, finds 100 files
Scans those 100 files
Write doesn't interfere; reader gets consistent view

Later reader:

Sees catalog now pointing to 00001.json
Reads snapshot 2's manifest-list
Finds 11 manifests (10 old + 1 new)
Reads 105 files total
Sees the 5 new rows

Why This Design Is Brilliant

1. Atomicity without transactions: Only the catalog pointer moves. Everything else is immutable. No metadata locks, no two-phase commit.

2. Snapshot isolation: Readers hold a reference to a metadata file (or snapshot ID). Old snapshots never change. Even if 10 new writes happen, that reader's view stays frozen.

3. Time travel: Query engine wants rows from May 1st? Load the snapshot from May 1st, read that manifest-list and files. No replaying transactions; the snapshot already exists.

4. File pruning: Manifest files store column stats. Query engines skip files without matching data before ever touching S3/HDFS. Orders of magnitude faster than directory scans.

5. Partition evolution: Want to change from daily to hourly partitions? Create new metadata file with new partition-spec. Old data files keep old partition values (stored in manifest). New files use new partition values. Readers handle both transparently.

6. Schema evolution: Columns are identified by ID, not name. Rename a column? New metadata file with updated schema. Old files still have ID 1 for user_id; readers understand both.

7. Concurrent writes at scale: Thousands of writers, zero coordination. Optimistic locking via metadata CAS. If you have 10% write collision rate, 99% of writers succeed on first try.

The Performance Trade-off

This architecture does have a cost: metadata reads. Every query must:

Load metadata file (JSON from S3)
Load manifest-list (Parquet)
Load relevant manifests (Parquet)
Filter down to data files

For a table with millions of files, even with parallelism, this adds 100-500ms to query startup. This is why Iceberg works best with a warm metadata cache (many query engines cache manifest files in memory) and why partition pruning is critical (manifest-list stores partition bounds, so you can skip entire manifest groups).

Next Steps in the Iceberg Journey

Now that you understand the metadata tree, you're ready for:

Schema evolution: How field IDs make ALTER TABLE safe
Hidden partitioning: How Iceberg auto-applies transforms (year, month, day, etc.)
Delete semantics: Position deletes vs equality deletes, Copy-on-Write vs Merge-on-Read
Optimistic concurrency: How CAS prevents write collisions
Time travel: Querying historical snapshots by timestamp

Conclusion

Apache Iceberg's genius is in its simplicity: a five-layer metadata tree with one atomic pointer at the top. No central database. No complex consistency protocols. Just immutable files, clever indexing, and compare-and-swap semantics.

If you're building data platforms at scale, this is the architecture you should understand. Whether you're using Spark, Trino, Flink, or any other engine, Iceberg's design enables correctness, performance, and flexibility that older table formats simply can't match.

Start with a test table in your warehouse. Try time travel. Try schema evolution. Feel how different Iceberg is. Then you'll really get why it matters.

Want to dive deeper? Check out the Apache Iceberg spec at https://iceberg.apache.org/spec/ and the source code at https://github.com/apache/iceberg.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. I work on data systems, LLM-powered applications, and large-scale architectures. Follow my work on GitHub: https://github.com/iprithv

DEV Community