Prithvi S

Posted on Apr 13 • Edited on Apr 20 • Originally published at github.com

How Apache Iceberg's Metadata Architecture Enables ACID at Scale

#architecture #database #dataengineering #distributedsystems

When you have a petabyte of data across millions of files in cloud storage, how do you ensure that reads are consistent, writes don't collide, and schema changes don't break everything? Traditional data lakes punt on this problem. Apache Iceberg solves it with an elegant metadata architecture that brings SQL-table reliability to distributed storage without needing a centralized database.

Let me walk you through how it works, why each layer matters, and what makes it fundamentally different from older table formats.

The Problem: Why Traditional Data Lakes Are Unreliable

Before Iceberg, data lakes operated like this:

Data engineers wrote Parquet files to S3
A Hive metastore tracked table schemas and partition locations
Queries discovered files by scanning directories or querying the metastore
Updates meant either rewriting entire partitions or leaving data inconsistent

This worked for append-only workflows. But the moment you needed:

Schema evolution (add a column) without rewriting data
Atomic deletes without breaking other writers
Time travel (query historical snapshots)
Concurrent writes without conflicts ...you hit a wall. The metastore was a single point of contention, and there was no reliable way to track which files belonged to which version of the table.

Iceberg fixes this by building a versioned metadata system directly into the table format. No metastore required (though one can help). Just immutable snapshots and a pointer to the current state.

The Metadata Hierarchy: Five Layers of Metadata

Iceberg organizes metadata in a clean bottom-to-top hierarchy. Each layer is immutable, and each layer is built from the layer below it. Here's how it works:

Layer 1: Data Files

At the bottom are your actual data files: Parquet, ORC, or Avro files stored in cloud storage (S3, GCS, Azure Blob, or local filesystems). These contain the raw table data, partitioned and compressed.

s3://my-bucket/warehouse/db/table/data/
  00001-abc123.parquet    <- 50MB, partition year=2024, month=01
  00002-def456.parquet    <- 60MB, partition year=2024, month=01
  00003-ghi789.parquet    <- 45MB, partition year=2024, month=02

From Iceberg's perspective, these files are opaque. It doesn't care about their internal structure. What matters is that each file has:

A file path
File format (parquet/orc/avro)
Partition values (year=2024, month=01)
Column-level statistics (min/max/null counts per column)
File size and record count

Layer 2: Manifest Files

Above data files sit manifest files. A manifest file is a Parquet file that lists the data files belonging to a snapshot, along with metadata about each one.

Think of a manifest like a file listing with extra info:

{
  "status": 1,
  "snapshot_id": 9223372036854775807,
  "sequence_number": 0,
  "file_path": "s3://my-bucket/warehouse/db/table/data/00001-abc123.parquet",
  "file_format": "PARQUET",
  "spec_id": 0,
  "partition": {
    "year": 2024,
    "month": 1
  },
  "record_count": 1234567,
  "file_size_in_bytes": 52428800,
  "column_sizes": {
    "1": 10485760,
    "2": 15728640,
    "3": 26214400
  },
  "value_counts": {
    "2": 1234567
  },
  "null_value_counts": {
    "2": 5
  },
  "lower_bounds": {
    "1": "2024-01-15",
    "2": 100
  },
  "upper_bounds": {
    "1": "2024-01-31",
    "2": 9999
  }
}

The manifest records:

Which data files are live vs deleted
Partition values for each file
Column-level statistics (min/max/null counts) for pruning
File sizes and record counts

This is where the magic starts. Because every data file's statistics are recorded in a manifest, query engines can:

Read a single manifest file (much smaller than scanning all data files)
Prune files based on partition values or column statistics
Skip reading files that don't match the query filter

On a table with a million data files, you might read one manifest file and determine that only 500 files match your query. No metadata service needed; it's all in the manifest.

Layer 3: Manifest List

A manifest list is a Parquet file that references all the manifest files for a given snapshot.

{
  "manifest_path": "s3://my-bucket/warehouse/db/table/metadata/10001-abc.avro",
  "manifest_length": 1048576,
  "partition_spec_id": 0,
  "added_snapshot_id": 9223372036854775807,
  "added_files_count": 150,
  "existing_files_count": 2500,
  "deleted_files_count": 10,
  "added_rows_count": 18750000,
  "existing_rows_count": 312500000,
  "deleted_rows_count": 125000,
  "partitions": [
    {
      "contains_null": false,
      "lower_bound": 2024,
      "upper_bound": 2024
    }
  ]
}

The manifest list aggregates statistics from all manifests for that snapshot:

How many files were added/existing/deleted?
How many rows were added/deleted?
What partition ranges does this snapshot contain?

Why? Because query engines need to know if a snapshot is even relevant before reading all its manifests. The manifest list answers that in one read.

Layer 4: Metadata File (JSON)

Above the manifest list sits the metadata file. This is a JSON file that contains:

Table schema and column definitions
Partition spec (how to partition the table)
Current snapshot ID (pointer to the "live" snapshot)
Snapshot history (all past snapshots)
Table properties and settings
Sorted order definitions
Schema evolution history (field IDs and renames)

Example metadata file structure:

{
  "format-version": 2,
  "table-uuid": "abc-123-def-456",
  "location": "s3://my-bucket/warehouse/db/table",
  "last-sequence-number": 3,
  "last-updated-ms": 1712973955000,
  "last-column-id": 3,
  "schema": {
    "type": "struct",
    "fields": [
      {"id": 1, "name": "event_id", "type": "string"},
      {"id": 2, "name": "user_id", "type": "long"},
      {"id": 3, "name": "event_timestamp", "type": "timestamp"}
    ]
  },
  "partition-spec": {
    "fields": [
      {"source-id": 3, "transform": "year", "name": "year"}
    ]
  },
  "current-snapshot-id": 9223372036854775807,
  "snapshots": [
    {
      "snapshot-id": 9223372036854775807,
      "timestamp-ms": 1712973955000,
      "summary": {
        "operation": "append",
        "spark.app.id": "app-20240412-123456"
      },
      "manifest-list": "s3://my-bucket/warehouse/db/table/metadata/v1.manifest.list"
    }
  ]
}

This single JSON file is the table's source of truth. It tells you:

What columns exist (with field IDs, not just names)
How the table is partitioned
Which snapshot is current
The full history of all snapshots ever created

Layer 5: The Catalog Pointer (The Only Mutable Piece)

Finally, at the very top sits the catalog. The catalog's job is simple: store a pointer to the current metadata file location.

table_identifier (db.table) -> s3://bucket/warehouse/db/table/metadata/v123.json

This is the only mutable piece in the entire system. When you commit a write, you:

Create a new metadata file (immutable)
Create new manifest files (immutable)
Create new data files (immutable)
Update the catalog pointer to point to the new metadata file (atomic CAS operation)

If the pointer update fails (because another writer already updated it), you retry with conflict detection. This gives you serializable isolation without a database.

Why This Matters: Three Critical Properties

This five-layer hierarchy gives you three things that traditional data lakes don't have:

1. Atomicity Without a Database

Traditional approach: Write files to S3, update the metastore database. If the database update fails, you have orphaned files. If the application crashes mid-write, the metastore is inconsistent.

Iceberg approach: Write everything (metadata files, manifests, data files) to immutable storage. The only atomic operation is the catalog pointer update (compare-and-swap on a key-value store, or a file rename on S3). If that fails, nothing changed. No orphaned files.

2. Schema Evolution Without Rewrites

With Hive tables, columns are identified by position or name. Want to add a column? Rewrite the schema. Want to rename a column? Rewrite all existing data files to update the column reference.

Iceberg uses field IDs instead of names or positions. Column 1 is always "event_id" internally, even if you rename the external column or reorder columns. Old data files don't change. New data files use the new schema. Queries automatically reconcile both.

Example: You have a table with schema (id, user_id, event_time). You want to add a source column:

Old data files don't have source (column ID 4)
New data files do have source
Iceberg handles missing columns transparently (NULL fill-down)
No rewrites

3. Time Travel and Snapshot Isolation

Every write creates a new immutable snapshot. Snapshots are never mutated or deleted (unless explicitly expired). This means:

You can query the table as it existed 30 days ago
Concurrent reads aren't blocked by concurrent writes
Snapshot expiration is manual (you decide when old snapshots are garbage)

Write Modes: Copy-on-Write vs Merge-on-Read

When you delete or update rows, Iceberg gives you two options:

Copy-on-Write (CoW)

When you update/delete rows in file X, rewrite the entire file
Pros: Readers always see clean data files, no performance penalty
Cons: Updates are expensive (rewrite entire files)
Best for: Read-heavy workloads

Merge-on-Read (MoR)

When you update/delete rows, write a separate delete file (position delete or equality delete)
Pros: Updates are fast (just write a small delete file)
Cons: Readers must merge data files + delete files, slight read penalty
Best for: Write-heavy workloads

Iceberg v2 spec introduced position deletes (deleted by row position) and equality deletes (deleted by column value). This gives engines flexibility in how they reconcile deletes.

Multi-Engine Interoperability

Here's what's remarkable: Spark, Trino, Flink, Presto, Hive, and Impala all read the same metadata format. The spec is engine-agnostic. A data engineer can:

Write data with Spark
Query it with Trino
Delete rows with Flink
Time travel with Presto ...all on the same table, without format conversions.

This is possible because Iceberg separates the format spec from the execution engine. The metadata hierarchy is standardized. Query engines just implement readers for that standard.

Optimistic Concurrency Control

With multiple writers, how does Iceberg prevent conflicts? Via optimistic concurrency:

Writer A reads the current metadata file
Writer B reads the current metadata file
Writer A finishes its changes, creates a new metadata file, tries to update the catalog pointer
Update succeeds (CAS operation)
Writer B finishes its changes, creates a new metadata file, tries to update the catalog pointer
Update fails (pointer no longer points to the metadata file Writer B read)
Writer B detects the conflict, re-reads the current metadata, recomputes its changes, and retries

This gives you serializable isolation. Writers don't block; they just retry on conflicts. For most workloads (few concurrent writers), conflicts are rare. For high-concurrency scenarios, you might need more sophisticated conflict resolution, but the default retry mechanism is sound.

Partition Evolution: Change Partitioning Without Rewriting

Suppose you initially partitioned by month(event_time), and now you want to partition by day(event_time) for better file pruning.

Traditional approach: Rewrite the entire table with the new partition scheme.

Iceberg approach:

Old snapshots keep their old partition scheme
New writes use the new partition scheme
Manifest files track partition spec ID per file
Queries automatically handle mixed partition layouts

No rewrites needed. This is huge for large tables where rewriting would take hours.

Conclusion: Why This Architecture Wins

Iceberg's metadata hierarchy achieves something remarkable: ACID guarantees on immutable cloud storage, without sacrificing performance or requiring a centralized database.

The design principles are:

Correctness over performance - atomic commits matter more than throughput
Immutability - makes caching, parallelism, and disaster recovery trivial
Versioning - every snapshot is preserved, enabling time travel and rollback
Engine-agnostic - the spec is open, allowing diverse tools to interoperate
File-level granularity - statistics are recorded per file, enabling efficient pruning

If you're building a data platform or migrating from Hive/Delta, understand this architecture. It's not just a file format; it's a rethinking of how to manage massive datasets reliably.

Want to dive deeper?

GitHub: https://github.com/apache/iceberg
Specification: https://iceberg.apache.org/spec/
Java API: org.apache.iceberg package in the repo

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community