When you have a petabyte of data across millions of files in cloud storage, how do you ensure that reads are consistent, writes don't collide, and schema changes don't break everything? Traditional data lakes punt on this problem. Apache Iceberg solves it with an elegant metadata architecture that brings SQL-table reliability to distributed storage without needing a centralized database.
Let me walk you through how it works, why each layer matters, and what makes it fundamentally different from older table formats.
The Problem: Why Traditional Data Lakes Are Unreliable
Before Iceberg, data lakes operated like this:
- Data engineers wrote Parquet files to S3
- A Hive metastore tracked table schemas and partition locations
- Queries discovered files by scanning directories or querying the metastore
- Updates meant either rewriting entire partitions or leaving data inconsistent
This worked for append-only workflows. But the moment you needed:
- Schema evolution (add a column) without rewriting data
- Atomic deletes without breaking other writers
- Time travel (query historical snapshots)
- Concurrent writes without conflicts ...you hit a wall. The metastore was a single point of contention, and there was no reliable way to track which files belonged to which version of the table.
Iceberg fixes this by building a versioned metadata system directly into the table format. No metastore required (though one can help). Just immutable snapshots and a pointer to the current state.
The Metadata Hierarchy: Five Layers of Metadata
Iceberg organizes metadata in a clean bottom-to-top hierarchy. Each layer is immutable, and each layer is built from the layer below it. Here's how it works:
Layer 1: Data Files
At the bottom are your actual data files: Parquet, ORC, or Avro files stored in cloud storage (S3, GCS, Azure Blob, or local filesystems). These contain the raw table data, partitioned and compressed.
s3://my-bucket/warehouse/db/table/data/
00001-abc123.parquet <- 50MB, partition year=2024, month=01
00002-def456.parquet <- 60MB, partition year=2024, month=01
00003-ghi789.parquet <- 45MB, partition year=2024, month=02
From Iceberg's perspective, these files are opaque. It doesn't care about their internal structure. What matters is that each file has:
- A file path
- File format (parquet/orc/avro)
- Partition values (year=2024, month=01)
- Column-level statistics (min/max/null counts per column)
- File size and record count
Layer 2: Manifest Files
Above data files sit manifest files. A manifest file is a Parquet file that lists the data files belonging to a snapshot, along with metadata about each one.
Think of a manifest like a file listing with extra info:
{
"status": 1,
"snapshot_id": 9223372036854775807,
"sequence_number": 0,
"file_path": "s3://my-bucket/warehouse/db/table/data/00001-abc123.parquet",
"file_format": "PARQUET",
"spec_id": 0,
"partition": {
"year": 2024,
"month": 1
},
"record_count": 1234567,
"file_size_in_bytes": 52428800,
"column_sizes": {
"1": 10485760,
"2": 15728640,
"3": 26214400
},
"value_counts": {
"2": 1234567
},
"null_value_counts": {
"2": 5
},
"lower_bounds": {
"1": "2024-01-15",
"2": 100
},
"upper_bounds": {
"1": "2024-01-31",
"2": 9999
}
}
The manifest records:
- Which data files are live vs deleted
- Partition values for each file
- Column-level statistics (min/max/null counts) for pruning
- File sizes and record counts
This is where the magic starts. Because every data file's statistics are recorded in a manifest, query engines can:
- Read a single manifest file (much smaller than scanning all data files)
- Prune files based on partition values or column statistics
- Skip reading files that don't match the query filter
On a table with a million data files, you might read one manifest file and determine that only 500 files match your query. No metadata service needed; it's all in the manifest.
Layer 3: Manifest List
A manifest list is a Parquet file that references all the manifest files for a given snapshot.
{
"manifest_path": "s3://my-bucket/warehouse/db/table/metadata/10001-abc.avro",
"manifest_length": 1048576,
"partition_spec_id": 0,
"added_snapshot_id": 9223372036854775807,
"added_files_count": 150,
"existing_files_count": 2500,
"deleted_files_count": 10,
"added_rows_count": 18750000,
"existing_rows_count": 312500000,
"deleted_rows_count": 125000,
"partitions": [
{
"contains_null": false,
"lower_bound": 2024,
"upper_bound": 2024
}
]
}
The manifest list aggregates statistics from all manifests for that snapshot:
- How many files were added/existing/deleted?
- How many rows were added/deleted?
- What partition ranges does this snapshot contain?
Why? Because query engines need to know if a snapshot is even relevant before reading all its manifests. The manifest list answers that in one read.
Layer 4: Metadata File (JSON)
Above the manifest list sits the metadata file. This is a JSON file that contains:
- Table schema and column definitions
- Partition spec (how to partition the table)
- Current snapshot ID (pointer to the "live" snapshot)
- Snapshot history (all past snapshots)
- Table properties and settings
- Sorted order definitions
- Schema evolution history (field IDs and renames)
Example metadata file structure:
{
"format-version": 2,
"table-uuid": "abc-123-def-456",
"location": "s3://my-bucket/warehouse/db/table",
"last-sequence-number": 3,
"last-updated-ms": 1712973955000,
"last-column-id": 3,
"schema": {
"type": "struct",
"fields": [
{"id": 1, "name": "event_id", "type": "string"},
{"id": 2, "name": "user_id", "type": "long"},
{"id": 3, "name": "event_timestamp", "type": "timestamp"}
]
},
"partition-spec": {
"fields": [
{"source-id": 3, "transform": "year", "name": "year"}
]
},
"current-snapshot-id": 9223372036854775807,
"snapshots": [
{
"snapshot-id": 9223372036854775807,
"timestamp-ms": 1712973955000,
"summary": {
"operation": "append",
"spark.app.id": "app-20240412-123456"
},
"manifest-list": "s3://my-bucket/warehouse/db/table/metadata/v1.manifest.list"
}
]
}
This single JSON file is the table's source of truth. It tells you:
- What columns exist (with field IDs, not just names)
- How the table is partitioned
- Which snapshot is current
- The full history of all snapshots ever created
Layer 5: The Catalog Pointer (The Only Mutable Piece)
Finally, at the very top sits the catalog. The catalog's job is simple: store a pointer to the current metadata file location.
table_identifier (db.table) -> s3://bucket/warehouse/db/table/metadata/v123.json
This is the only mutable piece in the entire system. When you commit a write, you:
- Create a new metadata file (immutable)
- Create new manifest files (immutable)
- Create new data files (immutable)
- Update the catalog pointer to point to the new metadata file (atomic CAS operation)
If the pointer update fails (because another writer already updated it), you retry with conflict detection. This gives you serializable isolation without a database.
Why This Matters: Three Critical Properties
This five-layer hierarchy gives you three things that traditional data lakes don't have:
1. Atomicity Without a Database
Traditional approach: Write files to S3, update the metastore database. If the database update fails, you have orphaned files. If the application crashes mid-write, the metastore is inconsistent.
Iceberg approach: Write everything (metadata files, manifests, data files) to immutable storage. The only atomic operation is the catalog pointer update (compare-and-swap on a key-value store, or a file rename on S3). If that fails, nothing changed. No orphaned files.
2. Schema Evolution Without Rewrites
With Hive tables, columns are identified by position or name. Want to add a column? Rewrite the schema. Want to rename a column? Rewrite all existing data files to update the column reference.
Iceberg uses field IDs instead of names or positions. Column 1 is always "event_id" internally, even if you rename the external column or reorder columns. Old data files don't change. New data files use the new schema. Queries automatically reconcile both.
Example: You have a table with schema (id, user_id, event_time). You want to add a source column:
- Old data files don't have
source(column ID 4) - New data files do have
source - Iceberg handles missing columns transparently (NULL fill-down)
- No rewrites
3. Time Travel and Snapshot Isolation
Every write creates a new immutable snapshot. Snapshots are never mutated or deleted (unless explicitly expired). This means:
- You can query the table as it existed 30 days ago
- Concurrent reads aren't blocked by concurrent writes
- Snapshot expiration is manual (you decide when old snapshots are garbage)
Write Modes: Copy-on-Write vs Merge-on-Read
When you delete or update rows, Iceberg gives you two options:
Copy-on-Write (CoW)
- When you update/delete rows in file X, rewrite the entire file
- Pros: Readers always see clean data files, no performance penalty
- Cons: Updates are expensive (rewrite entire files)
- Best for: Read-heavy workloads
Merge-on-Read (MoR)
- When you update/delete rows, write a separate delete file (position delete or equality delete)
- Pros: Updates are fast (just write a small delete file)
- Cons: Readers must merge data files + delete files, slight read penalty
- Best for: Write-heavy workloads
Iceberg v2 spec introduced position deletes (deleted by row position) and equality deletes (deleted by column value). This gives engines flexibility in how they reconcile deletes.
Multi-Engine Interoperability
Here's what's remarkable: Spark, Trino, Flink, Presto, Hive, and Impala all read the same metadata format. The spec is engine-agnostic. A data engineer can:
- Write data with Spark
- Query it with Trino
- Delete rows with Flink
- Time travel with Presto ...all on the same table, without format conversions.
This is possible because Iceberg separates the format spec from the execution engine. The metadata hierarchy is standardized. Query engines just implement readers for that standard.
Optimistic Concurrency Control
With multiple writers, how does Iceberg prevent conflicts? Via optimistic concurrency:
- Writer A reads the current metadata file
- Writer B reads the current metadata file
- Writer A finishes its changes, creates a new metadata file, tries to update the catalog pointer
- Update succeeds (CAS operation)
- Writer B finishes its changes, creates a new metadata file, tries to update the catalog pointer
- Update fails (pointer no longer points to the metadata file Writer B read)
- Writer B detects the conflict, re-reads the current metadata, recomputes its changes, and retries
This gives you serializable isolation. Writers don't block; they just retry on conflicts. For most workloads (few concurrent writers), conflicts are rare. For high-concurrency scenarios, you might need more sophisticated conflict resolution, but the default retry mechanism is sound.
Partition Evolution: Change Partitioning Without Rewriting
Suppose you initially partitioned by month(event_time), and now you want to partition by day(event_time) for better file pruning.
Traditional approach: Rewrite the entire table with the new partition scheme.
Iceberg approach:
- Old snapshots keep their old partition scheme
- New writes use the new partition scheme
- Manifest files track partition spec ID per file
- Queries automatically handle mixed partition layouts
No rewrites needed. This is huge for large tables where rewriting would take hours.
Conclusion: Why This Architecture Wins
Iceberg's metadata hierarchy achieves something remarkable: ACID guarantees on immutable cloud storage, without sacrificing performance or requiring a centralized database.
The design principles are:
- Correctness over performance - atomic commits matter more than throughput
- Immutability - makes caching, parallelism, and disaster recovery trivial
- Versioning - every snapshot is preserved, enabling time travel and rollback
- Engine-agnostic - the spec is open, allowing diverse tools to interoperate
- File-level granularity - statistics are recorded per file, enabling efficient pruning
If you're building a data platform or migrating from Hive/Delta, understand this architecture. It's not just a file format; it's a rethinking of how to manage massive datasets reliably.
Want to dive deeper?
- GitHub: https://github.com/apache/iceberg
- Specification: https://iceberg.apache.org/spec/
- Java API:
org.apache.icebergpackage in the repo
I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv
Top comments (0)