Prithvi S

Posted on Jun 12

Apache Iceberg: A Modern Data Architecture for Scalable Analytics

#iceberg #data #architecture #database

Introduction

Data lakes have become the de-facto storage layer for modern analytics, but the lack of a reliable table format has long hampered their usefulness. Apache Iceberg fills that gap by offering an open-source, high-performance table format that works across multiple compute engines. In this post we explore the core design of Iceberg, why it matters for data architects, and how its features-metadata hierarchy, schema evolution, hidden partitioning, write modes, and optimistic concurrency-enable safe, painless analytics at scale.

1. The Metadata Tree - Tracking Every File Without a Database

Iceberg stores a table’s state in a metadata hierarchy that replaces the traditional metastore database. The hierarchy consists of five layers:

Data Files - Parquet, ORC, or Avro files that hold the actual rows.
Manifest Files - Each manifest lists a set of data files and records per-file statistics such as column min/max, null counts, and partition values.
Manifest List - An immutable JSON file that points to all manifests belonging to a snapshot.
Metadata File - The snapshot-level JSON file that defines schema, partition spec, current snapshot pointer, and table properties.
Catalog - A thin service that stores only the location of the latest metadata file.

Only the catalog pointer changes when a table is updated. All other files are immutable, which means Iceberg can safely retain a complete history of the table without risking corruption.

Why this matters: With immutable metadata you gain strong consistency guarantees, time-travel queries, and easy rollbacks-all without a separate metastore service.

2. Schema Evolution - Changing Columns Without Rewrites

Traditional data lakes require costly ETL jobs whenever a column is added, renamed, or dropped. Iceberg solves this with field IDs. When a table is created, each column receives a stable numeric identifier that never changes, even if the column name does.

How it works

Add Column - A new field ID is assigned and the metadata file is updated. Existing data files remain untouched.
Rename Column - The name in the metadata file is changed; the field ID stays the same, so readers continue to map values correctly.
Drop Column - The column is simply omitted from the schema; the underlying data stays.
Reorder Columns - Since IDs are used for mapping, the physical order of columns is irrelevant.

This approach allows you to evolve the schema continuously without rewriting billions of rows. Old snapshots preserve the original schema, enabling seamless point-in-time analysis.

3. Hidden Partitioning - Let the Engine Do the Work

Partitioning a table dramatically speeds up queries by pruning irrelevant files. In classic Hive-style tables, users must write partition columns explicitly in their queries (e.g., WHERE year = 2024). Iceberg introduces partition transforms that automatically derive partition values from data columns.

Supported transforms include:

year, month, day, hour - Extract temporal components.
bucket(N) - Hash-based bucketing.
truncate(N) - String truncation.

When a table is created you define the partition spec, and Iceberg records the transformed values in the manifest files. At query time the engine reads only the relevant manifests, so users never need to write explicit partition predicates.

Partition Evolution

Iceberg also supports changing the partition spec without rewriting existing data. New files use the new spec, while older files retain their original partition values. The metadata hierarchy merges both views transparently, allowing a smooth transition as query patterns change.

4. Write Modes - Copy-on-Write vs Merge-on-Read

Iceberg offers two complementary write modes to handle updates and deletes.

Copy-on-Write (CoW)

How it works - Updates rewrite entire data files, producing a new snapshot that points to the refreshed files.
Pros - Simple read path, optimal for read-heavy workloads.
Cons - Higher write amplification for frequent updates.

Merge-on-Read (MoR)

How it works - Writes delete files (position or equality deletes) that reference existing data files. The engine merges deletes at read time.
Pros - Lower write cost for heavy update/delete workloads.
Cons - Reads must apply delete files, adding slight overhead.

Choosing a mode depends on your workload:

Use CoW for analytics pipelines where data is appended and rarely mutated.
Use MoR for streaming or change-data-capture scenarios where updates are frequent.

5. Optimistic Concurrency - Safe Parallel Writes

In a multi-user environment, concurrent writers could step on each other’s toes. Iceberg adopts an optimistic concurrency model that relies on a compare-and-swap (CAS) operation on the catalog pointer.

Commit Flow

Writer reads the current catalog pointer.
Writer creates a new set of metadata files (manifest list, metadata file).
Writer attempts to atomically replace the catalog pointer with the new metadata location.
If the pointer has changed since step 1, the commit fails and the writer retries.

This guarantees serializable isolation without locking the table, allowing many writers to operate in parallel with minimal contention.

6. Time Travel and Snapshot Management

Because every commit creates a new immutable snapshot, Iceberg naturally supports time-travel queries. You can query an older snapshot by ID or timestamp, enabling:

Auditing data changes.
Reproducing historic analyses.
Rolling back accidental bad writes.

Snapshots can be expired to reclaim storage. Iceberg automatically tracks orphaned data files and removes them during cleanup, keeping the lake tidy.

7. Real-World Use Cases

Data Lakehouse Modernization

Companies migrate from Hive tables to Iceberg to gain ACID guarantees and schema flexibility while retaining existing data files. The minimal change to query engines (Spark, Flink, Trino) makes adoption straightforward.

Streaming Data Pipelines

With MoR and equality deletes, Iceberg serves as a sink for CDC streams, allowing downstream analytics to query the latest state without complex merge jobs.

Multi-Engine Interoperability

Because the format is engine-agnostic, the same Iceberg table can be queried from Spark for batch jobs, Trino for ad-hoc analysis, and Flink for real-time dashboards-all sharing a single source of truth.

8. Getting Started

To create a table in Spark:

CREATE TABLE iceberg_db.sales (
  order_id   BIGINT,
  order_date DATE,
  amount     DOUBLE,
  customer_id BIGINT
) USING iceberg
PARTITIONED BY (year(order_date), month(order_date));

To add a column later:

ALTER TABLE iceberg_db.sales ADD COLUMN region STRING;

To query a historic snapshot:

SELECT * FROM iceberg_db.sales /*+ SNAPSHOT_ID('1234567890abcdef') */;

Conclusion

Apache Iceberg brings database-level reliability to data lakes, enabling safe schema changes, efficient partitioning, and concurrent writes without a centralized metastore. Its metadata-first design, flexible write modes, and built-in time travel make it a compelling choice for any organization building a modern analytics platform.

If you’re interested in trying Iceberg, start with the open-source project on GitHub and explore the extensive documentation. The community is active, and many cloud providers now offer managed Iceberg tables, lowering the barrier to entry.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community