DEV Community

Cover image for Apache Iceberg Explained: From Data Lakes to Metadata, Snapshots, and Real-World Usage
Mohamed Hussain S
Mohamed Hussain S

Posted on

Apache Iceberg Explained: From Data Lakes to Metadata, Snapshots, and Real-World Usage

If you’ve worked with data lakes for a while, you’ve probably heard names like Apache Iceberg, Delta Lake, or Apache Hudi.
They’re often mentioned together - but what problem do they actually solve, and why do modern systems like ClickHouse care about them?

This post walks through:

  • What a data lake really is (and why it breaks down)
  • What Apache Iceberg is and what it is not
  • Why metadata matters so much
  • How writers and readers work together
  • Where tools like ClickHouse fit into the picture
  • A real-world example tying everything together

By the end, you should have a clear mental model of Iceberg and open table formats.


What is a Data Lake?

A data lake is typically built on cheap, scalable object storage like S3 (or S3-compatible systems such as MinIO).

At its core, a data lake is:

  • A place to store large amounts of raw data
  • Usually in file formats like Parquet, ORC, or Avro
  • Cheap and flexible

But here’s the key thing:

Object storage has no concept of tables, transactions, or schemas.

From the storage system’s perspective:

  • A “table” is just a folder
  • Files can appear at any time
  • There is no guarantee a file is complete
  • There is no notion of “latest data”

This is where problems start.


Why Data Lakes Become Painful at Scale

Early data lakes relied on conventions:

  • Folder-based partitions (date=2025-01-01/)
  • Manual rules for writers and readers
  • “Just don’t read while writing”

This works… until it doesn’t.

Common issues:

  • Readers see partial writes
  • Queries mix old and new data
  • Schema changes break jobs
  • Deletes and reprocessing are unsafe
  • Multiple engines step on each other

In short:

Storage can hold files, but it can’t manage tables.


Enter Apache Iceberg

Apache Iceberg is an open table format designed to bring database-like guarantees to data lakes.

Important clarification:

  • 🚫 Iceberg is not a database
  • 🚫 Iceberg is not a query engine
  • 🟢 Iceberg is a table format

Its job is to define:

  • What files belong to a table
  • Which version of the table is current
  • How readers and writers coordinate safely

You can think of Iceberg as the brain of a data lake table.


The Core Idea: Metadata Over File Scanning

Iceberg introduces a metadata-driven model:

  • Data files → Actual Parquet files in object storage
  • Metadata files → Describe schemas, partitions, and snapshots
  • Snapshots → Define an exact version of the table
  • Manifests → List data files and their statistics

A crucial idea:

Query engines using Iceberg never discover data files by scanning object storage; they rely entirely on Iceberg metadata to locate valid files.

This is what enables consistency, time travel, and safe concurrent access.


Who Are “Writers” in a Data Lake?

A writer is not just “something that uploads files”.

A real writer:

  1. Writes data files (e.g., Parquet)
  2. Updates table metadata
  3. Atomically commits a new snapshot

Typical writers include:

  • Apache Spark
  • Apache Flink
  • Streaming pipelines fed by Apache Kafka

If someone uploads files directly to S3 without updating metadata, they are bypassing the table format - and breaking consistency.


How Readers Use Iceberg Metadata

Readers do not guess which files to read.

A reader:

  1. Reads Iceberg metadata
  2. Finds the latest snapshot
  3. Gets an exact list of valid files
  4. Applies pruning using file statistics
  5. Reads only the required Parquet files

This makes queries:

  • Correct
  • Predictable
  • Efficient

Where ClickHouse Fits In

ClickHouse is primarily a column-oriented analytical database.

It can play two roles:

1. Native database

  • Owns its own storage (MergeTree)
  • Ingests data directly
  • Best performance

2. External reader / query engine

  • Reads data from S3
  • Can query Iceberg tables
  • Relies on Iceberg metadata for correctness

This is why Iceberg works well in multi-engine environments:

  • Spark writes data
  • ClickHouse reads it
  • No direct coordination required

A Real-World Scenario

Imagine this pipeline:

  1. Edge devices produce events
  2. Events flow into Kafka
  3. Spark or Flink processes the stream
  4. Data is written as Parquet to S3
  5. Iceberg commits a new snapshot
  6. ClickHouse queries the table

Without Iceberg:

  • ClickHouse might read half-written files
  • Queries may mix old and new data
  • Fixing bad data is risky

With Iceberg:

  • Writers commit atomic snapshots
  • Readers always see a consistent view
  • Old versions remain accessible
  • Multiple engines can safely coexist

What About Delta Lake and Hudi?

Iceberg is not alone.

Other open table formats include:

  • Delta Lake
  • Apache Hudi

They all solve the same core problem:

Making object storage behave like a real table.

They differ mainly in philosophy:

  • Iceberg → metadata-first, engine-neutral
  • Delta Lake → strong Spark ecosystem integration
  • Hudi → incremental and streaming-heavy use cases

Final Mental Model (Worth Remembering)

  • Object storage → holds files
  • Table formats → define tables and versions
  • Writers → create data + metadata
  • Readers → trust metadata, not folders

Or in one line:

Object storage stores data.
Table formats make it trustworthy.


Closing Thoughts

Apache Iceberg doesn’t make queries magically faster.
What it does is far more important:

It makes data lakes correct, consistent, and usable at scale.

That’s why engines like ClickHouse can safely query data lakes today - something that was extremely fragile just a few years ago.


Top comments (0)