Mohamed Hussain S

Posted on Jan 21

Apache Iceberg Explained: From Data Lakes to Metadata, Snapshots, and Real-World Usage

#datalake #apacheiceberg #dataengineering #bigdata

If you’ve worked with data lakes for a while, you’ve probably heard names like Apache Iceberg, Delta Lake, or Apache Hudi.
They’re often mentioned together - but what problem do they actually solve, and why do modern systems like ClickHouse care about them?

This post walks through:

What a data lake really is (and why it breaks down)
What Apache Iceberg is and what it is not
Why metadata matters so much
How writers and readers work together
Where tools like ClickHouse fit into the picture
A real-world example tying everything together

By the end, you should have a clear mental model of Iceberg and open table formats.

What is a Data Lake?

A data lake is typically built on cheap, scalable object storage like S3 (or S3-compatible systems such as MinIO).

At its core, a data lake is:

A place to store large amounts of raw data
Usually in file formats like Parquet, ORC, or Avro
Cheap and flexible

But here’s the key thing:

Object storage has no concept of tables, transactions, or schemas.

From the storage system’s perspective:

A “table” is just a folder
Files can appear at any time
There is no guarantee a file is complete
There is no notion of “latest data”

This is where problems start.

Why Data Lakes Become Painful at Scale

Early data lakes relied on conventions:

Folder-based partitions (date=2025-01-01/)
Manual rules for writers and readers
“Just don’t read while writing”

This works… until it doesn’t.

Common issues:

Readers see partial writes
Queries mix old and new data
Schema changes break jobs
Deletes and reprocessing are unsafe
Multiple engines step on each other

In short:

Storage can hold files, but it can’t manage tables.

Enter Apache Iceberg

Apache Iceberg is an open table format designed to bring database-like guarantees to data lakes.

Important clarification:

🚫 Iceberg is not a database
🚫 Iceberg is not a query engine
🟢 Iceberg is a table format

Its job is to define:

What files belong to a table
Which version of the table is current
How readers and writers coordinate safely

You can think of Iceberg as the brain of a data lake table.

The Core Idea: Metadata Over File Scanning

Iceberg introduces a metadata-driven model:

Data files → Actual Parquet files in object storage
Metadata files → Describe schemas, partitions, and snapshots
Snapshots → Define an exact version of the table
Manifests → List data files and their statistics

A crucial idea:

Query engines using Iceberg never discover data files by scanning object storage; they rely entirely on Iceberg metadata to locate valid files.

This is what enables consistency, time travel, and safe concurrent access.

Who Are “Writers” in a Data Lake?

A writer is not just “something that uploads files”.

A real writer:

Writes data files (e.g., Parquet)
Updates table metadata
Atomically commits a new snapshot

Typical writers include:

Apache Spark
Apache Flink
Streaming pipelines fed by Apache Kafka

If someone uploads files directly to S3 without updating metadata, they are bypassing the table format - and breaking consistency.

How Readers Use Iceberg Metadata

Readers do not guess which files to read.

A reader:

Reads Iceberg metadata
Finds the latest snapshot
Gets an exact list of valid files
Applies pruning using file statistics
Reads only the required Parquet files

This makes queries:

Correct
Predictable
Efficient

Where ClickHouse Fits In

ClickHouse is primarily a column-oriented analytical database.

It can play two roles:

1. Native database

Owns its own storage (MergeTree)
Ingests data directly
Best performance

2. External reader / query engine

Reads data from S3
Can query Iceberg tables
Relies on Iceberg metadata for correctness

This is why Iceberg works well in multi-engine environments:

Spark writes data
ClickHouse reads it
No direct coordination required

A Real-World Scenario

Imagine this pipeline:

Edge devices produce events
Events flow into Kafka
Spark or Flink processes the stream
Data is written as Parquet to S3
Iceberg commits a new snapshot
ClickHouse queries the table

Without Iceberg:

ClickHouse might read half-written files
Queries may mix old and new data
Fixing bad data is risky

With Iceberg:

Writers commit atomic snapshots
Readers always see a consistent view
Old versions remain accessible
Multiple engines can safely coexist

What About Delta Lake and Hudi?

Iceberg is not alone.

Other open table formats include:

Delta Lake
Apache Hudi

They all solve the same core problem:

Making object storage behave like a real table.

They differ mainly in philosophy:

Iceberg → metadata-first, engine-neutral
Delta Lake → strong Spark ecosystem integration
Hudi → incremental and streaming-heavy use cases

Final Mental Model (Worth Remembering)

Object storage → holds files
Table formats → define tables and versions
Writers → create data + metadata
Readers → trust metadata, not folders

Or in one line:

Object storage stores data.
Table formats make it trustworthy.

Closing Thoughts

Apache Iceberg doesn’t make queries magically faster.
What it does is far more important:

It makes data lakes correct, consistent, and usable at scale.

That’s why engines like ClickHouse can safely query data lakes today - something that was extremely fragile just a few years ago.

DEV Community