Minwook Je

Posted on Sep 19 • Edited on Sep 20

Why Apache Iceberg is needed?

#dataengineering #ai #programming #robotics

Why Iceberg is needed?

source

When data lakes first started to scale into terabytes and petabytes, the real headache wasn't just storing the files, it was keeping track of them. The raw data(e.g., image, video, etc.) lived in directories, while databases managed partition and schema details, along with asset-related metadata such as labels like pose annotations, labels, bounding boxes.

Suddenly we had two sources of truth that constantly needed to stay in sync. Change a partition or update a schema, and we had to make sure both the database and the file system reflected the same state. At small scale this was manageable, but as datasets grew into billions of files, keeping directories and databases aligned became messy, fragile and painfully slow.

Hive and its metastore tried to make this easier, but the mismatch remained:

The metastore(HMS) is transactional RDBMS.
While the file system(HDFS) is non-transactional.

As partitions multiplied into the millions, directory listings exploded, consistency broke down, and even simple updates like adding or dropping partitions turned into unreliable, time consuming processes.

Apache Iceberg was created to fix this problem at its core. Instead of patching over the gap between directories and databases, Iceberg redefined what a table is (OTF). It tracks every change like file additions, deletions and schema evolution through atomic snapshots, giving us true transactional consistency. With this design, even at massive scale, we can keep billions of files organized and accessible, while multiple engines like Spark, Flink, and Trino read and write against the same tables reliably.

How Iceberg rescue?

Design principles

1. Flexible compute

Don't move data; multiple engines should work seamlessly.
Support batch, streaming and ad hoc jobs.
Support multiple codes.

2. SQL Warehouse

Transactions with SQL tables (CRUD)
Keep the real table for data only, separate extra concerns (like permissions, logs, metadata)

Apache Iceberg keeps its records in object storage, unlike Apache Hive.

In production, where a single table can contain tens of petabytes of data. Even multi-petabyte tables can be read from a single node, without needing a distributed SQL engine.

3. Keep it simples

With Hive tables, engineers had to care about what was under the hood. Directories, partitions, and the metastore often fell out of sync, which could cause deleted data to show up again or schema changes to behave inconsistently.

Iceberg removes this complexity. Tables are managed through atomic snapshots, so once data is deleted it stays deleted, and schema changes remain consistent across engines.** It feels like working with a clean SQL table without worrying about the file system underneath.**

Iceberg Maintenance

At this point, a natural question comes up:

💡 How does Iceberg make sure deleted data never reappears?
Is there a background cleanup that eventually syncs with the file system?

The answer is: Iceberg never deletes files at DML time. When we run a DELETE or UPDATE, Iceberg simply marks the files as invalid in a new snapshot. From the user's perspective the data is gone, but the physical objects still exist in storage.

Physical cleanup happens later, during maintenance. There are two ways to do it. The first is the standard path: run Iceberg procedures like expire_snapshots and remove_orphan_files (or Athena VACUUM on AWS). These procedures decide which objects are no longer referenced by any valid snapshot and then perform the cleanup.

The second is an AWS-specific variant for S3: configure the catalog so that, during these maintenance procedures, the engine does not hard-delete unreferenced objects but applies S3 object tags instead (for example s3.delete-enabled=false with s3.delete.tags). S3 lifecycle rules can then expire or tier only those tagged objects. In other words, tagging is a cleanup-time behavior, not something that happens at the moment of DELETE/UPDATE.

💡 What if someone manually deletes files in S3 or MinIO? (Broke sync Files <-> Iceberg case)

Iceberg's snapshots and metadata are the single source of truth. If a user or a broad lifecycle rule removes objects that a valid snapshot still references, queries will fail because the metadata expects files that no longer exist. Prevent this by restricting manual deletes and, on AWS, by scoping lifecycle policies to tags that Iceberg applies during maintenance.

Keeping in Sync (Iceberg <-> Storage)

Iceberg snapshots and metadata are the single source of truth. That means we cannot allow S3 lifecycle rules to blindly delete files by prefix or by age, because those files may still be referenced by a valid snapshot. If that happens, queries will break.

Data cleanup is always driven by Iceberg's own maintenance. Procedures like expire_snapshots and remove_orphan_files decide which objects are no longer referenced and also handle orphan files left by failed writes. On AWS, these same operations can be triggered through engines such as Athena (using VACUUM) or Glue jobs, but under the hood they still call Iceberg's maintenance routines.

From there we have two options. Iceberg can delete the obsolete objects directly, or, if configured with s3.delete-enabled=false and s3.delete.tags, it tags them instead (for example deleted=true). In the tagging setup, S3 lifecycle rules act only on tagged objects, expiring or transitioning them to cheaper storage like Glacier. This way, Iceberg decides what is obsolete, while S3 provides the automation for the physical cleanup.

Understanding Iceberg FileIO

FileIO is the interface between the core Iceberg library and underlying storage. FileIO was created as a way for Iceberg to function in a world where distributed compute and storage are disaggregated.

The legacy Hadoop ecosystem requires the hierarchical pathing and partition structures that are, in practice, the exact opposite of the methods used to achieve speed and scale in the object storage world.

Hadoop and Hive are anti-patterns for high-performance and scalable cloud-native object storage.

Data lake applications that rely on the S3 API to interact with MinIO can easily scale to thousands of transactions per second on millions or billions of objects.

You can increase read and write performance by processing multiple concurrent requests in parallel. You accomplish this by adding prefixes — a string of characters that is a subset of an object name, starting with the first character — to buckets and then writing parallel operations, each opening a connection per prefix.

Iceberg was designed to run completely abstracted from physical storage using object storage. All locations are "explicit, immutable, and absolute" as defined in metadata. Iceberg tracks the full state of the table without the baggage of referencing directories. It's dramatically faster to use metadata to find a table than it would be to list the entire hierarchy using the S3 API. There are no renames — a commit simply adds new entries to the metadata table.

CREATE TABLE my_catalog.my_table (
    id bigint,
    data string,
    category string)
USING iceberg
OPTIONS (
    'write.object-storage.enabled'=true, 
    'write.data.path'='s3://iceberg')
PARTITIONED BY (category);

Insert

INSERT INTO my_catalog.my_table VALUES (1, 'a', "music"), (2, 'b', "music"), (3, 'c', "video");

Manifest files track data files and include details and pre-calculated statistics about each file. The first thing that’s tracked is file format and location.

Manifest files are how Iceberg does away with Hive-style tracking data by filesystem location. Manifest files improve the efficiency and performance of reading data files by including details like partition membership, record count, and the lower and upper bounds of each column. The statistics are written during write operations and are more likely to be timely, accurate and up-to-date than Hive statistics.

When a SELECT query is submitted, the query engine obtains the location of the manifest list from the metadata database. Then the query engine reads the value of the file-path entries for each data-file object and then opens the data files to execute the query.

Shown below are the contents of the data prefix, organized by partition.

spark-sql> SELECT count(1) as count, data
FROM my_catalog.my_table
GROUP BY data;
1       a
1       b
1       c
Time taken: 9.715 seconds, Fetched 3 row(s)
spark-sql>

DEV Community