Minwook Je

Posted on Sep 20 • Edited on Oct 6

Deep dive Apache Iceberg

Apache Iceberg The Definitive Guide (2024, O'Reilly Media)

Apache Iceberg is one of the leading open formats for updatable Parquet-based tables, which are emerging as the new data storage standard for analytics.

Historically, relational databases have stored data row-by-row, packed into physical pages for efficient I/O. Columnar table formats, however, have proven far more efficient for query- intensive workloads.

Data lakes began by supporting queries over columnar formats such as Parquet, but of course, transactional updates must also be supported efficiently to address traditional warehouse scenarios.

1. Introduction to Apache Iceberg

Data Warehouse

Pros

Serves as the single source of truth as it allows storing and querying data from various sources.
Supports querying vast amounts of historical data, enabling analytical workloads to run quickly.
Provides effective data governance policies to ensure that data is available, usable, and aligned with security policies
Organizes the data layout for you, ensuring that it's optimized for querying

Cons

Locks the data into a vendor-specific system that only the warehouse's compute engine can use
Expensive in terms of both storage and computation; as the workload increases, the cost becomes hard to manage
Mainly supports structured data
Does not enable organizations to natively run advanced analytical workloads such as ML

Data Lake

Hive open table format

Pros

Lower cost
Stores data in open formats
Handles unstructured data

Cons

Performance
Lack of ACID transactions
Requires lots of configuration

The Data Lakehouse

The data lakehouse architecture decouples the storage an compute from data lakes and brings in mechanisms that allow for more data warehouse-like functionality (ACID, better perf, consistency).

Table formats create an abstraction layer on top of file storage that enables ACID guarantees when working with data directly on data lake storage, leading to several values:

Fewer copies
Faster queries
Historical data snapshots
Open architecture

Apache Iceberg Table format

Table format? A method of structuring a dataset's files to resent them as a unified table.

So what data is in this table?
It provides an abstraction of the table to users and tools, making it easier for them to interact with the underlying data in an efficient manner.

Table formats have been around since the inception of RDBMSs. In these systems, users could refer to a set of data as a table, and the database engine was responsible for managing the dataset's byte layout on disk in the form of files, while also handling complexities such as transactions. No other engine can interact with the files directly without risking system corruption. The details of how the data is stored are abstracted away, and users take for granted that the platform knows where the data for a specific table is located and how to access it.

However, in today's big data world.
Your data needs access to a variety of compute engines optimized for different use cases such as BI or ML.

In a data lake, all your data is stored as files in some storage solution (e.g., S3, Goole cloud, Azure ..). When using SQL with our favorite analytical tools or writing ad hoc scripts in languages such as Java, Python, and Rust, we wouldn't want to constantly define which of these files are in the table and which of them aren’t. Not only would this be tedious, but it would also likely lead to inconsistency across different uses of the data.

So the solution was to create a standard method of understanding "what data is in this table" for data lakes.

Hive: The Original Table Format

The Hive table format took the approach of defining a table as any and all files within a specified directory (or prefixes for object storage). The partitions of those tables would be the subdirectories. These directory paths defining the table are tracked by a service called the Hive Metastore, which query engines can access to know where to find the data applicable to their query.

Enabled more efficient query patterns than full table scans
- partitioning (dividing the data based on a partitioning key)
- bucketing (an approach to partitioning or clustering/sorting that uses a hash function to evenly distribute values)
File format agnostic
- Apache Parquet ...
- So Don't require transformation(Avro, CSV ..) prior to making the data available in a Hive table
Hive Metastore(atomic swaps) allow all-or-nothing(atomic) changes to an individual partition in the table.

Limitations

Changing a single file is inefficient, because there was no way to safely replace just one file. Hive Metastore could only safely replace an entire partition folder, not a single file.
While you could atomically swap a partition, there wasn't a mechanism for atomically updating multiple partitions as one transaction.

Iceberg Architecture

datafile -> manifest file -> manifest list -> snapshots -> metadata file -> catalog

Manifest file

A list of Datafiles
Containing each datafile's path, metadata.

Manifest list

Files that define a single snapshot of the table as a list of manifest files

Metadata file

Define a table's structure (schema, partitioning scheme, listing of snapshots)

Catalog

A catalog keeps track of where each table is stored.
In Hive Metastore, a table name points to a set of directories.
In Iceberg, a table name points to the location of the table’s most recent metadata file instead.
This metadata file records the full state of the table (snapshots, partitions, and data files).

Terms

Catalog

Iceberg library needs a way to keep track of tables by name. Tasks like creating, dropping and renaming tables are the responsibility of a catalog

Catalogs manage a collection of tables that are usually grouped into namespaces.
The most important responsibility of a catalog is tracking a table's current metadata.

Snapshot

The state of a table at some time

Manifest list

A metadata file that lists the manifests that make up a table snapshot.

Manifest (file)

A metadata file that lists a subset of data files that make up a snapshot.

DEV Community