Apache Iceberg The Definitive Guide (2024, O'Reilly Media)
Apache Iceberg is one of the leading open formats for updatable Parquet-based tables, which are emerging as the new data storage standard for analytics.
Historically, relational databases have stored data row-by-row, packed into physical pages for efficient I/O. Columnar table formats, however, have proven far more efficient for query- intensive workloads.
Data lakes began by supporting queries over columnar formats such as Parquet, but of course, transactional updates must also be supported efficiently to address traditional warehouse scenarios.
1. Introduction to Apache Iceberg
Data Warehouse
Pros
- Serves as the single source of truth as it allows storing and querying data from various sources.
- Supports querying vast amounts of historical data, enabling analytical workloads to run quickly.
- Provides effective data governance policies to ensure that data is available, usable, and aligned with security policies
- Organizes the data layout for you, ensuring that it's optimized for querying
Cons
- Locks the data into a vendor-specific system that only the warehouse's compute engine can use
- Expensive in terms of both storage and computation; as the workload increases, the cost becomes hard to manage
- Mainly supports structured data
- Does not enable organizations to natively run advanced analytical workloads such as ML
Data Lake
- Hive open table format
Pros
- Lower cost
- Stores data in open formats
- Handles unstructured data
Cons
- Performance
- Lack of ACID transactions
- Requires lots of configuration
The Data Lakehouse
The data lakehouse
architecture decouples the storage an compute from data lakes and brings in mechanisms that allow for more data warehouse-like functionality (ACID, better perf, consistency).
Table formats create an abstraction layer on top of file storage that enables ACID guarantees when working with data directly on data lake storage, leading to several values:
- Fewer copies
- Faster queries
- Historical data snapshots
- Open architecture
Apache Iceberg Table format
Table format? A method of structuring a dataset's files to resent them as a unified table.
So what data is in this table?
It provides an abstraction of the table to users and tools, making it easier for them to interact with the underlying data in an efficient manner.
Table formats have been around since the inception of RDBMSs. In these systems, users could refer to a set of data as a table, and the database engine was responsible for managing the dataset's byte layout on disk in the form of files, while also handling complexities such as transactions. No other engine can interact with the files directly without risking system corruption. The details of how the data is stored are abstracted away, and users take for granted that the platform knows where the data for a specific table is located and how to access it.
However, in today's big data world.
Your data needs access to a variety of compute engines optimized for different use cases such as BI or ML.
In a data lake, all your data is stored as files in some storage solution (e.g., S3, Goole cloud, Azure ..). When using SQL with our favorite analytical tools or writing ad hoc scripts in languages such as Java, Python, and Rust, we wouldn't want to constantly define which of these files are in the table and which of them aren’t. Not only would this be tedious, but it would also likely lead to inconsistency across different uses of the data.
So the solution was to create a standard method of understanding "what data is in this table" for data lakes.
Hive: The Original Table Format
The Hive table format took the approach of defining a table as any and all files within a specified directory (or prefixes for object storage). The partitions of those tables would be the subdirectories. These directory paths defining the table are tracked by a service called the Hive Metastore, which query engines can access to know where to find the data applicable to their query.
- Enabled more efficient query patterns than full table scans
- partitioning (dividing the data based on a partitioning key)
- bucketing (an approach to partitioning or clustering/sorting that uses a hash function to evenly distribute values)
- File format agnostic
- Apache Parquet ...
- So Don't require transformation(Avro, CSV ..) prior to making the data available in a Hive table
- Hive Metastore(atomic swaps) allow all-or-nothing(atomic) changes to an individual partition in the table.
Limitations
- Changing a single file is inefficient, because there was no way to safely replace just one file. Hive Metastore could only safely replace an entire partition folder, not a single file.
- While you could atomically swap a partition, there wasn't a mechanism for atomically updating multiple partitions as one transaction.
Iceberg Architecture
datafile -> manifest file -> manifest list -> snapshots -> metadata file -> catalog
Manifest file
- A list of
Datafiles
- Containing each datafile's path, metadata.
Manifest list
- Files that define a single snapshot of the table as a list of manifest files
Metadata file
- Define a table's structure (schema, partitioning scheme, listing of snapshots)
Catalog
- A catalog keeps track of where each table is stored.
- In Hive Metastore, a table name points to a set of directories.
- In Iceberg, a table name points to the location of the table’s most recent metadata file instead.
- This metadata file records the full state of the table (snapshots, partitions, and data files).
--
Terms
Catalog
Iceberg library needs a way to keep track of tables by name. Tasks like creating, dropping and renaming tables are the responsibility of a catalog
- Catalogs manage a collection of tables that are usually grouped into namespaces.
- The most important responsibility of a catalog is tracking a table's current metadata.
Snapshot
The state of a table at some time
Manifest list
A metadata file that lists the manifests that make up a table snapshot.
Manifest (file)
A metadata file that lists a subset of data files that make up a snapshot.
Top comments (0)