DEV Community: Tom Nats

Iceberg Time Travel & Rollbacks in Trino

Tom Nats — Thu, 17 Aug 2023 18:54:46 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Time Travel

Time travel in Trino using Iceberg is a handy feature to “look back in time” at a table’s history. As we covered in this blog, each change to an Iceberg table creates a new “snapshot” which can be referred to by using standard SQL.

As you can see from the diagram below, a new snapshot is created for the table creation, insert and update.

To see the snapshots on a table, you can use the handy metadata table that exists for each table:

SELECT * FROM "customer_iceberg$snapshots";

The above snapshot table shows the create, insert and update operations on the customer_iceberg table. You can see what type of operation was performed and when it was executed.

To select a certain snapshot, you use the “for version as of” syntax. In the following two examples, we show the customer name before and after an update:

SELECT custkey,name FROM customer_iceberg FOR VERSION AS OF 5043425904354141100 where custkey = 2732;

SELECT custkey,name FROM customer_iceberg FOR VERSION AS OF 3117754680069542695 where custkey = 2732;

You can also specify a timeframe to retrieve an older snapshot of a table. For example, the following query brings back the data for the first snapshot on or before a given timestamp:

SELECT custkey,name FROM s3lakehouse.demo_tpch.customer_iceberg FOR TIMESTAMP AS OF TIMESTAMP '2022-09-18 07:18:09.002 America/New_York' where custkey = 2732;

Rolling back

Another great feature of Iceberg is the ability to roll back a table to a previous snapshot. Sometimes this is used when a row was accidentally deleted or updated. As long as the snapshot exists, (it hasn’t been cleaned up yet) then you can roll back to any existing snapshot.

For example, in the scenario above, if I wanted to roll back to the state of the table before the update on the customer, then I would issue the following command:

CALL iceberg.system.rollback_to_snapshot('demo_tpch', 'customer_iceberg', 5043425904354141100)

Then we can query the table again to see the customer’s name was “rolled back” to the previous version before the update:

SELECT custkey,name FROM s3lakehouse.demo_tpch.customer_iceberg where custkey = 2732;

Both time travel and rolling back are database functions that are now available in a modern data lake architecture. This is a game changer as it allows database type functionality to objects stores which were not available until now.

Iceberg Schema Evolution in Trino

Tom Nats — Wed, 09 Aug 2023 13:34:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Schema evolution simply means the modification of tables as business rules and source systems are modified over time. Trino’s Iceberg connector supports different modifications to tables, including changing the table name itself, as well as, column and partition changes.

Much like a database, you perform “alters” to Iceberg tables to modify their structure. Since Iceberg is just a table format which contains metadata about a table, modifying the table is rather trivial.

Table Changes

Rename a table

alter table customer_iceberg rename to customer_iceberg_new;

Note: This is a change to the table name in the metastore and no changes will be made in the storage. So, the location s3:///customer_iceberg will remain the same.

Column Changes

Adding a column:

alter table customer_iceberg add column tier varchar(1);

Rename a column:

alter table customer_iceberg rename column address to fulladdress;

Partition Changes

Oftentimes a table is initially partitioned by a column or set of columns, only later it’s discovered this may not be optimal. With Iceberg, you can modify the partition columns at any time.

For example, initially this table is partitioned by month:

create table orders_iceberg with (partitioning=ARRAY['month(orderdate)']) as select * from tpch.sf1.orders;

After reviewing query a patterns, it’s determined that partitioning by day would perform better as a majority of queries are filter by certain days. A simple alter table statement as seen below will modify the partitioning on this table from month to day:

alter table orders_iceberg SET PROPERTIES partitioning = ARRAY['day(orderdate)'];

After new data is inserted into the table, you will see a change in the data directory where the table data is stored:

Notice the orderdate_month is now orderdate_day. Note that queries that filter by day will partition prune at the partition day level but the existing monthly partitions will still need to be searched. If you would like to have the entire table partitioned by day then you could recreate the table using a CTAS (create table as) from the existing table.

Example to create a new table partitioned by day from the existing table:

create table orders_iceberg_new with (partitioning=ARRAY['day(orderdate)']) (as select * from orders_iceberg)

Schema evolution in Trino’s Iceberg connector is very powerful and easy to use. These types of functions were not available in Hive and database veterans will be very happy to see them added to the data lake landscape.

Iceberg DML & Maintenance in Trino

Tom Nats — Mon, 07 Aug 2023 12:39:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

One key feature of the Apache Iceberg connector is Trino’s ability to modify data that resides on object storage. As we all know, storage like Amazon S3 is immutable which means it cannot be modified. This was a challenge in the Hadoop era where data needed to be modified or removed at the individual row level. Trino allows for full DML (data manipulation language) using the Iceberg connector which means full support for update, delete and merge.

Since Iceberg is a table format, when DML commands are issued, transaction logs are created and stored in the same storage, so clients that read from the same table will reflect these changes in subsequent queries.

Insert

Inserts are one of the most used commands used in a modern data lake. Data is constantly being added and as you would expect, the Iceberg connector for Iceberg supports a standard insert statement:

insert into customer_iceberg values (90000,'Testing','33 Main',3,'303-867-5309',323,'MACHINERY','Testing Iceberg');

Note that update, deletes and merges should be run in serial and or batch against a single table. This is to ensure there are no conflicts.

Update

Updates in Trino with the Iceberg connector act just like ordinary updates. If there are current select statements executing against this table, they will see the data in the previous snapshot to ensure read integrity.

update customer_iceberg set name = 'Tim Rogers' where custkey = 2732;

Delete

Delete statements are rarely used in a modern data lake, instead “soft” deletes are most common which means the row is updated to be flagged as deleted. There is usually a status column or something similar to filter out reading this data by select queries. In the event a row or set of rows need to be deleted, this can be done with a typical delete statement:

delete customer_iceberg where custkey = 2732;

Note: A delete statement doesn’t actually physically delete the data off the storage. In order to ensure the data has been removed, an expire_snapshots procedure needs to be executed with a date that is less than when the delete statement was ran.

Merge

The merge statement is a very handy tool to add logic based operations to a sql statement. Merge is often used when you have new or modified data that is staged in a table first. A good example is customer data that is being pulled from an operational system. CDC (change data capture) data is extracted from a CRM system into a staging table in S3. Or with Trino, a merge can use an existing table from the source system.

To use merge, you can either stage data that needs to be inserted or updated into your target table or you can use data directly from the source table(s).

Examples:

Example 1: If there are rows that don’t exist in the target table, insert them. This is a very basic merge statement. The customer_land table below could be a staged table in object storage like S3 or it could be from a source system such as MySQL or SQL Server:

MERGE INTO s3lakehouse.blog.customer_base AS b
USING (select * from s3lakehouse.blog.customer_land) AS l
ON (b.custkey = l.custkey)
WHEN NOT MATCHED
      THEN INSERT (custkey, name, state, zip, cust_since,last_update_dt)
            VALUES(l.custkey, l.name, l.state, l.zip, l.cust_since,l.last_update_dt);

Example 2: With merge, we can issue a single statement to insert new rows and update existing ones:

MERGE INTO s3lakehouse.blog.customer_base AS b
USING s3lakehouse.blog.customer_land AS l
ON (b.custkey = l.custkey)
WHEN MATCHED and b.name != l.name
THEN UPDATE
SET name = l.name ,
          state = l.state,
          zip = l.zip,
          cust_since = l.cust_since
WHEN NOT MATCHED
      THEN INSERT (custkey, name, state, zip, cust_since,last_update_dt)
            VALUES(l.custkey, l.name, l.state, l.zip, l.cust_since,l.last_update_dt);

This statement will insert new rows where the custkey doesn’t exist in the target table. It will update rows in the target table if the custkey matches and the name has changed. Of course in real-world situations, there will be numerous columns that are checked to see if they have changed to issue an update. I chose name for this simple example but you can see the power of merge and why it’s a game changer for a modern data lake.

Example 3: Slowly Changing Dimension (SCD Type 2)

MERGE INTO s3lakehouse.blog.customer_base as b
USING
( SELECT null as custkey_match, custkey, name, state, zip, cust_since, last_update_dt,'Y' as active_ind,current_timestamp as end_dt
FROM s3lakehouse.blog.customer_land
UNION ALL
SELECT
custkey as custkey_match,custkey, name, state, zip, cust_since, last_update_dt,active_ind,end_dt
FROM s3lakehouse.blog.customer_base
WHERE custkey IN
(SELECT custkey FROM s3lakehouse.blog.customer_land where active_ind = 'Y')
) as scdChangeRows
ON (b.custkey = scdChangeRows.custkey and b.custkey = scdChangeRows.custkey_match)
WHEN MATCHED and b.active_ind = 'Y' THEN
UPDATE SET end_dt = current_timestamp,active_ind = 'N'
WHEN NOT MATCHED THEN
        INSERT (custkey, name, state, zip, cust_since,last_update_dt,active_ind,end_dt)
            VALUES(scdChangeRows.custkey, scdChangeRows.name, scdChangeRows.state, scdChangeRows.zip,
                    scdChangeRows.cust_since,scdChangeRows.last_update_dt,'Y',null);

A SCD Type 2 simply means we insert new rows and we also “end date” existing rows and insert a new row. This allows history to be maintained in a single table. This is a data warehousing technique that has been around for a long time. The ability to perform this feature in a data lake is new though and opens up a choice to provide data warehousing features right out of a cloud storage.

There is a lot going on in this merge so we’ll cover a few points. We first select data from the landing table and union it with our base table while only pulling active rows. From there, we insert any new rows in addition to any modified rows. Lastly, we update or “end date” the old row by setting the active_ind to ‘N’ and providing a date in the end_dt column.

Optimize

As your Iceberg tables grow and have many operations performed against them, it’s a good idea to optimize them from time to time. The optimize command not only makes small files larger for better performance, it also cleans up the metadata which improves queries due to less metadata that needs to be read.

To scan the table for small files and make them larger, you simply issue the following command:

alter table <table> execute optimize;

This will look for any files under 100MB and combine them into larger ones. You can also choose the file size if 100MB:

ALTER TABLE <table> EXECUTE optimize(file_size_threshold => '10MB')

If your Iceberg table becomes very large and the optimize command above is taking too long to run, you can just optimize the files that have arrived recently:

alter table <table> execute optimize where $file_modified_time > <yesterday>

This will look for files that have arrived since yesterday and optimize them. On a very active table where lots of changes are taking place, this will greatly reduce the amount of time the optimize command takes.

For tables that are being modified constantly, it’s a good idea to optimize at regular intervals.

We’ve covered one of the most powerful features of Iceberg with Trino which allows database type updates/deletes/merges to be performed on your modern data lake. This opens the door to increasing the use cases and reliability of the data lake on a variety of clouds and platforms.

Iceberg Partitioning and Performance Optimizations in Trino

Tom Nats — Tue, 01 Aug 2023 12:43:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Partitioning

Partitioning is used to narrow down the scope of the data that needs to be read for a query. When dealing with big data, this can be crucial for performance and can be the difference between getting a query that takes minutes or even hours down to seconds!

One of the advantages of Apache Iceberg is how it handles partitions. One of the biggest drawbacks from using Hive based tables was the method on how you had to partition your data.

For example, most tables that you would plan to partition have some sort of date or timestamp that indicates when the row of data was created. Example table:

For Hive, if you wanted to partition by day, you would have to break out the created_ts column into year, month and day. Then, you would have to teach your users to always include these columns in their queries even if they wanted to query on created_ts.

create table hive.orders (event_id, integer, created_ts timestamp, metric integer, year varchar, month varchar, day varchar);

With Iceberg, you simply partition the data on created_ts using day and end users would query this table just like they would in a database. Here is an example:

-- create iceberg table partitioned by day on the created_ts column
create table orders_iceberg 
(event_id integer, created_ts timestamp(6),metric integer)
with (type='iceberg',partitioning=ARRAY['day(created_ts)']);
-- insert rows
insert into orders_iceberg values (1,timestamp '2022-09-10 10:45:38.527000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-11 03:12:23.522000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-12 10:46:13.516000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-13 04:34:05.577000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-14 09:10:23.517000',5.5);
-- query the table only looking for certain days
select * from orders_iceberg where created_ts BETWEEN date '2022-09-10' AND date '2022-09-12';

The data in S3 for example looks like this:

Trino is smart enough to read the Iceberg Manifest List and then only look at files that meet the partition requirement of the query. In the example above, it would only be 2022-09-10 and 2022-09-11. A list of functions to partition by can be found here.

Note: Trino’s Iceberg implementation includes the timezone for the timestamp data type (timestamp(6)). This was a conscious decision based on industry standard of supporting timezones within timestamp data types that Hive didn’t support.

Although we’ll cover this in a separate schema evolution blog, you aren’t stuck with this partitioning scheme. At any time, you can modify your partition column. For example, if we decided that partitioning on day is too granular, we can modify the table to now be partitioned by month:

alter table orders_iceberg SET PROPERTIES partitioning = ARRAY['month(created_ts)'];

New data will be created in directories named: created_ts_month=2022-09 for example. The existing data will remain partitioned by day unless the table is recreated.

Performance and Optimizations

When it comes to performance, Iceberg can be a very performant table format. This is because metadata is stored about all of the files that “belong” to a table for a given snapshot in time along with statistics about each one which helps with “file skipping”. This is a fancy term for files that do not need to be read based on the query that is issued.

With partitioning, the field of files is narrowed down even further by first only looking at the metadata for files after partition pruning is completed then looking at the metadata of the remaining files. When data is ordered by columns that appear in a where clause, this can greatly improve the performance of selective queries.

The manifest file contains information about the different files that belong to the table. Each entry has the location of the file in addition to statistics such as the minimum and maximum value for each column, the number of nulls and other useful information. Trino will use this metadata about each file to determine if the file needs to be read. If the data is sorted by “id” and a where clause has predicate similar to: where id = 5 then this query will see a large performance improvement because only a handful of files (if not just one) will need to be read.

Optimizing for performance

Iceberg includes some file management features that help with performance. Traditional data lakes have use cases where there is constant data being ingested. This data is written in small files because of the need to have it available to be queried immediately. This can hurt performance in any system that needs to read a bunch of small files especially in cloud storage. Iceberg includes an optimize feature that combines small files into larger ones ensuring maximum performance when it comes to querying.

The idea here is you want to ingest data as fast as possible, making it available for queries even though it might not be of the highest performance, then offer the ability to combine those files into larger ones at a given interval.

To scan the table for small files and make them larger, you simply issue the following command:

alter table <table> execute optimize;

This will look for any files under 100MB and combine them into larger ones. You can also choose the file size of 100MB:

ALTER TABLE <table> EXECUTE optimize(file_size_threshold => '10MB')

If your Iceberg table becomes very large and the optimize command above is taking too long to run, you can just optimize the files that have arrived recently:

ALTER TABLE <table> EXECUTE optimize where "$file_modified_time" > current_date - interval '1' day;

Example:

Network events are streamed in 1 minute intervals. Small files are dropped into an S3 bucket using the Iceberg API and the data is available immediately using standard SQL. Based on the volume of data and the files created, the optimize command can be run at given intervals to consolidate these smaller files into larger ones. This will greatly improve the performance of of subsequent queries against this table.

Cleaning up snapshots

From time to time, older snapshots of tables should be cleaned up. These older snapshots contain previous states of the table which are no longer needed.

There are two operations that clean up old snapshots and data. One is “expire_snapshots” and the other is “remove_orphan_files.

expire_snapshots
This function removes snapshots that are older than the value provided during the execution. An example is the below command that will remove snapshots that are older than 7 days:

ALTER TABLE <table> EXECUTE expire_snapshots(retention_threshold => ‘7d’)

[remove_orphan_files](https://trino.io/docs/current/connector/iceberg.html#remove-orphan-files)
This function removes files that are left on storage when a query is unable to complete for a variety of reasons. This doesn’t happen too often but it’s a good idea to include this when you run snapshot cleanups. A similar alter table statement is used as shown in the this example:

ALTER TABLE test_table EXECUTE remove_orphan_files(retention_threshold => ‘7d’)

As you can see, Iceberg + Trino brings some very exciting features along with tremendous performance advantages to complete your modern data lake.

Introduction to Apache Iceberg in Trino

Tom Nats — Wed, 26 Jul 2023 13:00:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

TL;DR

Apache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.

What is Apache Iceberg?

Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses in an open architecture, avoiding vendor and technology lock-in.

The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.

Advantages of Apache Iceberg

One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.

With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.

Here is a list of the many features Iceberg provides:

Choose your engine: As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own data and choose the engine that fits your use cases.
Avoid Data Lock-in: The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in.
Avoid Vendor Lock-out: Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers.
DML (modifying table data): Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR.
Schema evolution: Much like a database, Iceberg supports full schema evolution including columns and even partitions.
Performance: Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs.
Database feel: Partitioning is performed on any column and end users query Iceberg tables just like they would a database.

Iceberg Architecture

Iceberg is a layer of metadata over your object storage. It provides a transaction log per table very similar to a traditional database. This log keeps track of the current state of the table including any modifications. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.

Snapshots

Everytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. When an Iceberg client (let’s say Trino) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax.

Under the covers, Iceberg uses a set of Avro-based files to keep track of this metadata. A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.

Here is a very basic diagram of the different files that are created during a CTAS (create table as select):

Metadata File Pointer (fp1) – This is an entry in a Hive compatible metastore (AWS Glue for example) that points to the current metadata file. This is the start to any query against an Iceberg table.

Metadata File (mf1) – A json file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.

Manifest List (ml1) – List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.

Manifest File (mf1) – Lists a set of files and metadata about these files. This is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.

Here is a sample table named customer_iceberg that was created on S3:

customer_iceberg-a0ae01bc83cb44c5ad068dc3289aa1b9/
  data/
    20221005_142356_18493_dnvqc-43a7f422-d402-41d8-aab3-38d88f9a8810.orc
    20221005_142356_18493_dnvqc-548f81e0-b9c3-4015-99a7-d0f19416e39c.orc
  metadata/
    00000-8364ea6c-5e89-4b17-a4ea-4187725b8de6.metadata.json
    54d59fe-8368-4f5e-810d-4331dd3ee243-m0.avro
    snap-2223082798683567304-1-88c32199-6151-4fc7-97d9-ed7d9172d268.avro

Table directory – This is the name of the table with a unique uuid in order to support table renames.

Data directory – This holds the Orc, Parquet or Avro files and could contain subdirectories depending on if the table is partitioned.

Metadata directory – This directory holds the manifest files as covered above.

Again, this might be too nitty-gritty for the average user but the point is a tremendous amount of thought and work has been put into Iceberg to ensure it can handle many different types of analytical queries along with real-time ingestion. It was built to fill the gap between low-cost, cloud object stores and the demanding processing engines such as Trino and Spark.

Partitioning

Using partitions in Iceberg is just like with a database. Most data you ingest into your data lake has a timestamp and partitioning by that column is very easy:

Example – partition by month from a timestamp column:

create table orders_iceberg
with (partitioning=ARRAY['month(create_date),region'])

Querying using a standard where clause against the partitioned column will result in partition pruning and much higher performance:

select * from orders_iceberg
WHERE CAST(create_date AS date) BETWEEN date '1993-06-01' AND date '1993-11-30';

Trino's Iceberg Support

Trino has full support for Iceberg with a feature matrix listed below:

Create Table
Modify Table (update/delete/merge)
Add/Drop/Modify table column
Rename table
Rollback to previous snapshot
View support (includes AWS Glue)
Time travel
Maintenance (Optimize/Expire Snapshots)
Alter table partition
Metadata queries

Using Iceberg in Trino is very easy. There is a dedicated connector page located here.

If you're new to Trino, Starburst Galaxy's free tier is the easiest and fastest way to test out the power of Trino and Iceberg.