augusto kiniama rosa

Posted on Feb 23, 2024 • Originally published at Medium on Feb 8, 2024

An Overview of Snowflake Apache Iceberg Tables

#snowflake #apacheiceberg #data #datasuperhero

Come Learn with Me as I Experiment with Snowflake’s Latest Preview feature: Apache Iceberg Tables

Photo by Alexander Hafemann on Unsplash

If you have not picked until now, my background is not really data. I have had the chance to have an amazing and quite diverse career, so I often take this article as my way to learn about the topic. If I am going to research and learn, then I might as well write about it. So sometimes my articles may get basic, although I try to relate to advanced topics.

The format of how we store data and its related metadata can truly help us get more from the data and help with many aspects of the modern data stack.

What is Apache Iceberg?

Iceberg is an open-source project hosted by the Apache Foundation. It was initially started as a Netflix OSS project and eventually moved fully to a community project. From what I can see, it is truly a super high-performance format for those who have really large analytic tables.

Last week, I saw some comparisons of native tables in Snowflake vs Snowflake Iceberg External table, and the performance, although better on native tables, it is quite close. Iceberg allows for datalakes to have a single format that can be used by multiple systems, like Snowflake, Databricks and other engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

It has quite neat features like:

Expressive SQL Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance, or it can use delete deltas for faster updates
Full Schema Evolution, Schema evolution just works. Adding a column won’t bring back old or weird data. Columns can be renamed and reordered. Best of all, schema changes never require rewriting your table
Hidden Partitioning, producing partition values for rows in a table and skipping unnecessary partitions and files automatically. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change
Time Travel and Rollback, reproducible queries or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state
Data Compaction, is supported out-of-the-box with support for bin-packing or sorting to optimize file layout and size

Why is it Important?

I feel that in the world of data, choosing one tool only for large companies with large data sets can be a challenge; some features do not work that well in certain tools. For example, in Databricks vs Snowflake, Snowflake traditionally is stronger in data warehousing and Databricks in machine learning tasks, although they are now both going after each other workloads. However, some companies still choose to keep both tools in the picture to meet the business needs of every use case. This is where a common datalake format can become very helpful, and sharing data across all tools becomes useful even needed.

We do need to look how datalakes are typically setup now, they are usually in some sort of object storage like AWS S3, Azure Blob or Google Storage. This is different from Hadoop, more commonly used a few years ago. This brings some challenges with metadata management, schema management, and data catalogs. Iceberg allows for direct and concurrent updates with all tools considered equal.

Let’s face it as well that Iceberg has a lot bigger community support with the likes of Netflix, Apple and the Apache community.

I found this comparison of Iceberg against other formats:

https://www.dremio.com/resources/guides/apache-iceberg/

This is how files look like in an S3 bucket. There is always a data with data files in Parquet format, and metadata folder, metadata files and manifest files.

Show Me Iceberg In Snowflake

First things first is that Iceberg will always uses customer-controlled external storage, like an AWS S3 or Azure Blog Storage.

Snowflake Iceberg Tables support Iceberg in two ways: an Internal Catalog (Snowflake-managed catalog) or an externally managed catalog (AWS Glue or Objectstore).

Iceberg Tables: Snowflake-managed catalog

Snowflake-managed catalog is near same performance as normal Snowflake tables and has the following characteristics:

Snowflake reads/writes
Iceberg interoperability
Full platform support
Performance optimized

Iceberg Tables: Externally managed catalog

Externally managed catalogs like AWS Glue, or you can use Iceberg metadata files stored in object storage to create a table, and has the following characteristics:

Flexible sources
Efficient onboarding
Simplified operations
Performance optimized

Practical Learning

I used a Snowflake lab to learn these and practices, and here some scripts anyway to practice. I get no credit for the SQL scripts as it coming from the Snowflake lab.

It truly keeps track of all changes in a super neat way.

-- Lab starting point, creating required lab
CREATE WAREHOUSE iceberg_lab;
CREATE ROLE iceberg_lab;
CREATE DATABASE iceberg_lab;
CREATE SCHEMA iceberg_lab;
GRANT ALL ON DATABASE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;
GRANT ALL ON WAREHOUSE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;

CREATE USER iceberg_lab
    PASSWORD='PASSWORD',
    LOGIN_NAME='ICEBERG_LAB',
    MUST_CHANGE_PASSWORD=FALSE,
    DISABLED=FALSE,
    DEFAULT_WAREHOUSE='ICEBERG_LAB',
    DEFAULT_NAMESPACE='ICEBERG_LAB.ICEBERG_LAB',
    DEFAULT_ROLE='ICEBERG_LAB';

GRANT ROLE iceberg_lab TO USER iceberg_lab;
GRANT ROLE iceberg_lab TO USER YOURUSERNAME;
GRANT ROLE accountadmin TO USER iceberg_lab;

USE ROLE accountadmin;
ALTER USER iceberg_lab SET rsa_public_key='Mxxxxxxiiiixxmmm..';

-- Create External Volume After doing AWS Setup and retrieve Snowflake AWS ID
CREATE OR REPLACE EXTERNAL VOLUME iceberg_lab_vol
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'youriceberglab'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://youriceberglab/datalake/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::XXXXXXXXX:role/yournowflakerole'
            )

      );

-- find STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID and updated your IAM role
DESC EXTERNAL VOLUME iceberg_lab_vol;

-- Grant iceberg_lab to external volume
USE ROLE accountadmin;
GRANT ALL ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE iceberg_lab WITH GRANT OPTION;

-- Create a Snowflake-managed Iceberg Tabl
USE ROLE iceberg_lab;
USE DATABASE iceberg_lab;
USE SCHEMA iceberg_lab;
CREATE OR REPLACE ICEBERG TABLE customer_iceberg (
    c_custkey INTEGER,
    c_name STRING,
    c_address STRING,
    c_nationkey INTEGER,
    c_phone STRING,
    c_acctbal INTEGER,
    c_mktsegment STRING,
    c_comment STRING
)  
    CATALOG='SNOWFLAKE'
    EXTERNAL_VOLUME='iceberg_lab_vol'
    BASE_LOCATION='';

-- Loading Data
INSERT INTO customer_iceberg
  SELECT * FROM snowflake_sample_data.tpch_sf1.customer;

-- Sample query
SELECT
    *
FROM customer_iceberg c
INNER JOIN snowflake_sample_data.tpch_sf1.nation n
    ON c.c_nationkey = n.n_nationkey;

-- Sample query with time travel example
INSERT INTO customer_iceberg
    SELECT
        *
    FROM snowflake_sample_data.tpch_sf1.customer
    LIMIT 5;

SELECT
    count(*) AS after_row_count,
    before_row_count
FROM customer_iceberg
JOIN (
        SELECT count(*) AS before_row_count
        FROM customer_iceberg BEFORE(statement => LAST_QUERY_ID())
    )
    ON 1=1
GROUP BY 2;

-- Clean up
DROP WAREHOUSE iceberg_lab;
DROP DATABASE iceberg_lab;
DROP USER iceberg_lab;
DROP EXTERNAL VOLUME iceberg_lab_vol;
DROP ICEBERG TABLE customer_iceberg;

Icerberg Table Convertion

In practical terms, a managed Iceberg tables vs non-managed is the ability to write data inside snowflake or not, for managed you can write, and non-managed its read only.

It is easily done and it is a matter of doing a refresh of the metadata before hand and run this command:

ALTER ICEBERG TABLE customer_iceberg CONVERT TO MANAGED
  BASE_LOCATION = CustomerBaseLocation;

Conclusion

I hope this articles gives you a good overview of how Apache Iceberg works and how Snowflake implemented, and there quite a good use-cases for real world.

For one, you can share that same use case between different systems, like Databricks and Snowflake, see another Data super hero article about this.

Apache Iceberg was built fully open-source from the ground up, so there is old ways of doing things. It engine and format agnostic.

I’m Augusto Rosa, VP of Engineering for Infostrux Solutions. Thanks for reading my blog post. You can follow me on LinkedIn or my Blog Posts.

Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.

DEV Community