DEV Community: Jeff Zhang

Deep Dive into Apache Iceberg via Apache Zeppelin

Jeff Zhang — Mon, 18 Jul 2022 00:11:31 +0000

Apache Iceberg is a high-performance format for huge analytic tables. There’re a lot of tutorials on the internet about how to use Iceberg. This post is a little different, it is for those people who are curious to know the internal mechanism of Iceberg. In this post, I will use Spark sql to create/insert/delete/update Iceberg table in Apache Zeppelin and will explain what happens underneath for each operation.

Start Zeppelin Docker Container

To demonstrate the internal mechanism more intuitively, I use Apache Zeppelin to run all the example code. You can reproduce what I did easily via Zeppelin docker. You can check this article for how to play Spark in Zeppelin docker. Here I just summarize it as following steps:

Step 1. git clone https://github.com/zjffdu/zeppelin-notebook.git
Step 2. Download Spark 3.2.1
Step 3. Run the following command to start the Zeppelin docker container. ${zeppelin_notebook}is the notebook folder you cloned in Step 1, ${spark_location} is the Spark folder you downloaded in Step 2.

docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm -v \ 
${spark_location}:/opt/spark -v \
${zeppelin_notebook}:/opt/notebook -e \
ZEPPELIN_NOTEBOOK_DIR=/opt/notebook -e SPARK_HOME=/opt/spark \
-e ZEPPELIN_LOCAL_IP=0.0.0.0 --name zeppelin \
apache/zeppelin:0.10.1

Then open http://localhost:8080 in browser, and open the notebook Spark/Deep Dive into Iceberg which contains all the code in this article.

Architecture of Iceberg

Basically, there’re 3 layers for Iceberg:

Catalog layer
Metadata layer
Data Layer

Catalog Layer

Catalog layer has 2 implementations:

Hive catalog which uses hive metastore. Hive metastore uses relational database to store where’s current version’s snapshot file.
Path based catalog which is based on file system. This tutorial uses path based catalog. It uses files to store where’s the current version’s metadata file. ( version-hint.text is the pointer which point to each version’s metadata file v[x].metadata.jsonin the below examples)

Metadata Layer

In metadata layer, there’re 3 kinds of files:

Metadata file. Each CRUD operation will generate a new metadata file which contains all the metadata info of table, including the schema of table, all the historical snapshots until now and etc. Each snapshot is associated with one manifest list file.
Manifest list file. Each version of snapshot has one manifest list file. Manifest list file contains a collection of manifest files.
Manifest file. Manifest file can be shared cross snapshot files. It contains a collection of data files which store the table data. Besides that it also contains other meta info for potential optimization, e.g. row-count, lower-bound, upper-bound and etc.

Data Layer

Data layer is a bunch of parquet files which contain all the historical data, including newly added records, updated record and deleted records. A subset of these data files compose one version of snapshot.

The diagram above is the architecture of Iceberg and also demonstrates what we did in this tutorial

S1 means the version after we insert 3 records
S2 means the version after we update one record
S3 means the version after we delete one record

Preparation

Download jq and avro tools jar

jq is used for display json , avro tools jar is used to read iceberg metadata files (avro format) and display it in plain text.

Configure Spark

%spark.conf is a special interpreter to configure Spark interpreter in Zeppelin. Here I configure the Spark interpreter as described in this quick start. Besides that, I specify the warehouse folder spark.sql.catalog.local.warehouse explicitly so that I can check the table folder easily later in this tutorial. Now let’s start to use Spark and play Iceberg in Zeppelin.

Create Iceberg Table

First Let’s create an Iceberg table events with 2 fields: idand data.

Then describe this table to check its details

Check Table Folder

So what does Iceberg do underneath for this create sql statement? Actually, Iceberg did 2 things：

Create a directory events under the warehouse folder /tmp/warehouse
Add a metadata folder which contains all the metadata info

Since this is a newly created table, no data is in this table. There’s only one metadata folder under the table folder (/tmp/warehouse/db/events ). There’re 2 files under this folder:

version-hint.text. This file only contains one number which point to the current metadata file v[n].medata.json)
v1.metadata.json. This file contains the metadata of this table, such as the schema, location, snapshots and etc. For now, this table has no data, so there’s no snapshots in this metadata file.

Insert 3 Records (S1)

Now let’s insert 3 new records (1, a), (2, b), (3, c)

Then use select statement to verify the result.

Check Table Folder

Actually there’re 2 things happened underneath for this insert operation.

In data folder, 3 parquet files are created. One record per parquet file.
In metadata folder, the content ofversion-hint.text is changed to 2, v2.metadata.jsonis created which has one newly created snapshot which point to one manifest list file. This manifest list file points to one manifest file which points to the 3 parquet files.

We can use the avro tools jar to read the manifest list file which is avro format. And we find that it stores the location of manifest file and other meta info like added_data_files_count, deleted_data_files_count and etc.

Then use the avro tools jar to read the manifest file which contains the path of the data files and other related meta info.

We can use spark api to read raw parquet data files, and we can find there’s one record in each parquet file.

Update Record (S2)

Now, let’s use update statement to update one record.

Check result after update

Check Table Folder

In data folder, the existing parquet files are not changed. But one new parquet file is generated.(3, c_updated)
In metadata folder, the content ofversion-hint.text is changed to 3, v3.metadata.jsonis created which has 2 snapshots. One snapshot is the first snapshot in above step, another new snapshot is created which has a new manifest list file.

You might be curious to know how Iceberg implements the update operation without changing existing data. This magic happens in Iceberg metadata layer. If you check this version’s metadata file, you will find now it contains 2 snapshots, and each snapshot is associated with one manifest list file. The first snapshot is the same as above, while the second snapshot is associated with a new manifest list file. In this manifest list file, there’re 2 manifest files.

The first manifest file point to newly added data file (3, c_updated). While in the second manifest file, you will find that it still contains 3 data files that contains (1, a), (2, b), (3, c), but the status of the third data file(3, c) is 2 which means this data file is deleted, so when Iceberg read this version of table, it would skip this data file. So only (1,a), (2, b) will be read.

Delete Record (S3)

Now, let’s delete record (2, b)

Use select statement to verify the result

Check Table Folder

In data folder, nothing changed.
In metadata folder, the content ofversion-hint.text is changed to 4, v4.metadata.jsonis created which has one more snapshots (totally 3 snapthots).

The manifest list file associated with the new snapshot contains 2 manifest files.

The manifest list file associated with the new snapshot contains 2 manifest files.

The first manifest point to 1 data files (3, c_updated), , the second manifest file point to data file (1, a), (2, b). But the status of data file (2, b) is 2, which means it has been deleted, so when Iceberg read this version of table, it would just skip this data file. So only (1, a) will be read.

Use spark api to read these data files.

Inspect Metadata

You can also read metadata tables to inspect a table’s history, snapshots, and other metadata.

Inspect history metadata

Inspect snapshot metadata

Inspect manifest metadata

Inspect file meta table

Summary

In this article, I do 4 main steps to play Apache Iceberg:

Create Table
Insert Data
Update Data
Delete Data

At each step, I check what is changed under the table folder. All the steps are done in Apache Zeppelin docker container, you can reproduce them easily. Just one thing to remember, because the file names are randomly generated (snapshot file, manifest file, parquet file), so you need to update code to use the correct file name. Hope this article is useful for you to understand the internal mechanism of Apache Iceberg.

References

Deep Dive into Delta Lake via Apache Zeppelin

Jeff Zhang — Mon, 29 Nov 2021 15:28:37 +0000

Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. There’re a lot of tutorials on internet about how to use Delta Lake. This post is a little different, it is for those people who are curious to know the internal mechanism of Delta Lake, especially the transaction log.

Start Zeppelin Docker Container

Step 1. git clone https://github.com/zjffdu/zeppelin-notebook.git
Step 2. Download Spark 3.1.2 (This is what I used in this tutorial, don’t use Spark 3.2.0, it is not supported yet)
Step 3. Run the following command to start Zeppelin docker container. ${zeppelin_notebook} is the notebook folder you cloned in Step 1, ${spark_location} is the Spark folder you downloaded in Step 2.

docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm \ 
-v ${spark_location}:/opt/spark \
-v  ${zeppelin_notebook}:/opt/notebook \
-e ZEPPELIN_NOTEBOOK_DIR=/opt/notebook \
-e SPARK_HOME=/opt/spark \
-e ZEPPELIN_LOCAL_IP=0.0.0.0 \
--name zeppelin apache/zeppelin:0.10.0

Then open http://localhost:8080 , and open the notebook Spark/Deep Dive into Delta Lake which contains all the code in this article.

Configure Spark

This is the first paragraph of Deep Dive into Delta Lake, which is to configure Spark interpreter to use Delta Lake.
%spark.conf is a special interpreter to configure Spark interpreter in Zeppelin. Here I configure Spark interpreter as described in this quick start. Besides that I specify spark.sql.warehouse.dir for the warehouse folder explicitly so that I can check the table folder easily later in this tutorial. Now let’s start to use Spark and play Delta Lake in Zeppelin.

Create Delta Table

First Let’s create a Delta table events with 2 fields: id and data.

So what does Delta do underneath for this create sql statement ? Actually Delta did 2 things：

Create a directory events under the warehouse folder /tmp/warehouse
Add a transaction log which contains the schema of this table

Insert data

Now let’s insert some data into this Delta table. Here I just only insert only 2 records: (1, data_1), (2, data_2)

Then let’s run select sql statement to verify the result of this insert statement.

So what does Delta do underneath for this insert sql statement ? Let's check the table folder /tmp/warehouse/events , there’re 2 changes

Another new transaction log file is generated.
2 parquet files are generated

First let’s take a look at the new generated transaction file (00000000000000000001.json). This json file is very readable, it contains the operation of this insert sql statement: Add 2 parquet files which contains the 2 records. To be noticed, there’s no table schema info in this new transaction log file, because it is already in the first transaction log file (00000000000000000000.json). When Delta read the table, it would merge all the historical transaction files since then to get all the information of this table (including the schema of this table and what data files are included)
Since we only insert 2 records, it is natural to guess that each parquet contains one record. We can read these 2 parquet files directly to verify that. As the following code shows, our guess is correct.

Update Data

The most important feature of Delta is ACID support, you can update the table at any time without affecting others who also read/write the same table simultaneously. Now let’s update this events table.

Then run select statement to verify the result of this update statement.

So what does this update statement do underneath ? We can check the events table folder and would find 2 changes:

Another new transaction log file is generated
Another parquet file is added (the previous 2 parquet files are still there)

First let’s take a look at the new transaction log file content, there’re 2 operations:

Remove one parquet file
Add a new parquet file

It is natural for us to guess that the removed file contains the records (2, data_2), while the new added file contains record (2, data_2_updated). Let’s read these 2 parquet file directly to verify our guess.

Now let’s use the time travel feature of Delta. We would like to use last version of this table before this update operation.

The time travel feature works just because Delta doesn’t delete the data file, it only records all the operations in the transaction logs. When you read version 1 of this table, Delta Lake would only read the first 2 transactions logs: 00000000000000000000.json & 00000000000000000001.json.

Delete Data

Now let’s do the delete operation on this events table.

And then run select statement to verify the result of this delete statement.

So what does Delta do for this delete operation underneath ? We can still check the eventstable folder and would find 2 changes:

A new metadata transaction log file is generated
A new parquet file is added

In the new transaction log file we still see 2 operations: remove and add.
It is natural to guess that the remove operation just remove the file which contains record (1, data_1), so what does this new add operation do ?Actually the new added parquet file is empty which contains nothing, we can read these 2 parquet files directly to verify that.

Summary

In this article, I do 4 main steps to play Delta Lake:

Create Table
Insert Data
Update Data
Delete Data

At each step, I check what is changed in transaction log and data file. All the steps are done in Apache Zeppelin docker container, you can reproduce them easily, hope this article is useful for you to understand the internal mechanism of Delta Lake.

Use customized and isolated python environment in Apache Zeppelin notebook

Jeff Zhang — Sat, 10 Jul 2021 07:58:03 +0000

Apache Zeppelin notebook is web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.

For Python developers, using a customized and isolated Python runtime environment is an indispensable requirement. You and your colleagues may want to use different versions of python and python packages and don’t want to affect each others’ environment. In this article, I’d like to introduce to you how to use customized and isolated Python environment in hadoop yarn cluster. Regarding how to achieve this for PySpark, I will leave it to another article. (All the features in this article is done in this jira ZEPPELIN-5330). And you can reproduce all the steps here via downloading this note:

Python Conda Env in Yarn Mode

Step 1. Create your customized conda env

First, let’s create a yaml file which define the python conda env including:

env name
channels which is used to install packages
python version
other third party python packages

name: python_env
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.7 
  - pycodestyle
  - numpy
  - pandas
  - scipy
  - grpcio
  - protobuf
  - pandasql
  - ipython
  - ipykernel
  - jupyter_client
  - panel
  - pyyaml
  - seaborn
  - plotnine
  - hvplot
  - intake
  - intake-parquet
  - intake-xarray
  - altair
  - vega_datasets
  - pyarrow

Then run the using the following commands to create conda env tar and upload it to hdfs.

conda pack -n python_env
hadoop fs -put python_env.tar.gz /tmp
# The python conda tar should be publicly accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyspark_env.tar.gz

Step 2. Configure Python Interpreter

%python.conf
# set zeppelin.interpreter.launcher to be yarn, so that python interpreter run in yarn container, 
# otherwise python interpreter run as local process in the zeppelin server host.
zeppelin.interpreter.launcher yarn
# zeppelin.yarn.dist.archives can be either local file or hdfs file
zeppelin.yarn.dist.archives hdfs:///tmp/python_env.tar.gz#environment
# conda environment name, aka the folder name in the working directory of yarn container
zeppelin.interpreter.conda.env.name environment

Step 3. Run Python Interpreter in this customized conda env

%python

%matplotlib inline

import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()

Summary

The this feature is not released yet when this article is published, you can build Zeppelin master branch by yourself and import this note here to try this feature. If you have any question, you can ask in zeppelin user mail list or slack channel (http://zeppelin.apache.org/community.html)

Reference

How to use IPython in Apache Zeppelin Notebook

Jeff Zhang — Sat, 10 Jul 2021 07:51:32 +0000

Apache Zeppelin Notebook is web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.

In this post, I will talk about how to use IPython in Apache Zeppelin Notebook (Although Zeppelin support vanilla Python, it is strongly recommended to use IPython). This would make the Python development experience in Zeppelin notebook almost the same as Jupyter notebook.

All the contents in this notebook can be found in these 2 example notebook.

How to configure

Enable IPython in Zeppelin is pretty straightforward, first you need to install the following 3 python packages

pip install jupyter
pip install grpcio
pip install protobuf

Besides manually installing these packages, you can also use conda to create a customized python environment which has these installed. You can check this post for more details.

Then configure Python Interpreter，the most important configuration is zeppelin.python which need to point to the correct python executable in case you have multiple python installed in your machine.

IPython Features

Now, you can experience almost the same python development experience as Jupyter notebook. Here’s a list of feature that I’d like to highlight.

Support ipython magic function
Better code completion
Rich visualization libraries support

IPython magic function

Code completion

Rich visualization libraries support

Visualization libraries are a big family of python ecosystem. Like jupyter notebook, you can use most of the popular python visualization libraries in Zeppelin notebook.

Here’re a list of examples of how to use popular python visualization libraries in Zeppelin:

Matplotlib
Pandas
Seaborn
Plotnine
Bokeh
Holoviews
Altair

Summary

This is just a brief introduction of how to use IPython in Zeppelin notebook. If you have any question, you can ask in zeppelin user mail list or slack channel (http://zeppelin.apache.org/community.html)

References

Learn Flink SQL — The Easy Way

Jeff Zhang — Wed, 07 Jul 2021 08:45:17 +0000

Flink is almost the de facto standard streaming engine today. Flink SQL is the recommended approach to use Flink. But streaming sql is not the same as the traditional batch sql, you have to learn many new concepts, such as watermark, event time, different kinds of streaming joins and etc. To be honest all of these are not easy to learn.

Today I’d like to introduce you a new (easy) way to learn flink sql: Flink Sql Cookbook on Zeppelin. In Zeppelin you can run Flink SQL in interactive way as following:

All the examples in this post can be found here.
http://zeppelin-notebook.com/

Prepare environment

Step 1

git clone https://github.com/zjffdu/flink-sql-cookbook-on-zeppelin.git

This repo has all the zeppelin notebook which includes the examples in flink-sql-cookbook. Thanks ververica for the great examples, I just migrated them to Zeppelin.

Step 2

Download flink 1.13.1 and untar it. (I haven’t tried other versions of flink, but it should work for all flink versions after flink 1.10)

Step 3

Build flink faker and copy flink-faker-0.3.0.jar to lib folder of flink. This is a custom flink table source which is used to generate sample data.

Step 4

Run the following command to start Zeppelin

docker run -u $(id -u) -p 8081:8081 -p 8080:8080 --rm -v $PWD/logs:/logs -v /mnt/disk1/jzhang/flink-sql-cookbook-on-zeppelin:/notebook -v /mnt/disk1/jzhang/flink-1.13.1:/opt/flink -e FLINK_HOME=/opt/flink -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name zeppelin apache/zeppelin:0.10.0

There’re 2 folders you need to replace with your folder:

/mnt/disk1/jzhang/flink-sql-cookbook-on-zeppelin （This is the repo folder of step 1）
/mnt/disk1/jzhang/flink-1.13.1 （This is the flink folder of step 2）

Try examples of Flink Sql Cookbook

Now, the environment is ready, you can start your flink sql journey via opening http://localhost:8080

This is Zeppelin home page, there’s already one folder called Flink Sql Cookbook which includes all the examples.

Example 1： Filtering Data

Now let’s take a look at the first example: Foundations/04 Filtering Data

Here’re 2 paragraphs:

Paragraph 1 is creating table server_logs via faker connector.
Paragraph 2 is filtering data via where statement and then select the latest 10 records by log_time

The following is a screenshot of the result. You can see that the result is refreshed every 3 seconds. This is the biggest difference of flink streaming sql compared to traditional batch sql. Because in the streaming world, new data is coming continuously, so you will see the result is updated continuously.

Besides that, you can click the FLINK_JOB link in the top right, it would bring you to the Flink UI of this job.

Example 2： Lateral Table Join

Now let's take a look at the second example: lateral table join. This is one type of the joins that flink sql supports. Usually new beginners would be a little scared at this even after he learn it via some tutorial articles. If there’s one real example could show him what exactly this lateral table join is doing, I believe it would be very helpful for him to understand it. Fortunately, we have one example in this flink-sql-cookbook and you can run it directly in Zeppelin. Open Joins/06 Lateral Table , then run it you can can see the following screenshot.

Here I just show you the above 2 examples, there’re many other examples in this cookbook as below. You can try it by yourself. Hope you will enjoy this flink-sql-cookbook-on-zeppelin.

Summary

Not only you can use Zeppelin to learn Flink SQL, you can also use Zeppelin as your Streaming Platform to submit/manage your flink jobs.

Zeppelin community still try to improve and evolve the whole user experience of Flink on Zeppelin, you can join Zeppelin slack to discuss with community. http://zeppelin.apache.org/community.html#slack-channel

For more details of flink on zeppelin, please refer the following links.