<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeff Zhang</title>
    <description>The latest articles on DEV Community by Jeff Zhang (@zjffdu).</description>
    <link>https://dev.to/zjffdu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F353606%2F6d023fdb-5fa2-4fc4-8943-764b1a330272.jpg</url>
      <title>DEV Community: Jeff Zhang</title>
      <link>https://dev.to/zjffdu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zjffdu"/>
    <language>en</language>
    <item>
      <title>Deep Dive into Apache Iceberg via Apache Zeppelin</title>
      <dc:creator>Jeff Zhang</dc:creator>
      <pubDate>Mon, 18 Jul 2022 00:11:31 +0000</pubDate>
      <link>https://dev.to/zjffdu/deep-dive-into-apache-iceberg-via-apache-zeppelin-fc3</link>
      <guid>https://dev.to/zjffdu/deep-dive-into-apache-iceberg-via-apache-zeppelin-fc3</guid>
      <description>&lt;p&gt;Apache Iceberg is a high-performance format for huge analytic tables. There’re a lot of tutorials on the internet about how to use Iceberg. This post is a little different, it is for those people who are curious to know the internal mechanism of Iceberg. In this post, I will use Spark sql to create/insert/delete/update Iceberg table in Apache Zeppelin and will explain what happens underneath for each operation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Start Zeppelin Docker Container
&lt;/h1&gt;

&lt;p&gt;To demonstrate the internal mechanism more intuitively, I use Apache Zeppelin to run all the example code. You can reproduce what I did easily via Zeppelin docker. You can check this article for how to play Spark in Zeppelin docker. Here I just summarize it as following steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1. git clone &lt;a href="https://github.com/zjffdu/zeppelin-notebook.git"&gt;https://github.com/zjffdu/zeppelin-notebook.git&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Step 2. Download Spark 3.2.1&lt;/li&gt;
&lt;li&gt;Step 3. Run the following command to start the Zeppelin docker container. ${zeppelin_notebook}is the notebook folder you cloned in Step 1, ${spark_location} is the Spark folder you downloaded in Step 2.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm -v \ 
${spark_location}:/opt/spark -v \
${zeppelin_notebook}:/opt/notebook -e \
ZEPPELIN_NOTEBOOK_DIR=/opt/notebook -e SPARK_HOME=/opt/spark \
-e ZEPPELIN_LOCAL_IP=0.0.0.0 --name zeppelin \
apache/zeppelin:0.10.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;a href="http://localhost:8080"&gt;http://localhost:8080&lt;/a&gt; in browser, and open the notebook Spark/Deep Dive into Iceberg which contains all the code in this article.&lt;/p&gt;

&lt;h1&gt;
  
  
  Architecture of Iceberg
&lt;/h1&gt;

&lt;p&gt;Basically, there’re 3 layers for Iceberg:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Catalog layer&lt;/li&gt;
&lt;li&gt;Metadata layer&lt;/li&gt;
&lt;li&gt;Data Layer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Catalog Layer
&lt;/h2&gt;

&lt;p&gt;Catalog layer has 2 implementations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hive catalog which uses hive metastore. Hive metastore uses relational database to store where’s current version’s snapshot file.&lt;/li&gt;
&lt;li&gt;Path based catalog which is based on file system. This tutorial uses path based catalog. It uses files to store where’s the current version’s metadata file. ( version-hint.text is the pointer which point to each version’s metadata file v[x].metadata.jsonin the below examples)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Metadata Layer
&lt;/h2&gt;

&lt;p&gt;In metadata layer, there’re 3 kinds of files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metadata file. Each CRUD operation will generate a new metadata file which contains all the metadata info of table, including the schema of table, all the historical snapshots until now and etc. Each snapshot is associated with one manifest list file.&lt;/li&gt;
&lt;li&gt;Manifest list file. Each version of snapshot has one manifest list file. Manifest list file contains a collection of manifest files.&lt;/li&gt;
&lt;li&gt;Manifest file. Manifest file can be shared cross snapshot files. It contains a collection of data files which store the table data. Besides that it also contains other meta info for potential optimization, e.g. row-count, lower-bound, upper-bound and etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Layer
&lt;/h2&gt;

&lt;p&gt;Data layer is a bunch of parquet files which contain all the historical data, including newly added records, updated record and deleted records. A subset of these data files compose one version of snapshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QbKi_87D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rptcvz0mud5nlxlt9z9u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QbKi_87D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rptcvz0mud5nlxlt9z9u.png" alt="Image description" width="880" height="822"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above is the architecture of Iceberg and also demonstrates what we did in this tutorial&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S1 means the version after we insert 3 records&lt;/li&gt;
&lt;li&gt;S2 means the version after we update one record&lt;/li&gt;
&lt;li&gt;S3 means the version after we delete one record&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Preparation
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Download jq and avro tools jar
&lt;/h2&gt;

&lt;p&gt;jq is used for display json , avro tools jar is used to read iceberg metadata files (avro format) and display it in plain text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8wWwVyhs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w2wc60t7znq17ak3grpe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8wWwVyhs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w2wc60t7znq17ak3grpe.png" alt="Image description" width="880" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Spark
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DGIaXAxQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q4dvpr6m08pkne5jmakb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DGIaXAxQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q4dvpr6m08pkne5jmakb.png" alt="Image description" width="880" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;%spark.conf&lt;/code&gt; is a special interpreter to configure Spark interpreter in Zeppelin. Here I configure the Spark interpreter as described in this quick start. Besides that, I specify the warehouse folder spark.sql.catalog.local.warehouse explicitly so that I can check the table folder easily later in this tutorial. Now let’s start to use Spark and play Iceberg in Zeppelin.&lt;/p&gt;

&lt;h1&gt;
  
  
  Create Iceberg Table
&lt;/h1&gt;

&lt;p&gt;First Let’s create an Iceberg table events with 2 fields: idand data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YS7QSk16--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5zbhlxvr76i8lhxszsjy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YS7QSk16--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5zbhlxvr76i8lhxszsjy.png" alt="Image description" width="880" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then describe this table to check its details&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xsB4Krl3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ph85a77fmia09br4622.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xsB4Krl3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ph85a77fmia09br4622.png" alt="Image description" width="880" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Table Folder
&lt;/h2&gt;

&lt;p&gt;So what does Iceberg do underneath for this create sql statement? Actually, Iceberg did 2 things：&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a directory events under the warehouse folder /tmp/warehouse&lt;/li&gt;
&lt;li&gt;Add a metadata folder which contains all the metadata info&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since this is a newly created table, no data is in this table. There’s only one metadata folder under the table folder (/tmp/warehouse/db/events ). There’re 2 files under this folder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;version-hint.text. This file only contains one number which point to the current metadata file v[n].medata.json)&lt;/li&gt;
&lt;li&gt;v1.metadata.json. This file contains the metadata of this table, such as the schema, location, snapshots and etc. For now, this table has no data, so there’s no snapshots in this metadata file.
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mYJP5aU2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8m57ihpcz2n0ch41vtc4.png" alt="Image description" width="880" height="531"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Insert 3 Records (S1)
&lt;/h1&gt;

&lt;p&gt;Now let’s insert 3 new records (1, &lt;code&gt;a&lt;/code&gt;), (2, &lt;code&gt;b&lt;/code&gt;), (3, &lt;code&gt;c&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--elpWUbqe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h2ltr4ffi5wzuevgaoim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--elpWUbqe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h2ltr4ffi5wzuevgaoim.png" alt="Image description" width="880" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then use select statement to verify the result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yRHV55io--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nsfk7ygoylqoqx9889iv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yRHV55io--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nsfk7ygoylqoqx9889iv.png" alt="Image description" width="880" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Table Folder
&lt;/h2&gt;

&lt;p&gt;Actually there’re 2 things happened underneath for this insert operation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In data folder, 3 parquet files are created. One record per parquet file.&lt;/li&gt;
&lt;li&gt;In metadata folder, the content ofversion-hint.text is changed to 2, v2.metadata.jsonis created which has one newly created snapshot which point to one manifest list file. This manifest list file points to one manifest file which points to the 3 parquet files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xJX0GaRU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ybwvqo6wpqvc4ntnn9ql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xJX0GaRU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ybwvqo6wpqvc4ntnn9ql.png" alt="Image description" width="880" height="665"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can use the avro tools jar to read the manifest list file which is avro format. And we find that it stores the location of manifest file and other meta info like added_data_files_count, deleted_data_files_count and etc.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SgsIUZ2h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j2kqh0ifg6wxhcotu5ac.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SgsIUZ2h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j2kqh0ifg6wxhcotu5ac.png" alt="Image description" width="880" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then use the avro tools jar to read the manifest file which contains the path of the data files and other related meta info.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mMW-vm5t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbunghwr0874xdiuzggs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mMW-vm5t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbunghwr0874xdiuzggs.png" alt="Image description" width="880" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can use spark api to read raw parquet data files, and we can find there’s one record in each parquet file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TI39FSR8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/meh1d1jm876nxvf0kmnr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TI39FSR8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/meh1d1jm876nxvf0kmnr.png" alt="Image description" width="880" height="630"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Update Record (S2)
&lt;/h1&gt;

&lt;p&gt;Now, let’s use update statement to update one record.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mW_yBw3Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obi6cvjnvy5veegtlowg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mW_yBw3Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obi6cvjnvy5veegtlowg.png" alt="Image description" width="880" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Check result after update
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--D8RZDVgM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gz2du6r7adl9ivycom8k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--D8RZDVgM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gz2du6r7adl9ivycom8k.png" alt="Image description" width="880" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Table Folder
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;In data folder, the existing parquet files are not changed. But one new parquet file is generated.(3, &lt;code&gt;c_updated&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;In metadata folder, the content ofversion-hint.text is changed to 3, v3.metadata.jsonis created which has 2 snapshots. One snapshot is the first snapshot in above step, another new snapshot is created which has a new manifest list file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Fl28Ckd---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r97lj68gv5ryev76tzer.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Fl28Ckd---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r97lj68gv5ryev76tzer.png" alt="Image description" width="880" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might be curious to know how Iceberg implements the update operation without changing existing data. This magic happens in Iceberg metadata layer. If you check this version’s metadata file, you will find now it contains 2 snapshots, and each snapshot is associated with one manifest list file. The first snapshot is the same as above, while the second snapshot is associated with a new manifest list file. In this manifest list file, there’re 2 manifest files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--C7vQR-DN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sf51hbn1in9g3l036qmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--C7vQR-DN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sf51hbn1in9g3l036qmy.png" alt="Image description" width="880" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first manifest file point to newly added data file (3, &lt;code&gt;c_updated&lt;/code&gt;). While in the second manifest file, you will find that it still contains 3 data files that contains (1, &lt;code&gt;a&lt;/code&gt;), (2, &lt;code&gt;b&lt;/code&gt;), (3, &lt;code&gt;c&lt;/code&gt;), but the status of the third data file(3, &lt;code&gt;c&lt;/code&gt;) is 2 which means this data file is deleted, so when Iceberg read this version of table, it would skip this data file. So only (1,&lt;code&gt;a&lt;/code&gt;), (2, &lt;code&gt;b&lt;/code&gt;) will be read.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oghflEiU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7arhxjxt7z83ht4wugkg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oghflEiU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7arhxjxt7z83ht4wugkg.png" alt="Image description" width="880" height="589"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Delete Record (S3)
&lt;/h1&gt;

&lt;p&gt;Now, let’s delete record (2, &lt;code&gt;b&lt;/code&gt;)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8AaA-epx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hsnttr4l6650pakn7h7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8AaA-epx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hsnttr4l6650pakn7h7b.png" alt="Image description" width="880" height="105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use select statement to verify the result&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L6xt-Uli--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0ut72hd8mzu9xpgf1hlk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L6xt-Uli--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0ut72hd8mzu9xpgf1hlk.png" alt="Image description" width="880" height="219"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Table Folder
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;In data folder, nothing changed.&lt;/li&gt;
&lt;li&gt;In metadata folder, the content ofversion-hint.text is changed to 4, v4.metadata.jsonis created which has one more snapshots (totally 3 snapthots).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The manifest list file associated with the new snapshot contains 2 manifest files.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---zG5hOdw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3pmzgxc4omxlavy7rgwt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---zG5hOdw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3pmzgxc4omxlavy7rgwt.png" alt="Image description" width="880" height="707"&gt;&lt;/a&gt;&lt;br&gt;
The manifest list file associated with the new snapshot contains 2 manifest files.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sW9sWYyL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f0h6ylimzot0yxn29tx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sW9sWYyL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f0h6ylimzot0yxn29tx4.png" alt="Image description" width="880" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first manifest point to 1 data files (3, &lt;code&gt;c_updated&lt;/code&gt;), , the second manifest file point to data file (1, &lt;code&gt;a&lt;/code&gt;), (2, &lt;code&gt;b&lt;/code&gt;). But the status of data file (2, &lt;code&gt;b&lt;/code&gt;) is 2, which means it has been deleted, so when Iceberg read this version of table, it would just skip this data file. So only (1, &lt;code&gt;a&lt;/code&gt;) will be read.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z-i3epG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/avu1em96oen6dz7nvyat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z-i3epG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/avu1em96oen6dz7nvyat.png" alt="Image description" width="880" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use spark api to read these data files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rjI1YzDB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wvxqkhjqv4vj3pvtcznr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rjI1YzDB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wvxqkhjqv4vj3pvtcznr.png" alt="Image description" width="880" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Inspect Metadata
&lt;/h1&gt;

&lt;p&gt;You can also read metadata tables to inspect a table’s history, snapshots, and other metadata.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspect history metadata
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CXh4_aLb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y46mb2r9hf5yqi47f0qd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CXh4_aLb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y46mb2r9hf5yqi47f0qd.png" alt="Image description" width="880" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspect snapshot metadata
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rcfFNAIu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wphfa6zezlo69izngsty.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rcfFNAIu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wphfa6zezlo69izngsty.png" alt="Image description" width="880" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspect manifest metadata
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_8HMuXmn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6pjwapnbi20bawwzpwem.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_8HMuXmn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6pjwapnbi20bawwzpwem.png" alt="Image description" width="880" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspect file meta table
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hArXzn4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fni1tlcbbhmtn8kxgx90.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hArXzn4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fni1tlcbbhmtn8kxgx90.png" alt="Image description" width="880" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;In this article, I do 4 main steps to play Apache Iceberg:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create Table&lt;/li&gt;
&lt;li&gt;Insert Data&lt;/li&gt;
&lt;li&gt;Update Data&lt;/li&gt;
&lt;li&gt;Delete Data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At each step, I check what is changed under the table folder. All the steps are done in Apache Zeppelin docker container, you can reproduce them easily. Just one thing to remember, because the file names are randomly generated (snapshot file, manifest file, parquet file), so you need to update code to use the correct file name. Hope this article is useful for you to understand the internal mechanism of Apache Iceberg.&lt;/p&gt;

&lt;h1&gt;
  
  
  References
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://zeppelin.apache.org/docs/0.10.1/interpreter/spark.html"&gt;https://zeppelin.apache.org/docs/0.10.1/interpreter/spark.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberg.apache.org/"&gt;https://iceberg.apache.org/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/"&gt;https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.starburst.io/blog/trino-on-ice-iv-deep-dive-into-iceberg-internals/"&gt;https://www.starburst.io/blog/trino-on-ice-iv-deep-dive-into-iceberg-internals/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>apachezeppelin</category>
      <category>apacheiceberg</category>
      <category>spark</category>
    </item>
    <item>
      <title>Deep Dive into Delta Lake via Apache Zeppelin</title>
      <dc:creator>Jeff Zhang</dc:creator>
      <pubDate>Mon, 29 Nov 2021 15:28:37 +0000</pubDate>
      <link>https://dev.to/zjffdu/deep-dive-into-delta-lake-via-apache-zeppelin-3077</link>
      <guid>https://dev.to/zjffdu/deep-dive-into-delta-lake-via-apache-zeppelin-3077</guid>
      <description>&lt;p&gt;Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. There’re a lot of tutorials on internet about how to use Delta Lake. This post is a little different, it is for those people who are curious to know the internal mechanism of Delta Lake, especially the transaction log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Zeppelin Docker Container
&lt;/h2&gt;

&lt;p&gt;To demonstrate the internal mechanism more intuitively, I use Apache Zeppelin to run all the example code. You can reproduce what I did easily via Zeppelin docker. You can check this article for how to play Spark in Zeppelin docker. Here I just summarize it as following steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1. git clone &lt;a href="https://github.com/zjffdu/zeppelin-notebook.git"&gt;https://github.com/zjffdu/zeppelin-notebook.git&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Step 2. Download Spark 3.1.2 (This is what I used in this tutorial, don’t use Spark 3.2.0, it is not supported yet)&lt;/li&gt;
&lt;li&gt;Step 3. Run the following command to start Zeppelin docker container. &lt;code&gt;${zeppelin_notebook}&lt;/code&gt; is the notebook folder you cloned in Step 1, &lt;code&gt;${spark_location}&lt;/code&gt; is the Spark folder you downloaded in Step 2.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm \ 
-v ${spark_location}:/opt/spark \
-v  ${zeppelin_notebook}:/opt/notebook \
-e ZEPPELIN_NOTEBOOK_DIR=/opt/notebook \
-e SPARK_HOME=/opt/spark \
-e ZEPPELIN_LOCAL_IP=0.0.0.0 \
--name zeppelin apache/zeppelin:0.10.0

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;a href="http://localhost:8080"&gt;http://localhost:8080&lt;/a&gt; , and open the notebook Spark/Deep Dive into Delta Lake which contains all the code in this article.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8Y-f3-Zp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7m22d7jttbvmceioft9x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8Y-f3-Zp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7m22d7jttbvmceioft9x.png" alt="Image description" width="880" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Spark
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UG1bhApi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/goxlihzohywu0sdcaa1o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UG1bhApi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/goxlihzohywu0sdcaa1o.png" alt="Image description" width="880" height="151"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;This is the first paragraph of Deep Dive into Delta Lake, which is to configure Spark interpreter to use Delta Lake.&lt;br&gt;
&lt;code&gt;%spark.conf&lt;/code&gt; is a special interpreter to configure Spark interpreter in Zeppelin. Here I configure Spark interpreter as described in this quick start. Besides that I specify spark.sql.warehouse.dir for the warehouse folder explicitly so that I can check the table folder easily later in this tutorial. Now let’s start to use Spark and play Delta Lake in Zeppelin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create Delta Table
&lt;/h2&gt;

&lt;p&gt;First Let’s create a Delta table events with 2 fields: id and data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3uCRLLQl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hyzw1z602jvvucx97jzp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3uCRLLQl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hyzw1z602jvvucx97jzp.png" alt="Image description" width="880" height="157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what does Delta do underneath for this create sql statement ? Actually Delta did 2 things：&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a directory events under the warehouse folder /tmp/warehouse&lt;/li&gt;
&lt;li&gt;Add a transaction log which contains the schema of this table
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PJ-4ALlq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1x1gbukgopdytjchqgxx.png" alt="Image description" width="880" height="165"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Insert data
&lt;/h2&gt;

&lt;p&gt;Now let’s insert some data into this Delta table. Here I just only insert only 2 records: (1, data_1), (2, data_2)&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Cn119bBR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a3la09yzoybsmik096hk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Cn119bBR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a3la09yzoybsmik096hk.png" alt="Image description" width="880" height="98"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then let’s run select sql statement to verify the result of this insert statement.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YnMSreqH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/efirixl3e4slo1enzeuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YnMSreqH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/efirixl3e4slo1enzeuy.png" alt="Image description" width="880" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what does Delta do underneath for this insert sql statement ? Let's check the table folder /tmp/warehouse/events , there’re 2 changes&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Another new transaction log file is generated.&lt;/li&gt;
&lt;li&gt;2 parquet files are generated
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uBs68_YT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bfa9ehyvuwtbk0ic5ldq.png" alt="Image description" width="880" height="207"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First let’s take a look at the new generated transaction file (00000000000000000001.json). This json file is very readable, it contains the operation of this insert sql statement: Add 2 parquet files which contains the 2 records. To be noticed, there’s no table schema info in this new transaction log file, because it is already in the first transaction log file (00000000000000000000.json). When Delta read the table, it would merge all the historical transaction files since then to get all the information of this table (including the schema of this table and what data files are included)&lt;br&gt;
Since we only insert 2 records, it is natural to guess that each parquet contains one record. We can read these 2 parquet files directly to verify that. As the following code shows, our guess is correct.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4zyBmWoa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xp099s0wo1lbl7jah5xp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4zyBmWoa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xp099s0wo1lbl7jah5xp.png" alt="Image description" width="880" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Update Data
&lt;/h2&gt;

&lt;p&gt;The most important feature of Delta is ACID support, you can update the table at any time without affecting others who also read/write the same table simultaneously. Now let’s update this events table.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AKrBXhme--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/clee3zqs4ufd01ijuzq8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AKrBXhme--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/clee3zqs4ufd01ijuzq8.png" alt="Image description" width="880" height="98"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then run select statement to verify the result of this update statement.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4d7E-wGD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zxjg3ay40109mayogidh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4d7E-wGD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zxjg3ay40109mayogidh.png" alt="Image description" width="880" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what does this update statement do underneath ? We can check the events table folder and would find 2 changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Another new transaction log file is generated&lt;/li&gt;
&lt;li&gt;Another parquet file is added (the previous 2 parquet files are still there)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZlsGcU8L--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dlaoffx45h7tgu91rami.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZlsGcU8L--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dlaoffx45h7tgu91rami.png" alt="Image description" width="880" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First let’s take a look at the new transaction log file content, there’re 2 operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove one parquet file&lt;/li&gt;
&lt;li&gt;Add a new parquet file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is natural for us to guess that the removed file contains the records (2, data_2), while the new added file contains record (2, data_2_updated). Let’s read these 2 parquet file directly to verify our guess.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y8MNWlt7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5gxvxlc0sd47rpoaehp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y8MNWlt7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5gxvxlc0sd47rpoaehp0.png" alt="Image description" width="880" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let’s use the time travel feature of Delta. We would like to use last version of this table before this update operation.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E8Sjb5oI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bzqk71qxwtsq399dlmws.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E8Sjb5oI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bzqk71qxwtsq399dlmws.png" alt="Image description" width="880" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The time travel feature works just because Delta doesn’t delete the data file, it only records all the operations in the transaction logs. When you read version 1 of this table, Delta Lake would only read the first 2 transactions logs: 00000000000000000000.json &amp;amp; 00000000000000000001.json.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delete Data
&lt;/h2&gt;

&lt;p&gt;Now let’s do the delete operation on this events table.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DwoV-grU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ab33b2tgdn9710imedko.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DwoV-grU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ab33b2tgdn9710imedko.png" alt="Image description" width="880" height="90"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And then run select statement to verify the result of this delete statement.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vec1Os5a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dnqklp3kvqg7c3yeulu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vec1Os5a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dnqklp3kvqg7c3yeulu2.png" alt="Image description" width="880" height="210"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what does Delta do for this delete operation underneath ? We can still check the eventstable folder and would find 2 changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A new metadata transaction log file is generated&lt;/li&gt;
&lt;li&gt;A new parquet file is added
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F3o0MWI0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qjy4wlpt71h7bjxuynxf.png" alt="Image description" width="880" height="274"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the new transaction log file we still see 2 operations: remove and add.&lt;br&gt;
It is natural to guess that the remove operation just remove the file which contains record (1, data_1), so what does this new add operation do ?Actually the new added parquet file is empty which contains nothing, we can read these 2 parquet files directly to verify that.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rQO-hcF6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/evlynvczc88swab92w4w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rQO-hcF6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/evlynvczc88swab92w4w.png" alt="Image description" width="880" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this article, I do 4 main steps to play Delta Lake:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create Table&lt;/li&gt;
&lt;li&gt;Insert Data&lt;/li&gt;
&lt;li&gt;Update Data&lt;/li&gt;
&lt;li&gt;Delete Data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At each step, I check what is changed in transaction log and data file. All the steps are done in Apache Zeppelin docker container, you can reproduce them easily, hope this article is useful for you to understand the internal mechanism of Delta Lake.&lt;/p&gt;

</description>
      <category>deltalake</category>
      <category>apachespark</category>
      <category>apachezeppelin</category>
    </item>
    <item>
      <title>Use customized and isolated python environment in Apache Zeppelin notebook</title>
      <dc:creator>Jeff Zhang</dc:creator>
      <pubDate>Sat, 10 Jul 2021 07:58:03 +0000</pubDate>
      <link>https://dev.to/zjffdu/use-customized-and-isolated-python-environment-in-apache-zeppelin-notebook-k4i</link>
      <guid>https://dev.to/zjffdu/use-customized-and-isolated-python-environment-in-apache-zeppelin-notebook-k4i</guid>
      <description>&lt;p&gt;Apache Zeppelin notebook is web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.&lt;/p&gt;

&lt;p&gt;For Python developers, using a customized and isolated Python runtime environment is an indispensable requirement. You and your colleagues may want to use different versions of python and python packages and don’t want to affect each others’ environment. In this article, I’d like to introduce to you how to use customized and isolated Python environment in hadoop yarn cluster. Regarding how to achieve this for PySpark, I will leave it to another article. (All the features in this article is done in this jira ZEPPELIN-5330). And you can reproduce all the steps here via downloading this note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/#/notebook/2G7RDR415" rel="noopener noreferrer"&gt;Python Conda Env in Yarn Mode&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1. Create your customized conda env
&lt;/h2&gt;

&lt;p&gt;First, let’s create a yaml file which define the python conda env including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;env name&lt;/li&gt;
&lt;li&gt;channels which is used to install packages&lt;/li&gt;
&lt;li&gt;python version&lt;/li&gt;
&lt;li&gt;other third party python packages
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: python_env
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.7 
  - pycodestyle
  - numpy
  - pandas
  - scipy
  - grpcio
  - protobuf
  - pandasql
  - ipython
  - ipykernel
  - jupyter_client
  - panel
  - pyyaml
  - seaborn
  - plotnine
  - hvplot
  - intake
  - intake-parquet
  - intake-xarray
  - altair
  - vega_datasets
  - pyarrow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the using the following commands to create conda env tar and upload it to hdfs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;conda pack -n python_env
hadoop fs -put python_env.tar.gz /tmp
# The python conda tar should be publicly accessible, so need to change permission here.
hadoop fs -chmod 644 /tmp/pyspark_env.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2. Configure Python Interpreter
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%python.conf
# set zeppelin.interpreter.launcher to be yarn, so that python interpreter run in yarn container, 
# otherwise python interpreter run as local process in the zeppelin server host.
zeppelin.interpreter.launcher yarn
# zeppelin.yarn.dist.archives can be either local file or hdfs file
zeppelin.yarn.dist.archives hdfs:///tmp/python_env.tar.gz#environment
# conda environment name, aka the folder name in the working directory of yarn container
zeppelin.interpreter.conda.env.name environment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3. Run Python Interpreter in this customized conda env
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%python

%matplotlib inline

import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AqYW8yu8yA-8xznBlmDlg1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AqYW8yu8yA-8xznBlmDlg1w.png" alt="Matplotlib"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The this feature is not released yet when this article is published, you can build Zeppelin master branch by yourself and import this note here to try this feature. If you have any question, you can ask in zeppelin user mail list or slack channel (&lt;a href="http://zeppelin.apache.org/community.html" rel="noopener noreferrer"&gt;http://zeppelin.apache.org/community.html&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://zeppelin.apache.org/" rel="noopener noreferrer"&gt;Apache Zeppelin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zjffdu.gitbook.io/flink-on-zeppelin/" rel="noopener noreferrer"&gt;Flink on Zeppelin docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/" rel="noopener noreferrer"&gt;Zeppelin notebooks website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/zjffdu/zeppelin-notebook" rel="noopener noreferrer"&gt;Zeppelin notebooks git repo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>notebook</category>
      <category>apachezeppelin</category>
      <category>conda</category>
    </item>
    <item>
      <title>How to use IPython in Apache Zeppelin Notebook</title>
      <dc:creator>Jeff Zhang</dc:creator>
      <pubDate>Sat, 10 Jul 2021 07:51:32 +0000</pubDate>
      <link>https://dev.to/zjffdu/how-to-use-ipython-in-apache-zeppelin-notebook-1gi9</link>
      <guid>https://dev.to/zjffdu/how-to-use-ipython-in-apache-zeppelin-notebook-1gi9</guid>
      <description>&lt;p&gt;Apache Zeppelin Notebook is web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.&lt;/p&gt;

&lt;p&gt;In this post, I will talk about how to use IPython in Apache Zeppelin Notebook (Although Zeppelin support vanilla Python, it is strongly recommended to use IPython). This would make the Python development experience in Zeppelin notebook almost the same as Jupyter notebook. &lt;/p&gt;

&lt;p&gt;All the contents in this notebook can be found in these 2 example notebook.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/#/notebook/2EYDJKFFY" rel="noopener noreferrer"&gt;IPython Basic Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/#/notebook/2F1S9ZY8Z" rel="noopener noreferrer"&gt;IPython Visualization Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to configure
&lt;/h2&gt;

&lt;p&gt;Enable IPython in Zeppelin is pretty straightforward, first you need to install the following 3 python packages&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install jupyter
pip install grpcio
pip install protobuf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Besides manually installing these packages, you can also use conda to create a customized python environment which has these installed. You can check this post for more details.&lt;/p&gt;

&lt;p&gt;Then configure Python Interpreter，the most important configuration is zeppelin.python which need to point to the correct python executable in case you have multiple python installed in your machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2A0j2LwKBOi-pSlKHrf7a0XA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2A0j2LwKBOi-pSlKHrf7a0XA.png" alt="IPython Config"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  IPython Features
&lt;/h2&gt;

&lt;p&gt;Now, you can experience almost the same python development experience as Jupyter notebook. Here’s a list of feature that I’d like to highlight.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Support ipython magic function&lt;/li&gt;
&lt;li&gt;Better code completion&lt;/li&gt;
&lt;li&gt;Rich visualization libraries support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  IPython magic function
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AxqSuHG09u4pL7kp2VDIjJQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AxqSuHG09u4pL7kp2VDIjJQ.png" alt="Magic function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Code completion
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F0%2AF9ArQnKQ11wkQQWx.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F0%2AF9ArQnKQ11wkQQWx.gif" alt="Code completion"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Rich visualization libraries support
&lt;/h3&gt;

&lt;p&gt;Visualization libraries are a big family of python ecosystem. Like jupyter notebook, you can use most of the popular python visualization libraries in Zeppelin notebook.&lt;/p&gt;

&lt;p&gt;Here’re a list of examples of how to use popular python visualization libraries in Zeppelin:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Matplotlib&lt;/li&gt;
&lt;li&gt;Pandas&lt;/li&gt;
&lt;li&gt;Seaborn&lt;/li&gt;
&lt;li&gt;Plotnine&lt;/li&gt;
&lt;li&gt;Bokeh&lt;/li&gt;
&lt;li&gt;Holoviews&lt;/li&gt;
&lt;li&gt;Altair&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2Ay7eDPkuhnifoeI-EKLQ1Gw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2Ay7eDPkuhnifoeI-EKLQ1Gw.png" alt="Matplotlib"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AoK7m8fK0XOEFkudju9Qtsg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AoK7m8fK0XOEFkudju9Qtsg.png" alt="Pandas"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ARztBq4Svc_zb9iBP0S_SIg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ARztBq4Svc_zb9iBP0S_SIg.png" alt="Seaborn"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AtGdVKQF_NX-w0DvX-JSzGA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2AtGdVKQF_NX-w0DvX-JSzGA.png" alt="Plotnine"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2A9zinRcwcKseFrhzGIQxBxQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2A9zinRcwcKseFrhzGIQxBxQ.png" alt="Bokeh"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ASn0XaBOURKIEEDNE9DtQWw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ASn0XaBOURKIEEDNE9DtQWw.png" alt="Holoviews"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ArLJLq4tWRv1ICGK5ugHGqA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1400%2F1%2ArLJLq4tWRv1ICGK5ugHGqA.png" alt="Altair"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This is just a brief introduction of how to use IPython in Zeppelin notebook. If you have any question, you can ask in zeppelin user mail list or slack channel (&lt;a href="http://zeppelin.apache.org/community.html" rel="noopener noreferrer"&gt;http://zeppelin.apache.org/community.html&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://zeppelin.apache.org/" rel="noopener noreferrer"&gt;Apache Zeppelin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zjffdu.gitbook.io/flink-on-zeppelin/" rel="noopener noreferrer"&gt;Flink on Zeppelin gitbook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/" rel="noopener noreferrer"&gt;Zeppelin notebooks website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/zjffdu/zeppelin-notebook" rel="noopener noreferrer"&gt;Zeppelin notebooks git repo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>apachezeppelin</category>
      <category>ipython</category>
      <category>python</category>
      <category>notebook</category>
    </item>
    <item>
      <title>Learn Flink SQL — The Easy Way </title>
      <dc:creator>Jeff Zhang</dc:creator>
      <pubDate>Wed, 07 Jul 2021 08:45:17 +0000</pubDate>
      <link>https://dev.to/zjffdu/learn-flink-sql-the-easy-way-11ei</link>
      <guid>https://dev.to/zjffdu/learn-flink-sql-the-easy-way-11ei</guid>
      <description>&lt;p&gt;Flink is almost the de facto standard streaming engine today. Flink SQL is the recommended approach to use Flink. But streaming sql is not the same as the traditional batch sql, you have to learn many new concepts, such as watermark, event time, different kinds of streaming joins and etc. To be honest all of these are not easy to learn.&lt;/p&gt;

&lt;p&gt;Today I’d like to introduce you a new (easy) way to learn flink sql: Flink Sql Cookbook on Zeppelin. In Zeppelin you can run Flink SQL in interactive way as following:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jrIv6ti_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2qs52ri7nwfi7m1kag3k.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jrIv6ti_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2qs52ri7nwfi7m1kag3k.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All the examples in this post can be found here.&lt;br&gt;
&lt;a href="http://zeppelin-notebook.com/"&gt;http://zeppelin-notebook.com/&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prepare environment
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/zjffdu/flink-sql-cookbook-on-zeppelin.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This repo has all the zeppelin notebook which includes the examples in &lt;code&gt;flink-sql-cookbook&lt;/code&gt;. Thanks ververica for the great examples, I just migrated them to Zeppelin.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2
&lt;/h3&gt;

&lt;p&gt;Download flink 1.13.1 and untar it. (I haven’t tried other versions of flink, but it should work for all flink versions after flink 1.10)&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3
&lt;/h3&gt;

&lt;p&gt;Build &lt;a href="https://github.com/knaufk/flink-faker/"&gt;flink faker&lt;/a&gt; and copy flink-faker-0.3.0.jar to lib folder of flink. This is a custom flink table source which is used to generate sample data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 4
&lt;/h3&gt;

&lt;p&gt;Run the following command to start Zeppelin&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -u $(id -u) -p 8081:8081 -p 8080:8080 --rm -v $PWD/logs:/logs -v /mnt/disk1/jzhang/flink-sql-cookbook-on-zeppelin:/notebook -v /mnt/disk1/jzhang/flink-1.13.1:/opt/flink -e FLINK_HOME=/opt/flink -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name zeppelin apache/zeppelin:0.10.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There’re 2 folders you need to replace with your folder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;/mnt/disk1/jzhang/flink-sql-cookbook-on-zeppelin （This is the repo folder of step 1）&lt;/li&gt;
&lt;li&gt;/mnt/disk1/jzhang/flink-1.13.1 （This is the flink folder of step 2）&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try examples of Flink Sql Cookbook
&lt;/h2&gt;

&lt;p&gt;Now, the environment is ready, you can start your flink sql journey via opening &lt;a href="http://localhost:8080"&gt;http://localhost:8080&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rOtKsez9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2A7A4smZAY6fhETEF1DfG3Bw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rOtKsez9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2A7A4smZAY6fhETEF1DfG3Bw.png" alt="ZEPPELIN UI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is Zeppelin home page, there’s already one folder called Flink Sql Cookbook which includes all the examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 1： Filtering Data
&lt;/h2&gt;

&lt;p&gt;Now let’s take a look at the first example: &lt;code&gt;Foundations/04 Filtering Data&lt;/code&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eE2VDxkS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2AE1bqSmQAUDuF4Q0_8YpHLw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eE2VDxkS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2AE1bqSmQAUDuF4Q0_8YpHLw.png" alt="Example_1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’re 2 paragraphs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paragraph 1 is creating table server_logs via faker connector.&lt;/li&gt;
&lt;li&gt;Paragraph 2 is filtering data via where statement and then select the latest 10 records by log_time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following is a screenshot of the result. You can see that the result is refreshed every 3 seconds. This is the biggest difference of flink streaming sql compared to traditional batch sql. Because in the streaming world, new data is coming continuously, so you will see the result is updated continuously.&lt;/p&gt;

&lt;p&gt;Besides that, you can click the &lt;code&gt;FLINK_JOB&lt;/code&gt; link in the top right, it would bring you to the Flink UI of this job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y7xd7lvM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/1540/1%2AgXzgqhcaJgj-TsV_xOiw0w.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y7xd7lvM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/1540/1%2AgXzgqhcaJgj-TsV_xOiw0w.gif" alt="ZEPPELIN_FLINK_JOB"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example 2： Lateral Table Join
&lt;/h2&gt;

&lt;p&gt;Now let's take a look at the second example: lateral table join. This is one type of the joins that flink sql supports. Usually new beginners would be a little scared at this even after he learn it via some tutorial articles. If there’s one real example could show him what exactly this lateral table join is doing, I believe it would be very helpful for him to understand it. Fortunately, we have one example in this flink-sql-cookbook and you can run it directly in Zeppelin. Open Joins/06 Lateral Table , then run it you can can see the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RHXeRlNI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/1540/1%2AxYkL34J2M7dSXjUDESlD1A.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RHXeRlNI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://miro.medium.com/max/1540/1%2AxYkL34J2M7dSXjUDESlD1A.gif" alt="Example_2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here I just show you the above 2 examples, there’re many other examples in this cookbook as below. You can try it by yourself. Hope you will enjoy this flink-sql-cookbook-on-zeppelin.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GKJSs_aB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2AMqxeon7neQcKnggAdtpowA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GKJSs_aB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1540/1%2AMqxeon7neQcKnggAdtpowA.png" alt="All Examples"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Not only you can use Zeppelin to learn Flink SQL, you can also use Zeppelin as your Streaming Platform to submit/manage your flink jobs. &lt;/p&gt;

&lt;p&gt;Zeppelin community still try to improve and evolve the whole user experience of Flink on Zeppelin, you can join Zeppelin slack to discuss with community. &lt;a href="http://zeppelin.apache.org/community.html#slack-channel"&gt;http://zeppelin.apache.org/community.html#slack-channel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details of flink on zeppelin, please refer the following links.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://zeppelin-notebook.com/#/"&gt;http://zeppelin-notebook.com/#/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://zeppelin.apache.org/docs/0.10.0/interpreter/flink.html"&gt;http://zeppelin.apache.org/docs/0.10.0/interpreter/flink.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=YxPo0Fosjjg&amp;amp;list=PL4oy12nnS7FFtg3KV1iS5vDb0pTz12VcX"&gt;https://www.youtube.com/watch?v=YxPo0Fosjjg&amp;amp;list=PL4oy12nnS7FFtg3KV1iS5vDb0pTz12VcX&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>apacheflink</category>
      <category>apachezeppelin</category>
      <category>notebook</category>
      <category>streaming</category>
    </item>
  </channel>
</rss>
