Abstract
This short article will show how to install Delta Lake with Apache Spark on Deepnote. We'll use the SingleStore Spark Connector to read data from SingleStoreDB and write it into the Delta Lake, then read data from the Delta Lake and write it back into SingleStoreDB.
Introduction
Previously, we explored how to use Apache Iceberg with SingleStoreDB via Spark Dataframes. In this article, we'll focus on Delta Lake and provide details on one possible configuration for using it with SingleStoreDB from a Python notebook environment.
Create a Deepnote account
We'll create a free account on the Deepnote website. Once logged in, we'll create a new Deepnote project to give us a new notebook. We'll also need to create two folders (jars
and warehouse
).
In the jars
folder, we'll store the following files:
- SingleStore JDBC Client
- SingleStore Spark Connector
- Spray JSON
- Apache Commons DBCP
- Apache Commons Pool
The warehouse
folder will be used to store our Delta Lake.
Create a SingleStoreDB Cloud account
A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use Iris Demo Group as our Workspace Group Name and iris-demo as our Workspace Name.
Once we've created our database in the following steps, we'll make a note of our password and host name.
Create a Database and Table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database, table and populate the table. We can use a GitHub Gist for this task.
Deepnote notebook
Let's now start to fill out our notebook.
Install Software
First, we'll need to install pyspark
and delta-spark
:
! sudo apt-get update
! sudo mkdir -p /usr/share/man/man1
! sudo apt-get install -y openjdk-11-jdk
! pip install pyspark==3.3.2
! pip install delta-spark==2.2.0
Once the installation is complete, we'll prepare our SparkSession:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars", "jars/singlestore-jdbc-client-1.1.4.jar, jars/singlestore-spark-connector_2.12-4.1.2-spark-3.3.0.jar, jars/spray-json_3-1.3.6.jar, jars/commons-dbcp2-2.9.0.jar, jars/commons-pool2-2.11.1.jar") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
We can check the version of Spark as follows:
spark.version
The output should be:
'3.3.2'
Connect to SingleStoreDB
First, we'll provide connection details for SingleStoreDB:
host = "<host>"
password = "<password>"
port = "3306"
cluster = host + ":" + port
We'll replace the <host>
and <password>
with the values from our SingleStoreDB Cloud account.
We'll now set some parameters for the SingleStore Spark Connector:
spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")
Create Dataframe from SingleStoreDB, Write to Delta Lake
We can create a Dataframe from SingleStoreDB, as follows:
iris_df = (spark.read
.format("singlestore")
.load("iris_db.iris")
)
Next, we'll check the data:
iris_df.show(5)
The result should be similar to the following:
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width| species|
+------------+-----------+------------+-----------+-----------+
| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 5.1| 3.8| 1.5| 0.3|Iris-setosa|
| 4.8| 3.4| 1.9| 0.2|Iris-setosa|
| 5| 3.4| 1.6| 0.4|Iris-setosa|
| 4.4| 3.2| 1.3| 0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows
Let's now write the data to the Delta Lake:
(iris_df.write
.format("delta")
.save("warehouse/delta-table")
)
Create Dataframe from Delta Lake, Write to SingleStoreDB
Now, let's read the data back from the Delta Lake:
new_iris_df = (spark.read
.format("delta")
.load("warehouse/delta-table")
)
Next, we'll check the data:
new_iris_df.show(5)
The result should be similar to the following:
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width| species|
+------------+-----------+------------+-----------+-----------+
| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 5.1| 3.8| 1.5| 0.3|Iris-setosa|
| 4.8| 3.4| 1.9| 0.2|Iris-setosa|
| 5.0| 3.4| 1.6| 0.4|Iris-setosa|
| 4.4| 3.2| 1.3| 0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows
Let's now write the data to SingleStoreDB:
(new_iris_df.write
.format("singlestore")
.option("loadDataCompression", "LZ4")
.mode("overwrite")
.save("iris_db.new_iris")
)
From SingleStoreDB Cloud, we can check that the new_iris
table was created, and we can query the data:
USE iris_db;
SELECT * FROM new_iris LIMIT 5;
Summary
Using Spark Dataframes, we can seamlessly work with SingleStoreDB and Delta Lake.
Top comments (0)