DEV Community

Cover image for Running PySpark in JupyterLab on a Raspberry Pi
Pinei
Pinei

Posted on

Running PySpark in JupyterLab on a Raspberry Pi

While researching materials for installing a JupyterLab instance with Spark support (via PySpark), I noticed a lot of outdated content. That was until I came across an up-to-date Docker image provided by the Jupyter Docker Stacks project.

https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#apache-spark

Python and Java have good support for ARM architecture, so we can assume that any framework developed on these platforms will run well on a Raspberry Pi. Then, I just ran the Docker command to start the environment.

docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
Enter fullscreen mode Exit fullscreen mode

It couldn't be easier.

Jupyter Lab

Checking CPU and memory

!lscpu and !free

Let's try a simple code to check the availability of the PySpark and SQL API.

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .appName("Python Spark SQL basic example")
    .config("spark.executor.memory", "2g")
    .config("spark.executor.cores", "4")
    .config("spark.eventLog.enabled", "true")
    .config("spark.sql.shuffle.partitions", "50")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .getOrCreate() )
Enter fullscreen mode Exit fullscreen mode

Got a SparkSession-in-memory

  • Version: v3.5.0
  • Master: local[*]
  • AppName: Python Spark SQL basic example

We can setup a dataframe.

from pyspark.sql.types import StructType, StructField, FloatType, BooleanType
from pyspark.sql.types import DoubleType, IntegerType, StringType

# Setup the Schema
schema = StructType([
StructField("User ID", IntegerType(),True),
StructField("Username", StringType(),True),
StructField("Browser", StringType(),True),
StructField("OS", StringType(),True),
])

# Add Data
data = ([(1580, "Barry", "FireFox", "Windows" ),
(5820, "Sam", "MS Edge", "Linux"),
(2340, "Harry", "Vivaldi", "Windows"),
(7860, "Albert", "Chrome", "Windows"),
(1123, "May", "Safari", "macOS")
])

# Setup the Data Frame
user_data_df = spark.createDataFrame(data,schema=schema)
user_data_df.show()
Enter fullscreen mode Exit fullscreen mode
+-------+--------+-------+-------+
|User ID|Username|Browser|     OS|
+-------+--------+-------+-------+
|   1580|   Barry|FireFox|Windows|
|   5820|     Sam|MS Edge|  Linux|
|   2340|   Harry|Vivaldi|Windows|
|   7860|  Albert| Chrome|Windows|
|   1123|     May| Safari|  macOS|
+-------+--------+-------+-------+
Enter fullscreen mode Exit fullscreen mode

We can save this dataframe to a physical table in a new database.

spark.sql("CREATE DATABASE raspland")
user_data_df.write.saveAsTable("raspland.user_data")
Enter fullscreen mode Exit fullscreen mode

And then run SQL commands over the table.

spark.sql("SELECT * FROM raspland.user_data WHERE OS = 'Linux'").show()
Enter fullscreen mode Exit fullscreen mode
+-------+--------+-------+-----+
|User ID|Username|Browser|   OS|
+-------+--------+-------+-----+
|   5820|     Sam|MS Edge|Linux|
+-------+--------+-------+-----+
Enter fullscreen mode Exit fullscreen mode

By accessing the Terminal, we can check the Parquet files that store the data of our created table.

(base) jovyan@0e1d1463f0b0:~/spark-warehouse/raspland.db/user_data$ ls

part-00000-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00001-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00002-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
part-00003-963849eb-9b82-4bf7-ab1c-f71800914e15-c000.snappy.parquet
_SUCCESS
Enter fullscreen mode Exit fullscreen mode

Install the jupyterlab-sql-editor extension to get enhanced functionalities for SQL execution.

Image description

Here is a post to learn more about the extension:

https://towardsdatascience.com/jupyterlab-sql-cell-editor-e6ac865b42df

You will need to run 2 commands in the Terminal to install server prerequisites.

pip install jupyterlab-lsp jupyterlab-sql-editor
sudo npm install -g sql-language-server
Enter fullscreen mode Exit fullscreen mode

Load the extension in the notebook ...

%load_ext jupyterlab_sql_editor.ipython_magic.sparksql
Enter fullscreen mode Exit fullscreen mode

and SparkSQL cells will be enabled.

Image description

Metastore

Spark should be using a Hive metastore to manage databases and tables. We will talk about Hive later. Until now, our data has been physically persisted in Parquet files, while our metadata remains in memory.

We need to enable Hive Support to persist our metadata. So let's delete our data, recreate our Spark session, and run the sample again.

In the Terminal:

rm -rf spark-warehouse/
Enter fullscreen mode Exit fullscreen mode

Change the code in the notebook to enable Hive:

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .appName("Python Spark SQL basic example")
    ...
    .enableHiveSupport()
    .getOrCreate() )
Enter fullscreen mode Exit fullscreen mode

After running our database creation script, certain files and folders appear in the file explorer.

metastore_db folder

While spark-warehouse houses our Parquet files, metastore_db serves as a Derby repository to store our database and table definitions.

Derby is a lightweight relational database management system (RDBMS) implemented in Java. It is often used as a local, embedded metastore for Spark's SQL component when running in standalone mode.

Hive is a data warehousing and SQL query engine for Hadoop, originally developed by Facebook. It allows you to query big data in a distributed storage architecture like Hadoop's HDFS using SQL-like syntax. Hive's architecture includes a metadata repository stored in an RDBMS, which is often referred to as the Hive Metastore.

The standalone installation of Spark does not inherently include Hive, but it has built-in support for connecting to a Hive metastore if you have one set up separately.

Thank you for reading. Let me know in the comments what else you might be interested in to complement this article.

Top comments (0)