Gervais Yao Amoah

Posted on Jul 30

Comprehensive Guide to Running Apache Spark 4 with Jupyter on Ubuntu (for Python Developers)

#tutorial #datascience #jupyter #beginners

Apache Spark is a robust, open-source distributed computing system that simplifies big data processing. Running Apache Spark on a local machine can significantly enhance your data science and engineering workflows. In this guide, we will take you step-by-step through setting up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment (virtualenv). This comprehensive tutorial will only cover installation and configuration.

Prerequisites for Running Apache Spark 4 on Ubuntu

Before diving into the installation and configuration process, it's important to ensure that your system meets the following prerequisites:

Ubuntu 22.04, 24.04, or later versions
At least 20 GB of free disk space (less space can work, but it's safer to have more)
sudo privileges for system-level changes

Once you've confirmed these requirements, you're ready to begin setting up Apache Spark on your Ubuntu system.

Step 1: Installing Java 17 on Ubuntu

Apache Spark 4 requires Java 17 or later for compatibility. To install Java 17, follow these simple steps:

1.1 Update Package Lists

Start by updating the Ubuntu package list to ensure you're installing the latest version of OpenJDK.

sudo apt update

1.2 Install OpenJDK 17

Now, install OpenJDK 17, which includes all the necessary components to run Spark efficiently:

sudo apt install openjdk-17-jdk -y

1.3 Verify Java Installation

After installation, verify that Java 17 is installed properly by checking the version:

java -version

The expected output should look like this:

openjdk version "17.x.x" ...

This confirms that Java 17 is successfully installed.

Step 2: Setting JAVA_HOME Environment Variable Permanently

In order to ensure that Java 17 is recognized globally across all shell sessions, you'll need to set the JAVA_HOME environment variable.

2.1 Find Java Home Directory

Run the following command to determine the Java installation directory:

readlink -f $(which java)

The output will resemble:

/usr/lib/jvm/java-17-openjdk-amd64/bin/java

2.2 Edit Shell Configuration

Open your shell configuration file (.bashrc) to add the Java home directory.

nano ~/.bashrc

2.3 Add JAVA_HOME Configuration

Scroll to the bottom of the .bashrc file and add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

2.4 Reload Configuration

After saving the .bashrc file, reload it for the changes to take effect:

source ~/.bashrc

2.5 Verify JAVA_HOME

Verify the Java home directory is correctly set:

echo $JAVA_HOME

The output should match:

/usr/lib/jvm/java-17-openjdk-amd64

Additionally, check the Java version again to confirm everything is working:

java -version

Step 3: Download and Install Apache Spark 4 on Ubuntu

3.1 Download Apache Spark

The next step is to download Apache Spark version 4.0.0, which is the latest release. To do this, use the following wget command:

wget https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz

3.2 Extract and Move Apache Spark

Once the download is complete, extract the file and move it to your home directory for easier access:

tar -xzf spark-4.0.0-bin-hadoop3.tgz
mv spark-4.0.0-bin-hadoop3 ~/spark

3.3 Configure Apache Spark

To configure Apache Spark globally, you need to update your environment variables.

Open your .bashrc file again:

nano ~/.bashrc

Then, add the following lines:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH

3.4 Reload Shell Configuration

Reload the .bashrc file to apply the changes:

source ~/.bashrc

3.5 Verify Spark Installation

To verify that Apache Spark is properly installed, check the location of the pyspark executable:

which pyspark

The output should indicate the path to the pyspark binary:

~/spark/bin/pyspark

Step 4: Setting Up a Python Virtual Environment for Jupyter

Python is a crucial part of your Spark setup, especially when using PySpark. By creating a virtualenv, we can isolate the project’s dependencies and avoid system-wide package conflicts.

4.1 Create a Virtual Environment

To create a virtual environment, run:

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

4.2 Install Jupyter Notebook

Now, you need to install Jupyter Notebook in the virtual environment:

pip install --upgrade pip
pip install jupyter

4.3 (Optional) Install PySpark

To make it easier to interface with Spark directly from Jupyter, you can also install PySpark in your virtual environment:

pip install pyspark

Step 5: Configuring PySpark to Launch Jupyter

5.1 Configure Jupyter Integration

To ensure that PySpark launches with Jupyter Notebook, you need to set the necessary environment variables.

Open your .bashrc file once more:

nano ~/.bashrc

Add the following lines:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

5.2 Reload Configuration

Reload the .bashrc file to ensure the changes are applied:

source ~/.bashrc

5.3 Launch PySpark with Jupyter

Finally, you can launch Apache Spark along with Jupyter Notebook by running:

pyspark

This will:

Initialize a Spark session
Open Jupyter Notebook in your browser automatically

Step 6: Verifying the Setup

Once Jupyter launches, create a new Python 3 notebook and test if the PySpark setup is functioning correctly. Use the following code in a cell to verify the setup:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestSpark").getOrCreate()
df = spark.range(10).toDF("num")
df.show()

If the setup is correct, you should see the following output:

+---+
|num|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

Step 7: (Optional) Cleanup Old Java Versions

To free up disk space, you may wish to remove older versions of Java that are no longer needed.

sudo apt remove --purge openjdk-11-jdk -y
sudo apt autoremove --purge -y

🆕 Updated Setup (Dynamic Paths)

If you want a more portable, error-resistant setup (especially useful when repeating this on another machine), use this version instead of hardcoding paths:

Step 1: Set Environment Variables Dynamically

Instead of hardcoded values, use the following in your setup script or terminal session:

# Set JAVA_HOME based on active Java version
export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")

# Set SPARK_HOME based on where Spark is located
export SPARK_HOME=$(readlink -f ~/spark)  # Adjust if you moved Spark elsewhere

You can confirm SPARK_HOME is correct by checking if this path contains the bin/ directory:

ls $SPARK_HOME/bin

Step 2: Activate Your Virtual Environment Dynamically (Optional)

If you're unsure where your venv is located, find it with:

find ~ -type f -name activate 2>/dev/null | grep venv

Then activate it:

source /full/path/to/venv/bin/activate

Step 3: Use a Startup Script (Optional)

You can save this as a start-spark-notebook.sh script:

#!/bin/bash

export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")
export SPARK_HOME=$(readlink -f ~/spark)
source ~/Documents/ztm/venv/bin/activate  # Adjust to your venv path

jupyter notebook

Make it executable:

chmod +x start-spark-notebook.sh

Then run it:

./start-spark-notebook.sh

Troubleshooting Tip

If pyspark doesn’t launch or defaults to the terminal:

Make sure you're in the virtual environment
Run:

  pip install pyspark

Or launch the notebook with Spark manually by running your script above.

Conclusion

You have successfully set up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment. This environment will allow you to seamlessly process big data using PySpark and leverage the powerful features of Apache Spark for your data engineering and data science workflows. Whether you're analyzing large datasets or building data pipelines, this setup provides a solid foundation for your work.

For more advanced configurations and to learn about optimizing Spark for your specific needs, explore the official Apache Spark documentation and best practices.

DEV Community