DEV Community

Cover image for Comprehensive Guide to Running Apache Spark 4 with Jupyter on Ubuntu (for Python Developers)
Gervais Yao Amoah
Gervais Yao Amoah

Posted on

Comprehensive Guide to Running Apache Spark 4 with Jupyter on Ubuntu (for Python Developers)

Apache Spark is a robust, open-source distributed computing system that simplifies big data processing. Running Apache Spark on a local machine can significantly enhance your data science and engineering workflows. In this guide, we will take you step-by-step through setting up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment (virtualenv). This comprehensive tutorial will only cover installation and configuration.

Prerequisites for Running Apache Spark 4 on Ubuntu

Before diving into the installation and configuration process, it's important to ensure that your system meets the following prerequisites:

  • Ubuntu 22.04, 24.04, or later versions
  • At least 20 GB of free disk space (less space can work, but it's safer to have more)
  • sudo privileges for system-level changes

Once you've confirmed these requirements, you're ready to begin setting up Apache Spark on your Ubuntu system.

Step 1: Installing Java 17 on Ubuntu

Apache Spark 4 requires Java 17 or later for compatibility. To install Java 17, follow these simple steps:

1.1 Update Package Lists

Start by updating the Ubuntu package list to ensure you're installing the latest version of OpenJDK.

sudo apt update
Enter fullscreen mode Exit fullscreen mode

1.2 Install OpenJDK 17

Now, install OpenJDK 17, which includes all the necessary components to run Spark efficiently:

sudo apt install openjdk-17-jdk -y
Enter fullscreen mode Exit fullscreen mode

1.3 Verify Java Installation

After installation, verify that Java 17 is installed properly by checking the version:

java -version
Enter fullscreen mode Exit fullscreen mode

The expected output should look like this:

openjdk version "17.x.x" ...
Enter fullscreen mode Exit fullscreen mode

This confirms that Java 17 is successfully installed.

Step 2: Setting JAVA_HOME Environment Variable Permanently

In order to ensure that Java 17 is recognized globally across all shell sessions, you'll need to set the JAVA_HOME environment variable.

2.1 Find Java Home Directory

Run the following command to determine the Java installation directory:

readlink -f $(which java)
Enter fullscreen mode Exit fullscreen mode

The output will resemble:

/usr/lib/jvm/java-17-openjdk-amd64/bin/java
Enter fullscreen mode Exit fullscreen mode

2.2 Edit Shell Configuration

Open your shell configuration file (.bashrc) to add the Java home directory.

nano ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

2.3 Add JAVA_HOME Configuration

Scroll to the bottom of the .bashrc file and add the following lines:

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
Enter fullscreen mode Exit fullscreen mode

2.4 Reload Configuration

After saving the .bashrc file, reload it for the changes to take effect:

source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

2.5 Verify JAVA_HOME

Verify the Java home directory is correctly set:

echo $JAVA_HOME
Enter fullscreen mode Exit fullscreen mode

The output should match:

/usr/lib/jvm/java-17-openjdk-amd64
Enter fullscreen mode Exit fullscreen mode

Additionally, check the Java version again to confirm everything is working:

java -version
Enter fullscreen mode Exit fullscreen mode

Step 3: Download and Install Apache Spark 4 on Ubuntu

3.1 Download Apache Spark

The next step is to download Apache Spark version 4.0.0, which is the latest release. To do this, use the following wget command:

wget https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
Enter fullscreen mode Exit fullscreen mode

3.2 Extract and Move Apache Spark

Once the download is complete, extract the file and move it to your home directory for easier access:

tar -xzf spark-4.0.0-bin-hadoop3.tgz
mv spark-4.0.0-bin-hadoop3 ~/spark
Enter fullscreen mode Exit fullscreen mode

3.3 Configure Apache Spark

To configure Apache Spark globally, you need to update your environment variables.

Open your .bashrc file again:

nano ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

Then, add the following lines:

export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
Enter fullscreen mode Exit fullscreen mode

3.4 Reload Shell Configuration

Reload the .bashrc file to apply the changes:

source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

3.5 Verify Spark Installation

To verify that Apache Spark is properly installed, check the location of the pyspark executable:

which pyspark
Enter fullscreen mode Exit fullscreen mode

The output should indicate the path to the pyspark binary:

~/spark/bin/pyspark
Enter fullscreen mode Exit fullscreen mode

Step 4: Setting Up a Python Virtual Environment for Jupyter

Python is a crucial part of your Spark setup, especially when using PySpark. By creating a virtualenv, we can isolate the project’s dependencies and avoid system-wide package conflicts.

4.1 Create a Virtual Environment

To create a virtual environment, run:

python3 -m venv venv
Enter fullscreen mode Exit fullscreen mode

Activate the virtual environment:

source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

4.2 Install Jupyter Notebook

Now, you need to install Jupyter Notebook in the virtual environment:

pip install --upgrade pip
pip install jupyter
Enter fullscreen mode Exit fullscreen mode

4.3 (Optional) Install PySpark

To make it easier to interface with Spark directly from Jupyter, you can also install PySpark in your virtual environment:

pip install pyspark
Enter fullscreen mode Exit fullscreen mode

Step 5: Configuring PySpark to Launch Jupyter

5.1 Configure Jupyter Integration

To ensure that PySpark launches with Jupyter Notebook, you need to set the necessary environment variables.

Open your .bashrc file once more:

nano ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

Add the following lines:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
Enter fullscreen mode Exit fullscreen mode

5.2 Reload Configuration

Reload the .bashrc file to ensure the changes are applied:

source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

5.3 Launch PySpark with Jupyter

Finally, you can launch Apache Spark along with Jupyter Notebook by running:

pyspark
Enter fullscreen mode Exit fullscreen mode

This will:

  • Initialize a Spark session
  • Open Jupyter Notebook in your browser automatically

Step 6: Verifying the Setup

Once Jupyter launches, create a new Python 3 notebook and test if the PySpark setup is functioning correctly. Use the following code in a cell to verify the setup:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestSpark").getOrCreate()
df = spark.range(10).toDF("num")
df.show()
Enter fullscreen mode Exit fullscreen mode

If the setup is correct, you should see the following output:

+---+
|num|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
Enter fullscreen mode Exit fullscreen mode

Step 7: (Optional) Cleanup Old Java Versions

To free up disk space, you may wish to remove older versions of Java that are no longer needed.

sudo apt remove --purge openjdk-11-jdk -y
sudo apt autoremove --purge -y
Enter fullscreen mode Exit fullscreen mode

🆕 Updated Setup (Dynamic Paths)

If you want a more portable, error-resistant setup (especially useful when repeating this on another machine), use this version instead of hardcoding paths:

Step 1: Set Environment Variables Dynamically

Instead of hardcoded values, use the following in your setup script or terminal session:

# Set JAVA_HOME based on active Java version
export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")

# Set SPARK_HOME based on where Spark is located
export SPARK_HOME=$(readlink -f ~/spark)  # Adjust if you moved Spark elsewhere
Enter fullscreen mode Exit fullscreen mode

You can confirm SPARK_HOME is correct by checking if this path contains the bin/ directory:

ls $SPARK_HOME/bin
Enter fullscreen mode Exit fullscreen mode

Step 2: Activate Your Virtual Environment Dynamically (Optional)

If you're unsure where your venv is located, find it with:

find ~ -type f -name activate 2>/dev/null | grep venv
Enter fullscreen mode Exit fullscreen mode

Then activate it:

source /full/path/to/venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Step 3: Use a Startup Script (Optional)

You can save this as a start-spark-notebook.sh script:

#!/bin/bash

export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")
export SPARK_HOME=$(readlink -f ~/spark)
source ~/Documents/ztm/venv/bin/activate  # Adjust to your venv path

jupyter notebook
Enter fullscreen mode Exit fullscreen mode

Make it executable:

chmod +x start-spark-notebook.sh
Enter fullscreen mode Exit fullscreen mode

Then run it:

./start-spark-notebook.sh
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Tip

If pyspark doesn’t launch or defaults to the terminal:

  • Make sure you're in the virtual environment
  • Run:
  pip install pyspark
Enter fullscreen mode Exit fullscreen mode
  • Or launch the notebook with Spark manually by running your script above.

Conclusion

You have successfully set up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment. This environment will allow you to seamlessly process big data using PySpark and leverage the powerful features of Apache Spark for your data engineering and data science workflows. Whether you're analyzing large datasets or building data pipelines, this setup provides a solid foundation for your work.

For more advanced configurations and to learn about optimizing Spark for your specific needs, explore the official Apache Spark documentation and best practices.

Top comments (0)