Apache Spark is a robust, open-source distributed computing system that simplifies big data processing. Running Apache Spark on a local machine can significantly enhance your data science and engineering workflows. In this guide, we will take you step-by-step through setting up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment (virtualenv). This comprehensive tutorial will only cover installation and configuration.
Prerequisites for Running Apache Spark 4 on Ubuntu
Before diving into the installation and configuration process, it's important to ensure that your system meets the following prerequisites:
- Ubuntu 22.04, 24.04, or later versions
- At least 20 GB of free disk space (less space can work, but it's safer to have more)
- sudo privileges for system-level changes
Once you've confirmed these requirements, you're ready to begin setting up Apache Spark on your Ubuntu system.
Step 1: Installing Java 17 on Ubuntu
Apache Spark 4 requires Java 17 or later for compatibility. To install Java 17, follow these simple steps:
1.1 Update Package Lists
Start by updating the Ubuntu package list to ensure you're installing the latest version of OpenJDK.
sudo apt update
1.2 Install OpenJDK 17
Now, install OpenJDK 17, which includes all the necessary components to run Spark efficiently:
sudo apt install openjdk-17-jdk -y
1.3 Verify Java Installation
After installation, verify that Java 17 is installed properly by checking the version:
java -version
The expected output should look like this:
openjdk version "17.x.x" ...
This confirms that Java 17 is successfully installed.
Step 2: Setting JAVA_HOME Environment Variable Permanently
In order to ensure that Java 17 is recognized globally across all shell sessions, you'll need to set the JAVA_HOME
environment variable.
2.1 Find Java Home Directory
Run the following command to determine the Java installation directory:
readlink -f $(which java)
The output will resemble:
/usr/lib/jvm/java-17-openjdk-amd64/bin/java
2.2 Edit Shell Configuration
Open your shell configuration file (.bashrc
) to add the Java home directory.
nano ~/.bashrc
2.3 Add JAVA_HOME Configuration
Scroll to the bottom of the .bashrc
file and add the following lines:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
2.4 Reload Configuration
After saving the .bashrc
file, reload it for the changes to take effect:
source ~/.bashrc
2.5 Verify JAVA_HOME
Verify the Java home directory is correctly set:
echo $JAVA_HOME
The output should match:
/usr/lib/jvm/java-17-openjdk-amd64
Additionally, check the Java version again to confirm everything is working:
java -version
Step 3: Download and Install Apache Spark 4 on Ubuntu
3.1 Download Apache Spark
The next step is to download Apache Spark version 4.0.0, which is the latest release. To do this, use the following wget
command:
wget https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
3.2 Extract and Move Apache Spark
Once the download is complete, extract the file and move it to your home directory for easier access:
tar -xzf spark-4.0.0-bin-hadoop3.tgz
mv spark-4.0.0-bin-hadoop3 ~/spark
3.3 Configure Apache Spark
To configure Apache Spark globally, you need to update your environment variables.
Open your .bashrc
file again:
nano ~/.bashrc
Then, add the following lines:
export SPARK_HOME=~/spark
export PATH=$SPARK_HOME/bin:$PATH
3.4 Reload Shell Configuration
Reload the .bashrc
file to apply the changes:
source ~/.bashrc
3.5 Verify Spark Installation
To verify that Apache Spark is properly installed, check the location of the pyspark
executable:
which pyspark
The output should indicate the path to the pyspark
binary:
~/spark/bin/pyspark
Step 4: Setting Up a Python Virtual Environment for Jupyter
Python is a crucial part of your Spark setup, especially when using PySpark. By creating a virtualenv, we can isolate the project’s dependencies and avoid system-wide package conflicts.
4.1 Create a Virtual Environment
To create a virtual environment, run:
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
4.2 Install Jupyter Notebook
Now, you need to install Jupyter Notebook in the virtual environment:
pip install --upgrade pip
pip install jupyter
4.3 (Optional) Install PySpark
To make it easier to interface with Spark directly from Jupyter, you can also install PySpark in your virtual environment:
pip install pyspark
Step 5: Configuring PySpark to Launch Jupyter
5.1 Configure Jupyter Integration
To ensure that PySpark launches with Jupyter Notebook, you need to set the necessary environment variables.
Open your .bashrc
file once more:
nano ~/.bashrc
Add the following lines:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
5.2 Reload Configuration
Reload the .bashrc
file to ensure the changes are applied:
source ~/.bashrc
5.3 Launch PySpark with Jupyter
Finally, you can launch Apache Spark along with Jupyter Notebook by running:
pyspark
This will:
- Initialize a Spark session
- Open Jupyter Notebook in your browser automatically
Step 6: Verifying the Setup
Once Jupyter launches, create a new Python 3 notebook and test if the PySpark setup is functioning correctly. Use the following code in a cell to verify the setup:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestSpark").getOrCreate()
df = spark.range(10).toDF("num")
df.show()
If the setup is correct, you should see the following output:
+---+
|num|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
Step 7: (Optional) Cleanup Old Java Versions
To free up disk space, you may wish to remove older versions of Java that are no longer needed.
sudo apt remove --purge openjdk-11-jdk -y
sudo apt autoremove --purge -y
🆕 Updated Setup (Dynamic Paths)
If you want a more portable, error-resistant setup (especially useful when repeating this on another machine), use this version instead of hardcoding paths:
Step 1: Set Environment Variables Dynamically
Instead of hardcoded values, use the following in your setup script or terminal session:
# Set JAVA_HOME based on active Java version
export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")
# Set SPARK_HOME based on where Spark is located
export SPARK_HOME=$(readlink -f ~/spark) # Adjust if you moved Spark elsewhere
You can confirm SPARK_HOME
is correct by checking if this path contains the bin/
directory:
ls $SPARK_HOME/bin
Step 2: Activate Your Virtual Environment Dynamically (Optional)
If you're unsure where your venv
is located, find it with:
find ~ -type f -name activate 2>/dev/null | grep venv
Then activate it:
source /full/path/to/venv/bin/activate
Step 3: Use a Startup Script (Optional)
You can save this as a start-spark-notebook.sh
script:
#!/bin/bash
export JAVA_HOME=$(readlink -f $(which java) | sed "s:bin/java::")
export SPARK_HOME=$(readlink -f ~/spark)
source ~/Documents/ztm/venv/bin/activate # Adjust to your venv path
jupyter notebook
Make it executable:
chmod +x start-spark-notebook.sh
Then run it:
./start-spark-notebook.sh
Troubleshooting Tip
If pyspark
doesn’t launch or defaults to the terminal:
- Make sure you're in the virtual environment
- Run:
pip install pyspark
- Or launch the notebook with Spark manually by running your script above.
Conclusion
You have successfully set up Apache Spark 4 with Jupyter on Ubuntu, using Java 17 and a Python virtual environment. This environment will allow you to seamlessly process big data using PySpark and leverage the powerful features of Apache Spark for your data engineering and data science workflows. Whether you're analyzing large datasets or building data pipelines, this setup provides a solid foundation for your work.
For more advanced configurations and to learn about optimizing Spark for your specific needs, explore the official Apache Spark documentation and best practices.
Top comments (0)