DEV Community

Michael Staszel
Michael Staszel

Posted on • Originally published at mikestaszel.com on

Installing Spark on Ubuntu in 3 Minutes

One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables.

So here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.

All you need is a machine (or instance, server, VPS, etc.) that you can install packages on (e.g. “sudo apt” works). If you need one of those, check out DigitalOcean. It’s much simpler than AWS for small projects.

First, log in to the machine via SSH.

Now, install OpenJDK 8 (Java):

sudo apt update && sudo apt install -y openjdk-8-jdk-headless python

Enter fullscreen mode Exit fullscreen mode

Next, download and extract Apache Spark:

wget http://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz && tar xf spark-2.3.1-bin-hadoop2.7.tgz

Enter fullscreen mode Exit fullscreen mode

Set up environment variables to configure Spark:

echo 'SPARK_HOME=$HOME/spark-2.3.1-bin-hadoop2.7' >> ~/.bashrc
echo 'PATH=$PATH:$SPARK_HOME/bin' >> ~/.bashrc
echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

That’s it – you’re all set! You’ve installed Spark and it’s ready to go. Try out “pyspark”, “spark-submit” or “spark-shell”.

Try running this inside “pyspark” to validate that it worked:

spark.createDataFrame([{"hello": x} for x in range(1000)]).count() # hopefully this equals 1000
Enter fullscreen mode Exit fullscreen mode

Top comments (0)