Akhilesh Pratap Shahi

Posted on Jan 19

Apache Spark Installation

#spark #installation #java #python

Each step is important when you are learning new things. When you are learning new technology, setting up its environment is important so that you can practice effectively; we call it a "baby step".
Installing Apache Spark involves a few key steps like ensuring the prerequisites are installed and then downloading, extracting and configuring the spark binaries for your specific operating system.
Apache Spark runs on the Java Virtual Machine (JVM), so the Java Development Kit (JDK) is a requirement. Hey, don't panic! Installing the JDK doesn't mean you have to code in Java. It simply provides the JVM, creating the necessary runtime environment that Spark needs to execute its tasks.

Prerequisites

1. Java

Spark requires Java 8 or 11 to be installed, make sure that your system works on either one of these.

My suggestion is to go with Java 8 because:

Hadoop 3.0.x and 3.2.x only support Java 8
Hadoop 3.3+ supports Java 8 and 11 (runtime only)

Check Java version:

java -version

If not installed:

sudo apt install openjdk-8-jdk -y # Change according to your choice of version

2. Minimum System Requirement

OS (Operating System): Ubuntu 20.04 or 22.04 https://ubuntu.com/download
RAM: 4GB (Recommended 8GB)
Disk: 20GB free space
CPU: 2 cores

To create this setup in your windows firstly we have to create a Virtual Environment using wsl.

If you want the installation guide for Mac drop comment will be coming up with mac set up

3. Python (For PySpark)

Required: Python 3.7+

Check versions:

python --version
pip3 --version

If not installed:

sudo apt install python3 python3-pip -y

4. SSH (Mandatory for Hadoop)

Hadoop daemons require passwordless SSH, even on a single machine.

Check SSH:

ssh localhost

If not installed:

sudo apt install openssh-server -y

5. Linux Utilities (Required)

Install basic tools:

sudo apt install -y \
wget \
curl \
rsync \
vim \
nano \
net-tools \
procps

Why:

rsync: Hadoop file sync
procps: jps
net-tools: network checks

6. Environment Variables

(We will be sorting this out together)

7. Browser (For UIs)

Any browser of your choice will work:

chrome
safari
firefox

8. Permissions

You must:

Have sudo access
Be able to write to /opt

NOTE:

Run the commands below. If all of them pass, we are ready to move forward with the installation of Hadoop + Spark.

java -version
python3 --version
ssh localhost
sudo ls /opt

LET's START FOR WHAT WE ARE HERE ACTUALLY

STEP 1

Install & Configure Hadoop (Single Node Cluster)

1. Setup Passwordless SSH (Mandatory)

What's the use for this?
Hadoop uses SSH to

Start Daemons
Stop Daemons
Manage Nodes (even localhost)

"Even on one machine Hadoop behaves like a cluster"

This is very common question to ask, and you might be thinking the same; why passwordless if we can use password? Well let me tell you this you can enter the password what about your machine; don't mind me here but your machine is dumb, dumber than you think it won't be able to enter the password and we want this dumb thing to get the access that's why we say "Comm'mon you dumb, take this passwordless access and leave me alone". So, Daemons cannot enter password because this dude is dumb.
We have to generate the SSH keys and allow localhost login

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

The above command will create to keys id_rsa (Private key) and id_rsa.pub (public key)
Now we have to authorize it:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Here we are done with this, now we have to test if it login without password voila we have achieved it our SSH layer is ready to go

ssh localhost

2. Download Hadoop

We will be downloading the hadoop, when i say hadoop don't take it as a simple program it is a set of Java Services which includes.

HDFS
YARN
MapReduce(Runtime)

We will be keeping all third party software at location /opt to keep our system clean and sorted.

cd /opt
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 hadoop

(https://downloads.apache.org/hadoop/common/) This has all the possible options for version of Hadoop. Pick which ever you find best for your work. If you want current stable version go inside stable folder and copy hadoop-x.x.x.tar.gz path and run the above command.

change the ownership:

sudo chown -R $USER:$USER /opt/hadoop

3. Environment Variable

As I have told you before your machine is dumb so it won't be able to find the hadoop, binaries, config files. So we set the environment variable so that we can tell linux.

Where Hadoop is Installed
Where its Binary File lives
Where config files live

make sure you have VS code installed. Will be easier for you to manage things going forward, if you have any IDE installed. My preference is with VSCode so I am suggesting VSCode.

Variable	Purpose
HADOOP_HOME	Hadoop root directory
HADOOP_CONF_DIR	Hadoop XML configuration files
PATH	Run `hadoop`, `hdfs`, `yarn`

cd ~

Here you will find a file named as .bashrc this file actually contains all the environment variable and command that should run at the startup time.

Open .bashrc in VSCode.

in the last of the file add below commands.

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save the file and run the below command on terminal.

source ~/.bashrc

Now, We have to check if the hadoop is wired properly. If it is wired properly we have completed another step successfully.

hadoop version

4. Hadoop Configuration Files

Hadoop behaves exactly how we say it to behave within its limitations, and to tell hadoop how to behave we don't use broom like mom use to. We give them a manual and that is provided through XML Files.
These files will be found at location $HADOOP_CONF_DIR.

4.1. core-site.xml

This actually controls the file system abstraction and set the default filesystem URI

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Default filesystem -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <!-- Temporary directory used by Hadoop -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
  </property>

</configuration>

replace the core-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default core-site.xml file.

What this change actually means. So any file operation defaults to HDFS and the Name Node will run on localhost, for HDFS RPS the port is set to 9000.

Every hdfs dfs command uses this URI

spark also reads this when accessing HDFS

create a temp dir:

mkdir -p /opt/hadoop/tmp

4.2. hdfs-site.xml

This Controls HDFS storage, replication, metadata, and block storage.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Single node replication -->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <!-- NameNode metadata storage -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop/data/namenode</value>
  </property>

  <!-- DataNode block storage -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop/data/datanode</value>
  </property>

  <!-- Enable Web UI -->
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

</configuration>

Replace the hdfs-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default hdfs-site.xml file.
We will create the directory because hadoop will not create them by themselves.

mkdir -p /opt/hadoop/data/namenode
mkdir -p /opt/hadoop/data/datanode

4.3. mapred-site.xml

This Controls MapReduce execution engine (needed for YARN + Spark)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Run MapReduce on YARN -->
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>

  <!-- MapReduce job history server address -->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>localhost:10020</value>
  </property>

  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>localhost:19888</value>
  </property>

</configuration>

When we run spark on YARN it reuses MapReduce shuffle services, this is mandatory configuration even if you never run MR jobs.

4.4. yarn-site.xml

This Controls YARN resource management and container execution

<?xml version="1.0"?>

<configuration>

  <!-- Enable shuffle service -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

  <!-- ResourceManager hostname -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>

  <!-- Memory allocation (adjust to your RAM) -->
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
  </property>

  <!-- CPU allocation -->
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
  </property>

  <!-- Minimum container memory -->
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>

  <!-- Maximum container memory -->
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
  </property>

</configuration>

Spark executors are YARN containers and these limits decide executor size

4.5. hadoop-env.sh

This tell the hadoop which java it has to use, without this hadoop daemon will fail to start.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HEAPSIZE=1024

4.6. Slaves (Workers).

Previously it was known as slaves now we got civilized and started saying the same thing workers. This tells hadoop where to run DataNode and NodeManager will run.

Go to hadoop directory.

cd $HADOOP_CONF_DIR

Open the workers file.

nano workers

make sure it contains exactly this.

localhost

if it doesn't make sure to edit and then save.

4.7. Verify Configuration

This step will ensure that the hadoop is perfectly initialized and actually running, not just configured on disk.

Format the Name Node (First Time Task only)

hdfs namenode -format

This will create the metadata and namespace. Without formatting hdfs cannot start. Make sure you do this only once in lifetime of hadoop installation, formatting later will delete the hdfs meta data

Now, this is done. The hadoop is perfectly installed and we will start the services.

start-dfs.sh
start-yarn.sh

This will start the HDFS Daemons (storage level) and YARN Daemons (resource level)

verify the running daemons

jps

This command should return something like below:

if you see this then it means all the Hadoop JVM process are alive. if anything is missing the hadoop is not fully up in this case, retrace your steps.

5. Web Interface

This show the real time cluster state.

Service	URL
NameNode UI	http://localhost:9870
YARN UI	http://localhost:8088

We can also take browse the HDFS directories from utilities > browse the file system.

If both open hadoop is running correctly.

6. Confirmation

Confirmation is important and we must ensure HDFS, YARN works and Daemons are healthy.

hdfs dfs -mkdir -p /user/$USER
hdfs dfs -ls /user

If this works properly then it's confirm that Hadoop layer is stable.

STEP 2

Installing Spark

What we will be doing here.

Installing Spark
Tell spark where hadoop config live
make spark submit jobs to YARN
Enable Spark to read/write HDFS Spark will always be dependent on YARN, it does not run its own cluster

1. Download Spark(Hadoop Compatible)

We will be downloading the pre-build spark binary that already includes Hadoop Integration libraries. Spark internally relies on Hadoop FileSystem API to talk to HDFS and YARN Client APIs to request container. If Spark is not compiled with hadoop it won't be able to read or write ro HDFS or submit the application to YARN.

(LINK TO COPY THE VERSION SUITABLE LINK - https://downloads.apache.org/spark/)
You will have to copy the link for folder link mentioned in below picture

Choose the hadoop suitable Spark tgz file to go forward with installation.

cd /opt
sudo wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

Extract:

sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark

Hadoop aware spark binaries are available in our machine but not yet connected.

2. Settting Up Spark Environment Variable.

This is done so that our machine can understand where our spark is installed and where spark command lives, which python spark should use. Linux doesn't automaticallu know about software installed in /opt. So by setting below variable we make linux intelligent enough to know that spark is installed.

SPARK_HOME: Spark root directory
PATH: where spark-shell, spark-submit, pyspark live
PYSPARK_PYTHON: avoids python version mismatch This ensures that command can run from anywhere and pyspark should use python3 consistently.

Open .bashrc to VSCode.
Add the following set of code in .bashrc at the last.

# Spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3

save the .bashrc file and run below command on terminal

source ~/.bashrc

verify:

spark-shell --version

If this print spark version congratulations your spark is successfully installed.

bin-hadoop3 build contains hadoop client libraries

3. Configure Spark to use Hadoop & YARN

This is one of the most crucial part, do not miss. This explicitly connect Spark to hadoop Cluster.
Use below commands.

cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
nano spark-env.sh

add below set of command to the spark-env.sh file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop

Spark does not auto discover the Hadoop, these setting will tell spark that which Java Runtime to use, where HDFS and YARN configuration live. If we miss this Spark won't be able to locate NameNode, ResourceManager, HDFS Paths.

4. Prepare HDFS for Spark Execution

We will be creating the required directories.

hdfs dfs -mkdir /spark
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -chmod -R 777 /spark

When we are running the the spark on YARN, spark uploads jar and config to HDFS and uses HDFS for application staging whereas write logs and metadata under /user/<username>. If the directories won't be available it will throw a runtime failure, not a startup errors.

4. Run Spark Using YARN (Validation)

Here we will be running the spark using hadoop's resource manager.

Python

pyspark --master yarn

Test:

rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 10).collect()
print(result)

Exit Pyspark:

exit()

Scala

spark-shell --master yarn

Test:

sc.parallelize(1 to 5).map(_ * 10).collect()

Exit Spark Scala:

:quit

If this runs perfectly here we are ready with Apache Spark and ready to practice the code.

NOTE:
YARN log aggregation is intentionally skipped here.
It will be covered later when we discuss debugging Spark jobs.

DEV Community

Apache Spark Installation

Prerequisites

1. Java

2. Minimum System Requirement

3. Python (For PySpark)

4. SSH (Mandatory for Hadoop)

5. Linux Utilities (Required)

6. Environment Variables

7. Browser (For UIs)

8. Permissions

STEP 1

Install & Configure Hadoop (Single Node Cluster)

1. Setup Passwordless SSH (Mandatory)

2. Download Hadoop

3. Environment Variable

4. Hadoop Configuration Files

4.1. core-site.xml

4.2. hdfs-site.xml

4.3. mapred-site.xml

4.4. yarn-site.xml

4.5. hadoop-env.sh

4.6. Slaves (Workers).

4.7. Verify Configuration

5. Web Interface

6. Confirmation

STEP 2

Installing Spark

1. Download Spark(Hadoop Compatible)

2. Settting Up Spark Environment Variable.

3. Configure Spark to use Hadoop & YARN

4. Prepare HDFS for Spark Execution

4. Run Spark Using YARN (Validation)

If this runs perfectly here we are ready with Apache Spark and ready to practice the code.

Top comments (0)