DEV Community

Akhilesh Pratap Shahi
Akhilesh Pratap Shahi

Posted on

Apache Spark Installation

Spark Modules

Each step is important when you are learning new things. When you are learning new technology, setting up its environment is important so that you can practice effectively; we call it a "baby step".
Installing Apache Spark involves a few key steps like ensuring the prerequisites are installed and then downloading, extracting and configuring the spark binaries for your specific operating system.
Apache Spark runs on the Java Virtual Machine (JVM), so the Java Development Kit (JDK) is a requirement. Hey, don't panic! Installing the JDK doesn't mean you have to code in Java. It simply provides the JVM, creating the necessary runtime environment that Spark needs to execute its tasks.

Prerequisites

1. Java

Spark requires Java 8 or 11 to be installed, make sure that your system works on either one of these.

My suggestion is to go with Java 8 because:

  • Hadoop 3.0.x and 3.2.x only support Java 8
  • Hadoop 3.3+ supports Java 8 and 11 (runtime only)

Check Java version:

java -version
Enter fullscreen mode Exit fullscreen mode

If not installed:

sudo apt install openjdk-8-jdk -y # Change according to your choice of version
Enter fullscreen mode Exit fullscreen mode

2. Minimum System Requirement

  • OS (Operating System): Ubuntu 20.04 or 22.04 https://ubuntu.com/download
  • RAM: 4GB (Recommended 8GB)
  • Disk: 20GB free space
  • CPU: 2 cores

To create this setup in your windows firstly we have to create a Virtual Environment using wsl.

If you want the installation guide for Mac drop comment will be coming up with mac set up

3. Python (For PySpark)

  • Required: Python 3.7+

Check versions:

python --version
pip3 --version
Enter fullscreen mode Exit fullscreen mode

If not installed:

sudo apt install python3 python3-pip -y
Enter fullscreen mode Exit fullscreen mode

4. SSH (Mandatory for Hadoop)

Hadoop daemons require passwordless SSH, even on a single machine.

Check SSH:

ssh localhost
Enter fullscreen mode Exit fullscreen mode

If not installed:

sudo apt install openssh-server -y
Enter fullscreen mode Exit fullscreen mode

5. Linux Utilities (Required)

Install basic tools:

sudo apt install -y \
wget \
curl \
rsync \
vim \
nano \
net-tools \
procps
Enter fullscreen mode Exit fullscreen mode

Why:

  • rsync: Hadoop file sync
  • procps: jps
  • net-tools: network checks

6. Environment Variables

(We will be sorting this out together)

7. Browser (For UIs)

Any browser of your choice will work:

  • chrome
  • safari
  • firefox

8. Permissions

You must:

  • Have sudo access
  • Be able to write to /opt

NOTE:

Run the commands below. If all of them pass, we are ready to move forward with the installation of Hadoop + Spark.

java -version
python3 --version
ssh localhost
sudo ls /opt
Enter fullscreen mode Exit fullscreen mode

LET's START FOR WHAT WE ARE HERE ACTUALLY


STEP 1

Install & Configure Hadoop (Single Node Cluster)

1. Setup Passwordless SSH (Mandatory)

What's the use for this?
Hadoop uses SSH to

  • Start Daemons
  • Stop Daemons
  • Manage Nodes (even localhost)

"Even on one machine Hadoop behaves like a cluster"

This is very common question to ask, and you might be thinking the same; why passwordless if we can use password? Well let me tell you this you can enter the password what about your machine; don't mind me here but your machine is dumb, dumber than you think it won't be able to enter the password and we want this dumb thing to get the access that's why we say "Comm'mon you dumb, take this passwordless access and leave me alone". So, Daemons cannot enter password because this dude is dumb.
We have to generate the SSH keys and allow localhost login

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
Enter fullscreen mode Exit fullscreen mode

The above command will create to keys id_rsa (Private key) and id_rsa.pub (public key)
Now we have to authorize it:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Enter fullscreen mode Exit fullscreen mode

Here we are done with this, now we have to test if it login without password voila we have achieved it our SSH layer is ready to go

ssh localhost
Enter fullscreen mode Exit fullscreen mode

2. Download Hadoop

We will be downloading the hadoop, when i say hadoop don't take it as a simple program it is a set of Java Services which includes.

  • HDFS
  • YARN
  • MapReduce(Runtime)

We will be keeping all third party software at location /opt to keep our system clean and sorted.

cd /opt
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 hadoop
Enter fullscreen mode Exit fullscreen mode

(https://downloads.apache.org/hadoop/common/) This has all the possible options for version of Hadoop. Pick which ever you find best for your work. If you want current stable version go inside stable folder and copy hadoop-x.x.x.tar.gz path and run the above command.

change the ownership:

sudo chown -R $USER:$USER /opt/hadoop
Enter fullscreen mode Exit fullscreen mode

3. Environment Variable

As I have told you before your machine is dumb so it won't be able to find the hadoop, binaries, config files. So we set the environment variable so that we can tell linux.

  • Where Hadoop is Installed
  • Where its Binary File lives
  • Where config files live

make sure you have VS code installed. Will be easier for you to manage things going forward, if you have any IDE installed. My preference is with VSCode so I am suggesting VSCode.

Variable Purpose
HADOOP_HOME Hadoop root directory
HADOOP_CONF_DIR Hadoop XML configuration files
PATH Run hadoop, hdfs, yarn
cd ~
Enter fullscreen mode Exit fullscreen mode

Here you will find a file named as .bashrc this file actually contains all the environment variable and command that should run at the startup time.

Open .bashrc in VSCode.

in the last of the file add below commands.

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Enter fullscreen mode Exit fullscreen mode

Save the file and run the below command on terminal.

source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

Now, We have to check if the hadoop is wired properly. If it is wired properly we have completed another step successfully.

hadoop version
Enter fullscreen mode Exit fullscreen mode

4. Hadoop Configuration Files

Hadoop behaves exactly how we say it to behave within its limitations, and to tell hadoop how to behave we don't use broom like mom use to. We give them a manual and that is provided through XML Files.
These files will be found at location $HADOOP_CONF_DIR.

4.1. core-site.xml

This actually controls the file system abstraction and set the default filesystem URI

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Default filesystem -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <!-- Temporary directory used by Hadoop -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
  </property>

</configuration>
Enter fullscreen mode Exit fullscreen mode

replace the core-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default core-site.xml file.

What this change actually means. So any file operation defaults to HDFS and the Name Node will run on localhost, for HDFS RPS the port is set to 9000.

  • Every hdfs dfs command uses this URI
  • spark also reads this when accessing HDFS

create a temp dir:

mkdir -p /opt/hadoop/tmp
Enter fullscreen mode Exit fullscreen mode

4.2. hdfs-site.xml

This Controls HDFS storage, replication, metadata, and block storage.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Single node replication -->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <!-- NameNode metadata storage -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop/data/namenode</value>
  </property>

  <!-- DataNode block storage -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop/data/datanode</value>
  </property>

  <!-- Enable Web UI -->
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

</configuration>
Enter fullscreen mode Exit fullscreen mode

Replace the hdfs-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default hdfs-site.xml file.
We will create the directory because hadoop will not create them by themselves.

mkdir -p /opt/hadoop/data/namenode
mkdir -p /opt/hadoop/data/datanode
Enter fullscreen mode Exit fullscreen mode

4.3. mapred-site.xml

This Controls MapReduce execution engine (needed for YARN + Spark)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Run MapReduce on YARN -->
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>

  <!-- MapReduce job history server address -->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>localhost:10020</value>
  </property>

  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>localhost:19888</value>
  </property>

</configuration>
Enter fullscreen mode Exit fullscreen mode

When we run spark on YARN it reuses MapReduce shuffle services, this is mandatory configuration even if you never run MR jobs.

4.4. yarn-site.xml

This Controls YARN resource management and container execution

<?xml version="1.0"?>

<configuration>

  <!-- Enable shuffle service -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

  <!-- ResourceManager hostname -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>

  <!-- Memory allocation (adjust to your RAM) -->
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
  </property>

  <!-- CPU allocation -->
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
  </property>

  <!-- Minimum container memory -->
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>

  <!-- Maximum container memory -->
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
  </property>

</configuration>
Enter fullscreen mode Exit fullscreen mode

Spark executors are YARN containers and these limits decide executor size

4.5. hadoop-env.sh

This tell the hadoop which java it has to use, without this hadoop daemon will fail to start.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HEAPSIZE=1024
Enter fullscreen mode Exit fullscreen mode

4.6. Slaves (Workers).

Previously it was known as slaves now we got civilized and started saying the same thing workers. This tells hadoop where to run DataNode and NodeManager will run.

Go to hadoop directory.

cd $HADOOP_CONF_DIR
Enter fullscreen mode Exit fullscreen mode

Open the workers file.

nano workers
Enter fullscreen mode Exit fullscreen mode

make sure it contains exactly this.

localhost
Enter fullscreen mode Exit fullscreen mode

if it doesn't make sure to edit and then save.

4.7. Verify Configuration

This step will ensure that the hadoop is perfectly initialized and actually running, not just configured on disk.

Format the Name Node (First Time Task only)

hdfs namenode -format
Enter fullscreen mode Exit fullscreen mode

This will create the metadata and namespace. Without formatting hdfs cannot start. Make sure you do this only once in lifetime of hadoop installation, formatting later will delete the hdfs meta data

Now, this is done. The hadoop is perfectly installed and we will start the services.

start-dfs.sh
start-yarn.sh
Enter fullscreen mode Exit fullscreen mode

This will start the HDFS Daemons (storage level) and YARN Daemons (resource level)

verify the running daemons

jps
Enter fullscreen mode Exit fullscreen mode

This command should return something like below:

jps Output

jps Output

if you see this then it means all the Hadoop JVM process are alive. if anything is missing the hadoop is not fully up in this case, retrace your steps.


5. Web Interface

This show the real time cluster state.

Service URL
NameNode UI http://localhost:9870
YARN UI http://localhost:8088

Name Node UI

Name Node UI

We can also take browse the HDFS directories from utilities > browse the file system.

HDFS file system

HDFS file system

If both open hadoop is running correctly.

6. Confirmation

Confirmation is important and we must ensure HDFS, YARN works and Daemons are healthy.

hdfs dfs -mkdir -p /user/$USER
hdfs dfs -ls /user
Enter fullscreen mode Exit fullscreen mode

If this works properly then it's confirm that Hadoop layer is stable.


STEP 2

Installing Spark

What we will be doing here.

  • Installing Spark
  • Tell spark where hadoop config live
  • make spark submit jobs to YARN
  • Enable Spark to read/write HDFS Spark will always be dependent on YARN, it does not run its own cluster

1. Download Spark(Hadoop Compatible)

We will be downloading the pre-build spark binary that already includes Hadoop Integration libraries. Spark internally relies on Hadoop FileSystem API to talk to HDFS and YARN Client APIs to request container. If Spark is not compiled with hadoop it won't be able to read or write ro HDFS or submit the application to YARN.

(LINK TO COPY THE VERSION SUITABLE LINK - https://downloads.apache.org/spark/)
You will have to copy the link for folder link mentioned in below picture

Spark Directory Suitable With Hadoop

Spark Directory Suitable With Hadoop

Choose the hadoop suitable Spark tgz file to go forward with installation.

cd /opt
sudo wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Enter fullscreen mode Exit fullscreen mode

Extract:

sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark
Enter fullscreen mode Exit fullscreen mode

Hadoop aware spark binaries are available in our machine but not yet connected.


2. Settting Up Spark Environment Variable.

This is done so that our machine can understand where our spark is installed and where spark command lives, which python spark should use. Linux doesn't automaticallu know about software installed in /opt. So by setting below variable we make linux intelligent enough to know that spark is installed.

  • SPARK_HOME: Spark root directory
  • PATH: where spark-shell, spark-submit, pyspark live
  • PYSPARK_PYTHON: avoids python version mismatch This ensures that command can run from anywhere and pyspark should use python3 consistently.

Open .bashrc to VSCode.
Add the following set of code in .bashrc at the last.

# Spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
Enter fullscreen mode Exit fullscreen mode

save the .bashrc file and run below command on terminal

source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

verify:

spark-shell --version
Enter fullscreen mode Exit fullscreen mode

If this print spark version congratulations your spark is successfully installed.

bin-hadoop3 build contains hadoop client libraries


3. Configure Spark to use Hadoop & YARN

This is one of the most crucial part, do not miss. This explicitly connect Spark to hadoop Cluster.
Use below commands.

cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
nano spark-env.sh
Enter fullscreen mode Exit fullscreen mode

add below set of command to the spark-env.sh file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
Enter fullscreen mode Exit fullscreen mode

Spark does not auto discover the Hadoop, these setting will tell spark that which Java Runtime to use, where HDFS and YARN configuration live. If we miss this Spark won't be able to locate NameNode, ResourceManager, HDFS Paths.

4. Prepare HDFS for Spark Execution

We will be creating the required directories.

hdfs dfs -mkdir /spark
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -chmod -R 777 /spark
Enter fullscreen mode Exit fullscreen mode

When we are running the the spark on YARN, spark uploads jar and config to HDFS and uses HDFS for application staging whereas write logs and metadata under /user/<username>. If the directories won't be available it will throw a runtime failure, not a startup errors.


4. Run Spark Using YARN (Validation)

Here we will be running the spark using hadoop's resource manager.

  • Python
pyspark --master yarn
Enter fullscreen mode Exit fullscreen mode

Test:

rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 10).collect()
print(result)
Enter fullscreen mode Exit fullscreen mode

Exit Pyspark:

exit()
Enter fullscreen mode Exit fullscreen mode
  • Scala
spark-shell --master yarn
Enter fullscreen mode Exit fullscreen mode

Test:

sc.parallelize(1 to 5).map(_ * 10).collect()
Enter fullscreen mode Exit fullscreen mode

Exit Spark Scala:

:quit
Enter fullscreen mode Exit fullscreen mode

If this runs perfectly here we are ready with Apache Spark and ready to practice the code.

NOTE:
YARN log aggregation is intentionally skipped here.
It will be covered later when we discuss debugging Spark jobs.

Top comments (0)