Each step is important when you are learning new things. When you are learning new technology, setting up its environment is important so that you can practice effectively; we call it a "baby step".
Installing Apache Spark involves a few key steps like ensuring the prerequisites are installed and then downloading, extracting and configuring the spark binaries for your specific operating system.
Apache Spark runs on the Java Virtual Machine (JVM), so the Java Development Kit (JDK) is a requirement. Hey, don't panic! Installing the JDK doesn't mean you have to code in Java. It simply provides the JVM, creating the necessary runtime environment that Spark needs to execute its tasks.
Prerequisites
1. Java
Spark requires Java 8 or 11 to be installed, make sure that your system works on either one of these.
My suggestion is to go with Java 8 because:
- Hadoop
3.0.xand3.2.xonly support Java 8 - Hadoop
3.3+supportsJava 8 and 11 (runtime only)
Check Java version:
java -version
If not installed:
sudo apt install openjdk-8-jdk -y # Change according to your choice of version
2. Minimum System Requirement
- OS (Operating System): Ubuntu 20.04 or 22.04 https://ubuntu.com/download
- RAM: 4GB (Recommended 8GB)
- Disk: 20GB free space
- CPU: 2 cores
To create this setup in your windows firstly we have to create a Virtual Environment using wsl.
If you want the installation guide for Mac drop comment will be coming up with mac set up
3. Python (For PySpark)
- Required: Python 3.7+
Check versions:
python --version
pip3 --version
If not installed:
sudo apt install python3 python3-pip -y
4. SSH (Mandatory for Hadoop)
Hadoop daemons require passwordless SSH, even on a single machine.
Check SSH:
ssh localhost
If not installed:
sudo apt install openssh-server -y
5. Linux Utilities (Required)
Install basic tools:
sudo apt install -y \
wget \
curl \
rsync \
vim \
nano \
net-tools \
procps
Why:
-
rsync: Hadoop file sync -
procps:jps -
net-tools: network checks
6. Environment Variables
(We will be sorting this out together)
7. Browser (For UIs)
Any browser of your choice will work:
chromesafarifirefox
8. Permissions
You must:
- Have
sudoaccess - Be able to write to
/opt
NOTE:
Run the commands below. If all of them pass, we are ready to move forward with the installation of Hadoop + Spark.
java -version
python3 --version
ssh localhost
sudo ls /opt
LET's START FOR WHAT WE ARE HERE ACTUALLY
STEP 1
Install & Configure Hadoop (Single Node Cluster)
1. Setup Passwordless SSH (Mandatory)
What's the use for this?
Hadoop uses SSH to
- Start Daemons
- Stop Daemons
- Manage Nodes (even localhost)
"Even on one machine Hadoop behaves like a cluster"
This is very common question to ask, and you might be thinking the same; why passwordless if we can use password? Well let me tell you this you can enter the password what about your machine; don't mind me here but your machine is dumb, dumber than you think it won't be able to enter the password and we want this dumb thing to get the access that's why we say "Comm'mon you dumb, take this passwordless access and leave me alone". So, Daemons cannot enter password because this dude is dumb.
We have to generate the SSH keys and allow localhost login
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
The above command will create to keys id_rsa (Private key) and id_rsa.pub (public key)
Now we have to authorize it:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Here we are done with this, now we have to test if it login without password voila we have achieved it our SSH layer is ready to go
ssh localhost
2. Download Hadoop
We will be downloading the hadoop, when i say hadoop don't take it as a simple program it is a set of Java Services which includes.
- HDFS
- YARN
- MapReduce(Runtime)
We will be keeping all third party software at location /opt to keep our system clean and sorted.
cd /opt
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 hadoop
(https://downloads.apache.org/hadoop/common/) This has all the possible options for version of Hadoop. Pick which ever you find best for your work. If you want current stable version go inside stable folder and copy hadoop-x.x.x.tar.gz path and run the above command.
change the ownership:
sudo chown -R $USER:$USER /opt/hadoop
3. Environment Variable
As I have told you before your machine is dumb so it won't be able to find the hadoop, binaries, config files. So we set the environment variable so that we can tell linux.
- Where Hadoop is Installed
- Where its Binary File lives
- Where config files live
make sure you have VS code installed. Will be easier for you to manage things going forward, if you have any IDE installed. My preference is with VSCode so I am suggesting VSCode.
| Variable | Purpose |
|---|---|
| HADOOP_HOME | Hadoop root directory |
| HADOOP_CONF_DIR | Hadoop XML configuration files |
| PATH | Run hadoop, hdfs, yarn
|
cd ~
Here you will find a file named as .bashrc this file actually contains all the environment variable and command that should run at the startup time.
Open .bashrc in VSCode.
in the last of the file add below commands.
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Save the file and run the below command on terminal.
source ~/.bashrc
Now, We have to check if the hadoop is wired properly. If it is wired properly we have completed another step successfully.
hadoop version
4. Hadoop Configuration Files
Hadoop behaves exactly how we say it to behave within its limitations, and to tell hadoop how to behave we don't use broom like mom use to. We give them a manual and that is provided through XML Files.
These files will be found at location $HADOOP_CONF_DIR.
4.1. core-site.xml
This actually controls the file system abstraction and set the default filesystem URI
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Default filesystem -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- Temporary directory used by Hadoop -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>
</configuration>
replace the core-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default core-site.xml file.
What this change actually means. So any file operation defaults to HDFS and the Name Node will run on
localhost, for HDFS RPS the port is set to 9000.
- Every hdfs dfs command uses this URI
- spark also reads this when accessing HDFS
create a temp dir:
mkdir -p /opt/hadoop/tmp
4.2. hdfs-site.xml
This Controls HDFS storage, replication, metadata, and block storage.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Single node replication -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- NameNode metadata storage -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/data/namenode</value>
</property>
<!-- DataNode block storage -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/data/datanode</value>
</property>
<!-- Enable Web UI -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Replace the hdfs-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default hdfs-site.xml file.
We will create the directory because hadoop will not create them by themselves.
mkdir -p /opt/hadoop/data/namenode
mkdir -p /opt/hadoop/data/datanode
4.3. mapred-site.xml
This Controls MapReduce execution engine (needed for YARN + Spark)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- Run MapReduce on YARN -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- MapReduce job history server address -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>
When we run spark on
YARNit reusesMapReduceshuffle services, this is mandatory configuration even if you never run MR jobs.
4.4. yarn-site.xml
This Controls YARN resource management and container execution
<?xml version="1.0"?>
<configuration>
<!-- Enable shuffle service -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- ResourceManager hostname -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<!-- Memory allocation (adjust to your RAM) -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- CPU allocation -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<!-- Minimum container memory -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<!-- Maximum container memory -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
</configuration>
Spark executors are YARN containers and these limits decide executor size
4.5. hadoop-env.sh
This tell the hadoop which java it has to use, without this hadoop daemon will fail to start.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HEAPSIZE=1024
4.6. Slaves (Workers).
Previously it was known as slaves now we got civilized and started saying the same thing workers. This tells hadoop where to run DataNode and NodeManager will run.
Go to hadoop directory.
cd $HADOOP_CONF_DIR
Open the workers file.
nano workers
make sure it contains exactly this.
localhost
if it doesn't make sure to edit and then save.
4.7. Verify Configuration
This step will ensure that the hadoop is perfectly initialized and actually running, not just configured on disk.
Format the Name Node (First Time Task only)
hdfs namenode -format
This will create the metadata and namespace. Without formatting hdfs cannot start. Make sure you do this only once in lifetime of hadoop installation, formatting later will delete the hdfs meta data
Now, this is done. The hadoop is perfectly installed and we will start the services.
start-dfs.sh
start-yarn.sh
This will start the HDFS Daemons (storage level) and YARN Daemons (resource level)
verify the running daemons
jps
This command should return something like below:
if you see this then it means all the Hadoop JVM process are alive. if anything is missing the hadoop is not fully up in this case, retrace your steps.
5. Web Interface
This show the real time cluster state.
| Service | URL |
|---|---|
| NameNode UI | http://localhost:9870 |
| YARN UI | http://localhost:8088 |
We can also take browse the HDFS directories from utilities > browse the file system.
If both open hadoop is running correctly.
6. Confirmation
Confirmation is important and we must ensure HDFS, YARN works and Daemons are healthy.
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -ls /user
If this works properly then it's confirm that Hadoop layer is stable.
STEP 2
Installing Spark
What we will be doing here.
- Installing Spark
- Tell spark where hadoop config live
- make spark submit jobs to YARN
- Enable Spark to read/write HDFS Spark will always be dependent on YARN, it does not run its own cluster
1. Download Spark(Hadoop Compatible)
We will be downloading the pre-build spark binary that already includes Hadoop Integration libraries. Spark internally relies on Hadoop FileSystem API to talk to HDFS and YARN Client APIs to request container. If Spark is not compiled with hadoop it won't be able to read or write ro HDFS or submit the application to YARN.
(LINK TO COPY THE VERSION SUITABLE LINK - https://downloads.apache.org/spark/)
You will have to copy the link for folder link mentioned in below picture
Choose the hadoop suitable Spark tgz file to go forward with installation.
cd /opt
sudo wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Extract:
sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark
Hadoop aware spark binaries are available in our machine but not yet connected.
2. Settting Up Spark Environment Variable.
This is done so that our machine can understand where our spark is installed and where spark command lives, which python spark should use. Linux doesn't automaticallu know about software installed in /opt. So by setting below variable we make linux intelligent enough to know that spark is installed.
-
SPARK_HOME: Spark root directory -
PATH: wherespark-shell,spark-submit,pysparklive -
PYSPARK_PYTHON: avoids python version mismatch This ensures that command can run from anywhere and pyspark should use python3 consistently.
Open .bashrc to VSCode.
Add the following set of code in .bashrc at the last.
# Spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
save the .bashrc file and run below command on terminal
source ~/.bashrc
verify:
spark-shell --version
If this print spark version congratulations your spark is successfully installed.
bin-hadoop3build contains hadoop client libraries
3. Configure Spark to use Hadoop & YARN
This is one of the most crucial part, do not miss. This explicitly connect Spark to hadoop Cluster.
Use below commands.
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
nano spark-env.sh
add below set of command to the spark-env.sh file.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
Spark does not auto discover the Hadoop, these setting will tell spark that which Java Runtime to use, where HDFS and YARN configuration live. If we miss this Spark won't be able to locate NameNode, ResourceManager, HDFS Paths.
4. Prepare HDFS for Spark Execution
We will be creating the required directories.
hdfs dfs -mkdir /spark
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -chmod -R 777 /spark
When we are running the the spark on YARN, spark uploads jar and config to HDFS and uses HDFS for application staging whereas write logs and metadata under
/user/<username>. If the directories won't be available it will throw a runtime failure, not a startup errors.
4. Run Spark Using YARN (Validation)
Here we will be running the spark using hadoop's resource manager.
-
Python
pyspark --master yarn
Test:
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 10).collect()
print(result)
Exit Pyspark:
exit()
-
Scala
spark-shell --master yarn
Test:
sc.parallelize(1 to 5).map(_ * 10).collect()
Exit Spark Scala:
:quit
If this runs perfectly here we are ready with Apache Spark and ready to practice the code.
NOTE:
YARN log aggregation is intentionally skipped here.
It will be covered later when we discuss debugging Spark jobs.





Top comments (0)