Andrew (he/him)

Posted on Dec 13, 2019 • Edited on Dec 18, 2019

Installing and Running Hadoop and Spark on Ubuntu 18

#hadoop #spark #java #scala

Installing and Running Hadoop and Spark on Ubuntu 18

This is a short guide (updated from my previous guides) on how to install Hadoop and Spark on Ubuntu Linux. Roughly this same procedure should work on most Debian-based Linux distros, at least, though I've only tested it on Ubuntu. No prior knowledge of Hadoop, Spark, or Java is assumed.

I'll be setting all of this up on a virtual machine (VM) using Oracle's VirtualBox, so I first need to get an ISO file to install the Ubuntu operating system (OS). I'll download the most recent Long-Term Support (LTS) version of Ubuntu from their website (as of this writing, that's 18.04.3). Setting up a virtual machine is fairly straightforward and since it's not directly relevant, I won't be replicating those linked instructions here. Instead, let's just start with a clean Ubuntu installation...

Installing Java

Hadoop requires Java to be installed, and my minimal-installation Ubuntu doesn't have Java by default. You can check this with the command:

$ java -version

Command 'java' not found, but can be installed with:

sudo apt install default-jre
sudo apt install openjdk-11-jre-headless
sudo apt install openjdk-8-jre-headless

Note: we're going to ignore these suggestions and install Java in a different way.

Hadoop runs smoothly with Java 8, but may encounter bugs with newer versions of Java. So I'd like to install Java 8 specifically. To manage multiple Java versions, I install SDKMAN! (but first I need to install curl):

$ sudo apt install curl -y

...enter your password, and then install SDKMAN! with

$ curl -s "https://get.sdkman.io" | bash

SDKMAN! is a great piece of software that allows you to install multiple versions of all sorts of different packages, languages, and more. You can see a huge list of available software with:

$ sdk ls # or sdk list

To make sure you can use SDKMAN! in every new terminal, run the following command to append a line which sources the SDKMAN! initialisation script whenever a new terminal is opened:

$ echo "source ~/.sdkman/bin/sdkman-init.sh" >> ~/.bashrc

We're only going to use SDKMAN! to install one thing -- Java. You can list all available versions of a particular installation candidate with:

$ sdk list <software>

So in our case, that's

$ sdk list java

...we can see all of the different available Java versions:

To install a specific version, we use the Identifier in the column all the way on the right with:

$ sdk install <software> <Identifier>

I'm going to install AdoptOpenJDK's Java 8.0.232 (HotSpot), so this command, for me, is:

$ sdk install java 8.0.232.hs-adpt

SDKMAN! candidates are installed, by default, at ~/.sdkman/candidates:

$ ls ~/.sdkman/candidates/java
8.0.232.hs-adpt  current

The current symlink always points to whichever Java version SDKMAN! thinks is the version you're currently using, and this is reflected in the java -version command. After the last step, this command returns:

$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)

If you install multiple Java versions, you can easily switch between them with sdk use:

$ sdk install java 13.0.1.hs-adpt

...

$ sdk use java 13.0.1.hs-adpt

Using java version 13.0.1.hs-adpt in this shell.

$ java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment (AdoptOpenJDK)(build 13.0.1+9)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 13.0.1+9, mixed mode, sharing)

We also need to explicitly define the JAVA_HOME environment variable by adding it to the ~/.bashrc file:

$ echo "export JAVA_HOME=\$(readlink -f \$(which java) | sed 's:bin/java::')" >> ~/.bashrc

echo-ing JAVA_HOME should now give us the path to the SDKMAN! directory:

$ echo $JAVA_HOME
/home/andrew/.sdkman/candidates/java/13.0.1.hs-adpt

Make sure you switch back to Java 8 before continuing with this tutorial:

$ sdk use java 8.0.232.hs-adpt

Using java version 8.0.232.hs-adpt in this shell.

Installing Hadoop

With Java installed, the next step is to install Hadoop. You can get the most recent version of Hadoop from Apache's website. As of this writing, that version is Hadoop 3.2.1 (released 22 Sep 2019). If you click on the link on that webpage, it may redirect you. Click until a *.tar.gz file is downloaded. The link I ended up using was

http://mirrors.whoishostingthis.com/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

You can download that in the browser, or by using wget in the terminal:

$ wget http://mirrors.whoishostingthis.com/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Unpack the archive with tar, and redirect the output to the /opt/ directory:

$ sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/

Remove the archive file and move to the /opt/ directory:

$ rm hadoop-3.2.1.tar.gz && cd /opt

Rename the Hadoop directory and change its permissions so that its owned by you (my username is andrew) and not root or 1001:

$ sudo mv hadoop-3.2.1 hadoop && sudo chown andrew:andrew -R hadoop

Finally, define the HADOOP_HOME environment variable and add the correct Hadoop binaries to your PATH by echoing the following lines and concatenating them to your ~/.bashrc file:

$ echo "export HADOOP_HOME=/opt/hadoop" >> ~/.bashrc
$ echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> ~/.bashrc

Now, when you source your ~/.bashrc (or open any new shell), you should be able to check that Hadoop has been installed correctly:

$ hadoop version
Hadoop 3.2.1
Source code repository...
Compiled by ...
...

In order for HDFS to run correctly later, we also need to define JAVA_HOME in the file /opt/hadoop/etc/hadoop/hadoop-env.sh. Find the line in that file which begins with:

# export JAVA_HOME=

and edit it to match the JAVA_HOME variable we defined earlier:

export JAVA_HOME=/home/<username>/.sdkman/candidates/java/8.0.232.hs-adpt

Make sure you change the <username> above to the appropriate username for your setup. In my case, I replace <username> with andrew.

Installing Spark

The last bit of software we want to install is Apache Spark. We'll install this in a similar manner to how we installed Hadoop, above. First, get the most recent *.tgz file from Spark's website. I downloaded the Spark 3.0.0-preview (6 Nov 2019) pre-built for Apache Hadoop 3.2 and later with the command:

$ wget http://mirrors.whoishostingthis.com/apache/spark/spark-3.0.0-preview/spark-3.0.0-preview-bin-hadoop3.2.tgz

As with Hadoop, unpack the archive with tar, and redirect the output to the /opt/ directory:

$ sudo tar -xvf spark-3.0.0-preview-bin-hadoop3.2.tgz -C /opt/

Remove the archive file and move to the /opt/ directory:

$ rm spark-3.0.0-preview-bin-hadoop3.2.tgz && cd /opt

Rename the Spark directory and change its permissions so that its owned by you (my username is andrew) and not root or 1001:

$ sudo mv spark-3.0.0-preview-bin-hadoop3.2 spark && sudo chown andrew:andrew -R spark

Finally, define the SPARK_HOME environment variable and add the correct Spark binaries to your PATH by echoing the following lines and concatenating them to your ~/.bashrc file:

$ echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
$ echo "export PATH=\$PATH:\$SPARK_HOME/bin" >> ~/.bashrc

Now, when you source your ~/.bashrc (or open any new shell), you should be able to check that Spark has been installed correctly:

$ spark-shell --version
...
...version 3.0.0-preview
...

Configuring HDFS

At this point, Hadoop and Spark are installed and running correctly, but we haven't yet set up the Hadoop Distributed File System (HDFS). As its name suggests, HDFS is usually distributed across many machines. If you want to build a Hadoop Cluster, I've previously written instructions for doing that across a small cluster of Raspberry Pis. But for simplicity's sake, we'll just set up a standalone, local installation here.

To configure HDFS, we need to edit several files located at /opt/hadoop/etc/hadoop/. The first such file is core-site.xml. Edit that file so it has the following XML structure:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

The second file is hdfs-site.xml, which gives the locations of the the namenode and datanode directories. Edit that file so it looks like:

<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

We set dfs.replication to 1 because this is a one-machine cluster -- we can't replicate files any more than once here.

Read more about data replication in HDFS here.

The directories given above (/opt/hadoop_tmp/hdfs/datanode and /opt/hadoop_tmp/hdfs/namenode) must exist and be read/write-able by the current user. So create them now, and adjust their permissions, with:

$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
$ sudo chown andrew:andrew -R /opt/hadoop_tmp

The next configuration file is mapred-site.xml, which you should edit to look like:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

...and finally yarn-site.xml, which you should edit to look like:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Configuring SSH

If you started with a minimal Ubuntu installation like I did, you may need to first set up your ssh connection (as HDFS connects to localhost:9000). To check if the SSH server is running, enter the command

$ which sshd

If nothing is returned, then the SSH server is not installed (this is the case with the minimal Ubuntu installation). To get this up and running, install openssh-server, which will start the SSH service automatically:

$ sudo apt install openssh-server

$ sudo systemctl status ssh
● ssh.service - OpenBSD Secure Shell server
  Loaded: loaded ...
  Actve: active...
  ...

To check that this worked, try ssh-ing into localhost:

$ ssh localhost
...
Are you sure you want to continue connecting (yes/no)? yes
...
Welcome to Ubuntu 18.04.3 LTS...
...

You can exit to escape this superfluous self-connection.

Then, create a public-private keypair (if you haven't already):

$ ssh-keygen
Generating public/private rsa key pair.
...

Hit 'enter' / 'return' over and over to create a key in the default location with no passphrase. When you're back to the normal shell prompt, append the public key to your ~/.ssh/authorized_keys file:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

You should now be able to boot HDFS. Continue to the next section.

Formatting and Booting HDFS

At this point, we can format the distributed filesystem. BE CAREFUL and do not run the following command unless you are sure there is no important data currently stored in the HDFS because IT WILL BE LOST. But if you're setting up HDFS for the first time on this computer, you've got nothing to worry about:

Format the HDFS with

$ hdfs namenode -format -force

You should get a bunch of output and then a SHUTDOWN_MSG:

We can then boot the HDFS with the following two commands:

$ start-dfs.sh && start-yarn.sh

Note: if you performed a minimal installation, you may need to install openssh-server by following the instructions given here.

You can check that HDFS is running correctly with the command jps:

$ jps
10384 DataNode
11009 NodeManager
4113 ResourceManager
11143 Jps
10218 NameNode
10620 SecondaryNameNode

You should see a NameNode and a DataNode, at minimum, in that list. Check that HDFS is behaving correctly by trying to create a directory, then listing the contents of the HDFS:

$ hdfs dfs -mkdir /test
$ hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - andrew supergroup          0 2019-12-13 13:56 /test

If you can see your directory, you've correctly configured the HDFS!

Monitoring

Hadoop and Spark come with built-in web-based monitors that you can access by going to http://localhost:8088:

...and http://localhost:9870 in your browser:

Working with Spark and HDFS

One of the benefits of working with Spark and Hadoop is that they're both Apache products, so they work very nicely with each other. It's easy to read a file from HDFS into Spark to analyse it. To test this, let's copy a small file to HDFS and analyse it with Spark.

Spark comes with some example resource files. With the above configuration, they can be found at /opt/spark/examples/src/main/resources. Let's copy the file users.parquet to HDFS:

$ hdfs dfs -put /opt/spark/examples/src/main/resources/users.parquet /users.parquet

Parquet files are another Apache creation, designed for fast data access and analysis.

Next, open the Spark shell and read in the file with read.parquet:

$ spark-shell
...
Welcome to
... version-3.0.0-preview
...

scala> val df = spark.read.parquet("hdfs://localhost:9000/users.parquet")
df: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string ... 1 more field]

scala> df.collect.foreach(println)
[Alyssa,null,WrappedArray(3,9,15,20)]
[Ben,red,WrappedArray()]

This is just a small example, but it shows how Spark and HDFS can work closely together. You can easily read files from HDFS and analyse them with Spark!

If you want to stop the HDFS, you can run the commands:

$ stop-dfs.sh

and

$ stop-yarn.sh

Conclusion

I hope this guide will be useful for anyone trying to set up a small Hadoop / Spark installation for testing or education. If you're interested in learning more about Hadoop and Spark, please check out my other articles in this series on Dev! Thanks for reading!

Top comments (5)

Divas Gupta • Apr 15 '20

Hi Andrew, Many Thanks for creating such a wonderful installation guide.
I am following your instruction but am facing one issue in between, it will be great if you can help.

I am getting blank line when I am running below two commands -
echo $JAVA_HOME
echo $HADOOP_HOME

PFB my .bashrc file -

~/.bashrc: executed by bash(1) for non-login shells.

see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)

for examples

If not running interactively, don't do anything

case $- in
i) ;;
*) return;;
esac

don't put duplicate lines or lines starting with space in the history.

See bash(1) for more options

HISTCONTROL=ignoreboth

append to the history file, don't overwrite it

shopt -s histappend

for setting history length see HISTSIZE and HISTFILESIZE in bash(1)

HISTSIZE=1000
HISTFILESIZE=2000

check the window size after each command and, if necessary,

update the values of LINES and COLUMNS.

shopt -s checkwinsize

If set, the pattern "**" used in a pathname expansion context will

match all files and zero or more directories and subdirectories.

shopt -s globstar

make less more friendly for non-text input files, see lesspipe(1)

[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

set variable identifying the chroot you work in (used in the prompt below)

if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
debian_chroot=$(cat /etc/debian_chroot)
fi

set a fancy prompt (non-color, unless we know we "want" color)

case "$TERM" in
xterm-color|*-256color) color_prompt=yes;;
esac

uncomment for a colored prompt, if the terminal has the capability; turned

off by default to not distract the user: the focus in a terminal window

should be on the output of commands, not on the prompt

force_color_prompt=yes

if [ -n "$force_color_prompt" ]; then
if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
# We have color support; assume it's compliant with Ecma-48
# (ISO/IEC-6429). (Lack of such support is extremely rare, and such
# a case would tend to support setf rather than setaf.)
color_prompt=yes
else
color_prompt=
fi
fi

if [ "$color_prompt" = yes ]; then
PS1='${debian_chroot:+($debian_chroot)}[\033[01;32m]\u@\h[\033[00m]:[\033[01;34m]\w[\033[00m]\$ '
else
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt

If this is an xterm set the title to user@host:dir

case "$TERM" in
xterm*|rxvt*)
PS1="[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a]$PS1"
;;
*)
;;
esac

enable color support of ls and also add handy aliases

if [ -x /usr/bin/dircolors ]; then
test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
alias ls='ls --color=auto'
#alias dir='dir --color=auto'
#alias vdir='vdir --color=auto'

alias grep='grep --color=auto'
alias fgrep='fgrep --color=auto'
alias egrep='egrep --color=auto'

colored GCC warnings and errors

export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01'

some more ls aliases

alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'

Add an "alert" alias for long running commands. Use like so:

sleep 10; alert

alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]+\s*//;s/[;&|]\s*alert$//'\'')"'

Alias definitions.

You may want to put all your additions into a separate file like

~/.bash_aliases, instead of adding them here directly.

See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

enable programmable completion features (you don't need to enable

this, if it's already enabled in /etc/bash.bashrc and /etc/profile

sources /etc/bash.bashrc).

if ! shopt -oq posix; then
if [ -f /usr/share/bash-completion/bash_completion ]; then
. /usr/share/bash-completion/bash_completion
elif [ -f /etc/bash_completion ]; then
. /etc/bash_completion
fi
fi

THIS MUST BE AT THE END OF THE FILE FOR SDKMAN TO WORK!!!

export SDKMAN_DIR="/home/divas/.sdkman"
[[ -s "/home/divas/.sdkman/bin/sdkman-init.sh" ]] && source "/home/divas/.sdkman/bin/sdkman-init.sh"

source ~/.sdkman/bin/sdkman-init.sh

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=$(readlink -f $(which java) | sed 's:bin/java::')

Thanks,
Divas
Than

Nergiz Ünal • Mar 12 '20

I am very thankful for your good explained entry. I have done whole instructions but I could not handle jps command part :( I cannot see namenode and datanome even I had perform things before that. Can you help me for this?
Teşekkür Ederim :)

Sudhakar Daggubati • Dec 23 '19 • Edited

Hi Andrew,
Came across your blog as I am exploring to setup a pi cluster. Nice crisp write-ups; did you gave any thoughts about ansible & , vagrant to avoid manual provisions.

As you are trying different aspects, ansible playbooks helps you with speed n flexibility.

Anyway going to keep an eye😊
Thanks

Funso Iyaju • Apr 21 '20

Thanks Andrew for this post.

kbor72 • Sep 22 '20

Hi Andrew,
Grete tutorial !
Thank you a lot, very useful, even today in 2020 09 21 !

Installing and Running Hadoop and Spark on Ubuntu 18

Installing Java

Installing Hadoop

Installing Spark

Configuring HDFS

Configuring SSH

Formatting and Booting HDFS

Monitoring

Working with Spark and HDFS

Conclusion

~/.bashrc: executed by bash(1) for non-login shells.

see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)

for examples

If not running interactively, don't do anything

don't put duplicate lines or lines starting with space in the history.

See bash(1) for more options

append to the history file, don't overwrite it

for setting history length see HISTSIZE and HISTFILESIZE in bash(1)

check the window size after each command and, if necessary,

update the values of LINES and COLUMNS.

If set, the pattern "**" used in a pathname expansion context will

match all files and zero or more directories and subdirectories.

shopt -s globstar

make less more friendly for non-text input files, see lesspipe(1)

set variable identifying the chroot you work in (used in the prompt below)

set a fancy prompt (non-color, unless we know we "want" color)

uncomment for a colored prompt, if the terminal has the capability; turned

off by default to not distract the user: the focus in a terminal window

should be on the output of commands, not on the prompt

force_color_prompt=yes

If this is an xterm set the title to user@host:dir

enable color support of ls and also add handy aliases

colored GCC warnings and errors

export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01'

some more ls aliases

Add an "alert" alias for long running commands. Use like so:

sleep 10; alert

Alias definitions.

You may want to put all your additions into a separate file like

~/.bash_aliases, instead of adding them here directly.

See /usr/share/doc/bash-doc/examples in the bash-doc package.

enable programmable completion features (you don't need to enable

this, if it's already enabled in /etc/bash.bashrc and /etc/profile

sources /etc/bash.bashrc).

THIS MUST BE AT THE END OF THE FILE FOR SDKMAN TO WORK!!!

source ~/.sdkman/bin/sdkman-init.sh

Read next

Assert with Grace: Custom Soft Assertions using AssertJ for Cleaner Code

Writing k8s operator in Java

Java Programming

Sincronização e Comunicação entre Threads