DEV Community

Building a Raspberry Pi Hadoop / Spark Cluster

Andrew (he/him) on July 08, 2019

Pro tip: if you're only looking for how to configure Hadoop and Spark to run on a cluster, start here. Table of Contents Motivation...

Read full post

sshu2017 • Jun 19 '20 • Edited

Hi Andrew,
This is an excellent tutorial so THANK YOU very much!

I had a quite strange problem though related to the clustercmd function. At the moment, I have installed Java and Hadoop on both workers (which I call rpi102 and rpi103). As you can see from the terminal on the right, I can SSH from the master (rpi101) into rpi102 and run the Hadoop version command and got what I expected.

However, as you can see from the terminal on the left, when I did ssh rpi102 hadoop version, the hadoop command is not found. But if I try something else, like ssh rpi102 whoami, it worked fine.

This seems so odd and really puzzles me. Do you have any idea what the issue could be? Thanks in advance!

sshu2017 • Jun 19 '20

Never mind. This answer helped me out.
superuser.com/a/896454/955273

Yui Chun Leung • Aug 21 '21

Do you solve it by this command?

clustermod sudo -E env "PATH=$PATH" hadoop version | grep Hadoop

Rohit Das • Sep 23 '19

Hi. I am not able to get the hadoop version or java version for the nodes over ssh using clustercmd. I have set up a cluster of 5 Pis (model B+) with 32 GB micro SD cards in all of them. On running

clustercmd hadoop version | grep Hadoop

I get the following error:

bash:hadoop:command not found
bash:hadoop:command not found
bash:hadoop:command not found
bash:hadoop:command not found
Hadoop 3.2.0

I am attaching the .bashrc here. Please help. Thanks.

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
HISTCONTROL=ignoreboth

# append to the history file, don't overwrite it
shopt -s histappend

# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTSIZE=1000
HISTFILESIZE=2000

# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize

# If set, the pattern "**" used in a pathname expansion context will
# match all files and zero or more directories and subdirectories.
#shopt -s globstar

# make less more friendly for non-text input files, see lesspipe(1)
#[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

# set variable identifying the chroot you work in (used in the prompt below)
if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
    debian_chroot=$(cat /etc/debian_chroot)
fi

# set a fancy prompt (non-color, unless we know we "want" color)
case "$TERM" in
    xterm-color|*-256color) color_prompt=yes;;
esac

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
force_color_prompt=yes

if [ -n "$force_color_prompt" ]; then
    if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
    # We have color support; assume it's compliant with Ecma-48
    # (ISO/IEC-6429). (Lack of such support is extremely rare, and such
    # a case would tend to support setf rather than setaf.)
    color_prompt=yes
    else
    color_prompt=
    fi
fi

if [ "$color_prompt" = yes ]; then
    PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w \$\[\033[00m\] '
else
    PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt

# If this is an xterm set the title to user@host:dir
case "$TERM" in
xterm*|rxvt*)
    PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
    ;;
*)
    ;;
esac

# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
    test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
    alias ls='ls --color=auto'
    #alias dir='dir --color=auto'
    #alias vdir='vdir --color=auto'

    alias grep='grep --color=auto'
    alias fgrep='fgrep --color=auto'
    alias egrep='egrep --color=auto'
fi

# colored GCC warnings and errors
#export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01'

# some more ls aliases
#alias ll='ls -l'
#alias la='ls -A'
#alias l='ls -CF'

# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
  if [ -f /usr/share/bash-completion/bash_completion ]; then
    . /usr/share/bash-completion/bash_completion
  elif [ -f /etc/bash_completion ]; then
    . /etc/bash_completion
  fi
fi

# get hostname of other pis
function otherpis {
    grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}

# send commands to other pis
function clustercmd {
    for pi in $(otherpis); do ssh $pi "$@"; done
    $@
}

# restart all pis
function clusterreboot {
    for pi in $(otherpis);do
        ssh $pi "sudo shutdown -r 0"
    done
}

# shutdown all pis
function clustershutdown {
    for pi in $(otherpis);do
        ssh $pi "sudo shutdown 0"
    done
}

# send files to all pis
function clusterscp {
    for pi in $(otherpis); do
        cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
    done
}

function clusterssh {
    for pi in $(otherpis); do
        ssh $pi $1
    done
}

# sudo htpdate -a -l time.nist.gov

export JAVA_HOME=/usr/local/jdk1.8.0/
export HADOOP_HOME=/opt/hadoop
export SPARK_HOME=/opt/spark
export PATH=/usr/local/jdk1.8.0/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$PATH
export HADOOP_HOME_WARN_SUPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA"

Andrew (he/him) • Sep 23 '19

Did you follow these steps?

Create the Directories

Create the required directories on all other Pis using:

$ clustercmd sudo mkdir -p /opt/hadoop_tmp/hdfs
$ clustercmd sudo chown pi:pi –R /opt/hadoop_tmp
$ clustercmd sudo mkdir -p /opt/hadoop
$ clustercmd sudo chown pi:pi /opt/hadoop

Copy the Configuration

Copy the files in /opt/hadoop to each other Pi using:

$ for pi in $(otherpis); do rsync –avxP $HADOOP_HOME $pi:/opt; done

This will take quite a long time, so go grab lunch.

When you're back, verify that the files copied correctly by querying the Hadoop version on each node with the following command:

$ clustercmd hadoop version | grep Hadoop
Hadoop 3.2.0
Hadoop 3.2.0
Hadoop 3.2.0
...

You can't run the hadoop command on the other Pis until you've copied over those hadoop directories. If you have done that, you also need to make sure that that directory is on the $PATH of the other Pis by including the following lines in each of their .bashrc files (sorry, I don't think I included this step in the instructions):

export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

You could also simply clusterscp the .bashrc file from Pi #1 to each of the other Pis.

Rohit Das • Sep 23 '19

Hi. Thanks for the reply. Yes, I did the steps you mentioned. Since Java wasn't pre-installed, I installed it manually in each Pi, and checked them individually to see if they are working. As you can see below, the env variables are configured as you have suggested.

export JAVA_HOME=/usr/local/jdk1.8.0/
export HADOOP_HOME=/opt/hadoop
export SPARK_HOME=/opt/spark
export PATH=/usr/local/jdk1.8.0/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$PATH
export HADOOP_HOME_WARN_SUPRESS=1
export HADOOP_ROOT_LOGGER="WARN,DRFA"

Rohit Das • Sep 24 '19

Thanks. I resolved the issue by putting the PATH exports above the following part in .bashrc:

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

I also put the export PATH commands in /etc/profile of each Pi. Thanks.

PiDevi • Nov 9 '19

Hi Andrew, thanks for sharing this with us!

Following your instruction I have managed to run HDFS and YARN on my cluster using openJDK-8 instead of Oracle Java (is not pre-installed on Raspbian Buster and can't be installed via apt).

However, when running the wordcount example.

> yarn jar hadoop-mapreduce-examples-3.2.0.jar wordcount "/books/alice.txt" output

following error occurs

Error: Java heap space
Error: Java heap space
Error: Java heap space

and the job fails.

Do you have any idea what the reason might be?

Razvan T. Coloja • Nov 4 '19 • Edited

Try this for a more colourful clustercmd prompt put this in your ~/.bashrc:

# clustercmd
function clustercmd {
        for I in 1 2 3 4 5 6 7 8; do echo -e "\e[40;38;5;82m Cluster node \e[30;48;5;82m $I \e[0m \e[38;5;4m --------------"; ssh rpi$I "$@"; done
        $@
}

then do a

$ source ~/.bashrc

Looks like this:

Andrew (he/him) • Nov 4 '19

Looks nice! Thanks, Razvan!

Andreas Komninos • Aug 1 '19

Thank you for this superb article. I have been following it to deploy a Hadoop/Spark cluster using the latest Raspberry Pi 4 (4GB). I encountered one problem, which was that after completing the tutorial, the spark job was not being assigned. I got a warning:
INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers and then it sort of got stuck on
INFO yarn.Client: Application report for application_1564655698306_0001 (state: ACCEPTED). I will describe later how I solved this.

First, I want to note that the latest Raspbian version (Buster) does not include Oracle Java 8 which is required by Hadoop 3.2.0. There is no easy way to get it set-up, but it can be done. First you need to manually download the tar.gz file from Oracle's site (this requires a registration). I put it up on a personal webserver so it can be easily downloaded from the Pis. Then, on each Pi:

download java package

cd ~/Downloads
wget /jdk8.tar.gz

extract package contents

sudo mkdir /usr/java
cd /usr/java
sudo tar xf ~/Downloads/jdk8.tar.gz

update alternative configurations

sudo update-alternatives --install /usr/bin/java java /usr/java/jdk1.8.0_221/bin/java 1000
sudo update-alternatives --install /usr/bin/javac javac /usr/java/jdk1.8.0_221/bin/javac 1000

select desired java version

sudo update-alternatives --config java

check the java version changes

java -version

Next, here is how I solved the YARN problem. In your tutorial section "Configuring Hadoop on the Cluster", after the modifications to the xml files have been made on Pi1, two files need to be copied across to the other Pis: these are yarn-site.xml and mapred-site.xml. After copying, YARN needs to be restarted on Pi1.

To set appropriate values for the memory settings, I found a useful tool which is described on this thread stackoverflow.com/questions/495791...

Copy-pasting the instructions:

get the tool

wget public-repo-1.hortonworks.com/HDP/...
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
rm hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
mv hdp_manual_install_rpm_helper_files-2.6.0.3.8/ hdp_conf_files

run the tool

python hdp_conf_files/scripts/yarn-utils.py -c 4 -m 8 -d 1 false

-c number of cores you have for each node
-m amount of memory you have for each node (Giga)
-d number of disk you have for each node
-bool "True" if HBase is installed; "False" if not

This should provide appropriate settings to use. After the xml files have been edited and YARN has been restarted, you can try this command to check that all the worker nodes are active.

yarn node -list

PiDevi • Nov 9 '19 • Edited

Hi Andreas,

I am running Raspbian Buster on my PIs, too. I have downloaded the "Linux ARM 64 Hard Float ABI" (jdk-8u231-linux-arm64-vfp-hflt.tar.gz) and followed your instructions and I get following error bu running java -version

-bash: /usr/bin/java: cannot execute binary file: Exec format error

I guess this java-product is not compatible with the PI. Which exact file have you downloaded from the Orcale site?

Sliver88 • Jul 25 '20 • Edited

First of all i d like to thank Andrew for a superb tutorial. Besides some minor alternation i had to make, i was able to set up the hdfs etc. but i am running now on the same problem as you Andreas.
The first thing i d like to add to your recommendations is that downloading the java is easier.
sudo apt-get install openjdk-8-jdk
and then change the default (as you suggested already):
sudo update-alternatives --config java
sudo update-alternatives --config javac

Then change export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::") to export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf both in ~/.bashrc and in /opt/hadoop/etc/hadoop/hadoop-env.sh.

The part i have been stuck for a while though is that yarn node -list command stucks
and if i try to run a spark job then i also get stuck on the ACCEPTED part.
I haven't yet tried your proposition.

PS I know it is a year-old article but still is the best i ve seen so far in my research.

Ευχαριστώ πολύ και τους 2 (I would like to thank you both)

scurveedog • Jan 20 '20

Hello Andrew

Fantastic article/tutorial that made my deployment of a 4 node pi4 cluster almost trivial! But... I also think along the way, your use of bash shell scripting is brilliantly applied to the project. A model I will seek to emulate.

SunnyMo • Jul 16 '19

excellent article, I have a pi 3b+ and pi model b(the oldest 256m ver.), is it possible to run a spark cluster? just for study.

Andrew (he/him) • Jul 16 '19

My Pis are struggling as-is. I wouldn't recommend any lower specs than the Model 3 B+ (1 GB RAM)

SunnyMo • Jul 17 '19

I do want to make use of the old pi, even if it is used as a NameNode, I think it doesn't need much computation resource. I'm new to spark, so the question might be silly, sorry about that.

Andrew (he/him) • Jul 17 '19 • Edited

Even though NameNodes aren't processing data, they still have some CPU and memory requirements (they have to orchestrate the data processing, maintain records of the filesystem, etc.). I saw somewhere that 4GB per node was the recommended minimum. All I know from experience is that 1GB seems to barely work.

Spark sets minimum memory limits and I don't think 256MB is enough to do anything.

SunnyMo • Jul 19 '19

okay, the only thing that 256m can do may be running an Nginx reverse proxy in my private cloud or RIP, thanks for that.

Andrew (he/him) • Jul 19 '19

Maybe you could turn it into a Pi-Hole?

SunnyMo • Jul 25 '19

unfortunately, Pi-Hole project requires at least 512m memory. My old pi should R.I.P right now, I'll leave it to my children as the first gift form elder generation.

Jangbae • Jul 6 '20 • Edited

It was a great tutorial! I could successfully build a Hadoop system with four Raspberry Pi 4s. There were a few hiccups while I was following this tutorial but could be solved by following others postings. Thanks a lot!

Jangbae • Jul 8 '20

It was a bit early for me to declare my cluster is all set. The problem I'm facing is that the worker nodes are not showing up. I have a master node and three worker nodes in the cluster. "hdfs dfsadmin -report" only shows the master node. Double checked the XML files to make sure if there is any typo and they have been copied to the worker nodes correctly but found nothing. Do you have any suggestions to figure this out?

Scott Sims • Oct 4 '19

Hi Andrew,
First off, great stuff. Thank you so much for being as thorough as you are. This was a well put together #howto. Secondly, I'd like to highlight some issues I came across as following your post.

This is to be expected, but the links to the installations of hadoop and spark are no longer valid. But i just simply followed your link down the rabbit hole to find the most up to date available installer.

Next, I did run into the same issue as Rohit Das did as he noted below. I simply followed his fix and it worked perfectly. Thank you Rohit for figuring this out.

Thirdly, I ran into an issue of the namenodes and datanodes being registered, but the datanodes weren't able to connect to pi1. Looking in the UI, everything was 0 and 0 datanodes were registered. After chasing SO post after post, I finally got it to work by changing in core-site.xml

hdfs://pi1:9000
to
hdfs://{ip address}:9000

where {ip address} is the fully typed ip address of pi 1.

Lastly, I mounted 250GB SSD hard drives to my workers. Once I get the optimal resource limits worked out for this in the configuration xml's I'll be sure to post here what they were.

I should note my name node is a raspberry pi 3 but my workers are raspberry pi 4 with 4GB ram.

रमण मूर्ति • Jul 8 '19

Thanks for the article.

Andrew (he/him) • Jul 8 '19

Thanks for checking it out!

Andrea Luciano Damico • Jul 9 '19

Amazing! Now I kind of want to make one, although I'm not sure what I could do with such a cluster.

Andrew (he/him) • Jul 9 '19

Impress your friends!

Crush your enemies!

Wessel Valkenburg • Jan 23 '20

Thanks for this excellent guide! Haven't tried it yet, but looks remarkably complete and pedagogical.

Some questions:

regarding cost: you give an awesome hardware summary. How much time did you invest? As always, probably the human resource is the most expensive part (given the price of your level of expertise on the free market).
have you measured the performance of this cluster, and can you compare it to realistic real-world scenarios with x86 machines?
have you installed / tried to install Apache Impala on the cluster? Not sure that's "physically" possible, but I would guess you are the authority who can give a final answer. I'm asking, because I'm under the impression that Impala is by far the fastest SQL solution for Hadoop.
Further on SQL: how do queries on your cluster compare in performance to queries on a single big machine with a plain Postgress database (or MariaDB, or whatever)?
Finally, do you think this setup fills an empty spot in computing space for the commercial market? I'm thinking of small companies who want a better data cluster but cannot afford full-fledged big-data solutions.

Connor Luckett • Jul 9 '19

Hi, it looks like you've just tried this with computational tasks (calculating pi). We're trying this with Spark SQL and facing out-of-memory errors on our 3B+ cluster. Have you tried this on a memory-intensive job?

Andrew (he/him) • Jul 9 '19

Yes, we were doing some benchmarking on another machine, running random forests with X number of trees. I tried to run the same script on the Pi cluster and got OutOfMemoryErrors with miniscule datasets.