Hadoop configuration

#bigdata #hadoop

Video version of this article: https://www.youtube.com/watch?v=Slbi-uzPtnw
Credits:@codewitharjun

(Video uses cli text editor to edit config files, this tutorial will use normal text editor.)

First install java-jdk-8

sudo apt install openjdk-8-jdk

(Optional) To check it’s there

cd /usr/lib/jvm

Now ensure that you are at the root of the terminal if not run
cd ~

Open .bashrc file

sudo gedit .bashrc

And paste in the following block

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
export PATH=$PATH:/usr/lib/jvm/java-8-openjdk-amd64/bin 
export HADOOP_HOME=~/hadoop-3.3.6/ 
export PATH=$PATH:$HADOOP_HOME/bin 
export PATH=$PATH:$HADOOP_HOME/sbin 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" 
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
export HADOOP_LOG_DIR=$HADOOP_HOME/logs 
export PDSH_RCMD_TYPE=ssh

sudo apt-get install ssh

For following commands check your hadoop version number at the time of writing this it is 3.3.6

Now go to hadoop.apache.org website download the tar file.
Direct Link
Website Link

Once downloaded execute:
(To extract the tar file)

tar -zxvf ~/Downloads/hadoop-3.3.6.tar.gz

For all the configuration below ensure you are in hadoop-3.3.6/etc/hadoop directory.

cd hadoop-3.3.6/etc/hadoop

Many of the files might have <configuration> tag already so watch before you paste in new configurations.

Now open hadoop-env.h:

sudo gedit hadoop-env.h

Paste the following in hadoop-env.h:

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

(set the path for JAVA_HOME)
You might not need to use sudo in the following commands but to avoid permission issues I have added it to everything.
Let's configure other files similarly:

sudo gedit core-site.xml

<configuration> 
 <property> 
 <name>fs.defaultFS</name> 
 <value>hdfs://localhost:9000</value>  </property> 
 <property> 
<name>hadoop.proxyuser.dataflair.groups</name> <value>*</value> 
 </property> 
 <property> 
<name>hadoop.proxyuser.dataflair.hosts</name> <value>*</value> 
 </property> 
 <property> 
<name>hadoop.proxyuser.server.hosts</name> <value>*</value> 
 </property> 
 <property> 
<name>hadoop.proxyuser.server.groups</name> <value>*</value> 
 </property> 
</configuration>

sudo gedit hdfs-site.xml

Change <User> with your ubuntu username !

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href=''?>
<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>file:///home/<USER>/pseudo/dfs/name</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>file:///home/<USER>/pseudo/dfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
</configuration>

sudo gedit mapred-site.xml

<configuration> 
 <property> 
 <name>mapreduce.framework.name</name>  <value>yarn</value> 
 </property> 
 <property>
 <name>mapreduce.application.classpath</name> 
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> 
 </property> 
</configuration>

sudo gedit yarn-site.xml

<configuration> 
 <property> 
 <name>yarn.nodemanager.aux-services</name> 
 <value>mapreduce_shuffle</value> 
 </property> 
 <property> 
 <name>yarn.nodemanager.env-whitelist</name> 
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREP END_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> 
 </property> 
</configuration>

Hadoop is now configured.

Next execute following one by one:

ssh
ssh localhost 
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
chmod 0600 ~/.ssh/authorized_keys 
hadoop-3.3.6/bin/hdfs namenode -format
export PDSH_RCMD_TYPE=ssh

To start hadoop
start-all.sh

To check if hadoop is running go to http://localhost:9870/

To stop hadoop
stop-all.sh

This is an update version to this article