loading...

Install Hadoop in linux (Debian) for Big Data Analysis

mh_shifat profile image 5hfT Updated on ・3 min read

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

There are two ways to install Hadoop, i.e. Single node and Multi node.

Single node cluster means only one DataNode running and setting up all the NameNode, DataNode, ResourceManager and NodeManager on a single machine. This is used for studying and testing purposes.

Install Hadoop: Setting up a Single Node Hadoop Cluster

Prerequirments :

Step 0)

  • Install java open-jdk-8 :

    • Add repository :

    sudo add-apt-repository ppa:openjdk-r/ppa

    • Update :

    Sudo apt update

    • Install :

    sudo apt install openjdk-8-jdk

    Note : incase of kali-linux just install jdk

Step 1)

  • Install ssh :

sudo apt install ssh

Step 2)

  • Install rsync :

sudo apt install rsync

Step 3)

  • ssh without passphase setup :

ssh-keygen -t rsa

Step 4)

  • append :

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 5)

  • now

ssh localhost

  • Issue-1 : ssh: connect to host localhost port 22: Connection refused

    • Restart ssh :

    service ssh restart

  • Issue-2 : this could be a permission issue so try

    • Using chmod :

    chmod -R 700 ~/.ssh

    chmod -R 700 ~/.ssh

    chmod 644 ~/.ssh/authorized_keys

    chmod 644 ~/.ssh/known_hosts

    chmod 644 ~/.ssh/config

    chmod 600 ~/.ssh/id_rsa

    chmod 644 ~/.ssh/id_rsa.pub

  • Then again run :

    ssh localhost

Main Install Process :

Step 6)

I have installed Hadoop-3.2.1 and i prefer to downlaod this one.

Step 7)

  • Extract the file using

tar -xzf Hadoop-3.2.1.tar.gz

Step 8)

  • Copy the Hadoop-3.2.1 folder to your desired place and rename it hadoop (such as dir looks like /home/username/hadoop)

Step 9)

  • edit .bashrc file [location : ~ (home directory)] and insert (add) the code given below into .bashrc

    #for hadoop
    
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 #JAVA_JDK directory
    
    export HADOOP_HOME=/home/username/hadoop #location of your hadoop file directory
    
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_USER_CLASSPATH_FIRST=true
    
    alias hadoop=$HADOOP_HOME/bin/./hadoop #for convenience
    alias hdfs=$HADOOP_HOME/bin/./hdfs #for convenience
    
    #done
    

Note : Change username in HADOOP_HOME according to your username.

To get the JAVA_JDK path command :

readlink -f \$(which java)

Step 10)

  • Reload .bashrc file to effect the changes :

source .bashrc

Step 11)

  • Edit the files in hadoop/etc/hadoop/ :

    • core-site.xml (append/add the given code below) :
    <configuration>
                <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://localhost:9000</value>
                </property>
    </configuration>
    
    
    • hdfs-site.xml (append/add the given code below) :

    Note : Change username according to your username.

    
    <configuration>
                <property>
                    <name>dfs.name.dir</name>
                    <value>file:///home/username/pseudo/dfs/name</value>  <!-- username = use `whoami` command in terminal to know your username in machine  -->
                    </property>
                    <property>
                    <name>dfs.data.dir</name>
                    <value>file:///home/username/pseudo/dfs/data</value>  <!-- username = use `whoami` command in terminal to know your username in machine  -->
                </property>
                <property>
                    <name>dfs.replication</name>
                    <value>1</value>
                </property>
    </configuration>
    
    • mapred-site.xml (append/add the given code below) :
    <configuration>
                <property>
                <name>mapred.job.tracker</name>
                <value>localhost:8021</value>
                </property>
    </configuration>
    
    
    • hadoop-env.sh (append/add the given code below) :
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 #JAVA_JDK directory
    

To get the JAVA_JDK path run :

readlink -f \$(which java)

After everything done without any error...

Step 12)

  • Format Hadoop file system by running the command: > hadoop namenode -format

Step 13)

  • To run hadoop :

$HADOOP_HOME/sbin/start-all.sh

Now open your browser and go to http://localhost:50070 you will get your hadoop working ! :D

Since Hadoop 3.0.0 - Alpha 1 there was a Change in the port configuration:

http://localhost:50070 was moved to http://localhost:9870

  • To check the process and port:

jps

  • Stop hadoop :

$HADOOP_HOME/sbin/stop-all.sh

  • After Machine (PC) started enable hadoop using

$HADOOP_HOME/sbin/start-all.sh`

  • The default port number to access all applications of cluster is 8088 http://localhost:8088/

Posted on by:

mh_shifat profile

5hfT

@mh_shifat

I'm a Quick Learner | Love to Write Code | Learn new Tech stuffs | Find Peace in Solving or Fixing an Error. ~Every code matters !~

Discussion

pic
Editor guide