Hadoop Configuration using Ansible

#ansible #hadoop #linux

Hello everyone, this blog contains the configuration or setup of the Hadoop cluster.

The task that we are going to perform in this blog are:

Configure Namenode.
Format the Namenode.
Configure the Datanode.
Check Datanode is connected to Namenode or not.

Before we start with our main topic we should know what Hadoop is and what is Ansible?

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

What is Namenode?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

What is Datanode?

DataNode is a daemon (a process that runs in the background) that runs on the ‘SlaveNode’ in Hadoop Cluster. In Hdfs file is broken into small chunks called blocks(default block of 64 MB). These blocks of data are stored on the slave node. It stores the actual data. So, a large number of disks are required to store data. (Recommended 8 disks). These data read/write operation to disks is performed by the DataNode. For hosting data nodes, commodity hardware can be used.

What is ansible?

Ansible is an open-source automation tool, or platform, used for IT tasks such as configuration management, application deployment, intraservice orchestration, and provisioning. Automation is crucial these days, with IT environments that are too complex and often need to scale too quickly for system administrators and developers to keep up if they had to do everything manually. Automation simplifies complex tasks, not just making developers’ jobs more manageable but allowing them to focus attention on other tasks that add value to an organization. In other words, it frees up time and increases efficiency.

What is Control Node?

Any machine with Ansible installed. You can run Ansible commands and playbooks by invoking the ansible or ansible-playbook command from any control node. You can use any computer that has a Python installation as a control node - laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes.

What is Managed Node?

The network devices (and/or servers) you manage with Ansible. Managed nodes are also sometimes called hosts. Ansible is not installed on managed nodes.

My setup to setting up Hadoop Cluster is something like this👇

In my Virtual Machine, I have three Os running one is the Control node and the other two are the Managed nodes.

Now let's get started:

Namenode Configuration:

First create an Inventory:

To create an Inventory create a text file and inside the text file provide managed node IP address, username, password, and connection type as shown below👇

Now check the connectivity with Managed Node using the command ansible all -m ping

If it shows a message in green color that means there is connectivity and you will see the message ping pong.

Now create a directory mkdir /etc/ansible

Inside etc/ansible create ansible configuration file ansible.cfg and inside configuration file add the path of an inventory and set host key checking false so that it will not verify when we first time connect through ssh 👇

Now create a hdfs-site.xml file

Now create a core-site.xml file

Now inside /etc/ansible directory create yml file, here I have created file name master.yml. This yml file is playbook where we write code in YAML. Now code inside docker.yml 👇