DEV Community

loading...
Cover image for Hadoop Configuration using Ansible

Hadoop Configuration using Ansible

Nitesh Thapliyal
I'm an Information technology student. I'm passionate about current and emerging technologies. I have always been fascinated by how things work, functionality, and problem-solving.
・4 min read

Hello everyone, this blog contains the configuration or setup of the Hadoop cluster.

The task that we are going to perform in this blog are:

  1. Configure Namenode.
  2. Format the Namenode.
  3. Configure the Datanode.
  4. Check Datanode is connected to Namenode or not.

Before we start with our main topic we should know what Hadoop is and what is Ansible?

What is Hadoop?

Alt Text

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

What is Namenode?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

What is Datanode?

DataNode is a daemon (a process that runs in the background) that runs on the ‘SlaveNode’ in Hadoop Cluster. In Hdfs file is broken into small chunks called blocks(default block of 64 MB). These blocks of data are stored on the slave node. It stores the actual data. So, a large number of disks are required to store data. (Recommended 8 disks). These data read/write operation to disks is performed by the DataNode. For hosting data nodes, commodity hardware can be used.

What is ansible?

Alt Text

Ansible is an open-source automation tool, or platform, used for IT tasks such as configuration management, application deployment, intraservice orchestration, and provisioning. Automation is crucial these days, with IT environments that are too complex and often need to scale too quickly for system administrators and developers to keep up if they had to do everything manually. Automation simplifies complex tasks, not just making developers’ jobs more manageable but allowing them to focus attention on other tasks that add value to an organization. In other words, it frees up time and increases efficiency.

What is Control Node?

Any machine with Ansible installed. You can run Ansible commands and playbooks by invoking the ansible or ansible-playbook command from any control node. You can use any computer that has a Python installation as a control node - laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes.

What is Managed Node?

The network devices (and/or servers) you manage with Ansible. Managed nodes are also sometimes called hosts. Ansible is not installed on managed nodes.

My setup to setting up Hadoop Cluster is something like this👇

Alt Text

In my Virtual Machine, I have three Os running one is the Control node and the other two are the Managed nodes.

Now let's get started:

Namenode Configuration:

  • First create an Inventory:

To create an Inventory create a text file and inside the text file provide managed node IP address, username, password, and connection type as shown below👇

Alt Text

  • Now check the connectivity with Managed Node using the command ansible all -m ping

Alt Text

If it shows a message in green color that means there is connectivity and you will see the message ping pong.

  • Now create a directory mkdir /etc/ansible

Inside etc/ansible create ansible configuration file ansible.cfg and inside configuration file add the path of an inventory and set host key checking false so that it will not verify when we first time connect through ssh 👇

Alt Text

  • Now create a hdfs-site.xml file

Alt Text

  • Now create a core-site.xml file

Alt Text

  • Now inside /etc/ansible directory create yml file, here I have created file name master.yml. This yml file is playbook where we write code in YAML. Now code inside docker.yml 👇

Alt Text
Alt Text
Alt Text

  • Now run the playbook using the command ansible-playbook master.yml

Alt Text
Alt Text

  • Now check in the Managed node is master node configured or not

Alt Text

Here you can see the master node is started successfully⭐

Datanode Configuration

  • Now create an Inventory for Datanode as shown above

  • Now check the connectivity to the Managed node

Alt Text

  • Now create the playbook for Datanode

Alt Text

Alt Text

  • Now run the Playbook using command ansible-playbook datanode.yml

Alt Text
Alt Text

  • Now check Datanode is connected to Namenode or not.

To check use command hadoop dfsadmin -report

Alt Text

We can also check it from Hadoop WebUi

Alt Text

So that's how we can set up the Hadoop cluster using Ansible.

Thank you!!!❄

Discussion (0)