Salma Aga Shaik

Posted on Mar 15

Understanding Hadoop Architecture

#architecture #beginners #dataengineering #distributedsystems

Imagine a company that collects a large amount of data every day, such as website logs, transactions, or user activity. After some time, the data becomes too large for a single computer to store and process. This is where Hadoop helps.

Hadoop is a big data framework designed to store and process very large datasets across many machines. Instead of using one powerful computer, Hadoop uses multiple machines working together, which are called nodes. These machines together form a Hadoop cluster.

Storage Layer — HDFS

The storage system used by Hadoop is called HDFS (Hadoop Distributed File System).

When a large file is stored in Hadoop, it is automatically split into smaller pieces called blocks. These blocks are then distributed across multiple machines in the cluster.

For example:

Block 1 → Machine 1
Block 2 → Machine 2
Block 3 → Machine 3

This approach is called distributed storage, and it allows Hadoop to store very large datasets efficiently.

Hadoop Nodes

In a Hadoop cluster, there are two important types of nodes.

NameNode (Master Node)

The NameNode acts like the manager of the system. It stores metadata, which includes:

file names
block locations
which machine stores each block

The NameNode does not store actual data. It only manages the file system and keeps track of the data.

DataNodes

DataNodes are the machines that store the actual data blocks.

For example:

DataNode 1 → Block 1
DataNode 2 → Block 2
DataNode 3 → Block 3

This layer is known as the storage layer of Hadoop.

Processing Layer — MapReduce

After the data is stored, Hadoop needs to process the data. This is done using MapReduce.

MapReduce is a distributed data processing framework that works in two main steps.

Map Phase

In the Map phase, a large task is divided into smaller tasks.
Each machine processes a small part of the data.

Reduce Phase

In the Reduce phase, the results from all machines are combined to produce the final output.

This process allows Hadoop to process huge datasets quickly by using parallel processing.

Example

Imagine a company wants to analyze millions of website log records to see how many users visited from each country.

The log data is stored in HDFS.
Hadoop splits the logs into blocks and stores them across multiple DataNodes.
MapReduce processes the data in parallel on different machines.
The Reduce phase combines the results and generates the final report.

Architecture

Conclusion

Hadoop architecture works with two main components:

HDFS → for distributed storage
MapReduce → for distributed processing

By splitting data across multiple machines and processing it in parallel, Hadoop allows organizations to store and analyze massive datasets efficiently.

DEV Community