Imagine a company that collects a large amount of data every day, such as website logs, transactions, or user activity. After some time, the data becomes too large for a single computer to store and process. This is where Hadoop helps.
Hadoop is a big data framework designed to store and process very large datasets across many machines. Instead of using one powerful computer, Hadoop uses multiple machines working together, which are called nodes. These machines together form a Hadoop cluster.
Storage Layer — HDFS
The storage system used by Hadoop is called HDFS (Hadoop Distributed File System).
When a large file is stored in Hadoop, it is automatically split into smaller pieces called blocks. These blocks are then distributed across multiple machines in the cluster.
For example:
- Block 1 → Machine 1
- Block 2 → Machine 2
- Block 3 → Machine 3
This approach is called distributed storage, and it allows Hadoop to store very large datasets efficiently.
Hadoop Nodes
In a Hadoop cluster, there are two important types of nodes.
NameNode (Master Node)
The NameNode acts like the manager of the system. It stores metadata, which includes:
- file names
- block locations
- which machine stores each block
The NameNode does not store actual data. It only manages the file system and keeps track of the data.
DataNodes
DataNodes are the machines that store the actual data blocks.
For example:
- DataNode 1 → Block 1
- DataNode 2 → Block 2
- DataNode 3 → Block 3
This layer is known as the storage layer of Hadoop.
Processing Layer — MapReduce
After the data is stored, Hadoop needs to process the data. This is done using MapReduce.
MapReduce is a distributed data processing framework that works in two main steps.
Map Phase
In the Map phase, a large task is divided into smaller tasks.
Each machine processes a small part of the data.
Reduce Phase
In the Reduce phase, the results from all machines are combined to produce the final output.
This process allows Hadoop to process huge datasets quickly by using parallel processing.
Example
Imagine a company wants to analyze millions of website log records to see how many users visited from each country.
- The log data is stored in HDFS.
- Hadoop splits the logs into blocks and stores them across multiple DataNodes.
- MapReduce processes the data in parallel on different machines.
- The Reduce phase combines the results and generates the final report.
Architecture
Conclusion
Hadoop architecture works with two main components:
- HDFS → for distributed storage
- MapReduce → for distributed processing
By splitting data across multiple machines and processing it in parallel, Hadoop allows organizations to store and analyze massive datasets efficiently.
Top comments (0)