Erin Schaffer for Educative

Posted on Oct 20, 2021 • Originally published at educative.io

What is Hadoop?

#computerscience #programming #learning #algorithms

The Hadoop framework provides an open-source platform to process large amounts of data across clusters of computers. Because of its powerful features, it has become extremely popular in the big data field. Hadoop allows us to store any kind of data and handle multiple concurrent tasks. Today, we’re going to learn more about this platform and discuss the Hadoop ecosystem, how Hadoop works, its pros and cons, and much more.

Let’s get started!

We’ll cover:

What is Apache Hadoop?
Hadoop ecosystem
How does Hadoop work?
Hadoop pros, cons, and use cases
Hadoop vs Spark
Wrapping up and next steps

What is Apache Hadoop?

Hadoop is an open-source software framework developed by the Apache Software Foundation. It uses programming models to process large data sets. Hadoop is written in Java, and it’s built on Hadoop clusters. These clusters are collections of computers, or nodes, that work together to execute computations on data. Apache has other software projects that integrate with Hadoop, including ones to perform data storage, manage Hadoop jobs, analyze data, and much more. We can use Hadoop with cloud services such as Amazon AWS, Microsoft Azure, and Cloudera to manage and organize our big data efforts.

History of Hadoop

Apache Hadoop started in 2002 when Doug Cutting and Mike Cafarella were working on Apache Nutch. They learned that Nutch wasn’t fully capable of handling large amounts of data, so they began brainstorming for a solution. They learned about the architecture of the Google File System (GFS) and the MapReduce technique, which processes large data sets. They started implementing GFS and MapReduce techniques into their open-source Nutch project, but Nutch still didn’t fully meet their needs.

When Cutting joined Yahoo in 2006, he formed a new project called Hadoop. He separated the distributed computing parts from Apache Nutch and worked with Yahoo to design Hadoop so that it could handle thousands of nodes. In 2007, Yahoo tested Hadoop on a 1,000 node cluster and began using it internally. In early 2008, Hadoop was released as an open-source project at the Apache Software Foundation. Later that year, they successfully tested Hadoop on a 4,000 node cluster.

In 2009, Hadoop was capable of handling billions of searches and indexing millions of web pages. At this time, Cutting joined the Cloudera team to help spread Hadoop into the cloud industry. Finally, in 2011, version 1.0 of Hadoop was released. The latest version (3.3.1) was released in 2021.

Hadoop ecosystem

The Hadoop ecosystem is a suite of services we can use to work with big data initiatives. The four main elements of the ecosystem include:

MapReduce
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
Hadoop Common Let’s take a closer look at each of these services.

MapReduce

Hadoop MapReduce is a programming model used for distributed computing. With this model, we can process large amounts of data in parallel on large clusters of commodity hardware. With MapReduce, we can use Map and Reduce. With Map, we can convert a set of data into tuples (key/value pairs). Reduce takes the output of Map as input and combines the tuples into smaller sets of tuples. MapReduce makes it easy to scale data processing to run tens of thousands of machines in a cluster.

During MapReduce jobs, Hadoop sends the tasks to their respective servers in the cluster. When the tasks are completed, the clusters collect and reduce data into a result and send the result back to the Hadoop server.

Hadoop Distributed File System (HDFS)

As the name suggests, HDFS is a distributed file system. It handles large sets of data and runs on commodity hardware. HDFS helps us scale single Hadoop clusters to multiple nodes, and it helps us perform parallel processing. The built-in servers, NameNode and DataNode, help us check the status of our clusters. HDFS is designed to be highly fault-tolerant, portable, and cost-effective.

Yet Another Resource Negotiator (YARN)

Hadoop YARN is a cluster resource management and job scheduling tool. YARN also works with the data we store in HDFS, allowing us to perform tasks such as:

Graph processing
Interactive processing
Stream processing
Batch processing

It dynamically allocates resources and schedules application processing. YARN supports MapReduce, along with multiple other processing models. It efficiently utilizes resources and is backward compatible, meaning that it can run on previous Hadoop versions without any issues.

Hadoop Common

Hadoop Common, also known as Hadoop Core, provides Java libraries that we can use across all of our Hadoop modules.

Other components include:

Cassandra: Cassandra is a wide-column store NoSQL database management system.
Flume: Flume aggregates, collects, and moves large amounts of log data.
Pig: Pig is a high-level programming language used to analyze large data sets.
HBase - HBase is a non-relational database management system that runs on top of HDFS.
Hive: Apache Hive is a fault-tolerant and SQL-like data warehouse software that handles reading, writing, and managing data.
Lucene: Lucene is an open-source search engine software library written in Java. It provides robust search and indexing features.
Mahout: Apache Mahout is an open-source project used to create scalable machine learning algorithms.
Oozie: Oozie is a workload scheduling system used to handle Hadoop jobs.
Spark MLib: MLlib is a scalable machine learning library with Java, Scala, R, and Python APIs.
Solr: Solr is an enterprise-search platform built on Lucene.
Sqoop: Sqoop is a CLI application used to transfer data between relational databases and Hadoop.
Submarine: Submarine is a cloud-native machine learning and deep learning platform. It supports data processing, algorithm development, ML frameworks, and containerization efforts.
Zookeeper: Zookeeper is a centralized server for reliable distributed cloud application coordination.

How does Hadoop work?

In the previous section, we discussed a large amount of the services that integrate with Hadoop. We now know that the Hadoop ecosystem is large and extensible. It allows us to perform many tasks, such as collecting, storing, analyzing, processing, and managing big data. Hadoop provides us with a platform in which we can build other services and applications.

Applications can use API operations to connect to NameNode and place data in Hadoop clusters. NameNode replicates this data in chunks across DataNodes. We can use MapReduce to run jobs, query data, and reduce tasks in HDFS. Map tasks run on each node against the files we supply, and reduce tasks, or reducers, aggregate and organize our output.

Hadoop pros, cons, and use cases

Hadoop is a popular platform that comes with its pros and cons. Let’s take a look at them, and then we’ll discuss a handful of use cases.

Pros

Cost-effective: Traditionally, it costs a lot of money to store large amounts of data. Hadoop solves this problem, and it also stores all raw data so it can be accessed whenever needed.
High availability: The HDFS high availability feature allows us to run two or more redundant NameNodes in the same cluster, which allows for a fast failover in case a machine crashes or in case there’s a failure.
Scalability: Storage and processing power can be easily increased by adding more nodes.
Systematic: The HDFS thoughtfully processes all components and programs.
Flexibility: Hadoop can handle structured data and unstructured data.
Active community: Hadoop has a large user base, so it’s easy to find helpful documentation or help relating to any problem you encounter.
MapReduce: MapReduce is powerful and can be leveraged through Java or Apache Pig.
Rich ecosystem: Hadoop has so many companion tools and services that easily integrate into the platform. These services allow us to perform many different tasks related to our data.
Parallel processing: Hadoop efficiently executes parallel processing and can even process petabytes of data.
Data formatting: Changing between different types of data formats can sometimes cause data loss, but formats don’t need to be changed in Hadoop.

Cons

Small files: HDFS lacks the ability to support small files because it’s designed to handle high-capacity situations.
No real-time processing: Hadoop is not suitable for real-time data processing. Apache Spark or Apache Flink are great resources to help speed up the process.
Security: Hadoops lacks encryption at storage and network levels, which means your data may be at risk. Spark provides security bonuses to help overcome the limitations of Hadoop.
Response time: The MapReduce programming framework tends to run slowly at times.
Learning curve: There are a lot of different modules and services available to use with Hadoop, and those can take a lot of time to learn.
Complex interface: The interface isn’t extremely intuitive, so it may take some time to get acquainted with the platform.

Use cases

Data-driven decisions

We can integrate structured and unstructured data not used in a data warehouse or relational database. This allows us to make more precise decisions that are based on broad data.

Big data analytics and access

Hadoop is great for data scientists and ML engineers because it allows us to perform advanced analytics to find patterns and develop accurate and effective predictive models.

Data lakes

Hadoop governance solutions can help us with data integration, security, and quality for data lakes.

Financial services

Hadoop can help us build and run applications to assess risk, design investment models, and create trading algorithms.

Healthcare

Hadoop helps us track large-scale health indexes and keep track of patient records.

Sales prediction

Hadoop is used in retail companies to help predict sales and increase profits by studying historical data.

Hadoop vs Spark

Apache Hadoop and Apache Spark are commonly compared to one another because they’re both open-source frameworks for big data processing. Spark is a newer project, which was initially developed in 2012. It focuses on the parallel processing of data across a cluster, and it works in memory. This means that it’s a lot faster than MapReduce.

Hadoop is the better platform if you’re working with batch processing large amounts of data. Spark is the better platform if you’re streaming data, creating graph computations, or doing machine learning. Spark supports real-time data processing and batch processing as well. There are a lot of different libraries that you can use with Spark, including ones for machine learning, SQL tasks, streaming data, and graphing.

Wrapping up and next steps

Congrats on taking your first steps with Apache Hadoop! The Hadoop ecosystem is powerful and extensive, and there’s still so much more to learn about Hadoop. Some recommended concepts to cover next include:

Resilient distributed data sets (RDDs)
Hadoop data management
Avro and Parquet

To get started learning these concepts and more, check out Educative’s course Introduction to Big Data and Hadoop. In this hands-on course, you’ll learn the fundamentals of big data and work closely with functioning Hadoop clusters. By the end of the course, you’ll have the foundational knowledge to begin working in the big data field.

Happy learning!

DEV Community