MapReduce Vs Tez

Shivansh Yadav — Sun, 07 Jul 2024 09:03:01 +0000

Apache Hadoop uses MapReduce as it's programming model for distributed processing of Big Data, but instead of writing multiple MapReduce jobs, we can also utilize the power of Hive or Pig.

Hive: Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Pig: Pig is a high-level platform for creating programs that run on Apache Hadoop. It provides an SQL-like scripting language.

Both Hive queries and Pig scripts are compiled to MapReduce programs in the background, and then jobs are executed in parallel across the Hadoop cluster.

But instead of MapReduce both Hive and Pig can use Tez.

Apache Tez

Apache Tez is a framework that creates a complex Directed Acyclic Graph (DAG) of tasks for processing data.

It uses DAG to analyze all the relationship between the different steps and figures out the most optimal path to get the result.

Therefore, Tez is much faster as compared to MapReduce.

This technique is also used in Apache Spark for large-scale data processing.

MapReduce Vs Tez

MapReduce access the disk/HDFS multiple times during it's data-flow i.e Mapper -> Shuffle & Sort -> Reducer. It will write & read data/modified data during each of these steps, resulting in 5-6 disk access for a single MapReduce job.

On the other hand Tez gets the data from the disk, performs all the steps, stores the intermediate results in the memory, performs vectorization(processes batch of rows instead of one row at a time) and produces the output.

While MapReduce makes multiple reads/writes to HDFS, Tez avoids unneeded access to HDFS.

Introduction to Apache Hadoop & MapReduce

Shivansh Yadav — Sun, 30 Jun 2024 13:48:35 +0000

The History of Hadoop

There are mainly two problems with the big data.

Storage for a huge amount of data.
Processing of that stored data.

In 2003, Google published about Google's distributed file system, called GFS (Google File System) which can be used for storing large data sets.

Similarly in 2004, Google published a paper on MapReduce, that described the solution for processing large datasets.

Doug Cutting and Mike Cafarella (Founders of Hadoop), came across both of these papers that described GFS and MapReduce, while working on the Apache Nutch project.

The aim of the Apache Nutch project was to build a search engine system that can index 1 billion pages. Their conclusion to this project was that it would cost millions of dollars.

Both the papers by Google were not complete solution for the Nutch project.

Fast forward to 2006, Doug Cutting joined Yahoo and started the project Hadoop, implementing the papers from Google.

Finally in 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation) and they successfully tested a 4000 node cluster with Hadoop.

Intro to Apache Hadoop

Apache Hadoop is software framework for distributed storage and distributed processing of big data using the MapReduce programming model.

Hadoop comes with the following 4 modules:

HDFS (Hadoop Distributed File System): A file system inspired by GFS which is used for distributed storage of Big data.
YARN (Yet Another Resource Negotiator): A resources manager that can be used for job scheduling and cluster resource management. It keeps track of which node does what work.
MapReduce: Programming Model used for distributed processing. It divides the data into partition that are mapped (transformed) and Reduced (aggregated).
Hadoop Common: It includes libraries and utilities used and shared by other Hadoop modules.

Here is a block diagram representation of how they all work together.

MapReduce

As we know MapReduce is a programming model that can process big data in a distributed manner, let's see how MapReduce works internally.

There are majorly 3 tasks performed during a MapReduce job.

Mapper
Shuffle & Sort
Reducer

Below is a example of how a MapReduce job would look like:

This can vary and depends how we want the MapReduce to process our data.

Hadoop & MapReduce are written natively in Java, but streaming allows interfacing to other languages like Python.

Here is an example Python code for a MapReduce job.

from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_ratings,
                   reducer=self.reducer_count_ratings),
            MRStep(reducer=self.reducer_sorted_count)
            ]

    def mapper_get_ratings(self, _, line):
        (user_id, movie_id, rating, timestamp) = line.split('\t')
        yield movie_id, 1

    def reducer_count_ratings(self, key, value):
        yield str(sum(values)).zfill(5), key

    def reducer_sorted_output(self, count, movies):
        for movie in movies:
            yield movie, count

if __name__ == '__main__':
    RatingsBreakdown.run()

The Hadoop ecosystem has grown significantly and includes various tools and frameworks that build upon or complement the basic MapReduce model. Here’s a look at some of these technologies:

While newer technologies offer more straightforward ways to handle big data, understanding MapReduce is fundamental to grasping the field's broader concepts.

THE END

DEV Community: Shivansh Yadav

MapReduce Vs Tez

Apache Tez

MapReduce Vs Tez

Introduction to Apache Hadoop & MapReduce

The History of Hadoop

Intro to Apache Hadoop

MapReduce