<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shivansh Yadav</title>
    <description>The latest articles on DEV Community by Shivansh Yadav (@shvshydv).</description>
    <link>https://dev.to/shvshydv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F742871%2F5da970a9-ac12-481e-a314-0e9e62be68b9.jpg</url>
      <title>DEV Community: Shivansh Yadav</title>
      <link>https://dev.to/shvshydv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shvshydv"/>
    <language>en</language>
    <item>
      <title>MapReduce Vs Tez</title>
      <dc:creator>Shivansh Yadav</dc:creator>
      <pubDate>Sun, 07 Jul 2024 09:03:01 +0000</pubDate>
      <link>https://dev.to/shvshydv/mapreduce-vs-tez-171g</link>
      <guid>https://dev.to/shvshydv/mapreduce-vs-tez-171g</guid>
      <description>&lt;p&gt;Apache Hadoop uses MapReduce as it's programming model for distributed processing of Big Data, but instead of writing multiple MapReduce jobs, we can also utilize the power of Hive or Pig.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hive:&lt;/strong&gt; Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pig:&lt;/strong&gt; Pig is a high-level platform for creating programs that run on Apache Hadoop. It provides an SQL-like scripting language.&lt;/p&gt;

&lt;p&gt;Both Hive queries and Pig scripts are compiled to MapReduce programs in the background, and then jobs are executed in parallel across the Hadoop cluster.&lt;/p&gt;

&lt;p&gt;But instead of MapReduce both Hive and Pig can use &lt;strong&gt;Tez&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzfqtwd79uga9quzxwr2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzfqtwd79uga9quzxwr2y.png" alt="Hadoop ecosystem with Tez" width="653" height="402"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Apache Tez
&lt;/h2&gt;

&lt;p&gt;Apache Tez is a framework that creates a complex &lt;strong&gt;Directed Acyclic Graph (DAG)&lt;/strong&gt; of tasks for processing data.&lt;/p&gt;

&lt;p&gt;It uses DAG to analyze all the relationship between the different steps and figures out the most optimal path to get the result.&lt;/p&gt;

&lt;p&gt;Therefore, Tez is much &lt;strong&gt;faster as compared to MapReduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This technique is also used in Apache Spark for large-scale data processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  MapReduce Vs Tez
&lt;/h2&gt;

&lt;p&gt;MapReduce access the disk/HDFS multiple times during it's data-flow  i.e Mapper -&amp;gt; Shuffle &amp;amp; Sort -&amp;gt; Reducer. It will write &amp;amp; read data/modified data during each of these steps, resulting in &lt;strong&gt;5-6 disk access&lt;/strong&gt; for a single MapReduce job.&lt;/p&gt;

&lt;p&gt;On the other hand Tez gets the data from the disk, performs all the steps, stores the &lt;strong&gt;intermediate results in the memory&lt;/strong&gt;, performs &lt;strong&gt;vectorization&lt;/strong&gt;(processes batch of rows instead of one row at a time) and produces the output.&lt;/p&gt;

&lt;p&gt;While MapReduce makes multiple reads/writes to HDFS, Tez avoids unneeded access to HDFS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw59ibjxinc0pn685uau2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw59ibjxinc0pn685uau2.png" alt="MapReduce vs Tez" width="507" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>hadoop</category>
      <category>dataengineering</category>
      <category>database</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Introduction to Apache Hadoop &amp; MapReduce</title>
      <dc:creator>Shivansh Yadav</dc:creator>
      <pubDate>Sun, 30 Jun 2024 13:48:35 +0000</pubDate>
      <link>https://dev.to/shvshydv/introduction-to-apache-hadoop-30ka</link>
      <guid>https://dev.to/shvshydv/introduction-to-apache-hadoop-30ka</guid>
      <description>&lt;h2&gt;
  
  
  The History of Hadoop
&lt;/h2&gt;

&lt;p&gt;There are mainly two problems with the big data. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage for a huge amount of data.&lt;/li&gt;
&lt;li&gt;Processing of that stored data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2003, Google published about Google's distributed file system, called &lt;strong&gt;GFS (Google File System)&lt;/strong&gt; which can be used for storing large data sets.&lt;/p&gt;

&lt;p&gt;Similarly in 2004, Google published a paper on &lt;strong&gt;MapReduce&lt;/strong&gt;, that described the solution for processing large datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doug Cutting&lt;/strong&gt; and &lt;strong&gt;Mike Cafarella&lt;/strong&gt; (Founders of Hadoop), came across both of these papers that described GFS and MapReduce, while working on the Apache Nutch project.&lt;/p&gt;

&lt;p&gt;The aim of the Apache Nutch project was to build a search engine system that can index 1 billion pages. Their conclusion to this project was that it would cost millions of dollars.&lt;/p&gt;

&lt;p&gt;Both the papers by Google were not complete solution for the Nutch project.&lt;/p&gt;

&lt;p&gt;Fast forward to 2006, Doug Cutting joined &lt;strong&gt;Yahoo&lt;/strong&gt; and started the project &lt;strong&gt;Hadoop&lt;/strong&gt;, implementing the papers from Google.&lt;/p&gt;

&lt;p&gt;Finally in 2008, Yahoo released Hadoop as an open source project to &lt;strong&gt;ASF(Apache Software Foundation)&lt;/strong&gt; and they successfully tested a &lt;strong&gt;4000 node cluster with Hadoop&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Intro to Apache Hadoop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Apache Hadoop&lt;/strong&gt; is software framework for &lt;strong&gt;distributed storage&lt;/strong&gt; and &lt;strong&gt;distributed processing&lt;/strong&gt; of big data using the MapReduce programming model.&lt;/p&gt;

&lt;p&gt;Hadoop comes with the following 4 modules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;HDFS (Hadoop Distributed File System):&lt;/strong&gt; A file system inspired by GFS which is used for distributed storage of Big data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YARN (Yet Another Resource Negotiator):&lt;/strong&gt; A resources manager that can be used for job scheduling and cluster resource management. It keeps track of which node does what work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MapReduce:&lt;/strong&gt; Programming Model used for distributed processing. It divides the data into partition that are mapped (transformed) and Reduced (aggregated).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hadoop Common:&lt;/strong&gt; It includes libraries and utilities used and shared by other Hadoop modules.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a block diagram representation of how they all work together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2htehy3xvd9wrdr5nbp4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2htehy3xvd9wrdr5nbp4.png" alt="Image description" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  MapReduce
&lt;/h2&gt;

&lt;p&gt;As we know MapReduce is a programming model that can process big data in a distributed manner, let's see how MapReduce works internally.&lt;/p&gt;

&lt;p&gt;There are majorly 3 tasks performed during a MapReduce job.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mapper&lt;/li&gt;
&lt;li&gt;Shuffle &amp;amp; Sort&lt;/li&gt;
&lt;li&gt;Reducer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Below is a example of how a MapReduce job would look like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkdj2w0k3rksjq1jp8f5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkdj2w0k3rksjq1jp8f5.png" alt="Image description" width="648" height="732"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This can vary and depends how we want the MapReduce to process our data.&lt;/p&gt;

&lt;p&gt;Hadoop &amp;amp; MapReduce are written natively in Java, but streaming allows interfacing to other languages like Python.&lt;/p&gt;

&lt;p&gt;Here is an example Python code for a MapReduce job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mrjob.job&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MRJob&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mrjob.step&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MRStep&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RatingsBreakdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MRJob&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;MRStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mapper_get_ratings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;reducer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reducer_count_ratings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;MRStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reducer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reducer_sorted_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mapper_get_ratings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;movie_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reducer_count_ratings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;zfill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reducer_sorted_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;RatingsBreakdown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Hadoop ecosystem has grown significantly and includes various tools and frameworks that build upon or complement the basic MapReduce model. Here’s a look at some of these technologies:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpl48vgpr8jmw2m8j7cj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpl48vgpr8jmw2m8j7cj.png" alt="Image description" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While newer technologies offer more straightforward ways to handle big data, understanding MapReduce is fundamental to grasping the field's broader concepts.&lt;/p&gt;




&lt;p&gt;THE END&lt;/p&gt;

</description>
      <category>hadoop</category>
      <category>dataengineering</category>
      <category>bigdata</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
