<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yell0wturtle</title>
    <description>The latest articles on DEV Community by yell0wturtle (@yell0wturtle).</description>
    <link>https://dev.to/yell0wturtle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F188387%2F4101bc79-61fc-49a7-84e3-49c7cf763b78.jpg</url>
      <title>DEV Community: yell0wturtle</title>
      <link>https://dev.to/yell0wturtle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yell0wturtle"/>
    <language>en</language>
    <item>
      <title>Path to become a junior+ data engineer?</title>
      <dc:creator>yell0wturtle</dc:creator>
      <pubDate>Sat, 12 Oct 2019 15:35:51 +0000</pubDate>
      <link>https://dev.to/yell0wturtle/path-to-become-a-junior-data-engineer-30mb</link>
      <guid>https://dev.to/yell0wturtle/path-to-become-a-junior-data-engineer-30mb</guid>
      <description>&lt;p&gt;Henlo!&lt;/p&gt;

&lt;p&gt;I'm an I.T. student and I'd like to work as a data engineer but I'm like a fish lost in an ocean of big data tools.&lt;/p&gt;

&lt;p&gt;First of, I've got a strong Web background, mainly doing back-end stuff such as building and deploying kind of micro-services around the internet. But what I like most is to work with data, Big Data.&lt;/p&gt;

&lt;p&gt;But I don't know where to start. Today I'm quite confident with Apache Beam, SQL/NoSQL, Messaging Queues, Cloud solutions... but I feel like it's nothing compared to the great diversity of Big Data tools.&lt;/p&gt;

&lt;p&gt;Should I go for Open-Source stuff such as Kafka, Cassandra, HDFS etc, or should I focus on the Cloud side (Cloud Dataflow, AWS EMR, Pub/Sub, Kinesis...) ?&lt;/p&gt;

&lt;p&gt;I'd appreciate any help ;)&lt;/p&gt;

</description>
      <category>help</category>
      <category>distributedsystems</category>
      <category>spark</category>
      <category>hadoop</category>
    </item>
    <item>
      <title>Everyday you're shuffling, a cat dies.</title>
      <dc:creator>yell0wturtle</dc:creator>
      <pubDate>Sun, 29 Sep 2019 14:04:48 +0000</pubDate>
      <link>https://dev.to/yell0wturtle/map-shuffle-reduce-what-is-shuffling-5cbn</link>
      <guid>https://dev.to/yell0wturtle/map-shuffle-reduce-what-is-shuffling-5cbn</guid>
      <description>&lt;h3&gt;
  
  
  The dataset
&lt;/h3&gt;

&lt;p&gt;Let's say you have a simple log file which contains billions of financial transactions, and you'd like to &lt;strong&gt;count the occurrences of each type of credit card&lt;/strong&gt; in order to build some analytics (your consumer's behaviour...).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;sender&lt;/th&gt;
&lt;th&gt;receiver&lt;/th&gt;
&lt;th&gt;card_type&lt;/th&gt;
&lt;th&gt;amount&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Bob&lt;/td&gt;
&lt;td&gt;Alan&lt;/td&gt;
&lt;td&gt;Visa&lt;/td&gt;
&lt;td&gt;42.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Alan&lt;/td&gt;
&lt;td&gt;Dylan&lt;/td&gt;
&lt;td&gt;Mastercard&lt;/td&gt;
&lt;td&gt;69.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Lucie&lt;/td&gt;
&lt;td&gt;Eric&lt;/td&gt;
&lt;td&gt;Visa&lt;/td&gt;
&lt;td&gt;11.90&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The map phase
&lt;/h3&gt;

&lt;p&gt;First of, you have to build a PCollection of transactions from this source text file :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Representation of a transaction&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Transaction&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Id&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;Sender&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Receiver&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;CardType&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Amount&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// NOTHING -&amp;gt; PCollection&amp;lt;Transaction&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;textio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"../path/to/file.txt"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, you must extract the credit card type from each transaction and build a key-value pair where the key is going to be the card type and the value's gonna be 1. This approach will allow you to run a reduce function in aim to aggregate the sum.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// NOTHING -&amp;gt; PCollection&amp;lt;Transaction&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;textio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"../path/to/file.txt"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// New&lt;/span&gt;
&lt;span class="c"&gt;// PCollection&amp;lt;Transaction&amp;gt; -&amp;gt; PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;pairedWithOne&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParDo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tsx&lt;/span&gt; &lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tsx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CardType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What might this look like on the cluster?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's say you have a cluster of 3 worker nodes. Each node has a chunk (or partition) of your data and the user code (UDFs) to apply to each element. We will get a bunch of key-value pairs on each node, just like that:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The reduce phase
&lt;/h3&gt;

&lt;p&gt;Now, you want to group all values with the same key (the credit card Type) together in order to aggregate a sum. So you'd use a combination of a &lt;code&gt;GroupByKey&lt;/code&gt; with a reducer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// NOTHING -&amp;gt; PCollection&amp;lt;Transaction&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;textio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"../path/to/file.txt"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// PCollection&amp;lt;Transaction&amp;gt; -&amp;gt; PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;pairedWithOne&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParDo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tsx&lt;/span&gt; &lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tsx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CardType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// New&lt;/span&gt;
&lt;span class="c"&gt;// PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt; -&amp;gt; PCollection&amp;lt;GBK&amp;lt;string, iter&amp;lt;int&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;grouppedByKey&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GroupByKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pairedWithOne&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// PCollection&amp;lt;GBK&amp;lt;string, iter&amp;lt;int&amp;gt;&amp;gt;&amp;gt; -&amp;gt; PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;counted&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParDo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;grouppedByKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What might this look like on the cluster?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right before the reduce phase happens, key-value pairs will &lt;strong&gt;move across the network&lt;/strong&gt; so the same keys will be gathered on the same machine. In this example, we assume that each worker node is hosting a unique key. Each key holds an iterable of integers as a value, which represents the number of time the system has seen a credit card type.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GroupByKey&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, [1, 1, 1, 1, 1, 1, 1, 1])&lt;/td&gt;
&lt;td&gt;(Mastercard, [1, 1, 1, 1, 1, 1, 1])&lt;/td&gt;
&lt;td&gt;(Maestro, [1, 1, 1])&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Reduction&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 8)&lt;/td&gt;
&lt;td&gt;(Mastercard, 7)&lt;/td&gt;
&lt;td&gt;(Maestro, 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Houston, we have a problem
&lt;/h3&gt;

&lt;p&gt;But didn't we talk about shuffling? Just before the reduce phase, when data is moving across the network, this is called &lt;strong&gt;shuffling&lt;/strong&gt;. And when shuffling a small dataset might be &lt;em&gt;okay&lt;/em&gt;... on a huge dataset this will introduce &lt;strong&gt;latency&lt;/strong&gt; because too much network communication kills performance. &lt;/p&gt;

&lt;p&gt;So you &lt;strong&gt;do not want&lt;/strong&gt; sending all of your data across the network if it's not &lt;strong&gt;absolutely required&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/l3V0cSOBfJvvNsSzu/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/l3V0cSOBfJvvNsSzu/giphy.gif" alt="" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, in our case, the &lt;code&gt;GroupByKey&lt;/code&gt; was a naive approach. We have elements of the same type and we want to calculate a sum, hum, can we do it in a more efficient way? Can we reduce before doing the shuffle to reduce the data that will be sent over the network? Yay!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/7XuPYJXaF1CBAmbwQQ/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/7XuPYJXaF1CBAmbwQQ/giphy.gif" alt="" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SumPerKey/CombinePerKey
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;SumPerKey&lt;/code&gt; or &lt;code&gt;CombinePerKey&lt;/code&gt; are a combination of first doing a GroupByKey and then &lt;strong&gt;reduce-ing&lt;/strong&gt; on all the values grouped by that key. This is way more efficient than using each separately. And you can use them because you are aggregating values of the same type. So that's perfectly fine.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Using GroupByKey&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt; -&amp;gt; PCollection&amp;lt;GBK&amp;lt;string, iter&amp;lt;int&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;grouppedByKey&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GroupByKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pairedWithOne&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// PCollection&amp;lt;GBK&amp;lt;string, iter&amp;lt;int&amp;gt;&amp;gt;&amp;gt; -&amp;gt; PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;counted&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;beam&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParDo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iter&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;grouppedByKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Using SumPerKey&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt; -&amp;gt; PCollection&amp;lt;KV&amp;lt;string, int&amp;gt;&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SumPerKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pairedWithOne&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What might this look like on the cluster?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because &lt;code&gt;SumPerKey&lt;/code&gt; is an associative reduction : magic happens! In fact, it will reduces the data on the mapper side first before sending the aggregated results to the down stream. And because results are already pre-aggregated, the data that will be sent over the network for the final reduction will be greatly limited. So you get the same output but &lt;strong&gt;faster&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/WU8nnAdxZWeM8/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/WU8nnAdxZWeM8/giphy.gif" alt="Magic" width="360" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Mappers will build key-value pairs&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Mastercard, 1)&lt;/td&gt;
&lt;td&gt;(Visa, 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Mappers will group elements by key&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, [1, 1, 1])&lt;/td&gt;
&lt;td&gt;(Mastercard, [1, 1, 1])&lt;/td&gt;
&lt;td&gt;(Visa, [1, 1, 1])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, [1, 1])&lt;/td&gt;
&lt;td&gt;(Visa, [1, 1])&lt;/td&gt;
&lt;td&gt;(Mastercard, [1, 1])&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Maestro, [1])&lt;/td&gt;
&lt;td&gt;(Maestro, [1])&lt;/td&gt;
&lt;td&gt;(Maestro, [1]))&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Mappers will reduce the data they own locally&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 3)&lt;/td&gt;
&lt;td&gt;(Mastercard, 3)&lt;/td&gt;
&lt;td&gt;(Visa, 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Mastercard, 2)&lt;/td&gt;
&lt;td&gt;(Visa, 2)&lt;/td&gt;
&lt;td&gt;(Mastercard, 2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1)&lt;/td&gt;
&lt;td&gt;(Maestro, 1))&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And then, here comes the shuffle phase followed by the reduce phase.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Aggregated output&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Worker1&lt;/th&gt;
&lt;th&gt;Worker2&lt;/th&gt;
&lt;th&gt;Worker3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Visa, 8)&lt;/td&gt;
&lt;td&gt;(Mastercard, 7)&lt;/td&gt;
&lt;td&gt;(Maestro, 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  To sum up
&lt;/h3&gt;

&lt;p&gt;Not all problems that can be solved by &lt;code&gt;GroupByKey&lt;/code&gt; can be calculated with &lt;code&gt;SumPerKey&lt;/code&gt; / &lt;code&gt;CombinePerKey&lt;/code&gt;, in fact they require combining all your values into another value with the &lt;strong&gt;exact same type&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But just keep in mind that you can greatly optimise your data processing pipeline by reducing the shuffle as much as possible.&lt;/p&gt;

&lt;p&gt;Happy coding. 🔥&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>beam</category>
      <category>mapreduce</category>
      <category>shuffling</category>
    </item>
    <item>
      <title>Explain MapReduce Like I'm Five</title>
      <dc:creator>yell0wturtle</dc:creator>
      <pubDate>Sat, 28 Sep 2019 16:27:07 +0000</pubDate>
      <link>https://dev.to/yell0wturtle/explain-apache-beam-like-i-m-five-5784</link>
      <guid>https://dev.to/yell0wturtle/explain-apache-beam-like-i-m-five-5784</guid>
      <description>&lt;h3&gt;
  
  
  Once upon a time...
&lt;/h3&gt;

&lt;p&gt;So first of all, there is Big Data. To make it short, Big Data is the way to handle (semi)-structured or unstructured datasets, which are too large to fit in the memory of a single/a few computer(s) bla bla bla. &lt;/p&gt;

&lt;p&gt;And we talk about Big Data because today almost everything produces data: your phone, your fridge, your socks... everything. Then people use these datasets to understand their customers' behaviour, to monitor some kind of web-services or to do like fraud detection or financial things you know.&lt;/p&gt;

&lt;p&gt;To illustrate my point, let's say you're the developer of a very popular multiplayer game and as soon as a player performs a key action, your server save it in a log file. So, you have like a log file that contains &lt;strong&gt;billion of records&lt;/strong&gt;, and you'd like to extract some useful information such as the average number of quests completed, per server.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/14e27FhfQA3Yivhi6e/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/14e27FhfQA3Yivhi6e/giphy.gif" alt="" width="400" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Shared memory data-parallelism
&lt;/h3&gt;

&lt;p&gt;You got a basic computer and you want to do these computations, how does it look like? Well, firstly, your program has to take this dataset and cut it into several chunks via a partitioning algorithm. Then, it has to spin up a few tasks or some kind of workers/threads that will independently operate on a chunk, in parallel. Lastly, they will combine, if necessary, their result into one single output. &lt;/p&gt;

&lt;p&gt;Great, if you run this, your computer either &lt;strong&gt;explodes&lt;/strong&gt; or &lt;strong&gt;runs out of memory&lt;/strong&gt; because that's really too much data to handle... You have to find another way. Maybe you could do some kind of vertical scaling by buying this &lt;strong&gt;overpriced $50,000 New Apple Mac Pro&lt;/strong&gt;. It might probably help (I hope hehe xd) but that's not really scalable at all. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/NTur7XlVDUdqM/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/NTur7XlVDUdqM/giphy.gif" alt="" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In fact, If your game becomes even more popular than it is, you'll get a larger volume of data, you'll have to upgrade your computer, spend a shitload of money and so on, until you get a file so large that you can't store it on any f**** machine.&lt;/p&gt;

&lt;p&gt;No... you have to find a more effective way. Maybe, could you just &lt;strong&gt;distribute&lt;/strong&gt; the problem across &lt;strong&gt;multiple machines&lt;/strong&gt;? Yay!&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributed data-parallelism
&lt;/h3&gt;

&lt;p&gt;So instead upgrading you computer in order to be able to do these computations, you'll use a &lt;strong&gt;cluster of very traditional machines&lt;/strong&gt; to get the job done in a short time. It's about the same configuration as above but with some key differences.&lt;/p&gt;

&lt;p&gt;The dataset is chopped up into pieces, each piece has a number of copies and they are stored over several nodes. That's what we do via a &lt;strong&gt;distributed file system&lt;/strong&gt; (DFS). Then, we no longer have these kind of threads sharing the same memory but &lt;strong&gt;worker nodes&lt;/strong&gt; operating independently of each other, in parallel, on a fragment of the dataset.&lt;/p&gt;

&lt;p&gt;Alright, the more dataset grows, the more nodes you use... and you get the job done fast. You may use hundreds or thousands of machines, that's highly scalable. &lt;/p&gt;

&lt;h3&gt;
  
  
  MapReduce joined the chat
&lt;/h3&gt;

&lt;p&gt;I've mentioned that kind of distributed parallel computing tasks over several computers, but how do we do that effectively? Here comes &lt;strong&gt;MapReduce&lt;/strong&gt; &lt;em&gt;(heavy breathing intensifies)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/el8NjD5XUqTX3ggekt/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/el8NjD5XUqTX3ggekt/giphy.gif" alt="" width="320" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MapReduce is a programming paradigm invented by &lt;a href="https://static.googleusercontent.com/media/research.google.com/fr//archive/mapreduce-osdi04.pdf"&gt;Google&lt;/a&gt; which describes a way of structuring your job that allows it to be easily run on a many machines. What does that mean? Well, if you write your computation in a specific way then it's really easy for your computation to take advantage of hundreds or thousands of computers. And it's very easy to understand, sweet!&lt;/p&gt;

&lt;p&gt;Let's say you have a huge file that holds billion of Amazon reviews, and you'd like to grab the average rating by product category. With MapReduce that's a walk in the park.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/MSQOkUiyawspMTZT5B/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/MSQOkUiyawspMTZT5B/giphy.gif" alt="" width="480" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First of, there is the &lt;strong&gt;map&lt;/strong&gt; phase. You have a bunch of computers and your dataset, which has been sliced into pieces, will be spread over them. These machines do not communicate with each other during the map phase, they focus on the given chunks/partition and they run the user-defined function on each element to produce key-value pairs.&lt;/p&gt;

&lt;p&gt;So for each element we'll grab the product category as a key and the customer rating as a value (product_category, user_rating), we end up with that representation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;mapper&lt;/th&gt;
&lt;th&gt;mapper&lt;/th&gt;
&lt;th&gt;mapper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Books, 3.5)&lt;/td&gt;
&lt;td&gt;(Phones, 4.5)&lt;/td&gt;
&lt;td&gt;(Cooking, 4)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Phones, 5)&lt;/td&gt;
&lt;td&gt;(Cooking, 5)&lt;/td&gt;
&lt;td&gt;(Books, 2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(Phones, 3)&lt;/td&gt;
&lt;td&gt;(Books, 3.5)&lt;/td&gt;
&lt;td&gt;(Phones, 5)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Then, there is a shuffle phase which is going to sort these keys and send them to the next phase so every reducers obtains all values associated with the same key.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Books, [3.5, 3.5])&lt;/td&gt;
&lt;td&gt;(Cooking, [4, 5])&lt;/td&gt;
&lt;td&gt;(Phones, [4.5, 3, 5])&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Finally, here comes the reduce phase which combines all the results from the map stage to aggregate one value as an output. In this example, we wants to get the average rating:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;th&gt;reducer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;(Books, 3.5)&lt;/td&gt;
&lt;td&gt;(Cooking, 4.5)&lt;/td&gt;
&lt;td&gt;(Phones, 4.17)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's all. And this can scale up to hundreds and thousands of machines, the process will be exactly the same. That's how big compagnies such as Google built this kind of &lt;strong&gt;search engine indexing&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The yellow elephant
&lt;/h3&gt;

&lt;p&gt;If you're planning to build MapReduce-style jobs, there is Hadoop for that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is very popular open-source project under Apache which provides you a distributed file system called &lt;strong&gt;Hadoop Distributed File System&lt;/strong&gt; (HDFS) to store your large datasets. And it provides also the famous data-processing framework: &lt;strong&gt;Hadoop MapReduce&lt;/strong&gt;. If you've read so far, I hope you probably know how this works ;).&lt;/p&gt;

&lt;p&gt;Just keep in mind that MapReduce may not fit your needs. In fact, you could solve basically any class of problem with MapReduce if you take your algorithm, twist it enough and shove it into that map-shuffle-reduce style structure. But the framework can be too restrictive for what you're trying to achieve. Then, MapReduce is... slow, and that's because it has to read/write the data frequently between stages and these operations are really expensive. &lt;/p&gt;

&lt;p&gt;Finally, you can do batch-style processing only. That means, if you take our previous online game example, the game server will generate daily logs. So for instance, you'll get all the events from today but you'll have to wait tomorrow to be able to run your MapReduce job and get the results. But you want to get a &lt;strong&gt;much lower latency&lt;/strong&gt;, you'd like to build a real-time analytics dashboard to be able to tweak your game based on your player's behaviour. That is &lt;strong&gt;streaming processing&lt;/strong&gt;, that's another chapter of the book but that's a game changer in the industry.&lt;/p&gt;

&lt;h3&gt;
  
  
  A higher level of abstraction
&lt;/h3&gt;

&lt;p&gt;But fortunately, thanks to Apache folks and Google's work, you have now some advanced, fast, fault tolerant, Big Data tools built on top of MapReduce which allows you to deal either with bounded or unbounded datasets with a higher level of abstraction.&lt;/p&gt;

&lt;p&gt;You no longer just have these two functions &lt;code&gt;map()&lt;/code&gt; and &lt;code&gt;reduce()&lt;/code&gt; but more expressive APIs and more efficient runners. In a next post I'll talk about &lt;a href="https://beam.apache.org/"&gt;Apache Beam&lt;/a&gt; which is a &lt;strong&gt;portable&lt;/strong&gt;, &lt;strong&gt;unified&lt;/strong&gt; programming model which is a glue between multiple programming language, multiple runners and multiple data processing modes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://i.giphy.com/media/kaBU6pgv0OsPHz2yxy/giphy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://i.giphy.com/media/kaBU6pgv0OsPHz2yxy/giphy.gif" alt="" width="480" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;XOXO&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>mapreduce</category>
      <category>bigdata</category>
      <category>explainlikeimfive</category>
    </item>
  </channel>
</rss>
