DEV Community: yell0wturtle

Path to become a junior+ data engineer?

yell0wturtle — Sat, 12 Oct 2019 15:35:51 +0000

Henlo!

I'm an I.T. student and I'd like to work as a data engineer but I'm like a fish lost in an ocean of big data tools.

First of, I've got a strong Web background, mainly doing back-end stuff such as building and deploying kind of micro-services around the internet. But what I like most is to work with data, Big Data.

But I don't know where to start. Today I'm quite confident with Apache Beam, SQL/NoSQL, Messaging Queues, Cloud solutions... but I feel like it's nothing compared to the great diversity of Big Data tools.

Should I go for Open-Source stuff such as Kafka, Cassandra, HDFS etc, or should I focus on the Cloud side (Cloud Dataflow, AWS EMR, Pub/Sub, Kinesis...) ?

I'd appreciate any help ;)

Everyday you're shuffling, a cat dies.

yell0wturtle — Sun, 29 Sep 2019 14:04:48 +0000

The dataset

Let's say you have a simple log file which contains billions of financial transactions, and you'd like to count the occurrences of each type of credit card in order to build some analytics (your consumer's behaviour...).

id	sender	receiver	card_type	amount
1	Bob	Alan	Visa	42.00
2	Alan	Dylan	Mastercard	69.40
3	Lucie	Eric	Visa	11.90

The map phase

First of, you have to build a PCollection of transactions from this source text file :

// Representation of a transaction
type Transaction struct {
    Id int
    Sender string
    Receiver string
    CardType string
    Amount float64
}

// NOTHING -> PCollection<Transaction>
transactions := textio.Read(s, "../path/to/file.txt")

Then, you must extract the credit card type from each transaction and build a key-value pair where the key is going to be the card type and the value's gonna be 1. This approach will allow you to run a reduce function in aim to aggregate the sum.

// NOTHING -> PCollection<Transaction>
transactions := textio.Read(s, "../path/to/file.txt")

// New
// PCollection<Transaction> -> PCollection<KV<string, int>>
pairedWithOne := beam.ParDo(s, func (tsx Transaction) (string, int) {
    return tsx.CardType, 1
}, transactions)

What might this look like on the cluster?

Let's say you have a cluster of 3 worker nodes. Each node has a chunk (or partition) of your data and the user code (UDFs) to apply to each element. We will get a bunch of key-value pairs on each node, just like that:

Worker1	Worker2	Worker3
(Visa, 1)	(Mastercard, 1)	(Visa, 1)
(Mastercard, 1)	(Visa, 1)	(Mastercard, 1)
(Mastercard, 1)	(Visa, 1)	(Visa, 1)
(Visa, 1)	(Maestro, 1)	(Mastercard, 1)
(Visa, 1)	(Mastercard, 1)	(Maestro, 1)
(Maestro, 1)	(Mastercard, 1)	(Visa, 1)

The reduce phase

Now, you want to group all values with the same key (the credit card Type) together in order to aggregate a sum. So you'd use a combination of a GroupByKey with a reducer.

// NOTHING -> PCollection<Transaction>
transactions := textio.Read(s, "../path/to/file.txt")

// PCollection<Transaction> -> PCollection<KV<string, int>>
pairedWithOne := beam.ParDo(s, func (tsx Transaction) (string, int) {
    return tsx.CardType, 1
}, transactions)

// New
// PCollection<KV<string, int>> -> PCollection<GBK<string, iter<int>>>
grouppedByKey := beam.GroupByKey(s, pairedWithOne)

// PCollection<GBK<string, iter<int>>> -> PCollection<KV<string, int>>
counted := beam.ParDo(s, func(key string, iter []int) (key, int) {
    return key, sum(iter)
}, grouppedByKey)

What might this look like on the cluster?

Right before the reduce phase happens, key-value pairs will move across the network so the same keys will be gathered on the same machine. In this example, we assume that each worker node is hosting a unique key. Each key holds an iterable of integers as a value, which represents the number of time the system has seen a credit card type.

GroupByKey

Worker1	Worker2	Worker3
(Visa, [1, 1, 1, 1, 1, 1, 1, 1])	(Mastercard, [1, 1, 1, 1, 1, 1, 1])	(Maestro, [1, 1, 1])

Reduction

Worker1	Worker2	Worker3
(Visa, 8)	(Mastercard, 7)	(Maestro, 3)

Houston, we have a problem

But didn't we talk about shuffling? Just before the reduce phase, when data is moving across the network, this is called shuffling. And when shuffling a small dataset might be okay... on a huge dataset this will introduce latency because too much network communication kills performance.

So you do not want sending all of your data across the network if it's not absolutely required.

So, in our case, the GroupByKey was a naive approach. We have elements of the same type and we want to calculate a sum, hum, can we do it in a more efficient way? Can we reduce before doing the shuffle to reduce the data that will be sent over the network? Yay!

SumPerKey/CombinePerKey

SumPerKey or CombinePerKey are a combination of first doing a GroupByKey and then reduce-ing on all the values grouped by that key. This is way more efficient than using each separately. And you can use them because you are aggregating values of the same type. So that's perfectly fine.

Using GroupByKey

// PCollection<KV<string, int>> -> PCollection<GBK<string, iter<int>>>
grouppedByKey := beam.GroupByKey(s, pairedWithOne)

// PCollection<GBK<string, iter<int>>> -> PCollection<KV<string, int>>
counted := beam.ParDo(s, func(key string, iter []int) (key, int) {
    return key, sum(iter)
}, grouppedByKey)

Using SumPerKey

// PCollection<KV<string, int>> -> PCollection<KV<string, int>>
stats.SumPerKey(s, pairedWithOne)

What might this look like on the cluster?

Because SumPerKey is an associative reduction : magic happens! In fact, it will reduces the data on the mapper side first before sending the aggregated results to the down stream. And because results are already pre-aggregated, the data that will be sent over the network for the final reduction will be greatly limited. So you get the same output but faster.

Mappers will build key-value pairs

Worker1	Worker2	Worker3
(Visa, 1)	(Mastercard, 1)	(Visa, 1)
(Mastercard, 1)	(Visa, 1)	(Mastercard, 1)
(Mastercard, 1)	(Visa, 1)	(Visa, 1)
(Visa, 1)	(Maestro, 1)	(Mastercard, 1)
(Visa, 1)	(Mastercard, 1)	(Maestro, 1)
(Maestro, 1)	(Mastercard, 1)	(Visa, 1)

Mappers will group elements by key

Worker1	Worker2	Worker3
(Visa, [1, 1, 1])	(Mastercard, [1, 1, 1])	(Visa, [1, 1, 1])
(Mastercard, [1, 1])	(Visa, [1, 1])	(Mastercard, [1, 1])
(Maestro, [1])	(Maestro, [1])	(Maestro, [1]))

Mappers will reduce the data they own locally

Worker1	Worker2	Worker3
(Visa, 3)	(Mastercard, 3)	(Visa, 3)
(Mastercard, 2)	(Visa, 2)	(Mastercard, 2)
(Maestro, 1)	(Maestro, 1)	(Maestro, 1))

And then, here comes the shuffle phase followed by the reduce phase.

Aggregated output

Worker1	Worker2	Worker3
(Visa, 8)	(Mastercard, 7)	(Maestro, 3)

To sum up

Not all problems that can be solved by GroupByKey can be calculated with SumPerKey / CombinePerKey, in fact they require combining all your values into another value with the exact same type.

But just keep in mind that you can greatly optimise your data processing pipeline by reducing the shuffle as much as possible.

Happy coding. 🔥

Explain MapReduce Like I'm Five

yell0wturtle — Sat, 28 Sep 2019 16:27:07 +0000

Once upon a time...

So first of all, there is Big Data. To make it short, Big Data is the way to handle (semi)-structured or unstructured datasets, which are too large to fit in the memory of a single/a few computer(s) bla bla bla.

And we talk about Big Data because today almost everything produces data: your phone, your fridge, your socks... everything. Then people use these datasets to understand their customers' behaviour, to monitor some kind of web-services or to do like fraud detection or financial things you know.

To illustrate my point, let's say you're the developer of a very popular multiplayer game and as soon as a player performs a key action, your server save it in a log file. So, you have like a log file that contains billion of records, and you'd like to extract some useful information such as the average number of quests completed, per server.

Shared memory data-parallelism

You got a basic computer and you want to do these computations, how does it look like? Well, firstly, your program has to take this dataset and cut it into several chunks via a partitioning algorithm. Then, it has to spin up a few tasks or some kind of workers/threads that will independently operate on a chunk, in parallel. Lastly, they will combine, if necessary, their result into one single output.

Great, if you run this, your computer either explodes or runs out of memory because that's really too much data to handle... You have to find another way. Maybe you could do some kind of vertical scaling by buying this overpriced $50,000 New Apple Mac Pro. It might probably help (I hope hehe xd) but that's not really scalable at all.

In fact, If your game becomes even more popular than it is, you'll get a larger volume of data, you'll have to upgrade your computer, spend a shitload of money and so on, until you get a file so large that you can't store it on any f**** machine.

No... you have to find a more effective way. Maybe, could you just distribute the problem across multiple machines? Yay!

Distributed data-parallelism

So instead upgrading you computer in order to be able to do these computations, you'll use a cluster of very traditional machines to get the job done in a short time. It's about the same configuration as above but with some key differences.

The dataset is chopped up into pieces, each piece has a number of copies and they are stored over several nodes. That's what we do via a distributed file system (DFS). Then, we no longer have these kind of threads sharing the same memory but worker nodes operating independently of each other, in parallel, on a fragment of the dataset.

Alright, the more dataset grows, the more nodes you use... and you get the job done fast. You may use hundreds or thousands of machines, that's highly scalable.

MapReduce joined the chat

I've mentioned that kind of distributed parallel computing tasks over several computers, but how do we do that effectively? Here comes MapReduce (heavy breathing intensifies).

MapReduce is a programming paradigm invented by Google which describes a way of structuring your job that allows it to be easily run on a many machines. What does that mean? Well, if you write your computation in a specific way then it's really easy for your computation to take advantage of hundreds or thousands of computers. And it's very easy to understand, sweet!

Let's say you have a huge file that holds billion of Amazon reviews, and you'd like to grab the average rating by product category. With MapReduce that's a walk in the park.

First of, there is the map phase. You have a bunch of computers and your dataset, which has been sliced into pieces, will be spread over them. These machines do not communicate with each other during the map phase, they focus on the given chunks/partition and they run the user-defined function on each element to produce key-value pairs.

So for each element we'll grab the product category as a key and the customer rating as a value (product_category, user_rating), we end up with that representation:

mapper	mapper	mapper
(Books, 3.5)	(Phones, 4.5)	(Cooking, 4)
(Phones, 5)	(Cooking, 5)	(Books, 2)
(Phones, 3)	(Books, 3.5)	(Phones, 5)

Then, there is a shuffle phase which is going to sort these keys and send them to the next phase so every reducers obtains all values associated with the same key.

reducer	reducer	reducer
(Books, [3.5, 3.5])	(Cooking, [4, 5])	(Phones, [4.5, 3, 5])

Finally, here comes the reduce phase which combines all the results from the map stage to aggregate one value as an output. In this example, we wants to get the average rating:

reducer	reducer	reducer
(Books, 3.5)	(Cooking, 4.5)	(Phones, 4.17)

That's all. And this can scale up to hundreds and thousands of machines, the process will be exactly the same. That's how big compagnies such as Google built this kind of search engine indexing.

The yellow elephant

If you're planning to build MapReduce-style jobs, there is Hadoop for that.

Hadoop is very popular open-source project under Apache which provides you a distributed file system called Hadoop Distributed File System (HDFS) to store your large datasets. And it provides also the famous data-processing framework: Hadoop MapReduce. If you've read so far, I hope you probably know how this works ;).

Just keep in mind that MapReduce may not fit your needs. In fact, you could solve basically any class of problem with MapReduce if you take your algorithm, twist it enough and shove it into that map-shuffle-reduce style structure. But the framework can be too restrictive for what you're trying to achieve. Then, MapReduce is... slow, and that's because it has to read/write the data frequently between stages and these operations are really expensive.

Finally, you can do batch-style processing only. That means, if you take our previous online game example, the game server will generate daily logs. So for instance, you'll get all the events from today but you'll have to wait tomorrow to be able to run your MapReduce job and get the results. But you want to get a much lower latency, you'd like to build a real-time analytics dashboard to be able to tweak your game based on your player's behaviour. That is streaming processing, that's another chapter of the book but that's a game changer in the industry.

A higher level of abstraction

But fortunately, thanks to Apache folks and Google's work, you have now some advanced, fast, fault tolerant, Big Data tools built on top of MapReduce which allows you to deal either with bounded or unbounded datasets with a higher level of abstraction.

You no longer just have these two functions map() and reduce() but more expressive APIs and more efficient runners. In a next post I'll talk about Apache Beam which is a portable, unified programming model which is a glue between multiple programming language, multiple runners and multiple data processing modes.

XOXO