If you look at this image with a list of Big Data tools it may seem that all possible niches in this field are already occupied. With so much competition it should be very tough to come up with a groundbreaking technology.
Apache Flink creators have a different thought about this. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014 it was accepted as an Apache Incubator project, and just a few months later it became a top-level Apache project. At the time of this writing, the project has almost twelve thousand commits and more than 300 contributors.
Why is there so much attention? This is because Apache Flink was called a new generation Big Data processing framework and has enough innovations under its belt to replace Apache Spark and become the new de-facto tool for batch and stream processing.
Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Or is Apache Flink just a new gimmick? This article will attempt to give you answers to these and other questions.
Unless you have been living under a rock for the last couple of years, you have heard about Apache Spark. It looks like every modern system that does any kind data processing is using Apache Spark in one way or another.
For a long time, Spark was the latest and greatest tool in this area. It delivered some impressive features comparing to its predecessors such as:
- Impressive speed - it is ten times faster than Hadoop if data is processed on a disk and up to 100 times faster if data is processed in memory.
- Simpler Directed acyclic graph model - instead of defining your data processing jobs using rigid MapReduce framework Spark allows to define a graph of tasks that can implement complex data processing algorithms
- Stream processing - with the advent of new technologies such as the Internet of Things it is not enough to simply to process a huge amount of data. Now we need processing a huge amount of data as it arrives in real time. This is why Apache Spark has introduced stream processing that allows to process a potentially infinite stream of data.
- Rich set of libraries - In addition to its core features Apache Spark provides powerful libraries for machine learning, graph processing, and performing SQL queries.
To get a better idea of how you write applications with Apache Spark, let's take a look at how you can implement a simple Word Count application that would count how many times each word was used in a text document:
// Read file val file = sc.textFile("file/path") val wordCount = file // Extract words from every line .flatMap(line => line.split(" ")) // Convert words to pairs .map(word => (word, 1)) // Count how many times each word was used .reduceByKey(_ + _)
If you know Scala, this code should seem straightforward and is similar to working with regular collections. First we read a list of lines from a file located in "file/path". This file can be either a local file or a file in HDFS or S3.
Then every line is split into a list of words using the
flatMap method that simply splits a string by the space symbol. Then to implement the word counting we use the
map method to convert every word into a pair where the first element of the pair is a word from the input text and the second element is simply a number one.
Then the last step simply counts how many times each word was used by summing up numbers for all pairs for the same word.
Apache Spark seems like a great and versatile tool. But what does Apache Flink brings to the table?
At first glance, there does not seem to be many differences. The architecture diagram looks very similar:
If you take a look at the code example for the Word Count application for Apache Flink you would see that there is almost no difference:
val file = env.readTextFile("file/path") val counts = file .flatMap(line => line.split(" ")) .map(word => (word, 1)) .groupBy(0) .sum(1)
Few notable differences, is that in this case we need to use the
readTextFile method instead of the
textFile method and that we need to use a pair of methods:
sum instead of
So what is all the fuss about? Apache Flink may not have any visible differences on the outside, but it definitely has enough innovations, to become the next generation data processing tool. Here are just some of them:
- Implements actual streaming processing - when you process a stream in Apache Spark, it treats it as many small batch problems, hence making stream processing a special case. Apache Flink, in contrast, treats batch processing as a special and does not use micro batching.
- Better support for cyclical and iterative processing - Flink provides some additional operations that allow implementing cycles in your streaming application and algorithms that need to perform several iterations on batch data.
- Custom memory management - Apache Flink is a Java application, but it does not rely entirely on JVM garbage collector. It implements custom memory manager that stores data to process in byte arrays. This allows to reduce the load on a garbage collector and increase performance. You can read about it in this blog post.
- Lower latency and higher throughput - multiple tests done by third parties suggest that Apache Flink has lower latency and higher throughput than its competitors.
- Powerful windows operators - when you need to process a stream of data in most cases you need to apply a function to a finite group of elements in a stream. For example, you may need to count how many clicks your application has received in each five-minute interval, or you may want to know what was the most popular tweet on Twitter in each ten-minute interval. While Spark supports some of these use-cases, Apache Flink provides a vastly more powerful set of operators for stream processing.
- Implements lightweight distributed snapshots - this allows Apache Flink to provide low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does.
So, you are working on a new project, and you need to pick a software for it? What should use? Spark? Flink?
Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements.
If you don't need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.
If on the other hand, you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.
Does all this mean that Apache Spark is obsolete and in a couple of years we all are going to use Apache Flink? The answer may surprise you. While Flink has some impressive features, Spark is not staying the same. For example, Apache Spark introduced custom memory management in 2015 with the release of project Tungsten, and since then it has been adding features that were first introduced by Apache Flink. The winner is not decided yet.
In the upcoming blog posts I will write more about how you can use Apache Flink for batch and stream processing, so stay tuned!
If you want to know more about Apache Flink, you can take a look at my Pluralsight course where I cover Apache Flink in more details: Understanding Apache Flink. Here is a short preview of this course.
This post was originally posted at Brewing Codes blog.