Ivan Mushketyk

Posted on Sep 27, 2017

Apache Spark vs. Apache Flink

#flink #spark #bigdata

If you look at this image with a list of Big Data tools it may seem that all possible niches in this field are already occupied. With so much competition it should be very tough to come up with a groundbreaking technology.

Apache Flink creators have a different thought about this. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014 it was accepted as an Apache Incubator project, and just a few months later it became a top-level Apache project. At the time of this writing, the project has almost twelve thousand commits and more than 300 contributors.

Why is there so much attention? This is because Apache Flink was called a new generation Big Data processing framework and has enough innovations under its belt to replace Apache Spark and become the new de-facto tool for batch and stream processing.

Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Or is Apache Flink just a new gimmick? This article will attempt to give you answers to these and other questions.

Apache Spark is an old news

Unless you have been living under a rock for the last couple of years, you have heard about Apache Spark. It looks like every modern system that does any kind data processing is using Apache Spark in one way or another.

For a long time, Spark was the latest and greatest tool in this area. It delivered some impressive features comparing to its predecessors such as:

Impressive speed - it is ten times faster than Hadoop if data is processed on a disk and up to 100 times faster if data is processed in memory.
Simpler Directed acyclic graph model - instead of defining your data processing jobs using rigid MapReduce framework Spark allows to define a graph of tasks that can implement complex data processing algorithms
Stream processing - with the advent of new technologies such as the Internet of Things it is not enough to simply to process a huge amount of data. Now we need processing a huge amount of data as it arrives in real time. This is why Apache Spark has introduced stream processing that allows to process a potentially infinite stream of data.
Rich set of libraries - In addition to its core features Apache Spark provides powerful libraries for machine learning, graph processing, and performing SQL queries.

To get a better idea of how you write applications with Apache Spark, let's take a look at how you can implement a simple Word Count application that would count how many times each word was used in a text document:

// Read file
val file = sc.textFile("file/path")
val wordCount = file
  // Extract words from every line
  .flatMap(line => line.split(" "))
  // Convert words to pairs
  .map(word => (word, 1))
  // Count how many times each word was used
  .reduceByKey(_ + _)

If you know Scala, this code should seem straightforward and is similar to working with regular collections. First we read a list of lines from a file located in "file/path". This file can be either a local file or a file in HDFS or S3.

Then every line is split into a list of words using the flatMap method that simply splits a string by the space symbol. Then to implement the word counting we use the map method to convert every word into a pair where the first element of the pair is a word from the input text and the second element is simply a number one.

Then the last step simply counts how many times each word was used by summing up numbers for all pairs for the same word.

Apache Spark seems like a great and versatile tool. But what does Apache Flink brings to the table?

New kid on the block

At first glance, there does not seem to be many differences. The architecture diagram looks very similar:

If you take a look at the code example for the Word Count application for Apache Flink you would see that there is almost no difference:

val file = env.readTextFile("file/path")
val counts = file
  .flatMap(line => line.split(" "))
  .map(word => (word, 1))
  .groupBy(0)
  .sum(1)

Few notable differences, is that in this case we need to use the readTextFile method instead of the textFile method and that we need to use a pair of methods: groupBy and sum instead of reduceByKey.

So what is all the fuss about? Apache Flink may not have any visible differences on the outside, but it definitely has enough innovations, to become the next generation data processing tool. Here are just some of them:

Implements actual streaming processing - when you process a stream in Apache Spark, it treats it as many small batch problems, hence making stream processing a special case. Apache Flink, in contrast, treats batch processing as a special and does not use micro batching.
Better support for cyclical and iterative processing - Flink provides some additional operations that allow implementing cycles in your streaming application and algorithms that need to perform several iterations on batch data.
Custom memory management - Apache Flink is a Java application, but it does not rely entirely on JVM garbage collector. It implements custom memory manager that stores data to process in byte arrays. This allows to reduce the load on a garbage collector and increase performance. You can read about it in this blog post.
Lower latency and higher throughput - multiple tests done by third parties suggest that Apache Flink has lower latency and higher throughput than its competitors.
Powerful windows operators - when you need to process a stream of data in most cases you need to apply a function to a finite group of elements in a stream. For example, you may need to count how many clicks your application has received in each five-minute interval, or you may want to know what was the most popular tweet on Twitter in each ten-minute interval. While Spark supports some of these use-cases, Apache Flink provides a vastly more powerful set of operators for stream processing.
Implements lightweight distributed snapshots - this allows Apache Flink to provide low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does.

What to choose

So, you are working on a new project, and you need to pick a software for it? What should use? Spark? Flink?

Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements.

If you don't need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.

If on the other hand, you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.

Does all this mean that Apache Spark is obsolete and in a couple of years we all are going to use Apache Flink? The answer may surprise you. While Flink has some impressive features, Spark is not staying the same. For example, Apache Spark introduced custom memory management in 2015 with the release of project Tungsten, and since then it has been adding features that were first introduced by Apache Flink. The winner is not decided yet.

More information

In the upcoming blog posts I will write more about how you can use Apache Flink for batch and stream processing, so stay tuned!

If you want to know more about Apache Flink, you can take a look at my Pluralsight course where I cover Apache Flink in more details: Understanding Apache Flink. Here is a short preview of this course.

This post was originally posted at Brewing Codes blog.

Top comments (3)

Jacek Laskowski • Sep 28 '17

Hi Ivan,

I'm a Spark aficionado and therefore hugely Spark-biased, but even though the blog post is from Sep 27, 2017 which is yesterday the content of the article was as if it were written over a year ago, i.e.

Why are Spark Streaming and GraphX included? They're dead with Spark Structured Streaming (production-ready since Spark 2.2) and GraphFrames having replaced them, officially or not so much yet, respectively.
Where is Spark SQL with Tungsten and Catalyst mentioned? At the end as a kind of follow-up? Why? That's part of Spark for...well...years.
I'm still surprised that people use "Spark uses batches and streaming is special while Flink does the opposite". OK, got that, but how does Flink do the opposite? I could help explaining Spark's case and miss Flink's a lot. What's the difference for a developer?
Custom memory management in Flink? Ok. What about Spark that has it since Spark 2.0 too?
"Implements lightweight distributed snapshots" <-- without more details I'd say Spark Structured Streaming has it too since Spark 2.2.

I was deeply touched with the happy ending "Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements." Could you explain what features make Flink better? Wish I could attend a meetup where Flink and Spark are compared on stage that would help people decide which one is more suitable for their use cases (please note that I am not saying that Flink or Spark is better than the other, but just that one can be more suitable given requirements and experience in a delivery team).

Thanks for the article anyway. It's been a pleasure to have read it. Can't wait to read next instalments.

Ivan Mushketyk • Sep 28 '17 • Edited

Hi Jacek,

Thank you for your reply and thank you for very good comments. Let me address them one by one.

Makes, sense I need to use a more up to date architecture diagram.

2 & 4. The idea was to show what innovations Flink introduced and then to show that Spark is implementing similar features (e.g. Tungsten)

As far as I know Spark is still using micro batching, while Flink was using true streaming from the very beginning (more on this here: data-artisans.com/blog/high-throug...). Spark folks are also working on the continuous streaming feature but as far as I know it's know released yet.
I was referring to this: ci.apache.org/projects/flink/flink...

Wish I could attend a meetup where Flink and Spark are compared on stage that would help people decide which one is more suitable for their use cases

That would be awesome! I would definitely visit a meetup like this.

Can't wait to read next instalments.
I would be glad to read your comments and suggestions! Thank you for your comment again.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community

Apache Spark vs. Apache Flink

Apache Spark is an old news

New kid on the block

What to choose

More information

Top comments (3)

Read next

8 Powerful Python Error Handling Strategies for Robust Applications

Boost Go Network App Performance: Zero-Copy I/O Techniques Explained

GSSOC, SWOC, and the GitHub Glow-Up: A Beginner’s Guide to Open Source

Do We Really Need All the Data to Make Our Decisions?