DEV Community


Discussion on: Apache Spark vs. Apache Flink

jaceklaskowski profile image
Jacek Laskowski

Hi Ivan,

I'm a Spark aficionado and therefore hugely Spark-biased, but even though the blog post is from Sep 27, 2017 which is yesterday the content of the article was as if it were written over a year ago, i.e.

  1. Why are Spark Streaming and GraphX included? They're dead with Spark Structured Streaming (production-ready since Spark 2.2) and GraphFrames having replaced them, officially or not so much yet, respectively.

  2. Where is Spark SQL with Tungsten and Catalyst mentioned? At the end as a kind of follow-up? Why? That's part of Spark for...well...years.

  3. I'm still surprised that people use "Spark uses batches and streaming is special while Flink does the opposite". OK, got that, but how does Flink do the opposite? I could help explaining Spark's case and miss Flink's a lot. What's the difference for a developer?

  4. Custom memory management in Flink? Ok. What about Spark that has it since Spark 2.0 too?

  5. "Implements lightweight distributed snapshots" <-- without more details I'd say Spark Structured Streaming has it too since Spark 2.2.

I was deeply touched with the happy ending "Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements." Could you explain what features make Flink better? Wish I could attend a meetup where Flink and Spark are compared on stage that would help people decide which one is more suitable for their use cases (please note that I am not saying that Flink or Spark is better than the other, but just that one can be more suitable given requirements and experience in a delivery team).

Thanks for the article anyway. It's been a pleasure to have read it. Can't wait to read next instalments.