MapReduce and Spark are both used for large-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.
- Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
- MapReduce relies heavily on reading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
- Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
- Not that easy in terms of programming and requires lots of hand coding.
- A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
- Capable to using Hadoop ecosystem, e.g., HDFS, yarn
- Spark provides over 20 highly efficient distributed operations: Can be used in combination
- User can choose to cache data in memory: Increases performance for iterative algorithms
- Polyglot: Native Java, Python, Scala, R interfaces along with interactive shell (test and explore data interactively on shell)
- Easy to program and does not require that much hand coding
Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.
- General JVM executor to execute spark workflows
- Core on which all computation is done
- Interface to rest of the Hadoop ecosystem, e.g., HDFS
- System to manage provisioning and starting of worker nodes
- Cluster manager interfaces supported by spark –
- YARN (Hadoop cluster manager): Same cluster can be used with both Hadoop MR and spark.
- Standalone: Special spark process that takes care of starting nodes at beginning of computation and restarts them on failures.
- Interface with cluster
- Has a JVM that has a spark context: Gateway for us to connect to our spark instance and submit jobs.
- Jobs can be submitted in -
- Batch mode : Send program for execution and wait for result
- Streaming mode: Use spark shell and interact in real-time with data
- Using spark in standalone mode
- Everything running locally on one machine
- Worker node (executor JVM), spark process and driver program on the same machine
- Supports spark natively
- Web interface to configure number, type of instances, memory required, etc.
- Amazon EMR automatically runs YARN to spawn instances and prepares them to be executed with spark.
- Executor JVMs run on EC2 interfaces
- Driver program and YARN running on the master node
- Data storage created from HDFS, S3, HBase, JSON, text, etc.: Once spark reads the data, it can be referenced with an RDD
- Or from transforming another RDD: RDD are immutable
- Distributes across cluster of machines
- Data is divided into partitions and partitions divided across machines
- Spark tracks history of each partition
- Error recovery like node failures, slow processes
RDD are immutable but we can transform one RDD to another RDD. Spark does lazy transformations (execution will not start until an action is triggered).
- Like map and filter
- Do not imply transferring data through the network
- Depends on memory and CPU
- For example, groupByKey transfers data with same key to same partition Shuffle operation across network
- Also depended on interconnection speed between nodes
For further reading: Apache Spark Tutorial: Get Started With Serving ML Models With Spark
Source: Introduction to Apache Spark