Background
With the increase in internet usage, there is tremendous increase in the quantity of data from various sources including social media posts, images, audio, video, texts, etc. Due to these large data from variety of sources, the traditional data processing tools become incapable of processing.
So the concept of big data emerged, where big data is defined as extremely large and complex datasets that are difficult to process using the traditional data processing tools. Big data is generally characterized by 3Vs:
- Volume (data size varies from terabytes to petabytes)
- Velocity (high speed with which data is processed)
- Variety (data with different format, source and structure)
Introduction to Apache Spark
Apache Spark is an open-source distributed data processing engine for large data. Started in 2009 as a research project at the University of Colombia, the main objective of the project was to speed up the processing jobs in Hadoop systems.
It supports batch processing, stream processing, graph processing and machine learning, with support for programming languages like Python, Java, Scala, R, and SQL.
Why Spark?
Spark is used because of its fast data processing, advanced API support, scalability and ease of use, which are explained below:
1. Scalability
Apache Spark can handle data ranging from a single machine to thousands of servers in a cluster. It distributes data and computation across many nodes, allowing you to process very large datasets (big data) efficiently.
You can easily add more nodes to scale up processing power.
2. Real Time
Spark supports real-time data processing with its Spark Streaming module.
It processes live data streams (e.g., from sensors, logs, social media) with low latency.
3. Fault Tolerance
Spark’s data abstraction, RDD (Resilient Distributed Dataset), is designed to handle failures. Each RDD keeps track of how it was derived from other datasets (lineage).
If a node fails, Spark can recompute lost partitions automatically without needing to restart the entire job.
4. Ease of Use
Spark provides high-level APIs that simplify writing distributed applications.
It supports multiple programming languages (Scala, Python, Java, R) and comes with built-in libraries (Spark SQL, MLlib, GraphX) for common tasks, reducing the amount of code needed.
5. Speed
Spark performs data processing in-memory, reducing time spent on disk I/O. This makes Spark much faster (up to 100x) than traditional systems like Hadoop MapReduce for many workloads. It also optimizes execution plans to run tasks efficiently in parallel.
Why Spark Outperforms Hadoop MapReduce
Apache Hadoop, a traditional distributed data processing tool, writes the intermediate result to disk after each step resulting in slow performance, making it inefficient for iterative jobs.
However, Spark makes use of in-memory processing, i.e., the intermediate results are stored in RAM so that only the final result could be written into the disk, making it 100 times faster than Hadoop.
Besides, Apache Spark's fault tolerance, based on its resilient distributed datasets (RDDs), allows it to quickly recover from node failures by recomputing only lost data partitions — enhancing efficiency.
Moreover, Spark supports real-time data processing through its streaming capabilities, whereas Hadoop is mainly limited to batch jobs.
Spark Ecosystem
Figure: Overview of the Apache Spark Ecosystem and its Core Components
(Source: Data Flair)
The Apache Spark ecosystem consists of powerful tools and libraries built around the core Spark engine to support various big data processing tasks. Here are the key components:
1. Spark Core
This is the foundation of the Spark ecosystem, where all other Spark components are built on top of it.
It handles:
- Task Scheduling
- Memory Management
- Fault Recovery
- Interactions with Storage Systems
2. Spark SQL
It enables querying structured data using SQL or DataFrame APIs. It supports various data sources like JSON, Hive tables, and Parquet files.
3. Spark Streaming
It allows processing of real-time data streams from sources such as Kafka, Flume, or TCP sockets and processes it in near real-time. This is crucial for applications like fraud detection, live monitoring, and alert systems.
4. MLlib (Machine Learning Library)
It is a scalable machine learning library that offers algorithms for:
- Classification
- Regression
- Clustering
- Collaborative Filtering
- and more.
5. GraphX
It is a library for graph processing and analytics.
It enables computation on graphs and graph-parallel computations.
Resilient Distributed Datasets(RDD)
RDD is the fundamental data structure of spark and is the Spark's main programming abstraction. It is immutable and fault tolerant collection of elements or objects that can be processed in parallel across the cluster of machines.
RDDs are called "resilient" because they track data lineage information so that lost data can be rebuilt if there is a failure, making RDDs highly fault-tolerant.
Once created, a RDD cannot be changed. Transformations are applied to RDD to create new RDD leaving the original one unchanged.
RDD Operations
There are two types of operations that can be performed on RDD:
Figure: Transformations and Actions
1. Transformations
Transformations are the operations that create new RDD from existing one.
As mentioned earlier, RDD are immutable but, we can create new RDD by applying the transformation. It takes RDD as input and produces one or more RDD as output. Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is a map(), filter().
Transformations can be Narrow Transformation or Wide Transformation.
A. Narrow Transformation
This type of transformation doesn't require data to be shuffled across different partitions. Each input partition produces one output partition and there is no need of communication between the nodes in the cluster for the operations.
It is more efficient as there is less network overhead.
Figure: Narrow Transformation
(Source: Data Flair)
B. Wide Transformation
This type of transformation require shuffle of data between different partitions. Operation depends on data from the multiple partitions and needs to be redistributed across the cluster.
It has higher latency and less efficient execution due to disk I/O and network involved in shuffling.
Figure: Narrow Transformation
(Source: Data Flair)
2. Actions
Actions trigger the actual computation of RDD. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. Results are returned to driver or write data to external storage systems.
Job, Stage, and Task
Job
A job is the sequence of transformations on data. It can be thought of as the total work that your Spark application needs to perform, broken down into a series of steps.
Stage
When a job is divided, it results in the creation of multiple stages. A stage is a sequence of transformations executed in a single pass (i.e., without shuffling of data).
Each stage comprises tasks that all perform the same computation.
Task
A task is the unit of execution that runs on a single machine. It is associated with the number of partitions.
Lazy Evaluation
Lazy Evaluation is a powerful feature in Apache Spark that allows Spark to optimize the execution of data processing pipelines.
It means that Spark delays the execution of operations until it absolutely needs to perform them (i.e., until an action is called).
How Lazy Evaluation Works
- In Spark, transformations like
map()
,filter()
, andflatMap()
are lazy. They don’t immediately execute; rather, they are recorded as transformations in a DAG (Directed Acyclic Graph). - Spark builds up a series of transformations on the data, but doesn't compute anything until it encounters an action like
collect()
,count()
, orsave()
. - When an action is called, Spark looks at the entire DAG of transformations, optimizes the execution plan, and then runs the tasks in the most efficient order.
Spark Architecture
1. Driver Program
It is the central co-ordinator of the spark application. It contains the Spark Context, which is the entry point to the Spark's execution environment.
What it does?
- Job Scheduling
- Task Distribution
- Fault Tolerance
2. Cluster Manager
It is responsive for allocating resources across the cluster and decides which worker node your application gets to use. Examples are YARN, Mesos, etc.
3. Executors
These are the worker processes that run on the worker nodes. It is responsible for executing the tasks assigned to it by driver. It then reports the result back to the driver.
4. Worker Nodes
Worker nodes are the machines in the cluster where executors run.
Execution Flow of Apache Spark
1. App Submission
When a user submits an application, a driver program is launched.
Driver communicates with the cluster manager to request resources.
2. Job Creation and DAG Creation
Driver converts user code into jobs, which are divided into stages and further into tasks.
Driver creates a logical DAG representing the sequence of stages and tasks.
3. Stage Division and Task Scheduling
DAG Scheduler divides the DAG into stages, each with multiple tasks.
Task Scheduler assigns tasks to executors based on available resources and data.
4. Task Execution on Worker Nodes
Executors run tasks on worker nodes, process data, and return results to the driver.
Driver aggregates and presents the result to the user.
Speculative Execution in Spark
In a large spark job, tasks are distributed across many nodes in a cluster. Sometimes, a few tasks may runs significantly slower than others due to the issues like hardware degradation, network congestion, data skew, resource contention.
So what speculative execution in spark does is that it launches duplicate copies of slow-running tasks on different nodes under the assumption that these new nodes may complete faster than the older one. Then, whichever task finishes first wins, and the result of others is discarded.
Conclusion
In conclusion, Apache Spark is a powerful, open-source distributed data processing engine that has revolutionized big data analytics. Its ability to handle vast amounts of data in real-time, coupled with its fault tolerance, scalability, and speed, makes it an ideal choice for various industries dealing with massive datasets.
Top comments (0)