KiplangatJaphet

Posted on Sep 29

A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark

#python #datascience #analytics #beginners

What is Apache Spark?

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

What is the history of Apache Spark?
Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. The first paper entitled, “Spark: Cluster Computing with Working Sets” was published in June 2010, and Spark was open sourced under a BSD license. In June, 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February, 2014. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop.

What is PySpark?

PySpark is the Python API for Apache Spark, a powerful framework designed for distributed data processing. If you’ve ever worked with large datasets and found your programs running slowly, PySpark might be the solution you’ve been searching for. It allows you to process massive datasets across multiple computers at the same time, meaning your programs can handle more data in less time.

Key Features of PySpark

Distributed Processing: Instead of relying on one computer, PySpark breaks up your data into smaller chunks and processes them on multiple machines simultaneously.
In-Memory Processing: PySpark can store data in memory (RAM), making it much faster than traditional methods that often rely on slow disk access.
Fault Tolerance: Even if one machine fails while processing data, PySpark can automatically recover, ensuring your data is safe and the job gets done.

Importance of using pyspark
PySpark lets you handle that same data efficiently by splitting the work across multiple computers in a cluster.

Common Use Cases

Data Analysis: If you’re analyzing huge datasets (e.g., sales data, website logs), PySpark helps process that data quickly.
Machine Learning: PySpark is often used to build models that predict trends or patterns from large datasets.
Big Data Processing: Companies with tons of data (like social media platforms or e-commerce giants) use PySpark to keep things running smoothly.

Apache Spark Architecture

The Spark runtime consists of several key components that work together to execute distributed computations.

Below are the functions of each component of Spark architecture.

The Spark driver
The driver is the program or process responsible for coordinating the execution of the Spark application. It runs the main function and creates the SparkContext, which connects to the cluster manager.

The Spark executors
Executors are worker processes responsible for executing tasks in Spark applications. They are launched on worker nodes and communicate with the driver program and cluster manager. Executors run tasks concurrently and store data in memory or disk for caching and intermediate storage.

The cluster manager
The cluster manager is responsible for allocating resources and managing the cluster on which the Spark application runs. Spark supports various cluster managers like Apache Mesos, Hadoop YARN, and standalone cluster manager.

sparkContext
SparkContext is the entry point for any Spark functionality. It represents the connection to a Spark cluster and can be used to create RDDs (Resilient Distributed Datasets), accumulators, and broadcast variables. SparkContext also coordinates the execution of tasks.

Task
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution.

Working Of Spark Architecture
When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including;

DAG Scheduler.
Task Scheduler.
Backend Scheduler.
Block Manager.

These components translate user code into jobs that are executed on the cluster. Together, the Spark Driver and SparkContext oversee the entire job execution lifecycle.

Installing Apache Spark and Pyspark on your terminal

# Step 1: Install Java 17 (required for Spark 4.x)
sudo apt install openjdk-17-jdk -y

#Verify Java installation
java -version   # should print openjdk version "17.0.xx"

# Step 2: Download Apache Spark 4.0.1 (built with Hadoop 3)
wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

#: Extract the tarball
tar -xvzf spark-4.0.1-bin-hadoop3.tgz

#: Remove the archive to save space
rm spark-4.0.1-bin-hadoop3.tgz

#: Rename the extracted folder to something simpler
mv spark-4.0.1-bin-hadoop3 spark

#: Navigate into the Spark installation directory
cd spark 

#Step 3: Verify Python 
python --version

#Step 4: Set up Pyspark Environment 
python -m venv sparkenv 

#Activate the Environment:
source sparkenv/bin/activate

#Install Pyspark
pip install pyspark

#Step 5: Running Pyspark with JupyterLab
#Install Jupyter in your virtual environment
pip install notebook ipykernel 

#Register your environment 
python -m ipykernel install --user --name=sparkenv --display-name "Python (sparkenv)"

Running Pyspark Code
After a successful set up, initialize a Pyspark session.

from pyspark.sql import SparkSession

#Create a Spark session
spark = SparkSession.builder.appName("restaurant").getOrCreate()

Core concepts of Pyspark

1. Resilient Distributed Datasets (RDDs)
A RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark. These are the elements that run and operate on multiple nodes to do parallel processing on a cluster.

Key Characteristics

Immutable: Once created, RDDs can't be modified; transformations create new RDDs.
Resilient: Can recover from node failures through lineage tracking (remembers the transformations used to build it) rather than data replication.
Distributed: Data is partitioned and processed in parallel across cluster nodes.
Dataset: Holds your data like a large list or table.
Lazy Evaluation: Transformations are not executed immediately - they build up a computation graph that executes only when an action is called.

Creating an RDD

from pyspark.sql import SparkSession

# Initialize SparkSession
sc = SparkSession.builder \
    .appName("RDD") \
    .getOrCreate() 

#Initialize SparkContext
sc = spark.sparkContext

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # takes the python list and changes it into a RDD so Spark can process it in parallel.

RDD Transformations and Actions
There are two types of operations you can perform on an RDD: Transformations and Actions.

Transformations: These are lazy operations that define how the RDD should be transformed. Examples include map(), filter(), flatMap(), groupByKey(), reduceByKey(). These don’t execute right away; they build up a plan of what should happen.
Actions: These trigger the actual computation and return the result. Examples include collect(), count(), first(), take(), saveAsTextFile().

Example of Transformation:

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) 

#filter(): Keep only elements matching a condition
rdd.filter(lambda x: x % 2 == 1).collect() 

#flatMap(): Like map() but flattens lists
rdd2 = spark.sparkContext.parallelize(["hello world", "hi spark"])
rdd2.flatMap(lambda line: line.split(" ")).collect()

Example of Action:

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

#Brings all elements from the distributed RDD back to the driver
rdd.collect() 

#Returns the first N elements from the RDD (here, 3 elements). 
rdd.take(5)

2. DataFrames
Like RDDs, DataFrames are immutable and distributed, but they add schema and column support for structured data processing.
Why Use DataFrames Instead of RDDs?

Optimized for Performance: DataFrames come with built-in optimizations that RDDs don’t have, which means operations run faster.
Schema Information: With DataFrames, you know the structure of your data (e.g., column names and types), which allows for more meaningful data manipulation.
Ease of Use: DataFrames allow you to perform SQL-like operations such as filtering, grouping, and aggregating data, which are more intuitive than RDD transformations.
When to Use DataFrames
Ideal for:
Structured or semi-structured data (JSON, Parquet, CSV, databases)
Better performance requirements due to optimizations
SQL-like operations and complex queries
ETL pipelines with data from multiple sources
When you need both programmatic and SQL access to the same data
Working with data analysts who prefer SQL syntax

Creating DataFrames

# creating spark session
from pyspark.sql import SparkSession 

# Initializing a Spark session
spark = SparkSession.builder.appName("uber").getOrCreate() 

#read CSV file into DataFrame
uber_df = spark.read.csv("uber.csv", header=True, inferSchema=True)
uber_df.show(5 ) 

#Check coloumns and their datatypes
uber_df.printSchema() 

#Check the number of rows and columns
uber_df.count()

3. Spark SQL
Spark SQL is a module of PySpark that lets you use SQL-like syntax to interact with DataFrames. It’s particularly useful when you need to query structured data. Whether you’re filtering, grouping, or joining data, you can use familiar SQL commands, just like you would in a traditional database.

Why use Spark SQL?

Familiar SQL syntax.
Performance optimization using Catalyst.
Integration with BI tools.

Using SQL to Query a DataFrame
Let’s start by registering a DataFrame as a temporary table so we can query it using SQL.

from pyspark.sql import SparkSession

# create a Spark session
spark = SparkSession.builder.appName("restuarant").getOrCreate() 

#read CSV file into DataFrame
restaurant_df = spark.read.csv("restaurant_orders.csv", header=True, inferSchema=True)
restaurant_df.show(5 )

# Register the DataFrame as a temporary view
restuarant_df.createOrReplaceTempView("restuarant_info")

# Run a SQL query
spark.sql("SELECT * FROM restuarant_info").show() 

# Stop the SparkSession
spark.stop()

Conclusion
We learned about the Apache Spark Architecture in order to understand how to build big data applications efficiently. They’re accessible and consist of components, which is very beneficial for cluster computing and big data technology. Spark calculates the desired outcomes in an easy way and is popular for batch processing.

DEV Community

A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark

What is Apache Spark?

What is PySpark?

Apache Spark Architecture

Core concepts of Pyspark

Top comments (0)