DEV Community: KiplangatJaphet

A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark

KiplangatJaphet — Mon, 29 Sep 2025 21:49:45 +0000

What is Apache Spark?

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

What is the history of Apache Spark?
Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. The first paper entitled, “Spark: Cluster Computing with Working Sets” was published in June 2010, and Spark was open sourced under a BSD license. In June, 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February, 2014. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop.

What is PySpark?

PySpark is the Python API for Apache Spark, a powerful framework designed for distributed data processing. If you’ve ever worked with large datasets and found your programs running slowly, PySpark might be the solution you’ve been searching for. It allows you to process massive datasets across multiple computers at the same time, meaning your programs can handle more data in less time.

Key Features of PySpark

Distributed Processing: Instead of relying on one computer, PySpark breaks up your data into smaller chunks and processes them on multiple machines simultaneously.
In-Memory Processing: PySpark can store data in memory (RAM), making it much faster than traditional methods that often rely on slow disk access.
Fault Tolerance: Even if one machine fails while processing data, PySpark can automatically recover, ensuring your data is safe and the job gets done.

Importance of using pyspark
PySpark lets you handle that same data efficiently by splitting the work across multiple computers in a cluster.

Common Use Cases

Data Analysis: If you’re analyzing huge datasets (e.g., sales data, website logs), PySpark helps process that data quickly.
Machine Learning: PySpark is often used to build models that predict trends or patterns from large datasets.
Big Data Processing: Companies with tons of data (like social media platforms or e-commerce giants) use PySpark to keep things running smoothly.

Apache Spark Architecture

The Spark runtime consists of several key components that work together to execute distributed computations.

Below are the functions of each component of Spark architecture.

The Spark driver
The driver is the program or process responsible for coordinating the execution of the Spark application. It runs the main function and creates the SparkContext, which connects to the cluster manager.

The Spark executors
Executors are worker processes responsible for executing tasks in Spark applications. They are launched on worker nodes and communicate with the driver program and cluster manager. Executors run tasks concurrently and store data in memory or disk for caching and intermediate storage.

The cluster manager
The cluster manager is responsible for allocating resources and managing the cluster on which the Spark application runs. Spark supports various cluster managers like Apache Mesos, Hadoop YARN, and standalone cluster manager.

sparkContext
SparkContext is the entry point for any Spark functionality. It represents the connection to a Spark cluster and can be used to create RDDs (Resilient Distributed Datasets), accumulators, and broadcast variables. SparkContext also coordinates the execution of tasks.

Task
A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution.

Working Of Spark Architecture
When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including;

DAG Scheduler.
Task Scheduler.
Backend Scheduler.
Block Manager.

These components translate user code into jobs that are executed on the cluster. Together, the Spark Driver and SparkContext oversee the entire job execution lifecycle.

Installing Apache Spark and Pyspark on your terminal

# Step 1: Install Java 17 (required for Spark 4.x)
sudo apt install openjdk-17-jdk -y

#Verify Java installation
java -version   # should print openjdk version "17.0.xx"

# Step 2: Download Apache Spark 4.0.1 (built with Hadoop 3)
wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

#: Extract the tarball
tar -xvzf spark-4.0.1-bin-hadoop3.tgz

#: Remove the archive to save space
rm spark-4.0.1-bin-hadoop3.tgz

#: Rename the extracted folder to something simpler
mv spark-4.0.1-bin-hadoop3 spark

#: Navigate into the Spark installation directory
cd spark 

#Step 3: Verify Python 
python --version

#Step 4: Set up Pyspark Environment 
python -m venv sparkenv 

#Activate the Environment:
source sparkenv/bin/activate

#Install Pyspark
pip install pyspark

#Step 5: Running Pyspark with JupyterLab
#Install Jupyter in your virtual environment
pip install notebook ipykernel 

#Register your environment 
python -m ipykernel install --user --name=sparkenv --display-name "Python (sparkenv)"

Running Pyspark Code
After a successful set up, initialize a Pyspark session.

from pyspark.sql import SparkSession

#Create a Spark session
spark = SparkSession.builder.appName("restaurant").getOrCreate()

Core concepts of Pyspark

1. Resilient Distributed Datasets (RDDs)
A RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark. These are the elements that run and operate on multiple nodes to do parallel processing on a cluster.

Key Characteristics

Immutable: Once created, RDDs can't be modified; transformations create new RDDs.
Resilient: Can recover from node failures through lineage tracking (remembers the transformations used to build it) rather than data replication.
Distributed: Data is partitioned and processed in parallel across cluster nodes.
Dataset: Holds your data like a large list or table.
Lazy Evaluation: Transformations are not executed immediately - they build up a computation graph that executes only when an action is called.

Creating an RDD

from pyspark.sql import SparkSession

# Initialize SparkSession
sc = SparkSession.builder \
    .appName("RDD") \
    .getOrCreate() 

#Initialize SparkContext
sc = spark.sparkContext

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # takes the python list and changes it into a RDD so Spark can process it in parallel.

RDD Transformations and Actions
There are two types of operations you can perform on an RDD: Transformations and Actions.

Transformations: These are lazy operations that define how the RDD should be transformed. Examples include map(), filter(), flatMap(), groupByKey(), reduceByKey(). These don’t execute right away; they build up a plan of what should happen.
Actions: These trigger the actual computation and return the result. Examples include collect(), count(), first(), take(), saveAsTextFile().

Example of Transformation:

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) 

#filter(): Keep only elements matching a condition
rdd.filter(lambda x: x % 2 == 1).collect() 

#flatMap(): Like map() but flattens lists
rdd2 = spark.sparkContext.parallelize(["hello world", "hi spark"])
rdd2.flatMap(lambda line: line.split(" ")).collect()

Example of Action:

#Create RDD using parallelize
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

#Brings all elements from the distributed RDD back to the driver
rdd.collect() 

#Returns the first N elements from the RDD (here, 3 elements). 
rdd.take(5)

2. DataFrames
Like RDDs, DataFrames are immutable and distributed, but they add schema and column support for structured data processing.
Why Use DataFrames Instead of RDDs?

Optimized for Performance: DataFrames come with built-in optimizations that RDDs don’t have, which means operations run faster.
Schema Information: With DataFrames, you know the structure of your data (e.g., column names and types), which allows for more meaningful data manipulation.
Ease of Use: DataFrames allow you to perform SQL-like operations such as filtering, grouping, and aggregating data, which are more intuitive than RDD transformations.
When to Use DataFrames
Ideal for:
Structured or semi-structured data (JSON, Parquet, CSV, databases)
Better performance requirements due to optimizations
SQL-like operations and complex queries
ETL pipelines with data from multiple sources
When you need both programmatic and SQL access to the same data
Working with data analysts who prefer SQL syntax

Creating DataFrames

# creating spark session
from pyspark.sql import SparkSession 

# Initializing a Spark session
spark = SparkSession.builder.appName("uber").getOrCreate() 

#read CSV file into DataFrame
uber_df = spark.read.csv("uber.csv", header=True, inferSchema=True)
uber_df.show(5 ) 

#Check coloumns and their datatypes
uber_df.printSchema() 

#Check the number of rows and columns
uber_df.count()

3. Spark SQL
Spark SQL is a module of PySpark that lets you use SQL-like syntax to interact with DataFrames. It’s particularly useful when you need to query structured data. Whether you’re filtering, grouping, or joining data, you can use familiar SQL commands, just like you would in a traditional database.

Why use Spark SQL?

Familiar SQL syntax.
Performance optimization using Catalyst.
Integration with BI tools.

Using SQL to Query a DataFrame
Let’s start by registering a DataFrame as a temporary table so we can query it using SQL.

from pyspark.sql import SparkSession

# create a Spark session
spark = SparkSession.builder.appName("restuarant").getOrCreate() 

#read CSV file into DataFrame
restaurant_df = spark.read.csv("restaurant_orders.csv", header=True, inferSchema=True)
restaurant_df.show(5 )

# Register the DataFrame as a temporary view
restuarant_df.createOrReplaceTempView("restuarant_info")

# Run a SQL query
spark.sql("SELECT * FROM restuarant_info").show() 

# Stop the SparkSession
spark.stop()

Conclusion
We learned about the Apache Spark Architecture in order to understand how to build big data applications efficiently. They’re accessible and consist of components, which is very beneficial for cluster computing and big data technology. Spark calculates the desired outcomes in an easy way and is popular for batch processing.

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

KiplangatJaphet — Tue, 23 Sep 2025 20:53:17 +0000

Introduction
Apache Kafka has emerged as a cornerstone technology for building scalable, real-time data pipelines and event-driven architectures. Originally developed at LinkedIn and open-sourced in 2011, Kafka is a distributed streaming platform designed to handle massive volumes of data with low latency and high throughput. This article explores Kafka’s core concepts, its applications in data engineering, and best practices for running Kafka in production, with insights into how companies like Netflix, LinkedIn, and Uber leverage it.

What is Apache kafka?
Apache Kafka is an open-source distributed event streaming platform that serves as a robust system for building real-time data pipelines and streaming applications. It enables applications to publish and subscribe to streams of events, making it ideal for applications that need to process large volumes of data in real-time, such as data ingestion, real-time analytics, and event-driven architectures. Kafka's key features include high throughput, scalability, fault tolerance through data replication, and durable, ordered message storage within topics.

Apache Kafka as an event streaming platform
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:

To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
To store streams of events durably and reliably for as long as you want.
To process streams of events as they occur or retrospectively.

And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.

How does Kafka work?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

Servers
Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.

Clients
They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

Components of kafka

Producer - Producers are applications or services that publish (write) messages into Kafka topics. They decide which topic and partition the message goes to, either randomly, in round-robin fashion or based on key.
Zookeper - It is an open-source distributed coordinated service that helps manage and synchronize large clusters of distributed systems by providing a reliable place where services can keep configuration, naming, synchronization, and group information.
Consumer - Consumers are applications that subscribe (read) messages from Kafka topics Consumers exists in consumer groups to share the load of message consumption.
Topic - A topic is like a logical channel or category where messages are stored. Producers write messages into topics, and consumers read from them.
Partition - Topics are split into partitions to allow parallelism and scalability. Each partition is an ordered, immutable log of records. Messages inside partitions are identified by a unique offset.
Broker - A broker is a Kafka server that stores and serves messages. Acting as a central hub, the broker accepts messages from producers, assigns them unique offsets, and stores them securely on disk.
Cluster - A Kafka cluster is a group of brokers working together. It ensures data replication, fault tolerance, and high availability.
Offset - An offset is a unique ID assigned to each message in a partition. It helps consumers keep track of which messages have been read.

The architecture of kafka.

Each Kafka broker can host multiple topics, and each topic is divided into multiple partitions for scalability and fault tolerance.

Kafka topic partitions layout.

Consumers use offsets to read messages sequentially from oldest to newest, and upon recovery from failure, resume from the last committed offset.

Set up Kafka on the Terminal
Let’s dive into the installation and running of Kafka directly from the terminal.

1. Install Java
Kafka requires Java (JDK 11 or 17). Let’s install Java 11:

sudo apt update
sudo apt install openjdk-11-jdk -y

Confirm Java Installation

Verify that Java is installed correctly:

java -version

You should see output similar to:

openjdk version "11.0.xx" ...

2. Download and Extract Kafka

Let’s download and set up Kafka 3.9.1 with Scala 2.13.

Download Kafka

wget https://downloads.apache.org/kafka/3.9.1/kafka_2.13-3.9.1.tgz

Extract the downloaded file

tar -xvzf kafka_2.13-3.9.1.tgz

Delete the archive to free up space

rm kafka_2.13-3.9.1.tgz

Rename the extracted folder to something simpler

mv kafka_2.13-3.9.1 kafka

Change into the Kafka directory

cd kafka

3. Start ZooKeeper and Kafka
Start ZooKeeper (in one terminal):

bin/zookeeper-server-start.sh config/zookeeper.properties

You should see output like this:

Start Kafka (in another terminal)

Now open a second terminal, navigate to the Kafka folder, and run:

bin/kafka-server-start.sh config/server.properties

You should see an output like this:

4. Create a Kafka Topic
Let’s create a topic called exams.

bin/kafka-topics.sh --create \
  --topic exams \
  --bootstrap-server localhost:9092 \
  --partitions 1 \
  --replication-factor 1

5. Start a Kafka Producer
This send some messages to the exams topic using Kafka’s built-in console producer.
In a new terminal window, run:

bin/kafka-console-producer.sh --topic exams --bootstrap-server localhost:9092

Once it starts, you can type messages directly into the terminal and hit Enter to send each message to the Kafka topic.

Each line you type gets published to the exams topic.

6. Start a Kafka Consumer
To read the messages you just sent, start a Kafka consumer that listens to the exams topic.

In another terminal window, run:

bin/kafka-console-consumer.sh --topic exams --from-beginning --bootstrap-server localhost:9092

This will display all messages from the beginning of the topic — including the ones you just produced.

Your output should be like this:

Use Cases
Netflix
Netflix needs no introduction. One of the world’s most innovative and robust OTT platform uses Apache Kafka in its keystone pipeline project to push and receive notifications.

There are two types of Kafka used by Netflix which are Fronting Kafka, used for data collection and buffering by producers and Consumer Kafka, used for content routing to the consumers.

All of us know that the amount of data processed at Netflix is pretty huge, and Netflix uses 36 Kafka clusters (out of which 24 are Fronting Kafka and 12 are Consumer Kafka) to work on almost 700 billion instances in a day.

Netflix has achieved a data loss rate of 0.01% through this keystone pipeline project and Apache Kafka is a key driver to reduce this data loss to such a significant amount.

Netflix plans to use a 0.9.0.1 version to improve resource utilization and availability.

Uber
There are a lot of parameters where a giant in the travel industry like Uber needs to have a system that is uncompromising to errors, and fault-tolerant.

Uber uses Apache Kafka to run their driver injury protection program in more than 200 cities.

Drivers registered on Uber pay a premium on every ride and this program has been working successfully due to scalability and robustness of Apache Kafka.

It has achieved this success largely through the unblocked batch processing method, which allows the Uber engineering team to get a steady throughput.

The multiple retries have allowed the Uber team to work on the segmentation of messages to achieve real-time process updates and flexibility.

Uber is planning on introducing a framework, where they can improve the uptime, grow, scale and facilitate the program without having to decrease the developer time with Apache Kafka.

LinkedIn
LinkedIn, one of the world’s most prominent B2B social media platforms handles well over a trillion messages per day.

And we thought the number of messages handled by Netflix was huge. This figure is mind-blowing and LinkedIn has seen a rise of over 1200x over the last few years.

LinkedIn uses different clusters for different applications to avoid clashing of failure of one application which would lead to harm to the other applications in the cluster.

Broker Kafka clusters at LinkedIn help them to differentiate and white list certain users to allow them a higher bandwidth and ensure the seamless user experience.

LinkedIn plans to achieve a lesser number of data loss rate through the Mirror Maker.

As the Mirror Maker is used as the intermediary between Kafka clusters and Kafka topics.

At present, there is a limit on the message size of 1 MB.

But, through the Kafka ecosystem, LinkedIn plans to enable the publishers and the users to send well over that limit in the coming future.
Messaging
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.
(https://www.knowledgenile.com/blogs/apache-kafka-use-cases)
Resources
Apache Kafka - Fundamentals -(https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm)
Apache Kafka Documentation - (https://kafka.apache.org/documentation/?utm_source=chatgpt.com#uses)
Introduction to Apache Kafka - https://kafka.apache.org/intro
Best Apache Kafka Use Cases - (https://www.knowledgenile.com/blogs/apache-kafka-use-cases)