Eric Kahindi

Posted on Sep 29

A Beginner’s Guide to Big Data Analytics with Apache Spark and PySpark

#python #datascience #beginners

What is Apache Spark?

We all know about pandas data frames and how they make handling data so quick and easy. You can transform your data (drop columns, change data types, filter nulls) all in just a few lines of code.

But have you ever wondered what’s happening under the hood?
Where is this data actually stored while you’re manipulating it?
It’s definitely not in your database.
And once your Python script ends, where does all of it go?

The answer is simple: main memory (RAM).

All those transformations you run in pandas are only possible because your dataset is small enough to fit in your RAM.

Now here’s the problem: what happens when you’re working on massive projects, like training your own general-purpose LLM or crunching billions of rows of data?

The reality is—you can’t (at least not efficiently) with pandas. Sure, you can try streaming data or working in batches (I personally tried both for Lingua Connect), but it quickly becomes complex for no real reason.

And that’s where Apache Spark comes in.

Enter Spark

Apache Spark is the hero you call when your data is so massive that your machine can’t handle it all at once.

At its core, Apache Spark is an open-source, unified analytics engine designed for large-scale data processing.

In short: Just add more machines!

Here’s what I mean:

Instead of relying on one machine’s memory, Spark distributes your dataset across multiple machines (nodes). Each node processes a chunk of the data in parallel, and Spark combines the results. To you, it feels like working on one logical machine—but behind the scenes, it’s a cluster doing the heavy lifting.

The diagram bellow summarises its architecture in a nut shell

The Dimensions of Apache Spark

Spark isn’t just about running queries faster. It’s a whole ecosystem.
So after this article, you can decide which dimensions of Spark to follow, but here are the main ones.

Spark Core – The foundation that handles memory management, job scheduling, and distributed task execution.
Spark SQL – For working with structured data (tables, DataFrames, SQL queries).
Spark Streaming – For real-time data processing (think logs, IoT data, live feeds).
MLlib – Spark’s built-in machine learning library for scalable ML models.
GraphX – For graph computations (social networks, recommendations, relationships).

Now let's look into a Spark Core and Spark sql through the simple path pyspark

Apache Spark vs PySpark

Now you might be wondering—what’s the difference between Apache Spark and PySpark?

Put simply:

Apache Spark is the engine. Think of it like the car’s engine—the part that actually does the work and propels the car forward.
PySpark is the Python API for Spark. It’s the steering wheel you use to control the engine.

So while Spark itself is written in Scala and Java, PySpark gives Python developers the ability to harness Spark’s distributed power without leaving the comfort of Python.

Set up

I highly recommend using an environment variable while working on this. You can then set up the venv and activate it for PySpark at the base of your working directory.

We'll start by installing the latest version of Spark (This might take a while)

wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

We then unzip the Spark file

tar -xzf spark-4.0.1-bin-hadoop3.tgz

We then rename the Spark folder to make it easier to use and then move into it (the Spark home)

mv spark-4.0.1-bin-hadoop3/ spark/
cd spark

Run this command to ensure that your park version is downloaded

./bin/spark-submit --version

You should get an output like this.

Finally, let's install PySpark

pip install pyspark

Take Pyspark for a spin

Create a new folder at the base of your working directory named files. This is where your code and data go

touch spark.ipynb

Open it up in a text editor and connect to the virtual environment in which you installed PySpark
Now, let's get started

from pyspark.sql import SparkSession

Let's start the Spark session.

spark = SparkSession.builder.appName('demo').getOrCreate()

This is like starting up everything, ie, the entire architecture described above, like twisting the key to start the engine.

Create a dataframe
An interesting thing to note, unlike in pandas, dataframes in Spark are lists of tuples

data=[("Eric", 25), ("Jane", 29), ("Sam", 35)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()

Here are some useful commands

#identify the schema
df.printSchema()
#list out the columns 
df.columns
#count the number of rows 
df.count()
#creating data from CSV
df = spark.read.csv("demo.csv", header=True, inferSchema=True)

Spark SQL

Now we can interact with our data using SQL, which is really cool
First, we create a view from of dataframe (synonymous with database views)

df.createOrReplaceTempView("Demo")

That's it! Now you can write your queries here
Note that we're accessing the view from the Spark session we created above.

spark.sql("SELECT * FROM Demo"),show(5)

DEV Community