What is Apache Spark?
We all know about pandas data frames and how they make handling data so quick and easy. You can transform your data (drop columns, change data types, filter nulls) all in just a few lines of code.
But have you ever wondered what’s happening under the hood?
Where is this data actually stored while you’re manipulating it?
It’s definitely not in your database.
And once your Python script ends, where does all of it go?
The answer is simple: main memory (RAM).
All those transformations you run in pandas are only possible because your dataset is small enough to fit in your RAM.
Now here’s the problem: what happens when you’re working on massive projects, like training your own general-purpose LLM or crunching billions of rows of data?
The reality is—you can’t (at least not efficiently) with pandas. Sure, you can try streaming data or working in batches (I personally tried both for Lingua Connect), but it quickly becomes complex for no real reason.
And that’s where Apache Spark comes in.
Enter Spark
Apache Spark is the hero you call when your data is so massive that your machine can’t handle it all at once.
At its core, Apache Spark is an open-source, unified analytics engine designed for large-scale data processing.
In short: Just add more machines!
Here’s what I mean:
Instead of relying on one machine’s memory, Spark distributes your dataset across multiple machines (nodes). Each node processes a chunk of the data in parallel, and Spark combines the results. To you, it feels like working on one logical machine—but behind the scenes, it’s a cluster doing the heavy lifting.
The diagram bellow summarises its architecture in a nut shell
The Dimensions of Apache Spark
Spark isn’t just about running queries faster. It’s a whole ecosystem.
So after this article, you can decide which dimensions of Spark to follow, but here are the main ones.
Spark Core – The foundation that handles memory management, job scheduling, and distributed task execution.
Spark SQL – For working with structured data (tables, DataFrames, SQL queries).
Spark Streaming – For real-time data processing (think logs, IoT data, live feeds).
MLlib – Spark’s built-in machine learning library for scalable ML models.
GraphX – For graph computations (social networks, recommendations, relationships).
Now let's look into a Spark Core and Spark sql through the simple path pyspark
Apache Spark vs PySpark
Now you might be wondering—what’s the difference between Apache Spark and PySpark?
Put simply:
Apache Spark is the engine. Think of it like the car’s engine—the part that actually does the work and propels the car forward.
PySpark is the Python API for Spark. It’s the steering wheel you use to control the engine.
So while Spark itself is written in Scala and Java, PySpark gives Python developers the ability to harness Spark’s distributed power without leaving the comfort of Python.
Set up
I highly recommend using an environment variable while working on this. You can then set up the venv and activate it for PySpark at the base of your working directory.
We'll start by installing the latest version of Spark (This might take a while)
wget https://dlcdn.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz
We then unzip the Spark file
tar -xzf spark-4.0.1-bin-hadoop3.tgz
We then rename the Spark folder to make it easier to use and then move into it (the Spark home)
mv spark-4.0.1-bin-hadoop3/ spark/
cd spark
Run this command to ensure that your park version is downloaded
./bin/spark-submit --version
You should get an output like this.
Finally, let's install PySpark
pip install pyspark
Take Pyspark for a spin
Create a new folder at the base of your working directory named files. This is where your code and data go
touch spark.ipynb
Open it up in a text editor and connect to the virtual environment in which you installed PySpark
Now, let's get started
from pyspark.sql import SparkSession
Let's start the Spark session.
spark = SparkSession.builder.appName('demo').getOrCreate()
This is like starting up everything, ie, the entire architecture described above, like twisting the key to start the engine.
Create a dataframe
An interesting thing to note, unlike in pandas, dataframes in Spark are lists of tuples
data=[("Eric", 25), ("Jane", 29), ("Sam", 35)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
Here are some useful commands
#identify the schema
df.printSchema()
#list out the columns
df.columns
#count the number of rows
df.count()
#creating data from CSV
df = spark.read.csv("demo.csv", header=True, inferSchema=True)
Spark SQL
Now we can interact with our data using SQL, which is really cool
First, we create a view from of dataframe (synonymous with database views)
df.createOrReplaceTempView("Demo")
That's it! Now you can write your queries here
Note that we're accessing the view from the Spark session we created above.
spark.sql("SELECT * FROM Demo"),show(5)
Top comments (0)