Every time you run a cell in Databricks, Apache Spark is doing the work.
You've seen it mentioned in the runtime version (Spark 3.4.1), in the spark object available in every notebook, and in error messages that reference things like "jobs" and "stages."
But what actually is Spark? And why does Databricks rely on it so heavily?
In this article we'll demystify Spark from the ground up — no Scala, no academic papers, just clear explanations with real context.
What is Apache Spark and Why It Matters
Apache Spark is an open-source distributed computing engine designed to process large amounts of data — fast.
It was created at UC Berkeley in 2009, open-sourced in 2010, and donated to the Apache Software Foundation in 2013. The same year, its creators founded Databricks to make Spark easier to use.
The key word in that definition is distributed. Spark doesn't process data on a single machine. It splits the work across many machines simultaneously — that's what makes it fast at scale.
A task that would take hours on a single machine can take minutes when split across 50 workers running in parallel.
The Problem Spark Was Built to Solve
Before Spark, the standard tool for big data processing was Hadoop MapReduce.
MapReduce worked, but it had a critical problem: everything went to disk between steps.
Step 1: Read data from disk
Step 2: Process it
Step 3: Write results to disk ← slow
Step 4: Read from disk again ← slow
Step 5: Process again
Step 6: Write to disk again ← slow
For multi-step data pipelines, this disk I/O was brutally slow.
Spark solved this by keeping data in memory (RAM) as much as possible:
Step 1: Read data from disk → load into RAM
Step 2: Process in memory ← fast
Step 3: Process in memory ← fast
Step 4: Process in memory ← fast
Step 5: Write final results to disk
The result? Spark is up to 100x faster than Hadoop MapReduce for iterative workloads. That speed difference is why it took over the big data world.
How Spark Processes Data: DAGs, Jobs, Stages, and Tasks
This is where most explanations get confusing. Let's keep it simple.
When you write Spark code, it doesn't execute immediately. Spark first builds a plan, then executes it. Here's the hierarchy:
The Execution Hierarchy
Your Code
↓
Action (triggers execution)
↓
Job (one job per action)
↓
Stages (groups of tasks that can run together)
↓
Tasks (individual units of work per partition)
DAG: Directed Acyclic Graph
When you chain transformations in Spark, it builds a DAG — a logical map of all the steps needed to get from your input data to your final result.
Read CSV → Filter rows → Group by column → Aggregate → Write output
↓ ↓ ↓ ↓ ↓
[node] [node] [node] [node] [node]
Spark uses the DAG to optimize execution — it may reorder steps, combine operations, or skip unnecessary work entirely.
Lazy Evaluation
Here's something that surprises most Spark beginners: transformations don't run when you write them.
df = spark.read.csv("/data/sales.csv") # Nothing runs yet
df2 = df.filter(df.amount > 1000) # Nothing runs yet
df3 = df2.groupBy("region").count() # Nothing runs yet
df3.show() # 💥 THIS triggers execution — everything runs now
Spark waits until you call an action (like .show(), .collect(), .write(), or .count()) before doing any real work. This is called lazy evaluation, and it's what allows Spark to optimize the full plan before executing.
Transformations vs Actions
| Type | What it does | Examples |
|---|---|---|
| Transformation | Defines a new step — lazy, nothing runs |
filter(), select(), groupBy(), join(), withColumn()
|
| Action | Triggers execution of the full plan |
show(), collect(), count(), write(), save()
|
💡 Every time you call an action, Spark runs a new Job. Minimize unnecessary actions in your code.
Driver vs Worker Nodes
We introduced this briefly in the last article. Let's go deeper.
Driver Node
- Runs your code
- Builds the DAG and execution plan
- Coordinates all the workers
- Collects final results
- Single point of failure — if the driver crashes, the job fails
Worker Nodes
- Receive tasks from the driver
- Process their assigned data partitions in parallel
- Return results to the driver
- More workers = more parallelism = faster processing (up to a point)
You write code
↓
Driver Node
├── Builds execution plan
├── Splits data into partitions
└── Sends tasks to workers
↓
Worker 1 Worker 2 Worker 3 Worker 4
[partition] [partition] [partition] [partition]
↓ ↓ ↓ ↓
Driver collects results
↓
Output
Partitions
Data in Spark is split into partitions — chunks of data that can be processed independently on different workers.
# See how many partitions your DataFrame has
df.rdd.getNumPartitions()
# Repartition manually
df = df.repartition(8)
More partitions = more parallelism. But too many small partitions = too much overhead. As a rule of thumb, aim for partitions between 128MB and 256MB each.
Spark vs Traditional SQL Engines
You might wonder: if I can query data with SQL, why not just use a regular database?
Here's the honest answer:
| Traditional SQL DB | Apache Spark | |
|---|---|---|
| Data volume sweet spot | GBs | GBs to PBs |
| Architecture | Single machine | Distributed (many machines) |
| Data formats | Tables (proprietary) | CSV, JSON, Parquet, Delta, ORC... |
| In-memory processing | Limited | Core design |
| ML support | External tools | Native (MLlib) |
| Streaming | Limited | Native (Structured Streaming) |
| Setup complexity | Low | Higher |
Traditional SQL databases are perfect for transactional workloads (your app backend, your CRM). Spark is for analytical workloads on large datasets — exactly what data engineers deal with.
Why Spark is Fast (and When It Isn't)
Spark is fast because of:
- In-memory processing: avoids slow disk I/O between steps
- Lazy evaluation: builds an optimized plan before executing
- Catalyst optimizer: Spark's internal query optimizer rewrites your code to be more efficient
- Tungsten engine: low-level memory and CPU optimizations
- Parallelism: work is split across many machines simultaneously
When Spark isn't fast
Spark has weaknesses too — and knowing them saves you headaches:
❌ Small datasets: Spark has overhead. For datasets under a few GBs, Pandas on a single machine is often faster and simpler.
❌ Too many shuffles: Operations like groupBy, join, and orderBy require shuffling — moving data between workers. Shuffles are expensive. Too many of them kill performance.
❌ Data skew: If one partition has 100x more data than others, that one worker becomes the bottleneck while the rest sit idle.
❌ Too many small files: Spark performs poorly when reading thousands of tiny files. Prefer fewer, larger files (Parquet or Delta work best).
Spark in the Context of Databricks
When you use Databricks, you're using Spark — but Databricks adds layers on top:
| Raw Spark | Databricks adds |
|---|---|
| Spark engine | Optimized runtime (Photon engine) |
| Manual cluster setup | Managed clusters, autoscaling |
| No storage layer | Delta Lake |
| No UI | Notebooks, SQL Editor, Workflows |
| No governance | Unity Catalog |
Databricks' Photon engine is their proprietary rewrite of Spark's execution layer in C++. On compatible queries, it can be 2–8x faster than standard Spark. It runs automatically — no code changes needed.
Wrapping Up
Here's what matters from this article:
- Spark is a distributed computing engine — it splits work across many machines in parallel
- It stores data in memory rather than writing to disk between steps — that's why it's fast
- Lazy evaluation means transformations don't run until you call an action
- The execution hierarchy is: Action → Job → Stages → Tasks
- Spark shines at large-scale analytics but has real weaknesses on small data and shuffle-heavy workloads
- Databricks wraps Spark with Delta Lake, managed clusters, and the Photon engine to make it faster and easier to use
In the next article, we move from compute to storage: DBFS and connecting Databricks to your cloud storage.
This article is part of the **Databricks for Dummies* series — a step-by-step guide from zero to your first data warehouse in Databricks.*
← Previous: Clusters & Notebooks | Next: DBFS + Connecting to Cloud Storage →
Top comments (0)