Vinicius Fagundes

Posted on Apr 13

Apache Spark in Plain English: The Engine Behind Databricks

#ai #dataengineering #spark

Every time you run a cell in Databricks, Apache Spark is doing the work.

You've seen it mentioned in the runtime version (Spark 3.4.1), in the spark object available in every notebook, and in error messages that reference things like "jobs" and "stages."

But what actually is Spark? And why does Databricks rely on it so heavily?

In this article we'll demystify Spark from the ground up — no Scala, no academic papers, just clear explanations with real context.

What is Apache Spark and Why It Matters

Apache Spark is an open-source distributed computing engine designed to process large amounts of data — fast.

It was created at UC Berkeley in 2009, open-sourced in 2010, and donated to the Apache Software Foundation in 2013. The same year, its creators founded Databricks to make Spark easier to use.

The key word in that definition is distributed. Spark doesn't process data on a single machine. It splits the work across many machines simultaneously — that's what makes it fast at scale.

A task that would take hours on a single machine can take minutes when split across 50 workers running in parallel.

The Problem Spark Was Built to Solve

Before Spark, the standard tool for big data processing was Hadoop MapReduce.

MapReduce worked, but it had a critical problem: everything went to disk between steps.

Step 1: Read data from disk
Step 2: Process it
Step 3: Write results to disk  ← slow
Step 4: Read from disk again   ← slow
Step 5: Process again
Step 6: Write to disk again    ← slow

For multi-step data pipelines, this disk I/O was brutally slow.

Spark solved this by keeping data in memory (RAM) as much as possible:

Step 1: Read data from disk → load into RAM
Step 2: Process in memory   ← fast
Step 3: Process in memory   ← fast
Step 4: Process in memory   ← fast
Step 5: Write final results to disk

The result? Spark is up to 100x faster than Hadoop MapReduce for iterative workloads. That speed difference is why it took over the big data world.

How Spark Processes Data: DAGs, Jobs, Stages, and Tasks

This is where most explanations get confusing. Let's keep it simple.

When you write Spark code, it doesn't execute immediately. Spark first builds a plan, then executes it. Here's the hierarchy:

The Execution Hierarchy

Your Code
    ↓
  Action (triggers execution)
    ↓
   Job (one job per action)
    ↓
 Stages (groups of tasks that can run together)
    ↓
  Tasks (individual units of work per partition)

DAG: Directed Acyclic Graph

When you chain transformations in Spark, it builds a DAG — a logical map of all the steps needed to get from your input data to your final result.

Read CSV → Filter rows → Group by column → Aggregate → Write output
    ↓           ↓               ↓               ↓           ↓
  [node]     [node]           [node]          [node]      [node]

Spark uses the DAG to optimize execution — it may reorder steps, combine operations, or skip unnecessary work entirely.

Lazy Evaluation

Here's something that surprises most Spark beginners: transformations don't run when you write them.

df = spark.read.csv("/data/sales.csv")    # Nothing runs yet
df2 = df.filter(df.amount > 1000)         # Nothing runs yet
df3 = df2.groupBy("region").count()       # Nothing runs yet

df3.show()  # 💥 THIS triggers execution — everything runs now

Spark waits until you call an action (like .show(), .collect(), .write(), or .count()) before doing any real work. This is called lazy evaluation, and it's what allows Spark to optimize the full plan before executing.

Transformations vs Actions

Type	What it does	Examples
Transformation	Defines a new step — lazy, nothing runs	`filter()`, `select()`, `groupBy()`, `join()`, `withColumn()`
Action	Triggers execution of the full plan	`show()`, `collect()`, `count()`, `write()`, `save()`

💡 Every time you call an action, Spark runs a new Job. Minimize unnecessary actions in your code.

Driver vs Worker Nodes

We introduced this briefly in the last article. Let's go deeper.

Driver Node

Runs your code
Builds the DAG and execution plan
Coordinates all the workers
Collects final results
Single point of failure — if the driver crashes, the job fails

Worker Nodes

Receive tasks from the driver
Process their assigned data partitions in parallel
Return results to the driver
More workers = more parallelism = faster processing (up to a point)

You write code
      ↓
  Driver Node
  ├── Builds execution plan
  ├── Splits data into partitions
  └── Sends tasks to workers
        ↓
  Worker 1    Worker 2    Worker 3    Worker 4
  [partition] [partition] [partition] [partition]
       ↓           ↓           ↓           ↓
  Driver collects results
      ↓
  Output

Partitions

Data in Spark is split into partitions — chunks of data that can be processed independently on different workers.

# See how many partitions your DataFrame has
df.rdd.getNumPartitions()

# Repartition manually
df = df.repartition(8)

More partitions = more parallelism. But too many small partitions = too much overhead. As a rule of thumb, aim for partitions between 128MB and 256MB each.

Spark vs Traditional SQL Engines

You might wonder: if I can query data with SQL, why not just use a regular database?

Here's the honest answer:

	Traditional SQL DB	Apache Spark
Data volume sweet spot	GBs	GBs to PBs
Architecture	Single machine	Distributed (many machines)
Data formats	Tables (proprietary)	CSV, JSON, Parquet, Delta, ORC...
In-memory processing	Limited	Core design
ML support	External tools	Native (MLlib)
Streaming	Limited	Native (Structured Streaming)
Setup complexity	Low	Higher

Traditional SQL databases are perfect for transactional workloads (your app backend, your CRM). Spark is for analytical workloads on large datasets — exactly what data engineers deal with.

Why Spark is Fast (and When It Isn't)

Spark is fast because of:

In-memory processing: avoids slow disk I/O between steps
Lazy evaluation: builds an optimized plan before executing
Catalyst optimizer: Spark's internal query optimizer rewrites your code to be more efficient
Tungsten engine: low-level memory and CPU optimizations
Parallelism: work is split across many machines simultaneously

When Spark isn't fast

Spark has weaknesses too — and knowing them saves you headaches:

❌ Small datasets: Spark has overhead. For datasets under a few GBs, Pandas on a single machine is often faster and simpler.

❌ Too many shuffles: Operations like groupBy, join, and orderBy require shuffling — moving data between workers. Shuffles are expensive. Too many of them kill performance.

❌ Data skew: If one partition has 100x more data than others, that one worker becomes the bottleneck while the rest sit idle.

❌ Too many small files: Spark performs poorly when reading thousands of tiny files. Prefer fewer, larger files (Parquet or Delta work best).

Spark in the Context of Databricks

When you use Databricks, you're using Spark — but Databricks adds layers on top:

Raw Spark	Databricks adds
Spark engine	Optimized runtime (Photon engine)
Manual cluster setup	Managed clusters, autoscaling
No storage layer	Delta Lake
No UI	Notebooks, SQL Editor, Workflows
No governance	Unity Catalog

Databricks' Photon engine is their proprietary rewrite of Spark's execution layer in C++. On compatible queries, it can be 2–8x faster than standard Spark. It runs automatically — no code changes needed.

Wrapping Up

Here's what matters from this article:

Spark is a distributed computing engine — it splits work across many machines in parallel
It stores data in memory rather than writing to disk between steps — that's why it's fast
Lazy evaluation means transformations don't run until you call an action
The execution hierarchy is: Action → Job → Stages → Tasks
Spark shines at large-scale analytics but has real weaknesses on small data and shuffle-heavy workloads
Databricks wraps Spark with Delta Lake, managed clusters, and the Photon engine to make it faster and easier to use

In the next article, we move from compute to storage: DBFS and connecting Databricks to your cloud storage.

This article is part of the **Databricks for Dummies* series — a step-by-step guide from zero to your first data warehouse in Databricks.*

← Previous: Clusters & Notebooks | Next: DBFS + Connecting to Cloud Storage →

DEV Community