DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Apache Spark in Plain English: The Engine Behind Databricks

Every time you run a cell in Databricks, Apache Spark is doing the work.

You've seen it mentioned in the runtime version (Spark 3.4.1), in the spark object available in every notebook, and in error messages that reference things like "jobs" and "stages."

But what actually is Spark? And why does Databricks rely on it so heavily?

In this article we'll demystify Spark from the ground up — no Scala, no academic papers, just clear explanations with real context.


What is Apache Spark and Why It Matters

Apache Spark is an open-source distributed computing engine designed to process large amounts of data — fast.

It was created at UC Berkeley in 2009, open-sourced in 2010, and donated to the Apache Software Foundation in 2013. The same year, its creators founded Databricks to make Spark easier to use.

The key word in that definition is distributed. Spark doesn't process data on a single machine. It splits the work across many machines simultaneously — that's what makes it fast at scale.

A task that would take hours on a single machine can take minutes when split across 50 workers running in parallel.


The Problem Spark Was Built to Solve

Before Spark, the standard tool for big data processing was Hadoop MapReduce.

MapReduce worked, but it had a critical problem: everything went to disk between steps.

Step 1: Read data from disk
Step 2: Process it
Step 3: Write results to disk  ← slow
Step 4: Read from disk again   ← slow
Step 5: Process again
Step 6: Write to disk again    ← slow
Enter fullscreen mode Exit fullscreen mode

For multi-step data pipelines, this disk I/O was brutally slow.

Spark solved this by keeping data in memory (RAM) as much as possible:

Step 1: Read data from disk → load into RAM
Step 2: Process in memory   ← fast
Step 3: Process in memory   ← fast
Step 4: Process in memory   ← fast
Step 5: Write final results to disk
Enter fullscreen mode Exit fullscreen mode

The result? Spark is up to 100x faster than Hadoop MapReduce for iterative workloads. That speed difference is why it took over the big data world.


How Spark Processes Data: DAGs, Jobs, Stages, and Tasks

This is where most explanations get confusing. Let's keep it simple.

When you write Spark code, it doesn't execute immediately. Spark first builds a plan, then executes it. Here's the hierarchy:

The Execution Hierarchy

Your Code
    ↓
  Action (triggers execution)
    ↓
   Job (one job per action)
    ↓
 Stages (groups of tasks that can run together)
    ↓
  Tasks (individual units of work per partition)
Enter fullscreen mode Exit fullscreen mode

DAG: Directed Acyclic Graph

When you chain transformations in Spark, it builds a DAG — a logical map of all the steps needed to get from your input data to your final result.

Read CSV → Filter rows → Group by column → Aggregate → Write output
    ↓           ↓               ↓               ↓           ↓
  [node]     [node]           [node]          [node]      [node]
Enter fullscreen mode Exit fullscreen mode

Spark uses the DAG to optimize execution — it may reorder steps, combine operations, or skip unnecessary work entirely.

Lazy Evaluation

Here's something that surprises most Spark beginners: transformations don't run when you write them.

df = spark.read.csv("/data/sales.csv")    # Nothing runs yet
df2 = df.filter(df.amount > 1000)         # Nothing runs yet
df3 = df2.groupBy("region").count()       # Nothing runs yet

df3.show()  # 💥 THIS triggers execution — everything runs now
Enter fullscreen mode Exit fullscreen mode

Spark waits until you call an action (like .show(), .collect(), .write(), or .count()) before doing any real work. This is called lazy evaluation, and it's what allows Spark to optimize the full plan before executing.

Transformations vs Actions

Type What it does Examples
Transformation Defines a new step — lazy, nothing runs filter(), select(), groupBy(), join(), withColumn()
Action Triggers execution of the full plan show(), collect(), count(), write(), save()

💡 Every time you call an action, Spark runs a new Job. Minimize unnecessary actions in your code.


Driver vs Worker Nodes

We introduced this briefly in the last article. Let's go deeper.

Driver Node

  • Runs your code
  • Builds the DAG and execution plan
  • Coordinates all the workers
  • Collects final results
  • Single point of failure — if the driver crashes, the job fails

Worker Nodes

  • Receive tasks from the driver
  • Process their assigned data partitions in parallel
  • Return results to the driver
  • More workers = more parallelism = faster processing (up to a point)
You write code
      ↓
  Driver Node
  ├── Builds execution plan
  ├── Splits data into partitions
  └── Sends tasks to workers
        ↓
  Worker 1    Worker 2    Worker 3    Worker 4
  [partition] [partition] [partition] [partition]
       ↓           ↓           ↓           ↓
  Driver collects results
      ↓
  Output
Enter fullscreen mode Exit fullscreen mode

Partitions

Data in Spark is split into partitions — chunks of data that can be processed independently on different workers.

# See how many partitions your DataFrame has
df.rdd.getNumPartitions()

# Repartition manually
df = df.repartition(8)
Enter fullscreen mode Exit fullscreen mode

More partitions = more parallelism. But too many small partitions = too much overhead. As a rule of thumb, aim for partitions between 128MB and 256MB each.


Spark vs Traditional SQL Engines

You might wonder: if I can query data with SQL, why not just use a regular database?

Here's the honest answer:

Traditional SQL DB Apache Spark
Data volume sweet spot GBs GBs to PBs
Architecture Single machine Distributed (many machines)
Data formats Tables (proprietary) CSV, JSON, Parquet, Delta, ORC...
In-memory processing Limited Core design
ML support External tools Native (MLlib)
Streaming Limited Native (Structured Streaming)
Setup complexity Low Higher

Traditional SQL databases are perfect for transactional workloads (your app backend, your CRM). Spark is for analytical workloads on large datasets — exactly what data engineers deal with.


Why Spark is Fast (and When It Isn't)

Spark is fast because of:

  • In-memory processing: avoids slow disk I/O between steps
  • Lazy evaluation: builds an optimized plan before executing
  • Catalyst optimizer: Spark's internal query optimizer rewrites your code to be more efficient
  • Tungsten engine: low-level memory and CPU optimizations
  • Parallelism: work is split across many machines simultaneously

When Spark isn't fast

Spark has weaknesses too — and knowing them saves you headaches:

❌ Small datasets: Spark has overhead. For datasets under a few GBs, Pandas on a single machine is often faster and simpler.

❌ Too many shuffles: Operations like groupBy, join, and orderBy require shuffling — moving data between workers. Shuffles are expensive. Too many of them kill performance.

❌ Data skew: If one partition has 100x more data than others, that one worker becomes the bottleneck while the rest sit idle.

❌ Too many small files: Spark performs poorly when reading thousands of tiny files. Prefer fewer, larger files (Parquet or Delta work best).


Spark in the Context of Databricks

When you use Databricks, you're using Spark — but Databricks adds layers on top:

Raw Spark Databricks adds
Spark engine Optimized runtime (Photon engine)
Manual cluster setup Managed clusters, autoscaling
No storage layer Delta Lake
No UI Notebooks, SQL Editor, Workflows
No governance Unity Catalog

Databricks' Photon engine is their proprietary rewrite of Spark's execution layer in C++. On compatible queries, it can be 2–8x faster than standard Spark. It runs automatically — no code changes needed.


Wrapping Up

Here's what matters from this article:

  • Spark is a distributed computing engine — it splits work across many machines in parallel
  • It stores data in memory rather than writing to disk between steps — that's why it's fast
  • Lazy evaluation means transformations don't run until you call an action
  • The execution hierarchy is: Action → Job → Stages → Tasks
  • Spark shines at large-scale analytics but has real weaknesses on small data and shuffle-heavy workloads
  • Databricks wraps Spark with Delta Lake, managed clusters, and the Photon engine to make it faster and easier to use

In the next article, we move from compute to storage: DBFS and connecting Databricks to your cloud storage.


This article is part of the **Databricks for Dummies* series — a step-by-step guide from zero to your first data warehouse in Databricks.*

← Previous: Clusters & Notebooks | Next: DBFS + Connecting to Cloud Storage →

Top comments (0)