Sandeep

Posted on Dec 3

🔥 Day 3: RDDs - The Foundation of Spark

#distributedsystems #interview #dataengineering #tutorial

Welcome to Day 3 of your Spark Mastery Journey.
Today, we explore RDDs (Resilient Distributed Datasets) - the backbone of Spark.

Even if we mostly use DataFrames today, companies still test RDD fundamentals in interviews because they reveal how Spark works internally.

Let’s break it down simply.

🌟 What Exactly Is an RDD?

An RDD is:

A distributed collection of immutable data partitioned across the cluster.

Key properties:

Immutable
Lazy evaluated
Distributed
Fault-tolerant
Parallelized

RDDs were the first abstraction in Spark before DataFrames and Datasets existed.

🧠 Why Should You Learn RDDs?
Even though DataFrames are recommended now, RDDs are still crucial for:

Understanding execution plans
Debugging shuffles
Improving partition strategies
Designing performance-efficient pipelines
Handling non-structured data

⚡ How to Create RDDs

From Python Lists

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

From File

rdd = spark.sparkContext.textFile("sales.txt")

🔁 RDD Transformations (Lazy)
Transformations build the DAG.

Common transformations:

rdd.map(lambda x: x*2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))

You can chain transformations; Spark still won’t run anything until an Action is called.

🏁 RDD Actions (Execute Plan)

Actions trigger job execution:

rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")

🔥 Narrow vs Wide Transformations

This is the MOST important concept for performance.

🔸 Narrow (No Shuffle)

Output partition depends only on one input partition - Fast

Examples:

map
filter
union

🔸 Wide (Shuffle Required)

Output depends on multiple partitions - Slow

Creates new stage

Examples:

groupByKey
join
reduceByKey

Shuffles = major cause of slow Spark jobs.

🔄 RDD Lineage — Fault Tolerance in Action

Each RDD tracks how it was created.

Example:

rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)

If a node dies, Spark reconstructs data using lineage.

📦 Persistence and Caching

If you reuse an RDD:

processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)

Spark will NOT recompute it again — it reads from memory.

🧭 Summary

Today we learned:

What RDDs are
Why they matter
Transformations vs actions
Narrow vs wide transformations
Lineage
Caching and persistence

These concepts form the foundation of Spark internals.

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

DEV Community

🔥 Day 3: RDDs - The Foundation of Spark

Top comments (0)